Author: aks0911

SQL Subqueries Explained: WHERE and FROM Subqueries for Data Analysts with Examples

What Is a Subquery?

A subquery is a SELECT statement written inside another SELECT statement. The inner query runs first, produces a result, and the outer query uses that result to complete its own logic.

You have already written queries that filter rows, aggregate data, and join tables. A subquery combines those capabilities into a single statement — letting you answer multi-step business questions without breaking them into separate queries or creating temporary tables.

Here is the simplest way to think about it. Imagine you want to find all orders where the sales value is above average. You cannot write WHERE sales > AVG(sales) directly — SQL does not allow aggregate functions inside a WHERE clause. But you can write a subquery that calculates the average first, then use that result in the WHERE condition:

-- Find all orders above the average sales value
query("""

SELECT order_id, customer_name, sales
FROM superstore
WHERE sales > (SELECT AVG(sales) FROM superstore)
ORDER BY sales DESC
LIMIT 10

""")

The inner query SELECT AVG(sales) FROM superstore runs first and returns a single number. The outer query then uses that number as the filter threshold. This is a subquery in its most fundamental form.

Why Subqueries Matter for Analysts

New analysts sometimes wonder whether subqueries are necessary — after all, you could run two separate queries and use the result of the first manually. That works for one-off exploration but breaks down quickly in real work

.Subqueries let you build self-contained, reusable queries that answer complex questions in a single execution. They make your SQL readable, auditable, and easy to hand off to a colleague. When you save a query in a reporting tool or share it with your team, everything is in one place — not split across two separate statements that need to be run in a specific order.

They also open the door to a category of questions that would be genuinely difficult to answer any other way — questions like “which customers spend more than the average customer?” or “which products outsell the category average?” These comparisons require knowing a benchmark first, then filtering against it. Subqueries are the natural SQL tool for that.

Types of Subqueries Covered in This Topic

There are two subquery patterns every analyst needs to know:

WHERE subquery — the inner query runs and produces a value or list of values that the outer query filters on. Used for comparison and filtering against calculated benchmarks.

FROM subquery — the inner query runs and produces a temporary table that the outer query selects from. Used for multi-step aggregation and pre-filtering before a summary.

Both patterns follow the same core principle: the inner query runs first, produces a result, and the outer query uses that result.

WHERE Subqueries

A WHERE subquery places a SELECT statement inside the WHERE clause. The inner query must return either a single value or a list of values that the outer WHERE condition can compare against.

Comparing Against a Single Calculated Value

This is the most common WHERE subquery pattern. Calculate a benchmark with the inner query, then filter rows against it in the outer query.

-- Which orders have sales above the overall average?
query("""

SELECT
order_id,
customer_name,
region,
ROUND(sales, 2) AS sales
FROM superstore
WHERE sales > (
SELECT AVG(sales)
FROM superstore
)
ORDER BY sales DESC
LIMIT 10

""")

The inner query returns one number — the average sales value across all orders. The outer query filters to rows where individual sales exceed that number. This gives you above-average orders without needing to know the average value in advance.

-- Which orders have profit above the average profit?
query("""

SELECT
order_id,
customer_name,
category,
ROUND(profit, 2) AS profit
FROM superstore
WHERE profit > (
SELECT AVG(profit)
FROM superstore
)
ORDER BY profit DESC
LIMIT 10

""")

Using IN with a Subquery

When the inner query returns multiple values instead of one, use IN to check whether the outer query’s column matches any value in that list.

-- Find all orders placed by customers in the Corporate segment
query("""

SELECT
order_id,
customer_id,
order_date,
region
FROM orders
WHERE customer_id IN (
SELECT customer_id
FROM customers
WHERE segment = 'Corporate'
)
ORDER BY order_date DESC
LIMIT 10

""")

The inner query returns a list of customer IDs belonging to the Corporate segment. The outer query then filters the orders table to only show orders where the customer ID appears in that list. This achieves a similar result to a JOIN — but the logic reads differently and is sometimes clearer depending on the question being asked.

-- Find orders containing Technology products with high profit
query("""

SELECT
o.order_id,
o.order_date,
o.region
FROM orders o
WHERE o.order_id IN (
SELECT order_id
FROM order_items
WHERE category = 'Technology'
AND profit > 500
)
ORDER BY o.order_date DESC
LIMIT 10

""")

NOT IN — Exclusion Filter

The reverse of IN is NOT IN — filter to rows where the value does not appear in the subquery result. Useful for finding records that are absent from another dataset.

-- Find customers who have never ordered a Technology product
query("""

SELECT DISTINCT
customer_id,
customer_name,
segment
FROM customers
WHERE customer_id NOT IN (
SELECT DISTINCT o.customer_id
FROM orders o
INNER JOIN order_items oi
ON o.order_id = oi.order_id
WHERE oi.category = 'Technology'
)
ORDER BY customer_name
LIMIT 10

""")

NOT IN and NULLs — important warning: If the subquery result contains any NULL values, NOT IN returns no rows at all. This is a subtle but serious bug. Always add WHERE column IS NOT NULL inside a NOT IN subquery to be safe.

FROM Subqueries

A FROM subquery places a SELECT statement in the FROM clause, treating the result of the inner query as a temporary table. The outer query then selects from that temporary table as if it were a real one.

This pattern is used when you need to aggregate data in two stages — for example, first summarise by one dimension, then summarise or filter that summary further.

Syntax:

SELECT columns<br>FROM (<br>SELECT columns<br>FROM table<br>GROUP BY something<br>) AS subquery_alias<br>WHERE condition

The alias after the closing bracket — AS subquery_alias — is required. SQL needs a name to refer to the temporary table in the outer query.

Two-Stage Aggregation

-- Step 1 inner query: calculate total sales per customer
-- Step 2 outer query: find customers above a revenue threshold
query("""

SELECT
customer_name,
segment,
ROUND(total_sales, 2) AS total_sales
FROM (
SELECT
c.customer_name,
c.segment,
SUM(oi.sales) AS total_sales
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
GROUP BY c.customer_name, c.segment
) AS customer_summary
WHERE total_sales > 5000
ORDER BY total_sales DESC
LIMIT 10

""")

Notice the outer query uses WHERE total_sales > 5000 — filtering on the aggregated column from the inner query. You cannot do this with HAVING in a simple GROUP BY because HAVING filters groups, not a pre-calculated summary table. The FROM subquery pattern gives you this flexibility.

Filtering a Summary Before Further Analysis

-- Find regions where average order value exceeds $250
query("""

SELECT
region,
ROUND(avg_order_value, 2) AS avg_order_value,
total_orders
FROM (
SELECT
o.region,
AVG(oi.sales) AS avg_order_value,
COUNT(*) AS total_orders
FROM orders o
INNER JOIN order_items oi
ON o.order_id = oi.order_id
GROUP BY o.region
) AS region_summary
WHERE avg_order_value > 250
ORDER BY avg_order_value DESC

""")

The inner query calculates average order value and total orders per region. The outer query then filters to regions where the average exceeds $250. This is clean, readable, and easy to modify — just change the threshold in the outer WHERE clause.

Ranking Categories by Performance

-- Which product categories have above-average total profit?
query("""

SELECT
category,
ROUND(total_profit, 2) AS total_profit
FROM (
SELECT
category,
SUM(profit) AS total_profit
FROM order_items
GROUP BY category
) AS category_totals
WHERE total_profit > (
SELECT AVG(total_profit)
FROM (
SELECT SUM(profit) AS total_profit
FROM order_items
GROUP BY category
) AS avg_calc
)
ORDER BY total_profit DESC

""")

This query has a subquery inside a FROM clause and another subquery inside a WHERE clause — both in the same statement. It reads from the inside out: the innermost queries run first, their results feed the next level, and the outermost query produces the final answer. This is advanced but follows the exact same rules you have already learned.

Subquery vs JOIN — When to Use Which

Both subqueries and JOINs can answer many of the same questions. Choosing between them is partly about correctness and partly about readability.

Situation	Better Choice	Why
Filtering based on a calculated value (avg, max)	Subquery in WHERE	JOINs cannot filter on aggregations directly
Finding records absent from another table	Subquery with NOT IN or LEFT JOIN + IS NULL	Both work — LEFT JOIN is safer with NULLs
Combining columns from two tables in the result	JOIN	Subqueries in WHERE do not add columns to output
Two-stage aggregation	FROM subquery	Cleaner than a JOIN for pre-summarised data
Simple lookup across two tables	JOIN	Faster and more readable for straightforward matches
Filtering to a dynamic list from another table	Either — IN subquery or INNER JOIN both work	JOIN is generally faster on large datasets

The practical guideline: use a JOIN when you need columns from both tables in your output. Use a subquery when you are filtering or calculating based on a value derived from another query and do not need to show those extra columns.

Common Subquery Mistakes

Mistake	What Happens	Fix
Missing alias on FROM subquery	SQL error — every derived table needs a name	Always add `AS alias_name` after the closing bracket
Inner query returns multiple rows in a single-value context	SQL error	Use IN instead of = when the subquery can return multiple rows
NOT IN with NULLs in subquery result	Returns zero rows silently	Add `WHERE column IS NOT NULL` inside the NOT IN subquery
Deeply nested subqueries that are hard to read	Difficult to debug and maintain	Break into steps using pandas after pulling data, or use CTEs in future
Using a subquery when a JOIN would be simpler	Slower and harder to read	If you need columns from both tables, use a JOIN

Practice Exercises

Find all orders where sales are below the average sales value. Show order ID, customer name, and sales. Sort by sales ascending.
Find all customers from the Consumer segment who have placed more than 5 orders. Use a FROM subquery to first count orders per customer, then filter.
Find all sub-categories where total profit is above the average sub-category profit. Use a FROM subquery for the totals and a WHERE subquery for the average.
Using NOT IN, find all customers who have never placed an order in the West region.
Find the top 3 regions by average order value using a FROM subquery. Show region, average order value, and total orders.

Summary — What You Can Now Do

Explain what a subquery is and why it runs before the outer query
Write a WHERE subquery to filter rows against a single calculated value
Use IN and NOT IN with subqueries to filter against a list of values
Write a FROM subquery to create a temporary summary table for further filtering
Combine WHERE and FROM subqueries in the same query for multi-step analysis
Choose between a subquery and a JOIN based on what the question requires
Avoid common subquery errors including missing aliases and NOT IN with NULLs

Up next — Topic 6: SQL Meets Python

Topic 6 brings everything together — connecting SQLite to Python, running queries with pd.read_sql(), deciding when to filter in SQL versus pandas, and building a workflow where SQL retrieves the data and Python does the analysis. This is the bridge between Module 2 and everything that follows in the course.

April 9, 2026

SQL JOINs Explained: INNER JOIN and LEFT JOIN for Data Analysts with Examples

What Is a JOIN and Why Do You Need It

Every query in Topics 2 and 3 touched a single table. That works fine when all your data lives in one place — but in real databases it almost never does.

Customer details live in a customers table. Orders live in an orders table. Products live in a products table. These tables are kept separate deliberately — it avoids storing the same customer name and address on every single order they place. Instead, each order stores a customer ID, and that ID links back to the customer record.

This is efficient for storage and data integrity. But it means that to answer most real business questions, you need to combine two or more tables. That is exactly what a JOIN does.

A JOIN connects two tables based on a shared column — usually an ID that appears in both. The result is a new combined table containing columns from both sources, matched row by row.

Without JOINs, a relational database is just a collection of disconnected tables. With JOINs, it becomes a connected system you can query across freely.

Setting Up the Superstore Tables for JOINs

In Topics 2 and 3 the Superstore data was one flat table. To practice JOINs properly you need it split into separate tables the way a real database works. Run this setup code once in your notebook to create the tables:

import sqlite3
import pandas as pd

conn = sqlite3.connect('superstore.db')

# Load the original flat table
df = pd.read_csv('superstore_sales.csv')
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]

# Create customers table
customers = df[['customer_id', 'customer_name', 'segment']].drop_duplicates()
customers.to_sql('customers', conn, if_exists='replace', index=False)

# Create orders table
orders = df[['order_id', 'customer_id', 'order_date',
             'ship_date', 'ship_mode', 'region',
             'city', 'state']].drop_duplicates()
orders.to_sql('orders', conn, if_exists='replace', index=False)

# Create order_items table
order_items = df[['order_id', 'product_id', 'sub_category',
                  'category', 'sales', 'quantity',
                  'discount', 'profit']].drop_duplicates()
order_items.to_sql('order_items', conn, if_exists='replace', index=False)

print("Tables created successfully.")
conn.close()

Now you have three separate tables — customers, orders, and order_items — linked by shared ID columns. This mirrors how a real company database is structured.

def query(sql):
return pd.read_sql(sql, conn)

conn = sqlite3.connect('superstore.db')

How a JOIN Works — The Mental Model

Before writing JOIN syntax, understand what is happening conceptually.

Imagine two tables sitting side by side. The orders table has a column called customer_id. The customers table also has a column called customer_id. A JOIN says: for every row in orders, find the matching row in customers where the customer_id values are equal, and combine them into one wider row.

The column you join on is called the join key. It must exist in both tables and contain matching values. In the Superstore setup:

•   orders.customer_id links to customers.customer_id
•   order_items.order_id links to orders.order_id

The difference between JOIN types is what happens when a match is not found. That is the entire distinction between INNER JOIN and LEFT JOIN.

INNER JOIN — Only Matching Rows

Syntax:

SELECT columns<br>FROM table_one<br>INNER JOIN table_two ON table_one.key = table_two.key

INNER JOIN returns only rows where a match exists in both tables. If a row in the left table has no matching row in the right table, it is excluded from the result. If a row in the right table has no match in the left table, it is also excluded.

Think of it as the intersection — only rows that exist in both tables come through.

Your First INNER JOIN

-- Combine orders with customer details
query("""

SELECT
o.order_id,
o.order_date,
o.region,
c.customer_name,
c.segment
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
LIMIT 10

""")

Output (first 5 rows):

order_id	order_date	region	customer_name	segment
CA-2020-152156	2020-11-08	South	Claire Gute	Consumer
CA-2020-138688	2020-06-12	West	Darrin Van Huff	Corporate
US-2019-108966	2019-10-11	South	Sean O’Donnell	Consumer
CA-2019-115812	2019-06-09	East	Brosina Hoffman	Consumer
CA-2019-114412	2019-04-15	West	Andrew Allen	Consumer

Notice o. and c. before column names. These are table aliases — shorthand so you don’t have to type the full table name every time. orders o means “refer to the orders table as o.” This becomes essential when both tables have columns with the same name.

Joining Three Tables

Most real analyst queries join more than two tables. Here is how to bring orders, customers, and order_items together in one query:

-- Full picture: customer details + order details + financials
query("""

SELECT
c.customer_name,
c.segment,
o.order_date,
o.region,
oi.category,
oi.sub_category,
ROUND(oi.sales, 2) AS sales,
ROUND(oi.profit, 2) AS profit
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
LIMIT 10

""")

Each INNER JOIN adds another table into the result. The pattern is always the same — join on the shared key column between the two tables being connected.

INNER JOIN with WHERE and ORDER BY

JOINs combine cleanly with everything from Topics 2 and 3:

-- High-value orders in the West region with customer details
query("""

SELECT
c.customer_name,
c.segment,
o.region,
oi.category,
ROUND(oi.sales, 2) AS sales,
ROUND(oi.profit, 2) AS profit
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
WHERE o.region = 'West'
AND oi.sales > 1000
ORDER BY oi.sales DESC
LIMIT 10

""")

LEFT JOIN — Keep All Rows from the Left Table

Syntax:

SELECT columns<br>FROM table_one<br>LEFT JOIN table_two ON table_one.key = table_two.key

LEFT JOIN returns all rows from the left table, plus matching rows from the right table. When no match exists in the right table, the columns from the right table come back as NULL.

The key difference from INNER JOIN: no rows from the left table are ever dropped. Even if they have no match on the right side, they appear in the result — just with NULL values in the right table’s columns.

When to Use LEFT JOIN

LEFT JOIN is the right choice when you want to keep all records from one table regardless of whether they have matching records in another. Common scenarios:

Find customers who have never placed an order
Find products that have never been sold
Identify records in one system that are missing from another
Audit data completeness across two sources

LEFT JOIN Example — Finding Unmatched Records

To demonstrate LEFT JOIN clearly, first add a test customer with no orders:

Add a customer who has never ordered

import pandas as pd
import sqlite3
conn = sqlite3.connect('superstore.db')
new_customer = pd.DataFrame([{
'customer_id': 'TEST-001',
'customer_name': 'Test Customer',
'segment': 'Consumer'
}])
new_customer.to_sql('customers', conn, if_exists='append', index=False)
conn.commit()

Now run a LEFT JOIN to find customers with no orders:

-- Find customers who have never placed an order
query("""

SELECT
c.customer_id,
c.customer_name,
c.segment,
o.order_id
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL

""")

Output:

customer_id	customer_name	segment	order_id
TEST-001	Test Customer	Consumer	NULL

The NULL in order_id tells you this customer exists in the customers table but has no matching record in the orders table. The INNER JOIN version of this query would have excluded this row entirely — you would never know this customer existed.

This pattern — LEFT JOIN followed by WHERE right_table.key IS NULL — is one of the most useful SQL techniques for data quality auditing.

Handling NULLs After a JOIN

NULL values appearing after a LEFT JOIN are expected and useful — they signal missing matches. But they need to be handled carefully in any further calculations or filtering.

COALESCE — Replacing NULL with a Default Value

COALESCE returns the first non-NULL value from a list of arguments. Use it to replace NULLs with a meaningful default:

-- Replace NULL order_id with a readable label
query("""

SELECT
c.customer_name,
COALESCE(o.order_id, 'No orders yet') AS order_status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL

""")

COALESCE is also useful when joining tables where a column might be populated in one table but missing in another. Rather than seeing NULL in your output, you get a clean fallback value.

NULL in Aggregate Functions After a JOIN

One important behaviour: aggregate functions like COUNT, SUM, and AVG ignore NULL values automatically. This matters after a LEFT JOIN because unmatched rows produce NULLs in the right table’s columns.

-- Count orders per customer — unmatched customers show 0, not NULL
query("""

SELECT
c.customer_name,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name
ORDER BY order_count DESC
LIMIT 10

""")

COUNT(o.order_id) counts non-NULL values only — so customers with no orders correctly show 0. If you used COUNT() here you would get 1 for every customer including unmatched ones, because COUNT() counts the row itself regardless of NULL values.

INNER JOIN vs LEFT JOIN — When to Use Which

Situation	Use
You only want rows that exist in both tables	INNER JOIN
You want all rows from the left table, matched or not	LEFT JOIN
Finding records that are missing from another table	LEFT JOIN + WHERE right key IS NULL
Combining sales data with product details	INNER JOIN
Auditing customers with no orders	LEFT JOIN
Getting complete order history including customer info	INNER JOIN
Data completeness check across two systems	LEFT JOIN

The default choice for most analyst queries is INNER JOIN — you usually only want complete, matched records. Reach for LEFT JOIN specifically when missing matches are meaningful information rather than just gaps to exclude.

A Complete Business Query Using JOINs

Here is a realistic analyst query combining JOINs, WHERE, GROUP BY, and ORDER BY to answer a full business question:

Question: Which customer segments generate the most revenue and profit, broken down by product category?

query("""

SELECT
c.segment,
oi.category,
COUNT(DISTINCT o.order_id) AS total_orders,
ROUND(SUM(oi.sales), 2) AS total_revenue,
ROUND(SUM(oi.profit), 2) AS total_profit,
ROUND(SUM(oi.profit) /
SUM(oi.sales) * 100, 1) AS profit_margin_pct
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
GROUP BY c.segment, oi.category
ORDER BY c.segment, total_revenue DESC

""")

This single query pulls from three tables, aggregates across two dimensions, calculates a derived metric, and produces a result a manager could read directly in a meeting. That is the power of combining JOINs with everything from the previous topics.

Common JOIN Mistakes

Mistake	What Happens	Fix
Joining on the wrong column	Incorrect or empty results	Double-check which columns are the shared keys between tables
Forgetting table aliases when column names clash	SQL error — ambiguous column name	Always use aliases when the same column name exists in both tables
Using INNER JOIN when LEFT JOIN is needed	Silently drops unmatched rows	Ask yourself — do I care about rows with no match? If yes, use LEFT JOIN
Not handling NULLs after LEFT JOIN	Wrong aggregation results	Use COALESCE for display, COUNT(column) not COUNT(*) for counting
Joining without ON clause	Cartesian product — every row matched to every row	Always include the ON condition

Practice Exercises

Join the orders and customers tables. Show the customer name, region, and order date for all orders placed in 2021.
Join all three tables. Find the top 5 customers by total profit across all their orders.
Using a LEFT JOIN, find any order IDs in order_items that do not have a matching record in the orders table.
Join orders and customers. Group by segment and region. Show total revenue per combination sorted by revenue descending.
Join all three tables. Filter to Furniture category only. Show total sales and profit per customer segment.

Summary — What You Can Now Do

Explain what a JOIN does and why relational databases require them
Write an INNER JOIN to combine two or more tables on a shared key column
Write a LEFT JOIN to keep all rows from the left table including unmatched ones
Use LEFT JOIN with WHERE IS NULL to find records missing from a second table
Handle NULLs after a JOIN using COALESCE and COUNT(column) vs COUNT(*)
Combine JOINs with WHERE, GROUP BY, ORDER BY, and HAVING in a single query
Choose between INNER JOIN and LEFT JOIN based on whether unmatched rows matter

Up next — Topic 5: Subqueries

Topic 5 covers queries inside queries — how to use a SELECT result as a filter in WHERE, or as a derived table in FROM. Subqueries let you answer multi-step business questions in a single SQL statement without needing temporary tables or multiple separate queries.

April 9, 2026

How to Use COUNT, SUM, AVG, GROUP BY and HAVING in SQL

SQL Aggregations Explained: COUNT, SUM, AVG, GROUP BY and HAVING with Examples

In Topic 2 you learned to retrieve and filter individual rows. That is useful — but it is not where the real analysis happens. A manager does not want to see 10,000 individual order rows. They want to know: what is our total revenue this year? Which region is most profitable? Which product category has the highest average order value?
Those questions cannot be answered by looking at individual rows. They require aggregation — combining multiple rows into a single summary figure. This topic covers the five aggregate functions every analyst uses daily, along with GROUP BY and HAVING which control how aggregations are grouped and filtered.

What Is Aggregation and Why Does It Matter

Aggregation means collapsing multiple rows of data into a single calculated value. Instead of seeing every individual sale, you see the total. Instead of every order’s profit, you see the average. Instead of each transaction, you see a count.
In pandas you did this with groupby() and agg() in Module 1. SQL does the same thing — but directly inside the database, before the data even reaches Python. This matters for performance. Aggregating 500,000 rows in SQL and returning a 4-row summary is dramatically faster than loading all 500,000 rows into Python and then aggregating them.
In a professional analyst workflow the rule is simple: aggregate in SQL, visualise and present in Python. Do the heavy lifting where the data lives.

The Five Aggregate Functions

Before combining these with GROUP BY, understand what each function does on its own when applied to an entire table.

COUNT — How Many Rows

COUNT counts the number of rows. It is the most frequently used aggregate function in day-to-day analyst work.

-- How many orders are in the dataset?
query("""

SELECT COUNT(*) AS total_orders
FROM superstore

""")

Output:

total_orders
9994

COUNT(*) counts all rows including nulls. COUNT(column_name) counts only rows where that column is not null — a useful distinction when checking data completeness:

-- How many rows have a valid postal code vs total rows?
query("""

SELECT
COUNT(*) AS total_rows,
COUNT(postal_code) AS rows_with_postal_code
FROM superstore

""")

If these two numbers differ, you have missing values in that column — which is exactly what you want to know before any analysis.

SUM — Total Value

SUM adds up all values in a numeric column. Used constantly for revenue, profit, quantity, and any other additive metric.

-- What is the total revenue and total profit across all orders?
query("""

SELECT
ROUND(SUM(sales), 2) AS total_revenue,
ROUND(SUM(profit), 2) AS total_profit
FROM superstore

""")

Output:

total_revenue	total_profit
2297200.86	286397.02

ROUND(value, 2) keeps results to two decimal places. Always use it with financial figures — raw floating point results from SQL can be messy.

AVG — Average Value

AVG calculates the mean of a numeric column. Useful for understanding typical order size, average margin, average quantity, and so on.

-- What is the average order value and average profit per transaction?
query("""

SELECT
ROUND(AVG(sales), 2) AS avg_order_value,
ROUND(AVG(profit), 2) AS avg_profit
FROM superstore

""")

Mean vs median reminder: AVG calculates the mean, which is pulled upward by large outliers. A small number of very large orders can make the average look higher than what a typical order is worth. In Module 1 you compared mean and median in pandas — keep that in mind when interpreting AVG results from SQL.

MIN and MAX — Smallest and Largest Values

MIN and MAX return the lowest and highest values in a column. Useful for understanding the range of your data and spotting outliers.

-- What is the range of order values in the dataset?
query("""

SELECT
ROUND(MIN(sales), 2) AS smallest_order,
ROUND(MAX(sales), 2) AS largest_order,
ROUND(MIN(profit), 2) AS worst_profit,
ROUND(MAX(profit), 2) AS best_profit
FROM superstore

""")

Output:

smallest_order	largest_order	worst_profit	best_profit
0.44	22638.48	-6599.98	8399.98

The worst profit being nearly -$6,600 on a single order is a significant finding. MAX and MIN surface these extremes instantly — without having to sort through thousands of rows manually.

GROUP BY — Aggregating by Category

Aggregate functions on their own give you a single number for the entire table. GROUP BY breaks that down by a category — giving you one summary row per group instead of one summary row for everything.

Syntax:

SELECT column, AGG_FUNCTION(column)<br>FROM table<br>GROUP BY column

This is the SQL equivalent of pandas groupby(). And just like in pandas, it is one of the most powerful tools in your analyst toolkit.

Total Sales and Profit by Region

-- Business question: how does each region perform?
query("""

SELECT
region,
COUNT(*) AS total_orders,
ROUND(SUM(sales), 2) AS total_revenue,
ROUND(SUM(profit), 2) AS total_profit
FROM superstore
GROUP BY region
ORDER BY total_profit DESC

""")

Output:

region	total_orders	total_revenue	total_profit
West	3203	725457.82	108418.45
East	2848	678781.24	91522.78
Central	2323	501239.89	39706.36
South	1620	391721.91	46749.43

Four rows. One per region. Each one summarising thousands of individual transactions. This is what aggregation does — it turns raw data into something a manager can read and act on in seconds.

Grouping by Multiple Columns

You can GROUP BY more than one column to get more granular breakdowns:

-- Revenue and profit broken down by region AND category
query("""

SELECT
region,
category,
ROUND(SUM(sales), 2) AS total_revenue,
ROUND(SUM(profit), 2) AS total_profit
FROM superstore
GROUP BY region, category
ORDER BY region, total_profit DESC

""")

This gives you one row per unique region-category combination — 12 rows total (4 regions × 3 categories). A clean, structured view of performance across two dimensions at once.

The GROUP BY Rule — Every Non-Aggregated Column Must Be Listed

This is the rule that trips up every beginner at least once. In a SELECT statement that uses GROUP BY, every column that is not inside an aggregate function must appear in the GROUP BY clause. If you select region and SUM(sales), you must GROUP BY region. If you select region, category, and SUM(sales), you must GROUP BY region, category.

-- This will throw an error — category is selected but not grouped
query("""

SELECT region, category, SUM(sales)
FROM superstore
GROUP BY region -- missing category here

""")

-- This is correct
query("""

SELECT region, category, SUM(sales)
FROM superstore
GROUP BY region, category

""")

When you get a GROUP BY error, the fix is almost always: add the missing column to the GROUP BY clause.

HAVING — Filtering Aggregated Results

WHERE filters individual rows before aggregation. HAVING filters the aggregated results after GROUP BY runs. This distinction is critical and confuses many beginners.

The simple rule:

Use WHERE to filter rows — before grouping
Use HAVING to filter groups — after grouping

Syntax:

SELECT column, AGG_FUNCTION(column)<br>FROM table<br>GROUP BY column<br>HAVING AGG_FUNCTION(column) condition

Filter Groups by Aggregate Value

-- Which regions have total profit above $50,000?
query("""

SELECT
region,
ROUND(SUM(profit), 2) AS total_profit
FROM superstore
GROUP BY region
HAVING SUM(profit) > 50000
ORDER BY total_profit DESC

""")

You cannot use WHERE profit > 50000 here because at the time WHERE runs, the profit has not been summed yet. HAVING runs after the sum is calculated — that is why it can filter on it.

WHERE and HAVING Together

WHERE and HAVING can be used in the same query. WHERE filters the raw rows first, then GROUP BY aggregates, then HAVING filters the groups:

-- Among Technology orders only, which sub-categories have total sales above $100,000?
query("""

SELECT
sub_category,
ROUND(SUM(sales), 2) AS total_sales,
ROUND(SUM(profit), 2) AS total_profit
FROM superstore
WHERE category = 'Technology'
GROUP BY sub_category
HAVING SUM(sales) > 100000
ORDER BY total_sales DESC

""")

Output:

sub_category	total_sales	total_profit
Phones	330007.05	44515.73
Machines	189238.63	3384.76
Copiers	149528.03	55617.82

Read the logic out loud: “From all Technology orders, group by sub-category, show only groups with over $100,000 in sales, sorted by sales.” WHERE handles the category filter. HAVING handles the post-aggregation filter. Both in the same query.

Practical Business Queries Using Aggregations

Here are three complete queries that combine everything in this topic to answer real analyst questions.

Which customer segment is most valuable?

query("""

SELECT
segment,
COUNT(*) AS total_orders,
ROUND(SUM(sales), 2) AS total_revenue,
ROUND(AVG(sales), 2) AS avg_order_value,
ROUND(SUM(profit), 2) AS total_profit
FROM superstore
GROUP BY segment
ORDER BY total_revenue DESC

""")

Which sub-categories are losing money overall?

query("""

SELECT
sub_category,
ROUND(SUM(profit), 2) AS total_profit,
COUNT(*) AS total_orders
FROM superstore
GROUP BY sub_category
HAVING SUM(profit) < 0
ORDER BY total_profit ASC

""")

This query directly answers one of the most common business questions an analyst gets asked: which product lines are unprofitable and how bad is it?

Top performing sub-categories by profit margin

query("""

SELECT
sub_category,
ROUND(SUM(sales), 2) AS total_sales,
ROUND(SUM(profit), 2) AS total_profit,
ROUND(SUM(profit) / SUM(sales) * 100, 1) AS profit_margin_pct
FROM superstore
GROUP BY sub_category
ORDER BY profit_margin_pct DESC
LIMIT 10

""")

Notice SUM(profit) / SUM(sales) * 100 — you can do arithmetic directly inside a SELECT using aggregate functions. This calculates the overall profit margin for each sub-category, not an average of individual margins, which gives a more accurate business figure.

WHERE vs HAVING — The Full Picture

This distinction comes up so often it deserves its own summary:

	WHERE	HAVING
When it runs	Before GROUP BY	After GROUP BY
What it filters	Individual rows	Aggregated groups
Can use aggregate functions	No	Yes
Example	`WHERE region = 'West'`	`HAVING SUM(sales) > 10000`
Common use	Filter raw data before summarising	Keep only groups that meet a threshold

A useful mental model: WHERE is a row-level gate. HAVING is a group-level gate. Data passes through WHERE first, gets grouped and aggregated, then passes through HAVING.

Common Mistakes with Aggregations

Mistake	What Happens	Fix
Using WHERE instead of HAVING	Error or wrong results when filtering on aggregated values	Use HAVING for any condition involving SUM, COUNT, AVG, MIN, MAX
Forgetting to GROUP BY all non-aggregated columns	SQL error in strict databases	Add every non-aggregated SELECT column to GROUP BY
Not using ROUND on financial figures	Results show many decimal places	Wrap financial aggregations in ROUND(value, 2)
Dividing by SUM instead of individual values for margin	Slightly different result than expected	For overall margin use SUM(profit)/SUM(sales), not AVG(profit/sales)
Selecting an aggregated alias in HAVING	Error — alias not yet available	Repeat the full expression: HAVING SUM(sales) > 10000, not HAVING total_sales > 10000

Practice Exercises

Try these before moving to Topic 4. Each one maps to a real analyst scenario.

What is the total revenue and total profit for each category? Sort by total profit descending.
Which ship modes are used most frequently? Show the count of orders per ship mode.
Find all sub-categories where the average order value is above $300.
Which states have more than 100 orders? Show state name and order count.
Among orders in the Consumer segment only, what is the total profit per category?

Summary — What You Can Now Do

Use COUNT, SUM, AVG, MIN, and MAX to calculate summary statistics across an entire table
Apply GROUP BY to break aggregations down by one or multiple categories
Follow the GROUP BY rule — every non-aggregated SELECT column must appear in GROUP BY
Use HAVING to filter groups after aggregation
Combine WHERE and HAVING in the same query to filter both rows and groups
Write business-ready summary queries that answer questions about revenue, profit, and performance by segment

Up next — Topic 4: JOINs

So far every query has touched a single table. In Topic 4 you learn INNER JOIN and LEFT JOIN — how to combine data from two tables into one result. This is where SQL becomes genuinely powerful for real-world databases where customer data, order data, and product data all live separately.

April 8, 2026

How to Write Your First SQL Query: A Beginner’s Guide

SQL Query Basics: SELECT, WHERE, ORDER BY, LIMIT and DISTINCT Explained with Examples

Before writing complex analysis, you need to master five keywords. SELECT, WHERE, ORDER BY, LIMIT, and DISTINCT. These five alone let you answer most basic business questions from any database. Every query you write in this module — and most queries you will write in your career — is built on this foundation.

Why These Five Keywords Matter

When a data analyst joins a company, one of the first things they are asked to do is pull data. Not build a model. Not create a dashboard. Just answer a question: how many orders came in last month? Which region had the highest sales? Which customers haven’t ordered in 90 days?
All of those questions are answered with the five keywords in this topic. SELECT chooses what to show. FROM says where to look. WHERE filters to the relevant rows. ORDER BY sorts the results. LIMIT keeps things manageable. Master these and you can answer the majority of day-to-day analyst requests without writing a single line of Python.
This is also why SQL appears in more data analyst job postings than any other technical skill. It is the universal language of data retrieval. Every database system — MySQL, PostgreSQL, SQL Server, BigQuery, Snowflake — understands these five keywords. Learn them once and they work everywhere.

How SQL Thinks

Every SQL query follows the same skeleton structure. Write it in this order every time:

SELECT -- what columns do you want?<br>FROM -- which table?<br>WHERE -- which rows? (filter — optional)<br>ORDER BY -- in what order? (sort — optional)<br>LIMIT -- how many rows? (cap — optional)

You write queries top to bottom in this order. Internally the database reads FROM first, then WHERE, then SELECT — but you don’t need to worry about that yet. Write top to bottom and it works.
Before starting, add this to the top of your notebook. It sets up your database connection and creates a helper function so every query is one clean line:

import sqlite3
import pandas as pd

# Connect to your Superstore database from Topic 1
conn = sqlite3.connect('superstore.db')

# Helper — run any SQL query in one line
def query(sql):
    return pd.read_sql(sql, conn)

SELECT — Choosing Your Columns

Syntax: SELECT column1, column2 FROM table_name
SELECT tells the database which columns to return. FROM tells it which table to look in. These two always go together and appear in every query you will ever write.
The asterisk * means “all columns”:

-- Get everything from the superstore table<br>query("""<br>SELECT *<br>FROM superstore<br>""")

Avoid SELECT * in real work. It retrieves every column including ones you don’t need, which slows things down on large tables. Name your columns explicitly.

In practice, always specify only the columns you need:

-- Show order dates, regions, and financials only<br>query("""<br>SELECT order_date, region, category, sales, profit<br>FROM superstore<br>""")

Expected output (first 5 rows):

order_date	region	category	sales	profit
2020-11-08	South	Furniture	261.96	41.91
2020-11-08	South	Office Supplies	731.94	219.58
2020-06-12	West	Office Supplies	14.62	6.87
2020-06-12	West	Technology	957.58	-383.03
2020-06-12	West	Furniture	22.37	2.52

Renaming Columns with AS

You can rename any column in your output using AS. This is useful when column names are long, unclear, or when you want your output readable for a non-technical audience. The original column name in the database stays unchanged — AS only affects what comes back to you:

query("""<br>SELECT<br>order_id AS order,<br>customer_name AS customer,<br>sales AS revenue,<br>profit AS net_profit<br>FROM superstore<br>""")

A practical use case: if you are preparing output to paste into a report or share with a manager, clean column names matter. net_profit reads better than profit in a business context, and revenue is more universally understood than sales depending on your company’s terminology.

LIMIT — Controlling How Many Rows Come Back

Syntax: SELECT … LIMIT n

LIMIT caps the number of rows returned. Always use it when exploring a large table — you never want to accidentally load tens of thousands of rows into a notebook.

-- Quick preview — return only the first 10 rows<br>query("""<br>SELECT order_date, customer_name, sales<br>FROM superstore<br>LIMIT 10<br>""")

Good habit: Start every exploration with LIMIT 10. Understand what you’re looking at. Then remove the limit when you’re ready to work with the full dataset.

LIMIT also has an important use in combination with ORDER BY — which you will see later in this topic. When you sort by a value and apply a limit, you get a clean top-N or bottom-N result. For example, the top 10 most profitable orders, or the 5 most recent transactions. This pattern appears constantly in real analyst work.

DISTINCT — Finding Unique Values

Syntax: SELECT DISTINCT column FROM table

DISTINCT removes duplicate values and returns only unique entries. Use it routinely when you first encounter a new dataset to understand what values exist in a column.

-- What regions exist in the dataset?<br>query("""<br>SELECT DISTINCT region<br>FROM superstore<br>""")

Output:

region
South
West
Central
East

You can also apply DISTINCT across multiple columns to see unique combinations:

-- What unique category and sub-category combinations exist?<br>query("""<br>SELECT DISTINCT category, sub_category<br>FROM superstore<br>ORDER BY category, sub_category<br>""")

When DISTINCT spans multiple columns it returns unique combinations, not unique values per column individually.

Why DISTINCT Is More Useful Than It Looks

New analysts often underestimate DISTINCT. Beyond simple column exploration, it helps you catch data quality problems early. If a column that should only have four region values suddenly shows five — including a misspelling like “Westt” — DISTINCT will surface it immediately. Use it as a data validation step every time you work with a new table.

-- Validate ship mode values — should only be 4 options<br>query("""<br>SELECT DISTINCT ship_mode<br>FROM superstore<br>ORDER BY ship_mode<br>""")

If you see unexpected values, that is a data quality issue worth flagging before any analysis begins.

WHERE — Filtering Rows

Syntax: SELECT … FROM … WHERE condition

WHERE is the most powerful basic clause in SQL. It filters rows so only the ones matching your condition come back. Everything else is excluded.

WHERE Operators

Operator	What It Does	Example
=	Equals	`region = 'West'`
!= or <>	Not equals	`region != 'West'`
> / <	Greater / less than	`sales > 1000`
BETWEEN	Range, inclusive	`sales BETWEEN 100 AND 500`
LIKE	Pattern match	`customer_name LIKE 'A%'`
IN	Matches a list	`region IN ('East', 'West')`
IS NULL	Value is missing	`postal_code IS NULL`
AND	Both conditions true	`region = 'West' AND sales > 500`
OR	Either condition true	`region = 'West' OR region = 'East'`

Important: Text values always need single quotes — ‘West’ not West. Numbers do not — sales > 500 not sales > ‘500’. This is one of the most common beginner errors.
Basic Filter Examples

-- All orders from the West region<br>query("""<br>SELECT order_id, customer_name, region, sales<br>FROM superstore<br>WHERE region = 'West'<br>""")

-- High-value orders only — sales above $1,000<br>query("""<br>SELECT order_id, customer_name, sales, profit<br>FROM superstore<br>WHERE sales > 1000<br>LIMIT 10<br>""")

Using BETWEEN for Ranges

BETWEEN is cleaner than writing two separate conditions when you are filtering a range. Both the lower and upper values are inclusive:

-- Orders with sales between $500 and $1,000<br>query("""<br>SELECT order_id, customer_name, sales, profit<br>FROM superstore<br>WHERE sales BETWEEN 500 AND 1000<br>ORDER BY sales DESC<br>""")

This is identical to writing WHERE sales >= 500 AND sales <= 1000 but easier to read, especially when shared with others.

Combining Conditions with AND and OR

AND means both conditions must be true. OR means at least one must be true.

-- High-value orders specifically in the West region<br>query("""<br>SELECT order_id, customer_name, region, sales<br>FROM superstore<br>WHERE region = 'West'<br>AND sales > 1000<br>""")

Use IN instead of multiple OR conditions — it’s cleaner and easier to read:

-- Orders from East or West<br>query("""<br>SELECT order_id, region, sales<br>FROM superstore<br>WHERE region IN ('East', 'West')<br>""")

Pattern Matching with LIKE

The % wildcard means “zero or more of any character.” The _ wildcard means exactly one character. Here is how both work in practice:

-- Customers whose name starts with 'A'<br>query("""<br>SELECT DISTINCT customer_name, segment<br>FROM superstore<br>WHERE customer_name LIKE 'A%'<br>ORDER BY customer_name<br>""")

-- Find any sub-category containing the word 'Chair'<br>query("""<br>SELECT DISTINCT sub_category<br>FROM superstore<br>WHERE sub_category LIKE '%Chair%'<br>""")

-- Find products where the second letter is 'a'<br>-- _ means exactly one character<br>query("""<br>SELECT DISTINCT sub_category<br>FROM superstore<br>WHERE sub_category LIKE '_a%'<br>""")

Case sensitivity note: SQLite’s LIKE is case-insensitive for ASCII letters by default, so LIKE ‘west’ will still match ‘West’. However, this varies between database systems. In PostgreSQL, LIKE is case-sensitive. Build the habit of matching case correctly so your queries work everywhere.

Filtering for Missing Values with IS NULL

NULL in SQL means a value is missing or unknown. You cannot check for NULL using the equals operator — you must use IS NULL or IS NOT NULL:

-- Find rows where postal code is missing<br>query("""<br>SELECT order_id, customer_name, city, postal_code<br>FROM superstore<br>WHERE postal_code IS NULL<br>""")

-- Find all rows that DO have a postal code<br>query("""<br>SELECT order_id, customer_name, city, postal_code<br>FROM superstore<br>WHERE postal_code IS NOT NULL<br>LIMIT 10<br>""")

NULL handling trips up beginners constantly.
Remember: = NULL never works. Always use IS NULL.

Finding Loss-Making Orders

-- Which orders lost money? Sorted worst first.<br>query("""<br>SELECT order_id, customer_name, category, sales, profit<br>FROM superstore<br>WHERE profit < 0<br>ORDER BY profit ASC<br>LIMIT 10<br>""")

Output (top 5 shown):

order_id	customer_name	category	sales	profit
CA-2019-138422	Cindy Stewart	Technology	4199.93	-1199.97
CA-2021-144568	Zuschuss Donatelli	Technology	3199.95	-1079.98
US-2020-108966	Bill Donatelli	Furniture	2799.96	-839.98
CA-2020-160415	Erin Smith	Technology	2799.96	-839.98
CA-2021-119549	Dave Brooks	Furniture	2799.96	-699.98

ORDER BY — Sorting Results

Syntax: ORDER BY column ASC | DESC

ORDER BY sorts the output rows. ASC is ascending (smallest to largest, A to Z) and is the default. DESC is descending (largest to smallest, Z to A).

-- Top 10 highest-value orders<br>query("""<br>SELECT order_id, customer_name, region, sales<br>FROM superstore<br>ORDER BY sales DESC<br>LIMIT 10<br>""")

You can sort by multiple columns. The database sorts by the first column, then breaks ties using the second:

-- Sort by region alphabetically, then by sales within each region<br>query("""<br>SELECT region, category, customer_name, sales<br>FROM superstore<br>ORDER BY region ASC, sales DESC<br>LIMIT 20<br>""")

Putting It All Together

Here is one query using all five keywords to answer a specific business question:
Question: What are the top 5 profitable Technology orders in the West region?

query("""<br>SELECT<br>order_id,<br>customer_name,<br>sub_category,<br>sales,<br>profit<br>FROM superstore<br>WHERE category = 'Technology'<br>AND region = 'West'<br>AND profit > 0<br>ORDER BY profit DESC<br>LIMIT 5<br>""")

Output:

order_id	customer_name	sub_category	sales	profit
CA-2021-145317	Ken Black	Copiers	3299.98	1187.99
CA-2020-127180	Anne McFarland	Copiers	2799.98	1007.99
CA-2021-162688	Phillip Schmitt	Phones	1919.97	614.39
CA-2020-100294	Lena Creighton	Phones	1679.97	537.59
CA-2019-118977	Sung Shields	Phones	1439.97	460.79

Read that query out loud: “Select these columns from Superstore, where category is Technology and region is West and profit is positive, sorted by profit descending, top 5 only.” That maps word-for-word to the business question. This is the core skill of SQL — translating a plain English question into precise syntax.

Common Beginner Mistakes

Mistake	Wrong	Right
Quotes on text	`WHERE region = West`	`WHERE region = 'West'`
NULL comparison	`WHERE postal_code = NULL`	`WHERE postal_code IS NULL`
Alias in WHERE	`WHERE revenue > 500` (alias)	`WHERE sales > 500` (original name)
AND vs OR confusion	Gets zero results	If a column can’t be two values at once, use OR or IN
LIKE case sensitivity	`LIKE 'west'` misses ‘West’ in some databases	Match case exactly or use `LOWER(region) LIKE 'west'`
ORDER BY without LIMIT	Loads entire table sorted	Add `LIMIT n` when you only need top or bottom rows

Practice Exercises

Try writing these queries yourself against your Superstore database before looking at the answers. These are the kinds of questions a real manager might ask.

Show all orders from the Furniture category with sales greater than $500, sorted by sales descending.
How many unique customer names are in the dataset? (Hint: use DISTINCT then count the rows in Python)
Find all orders where the ship mode is ‘First Class’ and the region is ‘East’.
Show the 10 most recent orders by order date.
5. Find all orders where profit is between -$100 and $0 — orders that lost a small amount. Sort by profit ascending.

Writing these yourself — even if you need to look back at the examples — is more valuable than reading the topic a second time. Muscle memory in SQL comes from typing queries, not from reading them.

Summary — What You Can Now Do

By the end of this topic you should be able to write all of these from memory:

∙ Select specific columns from a table and rename them with AS
∙ Use DISTINCT to find unique values and catch data quality issues early
∙ Use LIMIT to safely cap results during exploration
∙ Filter rows with WHERE using =, >, <, BETWEEN, IN, LIKE, IS NULL, and IS NOT NULL
∙ Use % and _ wildcards in LIKE for pattern matching
∙ Combine multiple filters with AND and OR
∙ Sort results with ORDER BY across one or multiple columns
∙ Write a complete query combining all five keywords to answer a specific business question

Up next — Topic 3: Aggregations

COUNT, SUM, AVG, MIN, MAX, GROUP BY, and HAVING. This is where SQL stops working with individual rows and starts answering questions like “what is total revenue per region?” and “which category has the highest average order value?” That is where the real business analysis begins.

April 3, 2026

Why SQL for Data Analysts?

The Tool You Can’t Avoid

You’ve just spent Module 1 loading a CSV file into pandas and analysing it in Python. That felt powerful — and it was. But here’s something most beginner data courses don’t tell you upfront: in most real companies, data doesn’t live in CSV files.

It lives in databases. Structured, relational, often enormous databases — containing millions of rows spread across dozens of connected tables. Before a data analyst can do anything with that data, they need to query it. And the language used to query relational databases is SQL — Structured Query Language.

SQL has been around since the 1970s. It has survived every major technology shift since then — the rise of the internet, the cloud, big data, machine learning, and AI. In 2024, SQL still appears in more data analyst job postings than any other technical skill, including Python. That kind of longevity is not an accident.

In most analytics workflows, SQL is where data gets retrieved. Python is where it gets transformed and visualised. Excel is where it gets presented. Understanding where each tool starts and stops is the difference between a junior analyst and a confident one.

Core principle

This topic has three goals. First, give you a clear mental model for when to use SQL versus Excel versus Python. Second, explain how relational databases are structured so that SQL queries make intuitive sense. Third, get your local SQL environment set up using the same Superstore dataset from Module 1 — so you’re coding, not just reading.

THE THREE TOOLS

SQL vs Excel vs Python

Most people entering data analytics already know Excel. Many have started learning Python. SQL often feels like a third thing to learn — and that can feel overwhelming. The good news is that these three tools are not competitors. They are complements. Each one is exceptional at specific tasks and weak at others.

Here’s a clean, course-ready comparison table you can directly include in your page. It keeps things structured without feeling overly “list-heavy.”

Aspect	SQL	Python (pandas)	Excel
Primary Purpose	Data extraction and querying	Data analysis, transformation, automation	Quick analysis and reporting
Best Use Case	Working with large databases	Complex data processing and advanced analysis	Small to medium datasets, business reporting
Data Size Handling	Excellent (millions of rows)	Very good (depends on memory)	Limited (can slow/crash on large data)
Ease of Learning	Easy to start, logical syntax	Moderate (requires programming basics)	Very easy (beginner-friendly UI)
Performance	Very fast (optimized databases)	Fast, but depends on code efficiency	Slower with large datasets
Data Source	Directly connects to databases	Works with files, APIs, databases	Mostly local files (Excel, CSV)
Data Cleaning	Basic	Advanced and flexible	Manual and limited
Automation	Limited	Strong automation capabilities	Very limited
Visualization	Not supported (basic output only)	Strong (Matplotlib, Seaborn, etc.)	Built-in charts and dashboards
Scalability	High	High (with proper setup)	Low
Real-World Role	Extract and prepare data	Analyze and model data	Present and share insights
Dependency	Independent (data source tool)	Often depends on SQL for data	Often depends on exported data
Industry Usage	Mandatory for analysts	Highly preferred	Widely used for reporting

Simple Takeaway

Instead of choosing one tool over another, think of them as a workflow:

SQL → Get the data
Python → Analyze the data
Excel → Present or quickly explore

Here is how to think about them:

SQL (The retrieval layer)

Best for querying large databases, joining tables, filtering and aggregating millions of rows, and extracting exactly the data you need before analysis begins.Best for: Retrieving

Python (The analysis layer)

Best for complex data transformation, statistical analysis, visualisation, machine learning, and building repeatable automated workflows.Best for: Analysing

Excel (The presentation layer)

Best for sharing results with non-technical stakeholders, building simple models, formatting reports, and quick one-off calculations. Most business users live here.Best for: Presenting

The key insight is that a professional data analyst workflow often uses all three. SQL pulls the data from a database. Python cleans, transforms, and analyses it. Excel or a dashboard tool presents the final result to stakeholders. You are not choosing between them — you are learning to use the right one at the right stage.

When to Use What — Real Scenarios

Abstract descriptions only go so far. Here is a practical breakdown of common analyst tasks and which tool wins for each:

TASK	BEST TOOL	WHY
Pull last 3 months of orders for one region	SQL	Filtering a live database by date and region is a native SQL operation
Calculate profit margin across 50k rows	Python	Vectorised NumPy operations handle this faster and more flexibly
Build a monthly revenue summary for your manager	Excel	Non-technical stakeholders can view, filter, and share it without any tools
Join customer table with orders table to find repeat buyers	SQL	JOINs are SQL’s core strength — doing this in Excel is painful and error-prone
Build a churn prediction model	Python	scikit-learn, pandas, and model validation tools all live in Python
Quick sanity check on a 500-row dataset	Excel	Fastest tool for visual inspection of small, already-exported data
Automate a weekly report that pulls fresh data	SQL + Python	SQL queries the database, Python formats and emails the report

📌 RULE OF THUMB

If the data is already in front of you (a CSV, a DataFrame), work in Python. If the data lives in a database and you need to extract a specific slice of it, start with SQL. If you need to share a result with someone who doesn’t code, move to Excel or a dashboard.

HOW COMPANIES STORE DATA

Relational Databases — The Conceptual Model

When you worked with the Superstore dataset in Module 1, everything was in one flat CSV file — all columns side by side in a single table. That is convenient for learning, but it is not how production data works.

Real companies store data in relational databases — systems that split information across multiple connected tables. Instead of repeating a customer’s name and address on every order they place, a relational database stores the customer details once in a customers table and links each order to the customer via a shared ID.

This approach — called normalisation — reduces duplication, prevents inconsistencies, and makes large datasets much faster to query. Understanding it conceptually is all you need at this stage. Here is what it looks like with Superstore data:

Here’s the Superstore schema in a clean copy-paste format:

superstore_db — Simplified Schema

Table: orders

Column	Type	Key
order_id	TEXT	PK
customer_id	TEXT	FK → customers
order_date	DATE
ship_date	DATE
ship_mode	TEXT
region	TEXT
segment	TEXT

Table: customers

Column	Type	Key
customer_id	TEXT	PK
customer_name	TEXT
segment	TEXT
city	TEXT
state	TEXT
country	TEXT

Table: products

Column	Type	Key
product_id	TEXT	PK
product_name	TEXT
category	TEXT
sub_category	TEXT

Table: order_items

Column	Type	Key
item_id	INT	PK
order_id	TEXT	FK → orders
product_id	TEXT	FK → products
sales	REAL
quantity	INT
discount	REAL
profit	REAL

Relationships
∙ orders.customer_id → customers.customer_id
∙ order_items.order_id → orders.order_id
∙ order_items.product_id → products.product_id

Note: This is a normalised version of the flat Superstore CSV — split into 4 linked tables. In Module 1 you worked with it as one flat file. In this module you’ll query it as a real relational database using JOINs to reconnect the tables.

Three terms worth knowing at this stage:

Primary Key (PK) — a unique identifier for each row in a table. In the orders table, order_id is the primary key. No two rows can have the same value.

Foreign Key (FK) — a column that references the primary key of another table. customer_id in the orders table points to customer_id in the customers table. This is how tables are linked.

Schema — the overall structure of a database: its tables, columns, data types, and how they relate. When a colleague says “check the schema,” they mean look at this blueprint.

You don’t need to design databases at this stage. You just need to understand that when you write a SQL query, you are asking a structured question against a system that looks like this — and the answer comes back as a table you can then work with in Python.

💡 WHY THIS MATTERS FOR YOUR QUERIES

Because data is split across tables, getting a complete picture often means combining tables. A query asking “show me all orders placed by customers in New York” needs to look in both the orders table and the customers table. That is what JOINs are for — covered in Topic 4.

SETUP LAB

Setting Up SQLite + Converting Superstore to a Database

For this module we are using SQLite — a lightweight, file-based database that requires zero server setup and works directly inside Python. It is the perfect SQL learning environment because you can get started in under five minutes with no installation beyond what you already have.

Better still — we are converting the Superstore CSV from Module 1 into a SQLite database. You already know this dataset. The columns, the business context, the quirks. This means you can focus entirely on learning SQL syntax instead of learning new data at the same time.

1 Confirm your setup

SQLite comes built into Python’s standard library — no pip install needed. Confirm it’s available by running this in a new notebook cell:

python

import sqlite3
import pandas as pd

<em># Confirm sqlite3 version</em>
print("SQLite version:", sqlite3.sqlite_version)
print("Ready to go!")

2 Load the Superstore CSV and convert to SQLite

This script reads your CSV, creates a SQLite database file, and writes the data into it as a table called superstore. Run it once — it creates a file called superstore.db that you’ll use throughout this module.

python

import sqlite3
import pandas as pd

<em># Load the CSV you used in Module 1</em>
df = pd.read_csv('superstore_sales.csv')

<em># Clean column names — replace spaces with underscores</em>
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]

<em># Create a SQLite database file</em>
conn = sqlite3.connect('superstore.db')

<em># Write the DataFrame into a SQL table called 'superstore'</em>
df.to_sql('superstore', conn, if_exists='replace', index=False)

print(f"Database created. Rows loaded: {len(df)}")
conn.close()

3 Verify the database is working

Run your first SQL query. This confirms the database is readable and shows you the column names you’ll be working with throughout the module.

python

<em># Connect to the database</em>
conn = sqlite3.connect('superstore.db')

<em># Your very first SQL query — read 5 rows</em>
query = """
    SELECT *
    FROM superstore
    LIMIT 5
"""

result = pd.read_sql(query, conn)
print("Columns:", result.columns.tolist())
result

4 Check the table structure

SQLite has a built-in way to inspect a table’s schema. This is useful any time you work with an unfamiliar database — it tells you the column names and their data types.

python

<em># Inspect the table schema</em>
schema_query = "PRAGMA table_info(superstore)"
schema = pd.read_sql(schema_query, conn)
print(schema[['name', 'type']])

<em># Also check row count</em>
count = pd.read_sql("SELECT COUNT(*) as total FROM superstore", conn)
print(f"\nTotal rows: {count['total'][0]}")

✅ EXPECTED OUTPUT

After Step 4, you should see all your column names listed with their types (TEXT, REAL, INTEGER), and a total row count matching your original CSV. If you see that — your SQLite database is ready and you’re set for upcoming topics in this module.

Common Misconceptions

As you begin working with SQL, it is useful to address a few misconceptions.

One common belief is that SQL is only for database engineers. In reality, data analysts use SQL extensively. It is one of their primary tools for daily work.

Another misconception is that Python can replace SQL. While Python is extremely powerful, it still relies on data input. SQL remains the most efficient way to retrieve structured data from databases.

There is also a perception that SQL is difficult. In practice, SQL is relatively straightforward. Its syntax is readable, and you can start writing useful queries very quickly.

Understanding these points early helps you approach SQL with the right mindset.

SUMMARY

What You Now Know

Topic 1 is intentionally conceptual — it builds the mental model that makes every SQL query you write from here feel logical rather than arbitrary. Before moving to Topic 2, make sure you can answer these questions without looking at your notes:

✓Why do most companies store data in relational databases rather than flat files?
✓In a real analyst workflow, at what stage does SQL get used — before or after Python?
✓What is the difference between a primary key and a foreign key?
✓Which tool would you use to join two tables and filter by date — SQL, Python, or Excel?
✓What does pd.read_sql()do and why is it useful?
✓Your Superstore SQLite database is created and returns 5 rows when queried.

COMING UP IN THIS MODULE

Now that your database is set up and your mental model is clear, Topic 2 dives into writing real queries — SELECT, WHERE, ORDER BY, DISTINCT, and LIMIT. By the end of Topic 2 you’ll be able to answer basic business questions entirely in SQL against your Superstore database.

NEXT TOPIC →

Your First Queries — SELECT, WHERE, ORDER BY, LIMIT, DISTINCT

→

April 2, 2026

Exploratory Data Analysis (EDA): Discovering Patterns Through Visualization
Turning Structured Data into Insight

Up to this point, you have learned how to manipulate data, transform it efficiently, and structure it using NumPy and Pandas. Now we shift to a critical stage of the analytics lifecycle: Exploratory Data Analysis (EDA).

EDA is where data stops being abstract and starts becoming interpretable.

It is the disciplined process of examining a dataset to understand its structure, detect patterns, identify anomalies, validate assumptions, and form hypotheses. Visualization plays a central role in this stage because human cognition is strongly visual—patterns that are invisible in tables often become obvious in graphs.

This page develops both conceptual and practical clarity around how analysts explore data before modeling.

What Is Exploratory Data Analysis?

Exploratory Data Analysis is not about building models. It is about asking questions such as:
- What does the distribution of variables look like?
- Are there missing values or anomalies?
- Do variables appear correlated?
- Are there outliers that could distort analysis?
- Does the data align with domain expectations?
EDA precedes predictive modeling because poor understanding of data leads to flawed models.

In analytics workflows, EDA serves as a diagnostic stage. It bridges raw data manipulation and statistical inference.

Understanding Distributions

One of the first steps in EDA is understanding how a variable is distributed.

A common distribution in natural and social systems is the normal distribution:

Normal Distribution

f(x) = (1 / σ√2π) e^{-(x-μ)² / 2σ²}

Mean (μ)

μ = 0

Standard Deviation (σ)

σ = 1

This bell-shaped curve appears in measurement errors, biological traits, and aggregated human behaviors.

However, not all variables follow this pattern. Some are skewed, multimodal, or heavy-tailed.

Histograms and density plots help reveal:
- Symmetry vs skewness
- Presence of extreme values
- Clustering patterns
- Data range
Understanding distribution shape influences decisions about transformation, scaling, and modeling techniques.

Measures of Central Tendency and Spread

Descriptive statistics summarize distributions numerically. Key measures include:
- Mean
- Median
- Standard deviation
- Interquartile range
Standardization often uses the following transformation:

Normal Distribution (Shaded Area)

z = (x − μ) / σ

Move the sliders to see how the shaded probability region changes relative to the mean and standard deviation.

Value (x)

x = 1

Mean (μ)

μ = 0

Std Dev (σ)

σ = 1

Z-score =

While this formula appears simple, its interpretation is powerful: it tells us how far a value deviates from the mean in standard deviation units.

In EDA, comparing mean and median can reveal skewness. Large differences often signal asymmetry in the distribution.

Spread measures indicate variability, which affects model stability.

Visualizing Relationships Between Variables

EDA is not limited to univariate analysis. Relationships between variables are often more important.

Scatter plots are commonly used to examine pairwise relationships. For example, a linear relationship can be approximated as:

Linear Function with Intercepts

y = mx + b

Slope (m)

m = 1

Intercept (b)

b = 0

y-intercept =

x-intercept =

A scatter plot may reveal:
- Linear relationships
- Nonlinear patterns
- Clusters
- Outliers
- Heteroscedasticity (changing variance)
Identifying these patterns informs whether linear models are appropriate or whether transformations are needed.

Correlation and Dependence

Correlation measures the strength and direction of linear association between variables.

The Pearson correlation coefficient conceptually relates to covariance scaled by standard deviations:

\[
r = \frac{cov(X, Y)}{\sigma_X \sigma_Y}
\]

Correlation values range from -1 to 1.

However, correlation does not imply causation. In EDA, correlation is used as a screening tool, not proof of dependency.

Heatmaps of correlation matrices are common visualization techniques when dealing with many variables.

Outlier Detection

Outliers can dramatically influence statistical measures and models.

Common techniques for identifying outliers include:
- Boxplots
- Z-score thresholds
- Interquartile range rules
For example, values with absolute z-scores greater than 3 are often considered extreme in approximately normal distributions.

Outlier detection requires contextual understanding. In fraud detection, extreme values may be the most valuable signals. In sensor data, they may represent noise.

EDA helps differentiate between data errors and meaningful anomalies.

Categorical Data Exploration

Not all variables are numeric. Categorical variables require different treatment.

Bar charts help examine frequency distributions. Analysts often ask:
- Which categories dominate?
- Are categories imbalanced?
- Does imbalance affect modeling?
For example, a highly imbalanced target variable in classification may require resampling strategies.

EDA ensures that categorical structure is understood before applying algorithms.

Time Series Exploration

When data has a temporal component, exploration includes examining trends and seasonality.

Time plots reveal:
- Upward or downward trends
- Cyclical patterns
- Abrupt shifts
- Structural breaks
Trend approximation may resemble linear modeling in its simplest form:

Linear Function

y = mx + b

Adjust the slope and intercept to see how the line moves. The graph highlights the x-intercept and y-intercept.

Slope (m)

m = 1

Intercept (b)

b = 0

y-intercept:

x-intercept:

However, real-world time series often contain nonlinear and seasonal patterns that require deeper analysis.

Rolling averages and decomposition methods are commonly used to smooth noise and extract structure.

Multivariate Exploration

In datasets with many features, pairwise plots can reveal complex interactions.

Multivariate exploration aims to answer:
- Do clusters exist?
- Are there redundant features?
- Does dimensionality need reduction?
High-dimensional visualization is challenging, but tools like pair plots, principal component projections, and clustering previews provide insight.

EDA at this stage often transitions toward modeling decisions.

The Role of Visualization Libraries

In Python, common visualization libraries include:
- Matplotlib
- Seaborn
- Plotly
Matplotlib provides foundational plotting capability. Seaborn builds on it with statistical visualizations. Plotly adds interactive capabilities.

Visualization is not about aesthetics alone—it is about clarity and interpretability.

Well-designed visuals emphasize:
- Accurate scaling
- Clear labeling
- Logical grouping
- Minimal distortion
Poor visualization can mislead interpretation.

EDA as Hypothesis Generation

EDA is exploratory by design. It is not constrained by rigid hypotheses.

Instead, analysts form tentative hypotheses during exploration:
- “Sales appear higher during holidays.”
- “Income seems correlated with education level.”
- “Customer churn increases after price changes.”
These hypotheses are later tested statistically or validated through modeling.

EDA encourages curiosity while maintaining analytical rigor.

Bias and Misinterpretation Risks

Visualization can amplify cognitive biases. Humans naturally detect patterns—even in random noise.

Analysts must guard against:
- Overfitting visual patterns
- Confirmation bias
- Ignoring scale distortions
- Misinterpreting correlation as causation
Statistical validation should follow exploratory findings.

EDA is a guide, not a conclusion.

Workflow Integration

In the analytics lifecycle, EDA typically follows data cleaning and precedes modeling.

The general progression looks like this:
1. Data ingestion
2. Cleaning and preprocessing
3. Exploratory analysis
4. Feature engineering
5. Modeling
6. Evaluation
EDA often loops back to cleaning when new issues are discovered.

This iterative process is normal and expected in real-world analytics.

Connecting Mathematics and Visualization

Many statistical concepts introduced earlier become visible during EDA:
- Standard deviation reflects spread in histograms.
- Linear equations appear as trend lines in scatter plots.
- Standard scores highlight unusual values.
The connection between mathematical formulas and visual representations deepens conceptual understanding.

Visualization translates abstract numbers into intuitive patterns.

Developing Analytical Judgment

Tools and formulas are important, but analytical judgment is the ultimate goal.

Strong EDA involves:
- Asking meaningful questions
- Interpreting visuals critically
- Understanding domain context
- Recognizing data limitations
This stage trains you to think like a data analyst rather than a coder.

You begin to evaluate whether data is trustworthy, representative, and informative.

Transition Toward Modeling

EDA does not end analysis—it prepares it.

By the time modeling begins, you should already understand:
- Distribution shapes
- Relationships between features
- Potential multicollinearity
- Data imbalance issues
- Outlier behavior
Modeling without EDA is blind experimentation.

Exploration provides direction and context.

Looking Ahead

In the next section, we will move into Statistical Foundations for Analytics, where you will formalize many of the concepts encountered visually in EDA.

You will examine probability, sampling, hypothesis testing, and statistical inference—transforming exploratory insights into mathematically grounded conclusions.

This marks the transition from observation to validation in the analytical process.
March 15, 2026

Author: aks0911

SQL Subqueries Explained: WHERE and FROM Subqueries for Data Analysts with Examples

What Is a Subquery?

Why Subqueries Matter for Analysts

Types of Subqueries Covered in This Topic

WHERE Subqueries

Using IN with a Subquery

NOT IN — Exclusion Filter

FROM Subqueries

Subquery vs JOIN — When to Use Which

Common Subquery Mistakes

Practice Exercises

Summary — What You Can Now Do

Up next — Topic 6: SQL Meets Python

SQL JOINs Explained: INNER JOIN and LEFT JOIN for Data Analysts with Examples

What Is a JOIN and Why Do You Need It

How a JOIN Works — The Mental Model

INNER JOIN — Only Matching Rows

Joining Three Tables

INNER JOIN with WHERE and ORDER BY

LEFT JOIN — Keep All Rows from the Left Table

Handling NULLs After a JOIN

INNER JOIN vs LEFT JOIN — When to Use Which

A Complete Business Query Using JOINs

Common JOIN Mistakes

Practice Exercises

Summary — What You Can Now Do

Up next — Topic 5: Subqueries

How to Use COUNT, SUM, AVG, GROUP BY and HAVING in SQL

SQL Aggregations Explained: COUNT, SUM, AVG, GROUP BY and HAVING with Examples

What Is Aggregation and Why Does It Matter

The Five Aggregate Functions

COUNT — How Many Rows

SUM — Total Value

AVG — Average Value

MIN and MAX — Smallest and Largest Values

GROUP BY — Aggregating by Category

HAVING — Filtering Aggregated Results

WHERE and HAVING Together

Practical Business Queries Using Aggregations

WHERE vs HAVING — The Full Picture

Common Mistakes with Aggregations

Practice Exercises

Summary — What You Can Now Do

Up next — Topic 4: JOINs

How to Write Your First SQL Query: A Beginner’s Guide

SQL Query Basics: SELECT, WHERE, ORDER BY, LIMIT and DISTINCT Explained with Examples

Why These Five Keywords Matter

How SQL Thinks

SELECT — Choosing Your Columns

Expected output (first 5 rows):

Renaming Columns with AS

LIMIT — Controlling How Many Rows Come Back

DISTINCT — Finding Unique Values

Why DISTINCT Is More Useful Than It Looks

WHERE — Filtering Rows

Using BETWEEN for Ranges

Combining Conditions with AND and OR

Pattern Matching with LIKE

Filtering for Missing Values with IS NULL

ORDER BY — Sorting Results

Putting It All Together

Common Beginner Mistakes

Practice Exercises

Summary — What You Can Now Do

Up next — Topic 3: Aggregations

Why SQL for Data Analysts?

The Tool You Can’t Avoid

SQL vs Excel vs Python

Simple Takeaway

SQL (The retrieval layer)

Excel (The presentation layer)

When to Use What — Real Scenarios

Relational Databases — The Conceptual Model

Setting Up SQLite + Converting Superstore to a Database

Common Misconceptions

What You Now Know

Exploratory Data Analysis (EDA): Discovering Patterns Through Visualization

Turning Structured Data into Insight

What Is Exploratory Data Analysis?