What Is a JOIN and Why Do You Need It
Every query in Topics 2 and 3 touched a single table. That works fine when all your data lives in one place — but in real databases it almost never does.
Customer details live in a customers table. Orders live in an orders table. Products live in a products table. These tables are kept separate deliberately — it avoids storing the same customer name and address on every single order they place. Instead, each order stores a customer ID, and that ID links back to the customer record.
This is efficient for storage and data integrity. But it means that to answer most real business questions, you need to combine two or more tables. That is exactly what a JOIN does.
A JOIN connects two tables based on a shared column — usually an ID that appears in both. The result is a new combined table containing columns from both sources, matched row by row.
Without JOINs, a relational database is just a collection of disconnected tables. With JOINs, it becomes a connected system you can query across freely.
Setting Up the Superstore Tables for JOINs
In Topics 2 and 3 the Superstore data was one flat table. To practice JOINs properly you need it split into separate tables the way a real database works. Run this setup code once in your notebook to create the tables:
import sqlite3
import pandas as pd
conn = sqlite3.connect('superstore.db')
# Load the original flat table
df = pd.read_csv('superstore_sales.csv')
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]
# Create customers table
customers = df[['customer_id', 'customer_name', 'segment']].drop_duplicates()
customers.to_sql('customers', conn, if_exists='replace', index=False)
# Create orders table
orders = df[['order_id', 'customer_id', 'order_date',
'ship_date', 'ship_mode', 'region',
'city', 'state']].drop_duplicates()
orders.to_sql('orders', conn, if_exists='replace', index=False)
# Create order_items table
order_items = df[['order_id', 'product_id', 'sub_category',
'category', 'sales', 'quantity',
'discount', 'profit']].drop_duplicates()
order_items.to_sql('order_items', conn, if_exists='replace', index=False)
print("Tables created successfully.")
conn.close()
Now you have three separate tables — customers, orders, and order_items — linked by shared ID columns. This mirrors how a real company database is structured.
def query(sql):
return pd.read_sql(sql, conn)
conn = sqlite3.connect('superstore.db')
How a JOIN Works — The Mental Model
Before writing JOIN syntax, understand what is happening conceptually.
Imagine two tables sitting side by side. The orders table has a column called customer_id. The customers table also has a column called customer_id. A JOIN says: for every row in orders, find the matching row in customers where the customer_id values are equal, and combine them into one wider row.
The column you join on is called the join key. It must exist in both tables and contain matching values. In the Superstore setup:
• orders.customer_id links to customers.customer_id
• order_items.order_id links to orders.order_id
The difference between JOIN types is what happens when a match is not found. That is the entire distinction between INNER JOIN and LEFT JOIN.
INNER JOIN — Only Matching Rows
Syntax:
SELECT columns<br>FROM table_one<br>INNER JOIN table_two ON table_one.key = table_two.key
INNER JOIN returns only rows where a match exists in both tables. If a row in the left table has no matching row in the right table, it is excluded from the result. If a row in the right table has no match in the left table, it is also excluded.
Think of it as the intersection — only rows that exist in both tables come through.
Your First INNER JOIN
-- Combine orders with customer details
query("""
SELECT
o.order_id,
o.order_date,
o.region,
c.customer_name,
c.segment
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
LIMIT 10
""")
Output (first 5 rows):
| order_id | order_date | region | customer_name | segment |
|---|---|---|---|---|
| CA-2020-152156 | 2020-11-08 | South | Claire Gute | Consumer |
| CA-2020-138688 | 2020-06-12 | West | Darrin Van Huff | Corporate |
| US-2019-108966 | 2019-10-11 | South | Sean O’Donnell | Consumer |
| CA-2019-115812 | 2019-06-09 | East | Brosina Hoffman | Consumer |
| CA-2019-114412 | 2019-04-15 | West | Andrew Allen | Consumer |
Notice o. and c. before column names. These are table aliases — shorthand so you don’t have to type the full table name every time. orders o means “refer to the orders table as o.” This becomes essential when both tables have columns with the same name.
Joining Three Tables
Most real analyst queries join more than two tables. Here is how to bring orders, customers, and order_items together in one query:
-- Full picture: customer details + order details + financials
query("""
SELECT
c.customer_name,
c.segment,
o.order_date,
o.region,
oi.category,
oi.sub_category,
ROUND(oi.sales, 2) AS sales,
ROUND(oi.profit, 2) AS profit
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
LIMIT 10
""")
Each INNER JOIN adds another table into the result. The pattern is always the same — join on the shared key column between the two tables being connected.
INNER JOIN with WHERE and ORDER BY
JOINs combine cleanly with everything from Topics 2 and 3:
-- High-value orders in the West region with customer details
query("""
SELECT
c.customer_name,
c.segment,
o.region,
oi.category,
ROUND(oi.sales, 2) AS sales,
ROUND(oi.profit, 2) AS profit
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
WHERE o.region = 'West'
AND oi.sales > 1000
ORDER BY oi.sales DESC
LIMIT 10
""")
LEFT JOIN — Keep All Rows from the Left Table
Syntax:
SELECT columns<br>FROM table_one<br>LEFT JOIN table_two ON table_one.key = table_two.key
LEFT JOIN returns all rows from the left table, plus matching rows from the right table. When no match exists in the right table, the columns from the right table come back as NULL.
The key difference from INNER JOIN: no rows from the left table are ever dropped. Even if they have no match on the right side, they appear in the result — just with NULL values in the right table’s columns.
When to Use LEFT JOIN
LEFT JOIN is the right choice when you want to keep all records from one table regardless of whether they have matching records in another. Common scenarios:
- Find customers who have never placed an order
- Find products that have never been sold
- Identify records in one system that are missing from another
- Audit data completeness across two sources
LEFT JOIN Example — Finding Unmatched Records
To demonstrate LEFT JOIN clearly, first add a test customer with no orders:
Add a customer who has never ordered
import pandas as pd
import sqlite3
conn = sqlite3.connect('superstore.db')
new_customer = pd.DataFrame([{
'customer_id': 'TEST-001',
'customer_name': 'Test Customer',
'segment': 'Consumer'
}])
new_customer.to_sql('customers', conn, if_exists='append', index=False)
conn.commit()
Now run a LEFT JOIN to find customers with no orders:
-- Find customers who have never placed an order
query("""
SELECT
c.customer_id,
c.customer_name,
c.segment,
o.order_id
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL
""")
Output:
| customer_id | customer_name | segment | order_id |
|---|---|---|---|
| TEST-001 | Test Customer | Consumer | NULL |
The NULL in order_id tells you this customer exists in the customers table but has no matching record in the orders table. The INNER JOIN version of this query would have excluded this row entirely — you would never know this customer existed.
This pattern — LEFT JOIN followed by WHERE right_table.key IS NULL — is one of the most useful SQL techniques for data quality auditing.
Handling NULLs After a JOIN
NULL values appearing after a LEFT JOIN are expected and useful — they signal missing matches. But they need to be handled carefully in any further calculations or filtering.
COALESCE — Replacing NULL with a Default Value
COALESCE returns the first non-NULL value from a list of arguments. Use it to replace NULLs with a meaningful default:
-- Replace NULL order_id with a readable label
query("""
SELECT
c.customer_name,
COALESCE(o.order_id, 'No orders yet') AS order_status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL
""")
COALESCE is also useful when joining tables where a column might be populated in one table but missing in another. Rather than seeing NULL in your output, you get a clean fallback value.
NULL in Aggregate Functions After a JOIN
One important behaviour: aggregate functions like COUNT, SUM, and AVG ignore NULL values automatically. This matters after a LEFT JOIN because unmatched rows produce NULLs in the right table’s columns.
-- Count orders per customer — unmatched customers show 0, not NULL
query("""
SELECT
c.customer_name,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name
ORDER BY order_count DESC
LIMIT 10
""")
COUNT(o.order_id) counts non-NULL values only — so customers with no orders correctly show 0. If you used COUNT() here you would get 1 for every customer including unmatched ones, because COUNT() counts the row itself regardless of NULL values.
INNER JOIN vs LEFT JOIN — When to Use Which
| Situation | Use |
|---|---|
| You only want rows that exist in both tables | INNER JOIN |
| You want all rows from the left table, matched or not | LEFT JOIN |
| Finding records that are missing from another table | LEFT JOIN + WHERE right key IS NULL |
| Combining sales data with product details | INNER JOIN |
| Auditing customers with no orders | LEFT JOIN |
| Getting complete order history including customer info | INNER JOIN |
| Data completeness check across two systems | LEFT JOIN |
The default choice for most analyst queries is INNER JOIN — you usually only want complete, matched records. Reach for LEFT JOIN specifically when missing matches are meaningful information rather than just gaps to exclude.
A Complete Business Query Using JOINs
Here is a realistic analyst query combining JOINs, WHERE, GROUP BY, and ORDER BY to answer a full business question:
Question: Which customer segments generate the most revenue and profit, broken down by product category?
query("""
SELECT
c.segment,
oi.category,
COUNT(DISTINCT o.order_id) AS total_orders,
ROUND(SUM(oi.sales), 2) AS total_revenue,
ROUND(SUM(oi.profit), 2) AS total_profit,
ROUND(SUM(oi.profit) /
SUM(oi.sales) * 100, 1) AS profit_margin_pct
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
GROUP BY c.segment, oi.category
ORDER BY c.segment, total_revenue DESC
""")
This single query pulls from three tables, aggregates across two dimensions, calculates a derived metric, and produces a result a manager could read directly in a meeting. That is the power of combining JOINs with everything from the previous topics.
Common JOIN Mistakes
| Mistake | What Happens | Fix |
|---|---|---|
| Joining on the wrong column | Incorrect or empty results | Double-check which columns are the shared keys between tables |
| Forgetting table aliases when column names clash | SQL error — ambiguous column name | Always use aliases when the same column name exists in both tables |
| Using INNER JOIN when LEFT JOIN is needed | Silently drops unmatched rows | Ask yourself — do I care about rows with no match? If yes, use LEFT JOIN |
| Not handling NULLs after LEFT JOIN | Wrong aggregation results | Use COALESCE for display, COUNT(column) not COUNT(*) for counting |
| Joining without ON clause | Cartesian product — every row matched to every row | Always include the ON condition |
Practice Exercises
- Join the orders and customers tables. Show the customer name, region, and order date for all orders placed in 2021.
- Join all three tables. Find the top 5 customers by total profit across all their orders.
- Using a LEFT JOIN, find any order IDs in order_items that do not have a matching record in the orders table.
- Join orders and customers. Group by segment and region. Show total revenue per combination sorted by revenue descending.
- Join all three tables. Filter to Furniture category only. Show total sales and profit per customer segment.
Summary — What You Can Now Do
- Explain what a JOIN does and why relational databases require them
- Write an INNER JOIN to combine two or more tables on a shared key column
- Write a LEFT JOIN to keep all rows from the left table including unmatched ones
- Use LEFT JOIN with WHERE IS NULL to find records missing from a second table
- Handle NULLs after a JOIN using COALESCE and COUNT(column) vs COUNT(*)
- Combine JOINs with WHERE, GROUP BY, ORDER BY, and HAVING in a single query
- Choose between INNER JOIN and LEFT JOIN based on whether unmatched rows matter
Up next — Topic 5: Subqueries
Topic 5 covers queries inside queries — how to use a SELECT result as a filter in WHERE, or as a derived table in FROM. Subqueries let you answer multi-step business questions in a single SQL statement without needing temporary tables or multiple separate queries.
Leave a Reply