Category: Data Science

  • SQL Mini Project: Analyze the Superstore Database Using SQL and Python

    What This Project Is

    You have completed all six topics in Module 2. You can query a database, filter and sort rows, aggregate data, join tables, write subqueries, and connect SQL to Python. This mini project puts all of that together in one end-to-end deliverable.
    This is not a guided tutorial. There are no step-by-step instructions telling you which functions to use. The five business questions below are the kind you would genuinely receive in an entry-level data role — and your job is to answer them using everything you have learned.
    By the end you will have a completed Jupyter notebook, a written findings brief, and a GitHub repository — three things you can point to directly when applying for data roles.

    The Scenario

    You have just joined a retail company as a junior data professional. It is your first week. Your manager has sent you an email:

    “Hey — I’ve given you access to our sales database. Before our Friday meeting I’d love to get your take on five questions I’ve been sitting on for a while. Nothing fancy — just pull the numbers and tell me what you find. A short write-up is fine.”

    The five questions are below. The database is the Superstore SQLite database you have been working with throughout Module 2.

    The Five Business Questions

    Answer each question using SQL. Pull the result into a pandas DataFrame. Then write one to three sentences in plain English summarising what the data tells you.

    Question 1 — Regional Performance

    Which region generates the most total revenue and which generates the most total profit? Are they the same region? If not, what does that tell you?

    Question 2 — Product Profitability

    Which three sub-categories have the highest total profit and which three have the lowest? Are any sub-categories losing money overall?

    Question 3 — Customer Value

    Who are the top 10 customers by total revenue? For each of those customers, what is their profit margin? Are your highest-revenue customers also your most profitable ones?

    Question 4 — Loss-Making Orders

    What percentage of all orders are loss-making (profit below zero)? Which category has the highest proportion of loss-making orders? Which region?

    Question 5 — Shipping and Profitability

    Does ship mode affect profitability? Show average profit and average sales for each ship mode. Is there a pattern worth flagging to the business?

    Deliverables

    Submit three things when you complete this project:

    1. Jupyter Notebook
      One clean notebook containing all your SQL queries, pandas code, and output. Structure it with a markdown cell before each question stating the business question, followed by your code and output. Name it Module2_MiniProject.ipynb.
    2. Written Findings Brief
      A short document — five short paragraphs, one per question — written in plain English as if you are sending it to your manager. No code. No jargon. Just what the data shows and why it matters. Aim for 150 to 200 words total. Name it Module2_Findings_Brief.md.
    3. GitHub Repository
      Push both files to a public GitHub repository. Name it superstore-sql-analysis. Include a README that describes the project in two to three sentences, lists the tools used, and explains how to run the notebook.

    Technical Requirements

    Your notebook must meet all of the following:

    • All data retrieved using SQL queries against the Superstore SQLite database — no loading the raw CSV directly
    • At least three queries must use JOIN across two or more tables
    • At least one query must use a subquery
    • At least one query must use GROUP BY with HAVING
    • All results pulled into pandas DataFrames using pd.read_sql()
    • Connection opened once at the top and closed once at the bottom
    • All SQL queries use parameterised values where a variable is involved
    • Column names in output are readable — use AS aliases where needed

    Notebook Structure

    Set your notebook up in this order:

    1. Title cell — project name, your name, date
    2. Setup cell — imports and database connection
    3. Question 1 — markdown heading + SQL query + DataFrame output + written insight
    4. Question 2 — markdown heading + SQL query + DataFrame output + written insight
    5. Question 3 — markdown heading + SQL query + DataFrame output + written insight
    6. Question 4 — markdown heading + SQL query + DataFrame output + written insight
    7. Question 5 — markdown heading + SQL query + DataFrame output + written insight
    8. Summary cell — three to five overall takeaways in plain English
    9. Close connection
    

    Every question cell should follow this pattern:

    # ── QUESTION 1: Regional Performance ─────────────────────
    
    q1 = pd.read_sql("""
        -- Your SQL query here
    """, conn)
    
    q1
    

    Followed by a markdown cell with your plain-English insight.

    Hints — Read Only If Stuck

    These are directional hints only — not solutions.

    Question 1: GROUP BY region with SUM for both sales and profit. Think about why the most profitable region might not be the highest revenue region — margin matters.

    Question 2: GROUP BY sub_category with SUM(profit). Sort both ascending and descending. Use HAVING to isolate sub-categories where total profit is negative.

    Question 3: JOIN orders, customers, and order_items. GROUP BY customer name. Calculate profit margin as SUM(profit) / SUM(sales) * 100. Think about what a high-revenue but low-margin customer means for the business.

    Question 4: Use a subquery or a CASE statement to flag loss-making orders. Calculate percentage in pandas after pulling the counts. GROUP BY category and region separately to find the worst offenders.

    Question 5: GROUP BY ship_mode. Use AVG for both sales and profit. Think about whether causation is implied — does ship mode cause profitability differences or just correlate with them?

    Evaluation Criteria

    Your project will be assessed on four dimensions:

    Correctness — Do your SQL queries return accurate results? Are joins and aggregations logically sound?

    Clarity — Is your notebook clean and readable? Would a colleague understand your work without asking you to explain it?

    Insight — Do your written findings go beyond restating the numbers? Does your brief say something meaningful about the business?

    Craft — Are column names clean? Is the connection managed properly? Are queries well formatted and commented?

    Sharing Your Work

    When your project is complete:
    • Post your GitHub link in the course community forum
    • Write one sentence about the most surprising thing you found in the data
    • Review one other student’s project and leave a comment on their findings brief

    Looking at how others approached the same five questions is one of the most effective ways to deepen your SQL intuition. There is rarely one right query — seeing different approaches to the same problem is genuinely instructive.

    Up next — Module 3: Data Cleaning and Wrangling

    Module 3 moves back into Python full time. You will learn how to take messy, real-world data — missing values, wrong data types, duplicates, inconsistent categories — and turn it into a clean, analysis-ready dataset. The skills in Module 3 are what separates someone who can analyse clean data from someone who can handle data the way it actually arrives in the real world.

  • SQL and Python Together: How to Use sqlite3 and pd.read_sql() for Data Analysis

    Why SQL and Python Belong Together

    Every topic in this module has used Python to run SQL queries. You have been writing SQL inside Python strings and passing them to pd.read_sql(). That combination is not a workaround — it is the standard professional workflow for data analysts who work with databases.

    SQL and Python are not competing tools. They are complementary layers in the same pipeline. SQL is where you retrieve, filter, and summarise data at the database level. Python is where you transform, visualise, model, and communicate that data. Understanding where one ends and the other begins is one of the most practically valuable things you can take away from this module.

    This topic goes deeper into that boundary. You will learn how the connection between SQLite and Python actually works, how to manage that connection properly, how to decide what belongs in SQL versus pandas, and how to structure a clean repeatable workflow that scales from a local SQLite file to a production cloud database.

    How the sqlite3 Connection Works

    Every query you have run in this module started with one line:

    conn = sqlite3.connect('superstore.db')

    That line opens a connection to a SQLite database file. A connection is a live channel between your Python session and the database. Through that channel you can send SQL statements and receive results back as Python objects.

    Understanding the connection lifecycle matters because connections consume resources. A well-written analysis opens a connection, does its work, and closes the connection cleanly. A poorly written one leaves connections open, which can cause file locking issues and unpredictable behaviour especially when multiple processes access the same database.

    Opening and Closing Connections Properly

    import sqlite3
    import pandas as pd
    
    Open the connection
    conn = sqlite3.connect('superstore.db')
    
    Do your work
    df = pd.read_sql("SELECT * FROM superstore LIMIT 5", conn)
    
    Always close when done
    conn.close()

    For longer notebooks where you need the connection throughout, the best practice is to open it once at the top and close it once at the bottom — not open and close it around every query.

    Using a Context Manager

    Python’s with statement handles the connection lifecycle automatically. The connection closes itself when the block ends, even if an error occurs inside it:

    # Context manager — connection closes automatically
    with sqlite3.connect('superstore.db') as conn:
      df = pd.read_sql("SELECT * FROM superstore LIMIT 5", conn)
      print(df)
    # conn is closed here automatically

    For notebook-based analysis the manual approach is fine. For scripts that run automatically — scheduled reports, data pipelines — always use the context manager.

    Checking What Tables Exist

    When working with an unfamiliar database, the first thing you want to know is what tables are available:

    conn = sqlite3.connect('superstore.db')
    # List all tables in the database
    tables = pd.read_sql("""
    SELECT name
    FROM sqlite_master
    WHERE type = 'table'
    ORDER BY name
    """, conn)
    print(tables)

    sqlite_master is SQLite’s internal catalogue table. It stores metadata about everything in the database — tables, indexes, and views. This query is the SQLite equivalent of asking “what is in here?” when you open an unfamiliar database for the first time.

    pd.read_sql() — The Bridge Between SQL and pandas

    pd.read_sql() is the function that executes a SQL query and returns the result directly as a pandas DataFrame. It is the core of the SQL-Python workflow.

    # Basic usage
    df = pd.read_sql(sql_query, connection)

    Once the result is a DataFrame you have the full pandas toolkit available — filtering, reshaping, visualisation, statistical analysis, and everything from Module 1.

    Passing Parameters Safely

    When your query needs to include a variable value — a user input, a date from a loop, a value from another DataFrame — never build the query by concatenating strings. This is a security risk called SQL injection and also causes bugs when values contain special characters like apostrophes.

    Instead use parameterised queries:

    # Unsafe — never do this
    region = "West"
    df = pd.read_sql(f"SELECT * FROM superstore WHERE region = '{region}'", conn)
    
    # Safe — use parameters
    region = "West"
    df = pd.read_sql(
    "SELECT * FROM orders WHERE region = ?",
    conn,
    params=(region,)
    )

    The ? placeholder gets replaced safely by the value in params. SQLite handles the escaping automatically. This is especially important when the variable value comes from user input or an external source.

    Passing Multiple Parameters

    # Filter by region and minimum sales value
    
    region = "West"
    min_sales = 500
    df = pd.read_sql(
    """
    SELECT order_id, region, sales, profit
    FROM superstore
    WHERE region = ?
    AND sales > ?
    ORDER BY sales DESC
    """,
    conn,
    params=(region, min_sales)
    )

    Parameters are passed as a tuple in the same order as the ? placeholders appear in the query.

    Reading Large Datasets in Chunks

    When a query returns a very large result set — millions of rows — loading everything into memory at once can crash your notebook. pd.read_sql() supports chunked reading via the chunksize parameter:

    # Read in chunks of 1000 rows at a time
    chunks = pd.read_sql(
        "SELECT * FROM superstore",
        conn,
        chunksize=1000
    )
    
    # Process each chunk
    dfs = []
    for chunk in chunks:
        # Apply any row-level processing here
        dfs.append(chunk)
    
    df = pd.concat(dfs, ignore_index=True)
    print(f"Total rows loaded: {len(df)}")
    

    For the Superstore dataset this is not necessary — 10,000 rows loads instantly. But on a production database with millions of rows it is an essential technique to know.

    Writing Data Back to SQLite

    The SQL-Python bridge works in both directions. You can read data from a database into pandas, and you can write a pandas DataFrame back into a database as a table.

    to_sql() — Writing a DataFrame to a Database Table

    # Create a summary DataFrame in pandas
    summary = df.groupby('region').agg(
        total_sales=('sales', 'sum'),
        total_profit=('profit', 'sum'),
        order_count=('order_id', 'count')
    ).reset_index().round(2)
    
    # Write it back to the database as a new table
    summary.to_sql(
        'region_summary',       # table name
        conn,
        if_exists='replace',    # replace if table already exists
        index=False             # don't write the DataFrame index as a column
    )
    
    print("Summary table written to database.")
    

    The if_exists parameter controls what happens if the table already exists:

    • replace — drop and recreate the table
    • append — add rows to the existing table
    • fail — raise an error (the default)

    Once written back to the database, you can query this summary table with SQL just like any other table.

    When to Filter in SQL vs pandas

    This is the most practically important decision in the SQL-Python workflow. The wrong choice does not break anything — both tools can filter data. But the right choice makes your analysis faster, cleaner, and more professional.

    The Core Principle

    Filter and aggregate in SQL. Transform, visualise, and model in Python.

    SQL runs inside the database engine which is optimised for filtering and aggregating large datasets. When you push filtering into SQL, only the rows you actually need travel from the database to Python. When you pull everything into Python and filter there, you are loading unnecessary data into memory and doing work that the database could have done more efficiently.


    Filter in SQL When

    # ✅ Row-level filters that reduce data volume
    query("""
        SELECT *
        FROM superstore
        WHERE region = 'West'
        AND order_date >= '2021-01-01'
    """)
    
    # ✅ Aggregations that summarise large tables
    query("""
        SELECT region, SUM(sales) AS total_sales
        FROM superstore
        GROUP BY region
    """)
    
    # ✅ JOINs that combine tables
    query("""
        SELECT o.order_id, c.customer_name, oi.sales
        FROM orders o
        INNER JOIN customers c ON o.customer_id = c.customer_id
        INNER JOIN order_items oi ON o.order_id = oi.order_id
    """)
    
    # ✅ Deduplication before analysis
    query("""
        SELECT DISTINCT customer_id, segment
        FROM customers
    """)
    

    Filter in pandas When

    # ✅ Complex conditional logic involving multiple Python objects
    df['high_value'] = (df['sales'] > df['sales'].mean() * 1.5)
    
    # ✅ String operations not easily done in SQL
    df_filtered = df[df['customer_name'].str.contains('son', case=False)]
    
    # ✅ Filtering based on values calculated in Python
    threshold = df['profit'].quantile(0.75)
    df_top = df[df['profit'] > threshold]
    
    # ✅ Time-based filtering using pandas datetime methods
    df['order_date'] = pd.to_datetime(df['order_date'])
    df_recent = df[df['order_date'].dt.year == 2021]
    
    # ✅ Filtering after a merge or reshape operation in pandas
    merged = df1.merge(df2, on='customer_id')
    filtered = merged[merged['total_orders'] > 3]
    

    The Decision Framework

    Ask yourself three questions before deciding where to filter:

    1. Does the filter reduce the number of rows significantly?
      If yes, do it in SQL. Bringing fewer rows into Python is always better.
    2. Does the filter require Python objects, methods, or calculated values that SQL cannot access?
      If yes, do it in pandas after loading.
    3. Is this a one-time exploration or a repeatable pipeline?
      For pipelines, push as much as possible into SQL for performance and reliability.

    Building a Clean SQL-Python Workflow

    Here is a complete, realistic analyst workflow that shows SQL and Python working together from raw database to final insight:

    import sqlite3
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # ── STEP 1: Connect ──────────────────────────────────────
    conn = sqlite3.connect('superstore.db')
    
    # ── STEP 2: Pull clean, pre-filtered data using SQL ──────
    df = pd.read_sql("""
        SELECT
            c.segment,
            o.region,
            oi.category,
            oi.sub_category,
            o.order_date,
            ROUND(oi.sales, 2)   AS sales,
            ROUND(oi.profit, 2)  AS profit
        FROM orders o
        INNER JOIN customers c
            ON o.customer_id = c.customer_id
        INNER JOIN order_items oi
            ON o.order_id = oi.order_id
        WHERE o.order_date >= '2021-01-01'
    """, conn)
    
    # ── STEP 3: Convert types in pandas ──────────────────────
    df['order_date'] = pd.to_datetime(df['order_date'])
    df['month'] = df['order_date'].dt.to_period('M')
    
    # ── STEP 4: Further analysis in pandas ───────────────────
    # Profit margin by segment
    df['profit_margin'] = (df['profit'] / df['sales'] * 100).round(2)
    
    # Monthly revenue trend
    monthly = df.groupby('month')['sales'].sum().reset_index()
    monthly.columns = ['month', 'total_sales']
    
    # Segment performance
    segment = df.groupby('segment').agg(
        total_sales=('sales', 'sum'),
        total_profit=('profit', 'sum'),
        avg_margin=('profit_margin', 'mean')
    ).round(2).reset_index()
    
    # ── STEP 5: Print insights ────────────────────────────────
    print("=== Segment Performance (2021) ===")
    print(segment.sort_values('total_profit', ascending=False))
    
    print("\n=== Monthly Revenue Trend ===")
    print(monthly)
    
    # ── STEP 6: Close connection ─────────────────────────────
    conn.close()
    

    This workflow is the template for every analysis you will build in this course going forward. SQL handles retrieval and pre-filtering. Python handles enrichment, aggregation, and presentation. Each tool does what it is best at.

    Saving Query Results for Reuse

    When a query takes a long time to run — common on large production databases — save the result to a CSV or parquet file so you do not have to re-query every time you restart your notebook:

    # Run the heavy query once
    df = pd.read_sql(heavy_query, conn)
    
    # Save locally
    df.to_csv('data/superstore_clean.csv', index=False)
    
    # Next session — load from file instead of re-querying
    df = pd.read_csv('data/superstore_clean.csv')
    

    This is standard practice in professional analytics. Query the database to get fresh data when you need it. Work from a saved file during iterative analysis and visualisation where you are not changing the underlying data pull.

    From SQLite to Production Databases

    Everything you have learned in this module using SQLite transfers directly to production databases. The only thing that changes is the connection setup. The SQL syntax, pd.read_sql(), parameterised queries, and the SQL-Python workflow are identical.

    Here is how connections look for the most common production databases:

    # PostgreSQL — using psycopg2
    import psycopg2
    conn = psycopg2.connect(
        host="your-host",
        database="your-db",
        user="your-user",
        password="your-password"
    )
    
    # MySQL — using mysql-connector-python
    import mysql.connector
    conn = mysql.connector.connect(
        host="your-host",
        database="your-db",
        user="your-user",
        password="your-password"
    )
    
    # BigQuery — using google-cloud-bigquery
    from google.cloud import bigquery
    client = bigquery.Client()
    df = client.query("SELECT * FROM dataset.table LIMIT 10").to_dataframe()
    
    # Once connected — pd.read_sql() works the same way for all of them
    df = pd.read_sql("SELECT * FROM orders LIMIT 10", conn)
    

    The credentials and connection libraries differ. The workflow after that — SQL queries, pd.read_sql(), DataFrames — is exactly the same. What you have learned here scales directly to enterprise databases handling billions of rows.

    Common Mistakes in the SQL-Python Workflow

    MistakeWhat HappensFix
    Leaving connections openFile locking, resource leaksAlways call conn.close() or use a context manager
    Building queries with f-strings and user inputSQL injection risk, apostrophe bugsUse parameterised queries with ? placeholders
    Pulling full tables into pandas before filteringSlow, memory-heavy, unprofessionalFilter in SQL first, bring only what you need into Python
    Re-running expensive queries every notebook restartSlow development cycleSave results to CSV after the first run
    Not resetting the index after pd.read_sql()Index issues in downstream operationsAdd .reset_index(drop=True) if needed
    Hardcoding credentials in notebooksSecurity risk if sharedUse environment variables or a config file

    Practice Exercises

    1. Connect to your Superstore database, pull all orders from 2020 using a SQL WHERE filter, and calculate the monthly revenue trend in pandas.
    2. Write a parameterised query that accepts a region name as a variable and returns total sales and profit for that region. Test it for all four regions in a loop.
    3. Pull the top 10 customers by total sales using SQL GROUP BY. Then in pandas, add a column showing each customer’s share of total revenue as a percentage.
    4. Write a complete workflow: SQL pulls order data joined with customer and product tables. pandas calculates profit margin per segment. Print a clean summary table.
    5. Save the result of a complex JOIN query to a CSV file. Then reload it from CSV in a new cell and confirm the row count matches.

    Summary — What You Can Now Do

    • Open, use, and close a SQLite connection correctly in Python
    • Use context managers for safe automatic connection handling
    • Query a database using pd.read_sql() and work with the result as a pandas DataFrame
    • Write parameterised queries to safely pass variable values into SQL
    • Read large result sets in chunks using the chunksize parameter
    • Write pandas DataFrames back to a database table using to_sql()
    • Decide confidently whether a filter or aggregation belongs in SQL or pandas
    • Build a clean end-to-end SQL-Python workflow from database connection to final insight
    • Understand how the SQLite workflow transfers directly to production databases

    Module 2 Complete

    You have now finished all six topics in Module 2. Here is what you can do that you could not at the start of this module:

    • Query any relational database using SELECT, WHERE, ORDER BY, LIMIT, and DISTINCT
    • Summarise data with COUNT, SUM, AVG, MIN, MAX, GROUP BY, and HAVING
    • Combine multiple tables using INNER JOIN and LEFT JOIN
    • Write subqueries in WHERE and FROM clauses for multi-step analysis
    • Connect SQL to Python, retrieve results as DataFrames, and decide where each tool does its best work

    The Mini Project for this module brings all of this together. You will use SQL to query the Superstore database, answer five business questions, pull the results into pandas, and write a short plain-English brief of your findings — exactly what a junior analyst would be asked to do in their first month on the job.

    Up next — Module 2 Mini Project

    Five business questions. One database. SQL queries, pandas output, and a written brief. Your first end-to-end analyst deliverable.

  • SQL Subqueries Explained: WHERE and FROM Subqueries for Data Analysts with Examples

    What Is a Subquery?

    A subquery is a SELECT statement written inside another SELECT statement. The inner query runs first, produces a result, and the outer query uses that result to complete its own logic.

    You have already written queries that filter rows, aggregate data, and join tables. A subquery combines those capabilities into a single statement — letting you answer multi-step business questions without breaking them into separate queries or creating temporary tables.

    Here is the simplest way to think about it. Imagine you want to find all orders where the sales value is above average. You cannot write WHERE sales > AVG(sales) directly — SQL does not allow aggregate functions inside a WHERE clause. But you can write a subquery that calculates the average first, then use that result in the WHERE condition:

    -- Find all orders above the average sales value
    query("""
    
    SELECT order_id, customer_name, sales
    FROM superstore
    WHERE sales > (SELECT AVG(sales) FROM superstore)
    ORDER BY sales DESC
    LIMIT 10
    
    """)

    The inner query SELECT AVG(sales) FROM superstore runs first and returns a single number. The outer query then uses that number as the filter threshold. This is a subquery in its most fundamental form.

    Why Subqueries Matter for Analysts

    New analysts sometimes wonder whether subqueries are necessary — after all, you could run two separate queries and use the result of the first manually. That works for one-off exploration but breaks down quickly in real work

    .Subqueries let you build self-contained, reusable queries that answer complex questions in a single execution. They make your SQL readable, auditable, and easy to hand off to a colleague. When you save a query in a reporting tool or share it with your team, everything is in one place — not split across two separate statements that need to be run in a specific order.

    They also open the door to a category of questions that would be genuinely difficult to answer any other way — questions like “which customers spend more than the average customer?” or “which products outsell the category average?” These comparisons require knowing a benchmark first, then filtering against it. Subqueries are the natural SQL tool for that.

    Types of Subqueries Covered in This Topic

    There are two subquery patterns every analyst needs to know:

    WHERE subquery — the inner query runs and produces a value or list of values that the outer query filters on. Used for comparison and filtering against calculated benchmarks.

    FROM subquery — the inner query runs and produces a temporary table that the outer query selects from. Used for multi-step aggregation and pre-filtering before a summary.

    Both patterns follow the same core principle: the inner query runs first, produces a result, and the outer query uses that result.

    WHERE Subqueries

    A WHERE subquery places a SELECT statement inside the WHERE clause. The inner query must return either a single value or a list of values that the outer WHERE condition can compare against.

    Comparing Against a Single Calculated Value

    This is the most common WHERE subquery pattern. Calculate a benchmark with the inner query, then filter rows against it in the outer query.

    -- Which orders have sales above the overall average?
    query("""
    
    SELECT
    order_id,
    customer_name,
    region,
    ROUND(sales, 2) AS sales
    FROM superstore
    WHERE sales > (
    SELECT AVG(sales)
    FROM superstore
    )
    ORDER BY sales DESC
    LIMIT 10
    
    """)

    The inner query returns one number — the average sales value across all orders. The outer query filters to rows where individual sales exceed that number. This gives you above-average orders without needing to know the average value in advance.

    -- Which orders have profit above the average profit?
    query("""
    
    SELECT
    order_id,
    customer_name,
    category,
    ROUND(profit, 2) AS profit
    FROM superstore
    WHERE profit > (
    SELECT AVG(profit)
    FROM superstore
    )
    ORDER BY profit DESC
    LIMIT 10
    
    """)

    Using IN with a Subquery

    When the inner query returns multiple values instead of one, use IN to check whether the outer query’s column matches any value in that list.

    -- Find all orders placed by customers in the Corporate segment
    query("""
    
    SELECT
    order_id,
    customer_id,
    order_date,
    region
    FROM orders
    WHERE customer_id IN (
    SELECT customer_id
    FROM customers
    WHERE segment = 'Corporate'
    )
    ORDER BY order_date DESC
    LIMIT 10
    
    """)

    The inner query returns a list of customer IDs belonging to the Corporate segment. The outer query then filters the orders table to only show orders where the customer ID appears in that list. This achieves a similar result to a JOIN — but the logic reads differently and is sometimes clearer depending on the question being asked.

    -- Find orders containing Technology products with high profit
    query("""
    
    SELECT
    o.order_id,
    o.order_date,
    o.region
    FROM orders o
    WHERE o.order_id IN (
    SELECT order_id
    FROM order_items
    WHERE category = 'Technology'
    AND profit > 500
    )
    ORDER BY o.order_date DESC
    LIMIT 10
    
    """)

    NOT IN — Exclusion Filter

    The reverse of IN is NOT IN — filter to rows where the value does not appear in the subquery result. Useful for finding records that are absent from another dataset.

    -- Find customers who have never ordered a Technology product
    query("""
    
    SELECT DISTINCT
    customer_id,
    customer_name,
    segment
    FROM customers
    WHERE customer_id NOT IN (
    SELECT DISTINCT o.customer_id
    FROM orders o
    INNER JOIN order_items oi
    ON o.order_id = oi.order_id
    WHERE oi.category = 'Technology'
    )
    ORDER BY customer_name
    LIMIT 10
    
    """)

    NOT IN and NULLs — important warning: If the subquery result contains any NULL values, NOT IN returns no rows at all. This is a subtle but serious bug. Always add WHERE column IS NOT NULL inside a NOT IN subquery to be safe.

    FROM Subqueries

    A FROM subquery places a SELECT statement in the FROM clause, treating the result of the inner query as a temporary table. The outer query then selects from that temporary table as if it were a real one.

    This pattern is used when you need to aggregate data in two stages — for example, first summarise by one dimension, then summarise or filter that summary further.

    Syntax:

    SELECT columns<br>FROM (<br>SELECT columns<br>FROM table<br>GROUP BY something<br>) AS subquery_alias<br>WHERE condition

    The alias after the closing bracket — AS subquery_alias — is required. SQL needs a name to refer to the temporary table in the outer query.

    Two-Stage Aggregation

    -- Step 1 inner query: calculate total sales per customer
    -- Step 2 outer query: find customers above a revenue threshold
    query("""
    
    SELECT
    customer_name,
    segment,
    ROUND(total_sales, 2) AS total_sales
    FROM (
    SELECT
    c.customer_name,
    c.segment,
    SUM(oi.sales) AS total_sales
    FROM orders o
    INNER JOIN customers c
    ON o.customer_id = c.customer_id
    INNER JOIN order_items oi
    ON o.order_id = oi.order_id
    GROUP BY c.customer_name, c.segment
    ) AS customer_summary
    WHERE total_sales > 5000
    ORDER BY total_sales DESC
    LIMIT 10
    
    """)

    Notice the outer query uses WHERE total_sales > 5000 — filtering on the aggregated column from the inner query. You cannot do this with HAVING in a simple GROUP BY because HAVING filters groups, not a pre-calculated summary table. The FROM subquery pattern gives you this flexibility.

    Filtering a Summary Before Further Analysis

    -- Find regions where average order value exceeds $250
    query("""
    
    SELECT
    region,
    ROUND(avg_order_value, 2) AS avg_order_value,
    total_orders
    FROM (
    SELECT
    o.region,
    AVG(oi.sales) AS avg_order_value,
    COUNT(*) AS total_orders
    FROM orders o
    INNER JOIN order_items oi
    ON o.order_id = oi.order_id
    GROUP BY o.region
    ) AS region_summary
    WHERE avg_order_value > 250
    ORDER BY avg_order_value DESC
    
    """)

    The inner query calculates average order value and total orders per region. The outer query then filters to regions where the average exceeds $250. This is clean, readable, and easy to modify — just change the threshold in the outer WHERE clause.

    Ranking Categories by Performance

    -- Which product categories have above-average total profit?
    query("""
    
    SELECT
    category,
    ROUND(total_profit, 2) AS total_profit
    FROM (
    SELECT
    category,
    SUM(profit) AS total_profit
    FROM order_items
    GROUP BY category
    ) AS category_totals
    WHERE total_profit > (
    SELECT AVG(total_profit)
    FROM (
    SELECT SUM(profit) AS total_profit
    FROM order_items
    GROUP BY category
    ) AS avg_calc
    )
    ORDER BY total_profit DESC
    
    """)

    This query has a subquery inside a FROM clause and another subquery inside a WHERE clause — both in the same statement. It reads from the inside out: the innermost queries run first, their results feed the next level, and the outermost query produces the final answer. This is advanced but follows the exact same rules you have already learned.

    Subquery vs JOIN — When to Use Which

    Both subqueries and JOINs can answer many of the same questions. Choosing between them is partly about correctness and partly about readability.

    SituationBetter ChoiceWhy
    Filtering based on a calculated value (avg, max)Subquery in WHEREJOINs cannot filter on aggregations directly
    Finding records absent from another tableSubquery with NOT IN or LEFT JOIN + IS NULLBoth work — LEFT JOIN is safer with NULLs
    Combining columns from two tables in the resultJOINSubqueries in WHERE do not add columns to output
    Two-stage aggregationFROM subqueryCleaner than a JOIN for pre-summarised data
    Simple lookup across two tablesJOINFaster and more readable for straightforward matches
    Filtering to a dynamic list from another tableEither — IN subquery or INNER JOIN both workJOIN is generally faster on large datasets

    The practical guideline: use a JOIN when you need columns from both tables in your output. Use a subquery when you are filtering or calculating based on a value derived from another query and do not need to show those extra columns.

    Common Subquery Mistakes

    MistakeWhat HappensFix
    Missing alias on FROM subquerySQL error — every derived table needs a nameAlways add AS alias_name after the closing bracket
    Inner query returns multiple rows in a single-value contextSQL errorUse IN instead of = when the subquery can return multiple rows
    NOT IN with NULLs in subquery resultReturns zero rows silentlyAdd WHERE column IS NOT NULL inside the NOT IN subquery
    Deeply nested subqueries that are hard to readDifficult to debug and maintainBreak into steps using pandas after pulling data, or use CTEs in future
    Using a subquery when a JOIN would be simplerSlower and harder to readIf you need columns from both tables, use a JOIN

    Practice Exercises

    1. Find all orders where sales are below the average sales value. Show order ID, customer name, and sales. Sort by sales ascending.
    2. Find all customers from the Consumer segment who have placed more than 5 orders. Use a FROM subquery to first count orders per customer, then filter.
    3. Find all sub-categories where total profit is above the average sub-category profit. Use a FROM subquery for the totals and a WHERE subquery for the average.
    4. Using NOT IN, find all customers who have never placed an order in the West region.
    5. Find the top 3 regions by average order value using a FROM subquery. Show region, average order value, and total orders.

    Summary — What You Can Now Do

    • Explain what a subquery is and why it runs before the outer query
    • Write a WHERE subquery to filter rows against a single calculated value
    • Use IN and NOT IN with subqueries to filter against a list of values
    • Write a FROM subquery to create a temporary summary table for further filtering
    • Combine WHERE and FROM subqueries in the same query for multi-step analysis
    • Choose between a subquery and a JOIN based on what the question requires
    • Avoid common subquery errors including missing aliases and NOT IN with NULLs

    Up next — Topic 6: SQL Meets Python

    Topic 6 brings everything together — connecting SQLite to Python, running queries with pd.read_sql(), deciding when to filter in SQL versus pandas, and building a workflow where SQL retrieves the data and Python does the analysis. This is the bridge between Module 2 and everything that follows in the course.

  • SQL JOINs Explained: INNER JOIN and LEFT JOIN for Data Analysts with Examples

    What Is a JOIN and Why Do You Need It

    Every query in Topics 2 and 3 touched a single table. That works fine when all your data lives in one place — but in real databases it almost never does.

    Customer details live in a customers table. Orders live in an orders table. Products live in a products table. These tables are kept separate deliberately — it avoids storing the same customer name and address on every single order they place. Instead, each order stores a customer ID, and that ID links back to the customer record.

    This is efficient for storage and data integrity. But it means that to answer most real business questions, you need to combine two or more tables. That is exactly what a JOIN does.

    A JOIN connects two tables based on a shared column — usually an ID that appears in both. The result is a new combined table containing columns from both sources, matched row by row.

    Without JOINs, a relational database is just a collection of disconnected tables. With JOINs, it becomes a connected system you can query across freely.

    Setting Up the Superstore Tables for JOINs

    In Topics 2 and 3 the Superstore data was one flat table. To practice JOINs properly you need it split into separate tables the way a real database works. Run this setup code once in your notebook to create the tables:

    import sqlite3
    import pandas as pd
    
    conn = sqlite3.connect('superstore.db')
    
    # Load the original flat table
    df = pd.read_csv('superstore_sales.csv')
    df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]
    
    # Create customers table
    customers = df[['customer_id', 'customer_name', 'segment']].drop_duplicates()
    customers.to_sql('customers', conn, if_exists='replace', index=False)
    
    # Create orders table
    orders = df[['order_id', 'customer_id', 'order_date',
                 'ship_date', 'ship_mode', 'region',
                 'city', 'state']].drop_duplicates()
    orders.to_sql('orders', conn, if_exists='replace', index=False)
    
    # Create order_items table
    order_items = df[['order_id', 'product_id', 'sub_category',
                      'category', 'sales', 'quantity',
                      'discount', 'profit']].drop_duplicates()
    order_items.to_sql('order_items', conn, if_exists='replace', index=False)
    
    print("Tables created successfully.")
    conn.close()
    

    Now you have three separate tables — customers, orders, and order_items — linked by shared ID columns. This mirrors how a real company database is structured.

    def query(sql):
    return pd.read_sql(sql, conn)
    
    conn = sqlite3.connect('superstore.db')

    How a JOIN Works — The Mental Model

    Before writing JOIN syntax, understand what is happening conceptually.

    Imagine two tables sitting side by side. The orders table has a column called customer_id. The customers table also has a column called customer_id. A JOIN says: for every row in orders, find the matching row in customers where the customer_id values are equal, and combine them into one wider row.

    The column you join on is called the join key. It must exist in both tables and contain matching values. In the Superstore setup:

    •   orders.customer_id links to customers.customer_id
    •   order_items.order_id links to orders.order_id

    The difference between JOIN types is what happens when a match is not found. That is the entire distinction between INNER JOIN and LEFT JOIN.

    INNER JOIN — Only Matching Rows

    Syntax:

    SELECT columns<br>FROM table_one<br>INNER JOIN table_two ON table_one.key = table_two.key

    INNER JOIN returns only rows where a match exists in both tables. If a row in the left table has no matching row in the right table, it is excluded from the result. If a row in the right table has no match in the left table, it is also excluded.

    Think of it as the intersection — only rows that exist in both tables come through.

    Your First INNER JOIN

    -- Combine orders with customer details
    query("""
    
    SELECT
    o.order_id,
    o.order_date,
    o.region,
    c.customer_name,
    c.segment
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
    
    """)

    Output (first 5 rows):

    order_idorder_dateregioncustomer_namesegment
    CA-2020-1521562020-11-08SouthClaire GuteConsumer
    CA-2020-1386882020-06-12WestDarrin Van HuffCorporate
    US-2019-1089662019-10-11SouthSean O’DonnellConsumer
    CA-2019-1158122019-06-09EastBrosina HoffmanConsumer
    CA-2019-1144122019-04-15WestAndrew AllenConsumer

    Notice o. and c. before column names. These are table aliases — shorthand so you don’t have to type the full table name every time. orders o means “refer to the orders table as o.” This becomes essential when both tables have columns with the same name.

    Joining Three Tables

    Most real analyst queries join more than two tables. Here is how to bring orders, customers, and order_items together in one query:

    -- Full picture: customer details + order details + financials
    query("""
    
    SELECT
    c.customer_name,
    c.segment,
    o.order_date,
    o.region,
    oi.category,
    oi.sub_category,
    ROUND(oi.sales, 2) AS sales,
    ROUND(oi.profit, 2) AS profit
    FROM orders o
    INNER JOIN customers c
    ON o.customer_id = c.customer_id
    INNER JOIN order_items oi
    ON o.order_id = oi.order_id
    LIMIT 10
    
    """)

    Each INNER JOIN adds another table into the result. The pattern is always the same — join on the shared key column between the two tables being connected.

    INNER JOIN with WHERE and ORDER BY

    JOINs combine cleanly with everything from Topics 2 and 3:

    -- High-value orders in the West region with customer details
    query("""
    
    SELECT
    c.customer_name,
    c.segment,
    o.region,
    oi.category,
    ROUND(oi.sales, 2) AS sales,
    ROUND(oi.profit, 2) AS profit
    FROM orders o
    INNER JOIN customers c
    ON o.customer_id = c.customer_id
    INNER JOIN order_items oi
    ON o.order_id = oi.order_id
    WHERE o.region = 'West'
    AND oi.sales > 1000
    ORDER BY oi.sales DESC
    LIMIT 10
    
    """)

    LEFT JOIN — Keep All Rows from the Left Table

    Syntax:

    SELECT columns<br>FROM table_one<br>LEFT JOIN table_two ON table_one.key = table_two.key

    LEFT JOIN returns all rows from the left table, plus matching rows from the right table. When no match exists in the right table, the columns from the right table come back as NULL.

    The key difference from INNER JOIN: no rows from the left table are ever dropped. Even if they have no match on the right side, they appear in the result — just with NULL values in the right table’s columns.

    When to Use LEFT JOIN

    LEFT JOIN is the right choice when you want to keep all records from one table regardless of whether they have matching records in another. Common scenarios:

    • Find customers who have never placed an order
    • Find products that have never been sold
    • Identify records in one system that are missing from another
    • Audit data completeness across two sources

    LEFT JOIN Example — Finding Unmatched Records

    To demonstrate LEFT JOIN clearly, first add a test customer with no orders:

    Add a customer who has never ordered
    
    import pandas as pd
    import sqlite3
    conn = sqlite3.connect('superstore.db')
    new_customer = pd.DataFrame([{
    'customer_id': 'TEST-001',
    'customer_name': 'Test Customer',
    'segment': 'Consumer'
    }])
    new_customer.to_sql('customers', conn, if_exists='append', index=False)
    conn.commit()

    Now run a LEFT JOIN to find customers with no orders:

    -- Find customers who have never placed an order
    query("""
    
    SELECT
    c.customer_id,
    c.customer_name,
    c.segment,
    o.order_id
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_id IS NULL
    
    """)

    Output:

    customer_idcustomer_namesegmentorder_id
    TEST-001Test CustomerConsumerNULL

    The NULL in order_id tells you this customer exists in the customers table but has no matching record in the orders table. The INNER JOIN version of this query would have excluded this row entirely — you would never know this customer existed.

    This pattern — LEFT JOIN followed by WHERE right_table.key IS NULL — is one of the most useful SQL techniques for data quality auditing.

    Handling NULLs After a JOIN

    NULL values appearing after a LEFT JOIN are expected and useful — they signal missing matches. But they need to be handled carefully in any further calculations or filtering.

    COALESCE — Replacing NULL with a Default Value

    COALESCE returns the first non-NULL value from a list of arguments. Use it to replace NULLs with a meaningful default:

    -- Replace NULL order_id with a readable label
    query("""
    
    SELECT
    c.customer_name,
    COALESCE(o.order_id, 'No orders yet') AS order_status
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_id IS NULL
    
    """)

    COALESCE is also useful when joining tables where a column might be populated in one table but missing in another. Rather than seeing NULL in your output, you get a clean fallback value.

    NULL in Aggregate Functions After a JOIN

    One important behaviour: aggregate functions like COUNT, SUM, and AVG ignore NULL values automatically. This matters after a LEFT JOIN because unmatched rows produce NULLs in the right table’s columns.

    -- Count orders per customer — unmatched customers show 0, not NULL
    query("""
    
    SELECT
    c.customer_name,
    COUNT(o.order_id) AS order_count
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_name
    ORDER BY order_count DESC
    LIMIT 10
    
    """)

    COUNT(o.order_id) counts non-NULL values only — so customers with no orders correctly show 0. If you used COUNT() here you would get 1 for every customer including unmatched ones, because COUNT() counts the row itself regardless of NULL values.

    INNER JOIN vs LEFT JOIN — When to Use Which

    SituationUse
    You only want rows that exist in both tablesINNER JOIN
    You want all rows from the left table, matched or notLEFT JOIN
    Finding records that are missing from another tableLEFT JOIN + WHERE right key IS NULL
    Combining sales data with product detailsINNER JOIN
    Auditing customers with no ordersLEFT JOIN
    Getting complete order history including customer infoINNER JOIN
    Data completeness check across two systemsLEFT JOIN

    The default choice for most analyst queries is INNER JOIN — you usually only want complete, matched records. Reach for LEFT JOIN specifically when missing matches are meaningful information rather than just gaps to exclude.

    A Complete Business Query Using JOINs

    Here is a realistic analyst query combining JOINs, WHERE, GROUP BY, and ORDER BY to answer a full business question:

    Question: Which customer segments generate the most revenue and profit, broken down by product category?

    query("""
    
    SELECT
    c.segment,
    oi.category,
    COUNT(DISTINCT o.order_id) AS total_orders,
    ROUND(SUM(oi.sales), 2) AS total_revenue,
    ROUND(SUM(oi.profit), 2) AS total_profit,
    ROUND(SUM(oi.profit) /
    SUM(oi.sales) * 100, 1) AS profit_margin_pct
    FROM orders o
    INNER JOIN customers c
    ON o.customer_id = c.customer_id
    INNER JOIN order_items oi
    ON o.order_id = oi.order_id
    GROUP BY c.segment, oi.category
    ORDER BY c.segment, total_revenue DESC
    
    """)

    This single query pulls from three tables, aggregates across two dimensions, calculates a derived metric, and produces a result a manager could read directly in a meeting. That is the power of combining JOINs with everything from the previous topics.

    Common JOIN Mistakes

    MistakeWhat HappensFix
    Joining on the wrong columnIncorrect or empty resultsDouble-check which columns are the shared keys between tables
    Forgetting table aliases when column names clashSQL error — ambiguous column nameAlways use aliases when the same column name exists in both tables
    Using INNER JOIN when LEFT JOIN is neededSilently drops unmatched rowsAsk yourself — do I care about rows with no match? If yes, use LEFT JOIN
    Not handling NULLs after LEFT JOINWrong aggregation resultsUse COALESCE for display, COUNT(column) not COUNT(*) for counting
    Joining without ON clauseCartesian product — every row matched to every rowAlways include the ON condition

    Practice Exercises

    1. Join the orders and customers tables. Show the customer name, region, and order date for all orders placed in 2021.
    2. Join all three tables. Find the top 5 customers by total profit across all their orders.
    3. Using a LEFT JOIN, find any order IDs in order_items that do not have a matching record in the orders table.
    4. Join orders and customers. Group by segment and region. Show total revenue per combination sorted by revenue descending.
    5. Join all three tables. Filter to Furniture category only. Show total sales and profit per customer segment.

    Summary — What You Can Now Do

    • Explain what a JOIN does and why relational databases require them
    • Write an INNER JOIN to combine two or more tables on a shared key column
    • Write a LEFT JOIN to keep all rows from the left table including unmatched ones
    • Use LEFT JOIN with WHERE IS NULL to find records missing from a second table
    • Handle NULLs after a JOIN using COALESCE and COUNT(column) vs COUNT(*)
    • Combine JOINs with WHERE, GROUP BY, ORDER BY, and HAVING in a single query
    • Choose between INNER JOIN and LEFT JOIN based on whether unmatched rows matter

    Up next — Topic 5: Subqueries

    Topic 5 covers queries inside queries — how to use a SELECT result as a filter in WHERE, or as a derived table in FROM. Subqueries let you answer multi-step business questions in a single SQL statement without needing temporary tables or multiple separate queries.

  • How to Use COUNT, SUM, AVG, GROUP BY and HAVING in SQL

    SQL Aggregations Explained: COUNT, SUM, AVG, GROUP BY and HAVING with Examples

    In Topic 2 you learned to retrieve and filter individual rows. That is useful — but it is not where the real analysis happens. A manager does not want to see 10,000 individual order rows. They want to know: what is our total revenue this year? Which region is most profitable? Which product category has the highest average order value?
    Those questions cannot be answered by looking at individual rows. They require aggregation — combining multiple rows into a single summary figure. This topic covers the five aggregate functions every analyst uses daily, along with GROUP BY and HAVING which control how aggregations are grouped and filtered.

    What Is Aggregation and Why Does It Matter

    Aggregation means collapsing multiple rows of data into a single calculated value. Instead of seeing every individual sale, you see the total. Instead of every order’s profit, you see the average. Instead of each transaction, you see a count.
    In pandas you did this with groupby() and agg() in Module 1. SQL does the same thing — but directly inside the database, before the data even reaches Python. This matters for performance. Aggregating 500,000 rows in SQL and returning a 4-row summary is dramatically faster than loading all 500,000 rows into Python and then aggregating them.
    In a professional analyst workflow the rule is simple: aggregate in SQL, visualise and present in Python. Do the heavy lifting where the data lives.

    The Five Aggregate Functions

    Before combining these with GROUP BY, understand what each function does on its own when applied to an entire table.

    COUNT — How Many Rows

    COUNT counts the number of rows. It is the most frequently used aggregate function in day-to-day analyst work.

    -- How many orders are in the dataset?
    query("""
    
    SELECT COUNT(*) AS total_orders
    FROM superstore
    
    """)

    Output:

    total_orders
    9994

    COUNT(*) counts all rows including nulls. COUNT(column_name) counts only rows where that column is not null — a useful distinction when checking data completeness:

    -- How many rows have a valid postal code vs total rows?
    query("""
    
    SELECT
    COUNT(*) AS total_rows,
    COUNT(postal_code) AS rows_with_postal_code
    FROM superstore
    
    """)

    If these two numbers differ, you have missing values in that column — which is exactly what you want to know before any analysis.

    SUM — Total Value

    SUM adds up all values in a numeric column. Used constantly for revenue, profit, quantity, and any other additive metric.

    -- What is the total revenue and total profit across all orders?
    query("""
    
    SELECT
    ROUND(SUM(sales), 2) AS total_revenue,
    ROUND(SUM(profit), 2) AS total_profit
    FROM superstore
    
    """)

    Output:

    total_revenuetotal_profit
    2297200.86286397.02

    ROUND(value, 2) keeps results to two decimal places. Always use it with financial figures — raw floating point results from SQL can be messy.

    AVG — Average Value

    AVG calculates the mean of a numeric column. Useful for understanding typical order size, average margin, average quantity, and so on.

    -- What is the average order value and average profit per transaction?
    query("""
    
    SELECT
    ROUND(AVG(sales), 2) AS avg_order_value,
    ROUND(AVG(profit), 2) AS avg_profit
    FROM superstore
    
    """)

    Mean vs median reminder: AVG calculates the mean, which is pulled upward by large outliers. A small number of very large orders can make the average look higher than what a typical order is worth. In Module 1 you compared mean and median in pandas — keep that in mind when interpreting AVG results from SQL.

    MIN and MAX — Smallest and Largest Values

    MIN and MAX return the lowest and highest values in a column. Useful for understanding the range of your data and spotting outliers.

    -- What is the range of order values in the dataset?
    query("""
    
    SELECT
    ROUND(MIN(sales), 2) AS smallest_order,
    ROUND(MAX(sales), 2) AS largest_order,
    ROUND(MIN(profit), 2) AS worst_profit,
    ROUND(MAX(profit), 2) AS best_profit
    FROM superstore
    
    """)

    Output:

    smallest_orderlargest_orderworst_profitbest_profit
    0.4422638.48-6599.988399.98

    The worst profit being nearly -$6,600 on a single order is a significant finding. MAX and MIN surface these extremes instantly — without having to sort through thousands of rows manually.

    GROUP BY — Aggregating by Category

    Aggregate functions on their own give you a single number for the entire table. GROUP BY breaks that down by a category — giving you one summary row per group instead of one summary row for everything.

    Syntax:

    SELECT column, AGG_FUNCTION(column)<br>FROM table<br>GROUP BY column

    This is the SQL equivalent of pandas groupby(). And just like in pandas, it is one of the most powerful tools in your analyst toolkit.

    Total Sales and Profit by Region

    -- Business question: how does each region perform?
    query("""
    
    SELECT
    region,
    COUNT(*) AS total_orders,
    ROUND(SUM(sales), 2) AS total_revenue,
    ROUND(SUM(profit), 2) AS total_profit
    FROM superstore
    GROUP BY region
    ORDER BY total_profit DESC
    
    """)

    Output:

    regiontotal_orderstotal_revenuetotal_profit
    West3203725457.82108418.45
    East2848678781.2491522.78
    Central2323501239.8939706.36
    South1620391721.9146749.43

    Four rows. One per region. Each one summarising thousands of individual transactions. This is what aggregation does — it turns raw data into something a manager can read and act on in seconds.

    Grouping by Multiple Columns

    You can GROUP BY more than one column to get more granular breakdowns:

    -- Revenue and profit broken down by region AND category
    query("""
    
    SELECT
    region,
    category,
    ROUND(SUM(sales), 2) AS total_revenue,
    ROUND(SUM(profit), 2) AS total_profit
    FROM superstore
    GROUP BY region, category
    ORDER BY region, total_profit DESC
    
    """)

    This gives you one row per unique region-category combination — 12 rows total (4 regions × 3 categories). A clean, structured view of performance across two dimensions at once.

    The GROUP BY Rule — Every Non-Aggregated Column Must Be Listed

    This is the rule that trips up every beginner at least once. In a SELECT statement that uses GROUP BY, every column that is not inside an aggregate function must appear in the GROUP BY clause. If you select region and SUM(sales), you must GROUP BY region. If you select region, category, and SUM(sales), you must GROUP BY region, category.

    -- This will throw an error — category is selected but not grouped
    query("""
    
    SELECT region, category, SUM(sales)
    FROM superstore
    GROUP BY region -- missing category here
    
    """)
    -- This is correct
    query("""
    
    SELECT region, category, SUM(sales)
    FROM superstore
    GROUP BY region, category
    
    """)

    When you get a GROUP BY error, the fix is almost always: add the missing column to the GROUP BY clause.

    HAVING — Filtering Aggregated Results

    WHERE filters individual rows before aggregation. HAVING filters the aggregated results after GROUP BY runs. This distinction is critical and confuses many beginners.

    The simple rule:

    • Use WHERE to filter rows — before grouping
    • Use HAVING to filter groups — after grouping

    Syntax:

    SELECT column, AGG_FUNCTION(column)<br>FROM table<br>GROUP BY column<br>HAVING AGG_FUNCTION(column) condition

    Filter Groups by Aggregate Value

    -- Which regions have total profit above $50,000?
    query("""
    
    SELECT
    region,
    ROUND(SUM(profit), 2) AS total_profit
    FROM superstore
    GROUP BY region
    HAVING SUM(profit) > 50000
    ORDER BY total_profit DESC
    
    """)

    You cannot use WHERE profit > 50000 here because at the time WHERE runs, the profit has not been summed yet. HAVING runs after the sum is calculated — that is why it can filter on it.

    WHERE and HAVING Together

    WHERE and HAVING can be used in the same query. WHERE filters the raw rows first, then GROUP BY aggregates, then HAVING filters the groups:

    -- Among Technology orders only, which sub-categories have total sales above $100,000?
    query("""
    
    SELECT
    sub_category,
    ROUND(SUM(sales), 2) AS total_sales,
    ROUND(SUM(profit), 2) AS total_profit
    FROM superstore
    WHERE category = 'Technology'
    GROUP BY sub_category
    HAVING SUM(sales) > 100000
    ORDER BY total_sales DESC
    
    """)

    Output:

    sub_categorytotal_salestotal_profit
    Phones330007.0544515.73
    Machines189238.633384.76
    Copiers149528.0355617.82

    Read the logic out loud: “From all Technology orders, group by sub-category, show only groups with over $100,000 in sales, sorted by sales.” WHERE handles the category filter. HAVING handles the post-aggregation filter. Both in the same query.

    Practical Business Queries Using Aggregations

    Here are three complete queries that combine everything in this topic to answer real analyst questions.

    Which customer segment is most valuable?

    query("""
    
    SELECT
    segment,
    COUNT(*) AS total_orders,
    ROUND(SUM(sales), 2) AS total_revenue,
    ROUND(AVG(sales), 2) AS avg_order_value,
    ROUND(SUM(profit), 2) AS total_profit
    FROM superstore
    GROUP BY segment
    ORDER BY total_revenue DESC
    
    """)

    Which sub-categories are losing money overall?

    query("""
    
    SELECT
    sub_category,
    ROUND(SUM(profit), 2) AS total_profit,
    COUNT(*) AS total_orders
    FROM superstore
    GROUP BY sub_category
    HAVING SUM(profit) < 0
    ORDER BY total_profit ASC
    
    """)

    This query directly answers one of the most common business questions an analyst gets asked: which product lines are unprofitable and how bad is it?

    Top performing sub-categories by profit margin

    query("""
    
    SELECT
    sub_category,
    ROUND(SUM(sales), 2) AS total_sales,
    ROUND(SUM(profit), 2) AS total_profit,
    ROUND(SUM(profit) / SUM(sales) * 100, 1) AS profit_margin_pct
    FROM superstore
    GROUP BY sub_category
    ORDER BY profit_margin_pct DESC
    LIMIT 10
    
    """)

    Notice SUM(profit) / SUM(sales) * 100 — you can do arithmetic directly inside a SELECT using aggregate functions. This calculates the overall profit margin for each sub-category, not an average of individual margins, which gives a more accurate business figure.

    WHERE vs HAVING — The Full Picture

    This distinction comes up so often it deserves its own summary:

    WHEREHAVING
    When it runsBefore GROUP BYAfter GROUP BY
    What it filtersIndividual rowsAggregated groups
    Can use aggregate functionsNoYes
    ExampleWHERE region = 'West'HAVING SUM(sales) > 10000
    Common useFilter raw data before summarisingKeep only groups that meet a threshold

    A useful mental model: WHERE is a row-level gate. HAVING is a group-level gate. Data passes through WHERE first, gets grouped and aggregated, then passes through HAVING.

    Common Mistakes with Aggregations

    MistakeWhat HappensFix
    Using WHERE instead of HAVINGError or wrong results when filtering on aggregated valuesUse HAVING for any condition involving SUM, COUNT, AVG, MIN, MAX
    Forgetting to GROUP BY all non-aggregated columnsSQL error in strict databasesAdd every non-aggregated SELECT column to GROUP BY
    Not using ROUND on financial figuresResults show many decimal placesWrap financial aggregations in ROUND(value, 2)
    Dividing by SUM instead of individual values for marginSlightly different result than expectedFor overall margin use SUM(profit)/SUM(sales), not AVG(profit/sales)
    Selecting an aggregated alias in HAVINGError — alias not yet availableRepeat the full expression: HAVING SUM(sales) > 10000, not HAVING total_sales > 10000

    Practice Exercises

    Try these before moving to Topic 4. Each one maps to a real analyst scenario.

    1. What is the total revenue and total profit for each category? Sort by total profit descending.
    2. Which ship modes are used most frequently? Show the count of orders per ship mode.
    3. Find all sub-categories where the average order value is above $300.
    4. Which states have more than 100 orders? Show state name and order count.
    5. Among orders in the Consumer segment only, what is the total profit per category?

    Summary — What You Can Now Do

    • Use COUNT, SUM, AVG, MIN, and MAX to calculate summary statistics across an entire table
    • Apply GROUP BY to break aggregations down by one or multiple categories
    • Follow the GROUP BY rule — every non-aggregated SELECT column must appear in GROUP BY
    • Use HAVING to filter groups after aggregation
    • Combine WHERE and HAVING in the same query to filter both rows and groups
    • Write business-ready summary queries that answer questions about revenue, profit, and performance by segment

    Up next — Topic 4: JOINs

    So far every query has touched a single table. In Topic 4 you learn INNER JOIN and LEFT JOIN — how to combine data from two tables into one result. This is where SQL becomes genuinely powerful for real-world databases where customer data, order data, and product data all live separately.

  • How to Write Your First SQL Query: A Beginner’s Guide

    SQL Query Basics: SELECT, WHERE, ORDER BY, LIMIT and DISTINCT Explained with Examples

    Before writing complex analysis, you need to master five keywords. SELECT, WHERE, ORDER BY, LIMIT, and DISTINCT. These five alone let you answer most basic business questions from any database. Every query you write in this module — and most queries you will write in your career — is built on this foundation.

    Why These Five Keywords Matter

    When a data analyst joins a company, one of the first things they are asked to do is pull data. Not build a model. Not create a dashboard. Just answer a question: how many orders came in last month? Which region had the highest sales? Which customers haven’t ordered in 90 days?
    All of those questions are answered with the five keywords in this topic. SELECT chooses what to show. FROM says where to look. WHERE filters to the relevant rows. ORDER BY sorts the results. LIMIT keeps things manageable. Master these and you can answer the majority of day-to-day analyst requests without writing a single line of Python.
    This is also why SQL appears in more data analyst job postings than any other technical skill. It is the universal language of data retrieval. Every database system — MySQL, PostgreSQL, SQL Server, BigQuery, Snowflake — understands these five keywords. Learn them once and they work everywhere.

    How SQL Thinks

    Every SQL query follows the same skeleton structure. Write it in this order every time:

    SELECT -- what columns do you want?<br>FROM -- which table?<br>WHERE -- which rows? (filter — optional)<br>ORDER BY -- in what order? (sort — optional)<br>LIMIT -- how many rows? (cap — optional)

    You write queries top to bottom in this order. Internally the database reads FROM first, then WHERE, then SELECT — but you don’t need to worry about that yet. Write top to bottom and it works.
    Before starting, add this to the top of your notebook. It sets up your database connection and creates a helper function so every query is one clean line:

    import sqlite3
    import pandas as pd
    
    # Connect to your Superstore database from Topic 1
    conn = sqlite3.connect('superstore.db')
    
    # Helper — run any SQL query in one line
    def query(sql):
        return pd.read_sql(sql, conn)
    

    SELECT — Choosing Your Columns

    Syntax: SELECT column1, column2 FROM table_name
    SELECT tells the database which columns to return. FROM tells it which table to look in. These two always go together and appear in every query you will ever write.
    The asterisk * means “all columns”:

    -- Get everything from the superstore table<br>query("""<br>SELECT *<br>FROM superstore<br>""")

    Avoid SELECT * in real work. It retrieves every column including ones you don’t need, which slows things down on large tables. Name your columns explicitly.

    In practice, always specify only the columns you need:

    -- Show order dates, regions, and financials only<br>query("""<br>SELECT order_date, region, category, sales, profit<br>FROM superstore<br>""")

    Expected output (first 5 rows):

    order_dateregioncategorysalesprofit
    2020-11-08SouthFurniture261.9641.91
    2020-11-08SouthOffice Supplies731.94219.58
    2020-06-12WestOffice Supplies14.626.87
    2020-06-12WestTechnology957.58-383.03
    2020-06-12WestFurniture22.372.52

    Renaming Columns with AS

    You can rename any column in your output using AS. This is useful when column names are long, unclear, or when you want your output readable for a non-technical audience. The original column name in the database stays unchanged — AS only affects what comes back to you:

    query("""<br>SELECT<br>order_id AS order,<br>customer_name AS customer,<br>sales AS revenue,<br>profit AS net_profit<br>FROM superstore<br>""")

    A practical use case: if you are preparing output to paste into a report or share with a manager, clean column names matter. net_profit reads better than profit in a business context, and revenue is more universally understood than sales depending on your company’s terminology.

    LIMIT — Controlling How Many Rows Come Back

    Syntax: SELECT … LIMIT n

    LIMIT caps the number of rows returned. Always use it when exploring a large table — you never want to accidentally load tens of thousands of rows into a notebook.

    -- Quick preview — return only the first 10 rows<br>query("""<br>SELECT order_date, customer_name, sales<br>FROM superstore<br>LIMIT 10<br>""")

    Good habit: Start every exploration with LIMIT 10. Understand what you’re looking at. Then remove the limit when you’re ready to work with the full dataset.

    LIMIT also has an important use in combination with ORDER BY — which you will see later in this topic. When you sort by a value and apply a limit, you get a clean top-N or bottom-N result. For example, the top 10 most profitable orders, or the 5 most recent transactions. This pattern appears constantly in real analyst work.

    DISTINCT — Finding Unique Values

    Syntax: SELECT DISTINCT column FROM table

    DISTINCT removes duplicate values and returns only unique entries. Use it routinely when you first encounter a new dataset to understand what values exist in a column.

    -- What regions exist in the dataset?<br>query("""<br>SELECT DISTINCT region<br>FROM superstore<br>""")

    Output:

    region
    South
    West
    Central
    East

    You can also apply DISTINCT across multiple columns to see unique combinations:

    -- What unique category and sub-category combinations exist?<br>query("""<br>SELECT DISTINCT category, sub_category<br>FROM superstore<br>ORDER BY category, sub_category<br>""")

    When DISTINCT spans multiple columns it returns unique combinations, not unique values per column individually.


    Why DISTINCT Is More Useful Than It Looks

    New analysts often underestimate DISTINCT. Beyond simple column exploration, it helps you catch data quality problems early. If a column that should only have four region values suddenly shows five — including a misspelling like “Westt” — DISTINCT will surface it immediately. Use it as a data validation step every time you work with a new table.

    -- Validate ship mode values — should only be 4 options<br>query("""<br>SELECT DISTINCT ship_mode<br>FROM superstore<br>ORDER BY ship_mode<br>""")

    If you see unexpected values, that is a data quality issue worth flagging before any analysis begins.

    WHERE — Filtering Rows

    Syntax: SELECT … FROM … WHERE condition

    WHERE is the most powerful basic clause in SQL. It filters rows so only the ones matching your condition come back. Everything else is excluded.


    WHERE Operators

    OperatorWhat It DoesExample
    =Equalsregion = 'West'
    != or <>Not equalsregion != 'West'
    > / <Greater / less thansales > 1000
    BETWEENRange, inclusivesales BETWEEN 100 AND 500
    LIKEPattern matchcustomer_name LIKE 'A%'
    INMatches a listregion IN ('East', 'West')
    IS NULLValue is missingpostal_code IS NULL
    ANDBoth conditions trueregion = 'West' AND sales > 500
    OREither condition trueregion = 'West' OR region = 'East'

    Important: Text values always need single quotes — ‘West’ not West. Numbers do not — sales > 500 not sales > ‘500’. This is one of the most common beginner errors.
    Basic Filter Examples

    -- All orders from the West region<br>query("""<br>SELECT order_id, customer_name, region, sales<br>FROM superstore<br>WHERE region = 'West'<br>""")
    -- High-value orders only — sales above $1,000<br>query("""<br>SELECT order_id, customer_name, sales, profit<br>FROM superstore<br>WHERE sales > 1000<br>LIMIT 10<br>""")

    Using BETWEEN for Ranges

    BETWEEN is cleaner than writing two separate conditions when you are filtering a range. Both the lower and upper values are inclusive:

    -- Orders with sales between $500 and $1,000<br>query("""<br>SELECT order_id, customer_name, sales, profit<br>FROM superstore<br>WHERE sales BETWEEN 500 AND 1000<br>ORDER BY sales DESC<br>""")

    This is identical to writing WHERE sales >= 500 AND sales <= 1000 but easier to read, especially when shared with others.


    Combining Conditions with AND and OR

    AND means both conditions must be true. OR means at least one must be true.

    -- High-value orders specifically in the West region<br>query("""<br>SELECT order_id, customer_name, region, sales<br>FROM superstore<br>WHERE region = 'West'<br>AND sales > 1000<br>""")

    Use IN instead of multiple OR conditions — it’s cleaner and easier to read:

    -- Orders from East or West<br>query("""<br>SELECT order_id, region, sales<br>FROM superstore<br>WHERE region IN ('East', 'West')<br>""")

    Pattern Matching with LIKE

    The % wildcard means “zero or more of any character.” The _ wildcard means exactly one character. Here is how both work in practice:

    -- Customers whose name starts with 'A'<br>query("""<br>SELECT DISTINCT customer_name, segment<br>FROM superstore<br>WHERE customer_name LIKE 'A%'<br>ORDER BY customer_name<br>""")
    -- Find any sub-category containing the word 'Chair'<br>query("""<br>SELECT DISTINCT sub_category<br>FROM superstore<br>WHERE sub_category LIKE '%Chair%'<br>""")
    -- Find products where the second letter is 'a'<br>-- _ means exactly one character<br>query("""<br>SELECT DISTINCT sub_category<br>FROM superstore<br>WHERE sub_category LIKE '_a%'<br>""")

    Case sensitivity note: SQLite’s LIKE is case-insensitive for ASCII letters by default, so LIKE ‘west’ will still match ‘West’. However, this varies between database systems. In PostgreSQL, LIKE is case-sensitive. Build the habit of matching case correctly so your queries work everywhere.


    Filtering for Missing Values with IS NULL

    NULL in SQL means a value is missing or unknown. You cannot check for NULL using the equals operator — you must use IS NULL or IS NOT NULL:

    -- Find rows where postal code is missing<br>query("""<br>SELECT order_id, customer_name, city, postal_code<br>FROM superstore<br>WHERE postal_code IS NULL<br>""")
    -- Find all rows that DO have a postal code<br>query("""<br>SELECT order_id, customer_name, city, postal_code<br>FROM superstore<br>WHERE postal_code IS NOT NULL<br>LIMIT 10<br>""")

    NULL handling trips up beginners constantly.
    Remember: = NULL never works. Always use IS NULL.

    Finding Loss-Making Orders

    -- Which orders lost money? Sorted worst first.<br>query("""<br>SELECT order_id, customer_name, category, sales, profit<br>FROM superstore<br>WHERE profit < 0<br>ORDER BY profit ASC<br>LIMIT 10<br>""")

    Output (top 5 shown):

    order_idcustomer_namecategorysalesprofit
    CA-2019-138422Cindy StewartTechnology4199.93-1199.97
    CA-2021-144568Zuschuss DonatelliTechnology3199.95-1079.98
    US-2020-108966Bill DonatelliFurniture2799.96-839.98
    CA-2020-160415Erin SmithTechnology2799.96-839.98
    CA-2021-119549Dave BrooksFurniture2799.96-699.98

    ORDER BY — Sorting Results

    Syntax: ORDER BY column ASC | DESC

    ORDER BY sorts the output rows. ASC is ascending (smallest to largest, A to Z) and is the default. DESC is descending (largest to smallest, Z to A).

    -- Top 10 highest-value orders<br>query("""<br>SELECT order_id, customer_name, region, sales<br>FROM superstore<br>ORDER BY sales DESC<br>LIMIT 10<br>""")

    You can sort by multiple columns. The database sorts by the first column, then breaks ties using the second:

    -- Sort by region alphabetically, then by sales within each region<br>query("""<br>SELECT region, category, customer_name, sales<br>FROM superstore<br>ORDER BY region ASC, sales DESC<br>LIMIT 20<br>""")

    Putting It All Together

    Here is one query using all five keywords to answer a specific business question:
    Question: What are the top 5 profitable Technology orders in the West region?

    query("""<br>SELECT<br>order_id,<br>customer_name,<br>sub_category,<br>sales,<br>profit<br>FROM superstore<br>WHERE category = 'Technology'<br>AND region = 'West'<br>AND profit > 0<br>ORDER BY profit DESC<br>LIMIT 5<br>""")

    Output:

    order_idcustomer_namesub_categorysalesprofit
    CA-2021-145317Ken BlackCopiers3299.981187.99
    CA-2020-127180Anne McFarlandCopiers2799.981007.99
    CA-2021-162688Phillip SchmittPhones1919.97614.39
    CA-2020-100294Lena CreightonPhones1679.97537.59
    CA-2019-118977Sung ShieldsPhones1439.97460.79

    Read that query out loud: “Select these columns from Superstore, where category is Technology and region is West and profit is positive, sorted by profit descending, top 5 only.” That maps word-for-word to the business question. This is the core skill of SQL — translating a plain English question into precise syntax.

    Common Beginner Mistakes

    MistakeWrongRight
    Quotes on textWHERE region = WestWHERE region = 'West'
    NULL comparisonWHERE postal_code = NULLWHERE postal_code IS NULL
    Alias in WHEREWHERE revenue > 500 (alias)WHERE sales > 500 (original name)
    AND vs OR confusionGets zero resultsIf a column can’t be two values at once, use OR or IN
    LIKE case sensitivityLIKE 'west' misses ‘West’ in some databasesMatch case exactly or use LOWER(region) LIKE 'west'
    ORDER BY without LIMITLoads entire table sortedAdd LIMIT n when you only need top or bottom rows

    Practice Exercises

    Try writing these queries yourself against your Superstore database before looking at the answers. These are the kinds of questions a real manager might ask.

    1. Show all orders from the Furniture category with sales greater than $500, sorted by sales descending.
    2. How many unique customer names are in the dataset? (Hint: use DISTINCT then count the rows in Python)
    3. Find all orders where the ship mode is ‘First Class’ and the region is ‘East’.
    4. Show the 10 most recent orders by order date.
    5. 5. Find all orders where profit is between -$100 and $0 — orders that lost a small amount. Sort by profit ascending.


    Writing these yourself — even if you need to look back at the examples — is more valuable than reading the topic a second time. Muscle memory in SQL comes from typing queries, not from reading them.

    Summary — What You Can Now Do

    By the end of this topic you should be able to write all of these from memory:

    • ∙ Select specific columns from a table and rename them with AS
    • ∙ Use DISTINCT to find unique values and catch data quality issues early
    • ∙ Use LIMIT to safely cap results during exploration
    • ∙ Filter rows with WHERE using =, >, <, BETWEEN, IN, LIKE, IS NULL, and IS NOT NULL
    • ∙ Use % and _ wildcards in LIKE for pattern matching
    • ∙ Combine multiple filters with AND and OR
    • ∙ Sort results with ORDER BY across one or multiple columns
    • ∙ Write a complete query combining all five keywords to answer a specific business question

    Up next — Topic 3: Aggregations

    COUNT, SUM, AVG, MIN, MAX, GROUP BY, and HAVING. This is where SQL stops working with individual rows and starts answering questions like “what is total revenue per region?” and “which category has the highest average order value?” That is where the real business analysis begins.