SQL Mini Project: Analyze the Superstore Database Using SQL and Python

What This Project Is

You have completed all six topics in Module 2. You can query a database, filter and sort rows, aggregate data, join tables, write subqueries, and connect SQL to Python. This mini project puts all of that together in one end-to-end deliverable.
This is not a guided tutorial. There are no step-by-step instructions telling you which functions to use. The five business questions below are the kind you would genuinely receive in an entry-level data role — and your job is to answer them using everything you have learned.
By the end you will have a completed Jupyter notebook, a written findings brief, and a GitHub repository — three things you can point to directly when applying for data roles.

The Scenario

You have just joined a retail company as a junior data professional. It is your first week. Your manager has sent you an email:

“Hey — I’ve given you access to our sales database. Before our Friday meeting I’d love to get your take on five questions I’ve been sitting on for a while. Nothing fancy — just pull the numbers and tell me what you find. A short write-up is fine.”

The five questions are below. The database is the Superstore SQLite database you have been working with throughout Module 2.

The Five Business Questions

Answer each question using SQL. Pull the result into a pandas DataFrame. Then write one to three sentences in plain English summarising what the data tells you.

Question 1 — Regional Performance

Which region generates the most total revenue and which generates the most total profit? Are they the same region? If not, what does that tell you?

Question 2 — Product Profitability

Which three sub-categories have the highest total profit and which three have the lowest? Are any sub-categories losing money overall?

Question 3 — Customer Value

Who are the top 10 customers by total revenue? For each of those customers, what is their profit margin? Are your highest-revenue customers also your most profitable ones?

Question 4 — Loss-Making Orders

What percentage of all orders are loss-making (profit below zero)? Which category has the highest proportion of loss-making orders? Which region?

Question 5 — Shipping and Profitability

Does ship mode affect profitability? Show average profit and average sales for each ship mode. Is there a pattern worth flagging to the business?

Deliverables

Submit three things when you complete this project:

  1. Jupyter Notebook
    One clean notebook containing all your SQL queries, pandas code, and output. Structure it with a markdown cell before each question stating the business question, followed by your code and output. Name it Module2_MiniProject.ipynb.
  2. Written Findings Brief
    A short document — five short paragraphs, one per question — written in plain English as if you are sending it to your manager. No code. No jargon. Just what the data shows and why it matters. Aim for 150 to 200 words total. Name it Module2_Findings_Brief.md.
  3. GitHub Repository
    Push both files to a public GitHub repository. Name it superstore-sql-analysis. Include a README that describes the project in two to three sentences, lists the tools used, and explains how to run the notebook.

Technical Requirements

Your notebook must meet all of the following:

  • All data retrieved using SQL queries against the Superstore SQLite database — no loading the raw CSV directly
  • At least three queries must use JOIN across two or more tables
  • At least one query must use a subquery
  • At least one query must use GROUP BY with HAVING
  • All results pulled into pandas DataFrames using pd.read_sql()
  • Connection opened once at the top and closed once at the bottom
  • All SQL queries use parameterised values where a variable is involved
  • Column names in output are readable — use AS aliases where needed

Notebook Structure

Set your notebook up in this order:

1. Title cell — project name, your name, date
2. Setup cell — imports and database connection
3. Question 1 — markdown heading + SQL query + DataFrame output + written insight
4. Question 2 — markdown heading + SQL query + DataFrame output + written insight
5. Question 3 — markdown heading + SQL query + DataFrame output + written insight
6. Question 4 — markdown heading + SQL query + DataFrame output + written insight
7. Question 5 — markdown heading + SQL query + DataFrame output + written insight
8. Summary cell — three to five overall takeaways in plain English
9. Close connection

Every question cell should follow this pattern:

# ── QUESTION 1: Regional Performance ─────────────────────

q1 = pd.read_sql("""
    -- Your SQL query here
""", conn)

q1

Followed by a markdown cell with your plain-English insight.

Hints — Read Only If Stuck

These are directional hints only — not solutions.

Question 1: GROUP BY region with SUM for both sales and profit. Think about why the most profitable region might not be the highest revenue region — margin matters.

Question 2: GROUP BY sub_category with SUM(profit). Sort both ascending and descending. Use HAVING to isolate sub-categories where total profit is negative.

Question 3: JOIN orders, customers, and order_items. GROUP BY customer name. Calculate profit margin as SUM(profit) / SUM(sales) * 100. Think about what a high-revenue but low-margin customer means for the business.

Question 4: Use a subquery or a CASE statement to flag loss-making orders. Calculate percentage in pandas after pulling the counts. GROUP BY category and region separately to find the worst offenders.

Question 5: GROUP BY ship_mode. Use AVG for both sales and profit. Think about whether causation is implied — does ship mode cause profitability differences or just correlate with them?

Evaluation Criteria

Your project will be assessed on four dimensions:

Correctness — Do your SQL queries return accurate results? Are joins and aggregations logically sound?

Clarity — Is your notebook clean and readable? Would a colleague understand your work without asking you to explain it?

Insight — Do your written findings go beyond restating the numbers? Does your brief say something meaningful about the business?

Craft — Are column names clean? Is the connection managed properly? Are queries well formatted and commented?

Sharing Your Work

When your project is complete:
• Post your GitHub link in the course community forum
• Write one sentence about the most surprising thing you found in the data
• Review one other student’s project and leave a comment on their findings brief

Looking at how others approached the same five questions is one of the most effective ways to deepen your SQL intuition. There is rarely one right query — seeing different approaches to the same problem is genuinely instructive.

Up next — Module 3: Data Cleaning and Wrangling

Module 3 moves back into Python full time. You will learn how to take messy, real-world data — missing values, wrong data types, duplicates, inconsistent categories — and turn it into a clean, analysis-ready dataset. The skills in Module 3 are what separates someone who can analyse clean data from someone who can handle data the way it actually arrives in the real world.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *