The Tool You Can’t Avoid
You’ve just spent Module 1 loading a CSV file into pandas and analysing it in Python. That felt powerful — and it was. But here’s something most beginner data courses don’t tell you upfront: in most real companies, data doesn’t live in CSV files.
It lives in databases. Structured, relational, often enormous databases — containing millions of rows spread across dozens of connected tables. Before a data analyst can do anything with that data, they need to query it. And the language used to query relational databases is SQL — Structured Query Language.
SQL has been around since the 1970s. It has survived every major technology shift since then — the rise of the internet, the cloud, big data, machine learning, and AI. In 2024, SQL still appears in more data analyst job postings than any other technical skill, including Python. That kind of longevity is not an accident.
In most analytics workflows, SQL is where data gets retrieved. Python is where it gets transformed and visualised. Excel is where it gets presented. Understanding where each tool starts and stops is the difference between a junior analyst and a confident one.
Core principle
This topic has three goals. First, give you a clear mental model for when to use SQL versus Excel versus Python. Second, explain how relational databases are structured so that SQL queries make intuitive sense. Third, get your local SQL environment set up using the same Superstore dataset from Module 1 — so you’re coding, not just reading.
THE THREE TOOLS
SQL vs Excel vs Python
Most people entering data analytics already know Excel. Many have started learning Python. SQL often feels like a third thing to learn — and that can feel overwhelming. The good news is that these three tools are not competitors. They are complements. Each one is exceptional at specific tasks and weak at others.
Here’s a clean, course-ready comparison table you can directly include in your page. It keeps things structured without feeling overly “list-heavy.”
| Aspect | SQL | Python (pandas) | Excel |
|---|---|---|---|
| Primary Purpose | Data extraction and querying | Data analysis, transformation, automation | Quick analysis and reporting |
| Best Use Case | Working with large databases | Complex data processing and advanced analysis | Small to medium datasets, business reporting |
| Data Size Handling | Excellent (millions of rows) | Very good (depends on memory) | Limited (can slow/crash on large data) |
| Ease of Learning | Easy to start, logical syntax | Moderate (requires programming basics) | Very easy (beginner-friendly UI) |
| Performance | Very fast (optimized databases) | Fast, but depends on code efficiency | Slower with large datasets |
| Data Source | Directly connects to databases | Works with files, APIs, databases | Mostly local files (Excel, CSV) |
| Data Cleaning | Basic | Advanced and flexible | Manual and limited |
| Automation | Limited | Strong automation capabilities | Very limited |
| Visualization | Not supported (basic output only) | Strong (Matplotlib, Seaborn, etc.) | Built-in charts and dashboards |
| Scalability | High | High (with proper setup) | Low |
| Real-World Role | Extract and prepare data | Analyze and model data | Present and share insights |
| Dependency | Independent (data source tool) | Often depends on SQL for data | Often depends on exported data |
| Industry Usage | Mandatory for analysts | Highly preferred | Widely used for reporting |
Simple Takeaway
Instead of choosing one tool over another, think of them as a workflow:
SQL → Get the data
Python → Analyze the data
Excel → Present or quickly explore
Here is how to think about them:
SQL (The retrieval layer)
Best for querying large databases, joining tables, filtering and aggregating millions of rows, and extracting exactly the data you need before analysis begins.Best for: Retrieving
Python (The analysis layer)
Best for complex data transformation, statistical analysis, visualisation, machine learning, and building repeatable automated workflows.Best for: Analysing
Excel (The presentation layer)
Best for sharing results with non-technical stakeholders, building simple models, formatting reports, and quick one-off calculations. Most business users live here.Best for: Presenting
The key insight is that a professional data analyst workflow often uses all three. SQL pulls the data from a database. Python cleans, transforms, and analyses it. Excel or a dashboard tool presents the final result to stakeholders. You are not choosing between them — you are learning to use the right one at the right stage.
When to Use What — Real Scenarios
Abstract descriptions only go so far. Here is a practical breakdown of common analyst tasks and which tool wins for each:
| TASK | BEST TOOL | WHY |
|---|---|---|
| Pull last 3 months of orders for one region | SQL | Filtering a live database by date and region is a native SQL operation |
| Calculate profit margin across 50k rows | Python | Vectorised NumPy operations handle this faster and more flexibly |
| Build a monthly revenue summary for your manager | Excel | Non-technical stakeholders can view, filter, and share it without any tools |
| Join customer table with orders table to find repeat buyers | SQL | JOINs are SQL’s core strength — doing this in Excel is painful and error-prone |
| Build a churn prediction model | Python | scikit-learn, pandas, and model validation tools all live in Python |
| Quick sanity check on a 500-row dataset | Excel | Fastest tool for visual inspection of small, already-exported data |
| Automate a weekly report that pulls fresh data | SQL + Python | SQL queries the database, Python formats and emails the report |
📌 RULE OF THUMB
If the data is already in front of you (a CSV, a DataFrame), work in Python. If the data lives in a database and you need to extract a specific slice of it, start with SQL. If you need to share a result with someone who doesn’t code, move to Excel or a dashboard.
HOW COMPANIES STORE DATA
Relational Databases — The Conceptual Model
When you worked with the Superstore dataset in Module 1, everything was in one flat CSV file — all columns side by side in a single table. That is convenient for learning, but it is not how production data works.
Real companies store data in relational databases — systems that split information across multiple connected tables. Instead of repeating a customer’s name and address on every order they place, a relational database stores the customer details once in a customers table and links each order to the customer via a shared ID.
This approach — called normalisation — reduces duplication, prevents inconsistencies, and makes large datasets much faster to query. Understanding it conceptually is all you need at this stage. Here is what it looks like with Superstore data:
Here’s the Superstore schema in a clean copy-paste format:
superstore_db — Simplified Schema
Table: orders
| Column | Type | Key |
|---|---|---|
| order_id | TEXT | PK |
| customer_id | TEXT | FK → customers |
| order_date | DATE | |
| ship_date | DATE | |
| ship_mode | TEXT | |
| region | TEXT | |
| segment | TEXT |
Table: customers
| Column | Type | Key |
|---|---|---|
| customer_id | TEXT | PK |
| customer_name | TEXT | |
| segment | TEXT | |
| city | TEXT | |
| state | TEXT | |
| country | TEXT |
Table: products
| Column | Type | Key |
|---|---|---|
| product_id | TEXT | PK |
| product_name | TEXT | |
| category | TEXT | |
| sub_category | TEXT |
Table: order_items
| Column | Type | Key |
|---|---|---|
| item_id | INT | PK |
| order_id | TEXT | FK → orders |
| product_id | TEXT | FK → products |
| sales | REAL | |
| quantity | INT | |
| discount | REAL | |
| profit | REAL |
Relationships
∙ orders.customer_id → customers.customer_id
∙ order_items.order_id → orders.order_id
∙ order_items.product_id → products.product_id
Note: This is a normalised version of the flat Superstore CSV — split into 4 linked tables. In Module 1 you worked with it as one flat file. In this module you’ll query it as a real relational database using JOINs to reconnect the tables.
Three terms worth knowing at this stage:
Primary Key (PK) — a unique identifier for each row in a table. In the orders table, order_id is the primary key. No two rows can have the same value.
Foreign Key (FK) — a column that references the primary key of another table. customer_id in the orders table points to customer_id in the customers table. This is how tables are linked.
Schema — the overall structure of a database: its tables, columns, data types, and how they relate. When a colleague says “check the schema,” they mean look at this blueprint.
You don’t need to design databases at this stage. You just need to understand that when you write a SQL query, you are asking a structured question against a system that looks like this — and the answer comes back as a table you can then work with in Python.
💡 WHY THIS MATTERS FOR YOUR QUERIES
Because data is split across tables, getting a complete picture often means combining tables. A query asking “show me all orders placed by customers in New York” needs to look in both the
orderstable and thecustomerstable. That is what JOINs are for — covered in Topic 4.
SETUP LAB
Setting Up SQLite + Converting Superstore to a Database
For this module we are using SQLite — a lightweight, file-based database that requires zero server setup and works directly inside Python. It is the perfect SQL learning environment because you can get started in under five minutes with no installation beyond what you already have.
Better still — we are converting the Superstore CSV from Module 1 into a SQLite database. You already know this dataset. The columns, the business context, the quirks. This means you can focus entirely on learning SQL syntax instead of learning new data at the same time.
1 Confirm your setup
SQLite comes built into Python’s standard library — no pip install needed. Confirm it’s available by running this in a new notebook cell:
python
import sqlite3
import pandas as pd
<em># Confirm sqlite3 version</em>
print("SQLite version:", sqlite3.sqlite_version)
print("Ready to go!")
2 Load the Superstore CSV and convert to SQLite
This script reads your CSV, creates a SQLite database file, and writes the data into it as a table called superstore. Run it once — it creates a file called superstore.db that you’ll use throughout this module.
python
import sqlite3
import pandas as pd
<em># Load the CSV you used in Module 1</em>
df = pd.read_csv('superstore_sales.csv')
<em># Clean column names — replace spaces with underscores</em>
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]
<em># Create a SQLite database file</em>
conn = sqlite3.connect('superstore.db')
<em># Write the DataFrame into a SQL table called 'superstore'</em>
df.to_sql('superstore', conn, if_exists='replace', index=False)
print(f"Database created. Rows loaded: {len(df)}")
conn.close()
3 Verify the database is working
Run your first SQL query. This confirms the database is readable and shows you the column names you’ll be working with throughout the module.
python
<em># Connect to the database</em>
conn = sqlite3.connect('superstore.db')
<em># Your very first SQL query — read 5 rows</em>
query = """
SELECT *
FROM superstore
LIMIT 5
"""
result = pd.read_sql(query, conn)
print("Columns:", result.columns.tolist())
result
4 Check the table structure
SQLite has a built-in way to inspect a table’s schema. This is useful any time you work with an unfamiliar database — it tells you the column names and their data types.
python
<em># Inspect the table schema</em>
schema_query = "PRAGMA table_info(superstore)"
schema = pd.read_sql(schema_query, conn)
print(schema[['name', 'type']])
<em># Also check row count</em>
count = pd.read_sql("SELECT COUNT(*) as total FROM superstore", conn)
print(f"\nTotal rows: {count['total'][0]}")
✅ EXPECTED OUTPUT
After Step 4, you should see all your column names listed with their types (TEXT, REAL, INTEGER), and a total row count matching your original CSV. If you see that — your SQLite database is ready and you’re set for upcoming topics in this module.
Common Misconceptions
As you begin working with SQL, it is useful to address a few misconceptions.
One common belief is that SQL is only for database engineers. In reality, data analysts use SQL extensively. It is one of their primary tools for daily work.
Another misconception is that Python can replace SQL. While Python is extremely powerful, it still relies on data input. SQL remains the most efficient way to retrieve structured data from databases.
There is also a perception that SQL is difficult. In practice, SQL is relatively straightforward. Its syntax is readable, and you can start writing useful queries very quickly.
Understanding these points early helps you approach SQL with the right mindset.
SUMMARY
What You Now Know
Topic 1 is intentionally conceptual — it builds the mental model that makes every SQL query you write from here feel logical rather than arbitrary. Before moving to Topic 2, make sure you can answer these questions without looking at your notes:
- ✓Why do most companies store data in relational databases rather than flat files?
- ✓In a real analyst workflow, at what stage does SQL get used — before or after Python?
- ✓What is the difference between a primary key and a foreign key?
- ✓Which tool would you use to join two tables and filter by date — SQL, Python, or Excel?
- ✓What does
pd.read_sql()do and why is it useful? - ✓Your Superstore SQLite database is created and returns 5 rows when queried.
COMING UP IN THIS MODULE
Now that your database is set up and your mental model is clear, Topic 2 dives into writing real queries — SELECT, WHERE, ORDER BY, DISTINCT, and LIMIT. By the end of Topic 2 you’ll be able to answer basic business questions entirely in SQL against your Superstore database.
Your First Queries — SELECT, WHERE, ORDER BY, LIMIT, DISTINCT