Why SQL for Data Analysts?

The Tool You Can’t Avoid

You’ve just spent Module 1 loading a CSV file into pandas and analysing it in Python. That felt powerful — and it was. But here’s something most beginner data courses don’t tell you upfront: in most real companies, data doesn’t live in CSV files.

It lives in databases. Structured, relational, often enormous databases — containing millions of rows spread across dozens of connected tables. Before a data analyst can do anything with that data, they need to query it. And the language used to query relational databases is SQL — Structured Query Language.

SQL has been around since the 1970s. It has survived every major technology shift since then — the rise of the internet, the cloud, big data, machine learning, and AI. In 2024, SQL still appears in more data analyst job postings than any other technical skill, including Python. That kind of longevity is not an accident.

In most analytics workflows, SQL is where data gets retrieved. Python is where it gets transformed and visualised. Excel is where it gets presented. Understanding where each tool starts and stops is the difference between a junior analyst and a confident one.

Core principle

This topic has three goals. First, give you a clear mental model for when to use SQL versus Excel versus Python. Second, explain how relational databases are structured so that SQL queries make intuitive sense. Third, get your local SQL environment set up using the same Superstore dataset from Module 1 — so you’re coding, not just reading.

THE THREE TOOLS

SQL vs Excel vs Python

Most people entering data analytics already know Excel. Many have started learning Python. SQL often feels like a third thing to learn — and that can feel overwhelming. The good news is that these three tools are not competitors. They are complements. Each one is exceptional at specific tasks and weak at others.

Here’s a clean, course-ready comparison table you can directly include in your page. It keeps things structured without feeling overly “list-heavy.”

Aspect	SQL	Python (pandas)	Excel
Primary Purpose	Data extraction and querying	Data analysis, transformation, automation	Quick analysis and reporting
Best Use Case	Working with large databases	Complex data processing and advanced analysis	Small to medium datasets, business reporting
Data Size Handling	Excellent (millions of rows)	Very good (depends on memory)	Limited (can slow/crash on large data)
Ease of Learning	Easy to start, logical syntax	Moderate (requires programming basics)	Very easy (beginner-friendly UI)
Performance	Very fast (optimized databases)	Fast, but depends on code efficiency	Slower with large datasets
Data Source	Directly connects to databases	Works with files, APIs, databases	Mostly local files (Excel, CSV)
Data Cleaning	Basic	Advanced and flexible	Manual and limited
Automation	Limited	Strong automation capabilities	Very limited
Visualization	Not supported (basic output only)	Strong (Matplotlib, Seaborn, etc.)	Built-in charts and dashboards
Scalability	High	High (with proper setup)	Low
Real-World Role	Extract and prepare data	Analyze and model data	Present and share insights
Dependency	Independent (data source tool)	Often depends on SQL for data	Often depends on exported data
Industry Usage	Mandatory for analysts	Highly preferred	Widely used for reporting

Simple Takeaway

Instead of choosing one tool over another, think of them as a workflow:

SQL → Get the data
Python → Analyze the data
Excel → Present or quickly explore

Here is how to think about them:

SQL (The retrieval layer)

Best for querying large databases, joining tables, filtering and aggregating millions of rows, and extracting exactly the data you need before analysis begins.Best for: Retrieving

Python (The analysis layer)

Best for complex data transformation, statistical analysis, visualisation, machine learning, and building repeatable automated workflows.Best for: Analysing

Excel (The presentation layer)

Best for sharing results with non-technical stakeholders, building simple models, formatting reports, and quick one-off calculations. Most business users live here.Best for: Presenting

The key insight is that a professional data analyst workflow often uses all three. SQL pulls the data from a database. Python cleans, transforms, and analyses it. Excel or a dashboard tool presents the final result to stakeholders. You are not choosing between them — you are learning to use the right one at the right stage.

When to Use What — Real Scenarios

Abstract descriptions only go so far. Here is a practical breakdown of common analyst tasks and which tool wins for each:

TASK	BEST TOOL	WHY
Pull last 3 months of orders for one region	SQL	Filtering a live database by date and region is a native SQL operation
Calculate profit margin across 50k rows	Python	Vectorised NumPy operations handle this faster and more flexibly
Build a monthly revenue summary for your manager	Excel	Non-technical stakeholders can view, filter, and share it without any tools
Join customer table with orders table to find repeat buyers	SQL	JOINs are SQL’s core strength — doing this in Excel is painful and error-prone
Build a churn prediction model	Python	scikit-learn, pandas, and model validation tools all live in Python
Quick sanity check on a 500-row dataset	Excel	Fastest tool for visual inspection of small, already-exported data
Automate a weekly report that pulls fresh data	SQL + Python	SQL queries the database, Python formats and emails the report

📌 RULE OF THUMB

If the data is already in front of you (a CSV, a DataFrame), work in Python. If the data lives in a database and you need to extract a specific slice of it, start with SQL. If you need to share a result with someone who doesn’t code, move to Excel or a dashboard.

HOW COMPANIES STORE DATA

Relational Databases — The Conceptual Model

When you worked with the Superstore dataset in Module 1, everything was in one flat CSV file — all columns side by side in a single table. That is convenient for learning, but it is not how production data works.

Real companies store data in relational databases — systems that split information across multiple connected tables. Instead of repeating a customer’s name and address on every order they place, a relational database stores the customer details once in a customers table and links each order to the customer via a shared ID.

This approach — called normalisation — reduces duplication, prevents inconsistencies, and makes large datasets much faster to query. Understanding it conceptually is all you need at this stage. Here is what it looks like with Superstore data:

Here’s the Superstore schema in a clean copy-paste format:

superstore_db — Simplified Schema

Table: orders

Column	Type	Key
order_id	TEXT	PK
customer_id	TEXT	FK → customers
order_date	DATE
ship_date	DATE
ship_mode	TEXT
region	TEXT
segment	TEXT

Table: customers

Column	Type	Key
customer_id	TEXT	PK
customer_name	TEXT
segment	TEXT
city	TEXT
state	TEXT
country	TEXT

Table: products

Column	Type	Key
product_id	TEXT	PK
product_name	TEXT
category	TEXT
sub_category	TEXT

Table: order_items

Column	Type	Key
item_id	INT	PK
order_id	TEXT	FK → orders
product_id	TEXT	FK → products
sales	REAL
quantity	INT
discount	REAL
profit	REAL

Relationships
∙ orders.customer_id → customers.customer_id
∙ order_items.order_id → orders.order_id
∙ order_items.product_id → products.product_id

Note: This is a normalised version of the flat Superstore CSV — split into 4 linked tables. In Module 1 you worked with it as one flat file. In this module you’ll query it as a real relational database using JOINs to reconnect the tables.

Three terms worth knowing at this stage:

Primary Key (PK) — a unique identifier for each row in a table. In the orders table, order_id is the primary key. No two rows can have the same value.

Foreign Key (FK) — a column that references the primary key of another table. customer_id in the orders table points to customer_id in the customers table. This is how tables are linked.

Schema — the overall structure of a database: its tables, columns, data types, and how they relate. When a colleague says “check the schema,” they mean look at this blueprint.

You don’t need to design databases at this stage. You just need to understand that when you write a SQL query, you are asking a structured question against a system that looks like this — and the answer comes back as a table you can then work with in Python.

💡 WHY THIS MATTERS FOR YOUR QUERIES

Because data is split across tables, getting a complete picture often means combining tables. A query asking “show me all orders placed by customers in New York” needs to look in both the orders table and the customers table. That is what JOINs are for — covered in Topic 4.

SETUP LAB

Setting Up SQLite + Converting Superstore to a Database

For this module we are using SQLite — a lightweight, file-based database that requires zero server setup and works directly inside Python. It is the perfect SQL learning environment because you can get started in under five minutes with no installation beyond what you already have.

Better still — we are converting the Superstore CSV from Module 1 into a SQLite database. You already know this dataset. The columns, the business context, the quirks. This means you can focus entirely on learning SQL syntax instead of learning new data at the same time.

1 Confirm your setup

SQLite comes built into Python’s standard library — no pip install needed. Confirm it’s available by running this in a new notebook cell:

python

import sqlite3
import pandas as pd

<em># Confirm sqlite3 version</em>
print("SQLite version:", sqlite3.sqlite_version)
print("Ready to go!")

2 Load the Superstore CSV and convert to SQLite

This script reads your CSV, creates a SQLite database file, and writes the data into it as a table called superstore. Run it once — it creates a file called superstore.db that you’ll use throughout this module.

python

import sqlite3
import pandas as pd

<em># Load the CSV you used in Module 1</em>
df = pd.read_csv('superstore_sales.csv')

<em># Clean column names — replace spaces with underscores</em>
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]

<em># Create a SQLite database file</em>
conn = sqlite3.connect('superstore.db')

<em># Write the DataFrame into a SQL table called 'superstore'</em>
df.to_sql('superstore', conn, if_exists='replace', index=False)

print(f"Database created. Rows loaded: {len(df)}")
conn.close()

3 Verify the database is working

Run your first SQL query. This confirms the database is readable and shows you the column names you’ll be working with throughout the module.

python

<em># Connect to the database</em>
conn = sqlite3.connect('superstore.db')

<em># Your very first SQL query — read 5 rows</em>
query = """
    SELECT *
    FROM superstore
    LIMIT 5
"""

result = pd.read_sql(query, conn)
print("Columns:", result.columns.tolist())
result

4 Check the table structure

SQLite has a built-in way to inspect a table’s schema. This is useful any time you work with an unfamiliar database — it tells you the column names and their data types.

python

<em># Inspect the table schema</em>
schema_query = "PRAGMA table_info(superstore)"
schema = pd.read_sql(schema_query, conn)
print(schema[['name', 'type']])

<em># Also check row count</em>
count = pd.read_sql("SELECT COUNT(*) as total FROM superstore", conn)
print(f"\nTotal rows: {count['total'][0]}")

✅ EXPECTED OUTPUT

After Step 4, you should see all your column names listed with their types (TEXT, REAL, INTEGER), and a total row count matching your original CSV. If you see that — your SQLite database is ready and you’re set for upcoming topics in this module.

Common Misconceptions

As you begin working with SQL, it is useful to address a few misconceptions.

One common belief is that SQL is only for database engineers. In reality, data analysts use SQL extensively. It is one of their primary tools for daily work.

Another misconception is that Python can replace SQL. While Python is extremely powerful, it still relies on data input. SQL remains the most efficient way to retrieve structured data from databases.

There is also a perception that SQL is difficult. In practice, SQL is relatively straightforward. Its syntax is readable, and you can start writing useful queries very quickly.

Understanding these points early helps you approach SQL with the right mindset.

SUMMARY

What You Now Know

Topic 1 is intentionally conceptual — it builds the mental model that makes every SQL query you write from here feel logical rather than arbitrary. Before moving to Topic 2, make sure you can answer these questions without looking at your notes:

✓Why do most companies store data in relational databases rather than flat files?
✓In a real analyst workflow, at what stage does SQL get used — before or after Python?
✓What is the difference between a primary key and a foreign key?
✓Which tool would you use to join two tables and filter by date — SQL, Python, or Excel?
✓What does pd.read_sql()do and why is it useful?
✓Your Superstore SQLite database is created and returns 5 rows when queried.

COMING UP IN THIS MODULE

Now that your database is set up and your mental model is clear, Topic 2 dives into writing real queries — SELECT, WHERE, ORDER BY, DISTINCT, and LIMIT. By the end of Topic 2 you’ll be able to answer basic business questions entirely in SQL against your Superstore database.

NEXT TOPIC →

Your First Queries — SELECT, WHERE, ORDER BY, LIMIT, DISTINCT

→

The Tool You Can’t Avoid

SQL vs Excel vs Python

Simple Takeaway

SQL (The retrieval layer)

Excel (The presentation layer)

When to Use What — Real Scenarios

Relational Databases — The Conceptual Model

Setting Up SQLite + Converting Superstore to a Database

Common Misconceptions

What You Now Know

Comments

Leave a Reply Cancel reply

More posts

Distribution Analysis: Understanding the Shape of Data

Descriptive Statistics: Understanding Data Through Summary Measures

Exploratory Data Analysis: Understanding Data Before Analysis

Feature Creation & Data Transformation in Python