Why SQL for Data Analysts?

The Tool You Can’t Avoid

You’ve just spent Module 1 loading a CSV file into pandas and analysing it in Python. That felt powerful — and it was. But here’s something most beginner data courses don’t tell you upfront: in most real companies, data doesn’t live in CSV files.

It lives in databases. Structured, relational, often enormous databases — containing millions of rows spread across dozens of connected tables. Before a data analyst can do anything with that data, they need to query it. And the language used to query relational databases is SQL — Structured Query Language.

SQL has been around since the 1970s. It has survived every major technology shift since then — the rise of the internet, the cloud, big data, machine learning, and AI. In 2024, SQL still appears in more data analyst job postings than any other technical skill, including Python. That kind of longevity is not an accident.

In most analytics workflows, SQL is where data gets retrieved. Python is where it gets transformed and visualised. Excel is where it gets presented. Understanding where each tool starts and stops is the difference between a junior analyst and a confident one.

Core principle

This topic has three goals. First, give you a clear mental model for when to use SQL versus Excel versus Python. Second, explain how relational databases are structured so that SQL queries make intuitive sense. Third, get your local SQL environment set up using the same Superstore dataset from Module 1 — so you’re coding, not just reading.

THE THREE TOOLS

SQL vs Excel vs Python

Most people entering data analytics already know Excel. Many have started learning Python. SQL often feels like a third thing to learn — and that can feel overwhelming. The good news is that these three tools are not competitors. They are complements. Each one is exceptional at specific tasks and weak at others.

Here’s a clean, course-ready comparison table you can directly include in your page. It keeps things structured without feeling overly “list-heavy.”

AspectSQLPython (pandas)Excel
Primary PurposeData extraction and queryingData analysis, transformation, automationQuick analysis and reporting
Best Use CaseWorking with large databasesComplex data processing and advanced analysisSmall to medium datasets, business reporting
Data Size HandlingExcellent (millions of rows)Very good (depends on memory)Limited (can slow/crash on large data)
Ease of LearningEasy to start, logical syntaxModerate (requires programming basics)Very easy (beginner-friendly UI)
PerformanceVery fast (optimized databases)Fast, but depends on code efficiencySlower with large datasets
Data SourceDirectly connects to databasesWorks with files, APIs, databasesMostly local files (Excel, CSV)
Data CleaningBasicAdvanced and flexibleManual and limited
AutomationLimitedStrong automation capabilitiesVery limited
VisualizationNot supported (basic output only)Strong (Matplotlib, Seaborn, etc.)Built-in charts and dashboards
ScalabilityHighHigh (with proper setup)Low
Real-World RoleExtract and prepare dataAnalyze and model dataPresent and share insights
DependencyIndependent (data source tool)Often depends on SQL for dataOften depends on exported data
Industry UsageMandatory for analystsHighly preferredWidely used for reporting

Simple Takeaway

Instead of choosing one tool over another, think of them as a workflow:

SQL → Get the data
Python → Analyze the data
Excel → Present or quickly explore

Here is how to think about them:

SQL (The retrieval layer)

Best for querying large databases, joining tables, filtering and aggregating millions of rows, and extracting exactly the data you need before analysis begins.Best for: Retrieving

Python (The analysis layer)

Best for complex data transformation, statistical analysis, visualisation, machine learning, and building repeatable automated workflows.Best for: Analysing

Excel (The presentation layer)

Best for sharing results with non-technical stakeholders, building simple models, formatting reports, and quick one-off calculations. Most business users live here.Best for: Presenting

The key insight is that a professional data analyst workflow often uses all three. SQL pulls the data from a database. Python cleans, transforms, and analyses it. Excel or a dashboard tool presents the final result to stakeholders. You are not choosing between them — you are learning to use the right one at the right stage.

When to Use What — Real Scenarios

Abstract descriptions only go so far. Here is a practical breakdown of common analyst tasks and which tool wins for each:

TASKBEST TOOLWHY
Pull last 3 months of orders for one regionSQLFiltering a live database by date and region is a native SQL operation
Calculate profit margin across 50k rowsPythonVectorised NumPy operations handle this faster and more flexibly
Build a monthly revenue summary for your managerExcelNon-technical stakeholders can view, filter, and share it without any tools
Join customer table with orders table to find repeat buyersSQLJOINs are SQL’s core strength — doing this in Excel is painful and error-prone
Build a churn prediction modelPythonscikit-learn, pandas, and model validation tools all live in Python
Quick sanity check on a 500-row datasetExcelFastest tool for visual inspection of small, already-exported data
Automate a weekly report that pulls fresh dataSQL + PythonSQL queries the database, Python formats and emails the report

📌 RULE OF THUMB

If the data is already in front of you (a CSV, a DataFrame), work in Python. If the data lives in a database and you need to extract a specific slice of it, start with SQL. If you need to share a result with someone who doesn’t code, move to Excel or a dashboard.

HOW COMPANIES STORE DATA

Relational Databases — The Conceptual Model

When you worked with the Superstore dataset in Module 1, everything was in one flat CSV file — all columns side by side in a single table. That is convenient for learning, but it is not how production data works.

Real companies store data in relational databases — systems that split information across multiple connected tables. Instead of repeating a customer’s name and address on every order they place, a relational database stores the customer details once in a customers table and links each order to the customer via a shared ID.

This approach — called normalisation — reduces duplication, prevents inconsistencies, and makes large datasets much faster to query. Understanding it conceptually is all you need at this stage. Here is what it looks like with Superstore data:

Here’s the Superstore schema in a clean copy-paste format:

superstore_db — Simplified Schema

Table: orders

ColumnTypeKey
order_idTEXTPK
customer_idTEXTFK → customers
order_dateDATE
ship_dateDATE
ship_modeTEXT
regionTEXT
segmentTEXT

Table: customers

ColumnTypeKey
customer_idTEXTPK
customer_nameTEXT
segmentTEXT
cityTEXT
stateTEXT
countryTEXT

Table: products

ColumnTypeKey
product_idTEXTPK
product_nameTEXT
categoryTEXT
sub_categoryTEXT

Table: order_items

ColumnTypeKey
item_idINTPK
order_idTEXTFK → orders
product_idTEXTFK → products
salesREAL
quantityINT
discountREAL
profitREAL

Relationships
∙ orders.customer_id → customers.customer_id
∙ order_items.order_id → orders.order_id
∙ order_items.product_id → products.product_id

Note: This is a normalised version of the flat Superstore CSV — split into 4 linked tables. In Module 1 you worked with it as one flat file. In this module you’ll query it as a real relational database using JOINs to reconnect the tables.​​​​​​​​​​​​​​​​

Three terms worth knowing at this stage:

Primary Key (PK) — a unique identifier for each row in a table. In the orders table, order_id is the primary key. No two rows can have the same value.

Foreign Key (FK) — a column that references the primary key of another table. customer_id in the orders table points to customer_id in the customers table. This is how tables are linked.

Schema — the overall structure of a database: its tables, columns, data types, and how they relate. When a colleague says “check the schema,” they mean look at this blueprint.

You don’t need to design databases at this stage. You just need to understand that when you write a SQL query, you are asking a structured question against a system that looks like this — and the answer comes back as a table you can then work with in Python.

💡 WHY THIS MATTERS FOR YOUR QUERIES

Because data is split across tables, getting a complete picture often means combining tables. A query asking “show me all orders placed by customers in New York” needs to look in both the orders table and the customers table. That is what JOINs are for — covered in Topic 4.

SETUP LAB

Setting Up SQLite + Converting Superstore to a Database

For this module we are using SQLite — a lightweight, file-based database that requires zero server setup and works directly inside Python. It is the perfect SQL learning environment because you can get started in under five minutes with no installation beyond what you already have.

Better still — we are converting the Superstore CSV from Module 1 into a SQLite database. You already know this dataset. The columns, the business context, the quirks. This means you can focus entirely on learning SQL syntax instead of learning new data at the same time.

1 Confirm your setup

SQLite comes built into Python’s standard library — no pip install needed. Confirm it’s available by running this in a new notebook cell:

python

import sqlite3
import pandas as pd

<em># Confirm sqlite3 version</em>
print("SQLite version:", sqlite3.sqlite_version)
print("Ready to go!")

2 Load the Superstore CSV and convert to SQLite

This script reads your CSV, creates a SQLite database file, and writes the data into it as a table called superstore. Run it once — it creates a file called superstore.db that you’ll use throughout this module.

python

import sqlite3
import pandas as pd

<em># Load the CSV you used in Module 1</em>
df = pd.read_csv('superstore_sales.csv')

<em># Clean column names — replace spaces with underscores</em>
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]

<em># Create a SQLite database file</em>
conn = sqlite3.connect('superstore.db')

<em># Write the DataFrame into a SQL table called 'superstore'</em>
df.to_sql('superstore', conn, if_exists='replace', index=False)

print(f"Database created. Rows loaded: {len(df)}")
conn.close()

3 Verify the database is working

Run your first SQL query. This confirms the database is readable and shows you the column names you’ll be working with throughout the module.

python

<em># Connect to the database</em>
conn = sqlite3.connect('superstore.db')

<em># Your very first SQL query — read 5 rows</em>
query = """
    SELECT *
    FROM superstore
    LIMIT 5
"""

result = pd.read_sql(query, conn)
print("Columns:", result.columns.tolist())
result

4 Check the table structure

SQLite has a built-in way to inspect a table’s schema. This is useful any time you work with an unfamiliar database — it tells you the column names and their data types.

python

<em># Inspect the table schema</em>
schema_query = "PRAGMA table_info(superstore)"
schema = pd.read_sql(schema_query, conn)
print(schema[['name', 'type']])

<em># Also check row count</em>
count = pd.read_sql("SELECT COUNT(*) as total FROM superstore", conn)
print(f"\nTotal rows: {count['total'][0]}")

✅ EXPECTED OUTPUT

After Step 4, you should see all your column names listed with their types (TEXT, REAL, INTEGER), and a total row count matching your original CSV. If you see that — your SQLite database is ready and you’re set for upcoming topics in this module.

Common Misconceptions

As you begin working with SQL, it is useful to address a few misconceptions.

One common belief is that SQL is only for database engineers. In reality, data analysts use SQL extensively. It is one of their primary tools for daily work.

Another misconception is that Python can replace SQL. While Python is extremely powerful, it still relies on data input. SQL remains the most efficient way to retrieve structured data from databases.

There is also a perception that SQL is difficult. In practice, SQL is relatively straightforward. Its syntax is readable, and you can start writing useful queries very quickly.

Understanding these points early helps you approach SQL with the right mindset.

SUMMARY

What You Now Know

Topic 1 is intentionally conceptual — it builds the mental model that makes every SQL query you write from here feel logical rather than arbitrary. Before moving to Topic 2, make sure you can answer these questions without looking at your notes:

  • ✓Why do most companies store data in relational databases rather than flat files?
  • ✓In a real analyst workflow, at what stage does SQL get used — before or after Python?
  • ✓What is the difference between a primary key and a foreign key?
  • ✓Which tool would you use to join two tables and filter by date — SQL, Python, or Excel?
  • ✓What does pd.read_sql()do and why is it useful?
  • ✓Your Superstore SQLite database is created and returns 5 rows when queried.

COMING UP IN THIS MODULE

Now that your database is set up and your mental model is clear, Topic 2 dives into writing real queries — SELECT, WHERE, ORDER BY, DISTINCT, and LIMIT. By the end of Topic 2 you’ll be able to answer basic business questions entirely in SQL against your Superstore database.

NEXT TOPIC →

Your First Queries — SELECT, WHERE, ORDER BY, LIMIT, DISTINCT

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *