Why SQL and Python Belong Together
Every topic in this module has used Python to run SQL queries. You have been writing SQL inside Python strings and passing them to pd.read_sql(). That combination is not a workaround — it is the standard professional workflow for data analysts who work with databases.
SQL and Python are not competing tools. They are complementary layers in the same pipeline. SQL is where you retrieve, filter, and summarise data at the database level. Python is where you transform, visualise, model, and communicate that data. Understanding where one ends and the other begins is one of the most practically valuable things you can take away from this module.
This topic goes deeper into that boundary. You will learn how the connection between SQLite and Python actually works, how to manage that connection properly, how to decide what belongs in SQL versus pandas, and how to structure a clean repeatable workflow that scales from a local SQLite file to a production cloud database.
How the sqlite3 Connection Works
Every query you have run in this module started with one line:
conn = sqlite3.connect('superstore.db')
That line opens a connection to a SQLite database file. A connection is a live channel between your Python session and the database. Through that channel you can send SQL statements and receive results back as Python objects.
Understanding the connection lifecycle matters because connections consume resources. A well-written analysis opens a connection, does its work, and closes the connection cleanly. A poorly written one leaves connections open, which can cause file locking issues and unpredictable behaviour especially when multiple processes access the same database.
Opening and Closing Connections Properly
import sqlite3
import pandas as pd
Open the connection
conn = sqlite3.connect('superstore.db')
Do your work
df = pd.read_sql("SELECT * FROM superstore LIMIT 5", conn)
Always close when done
conn.close()
For longer notebooks where you need the connection throughout, the best practice is to open it once at the top and close it once at the bottom — not open and close it around every query.
Using a Context Manager
Python’s with statement handles the connection lifecycle automatically. The connection closes itself when the block ends, even if an error occurs inside it:
# Context manager — connection closes automatically
with sqlite3.connect('superstore.db') as conn:
df = pd.read_sql("SELECT * FROM superstore LIMIT 5", conn)
print(df)
# conn is closed here automatically
For notebook-based analysis the manual approach is fine. For scripts that run automatically — scheduled reports, data pipelines — always use the context manager.
Checking What Tables Exist
When working with an unfamiliar database, the first thing you want to know is what tables are available:
conn = sqlite3.connect('superstore.db')
# List all tables in the database
tables = pd.read_sql("""
SELECT name
FROM sqlite_master
WHERE type = 'table'
ORDER BY name
""", conn)
print(tables)
sqlite_master is SQLite’s internal catalogue table. It stores metadata about everything in the database — tables, indexes, and views. This query is the SQLite equivalent of asking “what is in here?” when you open an unfamiliar database for the first time.
pd.read_sql() — The Bridge Between SQL and pandas
pd.read_sql() is the function that executes a SQL query and returns the result directly as a pandas DataFrame. It is the core of the SQL-Python workflow.
# Basic usage
df = pd.read_sql(sql_query, connection)
Once the result is a DataFrame you have the full pandas toolkit available — filtering, reshaping, visualisation, statistical analysis, and everything from Module 1.
Passing Parameters Safely
When your query needs to include a variable value — a user input, a date from a loop, a value from another DataFrame — never build the query by concatenating strings. This is a security risk called SQL injection and also causes bugs when values contain special characters like apostrophes.
Instead use parameterised queries:
# Unsafe — never do this
region = "West"
df = pd.read_sql(f"SELECT * FROM superstore WHERE region = '{region}'", conn)
# Safe — use parameters
region = "West"
df = pd.read_sql(
"SELECT * FROM orders WHERE region = ?",
conn,
params=(region,)
)
The ? placeholder gets replaced safely by the value in params. SQLite handles the escaping automatically. This is especially important when the variable value comes from user input or an external source.
Passing Multiple Parameters
# Filter by region and minimum sales value
region = "West"
min_sales = 500
df = pd.read_sql(
"""
SELECT order_id, region, sales, profit
FROM superstore
WHERE region = ?
AND sales > ?
ORDER BY sales DESC
""",
conn,
params=(region, min_sales)
)
Parameters are passed as a tuple in the same order as the ? placeholders appear in the query.
Reading Large Datasets in Chunks
When a query returns a very large result set — millions of rows — loading everything into memory at once can crash your notebook. pd.read_sql() supports chunked reading via the chunksize parameter:
# Read in chunks of 1000 rows at a time
chunks = pd.read_sql(
"SELECT * FROM superstore",
conn,
chunksize=1000
)
# Process each chunk
dfs = []
for chunk in chunks:
# Apply any row-level processing here
dfs.append(chunk)
df = pd.concat(dfs, ignore_index=True)
print(f"Total rows loaded: {len(df)}")
For the Superstore dataset this is not necessary — 10,000 rows loads instantly. But on a production database with millions of rows it is an essential technique to know.
Writing Data Back to SQLite
The SQL-Python bridge works in both directions. You can read data from a database into pandas, and you can write a pandas DataFrame back into a database as a table.
to_sql() — Writing a DataFrame to a Database Table
# Create a summary DataFrame in pandas
summary = df.groupby('region').agg(
total_sales=('sales', 'sum'),
total_profit=('profit', 'sum'),
order_count=('order_id', 'count')
).reset_index().round(2)
# Write it back to the database as a new table
summary.to_sql(
'region_summary', # table name
conn,
if_exists='replace', # replace if table already exists
index=False # don't write the DataFrame index as a column
)
print("Summary table written to database.")
The if_exists parameter controls what happens if the table already exists:
- replace — drop and recreate the table
- append — add rows to the existing table
- fail — raise an error (the default)
Once written back to the database, you can query this summary table with SQL just like any other table.
When to Filter in SQL vs pandas
This is the most practically important decision in the SQL-Python workflow. The wrong choice does not break anything — both tools can filter data. But the right choice makes your analysis faster, cleaner, and more professional.
The Core Principle
Filter and aggregate in SQL. Transform, visualise, and model in Python.
SQL runs inside the database engine which is optimised for filtering and aggregating large datasets. When you push filtering into SQL, only the rows you actually need travel from the database to Python. When you pull everything into Python and filter there, you are loading unnecessary data into memory and doing work that the database could have done more efficiently.
Filter in SQL When
# ✅ Row-level filters that reduce data volume
query("""
SELECT *
FROM superstore
WHERE region = 'West'
AND order_date >= '2021-01-01'
""")
# ✅ Aggregations that summarise large tables
query("""
SELECT region, SUM(sales) AS total_sales
FROM superstore
GROUP BY region
""")
# ✅ JOINs that combine tables
query("""
SELECT o.order_id, c.customer_name, oi.sales
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
""")
# ✅ Deduplication before analysis
query("""
SELECT DISTINCT customer_id, segment
FROM customers
""")
Filter in pandas When
# ✅ Complex conditional logic involving multiple Python objects
df['high_value'] = (df['sales'] > df['sales'].mean() * 1.5)
# ✅ String operations not easily done in SQL
df_filtered = df[df['customer_name'].str.contains('son', case=False)]
# ✅ Filtering based on values calculated in Python
threshold = df['profit'].quantile(0.75)
df_top = df[df['profit'] > threshold]
# ✅ Time-based filtering using pandas datetime methods
df['order_date'] = pd.to_datetime(df['order_date'])
df_recent = df[df['order_date'].dt.year == 2021]
# ✅ Filtering after a merge or reshape operation in pandas
merged = df1.merge(df2, on='customer_id')
filtered = merged[merged['total_orders'] > 3]
The Decision Framework
Ask yourself three questions before deciding where to filter:
- Does the filter reduce the number of rows significantly?
If yes, do it in SQL. Bringing fewer rows into Python is always better. - Does the filter require Python objects, methods, or calculated values that SQL cannot access?
If yes, do it in pandas after loading. - Is this a one-time exploration or a repeatable pipeline?
For pipelines, push as much as possible into SQL for performance and reliability.
Building a Clean SQL-Python Workflow
Here is a complete, realistic analyst workflow that shows SQL and Python working together from raw database to final insight:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
# ── STEP 1: Connect ──────────────────────────────────────
conn = sqlite3.connect('superstore.db')
# ── STEP 2: Pull clean, pre-filtered data using SQL ──────
df = pd.read_sql("""
SELECT
c.segment,
o.region,
oi.category,
oi.sub_category,
o.order_date,
ROUND(oi.sales, 2) AS sales,
ROUND(oi.profit, 2) AS profit
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
WHERE o.order_date >= '2021-01-01'
""", conn)
# ── STEP 3: Convert types in pandas ──────────────────────
df['order_date'] = pd.to_datetime(df['order_date'])
df['month'] = df['order_date'].dt.to_period('M')
# ── STEP 4: Further analysis in pandas ───────────────────
# Profit margin by segment
df['profit_margin'] = (df['profit'] / df['sales'] * 100).round(2)
# Monthly revenue trend
monthly = df.groupby('month')['sales'].sum().reset_index()
monthly.columns = ['month', 'total_sales']
# Segment performance
segment = df.groupby('segment').agg(
total_sales=('sales', 'sum'),
total_profit=('profit', 'sum'),
avg_margin=('profit_margin', 'mean')
).round(2).reset_index()
# ── STEP 5: Print insights ────────────────────────────────
print("=== Segment Performance (2021) ===")
print(segment.sort_values('total_profit', ascending=False))
print("\n=== Monthly Revenue Trend ===")
print(monthly)
# ── STEP 6: Close connection ─────────────────────────────
conn.close()
This workflow is the template for every analysis you will build in this course going forward. SQL handles retrieval and pre-filtering. Python handles enrichment, aggregation, and presentation. Each tool does what it is best at.
Saving Query Results for Reuse
When a query takes a long time to run — common on large production databases — save the result to a CSV or parquet file so you do not have to re-query every time you restart your notebook:
# Run the heavy query once
df = pd.read_sql(heavy_query, conn)
# Save locally
df.to_csv('data/superstore_clean.csv', index=False)
# Next session — load from file instead of re-querying
df = pd.read_csv('data/superstore_clean.csv')
This is standard practice in professional analytics. Query the database to get fresh data when you need it. Work from a saved file during iterative analysis and visualisation where you are not changing the underlying data pull.
From SQLite to Production Databases
Everything you have learned in this module using SQLite transfers directly to production databases. The only thing that changes is the connection setup. The SQL syntax, pd.read_sql(), parameterised queries, and the SQL-Python workflow are identical.
Here is how connections look for the most common production databases:
# PostgreSQL — using psycopg2
import psycopg2
conn = psycopg2.connect(
host="your-host",
database="your-db",
user="your-user",
password="your-password"
)
# MySQL — using mysql-connector-python
import mysql.connector
conn = mysql.connector.connect(
host="your-host",
database="your-db",
user="your-user",
password="your-password"
)
# BigQuery — using google-cloud-bigquery
from google.cloud import bigquery
client = bigquery.Client()
df = client.query("SELECT * FROM dataset.table LIMIT 10").to_dataframe()
# Once connected — pd.read_sql() works the same way for all of them
df = pd.read_sql("SELECT * FROM orders LIMIT 10", conn)
The credentials and connection libraries differ. The workflow after that — SQL queries, pd.read_sql(), DataFrames — is exactly the same. What you have learned here scales directly to enterprise databases handling billions of rows.
Common Mistakes in the SQL-Python Workflow
| Mistake | What Happens | Fix |
|---|---|---|
| Leaving connections open | File locking, resource leaks | Always call conn.close() or use a context manager |
| Building queries with f-strings and user input | SQL injection risk, apostrophe bugs | Use parameterised queries with ? placeholders |
| Pulling full tables into pandas before filtering | Slow, memory-heavy, unprofessional | Filter in SQL first, bring only what you need into Python |
| Re-running expensive queries every notebook restart | Slow development cycle | Save results to CSV after the first run |
Not resetting the index after pd.read_sql() | Index issues in downstream operations | Add .reset_index(drop=True) if needed |
| Hardcoding credentials in notebooks | Security risk if shared | Use environment variables or a config file |
Practice Exercises
- Connect to your Superstore database, pull all orders from 2020 using a SQL WHERE filter, and calculate the monthly revenue trend in pandas.
- Write a parameterised query that accepts a region name as a variable and returns total sales and profit for that region. Test it for all four regions in a loop.
- Pull the top 10 customers by total sales using SQL GROUP BY. Then in pandas, add a column showing each customer’s share of total revenue as a percentage.
- Write a complete workflow: SQL pulls order data joined with customer and product tables. pandas calculates profit margin per segment. Print a clean summary table.
- Save the result of a complex JOIN query to a CSV file. Then reload it from CSV in a new cell and confirm the row count matches.
Summary — What You Can Now Do
- Open, use, and close a SQLite connection correctly in Python
- Use context managers for safe automatic connection handling
- Query a database using pd.read_sql() and work with the result as a pandas DataFrame
- Write parameterised queries to safely pass variable values into SQL
- Read large result sets in chunks using the chunksize parameter
- Write pandas DataFrames back to a database table using to_sql()
- Decide confidently whether a filter or aggregation belongs in SQL or pandas
- Build a clean end-to-end SQL-Python workflow from database connection to final insight
- Understand how the SQLite workflow transfers directly to production databases
Module 2 Complete
You have now finished all six topics in Module 2. Here is what you can do that you could not at the start of this module:
- Query any relational database using SELECT, WHERE, ORDER BY, LIMIT, and DISTINCT
- Summarise data with COUNT, SUM, AVG, MIN, MAX, GROUP BY, and HAVING
- Combine multiple tables using INNER JOIN and LEFT JOIN
- Write subqueries in WHERE and FROM clauses for multi-step analysis
- Connect SQL to Python, retrieve results as DataFrames, and decide where each tool does its best work
The Mini Project for this module brings all of this together. You will use SQL to query the Superstore database, answer five business questions, pull the results into pandas, and write a short plain-English brief of your findings — exactly what a junior analyst would be asked to do in their first month on the job.
Up next — Module 2 Mini Project
Five business questions. One database. SQL queries, pandas output, and a written brief. Your first end-to-end analyst deliverable.
Leave a Reply