SQL and Python Together: How to Use sqlite3 and pd.read_sql() for Data Analysis

Why SQL and Python Belong Together

Every topic in this module has used Python to run SQL queries. You have been writing SQL inside Python strings and passing them to pd.read_sql(). That combination is not a workaround — it is the standard professional workflow for data analysts who work with databases.

SQL and Python are not competing tools. They are complementary layers in the same pipeline. SQL is where you retrieve, filter, and summarise data at the database level. Python is where you transform, visualise, model, and communicate that data. Understanding where one ends and the other begins is one of the most practically valuable things you can take away from this module.

This topic goes deeper into that boundary. You will learn how the connection between SQLite and Python actually works, how to manage that connection properly, how to decide what belongs in SQL versus pandas, and how to structure a clean repeatable workflow that scales from a local SQLite file to a production cloud database.

How the sqlite3 Connection Works

Every query you have run in this module started with one line:

conn = sqlite3.connect('superstore.db')

That line opens a connection to a SQLite database file. A connection is a live channel between your Python session and the database. Through that channel you can send SQL statements and receive results back as Python objects.

Understanding the connection lifecycle matters because connections consume resources. A well-written analysis opens a connection, does its work, and closes the connection cleanly. A poorly written one leaves connections open, which can cause file locking issues and unpredictable behaviour especially when multiple processes access the same database.

Opening and Closing Connections Properly

import sqlite3
import pandas as pd

Open the connection
conn = sqlite3.connect('superstore.db')

Do your work
df = pd.read_sql("SELECT * FROM superstore LIMIT 5", conn)

Always close when done
conn.close()

For longer notebooks where you need the connection throughout, the best practice is to open it once at the top and close it once at the bottom — not open and close it around every query.

Using a Context Manager

Python’s with statement handles the connection lifecycle automatically. The connection closes itself when the block ends, even if an error occurs inside it:

# Context manager — connection closes automatically
with sqlite3.connect('superstore.db') as conn:
  df = pd.read_sql("SELECT * FROM superstore LIMIT 5", conn)
  print(df)
# conn is closed here automatically

For notebook-based analysis the manual approach is fine. For scripts that run automatically — scheduled reports, data pipelines — always use the context manager.

Checking What Tables Exist

When working with an unfamiliar database, the first thing you want to know is what tables are available:

conn = sqlite3.connect('superstore.db')
# List all tables in the database
tables = pd.read_sql("""
SELECT name
FROM sqlite_master
WHERE type = 'table'
ORDER BY name
""", conn)
print(tables)

sqlite_master is SQLite’s internal catalogue table. It stores metadata about everything in the database — tables, indexes, and views. This query is the SQLite equivalent of asking “what is in here?” when you open an unfamiliar database for the first time.

pd.read_sql() — The Bridge Between SQL and pandas

pd.read_sql() is the function that executes a SQL query and returns the result directly as a pandas DataFrame. It is the core of the SQL-Python workflow.

# Basic usage
df = pd.read_sql(sql_query, connection)

Once the result is a DataFrame you have the full pandas toolkit available — filtering, reshaping, visualisation, statistical analysis, and everything from Module 1.

Passing Parameters Safely

When your query needs to include a variable value — a user input, a date from a loop, a value from another DataFrame — never build the query by concatenating strings. This is a security risk called SQL injection and also causes bugs when values contain special characters like apostrophes.

Instead use parameterised queries:

# Unsafe — never do this
region = "West"
df = pd.read_sql(f"SELECT * FROM superstore WHERE region = '{region}'", conn)

# Safe — use parameters
region = "West"
df = pd.read_sql(
"SELECT * FROM orders WHERE region = ?",
conn,
params=(region,)
)

The ? placeholder gets replaced safely by the value in params. SQLite handles the escaping automatically. This is especially important when the variable value comes from user input or an external source.

Passing Multiple Parameters

# Filter by region and minimum sales value

region = "West"
min_sales = 500
df = pd.read_sql(
"""
SELECT order_id, region, sales, profit
FROM superstore
WHERE region = ?
AND sales > ?
ORDER BY sales DESC
""",
conn,
params=(region, min_sales)
)

Parameters are passed as a tuple in the same order as the ? placeholders appear in the query.

Reading Large Datasets in Chunks

When a query returns a very large result set — millions of rows — loading everything into memory at once can crash your notebook. pd.read_sql() supports chunked reading via the chunksize parameter:

# Read in chunks of 1000 rows at a time
chunks = pd.read_sql(
    "SELECT * FROM superstore",
    conn,
    chunksize=1000
)

# Process each chunk
dfs = []
for chunk in chunks:
    # Apply any row-level processing here
    dfs.append(chunk)

df = pd.concat(dfs, ignore_index=True)
print(f"Total rows loaded: {len(df)}")

For the Superstore dataset this is not necessary — 10,000 rows loads instantly. But on a production database with millions of rows it is an essential technique to know.

Writing Data Back to SQLite

The SQL-Python bridge works in both directions. You can read data from a database into pandas, and you can write a pandas DataFrame back into a database as a table.

to_sql() — Writing a DataFrame to a Database Table

# Create a summary DataFrame in pandas
summary = df.groupby('region').agg(
    total_sales=('sales', 'sum'),
    total_profit=('profit', 'sum'),
    order_count=('order_id', 'count')
).reset_index().round(2)

# Write it back to the database as a new table
summary.to_sql(
    'region_summary',       # table name
    conn,
    if_exists='replace',    # replace if table already exists
    index=False             # don't write the DataFrame index as a column
)

print("Summary table written to database.")

The if_exists parameter controls what happens if the table already exists:

replace — drop and recreate the table
append — add rows to the existing table
fail — raise an error (the default)

Once written back to the database, you can query this summary table with SQL just like any other table.

When to Filter in SQL vs pandas

This is the most practically important decision in the SQL-Python workflow. The wrong choice does not break anything — both tools can filter data. But the right choice makes your analysis faster, cleaner, and more professional.

The Core Principle

Filter and aggregate in SQL. Transform, visualise, and model in Python.

SQL runs inside the database engine which is optimised for filtering and aggregating large datasets. When you push filtering into SQL, only the rows you actually need travel from the database to Python. When you pull everything into Python and filter there, you are loading unnecessary data into memory and doing work that the database could have done more efficiently.

Filter in SQL When

# ✅ Row-level filters that reduce data volume
query("""
    SELECT *
    FROM superstore
    WHERE region = 'West'
    AND order_date >= '2021-01-01'
""")

# ✅ Aggregations that summarise large tables
query("""
    SELECT region, SUM(sales) AS total_sales
    FROM superstore
    GROUP BY region
""")

# ✅ JOINs that combine tables
query("""
    SELECT o.order_id, c.customer_name, oi.sales
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
    INNER JOIN order_items oi ON o.order_id = oi.order_id
""")

# ✅ Deduplication before analysis
query("""
    SELECT DISTINCT customer_id, segment
    FROM customers
""")

Filter in pandas When

# ✅ Complex conditional logic involving multiple Python objects
df['high_value'] = (df['sales'] > df['sales'].mean() * 1.5)

# ✅ String operations not easily done in SQL
df_filtered = df[df['customer_name'].str.contains('son', case=False)]

# ✅ Filtering based on values calculated in Python
threshold = df['profit'].quantile(0.75)
df_top = df[df['profit'] > threshold]

# ✅ Time-based filtering using pandas datetime methods
df['order_date'] = pd.to_datetime(df['order_date'])
df_recent = df[df['order_date'].dt.year == 2021]

# ✅ Filtering after a merge or reshape operation in pandas
merged = df1.merge(df2, on='customer_id')
filtered = merged[merged['total_orders'] > 3]

The Decision Framework

Ask yourself three questions before deciding where to filter:

Does the filter reduce the number of rows significantly?
If yes, do it in SQL. Bringing fewer rows into Python is always better.
Does the filter require Python objects, methods, or calculated values that SQL cannot access?
If yes, do it in pandas after loading.
Is this a one-time exploration or a repeatable pipeline?
For pipelines, push as much as possible into SQL for performance and reliability.

Building a Clean SQL-Python Workflow

Here is a complete, realistic analyst workflow that shows SQL and Python working together from raw database to final insight:

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt

# ── STEP 1: Connect ──────────────────────────────────────
conn = sqlite3.connect('superstore.db')

# ── STEP 2: Pull clean, pre-filtered data using SQL ──────
df = pd.read_sql("""
    SELECT
        c.segment,
        o.region,
        oi.category,
        oi.sub_category,
        o.order_date,
        ROUND(oi.sales, 2)   AS sales,
        ROUND(oi.profit, 2)  AS profit
    FROM orders o
    INNER JOIN customers c
        ON o.customer_id = c.customer_id
    INNER JOIN order_items oi
        ON o.order_id = oi.order_id
    WHERE o.order_date >= '2021-01-01'
""", conn)

# ── STEP 3: Convert types in pandas ──────────────────────
df['order_date'] = pd.to_datetime(df['order_date'])
df['month'] = df['order_date'].dt.to_period('M')

# ── STEP 4: Further analysis in pandas ───────────────────
# Profit margin by segment
df['profit_margin'] = (df['profit'] / df['sales'] * 100).round(2)

# Monthly revenue trend
monthly = df.groupby('month')['sales'].sum().reset_index()
monthly.columns = ['month', 'total_sales']

# Segment performance
segment = df.groupby('segment').agg(
    total_sales=('sales', 'sum'),
    total_profit=('profit', 'sum'),
    avg_margin=('profit_margin', 'mean')
).round(2).reset_index()

# ── STEP 5: Print insights ────────────────────────────────
print("=== Segment Performance (2021) ===")
print(segment.sort_values('total_profit', ascending=False))

print("\n=== Monthly Revenue Trend ===")
print(monthly)

# ── STEP 6: Close connection ─────────────────────────────
conn.close()

This workflow is the template for every analysis you will build in this course going forward. SQL handles retrieval and pre-filtering. Python handles enrichment, aggregation, and presentation. Each tool does what it is best at.

Saving Query Results for Reuse

When a query takes a long time to run — common on large production databases — save the result to a CSV or parquet file so you do not have to re-query every time you restart your notebook:

# Run the heavy query once
df = pd.read_sql(heavy_query, conn)

# Save locally
df.to_csv('data/superstore_clean.csv', index=False)

# Next session — load from file instead of re-querying
df = pd.read_csv('data/superstore_clean.csv')

This is standard practice in professional analytics. Query the database to get fresh data when you need it. Work from a saved file during iterative analysis and visualisation where you are not changing the underlying data pull.

From SQLite to Production Databases

Everything you have learned in this module using SQLite transfers directly to production databases. The only thing that changes is the connection setup. The SQL syntax, pd.read_sql(), parameterised queries, and the SQL-Python workflow are identical.

Here is how connections look for the most common production databases:

# PostgreSQL — using psycopg2
import psycopg2
conn = psycopg2.connect(
    host="your-host",
    database="your-db",
    user="your-user",
    password="your-password"
)

# MySQL — using mysql-connector-python
import mysql.connector
conn = mysql.connector.connect(
    host="your-host",
    database="your-db",
    user="your-user",
    password="your-password"
)

# BigQuery — using google-cloud-bigquery
from google.cloud import bigquery
client = bigquery.Client()
df = client.query("SELECT * FROM dataset.table LIMIT 10").to_dataframe()

# Once connected — pd.read_sql() works the same way for all of them
df = pd.read_sql("SELECT * FROM orders LIMIT 10", conn)

The credentials and connection libraries differ. The workflow after that — SQL queries, pd.read_sql(), DataFrames — is exactly the same. What you have learned here scales directly to enterprise databases handling billions of rows.

Common Mistakes in the SQL-Python Workflow

Mistake	What Happens	Fix
Leaving connections open	File locking, resource leaks	Always call `conn.close()` or use a context manager
Building queries with f-strings and user input	SQL injection risk, apostrophe bugs	Use parameterised queries with `?` placeholders
Pulling full tables into pandas before filtering	Slow, memory-heavy, unprofessional	Filter in SQL first, bring only what you need into Python
Re-running expensive queries every notebook restart	Slow development cycle	Save results to CSV after the first run
Not resetting the index after `pd.read_sql()`	Index issues in downstream operations	Add `.reset_index(drop=True)` if needed
Hardcoding credentials in notebooks	Security risk if shared	Use environment variables or a config file

Practice Exercises

Connect to your Superstore database, pull all orders from 2020 using a SQL WHERE filter, and calculate the monthly revenue trend in pandas.
Write a parameterised query that accepts a region name as a variable and returns total sales and profit for that region. Test it for all four regions in a loop.
Pull the top 10 customers by total sales using SQL GROUP BY. Then in pandas, add a column showing each customer’s share of total revenue as a percentage.
Write a complete workflow: SQL pulls order data joined with customer and product tables. pandas calculates profit margin per segment. Print a clean summary table.
Save the result of a complex JOIN query to a CSV file. Then reload it from CSV in a new cell and confirm the row count matches.

Summary — What You Can Now Do

Open, use, and close a SQLite connection correctly in Python
Use context managers for safe automatic connection handling
Query a database using pd.read_sql() and work with the result as a pandas DataFrame
Write parameterised queries to safely pass variable values into SQL
Read large result sets in chunks using the chunksize parameter
Write pandas DataFrames back to a database table using to_sql()
Decide confidently whether a filter or aggregation belongs in SQL or pandas
Build a clean end-to-end SQL-Python workflow from database connection to final insight
Understand how the SQLite workflow transfers directly to production databases

Module 2 Complete

You have now finished all six topics in Module 2. Here is what you can do that you could not at the start of this module:

Query any relational database using SELECT, WHERE, ORDER BY, LIMIT, and DISTINCT
Summarise data with COUNT, SUM, AVG, MIN, MAX, GROUP BY, and HAVING
Combine multiple tables using INNER JOIN and LEFT JOIN
Write subqueries in WHERE and FROM clauses for multi-step analysis
Connect SQL to Python, retrieve results as DataFrames, and decide where each tool does its best work

The Mini Project for this module brings all of this together. You will use SQL to query the Superstore database, answer five business questions, pull the results into pandas, and write a short plain-English brief of your findings — exactly what a junior analyst would be asked to do in their first month on the job.

Up next — Module 2 Mini Project

Five business questions. One database. SQL queries, pandas output, and a written brief. Your first end-to-end analyst deliverable.

SQL and Python Together: How to Use sqlite3 and pd.read_sql() for Data Analysis

Why SQL and Python Belong Together

How the sqlite3 Connection Works

Checking What Tables Exist

pd.read_sql() — The Bridge Between SQL and pandas

Reading Large Datasets in Chunks

Writing Data Back to SQLite

When to Filter in SQL vs pandas

The Decision Framework

Building a Clean SQL-Python Workflow

Saving Query Results for Reuse

From SQLite to Production Databases

Common Mistakes in the SQL-Python Workflow

Practice Exercises

Summary — What You Can Now Do

Comments

Leave a Reply Cancel reply

More posts

Distribution Analysis: Understanding the Shape of Data

Descriptive Statistics: Understanding Data Through Summary Measures

Exploratory Data Analysis: Understanding Data Before Analysis

Feature Creation & Data Transformation in Python