Tutorial: SQL
This tutorial will teach you the basics of SQL for querying relational databases.
Tutorial
This tutorial is based on the Library Carpentry SQL tutorial, licensed under CC-BY 4.0.
You can run this Jupyter notebook in Google Colab or on your local installation.
Note: In Colab, if the notebook is View Only, remember to make a copy, so that you can modify it:
- File –> Save a copy in Drive.
Attribution
Jordan Perdersen (Ed.), Kristin Lee (Ed.), Christopher Erdmann (Ed.), Lise Doucette (Ed.), Elaine Wong (Ed.), Janice Chan (Ed.), James Baker, Fernando Rios, Tim Dennis, Belinda Weaver, … orobecca. (2019, July). LibraryCarpentry/lc-sql: Library Carpentry: Introduction to SQL, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3266102
Modifications
Extracts modified to Jupyter notebook use by Stian Soiland-Reyes https://orcid.org/0000-0001-9842-9718.
References
- https://pythonspot.com/sqlite-database-with-pandas/
- https://pythoncourses.gumroad.com/l/KmxqY
- https://www.sqlitetutorial.net/
- https://wesmckinney.com/book/pandas-basics
import sqlite3
Data ingestion
In this tutorial we assume data has already been ingested and cleaned. We’ll retrieve the already prepared SQLite database.
SQLite is a miniature relational database that can be used without a server installation, for instance by directly importing the SQLite library in Python and “connecting” it to a local file. It is worth noting that this means only a single process can access the file at a time (in this case our Jupyter notebook).
Download SQLite database
The Zenodo deposit https://doi.org/10.5281/zenodo.8360812 contains a ZIP file with the pre-populated database in SQLite format. As this tutorial is not showing “proper” data ingestion, we’ll just download and unzip using Linux shell commands:
!wget https://zenodo.org/records/8360812/files/doaj-article-sample-db.zip && unzip doaj-article-sample-db.zip
--2025-07-08 15:12:03-- https://zenodo.org/records/8360812/files/doaj-article-sample-db.zip
Resolving zenodo.org (zenodo.org)... 188.185.45.92, 188.185.48.194, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.45.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 947007 (925K) [application/octet-stream]
Saving to: ‘doaj-article-sample-db.zip’
doaj-article-sample 100%[===================>] 924.81K 499KB/s in 1.9s
2025-07-08 15:12:06 (499 KB/s) - ‘doaj-article-sample-db.zip’ saved [947007/947007]
Archive: doaj-article-sample-db.zip
inflating: doaj-article-sample.db
inflating: doaj-article-sample.db.sql
Now let’s connect to this database from within Python using the sqlite3 library, and see which tables are included. In SQLite this can be done by inspecting the sqlite_master
table, using the SELECT *
query . We’ll use pandas as a helper to better view the SQL results in Jupyter Notebook.
import sqlite3
import pandas as pd
db = sqlite3.connect("doaj-article-sample.db")
pd.read_sql_query("SELECT * FROM sqlite_master", db)
type | name | tbl_name | rootpage | sql | |
---|---|---|---|---|---|
0 | table | sqlite_sequence | sqlite_sequence | 275 | CREATE TABLE sqlite_sequence(name,seq) |
1 | table | languages | languages | 408 | CREATE TABLE "languages" (\n "id" INTEGER NOT... |
2 | table | publishers | publishers | 679 | CREATE TABLE "publishers" (\n "id" INTEGER NO... |
3 | table | licences | licences | 2 | CREATE TABLE "licences" (\n "id" INTEGER NOT ... |
4 | table | journals | journals | 3 | CREATE TABLE "journals" (\n "id" INTEGER NOT ... |
5 | table | articles | articles | 5 | CREATE TABLE "articles" (\n "id" INTEGER NOT ... |
Introduction to SQL
In this tutorial we’ll follow the Library Carpentry guide, but run the queries from within Python instead of the DB Browser.
- Read through Section 1: Introduction to SQL (ignoring the DB Browser subsection).
Selecting and querying data
Next, follow the Section 2: Selecting and sorting data. Use Python’s """
strings to support multiple lines.
Note that you don’t need to use print()
, as Jupyter Notebook will output the result of the last line, which with Panda becomes a HTML table:
pd.read_sql_query("""
SELECT title
FROM articles
""", db)
Title | |
---|---|
0 | The Fisher Thermodynamics of Quasi-Probabilities |
1 | Aflatoxin Contamination of the Milk Supply: A ... |
2 | Metagenomic Analysis of Upwelling-Affected Bra... |
3 | Synthesis and Reactivity of a Cerium(III) Scor... |
4 | Performance and Uncertainty Evaluation of Snow... |
... | ... |
996 | Crystal structure of [3-(1H-benzimidazol-2-yl)... |
997 | Crystal structure of bis(3-bromopyridine-κN)bi... |
998 | Crystal structure of 4,4′-(ethane-1,2-diyl)bis... |
999 | Crystal structure of (Z)-4-[1-(4-acetylanilino... |
1000 | Metagenomic Analysis of Upwelling-Affected Bra... |
1001 rows × 1 columns
SQL results in Python
Above, Pandas’ DataFrame is rendered in Juyter Notebook, skipping most of the 1001 rows. Alternatively we can inspect the table returned in Python. This is beneficial in case you need to do further processing in code.
Note: You do not have to use Panda dataframes for accessing SQL databases from Python, see also the SQLite Python tutorial for accessing the database directly.
Below, the method .head()
is used to only select the first couple lines of the dataframe, while .itertuples()
iterates over each row in the SQL result. See the Getting started with pandas for further guidance on dataframes.
Warning: While SQL is generally case-insensitive, the database’s spelling of the column name will be returned, which in this case is Title
(instead of title
as in the query) – this matters below as Python code is case-sensitive.
articles = pd.read_sql_query("""
SELECT Title
FROM articles
""", db)
for article in articles.head().itertuples():
print(article.Title)
The Fisher Thermodynamics of Quasi-Probabilities
Aflatoxin Contamination of the Milk Supply: A Pakistan Perspective
Metagenomic Analysis of Upwelling-Affected Brazilian Coastal Seawater Reveals Sequence Domains of Type I PKS and Modular NRPS
Synthesis and Reactivity of a Cerium(III) Scorpionate Complex Containing a Redox Non-Innocent 2,2′-Bipyridine Ligand
Performance and Uncertainty Evaluation of Snow Models on Snowmelt Flow Simulations over a Nordic Catchment (Mistassibi, Canada)
Tip: You can modify the code of Jupyter notebook, and click each cell’s ▶️ button to run them. Use the [+ Code] button in Colab to add further code blocks. Use one code block per output.
Errors
Before we continue, let’s look quickly at different errors in the SQL, which will cause a Python exception.
Syntax error
The Python code is valid, but the SQL inside has a typo (SELET
instead of SELECT
):
pd.read_sql_query("""
SELET title
FROM articles
""", db)
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in execute(self, sql, params)
2673 try:
-> 2674 cur.execute(sql, *args)
2675 return cur
OperationalError: near "SELET": syntax error
The above exception was the direct cause of the following exception:
DatabaseError Traceback (most recent call last)
/tmp/ipython-input-22-3013307109.py in <cell line: 0>()
----> 1 pd.read_sql_query("""
2 SELET title
3 FROM articles
4 """, db)
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize, dtype, dtype_backend)
524
525 with pandasSQL_builder(con) as pandas_sql:
--> 526 return pandas_sql.read_query(
527 sql,
528 index_col=index_col,
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, parse_dates, params, chunksize, dtype, dtype_backend)
2736 dtype_backend: DtypeBackend | Literal["numpy"] = "numpy",
2737 ) -> DataFrame | Iterator[DataFrame]:
-> 2738 cursor = self.execute(sql, params)
2739 columns = [col_desc[0] for col_desc in cursor.description]
2740
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in execute(self, sql, params)
2684
2685 ex = DatabaseError(f"Execution failed on sql '{sql}': {exc}")
-> 2686 raise ex from exc
2687
2688 @staticmethod
DatabaseError: Execution failed on sql '
SELET title
FROM articles
': near "SELET": syntax error
Semantic error
Here the query is correct syntactically, but it’s referencing a column that does not exist:
pd.read_sql_query("""
SELECT wrongColumn
FROM articles
""", db)
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in execute(self, sql, params)
2673 try:
-> 2674 cur.execute(sql, *args)
2675 return cur
OperationalError: no such column: wrongColumn
The above exception was the direct cause of the following exception:
DatabaseError Traceback (most recent call last)
/tmp/ipython-input-23-818033597.py in <cell line: 0>()
----> 1 pd.read_sql_query("""
2 SELECT wrongColumn
3 FROM articles
4 """, db)
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize, dtype, dtype_backend)
524
525 with pandasSQL_builder(con) as pandas_sql:
--> 526 return pandas_sql.read_query(
527 sql,
528 index_col=index_col,
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, parse_dates, params, chunksize, dtype, dtype_backend)
2736 dtype_backend: DtypeBackend | Literal["numpy"] = "numpy",
2737 ) -> DataFrame | Iterator[DataFrame]:
-> 2738 cursor = self.execute(sql, params)
2739 columns = [col_desc[0] for col_desc in cursor.description]
2740
/usr/local/lib/python3.11/dist-packages/pandas/io/sql.py in execute(self, sql, params)
2684
2685 ex = DatabaseError(f"Execution failed on sql '{sql}': {exc}")
-> 2686 raise ex from exc
2687
2688 @staticmethod
DatabaseError: Execution failed on sql '
SELECT wrongColumn
FROM articles
': no such column: wrongColumn
- First, try to correct the errors above, then re-execute the cells.
Next, follow the rest of section 2 onwards to the remaining SQL selection exercises.
Capitalization and good style
Follow exercise on good style below:
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
Unique values
Follow exercise on unique values
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
Sorting
Follow exercise on sorting
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
Filtering
Follow exercies on filtering
pd.read_sql_query("""
SELECT *
FROM articles
WHERE ISSNs = '2056-9890'
""", db).head()
id | Title | Authors | DOI | URL | Subjects | ISSNs | Citation | LanguageId | LicenceId | Author_Count | First_Author | Citation_Count | Day | Month | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 142 | Crystal structure of 7-isopropyl-1,4a,N-trimet... | Li Liu|Xin-Yan Yan|Xiao-Ping Rao | 10.1107/S2056989015017648 | https://doaj.org/article/69bfe1351c864bc8a6b38... | crystal structure|dihydroabietic acid derivati... | 2056-9890 | Acta Crystallographica Section E: Crystallogra... | 1 | 1 | 3 | Li Liu | 10 | 1 | 10 | 2015 |
1 | 143 | Crystal structure of 2-(2,4-diphenyl-3-azabicy... | K. Priya|K. Saravanan|S. Kabilan|S. Selvanayagam | 10.1107/S2056989015017740 | https://doaj.org/article/ce085e7ac3bd43e49a122... | crystal structure|3-azabicyclononane derivativ... | 2056-9890 | Acta Crystallographica Section E: Crystallogra... | 1 | 1 | 4 | K. Priya | 10 | 1 | 10 | 2015 |
2 | 144 | Crystal structure of 2-[2-(hydroxyimino)-1-phe... | Brian J. Anderson|Michael B. Freedman|Sean P. ... | 10.1107/S2056989015017739 | https://doaj.org/article/4dd1a69aaefa478b8d2bb... | crystal structure|thiosemicarbazone|weak inter... | 2056-9890 | Acta Crystallographica Section E: Crystallogra... | 1 | 1 | 5 | Brian J. Anderson | 10 | 1 | 10 | 2015 |
3 | 145 | Crystal structure of 1-methoxy-5-methyl-N-phen... | Inna S. Khazhieva|Tatiana V. Glukhareva|Pavel ... | 10.1107/S2056989015017776 | https://doaj.org/article/8c3bd9d71ba642f5b28d0... | crystal structure|1,2,3-triazole|rearrangement... | 2056-9890 | Acta Crystallographica Section E: Crystallogra... | 1 | 1 | 4 | Inna S. Khazhieva | 10 | 1 | 10 | 2015 |
4 | 146 | Redetermined structure of 4,4′-bipyridine–1,4-... | Rima Paul|Sanchay Jyoti Bora | 10.1107/S2056989015017569 | https://doaj.org/article/b2d1ca2bb6fb4ba48c11e... | crystal structure|co-crystal|supramolecular in... | 2056-9890 | Acta Crystallographica Section E: Crystallogra... | 1 | 1 | 2 | Rima Paul | 10 | 1 | 10 | 2015 |
Note that in SQL, value comparisons is done with a single equal sign =
, unlike Python and other programing languagues typical use of ==
or .equals
pd.read_sql_query("""
SELECT ...
""", db).head()
pd.read_sql_query("""
SELECT ...
""", db).head()
pd.read_sql_query("""
SELECT ...
""", db).head()
pd.read_sql_query("""
SELECT ...
""", db).head()
Ordering and commenting
Follow exercises on Ordering and commenting
pd.read_sql_query("""
SELECT Title, Authors
FROM articles
WHERE ISSNs = '2067-2764|2247-6202'
ORDER BY First_Author ASC;
""", db)
Title | Authors | |
---|---|---|
0 | Zweier I-Convergent Double Sequence Spaces Def... | A. Khan Vakeel| Khan Nazneen|Yasmeen |
1 | New Čebyšev Type Inequalities for Functions wh... | B. Meftah|K. Boukerrioua |
2 | Initial Maclaurin Coecients Bounds for New Sub... | Basem Aref Frasin|Tariq Al-Hawary |
3 | Measure of Tessellation Quality of Voronoï Meshes | E. A-iyeh|J.F. Peters |
4 | The Applicability of $-Calculus to Solve Some ... | Eugene Eberbach |
5 | Some Families of q-Series Identities and Assoc... | H. M. Srivastava|S. N. Singh|S. P. Singh |
6 | Lp - Approximation of Analytic Functions on Co... | Kumar Devendra|Jain Vandna |
7 | A Mixed Integer Linear Programming Formulation... | Marija Ivanović |
8 | Properties of Stabilizing Computations | Mark Burgin |
9 | Primality Testing and Factorization by using F... | Musha Takaaki |
10 | Luhn Prime Numbers | Octavian Cira|Florentin Smarandache |
11 | Hadamard Product of Certain Harmonic Univalent... | R. M. El-Ashwah|B. A. Frasin |
12 | Second Hankel Determinant for Generalized Saka... | S. P. Vijayalakshmi|T. V. Sudharsan |
13 | Some Perturbed Ostrowski Type Inequalities for... | S. S. Dragomir |
14 | Katsaras’s Type Fuzzy Norm under Triangular Norms | Sorin Nădăban|Tudor Bînzar|Flavius Pater|Carme... |
15 | On BVσ I-convergent Sequence Spaces Defined by... | Vakeel A. Khan|Mohd Shafiq|Rami Kamel Ahmad Ra... |
16 | Coupled Systems of Fractional Integro-Differen... | Zoubir Dahmani|Mohamed Amin Abdellaoui|Mohamed... |
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
Aggregating values
Follow exercise on Aggregating and calculating values
pd.read_sql_query("""
SELECT ISSNs, AVG(Citation_Count)
FROM articles
GROUP BY ISSNs;
""", db)
ISSNs | AVG(Citation_Count) | |
---|---|---|
0 | 0367-0449|1988-3250 | 4.000000 |
1 | 1099-4300 | 4.000000 |
2 | 1420-3049 | 4.000000 |
3 | 1422-0067 | 8.000000 |
4 | 1424-8220 | 4.000000 |
5 | 1660-3397 | 4.000000 |
6 | 1660-4601 | 10.000000 |
7 | 1996-1944 | 4.000000 |
8 | 1999-4907 | 4.000000 |
9 | 1999-4915 | 4.000000 |
10 | 2056-9890 | 9.593240 |
11 | 2067-2764|2247-6202 | 8.647059 |
12 | 2071-1050 | 5.000000 |
13 | 2072-4292 | 5.000000 |
14 | 2072-6643 | 4.000000 |
15 | 2072-6651 | 4.000000 |
16 | 2072-6694 | 4.000000 |
17 | 2073-4344 | 4.000000 |
18 | 2073-4360 | 4.000000 |
19 | 2073-4395 | 4.000000 |
20 | 2073-4425 | 4.000000 |
21 | 2073-4441 | 4.000000 |
22 | 2073-8994 | 4.000000 |
23 | 2075-163X | 4.000000 |
24 | 2075-1702 | 4.000000 |
25 | 2075-4418 | 4.000000 |
26 | 2075-4442 | 4.000000 |
27 | 2075-4701 | 4.000000 |
28 | 2075-5309 | 4.000000 |
29 | 2076-0787 | 4.000000 |
30 | 2076-2615 | 4.000000 |
31 | 2076-3417 | 5.000000 |
32 | 2076-3905 | 4.000000 |
33 | 2076-393X | 4.000000 |
34 | 2077-0375 | 4.000000 |
35 | 2077-0472 | 5.000000 |
36 | 2077-1444 | 4.000000 |
37 | 2078-1547 | 4.000000 |
38 | 2079-6374 | 4.000000 |
39 | 2079-7737 | 4.000000 |
40 | 2079-8954 | 4.000000 |
41 | 2079-9268 | 8.000000 |
42 | 2220-9964 | 8.000000 |
43 | 2227-9717 | 4.000000 |
44 | 2250-1177 | 7.000000 |
45 | 2278-4748|2278-4802 | 8.000000 |
46 | 2304-6740 | 5.000000 |
47 | 2304-6767 | 5.000000 |
48 | 2305-6304 | 3.000000 |
49 | 2306-5338 | 4.000000 |
50 | 2306-5354 | 5.000000 |
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
pd.read_sql_query("""
SELECT ...
""", db)
Joins and aliases
Follow exercises on Joins and aliases
pd.read_sql_query("""
SELECT *
FROM articles
JOIN journals
ON articles.ISSNs = journals.ISSNs;
""", db).head()
id | Title | Authors | DOI | URL | Subjects | ISSNs | Citation | LanguageId | LicenceId | ... | First_Author | Citation_Count | Day | Month | Year | id | ISSN-L | ISSNs | PublisherId | Journal_Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | The Fisher Thermodynamics of Quasi-Probabilities | Flavia Pennini|Angelo Plastino | 10.3390/e17127853 | https://doaj.org/article/b75e8d5cca3f46cbbd63e... | Fisher information|quasi-probabilities|complem... | 1099-4300 | Entropy, Vol 17, Iss 12, Pp 7848-7858 (2015) | 1 | 1 | ... | Flavia Pennini | 4 | 1 | 11 | 2015 | 14 | 1099-4300 | 1099-4300 | 2 | Entropy |
1 | 1 | Aflatoxin Contamination of the Milk Supply: A ... | Naveed Aslam|Peter C. Wynn | 10.3390/agriculture5041172 | https://doaj.org/article/0edc5af6672641c0bd456... | aflatoxins|AFM1|AFB1|milk marketing chains|hep... | 2077-0472 | Agriculture (Basel), Vol 5, Iss 4, Pp 1172-118... | 1 | 1 | ... | Naveed Aslam | 5 | 1 | 11 | 2015 | 1 | 2077-0472 | 2077-0472 | 2 | Agriculture |
2 | 2 | Metagenomic Analysis of Upwelling-Affected Bra... | Rafael R. C. Cuadrat|Juliano C. Cury|Alberto M... | 10.3390/ijms161226101 | https://doaj.org/article/d9fe469f75a0442382b84... | PKS|NRPS|metagenomics|environmental genomics|u... | 1422-0067 | International Journal of Molecular Sciences, V... | 1 | 1 | ... | Rafael R. C. Cuadrat | 8 | 1 | 11 | 2015 | 22 | 1422-0067 | 1422-0067 | 2 | International Journal of Molecular Sciences |
3 | 3 | Synthesis and Reactivity of a Cerium(III) Scor... | Fabrizio Ortu|Hao Zhu|Marie-Emmanuelle Boulon|... | 10.3390/inorganics3040534 | https://doaj.org/article/95606ed39deb4f43b96f7... | lanthanide|cerium|scorpionate|tris(pyrazolyl)b... | 2304-6740 | Inorganics (Basel), Vol 3, Iss 4, Pp 534-553 (... | 1 | 1 | ... | Fabrizio Ortu | 5 | 1 | 11 | 2015 | 20 | 2304-6740 | 2304-6740 | 2 | Inorganics |
4 | 4 | Performance and Uncertainty Evaluation of Snow... | Magali Troin|Richard Arsenault|François Brissette | 10.3390/hydrology2040289 | https://doaj.org/article/18b1d70730d44573ab5c2... | snow models|hydrological models|snowmelt|uncer... | 2306-5338 | Hydrology, Vol 2, Iss 4, Pp 289-317 (2015) | 1 | 1 | ... | Magali Troin | 4 | 1 | 11 | 2015 | 19 | 2306-5338 | 2306-5338 | 2 | Hydrology |
5 rows × 21 columns
pd.read_sql_query("""
SELECT ...
""", db).head()
pd.read_sql_query("""
SELECT ...
""", db).head()
pd.read_sql_query("""
SELECT ...
""", db).head()
pd.read_sql_query("""
SELECT ...
""", db).head()
Aliases
Follow exercises on aliases
pd.read_sql_query("""
SELECT ...
""", db).head()
Saving queries
Follow exercise of Saving queries
Note: As DROP VIEW
and CREATE VIEW
do not return any result, below we have to use db.execute
directly rather than pd.read_sql_query
.
db.execute("DROP VIEW IF EXISTS journal_counts")
db.execute("""
CREATE VIEW ...
""")
pd.read_sql_query("""
SELECT ...
""", db).head()
Remember that a view is not stored as a separate table, but is a pre-canned query that can then itself be queries. That means updates to the tables will modify the views! SQL Views typically cannot themselves be modified with INSERT
.
Nevertheless, if a view was temporary, we may want to delete it so it does clutter our database:
db.execute("DROP VIEW journal_counts")
Additional reading
The remaining sections of Library Carpentry SQL relates to creating databases before they can be queried.
Tip: Sometimes knowing what kind of queries are needed will influence the design at earlier data engineering phases, e.g. reducing the need for complex JOIN
s.