One of the most important parts of the data analysis pipeline is the ability to collate data obtained from different sources. While the CSV file format remains the most popular form of data input, a lot of times production level data analysis requires that one fetch data from an internet source which in the simplest form may involve downloading a CSV file from a remote source. Often, however, the data fetch process requires interaction with a database which therefore necessitates that one have an understanding of relational databases and how to interact with them.

For this reason, I have spent the past week or two (at the time of this post) to improving my knowledge of relational databases, and more specifically finally learning SQL. SQL has its origins at IBM where the motivation was to “manipulate and retrieve data stored in IBM’s original quasirelational database management system”. Over the years, the language has grown to be extremely useful and powerful with several different variants now available that differ in syntax and a few other aspects.

With the importance of SQL highlighted, it is time to take a look at how one can run SQL queries within python and finally converting the retrieved data to the familiar DataFrame format.

Required Libraries

This blog post uses sqlite3, the python SQL interface that comes with the standard lib and pandas (1.3.4), the most popular data analysis library. First, we import these as required

import sqlite3
import pandas as pd 

Define helper functions

Next, to ease our SQL query handling, we define two helper functions, one to create a SQL table and one to execute our sQL queries. These are defined next below

def create_connection(db_path=None): 
    try:
        conn = sqlite3.connect(db_path)
    except:
        raise 
    else:
        return conn

The above function simply creates a SQL connection to a database via the connect method from sqlite3. We use a try-else-finally to ensure that we can raise errors whenever they occur and return a connection once successful.

def execute_sql(conn, sql_command):
    try:
        cursor = conn.cursor()
        cursor.execute(sql_command)
        res = cursor.fetchall()
    except:
        raise 
    else:
        return res

The execute_sql function briefly takes a sqlite3 Connection and command as its arguments. The Cursor object allows us to execute SQL queries via the exceute method. Finally, we use fetchall to fetch all data. We use a try-except-else again as above to ease our error handling.

Run SQL Queries

With our functions in place, we can now run our SQL queries in python. First, we need to create a connection. One thing to note is that the create_connection method requires a path to a database .db file. If the provided path does not exist, one will be created with that name. For purposes of this blog post, we supply a non-existent database name (sample.db) meaning that this will be created for us.

conn = create_connection("sample.db")

Next, for our table, I chose to create a table with gene id as a primary key and gene expression values.

res = execute_sql(conn, """
CREATE TABLE IF NOT EXISTS genes ([gene_id] TEXT PRIMARK KEY,
[gene_expression] NUMERIC)
""")

In SQL, to add values to a database, we need to use the INSERT command. While preparing this blog post, I noticed that if one uses the INSERT method, values may be inserted but the user may not see this unless they add a print to the exceute_sql call. However, even then sometimes values would return empty lists despite values being added. For this reason, I use INSERT OR REPLACE INTO


res = list(map(lambda gene, value: execute_sql(conn, f"INSERT OR REPLACE INTO genes VALUES ('{gene}', {value})"),
               ("RAN1", "RAN2", "RAN3", "RAN4", "RAN5", "RAN6"),
    (0.01, 0.02, 0.03, 0.00001, 0.01, 0.00004)))

As one may quickly realize, it can get quickly tiring/boring to have to always manually type out the values to insert. For this blog post, I use a lambda-map approach to semi-automate the insertion process.

With the above, we have successfully created a SQL table with random gene ids and random gene expression values.

SQL to Pandas

Finally, with out pre-sequel in place, it’s time to convert this to a pandas DataFrame. We do this by simply calling DataFrame on our result. Here I have used a select within select as I found this really fun to work with and also because if misused it can result in disaster.

pd.DataFrame(execute_sql(conn,"""
SELECT * FROM genes WHERE gene_expression < (SELECT gene_expression from genes where gene_id = 'RAN2')
""") , columns = ["geneid", "expression"])

This gives us

0   RAN1     0.01000
1   RAN4     0.00001
2   RAN5     0.01000
3   RAN6     0.00004

Outro

As always, this has been a short walk through (mostly for reinforcement of concepts) of how to move from a SQL query to a pandas DataFrame. I hope that this may motivate a reader to use SQL more in their own work. The full notebook used in this code is availabe here.

Thank you for reading

Nelson

Recommended resources