Pandas is one of the best data manipulation libraries in recent times. It lets you slice and dice, groupby, join and do any arbitrary data transformation. You can take a look at this post , which talks about handling most of the data manipulation cases using a straightforward, simple, and matter of fact way using Pandas.
But even with how awesome pandas generally is, there sometimes are moments when you would like to have just a bit more. Say you come from a SQL background in which the same operation was too easy. Or you wanted to have more readable code. Or you just wanted to run an ad-hoc SQL query on your data frame. Or, maybe you come from R and want a replacement for sqldf.
For example, one of the operations that Pandas doesn’t have an alternative for is non-equi joins, which are quite trivial in SQL.
In this series of posts named Python Shorts , I will explain some simple but very useful constructs provided by Python, some essential tips, and some use cases I come up with regularly in my Data Science work.
This post is essentially about using SQL with pandas Dataframes.
But, what are non-equi joins, and why would I need them?
Let’s say you have to join two data frames. One shows us the periods where we offer some promotions on some items. And the second one is our transaction Dataframe. I want to know the sales that were driven by promotions, i.e., the sales that happen for an item in the promotion period.
We can do this by doing a join on the item column as well as a join condition (TransactionDt≥StartDt and TransactionDt≤EndDt). Since now our join conditions have a greater than and less than signs as well, such joins are called non-equi joins. Do think about how you will do such a thing in Pandas before moving on.
The Pandas Solution
So how will you do it in Pandas? Yes, a Pandas based solution exists, though I don’t find it readable enough.
Let’s start by generating some random data to work with.
import pandas as pd
import random
import datetime
def random_dt_bw(start_date,end_date):
days_between = (end_date - start_date).days
random_num_days = random.randrange(days_between)
random_dt = start_date + datetime.timedelta(days=random_num_days)
return random_dt
def generate_data(n=1000):
items = [f"i_{x}" for x in range(n)]
start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
offerDf = pd.DataFrame({"Item":items,
"StartDt":start_dates,
"EndDt":end_dates})
transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
sales_amt = [random.randint(0,1000) for x in range(5*n)]
transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})
return offerDf,transactionDf
offerDf,transactionDf = generate_data(n=100000)
You don’t need to worry about the random data generation code above. Just know how our random data looks like:
Once we have the data, we can do the non-equi join by merging the data on the column item and then filtering by the required condition.
merged_df = pd.merge(offerDf,transactionDf,on='Item')
pandas_solution = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
The result is below just as we wanted:
The PandaSQL solution
The Pandas solution is alright, and it does what we want, but we could also have used PandaSQL to get the same thing done in a much more readable way.
What is PandaSQL ?
PandaSQL provides us with a way to write SQL on Pandas Dataframes. So if you have got some SQL queries already written, it might make more sense to use pandaSQL rather than converting them to pandas syntax. To get started with PandaSQL we install it simply with:
pip install -U pandasql
Once we have pandaSQL installed, we can use it by creating a pysqldf function that takes a query as an input and runs the query to return a Pandas DF. Don’t worry about the syntax; it remains more or less constant.
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
We can now run any SQL query on our Pandas data frames using this function. And, below is the non-equi join, we want to do in the much more readable SQL format.
q = """
SELECT A.*,B.TransactionDt,B.Sales
FROM
offerDf A
INNER JOIN
transactionDf B
ON
A.Item = B.Item AND
A.StartDt <= B.TransactionDt AND
A.EndDt >= B.TransactionDt;
"""
pandaSQL_solution = pysqldf(q)
The result is a pandas Dataframe as we would expect. The index is already reset for us, unlike before.
Caveats:
While the PandaSQL function lets us run SQL queries on our Pandas data frames and is an excellent tool to be aware of in certain situations, it is not as performant as pure pandas syntax.
When we time Pandas against the more readable PandaSQL, we find that the PandaSQL takes around 10x the time of native Pandas.
Conclusion
In this post of the Python Shorts series, we learned about pandaSQL, which lets us use SQL queries on our Dataframes. We also looked at how to do non-equi joins using both native pandas as well as pandaSQL.
While the PandaSQL library is not as performant as native pandas, it is a great addition to our data analytics toolbox when we want to do ad-hoc analysis and to people who feel much more comfortable with using SQL queries.
For a closer look at the code for this post, please visit my GitHub repository, where you can find the code for this post as well as all my posts.
Continue Learning
If you want to learn more about Python 3, I would like to call out an excellent course on Learn Intermediate level Python from the University of Michigan. Do check it out.
I am going to be writing more beginner-friendly posts in the future too. Follow me up at Medium or Subscribe to my blog .
Also, a small disclaimer — There might be some affiliate links in this post to relevant resources, as sharing knowledge is never a bad idea.