Pandas For All - Some Basic Pandas Functions
It has been quite a few days I have been working with Pandas and apparently I feel I have gotten quite good at it. (Quite a Braggard I know) So thought about adding a post about Pandas usage here. I intend to make this post quite practical and since I find the pandas syntax quite self explanatory, I won’t be explaining much of the codes. Just the use cases and the code to achieve them.
1. Import Pandas
We Start by importing the libraries that we will need to use.
import pandas as pd
2. Read a Datasource:
# Read from csv data files
# With Header
df = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv")
# Without Header. sep param to provide the delimiter
df = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv", header=None, sep= ",")
# Reading from SQL Datasource
import MySQLdb
from pandas import DataFrame
from pandas.io.sql import read_sql
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="root", # your username
passwd="password", # your password
db="dbname") # name of the data base
query = "SELECT * FROM tablename"
data = read_sql(query, db)
# Reading from ExcelFile
data = pd.read_excel(filename)
For now, we will be working with the file at http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv. The Sales Jan 2009 file contains some “sanitized” sales transactions during the month of January. If you want to work along you can download this file from that location.
df = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv")
3. See few rows of data:
# top 5 rows
df.head()
# top 50 rows
df.head(50)
# last 5 rows
df.tail()
# last 50 rows
df.tail(50)
4. Getting Column Names in a list:
columnnames = df.columns
5. Specifying user defined Column Names:
Sometimes you want to change the column names:
df.columns = ['Transdate', 'Product', 'Price', 'PaymentType', 'Name',
'City', 'State', 'Country', 'AccountCreated', 'LastLogin',
'Latitude', 'Longitude']
6. Subsetting specific columns:
Sometimes you only need to work with specific columns in a dataframe only. You can subset the columns in the dataframe using
newDf = df[['Product', 'Price', 'PaymentType', 'Name', 'City', 'State', 'Country']]
7. Seeing column types:
newDf.dtypes
8. Change type of a column
First thing i try is this.
newDf['Price'] = newDf['Price'].astype('int')
It gives error : ValueError: invalid literal for long() with base 10: ‘13,000’. That is you cannot cast a string with “,” to an int. To do that we first have to get rid of the comma. For that we use a particular lambda-apply functionality which lets us apply functions to each row in the data.
newDf['Price'] = newDf.apply(lambda x: int(x['Price'].replace(',', '')),axis=1)
9. Simple Dataframe Statistics:
# To get statistics of numerical columns
newDf.describe()
# To get maximum value of a column. When you take a single column you can think of it as a list and apply functions you would apply to a list
max(newDf['Price'])
# no of rows in dataframe
len(newDf)
# Shape of Dataframe
newDf.shape
10. Creating a new column:
# Create a column Address containing City,State and Country. Simply concat the columns.
newDf['Address'] = newDf['City'] +","+ newDf['State'] +","+ newDf['Country']
# I like to use a function defined approach with lambda-apply as it gives me more flexibility and more options. Like if i want to create a column which is 1 if the price is greater than 1200 and 0 otherwise.
def gt(x):
if x>1200:
return 1
else:
return 0
newDf['Pricegt1200'] = newDf.apply(lambda x: gt(x['Price']),axis=1)
11. Subset a DataFrame:
# Single condition: dataframe with all entries priced greater than 1500
df_gt_1500 = newDf[newDf['Price']>1500]
# Multiple conditions: AND - dataframe with all entries priced greater than 1500 and from London
And_df = newDf[(newDf['Price']>1500) & (newDf['City']=='London')]
# Multiple conditions: OR - dataframe with all entries priced greater than 1500 or from London
Or_df = newDf[(newDf['Price']>1500) | (newDf['City']=='London')]
# Multiple conditions: NOT - dataframe with all entries priced greater than 1500 or from London have to be excluded
Not_df = newDf[~((newDf['Price']>1500) | (newDf['City']=='London'))]
12. Change the Column at particular places or impute:
# In the state column the state is abbreviated as 'TX'. We want the whole name 'Texas' in there
newDf.loc[newDf['State']=='TX','State'] = 'Texas'
# When City is Monaco State is not given. You want to impute 'Monaco State' as state also.
newDf.loc[newDf['City']=='Monaco','State'] = 'Monaco State'
13. GroupBy:
One of the most used functionality. One simple example
# Find out the sum of transactions by a state. reset_index() is a function that resets the index of a dataframe. I apply this function ALWAYS whenever I do a groupby and you might think of it as a default syntax for groupby operations
import numpy as np
newDf.groupby(['State']).aggregate(np.sum).reset_index()
# You might get a few extra columns that you dont need. Just subset the columns in the dataframe. You could just chain the commands to subset for the columns you need.
newDf.groupby(['State']).aggregate(np.sum).reset_index()[['State','Price']]
# Find minimum transaction in each state
newDf.groupby(['State']).aggregate(np.min).reset_index()[['State','Price']]
# You might want to groupby more than one column
newDf.groupby(['State','City']).aggregate(np.sum).reset_index()[['State','City','Price']]
14. Concat:
You have two datarames df1 and df2 you need to concat. Means append one below the other you can do it using:
pd.concat([df1,df2])
15. Merge:
#Suppose in the start, you had two dataframes. One which contains city and price information:
City_Price = newwDf[['City','Price']]
#And another which contains 'City' and 'State' insformation
City_State = newDf[['City','State']].drop_duplicates(keep=False).reset_index()
#You need to merge these datatframes on basis of city. You need to do:
City_Price_State_df = pd.merge(City_Price,City_State,on=['City'],how='left')
16. Save a Dataframe to external File:
# To Csv file
newDf.to_csv("NewDfData.csv",index=False)
# To Excel File
from pandas import ExcelWriter
writer = ExcelWriter('NewDfData.xlsx')
newDf.to_excel(writer,'Sheet1')
writer.save()
17. Pushing Pandas Df to a sql database:
from pandas.io import sql
import MySQLdb
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="root", # your username
passwd="password", # your password
db="dbname") # name of the data base
newDf.to_sql(con = db, name='tablename',if_exists='append',flavor='mysql', chunksize=10000,index=False)
Hope you found this post useful and worth your time. I tried to make this as simple as possible but You may always ask me or see the documentation for doubts.
If you have any more ideas on how to use Pandas or other usecases, please suggest in the comments section.
Till then ciao!!
References
Intro to Pandas By Greg Rada What I have written is in a condensed form, If you want to get a detailed description visit Greg Rada’s 3 posts series.