It has been quite a few days I have been working with Pandas and apparently I feel I have gotten quite good at it. (Quite a Braggard I know) So thought about adding a post about Pandas usage here. I intend to make this post quite practical and since I find the pandas syntax quite self explanatory, I won't be explaining much of the codes. Just the use cases and the code to achieve them.
1. Import Pandas
We Start by importing the libraries that we will need to use.
import pandas as pd
2. Read a Datasource:
# Read from csv data files # With Header df = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv") # Without Header. sep param to provide the delimiter df = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv", header=None, sep= ",") # Reading from SQL Datasource import MySQLdb from pandas import DataFrame from pandas.io.sql import read_sql db = MySQLdb.connect(host="localhost", # your host, usually localhost user="root", # your username passwd="password", # your password db="dbname") # name of the data base query = "SELECT * FROM tablename" data = read_sql(query, db) # Reading from ExcelFile data = pd.read_excel(filename)
For now, we will be working with the file at http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv. The Sales Jan 2009 file contains some “sanitized” sales transactions during the month of January. If you want to work along you can download this file from that location.
df = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv")
3. See few rows of data:
# top 5 rows df.head() # top 50 rows df.head(50) # last 5 rows df.tail() # last 50 rows df.tail(50)
4. Getting Column Names in a list:
columnnames = df.columns
5. Specifying user defined Column Names:
Sometimes you want to change the column names:
df.columns = ['Transdate', 'Product', 'Price', 'PaymentType', 'Name', 'City', 'State', 'Country', 'AccountCreated', 'LastLogin', 'Latitude', 'Longitude']
6. Subsetting specific columns:
Sometimes you only need to work with specific columns in a dataframe only. You can subset the columns in the dataframe using
newDf = df[['Product', 'Price', 'PaymentType', 'Name', 'City', 'State', 'Country']]
7. Seeing column types:
8. Change type of a column
First thing i try is this.
newDf['Price'] = newDf['Price'].astype('int')
It gives error : ValueError: invalid literal for long() with base 10: '13,000'. That is you cannot cast a string with "," to an int. To do that we first have to get rid of the comma. For that we use a particular lambda-apply functionality which lets us apply functions to each row in the data.
newDf['Price'] = newDf.apply(lambda x: int(x['Price'].replace(',', '')),axis=1)
9. Simple Dataframe Statistics:
# To get statistics of numerical columns newDf.describe() # To get maximum value of a column. When you take a single column you can think of it as a list and apply functions you would apply to a list max(newDf['Price']) # no of rows in dataframe len(newDf) # Shape of Dataframe newDf.shape
10. Creating a new column:
# Create a column Address containing City,State and Country. Simply concat the columns. newDf['Address'] = newDf['City'] +","+ newDf['State'] +","+ newDf['Country'] # I like to use a function defined approach with lambda-apply as it gives me more flexibility and more options. Like if i want to create a column which is 1 if the price is greater than 1200 and 0 otherwise. def gt(x): if x>1200: return 1 else: return 0 newDf['Pricegt1200'] = newDf.apply(lambda x: gt(x['Price']),axis=1)
11. Subset a DataFrame:
# Single condition: dataframe with all entries priced greater than 1500 df_gt_1500 = newDf[newDf['Price']>1500] # Multiple conditions: AND - dataframe with all entries priced greater than 1500 and from London And_df = newDf[(newDf['Price']>1500) & (newDf['City']=='London')] # Multiple conditions: OR - dataframe with all entries priced greater than 1500 or from London Or_df = newDf[(newDf['Price']>1500) | (newDf['City']=='London')] # Multiple conditions: NOT - dataframe with all entries priced greater than 1500 or from London have to be excluded Not_df = newDf[~((newDf['Price']>1500) | (newDf['City']=='London'))]
12. Change the Column at particular places or impute:
# In the state column the state is abbreviated as 'TX'. We want the whole name 'Texas' in there newDf.loc[newDf['State']=='TX','State'] = 'Texas' # When City is Monaco State is not given. You want to impute 'Monaco State' as state also. newDf.loc[newDf['City']=='Monaco','State'] = 'Monaco State'
One of the most used functionality. One simple example
# Find out the sum of transactions by a state. reset_index() is a function that resets the index of a dataframe. I apply this function ALWAYS whenever I do a groupby and you might think of it as a default syntax for groupby operations import numpy as np newDf.groupby(['State']).aggregate(np.sum).reset_index() # You might get a few extra columns that you dont need. Just subset the columns in the dataframe. You could just chain the commands to subset for the columns you need. newDf.groupby(['State']).aggregate(np.sum).reset_index()[['State','Price']] # Find minimum transaction in each state newDf.groupby(['State']).aggregate(np.min).reset_index()[['State','Price']] # You might want to groupby more than one column newDf.groupby(['State','City']).aggregate(np.sum).reset_index()[['State','City','Price']]
You have two datarames df1 and df2 you need to concat. Means append one below the other you can do it using:
#Suppose in the start, you had two dataframes. One which contains city and price information: City_Price = newwDf[['City','Price']] #And another which contains 'City' and 'State' insformation City_State = newDf[['City','State']].drop_duplicates(keep=False).reset_index() #You need to merge these datatframes on basis of city. You need to do: City_Price_State_df = pd.merge(City_Price,City_State,on=['City'],how='left')
16. Save a Dataframe to external File:
# To Csv file newDf.to_csv("NewDfData.csv",index=False) # To Excel File from pandas import ExcelWriter writer = ExcelWriter('NewDfData.xlsx') newDf.to_excel(writer,'Sheet1') writer.save()
17. Pushing Pandas Df to a sql database:
from pandas.io import sql import MySQLdb db = MySQLdb.connect(host="localhost", # your host, usually localhost user="root", # your username passwd="password", # your password db="dbname") # name of the data base newDf.to_sql(con = db, name='tablename',if_exists='append',flavor='mysql', chunksize=10000,index=False)
Hope you found this post useful and worth your time. I tried to make this as simple as possible but You may always ask me or see the documentation for doubts.
If you have any more ideas on how to use Pandas or other usecases, please suggest in the comments section.
Till then ciao!!