Exploring Vowpal Wabbit with the Avazu Clickthrough Prediction Challenge
In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.
For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms? The winning models from this competition will be released under an open-source license.
Data Fields
id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 -- anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 -- anonymized categorical variables
Loading Data
## Loading the data
import pandas as pd
import numpy as np
import string as stri
#too large data not keeping it in memory.
# will be using line by line scripting.
#data = pd.read_csv("/Users/RahulAgarwal/kaggle_cpr/train")
Since the data is too large around 6 gb , we will proceed by doing line by line analysis of data. We will try to use vowpal wabbit first of all as it is an online model and it also gives us the option of minimizing log loss as a default. It is also very fast to run and will give us quite an intuition as to how good our prediction can be.
I will use all the variables in the first implementation and we will rediscover things as we move on
Running Vowpal Wabbit
Creating data in vowpal format (One Time Only)
from datetime import datetime
def csv_to_vw(loc_csv, loc_output, train=True):
start = datetime.now()
print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train))
i = open(loc_csv, "r")
j = open(loc_output, 'wb')
counter=0
with i as infile:
line_count=0
for line in infile:
# to counter the header
if line_count==0:
line_count=1
continue
# The data has all categorical features
#numerical_features = ""
categorical_features = ""
counter = counter+1
#print counter
line = line.split(",")
if train:
#working on the date column. We will take day , hour
a = line[2]
new_date= datetime(int("20"+a[0:2]),int(a[2:4]),int(a[4:6]))
day = new_date.strftime("%A")
hour= a[6:8]
categorical_features += " |hr %s" % hour
categorical_features += " |day %s" % day
# 24 columns in data
for i in range(3,24):
if line[i] != "":
categorical_features += "|c%s %s" % (str(i),line[i])
else:
a = line[1]
new_date= datetime(int("20"+a[0:2]),int(a[2:4]),int(a[4:6]))
day = new_date.strftime("%A")
hour= a[6:8]
categorical_features += " |hr %s" % hour
categorical_features += " |day %s" % day
for i in range(2,23):
if line[i] != "":
categorical_features += " |c%s %s" % (str(i+1),line[i])
#Creating the labels
#print "a"
if train: #we care about labels
if line[1] == "1":
label = 1
else:
label = -1 #we set negative label to -1
#print (numerical_features)
#print categorical_features
j.write( "%s '%s %s\n" % (label,line[0],categorical_features))
else: #we dont care about labels
#print ( "1 '%s |i%s |c%s\n" % (line[0],numerical_features,categorical_features) )
j.write( "1 '%s %s\n" % (line[0],categorical_features) )
#Reporting progress
#print counter
if counter % 1000000 == 0:
print("%s\t%s"%(counter, str(datetime.now() - start)))
print("\n %s Task execution time:\n\t%s"%(counter, str(datetime.now() - start)))
#csv_to_vw("/Users/RahulAgarwal/kaggle_cpr/train", "/Users/RahulAgarwal/kaggle_cpr/click.train_original_data.vw",train=True)
#csv_to_vw("/Users/RahulAgarwal/kaggle_cpr/test", "/Users/RahulAgarwal/kaggle_cpr/click.test_original_data.vw",train=False)
Running Vowpal Wabbit on the data
The Vowpal Wabbit will be run on the command line itself.
Training VW:
vw click.train_original_data.vw -f click.model.vw --loss_function logistic
Testing VW:
vw click.test_original_data.vw -t -i click.model.vw -p click.preds.txt
Creating Kaggle Submission File
import math
def zygmoid(x):
return 1 / (1 + math.exp(-x))
with open("kaggle.click.submission.csv","wb") as outfile:
outfile.write("id,click\n")
for line in open("click.preds.txt"):
row = line.strip().split(" ")
try:
outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
except:
pass
This solution ranked 211/371 submissions at the time and the leaderboard score was 0.4031825 while the best leaderboard score was 0.3901120
Next Steps
Create a better VW model
Shuffle the data before making the model as the VW algorithm is an online learner and might have given more preference to the latest data
provide high weights for clicks as data is skewed. How Much?
tune VW algorithm using vw-hypersearch. What should be tuned?
Use categorical features like |C1 “C1”&“1”
Create a XGBoost Model.
Create a Sofia-ML Model and see how it works on this data.