In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.
For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms? The winning models from this competition will be released under an open-source license.
id: ad identifier click: 0/1 for non-click/click hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. C1 -- anonymized categorical variable banner_pos site_id site_domain site_category app_id app_domain app_category device_id device_ip device_model device_type device_conn_type C14-C21 -- anonymized categorical variables
## Loading the data
import pandas as pd
import numpy as np
import string as stri
#too large data not keeping it in memory.
# will be using line by line scripting.
#data = pd.read_csv("/Users/RahulAgarwal/kaggle_cpr/train")
Since the data is too large around 6 gb , we will proceed by doing line by line analysis of data. We will try to use vowpal wabbit first of all as it is an online model and it also gives us the option of minimizing log loss as a default. It is also very fast to run and will give us quite an intuition as to how good our prediction can be.
I will use all the variables in the first implementation and we will rediscover things as we move on
from datetime import datetime
def csv_to_vw(loc_csv, loc_output, train=True):
start = datetime.now()
print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train))
i = open(loc_csv, "r")
j = open(loc_output, 'wb')
counter=0
with i as infile:
line_count=0
for line in infile:
# to counter the header
if line_count==0:
line_count=1
continue
# The data has all categorical features
#numerical_features = ""
categorical_features = ""
counter = counter+1
#print counter
line = line.split(",")
if train:
#working on the date column. We will take day , hour
a = line[2]
new_date= datetime(int("20"+a[0:2]),int(a[2:4]),int(a[4:6]))
day = new_date.strftime("%A")
hour= a[6:8]
categorical_features += " |hr %s" % hour
categorical_features += " |day %s" % day
# 24 columns in data
for i in range(3,24):
if line[i] != "":
categorical_features += "|c%s %s" % (str(i),line[i])
else:
a = line[1]
new_date= datetime(int("20"+a[0:2]),int(a[2:4]),int(a[4:6]))
day = new_date.strftime("%A")
hour= a[6:8]
categorical_features += " |hr %s" % hour
categorical_features += " |day %s" % day
for i in range(2,23):
if line[i] != "":
categorical_features += " |c%s %s" % (str(i+1),line[i])
#Creating the labels
#print "a"
if train: #we care about labels
if line[1] == "1":
label = 1
else:
label = -1 #we set negative label to -1
#print (numerical_features)
#print categorical_features
j.write( "%s '%s %s\n" % (label,line[0],categorical_features))
else: #we dont care about labels
#print ( "1 '%s |i%s |c%s\n" % (line[0],numerical_features,categorical_features) )
j.write( "1 '%s %s\n" % (line[0],categorical_features) )
#Reporting progress
#print counter
if counter % 1000000 == 0:
print("%s\t%s"%(counter, str(datetime.now() - start)))
print("\n %s Task execution time:\n\t%s"%(counter, str(datetime.now() - start)))
#csv_to_vw("/Users/RahulAgarwal/kaggle_cpr/train", "/Users/RahulAgarwal/kaggle_cpr/click.train_original_data.vw",train=True)
#csv_to_vw("/Users/RahulAgarwal/kaggle_cpr/test", "/Users/RahulAgarwal/kaggle_cpr/click.test_original_data.vw",train=False)
The Vowpal Wabbit will be run on the command line itself.
Training VW:
vw click.train_original_data.vw -f click.model.vw --loss_function logistic
Testing VW:
vw click.test_original_data.vw -t -i click.model.vw -p click.preds.txt
import math
def zygmoid(x):
return 1 / (1 + math.exp(-x))
with open("kaggle.click.submission.csv","wb") as outfile:
outfile.write("id,click\n")
for line in open("click.preds.txt"):
row = line.strip().split(" ")
try:
outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
except:
pass
This solution ranked 211/371 submissions at the time and the leaderboard score was 0.4031825 while the best leaderboard score was 0.3901120
Create a better VW model
Create a XGBoost Model.
Create a Sofia-ML Model and see how it works on this data.