Deep Learning, Data Science and NLP Enthusiast

Using XGBoost for time series prediction tasks

Recently Kaggle master Kazanova along with some of his friends released a "How to win a data science competition" Coursera course. You can start for free with the 7-day Free Trial. The Course involved a final project which itself was a time series prediction problem. Here I will describe how I got a top 10 position as of writing this article.

Description of the Problem:

In this competition we were given a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company.

We were asked you to predict total sales for every product and store in the next month.

The evaluation metric was RMSE where True target values are clipped into [0,20] range. This target range will be a lot important in understanding the submissions that I will prepare.

The main thing that I noticed was that the data preparation aspect of this competition was by far the most important thing. I creted a variety of features. Here are the steps I took and the features I created.

1. Created a dataframe of all Date_block_num, Store and Item combinations:

This is important because in the months we don't have a data for an item store combination, the machine learning algorithm needs to be specifically told that the sales is zero.

from itertools import product
# Create "grid" with columns
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = []
for block_num in sales['date_block_num'].unique():
    cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()
    cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

2. Cleaned up a little of sales data after some basic EDA:

sales = sales[sales.item_price<100000]
sales = sales[sales.item_cnt_day<=1000]

3. Created Mean Encodings:

sales_m = sales.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': 'sum','item_price': np.mean}).reset_index()
sales_m = pd.merge(grid,sales_m,on=['date_block_num','shop_id','item_id'],how='left').fillna(0)
# adding the category id too
sales_m = pd.merge(sales_m,items,on=['item_id'],how='left')

for type_id in ['item_id','shop_id','item_category_id']:
    for column_id,aggregator,aggtype in [('item_price',np.mean,'avg'),('item_cnt_day',np.sum,'sum'),('item_cnt_day',np.mean,'avg')]:

        mean_df = sales.groupby([type_id,'date_block_num']).aggregate(aggregator).reset_index()[[column_id,type_id,'date_block_num']]
        mean_df.columns = [type_id+'_'+aggtype+'_'+column_id,type_id,'date_block_num']

        sales_m = pd.merge(sales_m,mean_df,on=['date_block_num',type_id],how='left')

These above lines add the following 9 features :

  • 'item_id_avg_item_price'
  • 'item_id_sum_item_cnt_day'
  • 'item_id_avg_item_cnt_day'
  • 'shop_id_avg_item_price',
  • 'shop_id_sum_item_cnt_day'
  • 'shop_id_avg_item_cnt_day'
  • 'item_category_id_avg_item_price'
  • 'item_category_id_sum_item_cnt_day'
  • 'item_category_id_avg_item_cnt_day'

4. Create Lag Features:

Next we create lag features with diferent lag periods on the following features:

  • 'item_id_avg_item_price',
  • 'item_id_sum_item_cnt_day'
  • 'item_id_avg_item_cnt_day'
  • 'shop_id_avg_item_price'
  • 'shop_id_sum_item_cnt_day'
  • 'shop_id_avg_item_cnt_day'
  • 'item_category_id_avg_item_price'
  • 'item_category_id_sum_item_cnt_day'
  • 'item_category_id_avg_item_cnt_day'
  • 'item_cnt_day'
lag_variables  = list(sales_m.columns[7:])+['item_cnt_day']
lags = [1 ,2 ,3 ,4, 5, 12]
for lag in lags:
    sales_new_df = sales_m.copy()
    sales_new_df = sales_new_df[['date_block_num','shop_id','item_id']+lag_variables]
    sales_new_df.columns = ['date_block_num','shop_id','item_id']+ [lag_feat+'_lag_'+str(lag) for lag_feat in lag_variables]
    sales_means = pd.merge(sales_means, sales_new_df,on=['date_block_num','shop_id','item_id'] ,how='left')

5. Fill NA with zeros:

for feat in sales_means.columns:
    if 'item_cnt' in feat:
    elif 'item_price' in feat:

6. Drop the columns that we are not going to use in training:

cols_to_drop = lag_variables[:-1] + ['item_name','item_price']

7. Take a recent bit of data only:

sales_means = sales_means[sales_means['date_block_num']>12]

8. Split in train and CV :

X_train = sales_means[sales_means['date_block_num']<33].drop(cols_to_drop, axis=1)
X_cv =  sales_means[sales_means['date_block_num']==33].drop(cols_to_drop, axis=1)


In the start I told that the clipping aspect of [0,20] will be important. In the next few lines I clipped the days to range[0,40]. You might ask me why 40. An intuitive answer is if I had clipped to range [0,20] there would be very few tree nodes that could give 20 as an answer. While if I increase it to 40 having a 20 becomes much more easier. Please note that We will clip our predictions in the [0,20] range in the end.

def clip(x):
    if x>40:
        return 40
    elif x<0:
        return 0
        return x
train['item_cnt_day'] = train.apply(lambda x: clip(x['item_cnt_day']),axis=1)
cv['item_cnt_day'] = cv.apply(lambda x: clip(x['item_cnt_day']),axis=1)

10: Modelling:

  • Created a XGBoost model to get the most important features(Top 42 features)
  • Use hyperopt to tune xgboost
  • Used top 10 models from tuned XGBoosts to generate predictions.
  • clipped the predictions to [0,20] range
  • Final solution was the average of these 10 predictions.

Learned a lot of new things from this awesome course. Most recommended.

Deep Learning Specialization on Coursera

Python NLP Algorithms Kaggle
Advertiser Disclosure: All Amazon links are affiliate links, which means I receive compensation for any purchases through them. You do not have to purchase via my links, but you support me if you do.