Kaggle is an excellent place for learning. And I learned a lot of things from the recently concluded competition on Quora Insincere questions classification in which I got a rank of 182/4037. In this post, I will try to provide a summary of the things I tried. I will also try to summarize the ideas which I missed but were a part of other winning solutions. As a side note: if you want to know more about NLP, I would like to recommend this awesome course on Natural Language Processing in the Advanced machine learning specialization.
This is the second post of the NLP Text classification series. To give you a recap, recently I started up with an NLP text classification competition on Kaggle called Quora Question insincerity challenge. And I thought to share the knowledge via a series of blog posts on text classification. The first post talked about the various preprocessing techniques that work with Deep learning models and increasing embeddings coverage. In this post, I will try to take you through some basic conventional models like TFIDF, Count Vectorizer, Hashing etc.
Recently, I started up with an NLP competition on Kaggle called Quora Question insincerity challenge. It is an NLP Challenge on text classification and as the problem has become more clear after working through the competition as well as by going through the invaluable kernels put up by the kaggle experts, I thought of sharing the knowledge. Since we have a large amount of material to cover, I am splitting this post into a series of posts.
Recently Kaggle master Kazanova along with some of his friends released a “How to win a data science competition” Coursera course. You can start for free with the 7-day Free Trial. The Course involved a final project which itself was a time series prediction problem. Here I will describe how I got a top 10 position as of writing this article. Description of the Problem: In this competition we were given a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company.
Often times it happens that we fall short of creativity. And creativity is one of the basic ingredients of what we do. Creating features needs creativity. So here is the list of ideas I gather in day to day life, where people have used creativity to get great results on Kaggle leaderboards. Take a look at the How to Win a Data Science Competition: Learn from Top Kagglers course in the Advanced machine learning specialization by Kazanova(Number 3 Kaggler at the time of writing).
Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings. Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence. For those who don’t know bag of words: If we have a series of sentences(documents)