Often times it happens that we fall short of creativity. And creativity is one of the basic ingredients of what we do. Creating features needs creativity. So here is the list of ideas I gather in day to day life, where people have used creativity to get great results on Kaggle leaderboards. Take a look at the How to Win a Data Science Competition: Learn from Top Kagglers course in the Advanced machine learning specialization by Kazanova(Number 3 Kaggler at the time of writing).
Newton once said that “God does not play dice with the universe”. But actually he does. Everything happening around us could be explained in terms of probabilities. We repeatedly watch things around us happen due to chances, yet we never learn. We always get dumbfounded by the playfulness of nature. One of such ways intuition plays with us is with the Birthday problem. Problem Statement: In a room full of N people, what is the probability that 2 or more people share the same birthday(Assumption: 365 days in year)?
Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings. Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence. For those who don’t know bag of words: If we have a series of sentences(documents)
As a data scientist I believe that a lot of work has to be done before Classification/Regression/Clustering methods are applied to the data you get. The data which may be messy, unwieldy and big. So here are the list of algorithms that helps a data scientist to make better models using the data they have: 1. Sampling Algorithms. In case you want to work with a sample of data.
It has been quite a few days I have been working with Pandas and apparently I feel I have gotten quite good at it. (Quite a Braggard I know) So thought about adding a post about Pandas usage here. I intend to make this post quite practical and since I find the pandas syntax quite self explanatory, I won’t be explaining much of the codes. Just the use cases and the code to achieve them.
Yesterday I got introduced to awk programming on the shell and is it cool. It lets you do stuff on the command line which you never imagined. As a matter of fact, it’s a whole data analytics software in itself when you think about it. You can do selections, groupby, mean, median, sum, duplication, append. You just ask. There is no limit actually. And it is easy to learn.
Shell Commands are powerful. And life would be like hell without shell is how I like to say it(And that is probably the reason that I dislike windows). Consider a case when you have a 6 GB pipe-delimited file sitting on your laptop and you want to find out the count of distinct values in one particular column. You can probably do this in more than one way. You could put that file in a database and run SQL Commands, or you could write a python/perl script.