Just Kidding, Nothing is hotter than Jennifer Lawrence. But as you are here, let’s proceed. For a practitioner in any field, they turn out as good as the tools they use. Data Scientists are no different. But sometimes we don’t even know which tools we need and also if we need them. We are not able to fathom if there could be a more natural way to solve the problem we face.
Newton once said that “God does not play dice with the universe”. But actually he does. Everything happening around us could be explained in terms of probabilities. We repeatedly watch things around us happen due to chances, yet we never learn. We always get dumbfounded by the playfulness of nature. One of such ways intuition plays with us is with the Birthday problem. Problem Statement: In a room full of N people, what is the probability that 2 or more people share the same birthday(Assumption: 365 days in year)?
Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings. Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence. For those who don’t know bag of words: If we have a series of sentences(documents)
I have been looking to create this list for a while now. There are many people on quora who ask me how I started in the data science field. And so I wanted to create this reference. To be frank, when I first started learning it all looked very utopian and out of the world. The Andrew Ng course felt like black magic. And it still doesn’t cease to amaze me.
A data scientist needs to be Critical and always on a lookout of something that misses others. So here are some advices that one can include in day to day data science work to be better at their work: 1. Beware of the Clean Data Syndrome You need to ask yourself questions even before you start working on the data. Does this data make sense? Falsely assuming that the data is clean could lead you towards wrong Hypotheses.
Yesterday I got introduced to awk programming on the shell and is it cool. It lets you do stuff on the command line which you never imagined. As a matter of fact, it’s a whole data analytics software in itself when you think about it. You can do selections, groupby, mean, median, sum, duplication, append. You just ask. There is no limit actually. And it is easy to learn.
Shell Commands are powerful. And life would be like hell without shell is how I like to say it(And that is probably the reason that I dislike windows). Consider a case when you have a 6 GB pipe-delimited file sitting on your laptop and you want to find out the count of distinct values in one particular column. You can probably do this in more than one way. You could put that file in a database and run SQL Commands, or you could write a python/perl script.