We as data scientists have gotten quite comfortable with Pandas or SQL or any other relational database. We are used to seeing our users in rows with their attributes as columns. But does the real world really behave like that? In a connected world, users cannot be considered as independent entities. They have got certain relationships between each other and we would sometimes like to include such relationships while building our machine learning models.
Just Kidding, Nothing is hotter than Jennifer Lawrence. But as you are here, let’s proceed. For a practitioner in any field, they turn out as good as the tools they use. Data Scientists are no different. But sometimes we don’t even know which tools we need and also if we need them. We are not able to fathom if there could be a more natural way to solve the problem we face.
Newton once said that “God does not play dice with the universe”. But actually he does. Everything happening around us could be explained in terms of probabilities. We repeatedly watch things around us happen due to chances, yet we never learn. We always get dumbfounded by the playfulness of nature. One of such ways intuition plays with us is with the Birthday problem. Problem Statement: In a room full of N people, what is the probability that 2 or more people share the same birthday(Assumption: 365 days in year)?
Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings. Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence. For those who don’t know bag of words: If we have a series of sentences(documents)
I have been looking to create this list for a while now. There are many people on quora who ask me how I started in the data science field. And so I wanted to create this reference. To be frank, when I first started learning it all looked very utopian and out of the world. The Andrew Ng course felt like black magic. And it still doesn’t cease to amaze me.
A data scientist needs to be Critical and always on a lookout of something that misses others. So here are some advices that one can include in day to day data science work to be better at their work: 1. Beware of the Clean Data Syndrome You need to ask yourself questions even before you start working on the data. Does this data make sense? Falsely assuming that the data is clean could lead you towards wrong Hypotheses.
Yesterday I got introduced to awk programming on the shell and is it cool. It lets you do stuff on the command line which you never imagined. As a matter of fact, it’s a whole data analytics software in itself when you think about it. You can do selections, groupby, mean, median, sum, duplication, append. You just ask. There is no limit actually. And it is easy to learn.