As a data scientist I believe that a lot of work has to be done before Classification/Regression/Clustering methods are applied to the data you get. The data which may be messy, unwieldy and big. So here are the list of algorithms that helps a data scientist to make better models using the data they have: 1. Sampling Algorithms. In case you want to work with a sample of data.
It has been quite a few days I have been working with Pandas and apparently I feel I have gotten quite good at it. (Quite a Braggard I know) So thought about adding a post about Pandas usage here. I intend to make this post quite practical and since I find the pandas syntax quite self explanatory, I won’t be explaining much of the codes. Just the use cases and the code to achieve them.
Yesterday I got introduced to awk programming on the shell and is it cool. It lets you do stuff on the command line which you never imagined. As a matter of fact, it’s a whole data analytics software in itself when you think about it. You can do selections, groupby, mean, median, sum, duplication, append. You just ask. There is no limit actually. And it is easy to learn.
Shell Commands are powerful. And life would be like hell without shell is how I like to say it(And that is probably the reason that I dislike windows). Consider a case when you have a 6 GB pipe-delimited file sitting on your laptop and you want to find out the count of distinct values in one particular column. You can probably do this in more than one way. You could put that file in a database and run SQL Commands, or you could write a python/perl script.
When it comes to data preparation and getting acquainted with data, the one step we normally skip is the data visualization. While a part of it could be attributed to the lack of good visualization tools for the platforms we use, most of us also get lazy at times. Now as we know of it Python never had any good Visualization library. For most of our plotting needs, I would read up blogs, hack up with StackOverflow solutions and haggle with Matplotlib documentation each and every time I needed to make a simple graph.
Last time I wrote an article on MCMC and how they could be useful. We learned how MCMC chains could be used to simulate from a random variable whose distribution is partially known i.e. we don’t know the normalizing constant. So MCMC Methods may sound interesting to some (for these what follows is a treat) and for those who don’t really appreciate MCMC till now, I hope I will be able to pique your interest by the end of this blog post.
The things that I find hard to understand push me to my limits. One of the things that I have always found hard is Markov Chain Monte Carlo Methods. When I first encountered them, I read a lot about them but mostly it ended like this. The meaning is normally hidden in deep layers of Mathematical noise and not easy to decipher. This blog post is intended to clear up the confusion around MCMC methods, Know what they are actually useful for and Get hands on with some applications.