MLWhiz
https://mlwhiz.com/
Recent content on MLWhizHugo -- gohugo.ioen-usWed, 19 Feb 2014 00:00:00 +0000About Me
https://mlwhiz.com/about/
Wed, 19 Feb 2014 00:00:00 +0000https://mlwhiz.com/about/Summary I’m a data scientist consultant and big data engineer based in Bangalore, where I am currently working with WalmartLabs .
Previously, I have worked at startups like Fractal and MyCityWay and coglomerates like Citi.I Started this blog with a purpose to augment my own understanding about new things while helping others learn about them. I also write for publications on Medium like Towards Data Science and HackerNoon
As Feynman said: “I couldn’t do it.NLP Learning Series: Part 3 - Attention, CNN and what not for Text Classification
https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/
Sat, 09 Mar 2019 00:00:00 +0000https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/This post is the third post of the NLP Text classification series. To give you a recap, I started up with an NLP text classification competition on Kaggle called Quora Question insincerity challenge. So I thought to share the knowledge via a series of blog posts on text classification. The first post talked about the different preprocessing techniques that work with Deep learning models and increasing embeddings coverage. In the second post, I talked through some basic conventional models like TFIDF, Count Vectorizer, Hashing, etc.What my first Silver Medal taught me about Text Classification and Kaggle in general?
https://mlwhiz.com/blog/2019/02/19/siver_medal_kaggle_learnings/
Tue, 19 Feb 2019 00:00:00 +0000https://mlwhiz.com/blog/2019/02/19/siver_medal_kaggle_learnings/Kaggle is an excellent place for learning. And I learned a lot of things from the recently concluded competition on Quora Insincere questions classification in which I got a rank of 182/4037. In this post, I will try to provide a summary of the things I tried. I will also try to summarize the ideas which I missed but were a part of other winning solutions.
As a side note: if you want to know more about NLP, I would like to recommend this awesome course on Natural Language Processing in the Advanced machine learning specialization.NLP Learning Series: Part 2 - Conventional Methods for Text Classification
https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/
Fri, 08 Feb 2019 00:00:00 +0000https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/This is the second post of the NLP Text classification series. To give you a recap, recently I started up with an NLP text classification competition on Kaggle called Quora Question insincerity challenge. And I thought to share the knowledge via a series of blog posts on text classification. The first post talked about the various preprocessing techniques that work with Deep learning models and increasing embeddings coverage. In this post, I will try to take you through some basic conventional models like TFIDF, Count Vectorizer, Hashing etc.NLP Learning Series: Part 1 - Text Preprocessing Methods for Deep Learning
https://mlwhiz.com/blog/2019/01/17/deeplearning_nlp_preprocess/
Thu, 17 Jan 2019 00:00:00 +0000https://mlwhiz.com/blog/2019/01/17/deeplearning_nlp_preprocess/Recently, I started up with an NLP competition on Kaggle called Quora Question insincerity challenge. It is an NLP Challenge on text classification and as the problem has become more clear after working through the competition as well as by going through the invaluable kernels put up by the kaggle experts, I thought of sharing the knowledge.
Since we have a large amount of material to cover, I am splitting this post into a series of posts.A Layman guide to moving from Keras to Pytorch
https://mlwhiz.com/blog/2019/01/06/pytorch_keras_conversion/
Sun, 06 Jan 2019 00:00:00 +0000https://mlwhiz.com/blog/2019/01/06/pytorch_keras_conversion/Recently I started up with a competition on kaggle on text classification, and as a part of the competition, I had to somehow move to Pytorch to get deterministic results. Now I have always worked with Keras in the past and it has given me pretty good results, but somehow I got to know that the CuDNNGRU/CuDNNLSTM layers in keras are not deterministic, even after setting the seeds.What Kagglers are using for Text Classification
https://mlwhiz.com/blog/2018/12/17/text_classification/
Mon, 17 Dec 2018 00:00:00 +0000https://mlwhiz.com/blog/2018/12/17/text_classification/With the problem of Image Classification is more or less solved by Deep learning, Text Classification is the next new developing theme in deep learning. For those who don’t know, Text classification is a common task in natural language processing, which transforms a sequence of text of indefinite length into a category of text. How could you use that?
To find sentiment of a review. Find toxic comments in a platform like Facebook Find Insincere questions on Quora.To all Data Scientists - The one Graph Algorithm you need to know
https://mlwhiz.com/blog/2018/12/07/connected_components/
Fri, 07 Dec 2018 00:00:00 +0000https://mlwhiz.com/blog/2018/12/07/connected_components/Graphs provide us with a very useful data structure. They can help us to find structure within our data. With the advent of Machine learning and big data we need to get as much information as possible about our data. Learning a little bit of graph theory can certainly help us with that.
Here is a Graph Analytics for Big Data course on Coursera by UCSanDiego which I highly recommend to learn the basics of graph theory.Object Detection: An End to End Theoretical Perspective
https://mlwhiz.com/blog/2018/09/22/object_detection/
Sat, 22 Sep 2018 00:00:00 +0000https://mlwhiz.com/blog/2018/09/22/object_detection/We all know about the image classification problem. Given an image can you find out the class the image belongs to? We can solve any new image classification problem with ConvNets and Transfer Learning using pre-trained nets. ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset.Hyperopt - A bayesian Parameter Tuning Framework
https://mlwhiz.com/blog/2017/12/28/hyperopt_tuning_ml_model/
Thu, 28 Dec 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/12/28/hyperopt_tuning_ml_model/Recently I was working on a in-class competition from the “How to win a data science competition” Coursera course. You can start for free with the 7-day Free Trial. Learned a lot of new things from that about using XGBoost for time series prediction tasks.
The one thing that I tried out in this competition was the Hyperopt package - A bayesian Parameter Tuning Framework. And I was literally amazed.Using XGBoost for time series prediction tasks
https://mlwhiz.com/blog/2017/12/26/win_a_data_science_competition/
Tue, 26 Dec 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/12/26/win_a_data_science_competition/Recently Kaggle master Kazanova along with some of his friends released a “How to win a data science competition” Coursera course. You can start for free with the 7-day Free Trial. The Course involved a final project which itself was a time series prediction problem. Here I will describe how I got a top 10 position as of writing this article.
Description of the Problem: In this competition we were given a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company.Good Feature Building Techniques - Tricks for Kaggle - My Kaggle Code Repository
https://mlwhiz.com/blog/2017/09/14/kaggle_tricks/
Thu, 14 Sep 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/09/14/kaggle_tricks/Often times it happens that we fall short of creativity. And creativity is one of the basic ingredients of what we do. Creating features needs creativity. So here is the list of ideas I gather in day to day life, where people have used creativity to get great results on Kaggle leaderboards.
Take a look at the How to Win a Data Science Competition: Learn from Top Kagglers course in the Advanced machine learning specialization by Kazanova(Number 3 Kaggler at the time of writing).The story of every distribution - Discrete Distributions
https://mlwhiz.com/blog/2017/09/14/discrete_distributions/
Thu, 14 Sep 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/09/14/discrete_distributions/Distributions play an important role in the life of every Statistician. I coming from a non-statistic background am not so well versed in these and keep forgetting about the properties of these famous distributions. That is why I chose to write my own understanding in an intuitive way to keep a track. One of the most helpful way to learn more about these is the STAT110 course by Joe Blitzstein and his book.Today I Learned This Part 2: Pretrained Neural Networks What are they?
https://mlwhiz.com/blog/2017/04/17/deep_learning_pretrained_models/
Mon, 17 Apr 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/04/17/deep_learning_pretrained_models/Deeplearning is the buzz word right now. I was working on the course for deep learning by Jeremy Howard and one thing I noticed were pretrained deep Neural Networks. In the first lesson he used the pretrained NN to predict on the Dogs vs Cats competition on Kaggle to achieve very good results.
What are pretrained Neural Networks? So let me tell you about the background a little bit. There is a challenge that happens every year in the visual recognition community - The Imagenet Challenge.Maths Beats Intuition probably every damn time
https://mlwhiz.com/blog/2017/04/16/maths_beats_intuition/
Sun, 16 Apr 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/04/16/maths_beats_intuition/Newton once said that “God does not play dice with the universe”. But actually he does. Everything happening around us could be explained in terms of probabilities. We repeatedly watch things around us happen due to chances, yet we never learn. We always get dumbfounded by the playfulness of nature.
One of such ways intuition plays with us is with the Birthday problem.
Problem Statement: In a room full of N people, what is the probability that 2 or more people share the same birthday(Assumption: 365 days in year)?Today I Learned This Part I: What are word2vec Embeddings?
https://mlwhiz.com/blog/2017/04/09/word_vec_embeddings_examples_understanding/
Sun, 09 Apr 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/04/09/word_vec_embeddings_examples_understanding/Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings.
Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence.
For those who don’t know bag of words: If we have a series of sentences(documents)Top Data Science Resources on the Internet right now
https://mlwhiz.com/blog/2017/03/26/top_data_science_resources_on_the_internet_right_now/
Sun, 26 Mar 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/03/26/top_data_science_resources_on_the_internet_right_now/I have been looking to create this list for a while now. There are many people on quora who ask me how I started in the data science field. And so I wanted to create this reference.
To be frank, when I first started learning it all looked very utopian and out of the world. The Andrew Ng course felt like black magic. And it still doesn’t cease to amaze me.Basics Of Linear Regression
https://mlwhiz.com/blog/2017/03/23/basics_of_linear_regression/
Thu, 23 Mar 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/03/23/basics_of_linear_regression/Today we will look into the basics of linear regression. Here we go :
Contents Simple Linear Regression (SLR) Multiple Linear Regression (MLR) Assumptions 1. Simple Linear Regression Regression is the process of building a relationship between a dependent variable and set of independent variables. Linear Regression restricts this relationship to be linear in terms of coefficients. In SLR, we consider only one independent variable.
Example: The Waist Circumference – Adipose Tissue data Studies have shown that individuals with excess Adipose tissue (AT) in the abdominal region have a higher risk of cardio-vascular diseasesTop advice for a Data Scientist
https://mlwhiz.com/blog/2017/03/05/think_like_a_data_scientist/
Sun, 05 Mar 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/03/05/think_like_a_data_scientist/A data scientist needs to be Critical and always on a lookout of something that misses others. So here are some advices that one can include in day to day data science work to be better at their work:
1. Beware of the Clean Data Syndrome You need to ask yourself questions even before you start working on the data. Does this data make sense? Falsely assuming that the data is clean could lead you towards wrong Hypotheses.Machine Learning Algorithms for Data Scientists
https://mlwhiz.com/blog/2017/02/05/ml_algorithms_for_data_scientist/
Sun, 05 Feb 2017 00:00:00 +0000https://mlwhiz.com/blog/2017/02/05/ml_algorithms_for_data_scientist/As a data scientist I believe that a lot of work has to be done before Classification/Regression/Clustering methods are applied to the data you get. The data which may be messy, unwieldy and big. So here are the list of algorithms that helps a data scientist to make better models using the data they have:
1. Sampling Algorithms. In case you want to work with a sample of data.Things to see while buying a Mutual Fund
https://mlwhiz.com/blog/2016/12/24/mutual_fund_ratios/
Sat, 24 Dec 2016 00:00:00 +0000https://mlwhiz.com/blog/2016/12/24/mutual_fund_ratios/This is a post which deviates from my pattern fo blogs that I have wrote till now but I found that Finance also uses up a lot of Statistics. So it won’t be a far cry to put this on my blog here. I recently started investing in Mutual funds so thought of rersearching the area before going all in. Here is the result of some of my research.Pandas For All - Some Basic Pandas Functions
https://mlwhiz.com/blog/2016/10/27/baby_panda/
Thu, 27 Oct 2016 00:00:00 +0000https://mlwhiz.com/blog/2016/10/27/baby_panda/It has been quite a few days I have been working with Pandas and apparently I feel I have gotten quite good at it. (Quite a Braggard I know) So thought about adding a post about Pandas usage here. I intend to make this post quite practical and since I find the pandas syntax quite self explanatory, I won’t be explaining much of the codes. Just the use cases and the code to achieve them.Deploying ML Apps using Python and Flask- Learning about Flask
https://mlwhiz.com/blog/2016/01/10/deploying_ml_apps_using_python_flask/
Sun, 10 Jan 2016 00:00:00 +0000https://mlwhiz.com/blog/2016/01/10/deploying_ml_apps_using_python_flask/It has been a long time since I wrote anything on my blog. So thought about giving everyone a treat this time. Or so I think it is.
Recently I was thinking about a way to deploy all these machine learning models I create in python. I searched through the web but couldn’t find anything nice and easy. Then I fell upon this book by Sebastian Rashcka and I knew that it was what I was looking for.Shell Basics every Data Scientist Should know - Part II(AWK)
https://mlwhiz.com/blog/2015/10/11/shell_basics_for_data_science_2/
Sun, 11 Oct 2015 00:00:00 +0000https://mlwhiz.com/blog/2015/10/11/shell_basics_for_data_science_2/Yesterday I got introduced to awk programming on the shell and is it cool. It lets you do stuff on the command line which you never imagined. As a matter of fact, it’s a whole data analytics software in itself when you think about it. You can do selections, groupby, mean, median, sum, duplication, append. You just ask. There is no limit actually.
And it is easy to learn.Shell Basics every Data Scientist Should know -Part I
https://mlwhiz.com/blog/2015/10/09/shell_basics_for_data_science/
Fri, 09 Oct 2015 00:00:00 +0000https://mlwhiz.com/blog/2015/10/09/shell_basics_for_data_science/Shell Commands are powerful. And life would be like hell without shell is how I like to say it(And that is probably the reason that I dislike windows).
Consider a case when you have a 6 GB pipe-delimited file sitting on your laptop and you want to find out the count of distinct values in one particular column. You can probably do this in more than one way. You could put that file in a database and run SQL Commands, or you could write a python/perl script.Create basic graph visualizations with SeaBorn- The Most Awesome Python Library For Visualization yet
https://mlwhiz.com/blog/2015/09/13/seaborn_visualizations/
Sun, 13 Sep 2015 00:00:00 +0000https://mlwhiz.com/blog/2015/09/13/seaborn_visualizations/When it comes to data preparation and getting acquainted with data, the one step we normally skip is the data visualization. While a part of it could be attributed to the lack of good visualization tools for the platforms we use, most of us also get lazy at times.
Now as we know of it Python never had any good Visualization library. For most of our plotting needs, I would read up blogs, hack up with StackOverflow solutions and haggle with Matplotlib documentation each and every time I needed to make a simple graph.Learning Spark using Python: Basics and Applications
https://mlwhiz.com/blog/2015/09/07/spark_basics_explain/
Mon, 07 Sep 2015 00:00:00 +0000https://mlwhiz.com/blog/2015/09/07/spark_basics_explain/I generally have a use case for Hadoop in my daily job. It has made my life easier in a sense that I am able to get results which I was not able to see with SQL queries. But still I find it painfully slow. I have to write procedural programs while I work. As in merge these two datasets and then filter and then merge another dataset and then filter using some condition and yada-yada.Behold the power of MCMC
https://mlwhiz.com/blog/2015/08/21/mcmc_algorithm_cryptography/
Fri, 21 Aug 2015 00:00:00 +0000https://mlwhiz.com/blog/2015/08/21/mcmc_algorithm_cryptography/Last time I wrote an article on MCMC and how they could be useful. We learned how MCMC chains could be used to simulate from a random variable whose distribution is partially known i.e. we don’t know the normalizing constant.
So MCMC Methods may sound interesting to some (for these what follows is a treat) and for those who don’t really appreciate MCMC till now, I hope I will be able to pique your interest by the end of this blog post.My Tryst With MCMC Algorithms
https://mlwhiz.com/blog/2015/08/19/mcmc_algorithms_b_distribution/
Wed, 19 Aug 2015 00:00:00 +0000https://mlwhiz.com/blog/2015/08/19/mcmc_algorithms_b_distribution/The things that I find hard to understand push me to my limits. One of the things that I have always found hard is Markov Chain Monte Carlo Methods. When I first encountered them, I read a lot about them but mostly it ended like this.
The meaning is normally hidden in deep layers of Mathematical noise and not easy to decipher. This blog post is intended to clear up the confusion around MCMC methods, Know what they are actually useful for and Get hands on with some applications.Hadoop Mapreduce Streaming Tricks and Techniques
https://mlwhiz.com/blog/2015/05/09/hadoop_mapreduce_streaming_tricks_and_technique/
Sat, 09 May 2015 00:00:00 +0000https://mlwhiz.com/blog/2015/05/09/hadoop_mapreduce_streaming_tricks_and_technique/I have been using Hadoop a lot now a days and thought about writing some of the novel techniques that a user could use to get the most out of the Hadoop Ecosystem.
Using Shell Scripts to run your Programs I am not a fan of large bash commands. The ones where you have to specify the whole path of the jar files and the such. You can effectively organize your workflow by using shell scripts.Exploring Vowpal Wabbit with the Avazu Clickthrough Prediction Challenge
https://mlwhiz.com/blog/2014/12/01/exploring_vowpal_wabbit_avazu/
Mon, 01 Dec 2014 00:00:00 +0000https://mlwhiz.com/blog/2014/12/01/exploring_vowpal_wabbit_avazu/In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.
For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms? The winning models from this competition will be released under an open-source license.Data Science 101 : Playing with Scraping in Python
https://mlwhiz.com/blog/2014/10/02/data_science_101_python_pattern/
Thu, 02 Oct 2014 00:00:00 +0000https://mlwhiz.com/blog/2014/10/02/data_science_101_python_pattern/This is a simple illustration of using Pattern Module to scrape web data using Python. We will be scraping the data from imdb for the top TV Series along with their ratings
We will be using this link for this:
http://www.imdb.com/search/title?count=100&num_votes=5000,&ref_=gnr_tv_hr&sort=user_rating,desc&start=1&title_type=tv_series,mini_series This URL gives a list of top Rated TV Series which have number of votes atleast 5000. The Thing to note in this URL is the “&start=” parameter where we can specify which review should the list begin with.Dictvectorizer for One Hot Encoding of Categorical Data
https://mlwhiz.com/blog/2014/09/30/dictvectorizer_one_hot_encoding/
Tue, 30 Sep 2014 00:00:00 +0000https://mlwhiz.com/blog/2014/09/30/dictvectorizer_one_hot_encoding/THE PROBLEM: Recently I was working on the Criteo Advertising Competition on Kaggle. The competition was a classification problem which basically involved predicting the click through rates based on several features provided in the train data. Seeing the size of the data (11 GB Train), I felt that going with Vowpal Wabbit might be a better option.
But after getting to an CV error of .47 on the Kaggle LB and being stuck there , I felt the need to go back to Scikit learn.Learning pyspark – Installation – Part 1
https://mlwhiz.com/blog/2014/09/28/learning_pyspark/
Sun, 28 Sep 2014 00:00:00 +0000https://mlwhiz.com/blog/2014/09/28/learning_pyspark/This is part one of a learning series of pyspark, which is a python binding to the spark program written in Scala.
The installation is pretty simple. These steps were done on Mac OS Mavericks but should work for Linux too. Here are the steps for the installation:
1. Download the Binaries: Spark : http://spark.apache.org/downloads.html Scala : http://www.scala-lang.org/download/ Dont use Latest Version of Scala, Use Scala 2.10.x 2. Add these lines to your .Hadoop, Mapreduce and More – Part 1
https://mlwhiz.com/blog/2014/09/27/hadoop_mapreduce/
Sat, 27 Sep 2014 00:00:00 +0000https://mlwhiz.com/blog/2014/09/27/hadoop_mapreduce/It has been some time since I was stalling learning Hadoop. Finally got some free time and realized that Hadoop may not be so difficult after all. What I understood finally is that Hadoop is basically comprised of 3 elements:
A File System Map – Reduce Its many individual Components. Let’s go through each of them one by one.
1. Hadoop as a File System: One of the main things that Hadoop provides is cheap data storage.