Recently, I got asked about how to explain p-values in simple terms to a layperson. I found that it is hard to do that. P-Values are always a headache to explain even to someone who knows about them let alone someone who doesn’t understand statistics. I went to Wikipedia to find something and here is the definition: > In statistical hypothesis testing, the p-value or probability value is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical summary (such as the sample mean difference between two groups) would be equal to, or more extreme than, the actual observed results.
Explain Like I am 5. It is the basic tenets of learning for me where I try to distill any concept in a more palatable form. As Feynman said: I couldn’t do it. I couldn’t reduce it to the freshman level. That means we don’t really understand it. So, when I saw the ELI5 library that aims to interpret machine learning models, I just had to try it out.
What do we want to optimize for? Most of the businesses fail to answer this simple question. Every business problem is a little different, and it should be optimized differently. We all have created classification models. A lot of time we try to increase evaluate our models on accuracy. But do we really want accuracy as a metric of our model performance? What if we are predicting the number of asteroids that will hit the earth.
We, as data scientists have gotten quite comfortable with Pandas or SQL or any other relational database. We are used to seeing our users in rows with their attributes as columns. But does the real world behave like that? In a connected world, users cannot be considered as independent entities. They have got certain relationships with each other, and we would sometimes like to include such relationships while building our machine learning models.
When we create our machine learning models, a common task that falls on us is how to tune them. People end up taking different manual approaches. Some of them work, and some don’t, and a lot of time is spent in anticipation and running the code again and again. So that brings us to the quintessential question: Can we automate this process? A while back, I was working on an in-class competition from the “How to win a data science competition” Coursera course.
Creating a great machine learning system is an art. There are a lot of things to consider while building a great machine learning system. But often it happens that we as data scientists only worry about certain parts of the project. Most of the time that happens to be modeling, but in reality, the success or failure of a Machine Learning project depends on a lot of other factors.
I always get confused whenever someone talks about generative vs. discriminative classification models. I end up reading it again and again, yet somehow it eludes me. So I thought of writing a post on it to improve my understanding. This post is about understanding Generative Models and how they differ from Discriminative models. In the end, we will create a simple generative model ourselves. Discriminative vs. Generative Classifiers Problem Statement: Having some input data, X we want to classify the data into labels y.