My Tryst With MCMC Algorithms

The things that I find hard to understand push me to my limits. One of the things that I have always found hard is Markov Chain Monte Carlo Methods. When I first encountered them, I read a lot about them but mostly it ended like this.

The meaning is normally hidden in deep layers of Mathematical noise and not easy to decipher. This blog post is intended to clear up the confusion around MCMC methods, Know what they are actually useful for and Get hands on with some applications.

So what really are MCMC Methods?

First of all we have to understand what are Monte Carlo Methods!!!

Monte Carlo methods derive their name from Monte Carlo Casino in Monaco. There are many card games that need probability of winning against the dealer. Sometimes calculating this probability can be mathematically complex or highly intractable. But we can always run a computer simulation to simulate the whole game many times and see the probability as the number of wins divided by the number of games played.

So that is all you need to know about Monte carlo Methods. Yes it is just a simple simulation technique with a Fancy Name.

So as we have got the first part of MCMC, we also need to understand what are Markov Chains. Before Jumping onto Markov Chains let us learn a little bit about Markov Property.

Suppose you have a system of $$M$$ possible states, and you are hopping from one state to another. Markov Property says that given a process which is at a state $$X_n$$ at a particular point of time, the probability of $$X_{n+1} = k$$, where $$k$$ is any of the $$M$$ states the process can hop to, will only be dependent on which state it is at the given moment of time. And not on how it reached the current state.

Mathematically speaking:

$$P(X_{n+1}=k | X_n=k_n,X_{n-1}=k_{n-1},....,X_1=k_1) = P(X_{n+1}=k|X_n=k_n)$$

If a process exhibits the Markov Property than it is known as a Markov Process.

Now Why is a Markov Chain important? It is important because of its stationary distribution.

So what is a Stationary Distribution?

Assume you have a markov process like below. You start from any state $$X_i$$ and want to find out the state Probability distribution at $$X_{i+1}$$.

You have a matrix of transition probability

which defines the probability of going from a state $$X_i$$ to $$X_j$$. You start calculating the Probability distribution for the next state. If you are at Bull Market State at time $$i$$ , you have a state Probability distribution as [0,1,0]

you want to get the state pdf at $$X_{i+1}$$. That is given by

$$s_{i+1} = s_{i}Q$$
$$s_{i+1}=\left[ {\begin{array}{cc} .15 & .8 & .05 \end{array} } \right]$$

And the next state distribution could be found out by

$$s_{i+1} = s_iQ^2$$

and so on. Eventually you will reach a stationary state s where:

$$sQ=s$$
For this transition matrix Q the Stationary distribution $$s$$ is
$$s_{i+1}=\left[ {\begin{array}{cc} .625 & .3125 & .0625 \end{array} } \right]$$

The stationary state distribution is important because it lets you define the probability for every state of a system at a random time. That is for this particular example we can say that 62.5% of the times market will be in a bull market state, 31.25% of weeks it will be a bear market and 6.25% of weeks it will be stagnant

Intuitively you can think of it as an random walk on a chain. You might visit some nodes more often than others based on node probabilities. In the Google Pagerank problem you might think of a node as a page, and the probability of a page in the stationary distribution as its relative importance.

Woah! That was a lot of information and we have yet not started talking about the MCMC Methods. Well if you are with me till now, we can now get on to the real topic now.

So What is MCMC?

According to Wikipedia:

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its stationary distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

So let's explain this with an example: Assume that we want to sample from a Beta distribution. The PDF is:

$$f(x) = Cx^{\alpha -1}(1-x)^{\beta -1}$$
where $$C$$ is the normalizing constant (which we actually don't need to Sample from the distribution as we will see later).

This is a fairly difficult problem with the Beta Distribution if not intractable. In reality you might need to work with a lot harder Distribution Functions and sometimes you won't actually know the normalizing constants.

MCMC methods make life easier for us by providing us with algorithms that could create a Markov Chain which has the Beta distribution as its stationary distribution given that we can sample from a uniform distribution(which is fairly easy).

If we start from a random state and traverse to the next state based on some algorithm repeatedly, we will end up creating a Markov Chain which has the Beta distribution as its stationary distribution and the states we are at after a long time could be used as sample from the Beta Distribution.

One such MCMC Algorithm is the Metropolis Hastings Algorithm.

Metropolis Hastings Algorithm

Let $$s=(s_1,s_2,....,s_M)$$ be the desired stationary distribution. We want to create a Markov Chain that has this stationary distribution. We start with an arbitrary Markov Chain $$P$$ with $$M$$ states with transition matrix $$Q$$, so that $$Q_{ij}$$ represents the probability of going from state $$i$$ to $$j$$. Intuitively we know how to wander around this Markov Chain but this Markov Chain does not have the required Stationary Distribution. This chain does have some stationary distribution(which is not of our use)

Our Goal is to change the way we wander on the this Markov Chain $$P$$ so that this chain has the desired Stationary distribution.

To do this we:

1. Start at a random initial State $$i$$.
2. Randomly pick a new Proposal State by looking at the transition probabilities in the ith row of the transition matrix Q.
3. Compute an measure called the Acceptance Probability which is defined as:
$$a_{ij} = min(s_jp_{ji}/s_{i}p_{ij},1)$$
4. Now Flip a coin that lands head with probability $$a_{ij}$$. If the coin comes up heads, accept the proposal i.e move to next state else reject the proposal i.e. stay at the current state.
5. Repeat for a long time

After a long time this chain will converge and will have a stationary distribution $$s$$. We can then use the states of the chain as the sample from any distribution.

While doing this to sample the Beta Distribution, the only time we are using the PDF is to find the acceptance probability and in that we divide $$s_j$$ by $$s_i$$, i.e. the normalizing constant $$C$$ gets cancelled.

Now Let's Talk about the intuition. For the Intuition I am quoting an Answer from the site Stack Exchange,as this was the best intuitive explanation that I could find:

I think there's a nice and simple intuition to be gained from the (independence-chain) Metropolis-Hastings algorithm.

First, what's the goal? The goal of MCMC is to draw samples from some probability distribution without having to know its exact height at any point(We don't need to know C). The way MCMC achieves this is to "wander around" on that distribution in such a way that the amount of time spent in each location is proportional to the height of the distribution. If the "wandering around" process is set up correctly, you can make sure that this proportionality (between time spent and height of the distribution) is achieved.

Intuitively, what we want to do is to to walk around on some (lumpy) surface in such a way that the amount of time we spend (or # samples drawn) in each location is proportional to the height of the surface at that location. So, e.g., we'd like to spend twice as much time on a hilltop that's at an altitude of 100m as we do on a nearby hill that's at an altitude of 50m. The nice thing is that we can do this even if we don't know the absolute heights of points on the surface: all we have to know are the relative heights. e.g., if one hilltop A is twice as high as hilltop B, then we'd like to spend twice as much time at A as we spend at B.

The simplest variant of the Metropolis-Hastings algorithm (independence chain sampling) achieves this as follows: assume that in every (discrete) time-step, we pick a random new "proposed" location (selected uniformly across the entire surface). If the proposed location is higher than where we're standing now, move to it. If the proposed location is lower, then move to the new location with probability p, where p is the ratio of the height of that point to the height of the current location. (i.e., flip a coin with a probability p of getting heads; if it comes up heads, move to the new location; if it comes up tails, stay where we are). Keep a list of the locations you've been at on every time step, and that list will (asyptotically) have the right proportion of time spent in each part of the surface. (And for the A and B hills described above, you'll end up with twice the probability of moving from B to A as you have of moving from A to B).

There are more complicated schemes for proposing new locations and the rules for accepting them, but the basic idea is still: (1) pick a new "proposed" location; (2) figure out how much higher or lower that location is compared to your current location; (3) probabilistically stay put or move to that location in a way that respects the overall goal of spending time proportional to height of the location. """

Sampling from Beta Distribution

Now Let's Move on to the problem of Simulating from Beta Distribution. Now Beta Distribution is a continuous Distribution on [0,1] and it can have infinite states on [0,1].

Lets Assume an arbitrary Markov Chain P with infinite states on [0,1] having transition Matrix Q such that $Q_{ij} = Q_{ji} =$ All entries in Matrix. We don't really need the Matrix Q as we will see later, But I want to keep the problem description as close to the algorihm we suggested.

1. Start at a random initial State $$i$$ given by Unif(0,1).
2. Randomly pick a new Proposal State by looking at the transition probabilities in the ith row of the transition matrix Q. Lets say we pick up another Unif(0,1) state as a proposal state $$j$$.
3. Compute an measure called the Acceptance Probability :
$$a_{ij} = min(s_jp_{ji}/s_{i}p_{ij},1)$$
which is,
$$a_{ij} = min(s_j/s_i,1)$$
where,
$$s_i = Ci^{\alpha -1}(1-i)^{\beta -1}$$
and,
$$s_j = Cj^{\alpha -1}(1-j)^{\beta -1}$$
4. Now Flip a coin that lands head with probability $$a_{ij}$$. If the coin comes up heads, accept the proposal i.e move to next state else reject the proposal i.e. stay at the current state.
5. Repeat for a long time

So enough with theory, Let's Move on to python to create our Beta Simulations Now....

import random
# Lets define our Beta Function to generate s for any particular state. We don't care for the normalizing constant here.
def beta_s(w,a,b):
return w**(a-1)*(1-w)**(b-1)

# This Function returns True if the coin with probability P of heads comes heads when flipped.
def random_coin(p):
unif = random.uniform(0,1)
if unif>=p:
return False
else:
return True

# This Function runs the MCMC chain for Beta Distribution.
def beta_mcmc(N_hops,a,b):
states = []
cur = random.uniform(0,1)
for i in range(0,N_hops):
states.append(cur)
next = random.uniform(0,1)
ap = min(beta_s(next,a,b)/beta_s(cur,a,b),1) # Calculate the acceptance probability
if random_coin(ap):
cur = next
return states[-1000:] # Returns the last 100 states of the chain


Let us check our results of the MCMC Sampled Beta distribution against the actual beta distribution.

import numpy as np
import pylab as pl
import scipy.special as ss
%matplotlib inline
pl.rcParams['figure.figsize'] = (17.0, 4.0)

# Actual Beta PDF.
def beta(a, b, i):
e1 = ss.gamma(a + b)
e2 = ss.gamma(a)
e3 = ss.gamma(b)
e4 = i ** (a - 1)
e5 = (1 - i) ** (b - 1)
return (e1/(e2*e3)) * e4 * e5

# Create a function to plot Actual Beta PDF with the Beta Sampled from MCMC Chain.
def plot_beta(a, b):
Ly = []
Lx = []
i_list = np.mgrid[0:1:100j]
for i in i_list:
Lx.append(i)
Ly.append(beta(a, b, i))
pl.plot(Lx, Ly, label="Real Distribution: a="+str(a)+", b="+str(b))
pl.hist(beta_mcmc(100000,a,b),normed=True,bins =25, histtype='step',label="Simulated_MCMC: a="+str(a)+", b="+str(b))
pl.legend()
pl.show()

plot_beta(0.1, 0.1)
plot_beta(1, 1)
plot_beta(2, 3)


As we can see our sampled beta values closely resemble the beta distribution.

So MCMC Methods are useful for the following basic problems.

1. Simulating from a Random Variable PDF. Example: Simulate from a Beta(0.5,0.5) or from a Normal(0,1).
2. Solve problems with a large state space.For Example: Knapsack Problem, Encrytion Cipher etc. We will work on this in the Next Blog Post as this one has already gotten bigger than what I expected.

Till Then Ciao!!!!!!

References and Sources:

One of the newest and best resources that you can keep an eye on is the Bayesian Methods for Machine Learning course in the Advanced machine learning specialization created jointly by Kazanova(Number 3 Kaggler at the time of writing)

Apart from that I also found a course on Bayesian Statistics on Coursera. In the process of doing it right now so couldn't really comment on it. But since I had done an course on Inferential Statistics taught by the same professor before(Mine Ã‡etinkaya-Rundel), I am very hopeful for this course. Let's see.