Monte Carlo Sampling

Reference

Lecture video on sampling
Coursera course
UCBerkeley Note

Why use Monte Carlo Sampling?

To sample from a distribution $p(x)$ - often a posterior
To calculate $\mathbb{E}_{p(x)}\left[f(x)\right]=\int f(x)p(x)dx$

Here, 1 and 2 are related. If we can sample from $p(x)$ , we can also calculate $\mathbb{E}_{p(x)}\left[f(x)\right]$ by application of law of large numbers. Suppose $\{{x^{(i)}}\}$ are i.i.d. samples drawn from $p(x)$ . Then law of large number says:

$\frac{1}{N}\sum_{i=1}^N f\left({x^{(i)}}\right) \xrightarrow{\text{a.s.}} \int f(x)p(x)dx$

Properties of Monte Carlo Sampling

Estimator is unbiased
Variance shrinks $\propto \frac{1}{S}$

Inversion Method

To sample from a distribution $p(y)$ , use inverse of the CDF of the distribution and uniform random number generator.
CDF of a distribution $p(y)$ is given as
$h(y) = \int_{-\infty}^y p(y')dy'$

Steps:

Draw a number from uniform generator $u \sim \text{Uniform}[0,1]$

Assign this number as $h(y)$ , i.e. $u=h(y)$ . Then $y = h^{-1}(y)$

Cons:

If we can’t compute $h^{-1}(y)$ , then, we won’t be able to use this method.

Rejection Sampling Method

To sample from $p(x)$ , we consider another distribution $q(x)$ from which we can sample under the restriction $p(x) < Mq(x)$ - i.e., we are upper bounding the distribution $p(x)$ with $Mq(x)$ .

enter image description here

Steps:

Generate sample $x^{(i)}\sim q(x)$

Accept this sample with a certain probability. How?

Generate sample from a uniform distribution $u \sim \left[0,Mq(x^{(i)})\right]$

Accept $x^{(i)}$ if $u\le p(x^{(i)})$
Alternatively,

List item Generate sample from a uniform distribution $u \sim \left[0,1\right]$

Accept this sample if $u\le \frac{p(x^{(i)})$}{Mq(x^{(i)})}$

This method accepts $\frac{1}{M}$ points on average.

Pros:

Works for most distributions (even unnormalized).
Even works if we know our distribution up to normalization constant as it happens in probabilistic modeling, i.e. if we know $\frac{p(x)}{Z}$ , then we can upper bound by $p(x)\le Z Mq(x)$ and use this method to perform sampling.

Cons:

In higher dimension, $M$ is high. Then, this method rejects most of the points.

Importance Sampling

Unlike Inversion method or Rejection sampling, here goal is to compute $\mathbb{E}_{p(x)}\left[f(x)\right]=\int f(x)p(x)dx$ directly.

Steps:

Sample from a different distribution $x^{(i)}\sim q(x)$

Define importance weight $w(x^{(i)})=\frac{p(x^{(i)})}{q(x^{(i)})}$

Then, $\mathbb{E}_{p(x)}\left[f(x)\right]$ is in fact $\frac{1}{N}\sum_{i=1}^N f(x^{(i)})w(x^{(i)})$

Proof:

$\begin{eqnarray} &&\frac{1}{N}\sum_{i=1}^N f(x^{(i)})w(x^{(i)}) \\ = &&\frac{1}{N}\sum_{i=1}^N f(x^{(i)})\frac{p(x^{(i)})}{q(x^{(i)})} \\ \xrightarrow {\text{a.s.}}&& \int \left(f(x)\frac{p(x)}{q(x)} \right) q(x) dx ~~\text{[law of large number]}\\ =&&\int f(x)p(x)dx \end{eqnarray}$

In practice, we would like to choose $q(x)$ as close as possible to $|f(x)|w(x)$ to reduce the variance of our estimator

Markov Chain Monte Carlo (MCMC)

Rejection sampling and importance samplings performs poorly in higher dimensions. So, instead we use MCMC sampling.

Markov Chain

See this book for details.

In first order Markov chain, transition to next state depends only on previous state. A stochastic process $X_0, X_1,\ldots$ satisfies the Markov Property if
$P\{X_{n+1}=i_{n+1}|X_n=i_n,X_{n-1}=i_{n-1},\ldots,X_0=i_0\}=P\{X_{n+1}=i_{n+1}|X_n=i_n\}$
For Markov chain, we have initial state $S$ and transition matrix $P$ .

enter image description here

Here,
$P=\begin{array}{r|lll} & L & R\\ \hline L & 0.3 & 0.7\\ R & 0.5 & 0.5 \end{array}$

Assuming the frog starts at $L$ , i.e. $S=[1, ~ 0]$ , then after $k$ steps, then the probability that the frog will be in $L$ or $R$ is given by
$S^{(k)}=SP^k$
We can also replace $S=[1, ~ 0]$ with a probability vector to represent the initial state - i.e. $S=[0.5, ~ 0.5]$ , the frog is equally likely to start in either $L$ or $R$

Regular Markov Chain & Stationary Distribution

A Markov chain is regular if all the entries in the transition matrix is non-zero (this is a sufficient condition, not necessary condition).
For a regular Markov chain, long-range predictions are independent of the starting state, i.e. it doesn’t depend whether the frog started in $L$ or $R$ . In other words, $SP^k=\pi$ for any choice of $S$ when $k\to\infty$ . Here, $\pi$ is a stationary distribution such than $\pi = P\pi$ . One interesting property is, all the entries in a column of $P^k$ will have the same value. E.g.
$\text{For}~~P= \left[ {\begin{array}{cc} 0.3 & 0.7 \\ 0.5 & 0.5 \\ \end{array} } \right], ~~~~~~ P^k\to \left[ {\begin{array}{cc} 0.417 & 0.583 \\ 0.417 & 0.583 \\ \end{array} } \right]$
And, $SP^k=\pi=[0.417, 0.583]$ for any probability vector $S$ .

A distribution is said to be invariant/ stationary w.r.t. a Markov chain if
the transition function of that chain leaves that distribution unchanged. E.g. for a Markov chain with transition operator $T(x\to y)$ , a distribution $\pi(x)$ is considered a stationary distribution if
$\pi(y) = \sum_x \pi(x) T(x \to y)$

Detailed Balance

A transition operator $T$ satisfies detailed balance if
$\pi (x) T(x\to x') = \pi(x') T(x' \to x)$
Here, $\pi(x)$ is the stationary distribution of state $x$ , and $T(x\to y)$ the transition probability of moving from state $x$ to $y$ .

How to design a Markov Chain with stationary distribution $\pi$

If a transition operator satisfies detailed balance w.r.t. a particular
distribution, then that distribution will be invariant under T. In other words,
$\text{If}~~~~ \pi(x) T(x\to y) = \pi(y) T(y\to x), \\ \text{Then}~~~~ \pi(y) =\sum_x \pi(x) T(x\to y)$
Proof:
$\sum_x \pi(x) T(x\to y)=\sum_x \pi(y) T(y\to x)=\pi(y) \sum_x T(y\to x)=\pi(y)$

Therefore, to design a Markov Chain w.r.t. a stationary distribution $\pi$ , find a transition operator $T$ such that it satisfies the detailed balance w.r.t. $\pi$ .

How to use Markov Chain for Sampling?

We want to sample from $p(x)$
Build a Markov chain that converge to
- Start from any $x^0$
- For $k = 0, 1,\ldots$
  $x^{k+1}\sim T(x^k \to x^{k+1})$
- Eventually $x^k$ will look like samples from $p(x)$
  Here $T(x^k \to x^{k+1})$ is the transition probability from state $k$ to $k+1$ .

Metropolis-Hastings Algorithm

This is somewhat similar to the idea of rejection sampling to Markov chain. We start with a wrong Markov Chain, and introduce a Critic. Critic ensures that the random walk does not go too far away from the desired distribution.

For k=1,2, $\ldots$
– Sample $x'$ from wrong $Q(x^k\to x')$
– Accept proposal $x'$ with probability $A(x^k\to x')$
– Otherwise stay at $x^k$
$x^{k+1}=x^k$

Transition probability of the above algorithm is as follows:
$T(x\to y) = Q(x\to y) A(x\to y)~~\text{when}~~x\ne y\\ T(x\to x) = Q(x\to x) A(x\to x) + \sum_{y\ne x} Q(x \to y) (1-A(x\to y))$

How to choose critic $A$

We choose critic $A$ such that the transition probability $T$ as described above converged to the desired distribution $\pi$ . We can use detailed balance for this purpose:
$\begin{eqnarray} \pi(x) T(x\to y) &=& \pi(y) T(y\to x)\\ \pi(x) Q(x\to y) A(x\to y) &=& \pi(y) Q(y\to x)A(y\to x)\\ \dfrac{A(x\to y) }{A(y\to x)}&=&\dfrac{ \pi(y) Q(y\to x)}{\pi(x)Q(x\to y)}=\rho \end{eqnarray}$
Now, we assign, $A(y\to x)=1$ , then $A(x\to y) =\rho$ as long as $\rho\le 1$ . When $\rho > 1$ , we can simply choose 1. I.e.

$A(x\to y) =\min\left\{ 1,\dfrac{ \pi(y) Q(y\to x)}{\pi(x)Q(x\to y)} \right\}$

Things to note:

We are only using ratio of the desired distribution $\frac{\pi(y)}{\pi(x)}$ - this means, we don’t need to know the exact distribution. So, no issue with the normalization constant.
Choice of :
- $Q(x\to y) > 0$
- $Q$ should spread out, i.e. $Q$ with high variance will give us un-correlated samples. But this also means, the probability of rejection will increase. On the other had, $Q$ with low variance will take long time to converge. See the following figures:

$Q$ with proper variance enter image description here

Gibbs Sampling

We want to sample from a joint distribution $p(x,y,z)$

Initialize $x=x_0$ , $y=y_0$ , $z=z_0$ .
For $k=0, 1, 2, \ldots$
$\begin{eqnarray} x_{k+1} &\sim& p(x|y=y_k, z=z_k)\\ y_{k+1} &\sim& p(y|x=x_{k+1}, z=z_k)\\ z_{k+1} &\sim& p(z|x=x_{k+1}, z=z_{k+1})\\ \end{eqnarray}$

The 1D marginal distributions in the above algorithm can be calculated analytically or using something like rejection sampling.

Why Gibbs Sampling works?

We need to prove that the above mentioned sampling steps actually converge to stationary distribution $p(x,y,z)$ . In other words, we need to prove:
$p(x',y',z') = \sum_{x,y,z} p(x,y,z) T(x,y,z \to x'y'z')$

Proof:
$\begin{eqnarray} &&\sum_{x,y,z} p(x,y,z) T(x,y,z \to x'y'z')\\ =&&\sum_{x,y,z} p(x,y,z) p(x'|y=y, z=z) p(y'|x=x',z=z) p(z'|x=x',y=y')\\ =&&p(z'|x',y')\sum_{x,y,z} p(x,y,z) p(x'|y, z) p(y'|x',z) \\ =&&p(z'|x',y')\sum_{y,z} \left[ p(x'|y, z) p(y'|x',z) \sum_x p(x,y,z)\right]\\ =&&p(z'|x',y')\sum_{y,z} \left[ p(x'|y, z) p(y,z) p(y'|x',z) \right]\\ =&&p(z'|x',y')\sum_{y,z} \left[ p(x',y, z) p(y'|x',z) \right]\\ =&&p(z'|x',y')\sum_{z} \left[ p(y'|x',z) \sum_y p(x',y, z)\right]\\ =&&p(z'|x',y')\sum_{z} \left[ p(y'|x',z) p(x', z)\right]\\ =&&p(z'|x',y')\sum_{z} p(y',x',z) \\ =&&p(z'|x',y') p(y',x') \\ =&&p(x',y',z') \\ \end{eqnarray}$

Metropolis-Hastings vs. Gibbs

Gibbs sampling generate highly correlated samples and takes long time to converge. Also, since sampling of each dimension depends on the past sampling from other dimension, the scheme is not parallel. Its possible to make it parallel as follows:
$\begin{eqnarray} x_{k+1} &\sim& p(x|y=y_k, z=z_k)\\ y_{k+1} &\sim& p(y|x=x_{k}, z=z_k)\\ z_{k+1} &\sim& p(z|x=x_{k}, z=z_{k})\\ \end{eqnarray}$
However, this does not guarantee that the samples will converge to the desired distribution $p(x,y,z)$ . So, alternative is to sample from the above and use it as a proposal in Metropolis-hastings where the critic will decide whether or not to accept this point. Since, the above is pretty close the the desired distribution $p(x,y,z)$ , there is a higher chance that the critic will accept this point.

Shafi's ML Blog

Friday, December 1, 2017

Monte Carlo Sampling

Reference

Why use Monte Carlo Sampling?

Properties of Monte Carlo Sampling

Inversion Method

Rejection Sampling Method

Importance Sampling

Markov Chain Monte Carlo (MCMC)

Markov Chain

Regular Markov Chain & Stationary Distribution

Detailed Balance

How to design a Markov Chain with stationary distribution $\pi$

How to use Markov Chain for Sampling?

Metropolis-Hastings Algorithm

How to choose critic $A$

Gibbs Sampling

Why Gibbs Sampling works?

Metropolis-Hastings vs. Gibbs

No comments:

Post a Comment

Friday, December 1, 2017

Monte Carlo Sampling

Reference

Why use Monte Carlo Sampling?

Properties of Monte Carlo Sampling

Inversion Method

Rejection Sampling Method

Importance Sampling

Markov Chain Monte Carlo (MCMC)

Markov Chain

Regular Markov Chain & Stationary Distribution

Detailed Balance

How to design a Markov Chain with stationary distribution \pi

How to use Markov Chain for Sampling?

Metropolis-Hastings Algorithm

How to choose critic A

Gibbs Sampling

Why Gibbs Sampling works?

Metropolis-Hastings vs. Gibbs

No comments:

Post a Comment

How to design a Markov Chain with stationary distribution $\pi$

How to choose critic $A$