Friday, December 1, 2017

Monte Carlo Sampling

Monte Carlo Sampling

Reference


Why use Monte Carlo Sampling?

  1. To sample from a distribution - often a posterior
  2. To calculate

Here, 1 and 2 are related. If we can sample from , we can also calculate by application of law of large numbers. Suppose are i.i.d. samples drawn from . Then law of large number says:

Properties of Monte Carlo Sampling

  • Estimator is unbiased
  • Variance shrinks

Inversion Method

  • To sample from a distribution , use inverse of the CDF of the distribution and uniform random number generator.
  • CDF of a distribution is given as

    enter image description here

Steps:

  • Draw a number from uniform generator
  • Assign this number as , i.e. . Then

Cons:

  • If we can’t compute , then, we won’t be able to use this method.

Rejection Sampling Method

  • To sample from , we consider another distribution from which we can sample under the restriction - i.e., we are upper bounding the distribution with .

enter image description here

Steps:

  • Generate sample
  • Accept this sample with a certain probability. How?

    • Generate sample from a uniform distribution
    • Accept if
      Alternatively,

    • List item Generate sample from a uniform distribution

    • Accept this sample if

This method accepts points on average.

Pros:

  • Works for most distributions (even unnormalized).
  • Even works if we know our distribution up to normalization constant as it happens in probabilistic modeling, i.e. if we know , then we can upper bound by and use this method to perform sampling.

Cons:

  • In higher dimension, is high. Then, this method rejects most of the points.

Importance Sampling

Unlike Inversion method or Rejection sampling, here goal is to compute directly.

Steps:

  • Sample from a different distribution
  • Define importance weight
  • Then, is in fact

Proof:

  • In practice, we would like to choose as close as possible to to reduce the variance of our estimator

Markov Chain Monte Carlo (MCMC)

Rejection sampling and importance samplings performs poorly in higher dimensions. So, instead we use MCMC sampling.

Markov Chain

See this book for details.

  • In first order Markov chain, transition to next state depends only on previous state. A stochastic process satisfies the Markov Property if
  • For Markov chain, we have initial state and transition matrix .

enter image description here

Here,

  • Assuming the frog starts at , i.e. , then after steps, then the probability that the frog will be in or is given by

    We can also replace with a probability vector to represent the initial state - i.e. , the frog is equally likely to start in either or

Regular Markov Chain & Stationary Distribution

  • A Markov chain is regular if all the entries in the transition matrix is non-zero (this is a sufficient condition, not necessary condition).
  • For a regular Markov chain, long-range predictions are independent of the starting state, i.e. it doesn’t depend whether the frog started in or . In other words, for any choice of when . Here, is a stationary distribution such than . One interesting property is, all the entries in a column of will have the same value. E.g.

    And, for any probability vector .

A distribution is said to be invariant/ stationary w.r.t. a Markov chain if
the transition function of that chain leaves that distribution unchanged. E.g. for a Markov chain with transition operator , a distribution is considered a stationary distribution if

Detailed Balance

A transition operator satisfies detailed balance if

Here, is the stationary distribution of state , and the transition probability of moving from state to .

How to design a Markov Chain with stationary distribution

If a transition operator satisfies detailed balance w.r.t. a particular
distribution, then that distribution will be invariant under T. In other words,

Proof:

Therefore, to design a Markov Chain w.r.t. a stationary distribution , find a transition operator such that it satisfies the detailed balance w.r.t. .

How to use Markov Chain for Sampling?

  • We want to sample from
  • Build a Markov chain that converge to
    • Start from any
    • For
    • Eventually will look like samples from
      Here is the transition probability from state to .

Metropolis-Hastings Algorithm

This is somewhat similar to the idea of rejection sampling to Markov chain. We start with a wrong Markov Chain, and introduce a Critic. Critic ensures that the random walk does not go too far away from the desired distribution.

For k=1,2,
– Sample from wrong
– Accept proposal with probability
– Otherwise stay at

Transition probability of the above algorithm is as follows:

How to choose critic

We choose critic such that the transition probability as described above converged to the desired distribution . We can use detailed balance for this purpose:

Now, we assign, , then as long as . When , we can simply choose 1. I.e.

Things to note:

  • We are only using ratio of the desired distribution - this means, we don’t need to know the exact distribution. So, no issue with the normalization constant.
  • Choice of :
    • should spread out, i.e. with high variance will give us un-correlated samples. But this also means, the probability of rejection will increase. On the other had, with low variance will take long time to converge. See the following figures:

$Q$ with proper variance enter image description here enter image description here


Gibbs Sampling

We want to sample from a joint distribution

Initialize , , .
For

The 1D marginal distributions in the above algorithm can be calculated analytically or using something like rejection sampling.

Why Gibbs Sampling works?

We need to prove that the above mentioned sampling steps actually converge to stationary distribution . In other words, we need to prove:

Proof:

Metropolis-Hastings vs. Gibbs

Gibbs sampling generate highly correlated samples and takes long time to converge. Also, since sampling of each dimension depends on the past sampling from other dimension, the scheme is not parallel. Its possible to make it parallel as follows:

However, this does not guarantee that the samples will converge to the desired distribution . So, alternative is to sample from the above and use it as a proposal in Metropolis-hastings where the critic will decide whether or not to accept this point. Since, the above is pretty close the the desired distribution , there is a higher chance that the critic will accept this point.

No comments:

Post a Comment