Notes on Bayesian Inference

Reference
Bayes Theorem
- Bayesian Netwok (Graphical Model)
Calculation of Posterior Distribution
Conjugate Prior
Variational Inference
- Steps:
- Minimizing the KL divergence
Mean Field Approximation
Common Probability Distributions
- Gamma Distribution
- Beta Distribution

Reference

Coursera course
UC Berkeley Note
Stanford Note

Bayes Theorem

We have,

$~\theta~$ : Parameters
$~X~$ : Observations

Bayes rules is as follows:

enter image description here

Probability rules:

Sum rules: Marginalization from joint distribution $p(X) = \int_{-\infty}^\infty p(X,Y) dY$
Chain rules:
$\begin{eqnarray} P(X, Y) &=& P(X|Y)P(Y)=P(Y|X)P(X) \\ P(X,Y,Z)&=&P(X|Y,Z)P(Y,Z)=P(X|Y,Z)P(Y|Z)P(Z)\\ P(X_1,\ldots,X_N)&=&\prod_{i=1}^NP(X_i|X_1,\ldots,X_{i-1}) \end{eqnarray}$

Point Estimation (Frequentist vs. Bayesian)

Rather than estimate the entire distribution $p(θ|x)$ , sometimes it is sufficient to find a single ‘good’ value for $θ$ . We call this a point estimate.

Frequentist thinks parameters $\theta$ are fixed, data $X$ are random.
Maximum Likelihood Estimation: $\hat{\theta}=\arg\max_{\theta} P(X|\theta)$
Bayesian thinks parameters are random, data are fixed.
Maximum Aposteriori Estimation (MAP)
- MAP estimation is not invariant to non-linear transformations of $θ$ . E.g. A non-linear transformation $\theta^\prime=g(\theta)$ , to $\theta$ can shift the posterior mode in such a way that $g^{-1}(\mathrm{mode}[\theta^\prime]) \neq \mathrm{mode}[\theta]$ .
- MAP estimate may not be typical of the posterior.

Bayesian Netwok (Graphical Model)

enter image description here

Nodes are random variables
Edges indicates dependence (e.g. Grass is wet depends on both sprinkler or rain, and whether sprinkler is on or off depends on rain)
Observed variables are shaded nodes; unshaded nodes are hidden
Plated denote replicated structure

Joint probability over all the variables in the above model is given by:

enter image description here

Example 1:
enter image description here

Here, $P(S,R,G) = P(G|S, R) P(S|R) P(R)$

Example 2: Naive Bayes Classifer
enter image description here

Joint Probability $P(c, f_1,\ldots,f_N)=P(c)\prod_{i=1}^N P(f_i|C)$

In plate notation, the figure above can be shortened as follows:

enter image description here

Calculation of Posterior Distribution

Analytical approach: Use of conjugate prior
Converting to optimization problem: variational inference (mean field approximation)
Simulations: MCMC methods (metropolis-hastings or gibbs sampling) - see next post.

Since variational inference approximate the posterior, MCMC usually produce higher accuracy - however may be slower to converge, as shown in the figure below:

enter image description here

Conjugate Prior

MIT lecture note has a good section on conjugate prior

Point estimation is useful for many applications, however true goal in Bayesian analysis is often to find the full posterior $p(\theta|X)$ . In most cases, it is difficult to calculate the denominator $p(X)=\int p(X|\theta)p(\theta)$ . One approach to circumventing the integral is to use conjugate priors. Here the idea is, if we choose the ‘right’ prior for a particular likelihood function, then we can compute the posterior without worrying about the integral.

Formally, a prior $p(\theta)$ is conjugate to the likelihood $p(X|\theta)$ , if the prior $p(\theta)$ and the posterior $p(\theta|x)$ are from the same family of distribution.

Examples:

Beta distribution is conjugate to Bernoulli likelihood. Here is a good example of this for baseball batting average calculation.
Dirichlet distribution is conjugate to Multinomial likelihood (e.g. application in LDA)

Variational Inference

Very intuitive explanation in this blog
Very nice notes on the derivation from CMU here.

If there are no conjugate prior, it might be hard to calculate the posterior. In many cases, we can approximate the posterior with some known distributions.

enter image description here

Steps

We want to find posterior $p(\theta|X)$

Select a family of distribution $Q=\{q(\theta|\phi)\}$ parameterized by $\phi$ .
Find the best approximation of $p(\theta|X)$ from $Q$ by minimizing the KL divergence between the two.

$\begin{equation} p(\theta|X)\approx q^*(\theta|\phi) ={\arg\min}_{q(\theta|\phi)\in Q} KL[q(\theta|\phi)||p(\theta|X)] \end{equation}$

Will there be issue with denominator integral?

Due to the use of KL divergence, we will have:
$\min_{q(\theta|\phi)\in Q} KL[q(\theta|\phi)||p(\theta|X)]= \min_{q(\theta)\in Q} KL[q(\theta|\phi)||{\hat{p} (\theta|X)}]$
where ${p} (\theta|X) = \dfrac{\hat{p} (\theta|X)}{Z}$ . So, we don’t have to worry about $Z$ .

Evidence Lower Bound (ELBO)

Evidence lower bound is defined as:

$\text{ELBO}=\mathbb{E}_q[\log p(x, z)]-\mathbb{E}_q[\log q(z)]$

Properties: . $\log p(x)\ge \text{ELBO}$

Proof:
$\begin{eqnarray} \log p(x)&=& \log \int_z p(x, z) dx= \log \int_z q(z) \dfrac{p(x, z)}{q(z)}dz\\ &=&\log\mathbb{E}_q\left[\dfrac{p(x, z)}{q(z)}\right]\ge\mathbb{E}_q[\log p(x, z)]-\mathbb{E}_q[\log q(z)] \\ &&\left[\text{Jensen's Inequality}~ f(\mathbb{E}[X])\ge \mathbb{E}[f(X)]\right] \end{eqnarray}$

Minimizing the KL divergence

We can write the KL divergence as:
$\begin{eqnarray} &&KL(q(z)||p(z|x))\\ &=& \int_z q(z) \log \dfrac{q(z)}{p(z|x)}= \mathbb{E}_{q}\left[\log \dfrac{q(z)}{p(z|x)}\right]\\ &=&\mathbb{E}_{q} [\log q(z)]-\mathbb{E}_{q} [\log p(z|x)]\\ &=& \mathbb{E}_{q} [\log q(z)] - \mathbb{E}_{q} [\log p(z, x)]+\mathbb{E}_{q}[\log p(x)]\\ &=& \mathbb{E}_{q} [\log q(z)] - \mathbb{E}_{q} [\log p(z, x)] +\log p(x)\\ &=& -\text{ELBO}+\log p(x)\\ \end{eqnarray}$

Therefore,
$\min_{q}KL(q||p) =\min_{q}\left[-\text{ELBO}+\log p(x)\right]=\max_{q}\left[\text{ELBO}\right]$

In words, any $q$ that maximizes ELBO, minimizes KL divergence.

$\begin{equation} p(Z|X)\approx q^*(Z) ={\arg\min}_{q(Z)} KL[q(Z)||p(Z|X)]={\arg\max}_{q(Z)} \mathbb{E}_q[\log p(x, z)]-\mathbb{E}_q[\log q(z)] \end{equation}$

Mean Field Approximation

In mean field approximation, the family of distribution $Q$ is assumed to factorize over the components of $\mathbf{z}$ , i.e. $z_1, z_2, \ldots$
$Q = \left\{ q(\mathbf{z}|\phi) = \prod_{i=1}^{d}q(z_i)\right\}$
and we are trying to achieve

$\begin{eqnarray} &&\min_{q} \ \mbox{KL}[ q(\mathbf{z}|\phi) \ || \ p(\mathbf{z}|X) ]\\ &=&\min_{q} \ \mbox{KL}\left[ \prod_{i}q(z_i) \ || \ p(\theta|X) \right]\\ &=&\max_{q}\left\{ \text{ELBO}= \mathbb{E}_{q} \log p(z, x) -\mathbb{E}_{q} \log q(z)\right\}\\ &=&\max_{q}\left[ \sum_{i} \mathbb{E}_{q_i} \log p(z_i|z_{-i}, x)-\sum_i \mathbb{E}_{q_i} \log q_i(z_i)\right] \end{eqnarray}$
To get to the last line from the previous line, we did the following math:
$\begin{eqnarray} \mathbb{E}_{q} [\log q(z)] &=&\mathbb{E}_{q} \left[\sum_i\log q_i(z_i)\right]=\sum_i \mathbb{E}_{q_i} \left[\log q_i(z_i)\right] \end{eqnarray}$
And,
$\begin{eqnarray} p(z, x)= p(z_1,\ldots, z_d,x)&=& p(z_1,\ldots, z_d|x)p(x)= p(x) \prod_{i=1}^d p(z_i|z_{1:i-1}, x)\\ \mathbb{E}_{q} [\log p(z,x)]&=&\log p(x) + \sum_{i=1}^d \mathbb{E}_{q} [\log p(z_i|z_{1:i-1}, x)] \end{eqnarray}$

Co-ordinate Ascent

We can solve this using co-ordinate ascent algorithm, by maximizing a single factor $q_k$ , while keeping all other factors $q_{-k}$ constant.
$\begin{eqnarray} q_k^*&=&\arg\max_{q_k} \left[-\mathbb{E}_{q_k} \left[\log q_k(z_k)\right] + \mathbb{E}_{q} \left[\log p(z_k|z_{-k}, x)\right]\right]\\ &=&\arg\max_{q_k}\left[-\int_{z_k} q_k \log q_k dz_k +\int_z q(z) p(z_k|z_{-k}, x) dz\right]\\ &=&\arg\max_{q_k}\left[-\int_{z_k} q_k \log q_k dz_k +\int_{z_k} q_k \int_{z_{j\ne k}} \prod_{j\ne k}q_j p(z_k|z_{-k}, x) dz_{j\ne k} dz_k\right]\\ &=&\arg\max_{q_k}\left[-\int_{z_k} q_k \log q_k dz_k +\int_{z_k} q_k \mathbb{E}_{-k} [p(z_k|z_{-k}, x)] dz_k\right]\\ \dfrac{d \mathcal{L}}{d q_k^*}&=&-\log q_k^* + \mathbb{E}_{-k} [p(z_k|z_{-k}, x)] -1 =0\\ q_k^*&\propto&\exp\left\{\mathbb{E}_{-k} [p(z_k|z_{-k}, x)] \right\}=\exp\left\{\mathbb{E}_{-k} \left[\dfrac{p(z_k|z_{-k}, x)p(z_k)}{p(z_k)}\right] \right\}\\ q_k^*&\propto&\exp\left\{\mathbb{E}_{-k} [p(z_k,z_{-k}, x)] \right\} \end{eqnarray}$

$\begin{eqnarray} q(z_k)&\propto&\mathbb{E}_{q_{-k}}\log p(z_k|z_{-k},x)\\ &\propto&\mathbb{E}_{q_{-k}}\log p(z_k,z_{-k},x) \end{eqnarray}$

In summary, we first defined a family of approximations called mean field approximations, in which there are no dependencies between latent variables . Then we decomposed the ELBO into a nice form under mean field assumptions. Then, we derived a coordinate ascent updates to iteratively optimize each local variational approximation under mean field assumptions.

Common Probability Distributions

Gamma Distribution

$p(\gamma|a,b) = \dfrac{b^a}{\Gamma(a)}\gamma ^{a-1}e^{-b\gamma}$
Here,

$\gamma, a, b > 0$
support of Gamma distribution is $[0, \infty)$
$\mathbb{E}[\gamma]=\frac{a}{b}$
$\text{Var}[\gamma]=\frac{a}{b^2}$

enter image description here

Example: Suppose I ran 5km $\pm$ 100 m every day, i.e. mean 5km with std 100m. We can model this as Gamma distribution. We can also use Gaussian - however, that means we can run negative distance.

enter image description here

Beta Distribution

$p(x|a,b)=\dfrac{1}{B(a,b)}x^{a-1}(1-x)^{b-1}$
Here,
- a, b > 0
- support of beta distribution is [0,1], i.e. $x\in [0,1]$
- $\mathbb{E}[x]=\frac{a}{a+b}$
- $\text{Var}[x]=\frac{ab}{(a+b)^2(a+b-1)}$

enter image description here

Example: Baseball batting average (its a number between 0 and 1). e.g. $0.27\pm 0.1$

Shafi's ML Blog

Saturday, November 25, 2017

Notes on Bayesian Inference

Reference

Bayes Theorem

Probability rules:

Point Estimation (Frequentist vs. Bayesian)

Bayesian Netwok (Graphical Model)

Calculation of Posterior Distribution

Conjugate Prior

Variational Inference

Steps

Will there be issue with denominator integral?

Evidence Lower Bound (ELBO)

Minimizing the KL divergence

Mean Field Approximation

Co-ordinate Ascent

Common Probability Distributions

Gamma Distribution

Beta Distribution