Suggest edit — Markov Chain Monte-Carlo

Title

Name

Note

---
title: "Markov Chain Monte-Carlo"
source: https://www.jemoka.com/posts/kbhmarkov_chain_monte_carlo/
---

Start with an initial sample \(\tau\)
At each distribution&hellip;
sample \(\tau &rsquo; \sim g\qty(\cdot | \tau)\) (for instance, \(\tau&rsquo; \sim \mathcal{N}\qty(\cdot | \tau, \sigma^{2})\)) accept the sample with probability given by \(\frac{\bar{p} \qty(\tau&rsquo;) g\qty(\tau | \tau&rsquo;)}{\bar{p}\qty(\tau) g\qty(\tau&rsquo; | \tau)}\), otherwise keep \(\tau\) (this is also called the Metropolis-Hastings criteria) intuition The kernel is often chosen to be symmetric, so:
\begin{equation} g\qty(\tau | \tau&rsquo;) = g\qty(\tau&rsquo; | \tau) \end{equation}
we want to sample from areas of high density more often. Consider:
\(\bar{p}\qty(\tau&rsquo;) \geq \bar{p}\qty(\tau)\), meaning \(\tau&rsquo;\) is more likely under the target density than \(\tau\), so \(\frac{\bar{p}\qty(\tau&rsquo;)}{\bar{p} \qty(\tau)} \geq 1\), so we will always accept whereas \(\bar{p}\qty(\tau&rsquo;) &lt; \bar{p}\qty(\tau)\), meaning \(\frac{\bar{p}\qty(\tau&rsquo;)}{\bar{p}\qty(\tau)} &lt; 1\), so we will still accept some of the time but mostly not accept; the bigger the deviations, the less likely we are able to sample it Key: this all works assuming this is in the limit of infinite samples.
problems with MCMC The problem is that we may not ever get to a failure by doing this naively (think multi-modal distributions)
burn in A useful hack to do is to start sampling and throw away some samples in the beginning.
failure smoothing In certain cases where the failure distribution is multi-modal, its really hard to perturb your way into jumping between different distributions. So, let&rsquo;s define:
\(\Delta \qty(\tau) = 0\), if \(\tau \not \in \psi\) (is failure) \(\Delta \qty(\tau) &gt; 0\), otherwise for instance, a function that behaves like tis is \(\max \qty(\rho(\tau, \psi), 0)\)
Now, we can write:
\begin{align} \bar{p} \qty(\tau | \tau \not \in\psi) &amp;= 1 \qty {\tau \not \in \psi} p\qty(\tau) \\ &amp;=\mathbb{1} \qty{\Delta \qty(\tau) = 0} p\qty(\tau) \end{align}
Which we can now approximate&mdash;using some smoothing parameter \(\varepsilon\):
\begin{align} \bar{p} \qty(\tau | \tau \not \in\psi) &amp;= 1 \qty {\tau \not \in \psi} p\qty(\tau) \\ &amp;=\mathbb{1} \qty{\Delta \qty(\tau) = 0} p\qty(\tau) \\ &amp;= \mathcal{N}\qty(\Delta \qty(\tau) \mid0, \epsilon^{2}) \end{align}
If we got a non-failure from the smoothed distribution, just throw it away.
a good kernel A good kernel: using the gradient of the likelihood to define the kernel is sometimes a good method of increasing likelihood of the discovered failure. (NUTS, HMC, etc.)
or generally: use probabilistic programming&mdash;which allows us to use probabilistic modes as computer programs.