[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / raw/concept/kbhmarkov_chain_monte_carlo.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "Markov Chain Monte-Carlo" source: https://www.jemoka.com/posts/kbhmarkov_chain_monte_carlo/ --- Start with an initial sample \(\tau\) At each distribution… sample \(\tau ’ \sim g\qty(\cdot | \tau)\) (for instance, \(\tau’ \sim \mathcal{N}\qty(\cdot | \tau, \sigma^{2})\)) accept the sample with probability given by \(\frac{\bar{p} \qty(\tau’) g\qty(\tau | \tau’)}{\bar{p}\qty(\tau) g\qty(\tau’ | \tau)}\), otherwise keep \(\tau\) (this is also called the Metropolis-Hastings criteria) intuition The kernel is often chosen to be symmetric, so: \begin{equation} g\qty(\tau | \tau’) = g\qty(\tau’ | \tau) \end{equation} we want to sample from areas of high density more often. Consider: \(\bar{p}\qty(\tau’) \geq \bar{p}\qty(\tau)\), meaning \(\tau’\) is more likely under the target density than \(\tau\), so \(\frac{\bar{p}\qty(\tau’)}{\bar{p} \qty(\tau)} \geq 1\), so we will always accept whereas \(\bar{p}\qty(\tau’) < \bar{p}\qty(\tau)\), meaning \(\frac{\bar{p}\qty(\tau’)}{\bar{p}\qty(\tau)} < 1\), so we will still accept some of the time but mostly not accept; the bigger the deviations, the less likely we are able to sample it Key: this all works assuming this is in the limit of infinite samples. problems with MCMC The problem is that we may not ever get to a failure by doing this naively (think multi-modal distributions) burn in A useful hack to do is to start sampling and throw away some samples in the beginning. failure smoothing In certain cases where the failure distribution is multi-modal, its really hard to perturb your way into jumping between different distributions. So, let’s define: \(\Delta \qty(\tau) = 0\), if \(\tau \not \in \psi\) (is failure) \(\Delta \qty(\tau) > 0\), otherwise for instance, a function that behaves like tis is \(\max \qty(\rho(\tau, \psi), 0)\) Now, we can write: \begin{align} \bar{p} \qty(\tau | \tau \not \in\psi) &= 1 \qty {\tau \not \in \psi} p\qty(\tau) \\ &=\mathbb{1} \qty{\Delta \qty(\tau) = 0} p\qty(\tau) \end{align} Which we can now approximate—using some smoothing parameter \(\varepsilon\): \begin{align} \bar{p} \qty(\tau | \tau \not \in\psi) &= 1 \qty {\tau \not \in \psi} p\qty(\tau) \\ &=\mathbb{1} \qty{\Delta \qty(\tau) = 0} p\qty(\tau) \\ &= \mathcal{N}\qty(\Delta \qty(\tau) \mid0, \epsilon^{2}) \end{align} If we got a non-failure from the smoothed distribution, just throw it away. a good kernel A good kernel: using the gradient of the likelihood to define the kernel is sometimes a good method of increasing likelihood of the discovered failure. (NUTS, HMC, etc.) or generally: use probabilistic programming—which allows us to use probabilistic modes as computer programs.