Suggest edit — SU-CS361 MAY072024

Title

Name

Note

---
title: "SU-CS361 MAY072024"
source: https://www.jemoka.com/posts/kbhsu_cs361_may072024/
date: 2024-05-07
---

Generalization Error \begin{equation} \epsilon_{gen} = \mathbb{E}_{x \sim \mathcal{X}} \qty[\qty(f(x) - \hat{f}(x))^{2}] \end{equation}
we usually instead of compute it by averaging specific points we measured.
Probabilistic Surrogate Models Gaussian Process A Gaussian Process is a Gaussian distribution over functions!
Consider a mean function \(m(x)\), and a covariance (kernel) function \(k(x, x&rsquo;)\). And, for a set of objective values \(y_{j} \in \mathbb{R}\), which we are trying to infer using \(m\) and \(k\).
\begin{equation} \mqty[y_1 \\ \dots \\ y_{m}] \sim \mathcal{N} \qty(\mqty[m(x_1) \\ \dots \\ m(x_{m})], \mqty[k(x_1, x_1) &amp; \dots &amp; k(x_1, x_{m}) \&amp;\dots&amp;\\ k(x_{m}, x_{1}) &amp;\dots &amp;k(x_{m}, x_{m})]) \end{equation}
The choice of kernel makes or breaks the your ability to model your system. Its the way by which your input values are &ldquo;smoothed&rdquo; together to create a probabilistic estimate.
Choice of Kernels squared exponential kernel
\begin{equation} k(x,x&rsquo;) = \exp \qty( \frac{-(x-x&rsquo;)^{2}}{2 l^{2}}) \end{equation}
where, \(l\) is the parameter controlling the &ldquo;length scale&rdquo; (i.e. distance required for the function to change significantly). As \(l\) gets larger, there&rsquo;s more smoothing.
Matérn Kernel
This is a very common kernel. Look it up.
Prediction Given known means and variances of the sampled points from the original system, we can compute:
\begin{equation} P(Y^{*}|Y) \end{equation}
through using conditioning Gaussian distributions.
Specifically, with:
\begin{equation} \mqty[\hat{y} \\ y] \sim \mathcal{N} \qty(\mqty[m(X^{*}) \\ m(X)], \mqty[K(X^{*}, X^{*}) &amp; K(X^{*},X) \\ K(X, X^{*}) &amp; K(X, X)]) \end{equation}
we can compute a new mean and a new covariance using conditioning Gaussian distributions
Noisy Measurements We can account for zero-mean noise by adding some noise to your covariance:
\begin{equation} \mqty[\hat{y} \\ y] \sim \mathcal{N} \qty(\mqty[m(X^{*}) \\ m(X)], \mqty[K(X^{*}, X^{*}) &amp; K(X^{*},X) \\ K(X, X^{*}) &amp; K(X, X) + v I]) \end{equation}
Surrogate Optimization Prediction Based Exploration Given your existing points \(D\), evaluate \(\mu_{x|D}\), and optimize for the next design point that has the smallest \(\mu_{x|D}\).
This is all exploitation, no exploration.
Error Based Exploration Use the 95% confidence interval from the Gaussian Process, find the areas with with the biggest gap and then lower those.
This is *all exploration, no exploitation.
Lower Confidence Bound Exploration tradeoff between exploration and exploitation. Try to minimize:
\begin{equation} LB(x) = \hat{\mu}(x) - \alpha \hat{\Sigma}(x) \end{equation}
try to minimize both the LOWER BOUND as well as the optimum. This is a probabilistic generalization of the Shubert-Piyavskill Method&mdash;and no Lipschitz Constant needed!
Reminder, though, these are probabilistic bounds&mdash;unlike Shubert-Piyavskill Method.
Probability of Improvement Exploration We define &ldquo;improvement&rdquo; as:
\begin{equation} I(y) = \begin{cases} y_{\min} - y, \text{if}\ y &lt; y_{\min} \\ 0, \text{otherwise} \end{cases} \end{equation}
then, we have the &ldquo;probability of improvement&rdquo; metric at:
\begin{equation} P(y &lt; y_{\min}) &amp;= \int_{-\infty}^{y_{\min}} \mathcal{N}(y | \hat{\mu}, \hat{\Sigma}) \dd{y} \\ &amp;= \Phi\qty( \frac{y_{\min } - \hat{\mu}}{\hat{\Sigma}}) \end{equation}
(i.e. we want to find points that are very possible to improve). This could be zero when \(\hat{\Sigma} = 0\), which happens when we are at a noiseless point.
You can also do this by the expected value of improvement