Suggest edit — least-squares error

Title

Name

Note

---
title: "least-squares error"
source: https://www.jemoka.com/posts/kbhleast_squares_error/
---

requirements \(h\qty(x)\) the predictor function \(x,y\), the samples of data definition \begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n}\qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})^{2} \end{equation}
see also example: gradient descent for least-squares error.
additional information &ldquo;why the 1/2&rdquo;? Because when you take \(\nabla J\qty(\theta)\) you end up with the \(\frac{1}{2}\) and the \(2\) canceling out.
probabilistic intuition for least-squares error in linear regression Assume that our dataset \(\qty(x^{(i)}, y^{(i)}) \sim D\) has the following property: &ldquo;the true \(y\) value is just our model&rsquo;s output, plus some error.&rdquo; Meaning:
\begin{equation} y^{(i)} = \theta^{\top} x^{(i)} + \varepsilon^{(i)} \end{equation}
Assume too now that \(\varepsilon^{(i)} \sim \mathcal{N}\qty(0, \sigma^{2})\) for all \(i\), that the error is normally distributed. Recall the PDF of the normal distribution:
\begin{equation} P\qty(\varepsilon^{(i)}) = \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(\epsilon^{(i)})^{2}}{2\sigma^{2}}) \end{equation}
Plugging in our definition for \(\varepsilon\) here:
\begin{equation} P\qty(y^{(i)} | x^{(i)}, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(y^{(i)}- \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \end{equation}
If we now assume the entire dataset is IID, we can then write:
\begin{align} P\qty(y | x, \theta) &amp;= \prod_{i=1}^{n} P\qty(y^{(i)} | x^{(i)}, \theta) \\ &amp;= \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(y^{(i)}- \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \end{align}
What we want to pick \(\theta\) is to perform MLE&mdash;indeed we want the model that maximizes the likelihood of seeing our real data \(y\). Meaning, we desire:
\begin{equation} \theta = \arg\max_{\theta} P\qty(y | x,\theta) \end{equation}
Let&rsquo;s do it! First let&rsquo;s write the thing we want to maximize as a function of \(\theta\)
\begin{equation} L\qty(\theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(y^{(i)}- \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \end{equation}
recall log is monotonic, so
\begin{align} \arg\max_{\theta} L\qty(\theta) &amp;= \arg\max_{\theta} \log \qty(L\qty(\theta)) \\ &amp;= \arg\max_{\theta} \log \prod_{i=1}^{n}\frac{1}{\sigma\sqrt{2\pi}} \exp \qty(\dots) \\ &amp;= \arg\max_{\theta} n \log \frac{1}{\sigma\sqrt{2\pi}} + \sum_{i=1}^{n} \frac{-\qty(y^{(i)}- \theta^{\top}x^{(i)})^{2}}{2\sigma^{2}} \end{align}
We can throw away the left term (since its just a constant, and the objective function of the right is just the least-squares error formula, with \(\sigma=1\) (i.e. it doesn&rsquo;t matter since we are just trying to maximize)! Yay!