Suggest edit — gradient descent

Title

Name

Note

---
title: "gradient descent"
source: https://www.jemoka.com/posts/kbhgradient_descent/
---

It&rsquo;s hard to make globally optimal solution, so therefore we instead make local progress.
constituents parameters \(\theta\) step size \(\alpha\) cost function \(J\) (and its derivative \(J&rsquo;\)) requirements let \(\theta^{(0)} = 0\) (or a random point), and then:
\begin{equation} \theta^{(t+1)} = \theta^{(t)} - \alpha J&rsquo;\qty (\theta^{(t)}) \end{equation}
&ldquo;update the weight by taking a step in the opposite direction of the gradient by weight&rdquo;. We stop, btw, when its &ldquo;good enough&rdquo; because the training data noise is so much that like a little bit non-convergent optimization its fine.
additional information multi-dimensional case \begin{equation} \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J\qty(\theta^{(t)}) \end{equation}
where:
\begin{equation} \nabla J(\theta) = \mqty(\dv \theta_{1} J(\theta) \\ \dots \\ \dv \theta_{d} J(\theta)) \end{equation}
gradient descent for least-squares error We have:
\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n} \qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})^{2} \end{equation}
we want to take the derivative of this, which actually is chill
\begin{equation} \dv \theta_{j }J(\theta) = \sum_{i=1}^{n}\qty(h_{\theta } \qty(x^{(i)}) - y^{(i)}) \dv \theta_{j} h_{\theta} \qty(x^{(i)}) \end{equation}
recall that \(h_{\theta}(x) = \theta_{0} x_{0} + \ldots\)
and so: \(\dv \theta_{j} h_{\theta}(x) = x_{j}\) since every other term goes to \(0\).
So, our update rule is:
\begin{align} \theta_{j}^{(t+1)} &amp;= \theta_{j}^{(t)} - \alpha \dv \theta_{j} J\qty(\theta^{(t)}) \\ &amp;= \theta_{j}^{(t)} -\alpha \sum_{i=1}^{n} \qty(h_{\theta}\qty(x^{(i)}) - y^{(i)})x_{j}^{(i)} \end{align}
Meaning, in vector notation: \(\theta^{(t+1)} = \theta^{(t)}-\alpha \sum_{i=1}^{n} \qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})x^{(i)}\)
when does gradient descent provably work? &hellip; on convex functions
stochastic gradient descent see stochastic gradient descent