[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / wiki/concepts/stochastic_gradient_descent.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "stochastic gradient descent" type: concept related: [Stochastic Gradient Descent, Gradient Descent] source: https://www.jemoka.com/posts/kbhstochastic_gradient_descent/ confidence: high status: active --- gradient descent makes a pass over all points to make one gradient step. We can instead approximate gradients on a minibatch of data. This is the idea behind stochastic-gradient-descent. \begin{equation} \theta^{t+1} = \theta^{t} - \eta \nabla_{\theta} L(f_{\theta}(x), y) \end{equation} this terminates when theta differences becomes small, or when progress halts: like when \(\theta\) begins going up instead. we update the weights in SGD by taking a single random sample and moving weights to that direction. while not_converged(): for x,y in 1...n: theta = theta - alpha*(loss(x,y).grad()) In theory this is an approximation of gradient descent; however, Neural Networks works actually BETTER when you jiggle a bit. batch gradient descent batch gradient descent does it over the entire dataset by summing loss term over the entire dataset, which is fine but its slow: you are summing over your whole (possibly TBs) dataset, rerunning inference, and then taking a tiny step. This is rather slow. stochastic gradient descent gives choppy movements because it does one sample at once. \begin{equation} \theta := \theta - \frac{\alpha}{B} \sum_{k=1}^{B} \nabla_{\theta} J^{(j_{k})} \qty(\theta) \end{equation} where \(B\) is the batch size and \(\alpha\) is the learning rate. mini-batch gradient mini-batches helps take advantage of both by training over groups of \(m\) samples See