Suggest edit — deep learning

Title

Name

Note

---
title: "deep learning"
type: concept
related: [Stochastic Gradient Descent, Neural Network, Supervised Learning]
source: https://www.jemoka.com/posts/kbhdeep_learning/
confidence: high
status: active
---

supervised learning with non-linear models.
Motivation Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).
constituents We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\) Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\) Optimization: \(\min_{\theta} J\qty(\theta)\) Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\) Hyperparameters: Learning rate: \(\alpha\) Batch size \(B\) Iterations: \(n_{\text{iter}}\) stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch) neural network requirements additional information Background Notation:
\(x\) is the input, \(h\) is the hidden layers, and \(\hat{y}\) is the prediction.
We call each weight, at each layer, from \(x_{i}\) to \(h_{j}\), \(\theta_{i,j}^{(h)}\). At every neuron on each layer, we calculate:
\begin{equation} h_{j} = \sigma\qty[\sum_{i}^{} x_{i} \theta_{i,j}^{(h)}] \end{equation}
\begin{equation} \hat{y} = \sigma\qty[\sum_{i}^{} h_{i}\theta_{i}^{(y)}] \end{equation}
note! we often