Suggest edit — Neural Networks

Title

Name

Note

---
title: "Neural Networks"
type: concept
related: [Sigmoid, Seperable Diffequ, Logistic Regression]
source: https://www.jemoka.com/posts/kbhneural_networks/
confidence: high
status: active
---

Neural Network Unit A real-valued vector as input, each multiplied by some weights, summed, and squashed by some non-linear transform.
\begin{equation} z = w\cdot x + b \end{equation}
and then, we will squash this using it as an &ldquo;activation&rdquo;
\begin{equation} y = \sigmoid(z) \end{equation}
One common activation is sigmoid. So, one common formulation would be:
\begin{equation} y = \frac{1}{1+\exp (- (w \cdot x + b))} \end{equation}
Tanh \begin{equation} y(z) = \frac{e^{z} - e^{-z}}{e^{z}+e^{-z}} \end{equation}
This causes &ldquo;saturation&rdquo;&mdash;meaning derivatives to be \(0\) at high values
relu \begin{equation} y(z) = \max(z,0) \end{equation}
multi-layer networks Single computing units can&rsquo;t compute XOR. Consider a perceptron:
\begin{equation} w_1x_1 + w_2x_2 + b = 0 \end{equation}
meaning:
\begin{equation} x_2 = \qty(\frac{-w_1}{w_2})x_1 + \qty(\frac{-b}{w_2}) \end{equation}
meaning, obtain a line that acts as a decision boundary&mdash;we obtain 0 if the input is on one side of the line, and 1 if on the other. XOR, unfortunately, does not have a single linear boundary, its not linearly seperable.
logistic regression, for instance, can&rsquo;t compute XOR because it is linear until squashing.
feed-forward network we can think about logistic regression as a one layer network, generalizing over sigmoid:
\begin{equation} \text{softmax} = \frac{\exp(z_{i})}{\sum_{j=1}^{k} \exp(z_{j})} \end{equation}
and a multinomial logistic regression which uses the above. This is considered a &ldquo;layer&rdquo; in the feed-forward network.
notation:
\(W^{(j)}\), weight matrix for layer \(j\) \(b^{(j)}\), the bias vector for layer \(j\) \(g^{(j)}\), the activation function at \(j\) and \(z^{(i)}\), the output at \(i\) (before activation function) \(a^{(i)}\), the activation at \(i\) instead of bias, we sometimes add a dummy node \(a_{0}\), we will force a value \(1\) at \(a_{0}\) and use its weights as bias.
embeddings We use vector-space model to feed words into networks: converting each word first into embeddings, then feeding it into the network
Fix length problems:
sentence embedding (mean of all the embeddings) element wise max of all the word embeddings to create sentence embedding use the max length + pad For Language Models, we can use a &ldquo;sliding window&rdquo;; that is:
\begin{equation} P(w_{t}|w_{1 \dots t-1}) \approx P(w_{t} | w_{t-N+1 \dots t-1}) \end{equation}
Training For every tuple \((x,y)\), we run a forward pass to obtain \(\hat{y}\). Then, we run the network backwards to update the weights.
A loss function calculates the negative of the probability of the correct labels.
backpropegation backprop