Basics of ML for 224n
Random thoughts as I scan through the book:
Central framing: learning as a means to generalize + predict
Key Tasks regression (predict a value) binary classification (sort an example into a boolean class of Y/N) multi-class classification (sort an example into multiple classes) ranking (sorting an object into relevant order) Optimality of Bayes Optimal Classifier If you have the underlying data distribution, we can classify y from x by choosing:
where P(a,b) is the probability of a,b pain on the data D; but like you don’t have the probability distributed over D, so sadge.
Optimality:
assume for contradiction \exists some g which works better than f, meaning \exists x: g(x) \neq f(x). The error rate of f on x is 1-P(x, f(x)), of g is 1-P(x, g(x)). Yet, P(x,f(x)) is maximized by definition of f, so 1-P(x, f(x)) \leq 1-P(x,g(x)), meaning, P(x,g(x)) \leq P(x,f(x)); recall P is the distribution on actual data so g is worse than f, reaching contradiction. \blacksquare
Decision Tree Decision Tree
Bias Inductive Bias Inductive Bias is defined here as the bias towards a particular choice of solution in a set of possible variations of valid solutions in absence of data which further narrows down the relevant concept.
Nice.
Confused about their treatment of parity, will come back later.
So true so true
Sooo true I officially have no opinions on their treatment of KNN/kmeans or single layer perceptrons
I wish there was a proof for why kmeans works
Perception See logistic regression and Neural Networks
Perception Convergence Suppose D is linearly separable with margin \gamma >0 (otherwise it wouldn’t be very separable); assume \mid x \mid \leq 1; then, 1-layer perceptrons converge in \frac{1}{\gamma^{2}} updates.
Sketch: take the fact that w^{(k)} = w^{(k-1)} + yx. Compare it to w^{*} to obtain that w^{*} \cdot w^{(k)} \geq k \gamma. Norm it and because x is within 1 so \mid w^{(k)}\mid^{2} \leq k. Now do algebra.
We will obtain k \leq \frac{1}{\gamma^{2}}.