Suggest edit — Interesting Facts in Machine Learning

Title

Name

Note

---
title: "Interesting Facts in Machine Learning"
visibility: public
---

# Interesting Facts in Machine Learning

Category: [[machine-intelligence|Machine Intelligence]]

[Read the original document](https://docs.google.com/document/d/1kv_g4eO_xKvdu6GLR6xNyogq4uURs3ovdELj8d1DFCg/edit?usp=drivesdk&sa=D&ust=1596495076583000&usg=AOvVaw3kJxGzrtc1sLmadxtwktrI)

---

Linear Regression

1. You can get better generalization with a stochastic solver
2. Fastest solution is often through QR factorization, rather than computing inverse or pseudo inverse. Unlike almost every other algorithm in this respect.
3. The reason scaling can still be important is for the optimizer - even though you technically have a convex model and will get the same solution
4. Linear generalization w/ quality feature engineering is stronger than almost every other form of generalization for unstructured data (trees + networks overfit)
5. Best in terms of not overfitting the data - optimal algorithm in low signal-noise ratio environments
6. Only major supervised algorithm with closed form solution
7. Every relationship between your feature and the label should be as close to linear as possible
8. You can use boxcox transform to automatically get close to linear
9. Convex
10. Lasso is not invariant to rescaling
11. L1 penalty leads to laplace distributed coefficients, L2 penalty leads to gaussian distributed coefficients - Bayesian perspective

Logistic Regression

1. Two major ways to do multinomial eval:
   1. Softmax Loss
   2. One vs. All with binary (logistic) function
2. Naming - 
   1. “Logistic” regression due to Sigmoid (logistic) function
   2. “Softmax” regression due to softmax function
3. No closed form solution, despite convexity
4. Many, many optimizers:
   1. Newton / Newton-CG
   2. BFGS
      1. L-BFGS
   3. IRLS
   4. Trust Region Conjugate Gradient
   5. Gradient Descent
      1. GD + Line Search
   6. Stochastic Average Gradient
5. Difficult Bayesian Solutions (No convenient conjugate prior)
6. Discriminative (Learns P(Y|X), rather than first the joint P(Y, X) and then conditioning on X (the generative approach))
7. Without regularization, the weights will become arbitrary large, damaging generalization. Penalties are more important than in the regression setting.
8. You can get better generalization with a stochastic solver [https://arxiv.org/pdf/1708.05070.pdf]
9. The reason scaling can still be important is for the optimizer - even though you technically have a convex model and will get the same solution
10. Linear generalization is stronger than almost every other form of generalization for unstructured data (trees + networks overfit)
11. Every relationship between your feature and the label should be as close to linear as possible
12. You can use boxcox transform to automatically get close to linear

Decision Trees

1. Captures structure that look like discontinuities or thresholds in a feature
   1. This is close to quantized structure!!
      1. This is a great example of model blending. If leaving time and countdown time for each stoplight on the way are your input feature and you use both a decision tree and an MLP, you capture the quantized structure (you hit a stoplights, leading to large differences in arrival time) and the continuous structure (relationship between leaving time and arrival time)
2. Captures discrete structure in continuous and discrete features
3. Fails to capture continuous structure
4. Extremely poor generalization out of domain - best case, takes the most extreme example found in training data
5. Works over missing data
   1. One approach is to stop when you hit a missing data point and give the classification to the larges distribution of children remaining
6. Learns hierarchy of feature interactions, top down
   1. Question - decision trees are learned top-down. How can we do supervised learning bottom up? Hierarchical clustering w/ supervision? 
7. Recursively chooses the split that leads to the greatest variance gain (for regression) or information gain through entropy or gini impurity (for classification).
8. Insensitive to monotone transformations of features (only cares how the distribution of labels varies across split points)
9. Greedy algorithm
10. Can be seen as a hierarchical mixture of experts (train expert models on subsets of the data)

Neural Networks

1. Learns compositional (bottom-up) hierarchical structure
2. Model complexity overcomes the curse of dimensionality
   1. Combinatorial in depth and in width
3. Requires high signal-to-noise ratio
4. ‘Just’ adaptive basis function regression
5. Optimizer improved by exponentially weighted average of the gradient, learning rate
6. Covariate Shift
7. Close-to-linear model leads to failure to generalize, ex. adversarial examples
8. Dimensionality of the representation increases with depth of a convnet.
9. Softmax leads to extreme solutions
10. Non-convex optimization surface is dominated by saddle points.
11. Convnets are:
   1. Parameter Sharing leads to translation equivariance
   2. Locality (Sparse Connectivity)
   3. Composition
   4. Not equivariant to scale or rotation.
12. Many machine learning libraries implement cross-correlation but call it convolution

Optimization

1. Stochastic gradient descent optimizes for the validation / test error directly (when each datapoint is only touched once), while batch gradients optimize for the training set error (and so overfit).
   1. https://arxiv.org/abs/1509.01240
   2. http://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization
2. Improved by exponentially weighted average of the gradient, learning rate

---

*Source: [Original Google Doc](https://docs.google.com/document/d/1kv_g4eO_xKvdu6GLR6xNyogq4uURs3ovdELj8d1DFCg/edit?usp=drivesdk&sa=D&ust=1596495076583000&usg=AOvVaw3kJxGzrtc1sLmadxtwktrI)*