# davidrosenberg/mlcourse

Machine learning course materials.

repo name | davidrosenberg/mlcourse |

repo link | https://github.com/davidrosenberg/mlcourse |

homepage | https://davidrosenberg.github.io/ml2019 |

language | Jupyter Notebook |

size (curr.) | 356314 kB |

stars (curr.) | 314 |

created | 2015-10-11 |

license | |

## Notable Changes from 2017FOML to 2018

- Elaborated on the case against sparsity in the lecture on elastic net, to complement the reasons
*for*sparsity on the slide Lasso Gives Feature Sparsity: So What?. - Added a note on conditional expectations, since many students find the notation confusing.
- Added a note on the correlated features theorem for elastic net, which was basically a translation of Zou and Hastie’s 2005 paper “Regularization and variable selection via the elastic net.” into the notation of our class, dropping an unnecessary centering condition, and using a more standard definition of correlation.
- Changes to EM Algorithm presentation: Added several diagrams (slides 10-14) to give the general idea of a variational method, and made explicit that the marginal log-likelihood is exactly the pointwise supremum over the variational lower bounds (slides 31 and 32)).
- Treatment of the representer theorem is now well before any mention of kernels, and is described as an interesting consequence of basic linear algebra: “Look how the solution always lies in the subspace spanned by the data. That’s interesting (and obvious with enough practice). We can now constrain our optimization problem to this subspace…”
- The kernel methods lecture was rewritten to significantly reduce references to the feature map. When we’re just talking about kernelization, it seems like unneeded extra notation.
- Replaced the 1-hour crash course in Lagrangian duality with a 10-minute summary of Lagrangian duality, which I actually never presented and left as optional reading.
- Added a brief note on Thompson sampling for Bernoulli Bandits as a fun application for our unit on Bayesian statistics.
- Significant improvement of the programming problem for lasso regression in Homework #2.
- New written and programming problems on logistic regression in Homework #5 (showing the equivalence of the ERM and the conditional probability model formulations, as well as implementing regularized logistic regression).
- New homework on backpropagation Homework #7 (with Philipp Meerkamp and Pierre Garapon).

## Notable Changes from 2017 to 2017FOML

- This version of the course didn’t have any ML prerequisites, so added a couple lectures on the basics:
- Added lecture on Black Box ML.
- Added lecture on standard methods of evaluating classifier performance.

- Added a note on the main takeaways from duality for the SVM.
- Rather than go through the full derivation of the SVM dual, in the new lecture, I just state the dual formulation and highlight the insights we get from the complementary slackness conditions, with an emphasis on the “sparsity in the data”.
- Dropped the geometric derivation of SVMs and all mention of hard-margin SVM. It was always a crowd-pleaser, but I don’t think it’s worth the time. Seemed most useful as a review of affine spaces, projections, and other basic linear algebra.
- Dropped most of the AdaBoost lecture, except to mention it as a special case of forward stagewise additive modeling with an exponential loss (slides 24-29).
- New worked example for predicting Poisson distributions with linear and gradient boosting models.
- New module on back propagation.

## Notable Changes from 2016 to 2017

- New lecture on geometric approach to SVMs (Brett)
- New lecture on principal component analysis (Brett)
- Added slide on k-means++ (Brett)
- Added slides on explicit feature vector for 1-dim RBF kernel
- Created notebook to regenerate the buggy lasso/elastic net plots from Hastie’s book (Vlad)
- L2 constraint for linear models gives Lipschitz continuity of prediction function (Thanks to Brian Dalessandro for pointing this out to me).
- Expanded discussion of L1/L2/ElasticNet with correlated random variables (Thanks Brett for the figures)

## Notable Changes from 2015 to 2016

- New lecture on
**multiclass classification**and an intro to**structured prediction** - New homework on
**multiclass hinge loss**and**multiclass SVM** - New homework on Bayesian methods, specifically the
**beta-binomial model, hierarchical models, empirical Bayes ML-II, MAP-II** - New short lecture on correlated variables with L1, L2, and
**Elastic Net**regularization - Added some details about subgradient methods, including a one-slide proof that subgradient descent moves us towards a minimizer of a convex function (based on Boyd’s notes)
- Added some review notes on directional derivatives, gradients, and first-order approximations
- Added light discussion of convergence rates for SGD vs GD (accidentally left out theorem for SGD)
- For lack of time, dropped the curse of dimensionality discussion, originally based on Guillaume Obozinski’s slides
- New lecture (from slide 12) on the
**Representer Theorem**(without RKHS), and its use for kernelization (based on Shalev-Shwartz and Ben-David’s book) - Dropped the kernel machine approach (slide 16) to introducing kernels, which was based on the approach in Kevin Murphy’s book
- Added EM algorithm convergence theorem (slide 20) based on Vaida’s result
- New lecture giving more details on gradient boosting, including brief mentions of some variants (
**stochastic gradient boosting**,**LogitBoost**,**XGBoost**) - New worked example for predicting exponential distributions with generalized linear models and gradient boosting models.
- Deconstructed 2015’s lecture on generalized linear models, which started with natural exponential families (slide 15) and built up to a definition of GLMs (slide 20). Instead, presented the more general notion of conditional probability models, focused on using MLE and gave multiple examples; relegated formal introduction of exponential families and generalized linear models to the end;
- Removed equality constraints from convex optimization lecture to simplify, but check here if you want them back
- Dropped content on Bayesian Naive Bayes, for lack of time
- Dropped formal discussion of k-means objective function (slide 9)
- Dropped the brief introduction to
**information theory**. Initially included, since we needed to introduce KL divergence and Gibbs inequality anyway, for the EM algorithm. The mathematical prerequisites are now given here (slide 15).

## Possible Future Topics

### Basic Techniques

- Gaussian processes
- MCMC (or at least Gibbs sampling)
- Importance sampling
- Density ratio estimation (for covariate shift, anomaly detection, conditional probability modeling)
- Local methods (knn, locally weighted regression, etc.)

### Applications

- Collaborative filtering / matrix factorization (building on this lecture on matrix factorization and Brett’s lecture on PCA)
- Learning to rank and associated concepts
- Bandits / learning from logged data?
- Generalized additive models for interpretable nonlinear fits (smoothing way, basis function way, and gradient boosting way)
- Automated hyperparameter search (with GPs, random, hyperband,…)
- Active learning
- Domain shift / covariate shift adaptation
- Reinforcement learning (minimal path to REINFORCE)

#### Latent Variable Models

- PPCA / Factor Analysis and non-Gaussian generalizations
- Personality types as example of factor analysis if we can get data?

- Variational Autoencoders
- Latent Dirichlet Allocation / topic models
- Generative models for images and text (where we care about the human-perceived quality of what’s generated rather than the likelihood given to test examples) (GANs and friends)

#### Bayesian Models

- Relevance vector machines
- BART
- Gaussian process regression and conditional probability models

### Technical Points

- Overfitting the validation set?
- Link to paper on subgradient convergence for tame functions

### Other

- Class imbalance
- Black box feature importance measures (building on Ben’s 2018 lecture)
- Quantile regression and conditional prediction intervals (perhaps integrated into homework on loss functions);
- More depth on basic neural networks: weight initialization, vanishing / exploding gradient, possibly batch normalization
- Finish up ‘structured prediction’ with beam search / Viterbi
- give probabilistic analogue with MEMM’s/CRF’s

- Generative vs discriminative (Jordan & Ng’s naive bayes vs logistic regression, plus new experiments including regularization)
- Something about causality?
- DART
- LightGBM and CatBoost efficient handling of categorical features (i.e. handling categorical features in regression trees )

## Citation Information

Machine Learning Course Materials by Various Authors is licensed under a Creative Commons Attribution 4.0 International License. The author of each document in this repository is considered the license holder for that document.