CPSC 5XX - Numerical Optimization for Machine Learning (July, 2022)

Last-ish Friday of the month from 3-5pm in ICICS X836.

Date Slides Related Links
July 7 How many iterations of gradient descent do we need? Cauchy's 1847 paper, Lipschitz relationships, practical line-searches, PL condition.
July 14 Momentum, acceleration, and second-order methods heavy-ball, CG, SSO, accelerated gradient, restarting, quadratic convergence damped Newton (Section 9.5), cubic regularization.
July 21 Coordinate optimization and stochastic gradient descent random coordinate descent, shuffle coordinate descent, Gauss-Southwell, block coordinate descent, accelerated coordinate descent.
July 28 SGD with Constant Step Sizes, Growing Batches, and Over-Parameterization non-convex SGD, decreasing step SGD, constant step SGD, shuffle SGD, growing batch size, SGC, accelerated SGD, non-uniform SGD, SGD + Armijo.
August 4 No lecture
August 11 Variance reduction and 1.5-Order Methods SAG, SVRG, non-uniform sampling, acceleration loopless SVRG, SGD*, SVRG for deep learning, diagonal approximation, Hessian-free Newton mini-batch Hessian, Newton sketch, 2.5-order, Barzilai-Borwein, quasi-Newton (superlinear), L-BFGS, initialization, L-BFGS preconditioning, explicit superlinear )
August 18+ Baby break
January 27 Projected Gradient, Projected Newton, and Frank-Wolfe Translation of original PG and PN paper, projection onto simple sets (Section 8.1), Dykstra's algorithm, active set identification and PG backtracking, spectral projected gradient, two-metric projection, projected quasi-Newton, projected coordinate descent, Frank-Wolfe
February 17 Global Optimization, Subgradients, and Cutting Planes Random search Bayesian optimization, harmless global optimization, BO rate, subgradients, subgradient method, stochastic subgradient, suffix averaging, (k+1) averaging, weakly-convex rate, tame function convergence, smoothing, adaptive smoothing, cutting planes, randomized center of gravity, ignoring non-smoothness, bundle methods, orthant-projected min-norm subgradient (Chapter 2)
April 21 Proximal-Gradient and Fenchel Duality Proximal-gradient (and acceleration), active set complexity, proximal PL, group L1-regularization, structured sparsity, inexact proximal-gradient, proximal average, ADMM, coordinate-wise proximal-gradient, stochastic proximal-gradient, regularized dual averaging, proximal SVRG, proximal Newton, proximal point, convex conjugate and duality (Section 3.3 and Chapter 5) kernel methods, Lipschitz-smoothness and strong-convexity duality, Fenchel duality, SDCA, dual-free SDCA gap safe screening, SVM safe screening

Date	Slides	Related Links
July 7	How many iterations of gradient descent do we need?	Cauchy's 1847 paper, Lipschitz relationships, practical line-searches, PL condition.
July 14	Momentum, acceleration, and second-order methods	heavy-ball, CG, SSO, accelerated gradient, restarting, quadratic convergence damped Newton (Section 9.5), cubic regularization.
July 21	Coordinate optimization and stochastic gradient descent	random coordinate descent, shuffle coordinate descent, Gauss-Southwell, block coordinate descent, accelerated coordinate descent.
July 28	SGD with Constant Step Sizes, Growing Batches, and Over-Parameterization	non-convex SGD, decreasing step SGD, constant step SGD, shuffle SGD, growing batch size, SGC, accelerated SGD, non-uniform SGD, SGD + Armijo.
August 4	No lecture
August 11	Variance reduction and 1.5-Order Methods	SAG, SVRG, non-uniform sampling, acceleration loopless SVRG, SGD*, SVRG for deep learning, diagonal approximation, Hessian-free Newton mini-batch Hessian, Newton sketch, 2.5-order, Barzilai-Borwein, quasi-Newton (superlinear), L-BFGS, initialization, L-BFGS preconditioning, explicit superlinear )
August 18+	Baby break
January 27	Projected Gradient, Projected Newton, and Frank-Wolfe	Translation of original PG and PN paper, projection onto simple sets (Section 8.1), Dykstra's algorithm, active set identification and PG backtracking, spectral projected gradient, two-metric projection, projected quasi-Newton, projected coordinate descent, Frank-Wolfe
February 17	Global Optimization, Subgradients, and Cutting Planes	Random search Bayesian optimization, harmless global optimization, BO rate, subgradients, subgradient method, stochastic subgradient, suffix averaging, (k+1) averaging, weakly-convex rate, tame function convergence, smoothing, adaptive smoothing, cutting planes, randomized center of gravity, ignoring non-smoothness, bundle methods, orthant-projected min-norm subgradient (Chapter 2)
April 21	Proximal-Gradient and Fenchel Duality	Proximal-gradient (and acceleration), active set complexity, proximal PL, group L1-regularization, structured sparsity, inexact proximal-gradient, proximal average, ADMM, coordinate-wise proximal-gradient, stochastic proximal-gradient, regularized dual averaging, proximal SVRG, proximal Newton, proximal point, convex conjugate and duality (Section 3.3 and Chapter 5) kernel methods, Lipschitz-smoothness and strong-convexity duality, Fenchel duality, SDCA, dual-free SDCA gap safe screening, SVM safe screening

Additional notes:

Thanks to Philip Loewen for many comments that improved the slides.

Mark Schmidt > Courses > CPSC 5XX