February 2, 2020

5166 words 25 mins read

mitmath/18335

18.335 - Introduction to Numerical Methods course


repo name	mitmath/18335
repo link	https://github.com/mitmath/18335
homepage
language	Jupyter Notebook
size (curr.)	149639 kB
stars (curr.)	132
created	2019-01-30
license

18.335J/6.337J: Introduction to Numerical Methods

This is the repository of course materials for the 18.335J/6.337J course at MIT, taught by Prof. Steven G. Johnson, in Spring 2020.

Syllabus

Lectures: Monday/Wednesday/Friday 3–4pm (2-190). Lecture videos are posted online. Office Hours: Thursday 4–5pm (2-345).

Topics: Advanced introduction to numerical linear algebra and related numerical methods. Topics include direct and iterative methods for linear systems, eigenvalue decompositions and QR/SVD factorizations, stability and accuracy of numerical algorithms, the IEEE floating-point standard, sparse and structured matrices, and linear algebra software. Other topics may include memory hierarchies and the impact of caches on algorithms, nonlinear optimization, numerical integration, FFTs, and sensitivity analysis. Problem sets will involve use of Julia, a Matlab-like environment (little or no prior experience required; you will learn as you go).

Prerequisites: Understanding of linear algebra (18.06, 18.700, or equivalents). 18.335 is a graduate-level subject, however, so much more mathematical maturity, ability to deal with abstractions and proofs, and general exposure to mathematics is assumed than for 18.06!

Textbook: The primary textbook for the course is Numerical Linear Algebra by Trefethen and Bau. (Readable online with MIT certificates.)

Other Reading: Previous terms can be found in branches of the 18335 git repository. The course notes from 18.335 in much earlier terms can be found on OpenCourseWare. For a review of iterative methods, the online books Templates for the Solution of Linear Systems (Barrett et al.) and Templates for the Solution of Algebraic Eigenvalue Problems are useful surveys.

Grading: 33% problem sets (about six, ~biweekly). 33% take-home mid-term exam (posted Monday Apr. 6 and due Tuesday Apr. 7), 34% final project (one-page proposal due Friday March 20, project due Tuesday May 12).

Psets will be submitted electronically via Stellar. Submit a good-quality PDF scan of any handwritten solutions and also a PDF printout of a Julia notebook of your computational solutions.

TA/grader: Jacob Gold.

Collaboration policy: Talk to anyone you want to and read anything you want to, with three exceptions: First, you may not refer to homework solutions from the previous terms in which I taught 18.335. Second, make a solid effort to solve a problem on your own before discussing it with classmates or googling. Third, no matter whom you talk to or what you read, write up the solution on your own, without having their answer in front of you.

Final Projects: The final project will be a 8–15 page paper (single-column, single-spaced, ideally using the style template from the SIAM Journal on Numerical Analysis), reviewing some interesting numerical algorithm not covered in the course. [Since this is not a numerical PDE course, the algorithm should not be an algorithm for turning PDEs into finite/discretized systems; however, your project may take a PDE discretization as a given “black box” and look at some other aspect of the problem, e.g. iterative solvers.] Your paper should be written for an audience of your peers in the class, and should include example numerical results (by you) from application to a realistic problem (small-scale is fine), discussion of accuracy and performance characteristics (both theoretical and experimental), and a fair comparison to at least one competing algorithm for the same problem. Like any review paper, you should thoroughly reference the published literature (citing both original articles and authoritative reviews/books where appropriate [rarely web pages]), tracing the historical development of the ideas and giving the reader pointers on where to go for more information and related work and later refinements, with references cited throughout the text (enough to make it clear what references go with what results). (Note: you may re-use diagrams from other sources, but all such usage must be explicitly credited; not doing so is plagiarism.) Model your paper on academic review articles (e.g. read SIAM Review and similar journals for examples).

A good final project will include:

An extensive introduction and bibliography putting the algorithm in context. Where did it come from, and what motivated its development? Where is it commonly used (if anywhere)? What are the main competing algorithms? Were any variants of the algorithm proposed later? What do other authors say about it?
A clear description of the algorithm, with a summary of its derivation and key properties. Don’t copy long mathematical derivations or proofs from other sources, but do summarize the key ideas and results in the literature.
A convincing validation on a representative/quasi-realistic test problem (i.e. show that your code works), along with an informative comparison to important competing algorithms. For someone who is thinking about using the algorithm, you should strive to give them useful guidance on how the algorithm compares to competing algorithms: when/where should you consider using it (if ever)? Almost never rely on actual timing results — see below!

Frequently asked questions about the final project:

Does it have to be about numerical linear algebra? No. It can be any numerical topic (basically, anything where you are computing a conceptually real result, not integer computations), excluding algorithms for discretizing PDEs.
Can I use a matrix from a discretized PDE? Yes. You can take a matrix from the PDE as input and then talk about iterative methods to solve it, etcetera. I just don’t want the paper to be about the PDE discretization technique itself.
How formal is the proposal? Very informal—one page describing what you plan to do, with a couple of references that you are using as starting points. Basically, the proposal is just so that I can verify that what you are planning is reasonable and to give you some early feedback.
How much code do I need to write? A typical project (there may be exceptions) will include a working proof-of-concept implementation, e.g. in Julia or Python or Matlab, that you wrote to demonstrate that you understand how the algorithm works. Your code does not have to be competitive with “serious” implementations, and I encourage you to download and try out existing “serious” implementations (where available) for any large-scale testing and comparisons.
How should I do performance comparisons? Be very cautious about timing measurements: unless you are measuring highly optimized code or only care about orders of magnitude, timing measurements are more about implementation quality than algorithms. Better to measure something implementation-independent (like flop counts, or matrix-vector multiplies for iterative algorithms, or function evaluations for integrators/optimizers), even though such measures have their own weaknesses.

Lecture Summaries and Handouts

Lecture 1 (Feb 3)

pset 1 and accompanying notebook, due Friday Feb. 14.
Newton’s method for square roots and accompanying notebook.

Brief overview of the huge field of numerical methods, and outline of the small portion that this course will cover. Key new concerns in numerical analysis, which don’t appear in more abstract mathematics, are (i) performance (traditionally, arithmetic counts, but now memory access often dominates) and (ii) accuracy (both floating-point roundoff errors and also convergence of intrinsic approximations in the algorithms).

As a starting example, considered the convergence of Newton’s method (as applied to square roots); see the handout and Julia notebook above.

Further reading: Googling “Newton’s method” will find lots of references; as usual, the Wikipedia article on Newton’s method is a reasonable starting point. Beware that the terminology for the convergence order (linear, quadratic, etc.) is somewhat different in this context from the terminology for discretization schemes (first-order, second-order, etc.); see e.g. the linked Wikipedia article. Homer Reid’s notes on machine arithmetic for 18.330 are an excellent introduction that covers several applications and algorithms for root-finding. For numerical computation in 18.335, we will be using the Julia language: see this information on Julia at MIT.

Lecture 2 (Feb 5)

notes on floating-point (18.335 Fall 2007; also slides)
Julia floating-point notebook
some floating-point myths

New topic: Floating-point arithmetic

The basic issue is that, for computer arithmetic to be fast, it has to be done in hardware, operating on numbers stored in a fixed, finite number of digits (bits). As a consequence, only a finite subset of the real numbers can be represented, and the question becomes which subset to store, how arithmetic on this subset is defined, and how to analyze the errors compared to theoretical exact arithmetic on real numbers.

In floating-point arithmetic, we store both an integer coefficient and an exponent in some base: essentially, scientific notation. This allows large dynamic range and fixed relative accuracy: if fl(x) is the closest floating-point number to any real x, then |fl(x)-x| < ε|x| where ε is the machine precision. This makes error analysis much easier and makes algorithms mostly insensitive to overall scaling or units, but has the disadvantage that it requires specialized floating-point hardware to be fast. Nowadays, all general-purpose computers, and even many little computers like your cell phones, have floating-point units.

Overview of floating-point representations, focusing on the IEEE 754 standard (see also handout from previous lecture). The key point is that the nearest floating-point number to x, denoted fl(x), has the property of uniform relative precision (for |x| and 1/|x| < than some range, ≈10³⁰⁰ for double precision) that |fl(x)−_x_| ≤ εmachine|x|, where εmachine is the relative “machine precision” (about 10⁻¹⁶ for double precision). There are also a few special values: ±Inf (e.g. for overflow), NaN, and ±0 (e.g. for underflow).

Went through some simple examples in Julia (see notebook above), illustrating basic syntax and a few interesting tidbits. In particular, we looked at two examples of catastrophic cancellation and how it can sometimes be avoided by rearranging a calculation.

Further reading: Trefethen, lecture 13. What Every Computer Scientist Should Know About Floating Point Arithmetic (David Goldberg, ACM 1991). William Kahan, How Java’s floating-point hurts everyone everywhere (2004): contains a nice discussion of floating-point myths and misconceptions. A brief but useful summary can be found in this Julia-focused floating-point overview by Prof. John Gibson.

Lecture 3 (Feb 7)

notes on the accuracy and stability of floating-point summation

Summation, accuracy, and stability.

Further reading: See the further reading from the previous lecture. Trefethen, lectures 14, 15, and 3. See also the Wikipedia article on asymptotic (“big O”) notation; note that for expressions like O(ε) we are looking in the limit of small arguments rather than of large arguments (as in complexity theory), but otherwise the ideas are the same.

Julia tutorial (Feb 7: 5pm in 32-141) — optional

Handout: Julia cheat-sheet, Julia intro slides

On Friday, 7 February, at 5pm in 32-141, I will give an (attendance-optional!) Julia tutorial, introducing the Julia programming language and environment that we will use this term. Please see the tutorial notes online.

Please bring your laptops, and try to install Julia and the IJulia interface first via the abovementioned tutorial notes. Several people will be at the tutorial session to help answer installation questions. Alternatively, you can use Julia online at JuliaBox without installing anything (although running things on your own machine is usually faster).

Lecture 4 (Feb 10)

notes on the equivalence of norms

When quantifying errors, a central concept is a norm, and we saw in our proof of backwards stability of summation that the choice of norm seems important. Defined norms (as in lecture 3 of Trefethen): for a vector space V, a norm takes any v∈V and gives you a real number ‖v‖ satisfying three properties:

Positive definite: ‖v‖≥0, and =0 if and only if v=0
Scaling: ‖αv‖ = |α|⋅‖v‖ for any scalar α.
Triangle inequality: ‖v+w‖≤‖v‖+‖w‖

There are many norms, for many different vector spaces. Gave examples of norms of column vectors: Lₚ norms (usually p = 1, 2, or ∞) and weighted L₂ norms. A (complete) normed vector space is called a Banach space, and these error concepts generalize to f(x) when f and x are in any Banach spaces (including scalars, column vectors, matrices, …though infinite-dimensional Banach spaces are trickier).

Especially important in numerical analysis are functions where the inputs and/or outputs are matrices, and for these cases we need matrix norms. The most important matrix norms are those that are related to matrix operations. Started with the Frobenius norm. Related the Frobenius norm ‖A‖F to the square root of the sum of eigenvalues of A’A, which are called the singular values σ²; we will do much more on singular values later, but for now noted that they equal the squared eigenvalues of A if A’A (Hermitian). Also defined the induced matrix norm, and bounded it below by the maximum eigenvalue magnitude of A (if A is square). For the L₂ induced norm, related it (without proof for now) to the maximum singular value. A useful property of the induced norm is ‖AB‖≤‖A‖⋅‖B‖. Multiplication by a unitary matrix Q (Q' = Q⁻¹) preserves both the Frobenius and L₂ induced norms.

Equivalence of norms. Described fact that any two norms are equivalent up to a constant bound, and showed that this means that stability in one norm implies stability in all norms. Sketched proof (only skimmed this): (i) show we can reduce the problem to proving any norm is equivalent to e.g. L₁ on (ii) the unit circle; (iii) show any norm is continuous; and (ii) use a result from real analysis: a continuous function on a closed/bounded (compact) set achieves its maximum and minimum, the extreme value theorem. See notes handout.

Further reading: Trefethen, lecture 3. If you don’t immediately recognize that A’A has nonnegative real eigenvalues (it is positive semidefinite), now is a good time to review your linear algebra; see also Trefethen lecture 24.

Lecture 5 (Feb 12)

Relate backwards error to forwards error, and backwards stability to forwards error (or “accuracy” as the book calls it). Show that, in the limit of high precision, the forwards error can be bounded by the backwards error multiplied by a quantity κ, the relative condition number. The nice thing about κ is that it involves only exact linear algebra and calculus, and is completely separate from considerations of floating-point roundoff. Showed that, for differentiable functions, κ can be written in terms of the induced norm of the Jacobian matrix.

Calculated condition number for square root, summation, and matrix-vector multiplication, as well as solving Ax=b, similar to the book. Defined the condition number of a matrix: for f(x)=Ax, the condition number is ‖A‖⋅‖x‖/‖Ax‖, which is bounded above by κ(A)=‖A‖⋅‖A⁻¹‖.

Related matrix _L_2 norm to eigenvalues of B=A*A. B is obviously Hermitian (B*=B), and with a little more work showed that it is positive semidefinite: x*B_x_≥0 for any x. Reviewed and re-derived properties of eigenvalues and eigenvectors of Hermitian and positive-semidefinite matrices. Hermitian means that the eigenvalues are real, the eigenvectors are orthogonal (or can be chosen orthogonal). Also, a Hermitian matrix must be diagonalizable (I skipped the proof for this; we will prove it later in a more general setting). Positive semidefinite means that the eigenvalues are nonnegative.

Proved that, for a Hermitian matrix B, the Rayleigh quotient R(x)=x*Bx/x*x is bounded above and below by the largest and smallest eigenvalues of B (the “min–max theorem”). Hence showed that the L2 induced norm of A is the square root of the largest eigenvalue of B=A*A. Similarly, showed that the L₂ induced norm of A⁻¹, or more generally the supremum of ‖x‖/‖Ax‖, is equal to the square root of the inverse of the smallest eigenvalue of A*.A

Understanding norms and condition numbers of matrices therefore reduces to understanding the eigenvalues of A*A (or AA*). However, looking at it this way is unsatisfactory for several reasons. First, we would like to solve one eigenproblem, not two. Second, working with things like A*A explicitly is often bad numerically, because it squares the condition number [showed that κ(A*A)=κ(A)2] and hence exacerbates roundoff errors. Third, we would really like to get some better understanding of A itself. All of these concerns are addressed by the singular value decomposition or SVD, which we will derive next time.

Further reading: Trefethen, lectures 12, 14, 15, 24. See any linear-algebra textbook for a review of eigenvalue problems, especially Hermitian/real-symmetric ones. See also these notes from 18.06.

Lecture 6 (Feb 14)

pset 1 solutions and accompanying Julia notebook
pset 2, due Fri. Feb 28, 2020, at 3pm via Stellar.

Understanding norms and condition numbers of matrices therefore reduces to understanding the eigenvalues of A*A (or AA*). However, looking at it this way is unsatisfactory for several reasons. First, we would like to solve one eigenproblem, not two. Second, working with things like A*A explicitly is often bad numerically, because it squares the condition number [showed that κ(A*A)=κ(A)²] and hence exacerbates roundoff errors. Third, we would really like to get some better understanding of A itself. All of these concerns are addressed by the singular value decomposition or SVD.

Explicitly constructed SVD (both “thin” and thick/unitary) in terms of eigenvectors/eigenvalues of A*A and AA*. Recall from last time that we related the singular values to induced L₂ norm and condition number.

Talked a little about the SVD and low-rank approximations (more on this in homework), e.g. graphically illustrated via image compression, or principal component analysis (PCA), e.g. illustrated with this nice demo of human locomotion analysis.

Further reading: Trefethen, lectures 4, 5, 11.

Lecture 7 (Feb 18: Tuesday `==` MIT Monday)

Handouts: least-squares IJulia notebook

Introduced least-squares problems, gave example of polynomial fitting, gave normal equations, and derived them from the condition that the L2 error be minimized.

Discussed solution of normal equations. Discussed condition number of solving normal equations directly, and noted that it squares the condition number—not a good idea if we can avoid it.

Introduced the alternative of QR factorization (finding an orthonormal basis for the column space of the matrix). Explained why, if we can do it accurately, this will give a good way to solve least-squares problems.

Further reading: Trefethen, lectures 7, 8, 18, 19

Lecture 8 (Feb 19)

Per Persson’s 2006 18.335 Gram-Schmidt notes
Gram-Schmidt IJulia notebook

Gave the simple, but unstable, construction of the Gram-Schmidt algorithm, to find a QR factorization.

Discussed loss of orthogonality in classical Gram-Schmidt, using a simple example, especially in the case where the matrix has nearly dependent columns to begin with. Showed modified Gram-Schmidt and argued how it (mostly) fixes the problem. Numerical examples (see notebook).

Re-interpret Gram-Schmidt in matrix form as Q = AR1R2…, i.e. as multiplying A on the right by a sequence of upper-triangular matrices to get Q. The problem is that these matrices R may be very badly conditioned, leading to an inaccurate Q and loss of orthogonality. Instead of multiplying A on the right by R’s to get Q, however, we can instead multiply A on the left by Q’s to get R. This leads us to the Householder QR algorithm.

Further reading: Trefethen, lectures 7, 8, 18, 19. The Wikipedia Gram-Schmidt article is also nice. It turns out that modified GS is backwards stable in the sense that the product QR is close to A, i.e. the function f(A) = Q*R is backwards stable in MGS; this is why solving systems with Q,R (appropriately used as discussed in Trefethen lecture 19) is an accurate approximation to solving them with A. For a review of the literature on backwards-stability proofs of MGS, see e.g. this 2006 paper by Paige et al. [SIAM J. Matrix Anal. Appl. 28, pp. 264-284].

Lecture 9 (Feb 21)

Guest lecture by Prof. Alan Edelman: the Generalized SVD (GSVD). Least-square problems (via QR or SVD) and different viewpoints on linear regression: linear algebra, optimization, statistics, machine learning.

Further reading: Trefethen, lectures 7, 11. Edelman and Wang (2019), The GSVD: Where are the ellipses?.

Lecture 10 (Feb 24)

Householder QR notes from Per Persson.

Introduced Householder QR, emphasized the inherent stability properties of multiplying by a sequence of unitary matrices (as shown in pset 2). Show how we can convert a matrix to upper-triangular form (superficially similar to Gaussian elimination) via unitary Householder reflectors.

Finished Householder QR derivation, including the detail that one has a choice of Householder reflectors…we choose the sign to avoid taking differences of nearly-equal vectors. Gave flop count, showed that we don’t need to explicitly compute Q if we store the Householder reflector vectors.

Operation count for Gram-Schmidt (2mn²) and Householder (2mn² - 2n³/3), using the simple “graphical” estimation method from Trefethen. Evidently, Householder is at least as accurate as modified GS while being slightly faster. But does fewer operations really mean it is faster?

Further reading: Trefethen, lectures 7, 8, 10, 16.

Lecture 11 (Feb 26)

(Start by finishing operation counts from last lecture.)

performance experiments with matrix multiplication (one-page or full-size versions)
ideal-cache terminology

Counting arithmetic operation counts is no longer enough. Illustrate this with some performance experiments on a much simpler problem, matrix multiplication (see handouts). This leads us to analyze memory-access efficiency and caches and points the way to restructuring many algorithms.

Outline of the memory hierarchy: CPU, registers, L1/L2 cache, main memory, and presented simple 2-level ideal-cache model that we can analyze to get the basic ideas.

Analyzed cache complexity of simple row-column matrix multiply, showed that it asymptotically gets no benefit from the cache. Presented blocked algorithm, and showed that it achieves optimal Θ(n³/√Z) cache complexity.

Discussed some practical difficulties of the blocked matrix multiplication: algorithm depends on cache-size Z, and multi-level memories require multi-level blocking. Discussed how these ideas are applied to the design of modern linear-algebra libraries (LAPACK) by building them out of block operations (performed by an optimized BLAS).

Further reading: Wikipedia has a reasonable introduction to memory locality that you might find useful. The optimized matrix multiplication shown on the handouts is called ATLAS, and you can find out more about it on the ATLAS web page. Cache-oblivious algorithms, describing ideal cache model and analysis for various algorithms, by Frigo, Leiserson, Prokop, and Ramachandran (1999). Notes on the switch from LINPACK to LAPACK/BLAS in Matlab. The MIT course 6.172 has two lecture videos (first and second) on cache-efficient algorithms, including a discussion of matrix multiplication.

Lecture 12 (Feb 28)

experiments with cache-oblivious matrix-multiplication (handout or full-size slides)
IJulia matrix-multiplication notebook
pset 2 solutions and IJulia notebook
pset 3 (due Fri 13 March)

Introduced the concept of optimal cache-oblivious algorithms. Discussed cache-oblivious matrix multiplication in theory and in practice (see handout and Frigo et. al paper above).

Discussion of spatial locality and cache lines, with examples of dot products and matrix additions (both of which are “level 1 BLAS” operations with no temporal locality); you’ll do more on this in pset 3.

Further reading: Frigo et al. paper from previous lecture. ATLAS web page above. See Register Allocation in Kernel Generators (talk by M. Frigo, 2007) on the difficulty of optimizing for the last level of cache (the registers) in matrix multiplication (compared to FFTs), and why a simple cache-oblivious algorithm is no longer enough. See e.g. the Wikipedia article on row-major and column-major order.

Lecture 13 (March 2)

Finished cache and memory discussion from last lecture’s notebook.

Review of Gaussian elimination. Reviewed the fact (from 18.06) that this givs an A=LU factorization, and that we then solve Ax=b by solving Ly=b (doing the same steps to b that we did to A during elimination to get y) and then solving Ux=y (backsubstitution). Emphasized that you should almost never compute A⁻¹ explicitly. It is just as cheap to keep L and U around, since triangular solves are essentially the same cost as a matrix-vector multiplication. Computing A⁻¹ is usually a mistake: you can’t do anything with A⁻¹ that you couldn’t do with L and U, and you are wasting both computations and accuracy in computing A⁻¹. A⁻¹ is useful in abstract manipulations, but whenever you see “x=A⁻¹b” you should interpret it for computational purposes as solving Ax=b by LU or some other method.

Further reading: Trefethen, lectures 20–22.

Lecture 14 (March 4)

Introduced partial pivoting, and pointed out (omitting bookkeeping details) that this can be expressed as a PA=LU factorization where P is a permutation. Began to discuss backwards stability of LU, and mentioned example where U matrix grows exponentially fast with m to point out that the backwards stability result is practically useless here, and that the (indisputable) practicality of Gaussian elimination is more a result of the types of matrices that arise in practice.

Discussed Cholesky factorization, which is Gaussian elimation for the special case of Hermitian positive-definite matrices, where we can save a factor of two in time and memory. More generally, if the matrix A has a special form, one can sometimes take advantage of this to have a more efficient Ax=b solver, for example: Hermitian positive-definite (Cholesky), tridiagonal or banded (linear-time solvers), lower/upper triangular (forward/backsubstitution), sparse (mostly zero—sparse-direct and iterative solvers, to be discussed later; typically only worthwhile when the matrix is much bigger than 1000×1000).

Further reading: Trefethen, lectures 20–23. See all of the special cases of LAPACK’s linear-equation solvers.

Lecture 15 (March 6)

New topic: eigenproblems. Reviewed the usual formulation of eigenproblems and the characteristic polynomial, mentioned extensions to generalized eigenproblems and SVDs. Discussed diagonalization, defective matrices, and the generalization ot the Schur factorization.

Discussed diagonalization, defective matrices, and the generalization ot the Schur factorization. Proved (by induction) that every (square) matrix has a Schur factorization, and that for Hermitian matrices the Schur form is real and diagonal.

Pointed out that an “LU-like” algorithm for eigenproblems, which computes the exact eigenvalues/eigenvectors (in exact arithmetic, neglecting roundoff) in a finite number of steps involving addition, subtraction, multiplication, division, and roots, is impossible. The reason is that no such algorithm exists (or can ever exist) to find roots of polynomials with degree greater than 4, thanks to a theorem by Abel, Galois and others in the 19th century. Used the companion matrix to show that polynomial root finding is equivalent to the problem of finding eigenvalues. Mentioned the connection to other classic problems of antiquity (squaring the circle, trisecting an angle, doubling the cube), which were also proved impossible in the 19th century.

As a result, all eigenproblem methods must be iterative: they must consist of improving an initial guess, in successive steps, so that it converges towards the exact result to any desired accuracy, but never actually reaches the exact answer in general. A simple example of such a method is Newton’s method, which can be applied to iteratively approximate a root of any nonlinear function to any desired accuracy, given a sufficiently good initial guess.

However, finding roots of the characteristic polynomial is generally a terrible way to find eigenvalues. Actually computing the characteristic polynomial coefficients and then finding the roots somehow (Newton’s method?) is a disaster, incredibly ill-conditioned: gave the example of Wilkinson’s polynomial. If we can compute the characteristic polynomial values implicitly somehow, directly from the determinant, then it is not too bad (if you are looking only for eigenvalues in some known interval, for example), but we haven’t learned an efficient way to do that yet. The way LAPACK and Matlab actually compute eigenvalues, the QR method and its descendants, wasn’t discovered until 1960.

Further reading: Trefethen, lecture 24, 25. See this Wilkinson polynomial Julia notebook for some experiments with polynomial roots in Julia, as well as this more recent 18.06 notebook.

Lecture 16 (March 9)

The key to making most of the eigensolver algorithms efficient is reducing A to Hessenberg form: A=QHQ* where H is upper triangular plus one nonzero value below each diagonal. Unlike Schur form, Hessenberg factorization can be done exactly in a finite number of steps (in exact arithmetic), Θ(m³) steps to be precise. H and A are similar: they have the same eigenvalues, and the eigenvector are related by Q. And once we reduce to Hessenberg form, all the subsequent operations we might want to do (determinants, LU or QR factorization, etcetera), will be fast. In the case of Hermitian A, showed that H is Hermitian tridiagonal; in this case, most subsequent operations (even LU and QR factorization) will be Θ(m) (you will show this in HW)! (In fact, you can always arrange that H is a real tridiagonal matrix even if A is complex Hermitian.)

Reviewed power method for biggest-|λ| eigenvector/eigenvalue and its the convergence rate. To get the eigenvalue, we use the Rayleigh quotient of our eigenvector estimate. Showed that (related to the min–max theorem), for a Hermitian matrix the eigenvectors are all extrema of the Rayleigh quotient, and this means that the eigenvalue estimates converge at twice the rate (i.e. squared error) of the eigenvectors.

Further reading: Trefethen, lecture 25, 26, and and Per Persson’s 2006 notes on Hessenberg factorization. The Julia LinearAlgebra provides functions schur, eigen, and hessenberg for the Schur, eigenvector, and Hessenberg factorizations respectively. (For a large real-symmetric matrix, Hessenberg factorization is about 5× faster than diagonalization, but is only about 40% faster than finding the eigenvalues and not eigenvectors.)

Lecture 17 (March 11)

Discussed how to use the power method to get multiple eigenvalues/vectors of Hermitian matrices by “deflation” (using orthogonality of eigenvectors). Discussed how, in principle, QR factorization of Aⁿ for large n will give the eigenvectors and eigenvalues in descending order of magnitude, but how this is killed by roundoff errors.

Unshifted QR method: proved that repeatedly forming A=QR, then replacing A with RQ (as in pset 3) is equivalent (in exact arithmetic) to QR factorizing Aⁿ. But since we do this while only multiplying repeatedly by unitary matrices, it is well conditioned and we get the eigenvalues accurately.

To make the QR method faster, we first reduce to Hessenberg form; you will show in pset 3 that this is especially fast when A is Hermitian and the Hessenberg form is tridiagonal. Second, we use shifts. In particular, the worst case for the QR method, just as for the power method, is when eigenvalues are nearly equal. We can fix this by shifting.

Further reading: See Trefethen, lectures 27–30, and Per Persson’s 2006 notes on power/inverse/Rayleigh iteration and on QR (part 1 and part 2).

Lecture 18 (March 13)

New topic: iterative algorithms, usually for sparse matrices, and in general for matrices where you have a fast way to compute Ax matrix-vector products but cannot (practically) mess around with the specific entries of A.

Further reading: Trefethen, lecture 31, 32, 33, 34. The online books, Templates for the Solution of Linear Systems (Barrett et al.) and Templates for the Solution of Algebraic Eigenvalue Problems, are useful surveys of iterative methods.

mitmath/18335

18.335J/6.337J: Introduction to Numerical Methods

Syllabus

Lecture Summaries and Handouts

Lecture 1 (Feb 3)

Lecture 2 (Feb 5)

Lecture 3 (Feb 7)

Julia tutorial (Feb 7: 5pm in 32-141) — optional

Lecture 4 (Feb 10)

Lecture 5 (Feb 12)

Lecture 6 (Feb 14)

Lecture 7 (Feb 18: Tuesday == MIT Monday)

Lecture 8 (Feb 19)

Lecture 9 (Feb 21)

Lecture 10 (Feb 24)

Lecture 11 (Feb 26)

Lecture 12 (Feb 28)

Lecture 13 (March 2)

Lecture 14 (March 4)

Lecture 15 (March 6)

Lecture 16 (March 9)

Lecture 17 (March 11)

Lecture 18 (March 13)

hse-aml/intro-to-dl

fastai/course-nlp

mitmath/1806

MLIME/MAC0460

CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

Yorko/dl_in_nlp_deeppavlov_cs224n_spring2020

mitmath/18330

qiaoxu123/Self-Driving-Cars

vinsis/math-and-ml-notes

Lecture 7 (Feb 18: Tuesday `==` MIT Monday)