Reconsider a system of linear equations

Here, the vector and matrix are known, and we want to computationally recover the vector . The unknown vector models noise.

If we know nothing about beyond the measurements , then it makes sense to seek that predicts as closely as possible via . Note, however, that this approach would only make sense when , *i.e.*, when there are at least as many measurements as the unknowns; otherwise, we end up with infinitely many potential vectors that can perfectly match given measurements .

So, how can one enforce the closeness of to the measurements ?

The simplest, and most popular, approach is to find that minimizes the least-squares function

Given the least-squares cost function, how would we actually minimize it?

One approach that conveniently circumvents the need to physically store and invert the matrix is the gradient descent method. To apply it, we first find the expression for the gradient of as

Then, given a guess of the solution , we can update it by taking a step in the direction of negative gradient

where the parameter controls the size of the step. The process above is repeated many times, for every , starting from some initial guess .

How can we be sure that this gradient-descent iteration actually minimizes ? Can we comment on the number of iterations required to get to the bottom of ?

To start analyzing the gradient descent method, we first write the Taylor series expansion of at an arbitrary location

where we used to denote the Hessian, which in our case reduces to . Note that the second order Taylor expansion was sufficient since is a quadratic function. Then, we can obtain two bounds on for any

and

where with the notation denoting the largest eigenvalue of the matrix . The lower bound above indicates to the convexity of , while the upper bound to the Lipschitz continuity of the gradient .

We now evaluate at , and use the upper bound to obtain

which for reduces to

This inequality clearly indicates that gradient descent is indeed a descent method, *i.e.*, at each step it reduces our by a value proportional to the product of the step-size and the norm of the gradient.

Now, by plugging in the lower bound inequality into the previous inequality, we obtain for any

Thus, by evaluating this inequality at , where is the minimizer of , we obtain

Note that since , we have that , *i.e.*, distance to a minimizer decreases.

Finally, by setting , , and , and summing over , we obtain

and since the sequence is nonincreasing,

This implies that gradient descent reaches after iterations, where

**References:**

[1] S. Boyd and L. Vandenberghe, *Convex Optimization*, Cambridge Univ. Press, 2004.

[2] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.

Very nice and lucid. You should eventually compose all of these and consider writing a book!

Thank you for the comment and advice Chandra.