Reconsider a system of linear equations
Here, the vector and matrix are known, and we want to computationally recover the vector . The unknown vector models noise.
If we know nothing about beyond the measurements , then it makes sense to seek that predicts as closely as possible via . Note, however, that this approach would only make sense when , i.e., when there are at least as many measurements as the unknowns; otherwise, we end up with infinitely many potential vectors that can perfectly match given measurements .
So, how can one enforce the closeness of to the measurements ?
The simplest, and most popular, approach is to find that minimizes the least-squares function
Given the least-squares cost function, how would we actually minimize it?
One approach that conveniently circumvents the need to physically store and invert the matrix is the gradient descent method. To apply it, we first find the expression for the gradient of as
Then, given a guess of the solution , we can update it by taking a step in the direction of negative gradient
where the parameter controls the size of the step. The process above is repeated many times, for every , starting from some initial guess .
How can we be sure that this gradient-descent iteration actually minimizes ? Can we comment on the number of iterations required to get to the bottom of ?
To start analyzing the gradient descent method, we first write the Taylor series expansion of at an arbitrary location
where we used to denote the Hessian, which in our case reduces to . Note that the second order Taylor expansion was sufficient since is a quadratic function. Then, we can obtain two bounds on for any
We now evaluate at , and use the upper bound to obtain
which for reduces to
This inequality clearly indicates that gradient descent is indeed a descent method, i.e., at each step it reduces our by a value proportional to the product of the step-size and the norm of the gradient.
Now, by plugging in the lower bound inequality into the previous inequality, we obtain for any
Thus, by evaluating this inequality at , where is the minimizer of , we obtain
Note that since , we have that , i.e., distance to a minimizer decreases.
Finally, by setting , , and , and summing over , we obtain
and since the sequence is nonincreasing,
This implies that gradient descent reaches after iterations, where
 S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.
 A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.