# Descending the least-squares

Reconsider a system of linear equations

Here, the vector $\mathbf{y} \in \mathbb{R}^M$ and matrix $\mathbf{H} \in \mathbb{R}^{M \times N}$ are known, and we want to computationally recover the vector $\mathbf{x} \in \mathbb{R}^N$. The unknown vector $\mathbf{e} \in \mathbb{R}^M$ models noise.

If we know nothing about $\mathbf{x}$ beyond the measurements $\mathbf{y}$, then it makes sense to seek $\mathbf{x}$ that predicts $\mathbf{y}$ as closely as possible via  $\mathbf{H}$. Note, however, that this approach would only make sense when $M \geq N$, i.e., when there are at least as many measurements as the unknowns; otherwise, we end up with infinitely many potential vectors $\mathbf{x} \in \mathbb{R}^N$ that can perfectly match given measurements $\mathbf{y}$.

So, how can one enforce the closeness of $\mathbf{x}$ to the measurements $\mathbf{y}$?

The simplest, and most popular, approach is to find $\mathbf{x} \in \mathbb{R}^N$ that minimizes the least-squares function

Given the least-squares cost function, how would we actually minimize it?

One approach that conveniently circumvents the need to physically store and invert the matrix $\mathbf{H}$ is the gradient descent method. To apply it, we first find the expression for the gradient of $\mathcal{D}$ as

Then, given a guess of the solution $\mathbf{x}^{t-1}$, we can update it by taking a step in the direction of negative gradient

where the parameter $\gamma > 0$ controls the size of the step. The process above is repeated many times, for every $t = 1, 2, 3, \dots$, starting from some initial guess  $\mathbf{x}^0$.

How can we be sure that this gradient-descent iteration actually minimizes $\mathcal{D}$? Can we comment on the number of iterations required to get to the bottom of $\mathcal{D}$?

To start analyzing the gradient descent method, we first write the Taylor series expansion of $\mathcal{D}(\mathbf{x})$ at an arbitrary location $\mathbf{z} \in \mathbb{R}^N$

where we used $\nabla^2$ to denote the Hessian, which in our case reduces to $\nabla^2\mathcal{D}(\mathbf{z}) = \mathbf{H}^\mathrm{T}\mathbf{H}$. Note that the second order Taylor expansion was sufficient since $\mathcal{D}$ is a quadratic function. Then, we can obtain two bounds on $\mathcal{D}$ for any $\mathbf{x},\mathbf{z} \in \mathbb{R}^N$

and

where $L \triangleq \lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})$ with the notation $\lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})$ denoting the largest eigenvalue of the matrix $\mathbf{H}^\mathrm{T}\mathbf{H}$. The lower bound above indicates to the convexity of $\mathcal{D}$, while the upper bound to the Lipschitz continuity of the gradient $\nabla \mathcal{D}$.

We now evaluate $\mathcal{D}$ at $\mathbf{x}^+ = \mathbf{x} - \gamma \nabla \mathcal{D}(\mathbf{x})$, and use the upper bound to obtain

which for $\gamma \in [0, 1/L]$ reduces to

This inequality clearly indicates that gradient descent is indeed a descent method, i.e., at each step it reduces our $\mathcal{D}$ by a value proportional to the product of the step-size and the norm of the gradient.

Now, by plugging in the lower bound inequality into the previous inequality, we obtain for any $\mathbf{z} \in \mathbb{R}^N$

Thus, by evaluating this inequality at $\mathbf{z} = \mathbf{x}^\ast$, where $\mathbf{x}^\ast$ is the minimizer of $\mathcal{D}$, we obtain

Note that since $\mathcal{D}(\mathbf{x}^+) - \mathcal{D}(\mathbf{x}^\ast) \geq 0$, we have that $\|\mathbf{x}^+ - \mathbf{x}^\ast\|_{\ell_2}^2 \leq \|\mathbf{x}-\mathbf{x}^\ast\|_{\ell_2}^2$, i.e., distance to a minimizer decreases.

Finally, by setting $\mathbf{x} = \mathbf{x}^{t-1}$, $\mathbf{x}^+ = \mathbf{x}^{t}$, and $\gamma = 1/L$, and summing over $t$, we obtain

and since the sequence $\mathcal{D}(\mathbf{x}^t)$ is nonincreasing,

This implies that gradient descent reaches $[\mathcal{D}(\mathbf{x}^{t}) - \mathcal{D}(\mathbf{x}^{\ast})] \leq \epsilon$ after $C/\epsilon$ iterations, where

References:

[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.

[2] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.

## 4 thoughts on “Descending the least-squares”

1. Very nice and lucid. You should eventually compose all of these and consider writing a book!

1. kamilov says:

Thank you for the comment and advice Chandra.