# Descending the least-squares

Reconsider a system of linear equations

Here, the vector $\mathbf{y} \in \mathbb{R}^M$ and matrix $\mathbf{H} \in \mathbb{R}^{M \times N}$ are known, and we want to computationally recover the vector $\mathbf{x} \in \mathbb{R}^N$. The unknown vector $\mathbf{e} \in \mathbb{R}^M$ models noise.

If we know nothing about $\mathbf{x}$ beyond the measurements $\mathbf{y}$, then it makes sense to seek $\mathbf{x}$ that predicts $\mathbf{y}$ as closely as possible via  $\mathbf{H}$. Note, however, that this approach would only make sense when $M \geq N$, i.e., when there are at least as many measurements as the unknowns; otherwise, we end up with infinitely many potential vectors $\mathbf{x} \in \mathbb{R}^N$ that can perfectly match given measurements $\mathbf{y}$.

So, how can one enforce the closeness of $\mathbf{x}$ to the measurements $\mathbf{y}$?

The simplest, and most popular, approach is to find $\mathbf{x} \in \mathbb{R}^N$ that minimizes the least-squares function

Given the least-squares cost function, how would we actually minimize it?

One approach that conveniently circumvents the need to physically store and invert the matrix $\mathbf{H}$ is the gradient descent method. To apply it, we first find the expression for the gradient of $\mathcal{D}$ as

Then, given a guess of the solution $\mathbf{x}^{t-1}$, we can update it by taking a step in the direction of negative gradient

where the parameter $\gamma > 0$ controls the size of the step. The process above is repeated many times, for every $t = 1, 2, 3, \dots$, starting from some initial guess  $\mathbf{x}^0$.

How can we be sure that this gradient-descent iteration actually minimizes $\mathcal{D}$? Can we comment on the number of iterations required to get to the bottom of $\mathcal{D}$?

To start analyzing the gradient descent method, we first write the Taylor series expansion of $\mathcal{D}(\mathbf{x})$ at an arbitrary location $\mathbf{z} \in \mathbb{R}^N$

where we used $\nabla^2$ to denote the Hessian, which in our case reduces to $\nabla^2\mathcal{D}(\mathbf{z}) = \mathbf{H}^\mathrm{T}\mathbf{H}$. Note that the second order Taylor expansion was sufficient since $\mathcal{D}$ is a quadratic function. Then, we can obtain two bounds on $\mathcal{D}$ for any $\mathbf{x},\mathbf{z} \in \mathbb{R}^N$

and

where $L \triangleq \lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})$ with the notation $\lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})$ denoting the largest eigenvalue of the matrix $\mathbf{H}^\mathrm{T}\mathbf{H}$. The lower bound above indicates to the convexity of $\mathcal{D}$, while the upper bound to the Lipschitz continuity of the gradient $\nabla \mathcal{D}$.

We now evaluate $\mathcal{D}$ at $\mathbf{x}^+ = \mathbf{x} - \gamma \nabla \mathcal{D}(\mathbf{x})$, and use the upper bound to obtain

which for $\gamma \in [0, 1/L]$ reduces to

This inequality clearly indicates that gradient descent is indeed a descent method, i.e., at each step it reduces our $\mathcal{D}$ by a value proportional to the product of the step-size and the norm of the gradient.

Now, by plugging in the lower bound inequality into the previous inequality, we obtain for any $\mathbf{z} \in \mathbb{R}^N$

Thus, by evaluating this inequality at $\mathbf{z} = \mathbf{x}^\ast$, where $\mathbf{x}^\ast$ is the minimizer of $\mathcal{D}$, we obtain

Note that since $\mathcal{D}(\mathbf{x}^+) - \mathcal{D}(\mathbf{x}^\ast) \geq 0$, we have that $\|\mathbf{x}^+ - \mathbf{x}^\ast\|_{\ell_2}^2 \leq \|\mathbf{x}-\mathbf{x}^\ast\|_{\ell_2}^2$, i.e., distance to a minimizer decreases.

Finally, by setting $\mathbf{x} = \mathbf{x}^{t-1}$, $\mathbf{x}^+ = \mathbf{x}^{t}$, and $\gamma = 1/L$, and summing over $t$, we obtain

and since the sequence $\mathcal{D}(\mathbf{x}^t)$ is nonincreasing,

This implies that gradient descent reaches $[\mathcal{D}(\mathbf{x}^{t}) - \mathcal{D}(\mathbf{x}^{\ast})] \leq \epsilon$ after $C/\epsilon$ iterations, where

References:

[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.

[2] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.

# Minimization via soft-thresholding

Previously, we saw that it was straightforward to modify the gradient descent algorithm to obtain a sparse solution to a set of linear equations. The only modification that was required involved soft-thresholding the solution vector at every iteration of the gradient descent. At the end of the post, we even claimed that this modified algorithm, that is known as ISTA, minimizes some $\ell_1$-norm regularized energy function.

Today, we move towards a more rigorous understanding of ISTA. Specifically, we show that the soft-thresholding operator

minimizes the following convex function

where the number $z \in \mathbb{R}$ and the parameter $\lambda > 0$ are known, and $|\cdot|$ denotes the absolute value. This relationship for scalar functions can be easily extended to the following key result

for an arbitrary $\mathbf{z} \in \mathbb{R}^N$ and parameter $\lambda > 0$. Note that, here, $\eta$ is applied component-wise to its first argument.

We first introduce the notion of subderivative and of subdifferential. A subderivative of a convex function $c$ at $x$ is a number $g \in \mathbb{R}$ that satisfies the inequality $c(y) \geq c(x) + g(y-x)$, for all $y$. When $c$ is differentiable, the only possible choice for $g$ is the derivative $c^\prime(x)$. The set of subderivaties of $c$ at $x$ is called the subdifferentiable, denoted $\partial c(x)$. Similarly to the standard optimality conditions for differentiable convex funcitons, a point $x^\ast \in \mathbb{R}$ is a minimizer of $c$ if and only if zero is contained in the subdifferential of $c$ at $x^\ast$, i.e., $0 \in \partial c(x^\ast)$.

When $x = 0$, the absolute value term in $c$ is not differentiable. However, its subdifferential is the interval $[-1, 1]$, which leads to the optimality condition

When $x \neq 0$, the absolute value term in $c$ is differrentiable, and $\partial|x| = \mathrm{sgn}(x)$, which leads to the optimality condition

By noting that $(x < 0) \Rightarrow (z < -\lambda)$, and that $(x > 0) \Rightarrow (z > \lambda)$, we can expression the minimizer $x^\ast \neq 0$ as

Putting two cases together, we obtain

which is precisely the definition of the soft-thresholding operator.

Finally, to see how to generalize this scalar result to vectors, we re-write

and optimize separately over every $x_n$.

References:

[1] M. Unser and P. D. Tafti, "An Introduction to Sparse Stochastic Processes," Cambridge Univ. Press, 2014, ch. Proximal operators, pp. 259–261.