Enforcing sparsity with a proximal-gradient method

Sparsity of a vector is a key property that makes its accurate estimation possible when solving underdetermined systems of linear equations of the form

Here, we know the vector $\mathbf{y} \in \mathbb{R}^M$ and the matrix $\mathbf{H} \in \mathbb{R}^{M \times N}$, and we would like to computationally recover the vector $\mathbf{x} \in \mathbb{R}^N$. The last term $\mathbf{e} \in \mathbb{R}^M$ models noise, which is unknown, and gives some robustness to the linear system of equations. The term underdetermined means that $M < N$, i.e., there are less measurements than unknowns, and sparsity means that $K \ll N$, i.e., only very few components of $\mathbf{x}$ are non-zero. A great deal has already been said about the relationship between $K$ and $M$ and the level of noise that would make a computational estimation of $\mathbf{x}$ possible. This underlying theory is often called compressive sensing and has received a lot of attention in the recent past [1][2].

Now, suppose that we are working on a problem that can be formulated as the linear system of equations above, and it turns out that the vector we are trying to recover is indeed sparse. How do we go about recovering $\mathbf{x}$ in a way that takes advantage of its sparsity?

First, let's consider the scenario without sparsity. Then, the simplest approach to computationally estimate $\mathbf{x}$ is to implement and run the gradient descent algorithm summarized as follows

Here, $t = 1, 2, 3, \dots$ denotes the iteration number, $\mathbf{x}^t$ denotes the estimate at $t$, the superscript letter $^\text{T}$ denotes the matrix-transposition operator, and $\gamma > 0$ is a step-size. The step-size is a parameter that can be precomputed in advance to make sure that the sequence of estimates converges. Specifically, we can set $\gamma = 1/\lambda_{\text{max}}(\mathbf{H}^{\text{T}}\mathbf{H})$, where $\lambda_{\text{max}}(\mathbf{H}^{\text{T}}\mathbf{H})$ denotes the largest eigenvalue of the matrix $\mathbf{H}^{\text{T}}\mathbf{H}$ [3].

To understand the standard gradient descent consider the following least-squares energy function

which simply measures the Euclidean- or $\ell_2$-distance between the actual measurements $\mathbf{y}$ and the predicted measurements $\mathbf{H}\mathbf{x}$. The gradient of the function $\mathcal{D}$ with respect to the vector $\mathbf{x}$ is given by

which implies that the gradient descent algorithm simply minimizes the least-squares energy function by iteratively pushing the estimate towards the direction of steepest descent.

It turns out that the sparsity of $\mathbf{x}$ can be enforced via a slight modification of the standard gradient descent. Specifically, we can implement and run the following iteration

Here, $\eta$ is a special scalar function that is applied component-wise to its first argument, and $\lambda > 0$  is a parameter controlling the sparsity of the solution, i.e., larger  $\lambda$ leads to sparser solutions. One popular function is the soft-thresholding function [4] that can be defined as follows

where the operator $(\cdot)_{+}$ returns the positive part of its argument, and $\mathrm{sgn}(\cdot)$ returns $+1$ if its argument is positive, $-1$ if its negative, and $0$ if it is exactly zero.

This modified gradient-descent algorithm is often called iterative shrinkage/thresholding algorithm (ISTA) [5–7]; more broadly, these types of algorithms that alternate between taking a gradient-step and evaluating a nonlinearity are called proximal-gradient algorithms.

Additionally, it turns out that ISTA minimizes the following energy function

where $\mathcal{D}$ is the same least-squares function as above, and $\mathcal{R}(\mathbf{x}) \triangleq \|\mathbf{x}\|_{\ell_1}$ is the $\ell_1$-norm penalty, which is the basis of the compressive sensing theory.

Here is how a Python code of ISTA would look like given a matrix $\mathbf{H}$ and a soft-thresholding function $\eta$ (ipython demo):

numIter = 1000  # number of iterations
gamma = np.square(1/np.linalg.norm(H, 2)) # step-size
lam = 0.001  # regularization parameter

xhat = np.zeros((N, 1))  # initialize ISTA

for iter in range(numIter):
xhat = eta(xhat - gamma*grad, gamma*lam)  # update

References:

[1] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, February 2006.

[2] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.

[3] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.

[4] D. L. Donoho, “De-noising by soft-thresholding,” IEEE Trans. Inf. Theory, vol. 41, no. 3, pp. 613-627, May 1995.

[5] M. A. T. Figueiredo and R. D. Nowak, “An EM algorithm for wavelet-based image restoration,” IEEE Trans. Image Process., vol. 12, no. 8, pp. 906–916, August 2003.

[6] J. Bect, L. Blanc-Feraud, G. Aubert, and A. Chambolle, “A $\ell_1$-unified variational framework for image restoration,” in Proc. ECCV, Springer, Ed., vol. 3024, New York, 2004, pp. 1–13.

[7] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, November 2004.