Tag Archives: FISTA

A simple ISTA ↔ FISTA switch

Today, let us revisit the topic of enforcing sparsity and see an easy trick to accelerate the convergence speed of the algorithm (demo).

The basic algorithm that we discussed last time is the iterative shrinkage/thresholding algorithm (ISTA) that can be specified as follows

\mathbf{f}^t \leftarrow \eta(\mathbf{f}^{t-1} - \gamma \mathbf{H}^\mathrm{T}(\mathbf{H}\mathbf{f}^{t-1}-\mathbf{y}), \gamma \tau),

where t = 1, 2, 3, \dots is the iteration number, \mathbf{y} is the measurement vector, \mathbf{H} is the measurement matrix that models the acquisition system, \gamma > 0 is a step-size of the algorithm that we can always set to the inverse of the largest eigenvalue of \mathbf{H}^\mathrm{T}\mathbf{H} to ensure convergence (i.e., set \gamma = 1/L with L = \lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})), \tau > 0 is the regularization parameter that controls the sparsity of the final solution (larger \tau leads to a sparser solution), and finally \eta is a scalar thresholding function applied in a component-wise fashion. One of the most popular thresholding functions is the soft-thresholding defined as

\eta(x, \tau) \triangleq \mathrm{sgn}(x)(|x|-\tau)_{+}

where (\cdot)_+ returns the positive part of its argument, and \mathrm{sgn}(\cdot) is a signum function that returns +1 if its argument is positive and -1 when it is negative.

ISTA is a very well understood method, and it is well known that its rate of convergence corresponds to that of the gradient-descent method, which is O(1/t).

Let us considering the following simple iteration

\mathbf{f}^t \leftarrow \eta(\mathbf{s}^{t-1} - \gamma \mathbf{H}^\mathrm{T}(\mathbf{H}\mathbf{s}^{t-1}-\mathbf{y}), \gamma \tau)

\mathbf{s}^t \leftarrow \mathbf{f}^{t} + ((q_{t-1}-1)/q_t) (\mathbf{f}^t - \mathbf{f}^{t-1}),

where \mathbf{f}^0 = \mathbf{s}^0 = \mathbf{f}_{\text{init}}. When q_t = 1 for all t = 1, 2, 3, \dots the iteration corresponds to ISTA, however, an appropriate choice of\{q_t\}_{t \in [0, 1, \dots]} leads to a faster O(1/t^2)  convergence, which is crucial for larger scale problems where one tries to reduce the amount of matrix-vector products with \mathbf{H} and \mathbf{H}^\mathrm{T}. The faster version of ISTA was originally proposed by Beck & Teboulle 2009 and is widely known as fast ISTA (FISTA).

So, what is that appropriate choice of \{q_t\}_{t \in [0, 1, \dots]}?

Beck & Teboulle proposed to initialize q_0 = 1 and then setting the rest iteratively as follows

q_t \leftarrow \frac{1}{2}\left(1 +\sqrt{1+4q_{t-1}^2}\right).

A short note on the Python demo. It was done as an IPython notebook in a fully self-contained way. The first two cells create and save two files matrixmodel.py and algorithm.py that are re-usable as stand-alone files.

The switch from ISTA to FISTA is done in cell 8:

# Reconstruct with ISTA
[fhatISTA, costISTA] = fistaEst(y, forwardObj, tau, numIter, accelerate=False)

# Reconstruct with FISTA 
[fhatFISTA, costFISTA] = fistaEst(y, forwardObj, tau, numIter, accelerate=True)

The output of the demo is the following figure plotting the results of both algorithms within 100 iterations:

Comaprison of ISTA with FISTA
Comaprison of ISTA with FISTA