Category Archives: Optimization

A simple ISTA ↔ FISTA switch

Today, let us revisit the topic of enforcing sparsity and see an easy trick to accelerate the convergence speed of the algorithm (demo).

The basic algorithm that we discussed last time is the iterative shrinkage/thresholding algorithm (ISTA) that can be specified as follows

\mathbf{f}^t \leftarrow \eta(\mathbf{f}^{t-1} - \gamma \mathbf{H}^\mathrm{T}(\mathbf{H}\mathbf{f}^{t-1}-\mathbf{y}), \gamma \tau),

where t = 1, 2, 3, \dots is the iteration number, \mathbf{y} is the measurement vector, \mathbf{H} is the measurement matrix that models the acquisition system, \gamma > 0 is a step-size of the algorithm that we can always set to the inverse of the largest eigenvalue of \mathbf{H}^\mathrm{T}\mathbf{H} to ensure convergence (i.e., set \gamma = 1/L with L = \lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})), \tau > 0 is the regularization parameter that controls the sparsity of the final solution (larger \tau leads to a sparser solution), and finally \eta is a scalar thresholding function applied in a component-wise fashion. One of the most popular thresholding functions is the soft-thresholding defined as

\eta(x, \tau) \triangleq \mathrm{sgn}(x)(|x|-\tau)_{+}

where (\cdot)_+ returns the positive part of its argument, and \mathrm{sgn}(\cdot) is a signum function that returns +1 if its argument is positive and -1 when it is negative.

ISTA is a very well understood method, and it is well known that its rate of convergence corresponds to that of the gradient-descent method, which is O(1/t).

Let us considering the following simple iteration

\mathbf{f}^t \leftarrow \eta(\mathbf{s}^{t-1} - \gamma \mathbf{H}^\mathrm{T}(\mathbf{H}\mathbf{s}^{t-1}-\mathbf{y}), \gamma \tau)

\mathbf{s}^t \leftarrow \mathbf{f}^{t} + ((q_{t-1}-1)/q_t) (\mathbf{f}^t - \mathbf{f}^{t-1}),

where \mathbf{f}^0 = \mathbf{s}^0 = \mathbf{f}_{\text{init}}. When q_t = 1 for all t = 1, 2, 3, \dots the iteration corresponds to ISTA, however, an appropriate choice of\{q_t\}_{t \in [0, 1, \dots]} leads to a faster O(1/t^2)  convergence, which is crucial for larger scale problems where one tries to reduce the amount of matrix-vector products with \mathbf{H} and \mathbf{H}^\mathrm{T}. The faster version of ISTA was originally proposed by Beck & Teboulle 2009 and is widely known as fast ISTA (FISTA).

So, what is that appropriate choice of \{q_t\}_{t \in [0, 1, \dots]}?

Beck & Teboulle proposed to initialize q_0 = 1 and then setting the rest iteratively as follows

q_t \leftarrow \frac{1}{2}\left(1 +\sqrt{1+4q_{t-1}^2}\right).

A short note on the Python demo. It was done as an IPython notebook in a fully self-contained way. The first two cells create and save two files matrixmodel.py and algorithm.py that are re-usable as stand-alone files.

The switch from ISTA to FISTA is done in cell 8:

# Reconstruct with ISTA
[fhatISTA, costISTA] = fistaEst(y, forwardObj, tau, numIter, accelerate=False)

# Reconstruct with FISTA 
[fhatFISTA, costFISTA] = fistaEst(y, forwardObj, tau, numIter, accelerate=True)

The output of the demo is the following figure plotting the results of both algorithms within 100 iterations:

Comaprison of ISTA with FISTA
Comaprison of ISTA with FISTA

Fundamentals: Extreme Value Theorems—Part 2

In the last post, we stated two fundamental theorems that establish existence of global minimizers of some continuous function \mathcal{C}(\mathbf{x}), where \mathbf{x} \in \mathbb{R}^N. Today, we want to discuss the technicalities of establishing these theorems.

Let's start with the first theorem, which will actually enable the proof of the second theorem.

Theorem 1 [Extreme Value Theorem]: If \mathcal{X} \subseteq \mathbb{R}^N is a compact set and \mathcal{C} is a continuous function on \mathcal{X}, then \mathcal{C} has a global minimizer on \mathcal{X}.

Proof: We shall divide the proof into two parts.

[Part 1: \mathcal{C} is bounded below on \mathcal{X} We shall prove this by contradiction. Suppose that \mathcal{C} is not bounded below on \mathcal{X}. Then, we can define a sequence \{\mathbf{x}_n\}_{n \in \mathbb{N}} in \mathcal{X}, by picking, for every n \in \mathbb{N}, a vector \mathbf{x}_n \in \mathcal{X} such that \mathcal{C}(\mathbf{x}_n) < -n. Since \mathcal{X} is compact, and thus bounded and closed, we know from the Bolzano-Weierstrass theorem that there exists a convergent subsequence \{\mathbf{x}_{n_k}\}_{k \in \mathbb{N}} of \{\mathbf{x}_n\}_{n \in \mathbb{N}}, whose limit we denote by \mathbf{x} \in \mathcal{X}. Note that we know that \mathbf{x} is indeed in \mathcal{X} due the fact that the latter is closed, i.e., \mathcal{X} contains all its limit points. From the continuity of \mathcal{C}, we know that \{\mathcal{C}(\mathbf{x}_{n_k})\}_{k \in \mathbb{N}} converges to the real number \mathcal{C}(\mathbf{x}). However, since \mathcal{C}(\mathbf{x}_{n_k}) < -n \leq -k for all k \in \mathbb{N}, we also have that \{\mathcal{C}(\mathbf{x}_{n_k})\}_{k \in \mathbb{N}} diverges to -\infty. This is a contradiction, which means that \mathcal{C} is bounded below on \mathcal{X}.

[Part 2: \mathcal{C} achieves its infimum on \mathcal{X} Since \mathcal{C} is bounded below, it has a greatest lower bound (commonly known as the infimum), which we shall denote as C \in \mathbb{R}. What we want to show is that there exists a vector \mathbf{x}^\ast \in \mathcal{X} such that \mathcal{C}(\mathbf{x}^\ast) = C. Since C is the infimum, for any n \in \mathbb{N}, we know that C + 1/n is not a lower bound for \mathcal{C}. We can then define a sequence \{\mathbf{x}_n\}_{n \in \mathbb{N}} in \mathcal{X} by picking \mathbf{x}_n \in \mathcal{X} such that \mathcal{C}(\mathbf{x}_n) < C + 1/n. Then, we have that

 C \leq \mathcal{C}(\mathbf{x}_n) < C + \frac{1}{n},

for all n \in \mathbb{N}, which means that the sequence \{\mathcal{C}(\mathbf{x}_n)\}_{n \in \mathbb{N}} converges to C as n \rightarrow \mathbb{N}. Then we know from the Bolzano-Weierstrass theorem that there exists a subsequence \{\mathbf{x}_{n_k}\}_{k \in \mathbb{N}}, which converges to some \mathbf{x}^\ast, and additionally, since \mathcal{X} is closed, we know that \mathbf{x}^\ast \in \mathcal{X}. From the continuity of the function \mathcal{C}, we have that \{\mathcal{C}(\mathbf{x}_{n_k})\}_{k \in \mathbb{N}} converges to \mathcal{C}(\mathbf{x}^\ast). But, \{\mathcal{C}(\mathbf{x}_{n_k})\}_{k \in \mathbb{N}} is a subsequence of \{\mathcal{C}(\mathbf{x}_n)\}_{n \in \mathbb{N}} that converges to C, so we must have C = \mathcal{C}(\mathbf{x}^\ast). Therefore, we can conclude that \mathcal{C} attains its infimum C at \mathbf{x}^\ast \in \mathcal{X}.

To establish the second theorem, we first prove that coercivity implies compactness, which is the special case of the following result.

Lemma 1 [coercivity \Leftrightarrow compactness]: If \mathcal{C} is a continuous coercive function defined on all of \mathbb{R}^N, then for every C \in \mathbb{R} the set

\mathcal{X} \triangleq \{\mathbf{x} \in \mathbb{R}^N \,|\, \mathcal{C}(\mathbf{x}) \leq C\}

is compact.

Proof for \Rightarrow: First note that, since \mathcal{C} is continuous, the set \mathcal{X} is closed. Thus, we only need to show that \mathcal{X} is bounded. We shall do this by contradiction. Suppose that there is a constant C \in \mathbb{R} such that \mathcal{X} is unbounded. Then, there must exist a sequence \{\mathbf{x}_n\}_{n \in \mathbb{N}} in \mathcal{X} such that \|\mathbf{x}_n\|_{\ell_2} \rightarrow \infty. But then, by coercivity of \mathcal{C}, we must also have \mathcal{C}(\mathbf{x}_n) \rightarrow \infty, which contradicts the fact that by definition of \mathcal{X}, we have that \mathcal{C}(\mathbf{x}_n) \leq C  for all n \in \mathbb{N}. Therefore, the set \mathcal{X} must be bounded.

We can now finally establish the following important result.

Theorem 2 [Extreme Value Theorem]: If \mathcal{C} is a continuous coercive function defined on all of \mathbb{R}^N, then \mathcal{C} has a global minimizer.

Proof: Let a constant C \in \mathbb{R} be chosen so that the set \mathcal{X} = \{\mathbf{x} \in \mathbb{R}^N \,|\, \mathcal{C}(\mathbf{x}) \leq C\} is non-empty. From the definition of coercivity, we know that \mathcal{X} is compact. From the Theorem 1 above, we know that \mathcal{C} has at least one global minimizer on \mathcal{X}, and as the set of global minimizers on \mathcal{X} is a global minimizer on \mathbb{R}^N, we get the result.


References:

[1] Wikipedia, "Extreme value theorem," https://en.wikipedia.org/wiki/Extreme_value_theorem.

[2] Class notes by James V. Burke on "Nonlinear Optimization", https://www.math.washington.edu/~burke/crs/408/notes/nlp/unoc.pdf