Monthly Archives: March 2016

Fundamentals: Extreme Value Theorems—Part 1

In the last post, we proved that the gradient descent algorithm can be used to computationally find a global minimizer of the least-squares cost functional, i.e., it converges to a vector \mathbf{x}^\ast \in \mathbb{R}^N such that

\mathcal{C}(\mathbf{x}^\ast) \leq \mathcal{C}(\mathbf{x}),

for all \mathbf{x} \in \mathbb{R}^N, where \mathcal{C}(\mathbf{x}) \triangleq \frac{1}{2}\|\mathbf{y}-\mathbf{H}\mathbf{x}\|_{\ell_2}^2.

Today, we want to be more general and ask the following question: given an arbitrary continuous function \mathcal{C}(\mathbf{x}), with \mathbf{x} \in \mathbb{R}^N, how to know that the function has a global minimizer?

Specifically, we will discuss two famous theorems that establish existence of global minimizers.

Theorem 1 [Extreme Value Theorem]: If \mathcal{X} \subseteq \mathbb{R}^N is a compact set and \mathcal{C} is a continuous function on \mathcal{X}, then \mathcal{C} has a global minimizer on \mathcal{X}

Theorem 2 [Extreme Value Theorem]: If \mathcal{C} is a continuous coercive function defined on all of \mathbb{R}^N, then \mathcal{C} has a global minimizer.

Those theorems provide sufficient conditions for the existence of global minimizers. The rest of the post will be about defining the terms compact and coercive; in Part 2 of the post we shall prove both theorems.

We start by establishing two fundamental concepts of open and closed sets in \mathbb{R}^N.

Definition 1 [open set]: A subset \mathcal{X} \subset \mathbb{R}^N is open if and only if for every \mathbf{x} \in \mathcal{X}  there exist \epsilon > 0 such that the open ball \mathcal{B}_{\epsilon}(\mathbf{x}) of center \mathbf{x} and radius \epsilon remains in \mathcal{X}, i.e., \mathcal{B}_{\epsilon}(\mathbf{x}) \subset \mathcal{X}.

Remark that the radius \epsilon may depend on \mathbf{x}, and also we remind the definition \mathcal{B}_{\epsilon}(\mathbf{x}) \triangleq \{\mathbf{y} \in \mathbb{R}^N : \|\mathbf{x}-\mathbf{y}\|_{\ell_2} < \epsilon\}.

Definition 2 [closed set]: A subset \mathcal{X} \subset \mathbb{R}^N is closed if and only if its complement \mathcal{X}^\complement = \mathbb{R}^N \setminus \mathcal{X} is open.

More intuitively, a closed set is a set that includes its boundary (if there is one), while an open set does not. To establish the term compact set, we introduce one more intermediate definition.

Definition 3 [bounded set]: A subset \mathcal{X} \subseteq \mathbb{R}^N is bounded if there exists a constant C > 0 such that \|\mathbf{x}\|_{\ell_2} < C for all \mathbf{x} \in \mathcal{X}.

Finally, we now define the compact set by combining Definitions 2 and 3.

Definition 4 [compact set]: A subset \mathcal{X} \subseteq \mathbb{R}^N is compact if it is closed and bounded.

From the perspective of Definition 4, Theorem 1 simply states that if we are optimizing \mathcal{C} over some compact subset of \mathcal{X} \subseteq \mathbb{R}^N, then we are sure that there exists a global minimizer \mathbf{x}^\ast that we may hope to find computationally.

Now, what happens if we are interested in optimizing over the whole \mathbb{R}^N rather than on some subset?

This is where the notion of coercive functions becomes useful.

Definition 5 [coercive function]: A continuous function \mathcal{C} that is defined on all of \mathbb{R}^N is coercive if

\lim_{\|\mathbf{x}\|_{\ell_2} \rightarrow \infty} \left\{\mathcal{C}(\mathbf{x})\right\} = +\infty.

This means that for any constant C > 0 there exists a constant X > 0 (that may depend on C) such that \mathcal{C}(\mathbf{x}) > C whenever \|\mathbf{x}\|_{\ell_2} > X.

Intuitively, for a function to be coercive, it must approach +\infty along any path within \mathbb{R}^N on which \|\mathbf{x}\|_{\ell_2} becomes infinite.

Going back to the example of the least-squares cost functional, we have from the reverse triangular inequality that

 \|\mathbf{y}-\mathbf{H}\mathbf{x}\|_{\ell_2} \geq \|\mathbf{H}\mathbf{x}\|_{\ell_2}-\|\mathbf{y}\|_{\ell_2} \geq C\|\mathbf{x}\|_{\ell_2}-\|\mathbf{y}\|_{\ell_2},

where the constant C = \sqrt{\lambda_{\text{min}}(\mathbf{H}^\mathrm{T}\mathbf{H})} with \lambda_{\text{min}}(\mathbf{H}^\mathrm{T}\mathbf{H}) denoting the smallest eigenvalue of the matrix \mathbf{H}^\mathrm{T}\mathbf{H}.

Thus, when the matrix \mathbf{H} is non-singular, i.e., C > 0, and since

 \lim_{\|\mathbf{x}\|_{\ell_2} \rightarrow \infty} \left\{\|\mathbf{x}\|_{\ell_2}\right\} = +\infty,

we see that least-squares is indeed a coercive function.

Descending the least-squares

Reconsider a system of linear equations

\mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{e}.

Here, the vector \mathbf{y} \in \mathbb{R}^M and matrix \mathbf{H} \in \mathbb{R}^{M \times N} are known, and we want to computationally recover the vector \mathbf{x} \in \mathbb{R}^N. The unknown vector \mathbf{e} \in \mathbb{R}^M models noise.

If we know nothing about \mathbf{x} beyond the measurements \mathbf{y}, then it makes sense to seek \mathbf{x} that predicts \mathbf{y} as closely as possible via  \mathbf{H}. Note, however, that this approach would only make sense when M \geq N, i.e., when there are at least as many measurements as the unknowns; otherwise, we end up with infinitely many potential vectors \mathbf{x} \in \mathbb{R}^N that can perfectly match given measurements \mathbf{y}.

So, how can one enforce the closeness of \mathbf{x} to the measurements \mathbf{y}?

The simplest, and most popular, approach is to find \mathbf{x} \in \mathbb{R}^N that minimizes the least-squares function

\mathcal{D}(\mathbf{x}) \triangleq \frac{1}{2}\|\mathbf{y}- \mathbf{H}\mathbf{x}\|_{\ell_2}^2 = \frac{1}{2} \sum_{m = 1}^M \left(y_m - \sum_{n = 1}^N H_{mn} x_n\right)^2 .

Given the least-squares cost function, how would we actually minimize it?

One approach that conveniently circumvents the need to physically store and invert the matrix \mathbf{H} is the gradient descent method. To apply it, we first find the expression for the gradient of \mathcal{D} as

\nabla \mathcal{D}(\mathbf{x}) = \mathbf{H}^\mathrm{T}\left(\mathbf{H}\mathbf{x}-\mathbf{y}\right).

Then, given a guess of the solution \mathbf{x}^{t-1}, we can update it by taking a step in the direction of negative gradient

\mathbf{x}^t \leftarrow \mathbf{x}^{t-1} - \gamma \nabla \mathcal{D}(\mathbf{x}^{t-1}),

where the parameter \gamma > 0 controls the size of the step. The process above is repeated many times, for every t = 1, 2, 3, \dots, starting from some initial guess  \mathbf{x}^0.

How can we be sure that this gradient-descent iteration actually minimizes \mathcal{D}? Can we comment on the number of iterations required to get to the bottom of \mathcal{D}?

To start analyzing the gradient descent method, we first write the Taylor series expansion of \mathcal{D}(\mathbf{x}) at an arbitrary location \mathbf{z} \in \mathbb{R}^N

\mathcal{D}(\mathbf{x}) = \mathcal{D}(\mathbf{z}) + [\nabla \mathcal{D}(\mathbf{z})]^\mathrm{T}(\mathbf{x}-\mathbf{z}) + (\mathbf{x}-\mathbf{z})^\mathrm{T}[\nabla^2 \mathcal{D}(\mathbf{z})](\mathbf{x}-\mathbf{z}),

where we used \nabla^2 to denote the Hessian, which in our case reduces to  \nabla^2\mathcal{D}(\mathbf{z}) = \mathbf{H}^\mathrm{T}\mathbf{H}. Note that the second order Taylor expansion was sufficient since \mathcal{D} is a quadratic function. Then, we can obtain two bounds on \mathcal{D} for any \mathbf{x},\mathbf{z} \in \mathbb{R}^N

 \mathcal{D}(\mathbf{x}) \geq \mathcal{D}(\mathbf{z}) + [\nabla \mathcal{D}(\mathbf{z})]^\mathrm{T}(\mathbf{x}-\mathbf{z})

and

 \mathcal{D}(\mathbf{x}) \leq \mathcal{D}(\mathbf{z}) + [\nabla \mathcal{D}(\mathbf{z})]^\mathrm{T}(\mathbf{x}-\mathbf{z}) + \frac{L}{2}\|\mathbf{x}-\mathbf{z}\|_{\ell_2}^2,

where L \triangleq \lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H}) with the notation \lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H}) denoting the largest eigenvalue of the matrix \mathbf{H}^\mathrm{T}\mathbf{H}. The lower bound above indicates to the convexity of \mathcal{D}, while the upper bound to the Lipschitz continuity of the gradient \nabla \mathcal{D}.

We now evaluate \mathcal{D} at \mathbf{x}^+ = \mathbf{x} - \gamma \nabla \mathcal{D}(\mathbf{x}), and use the upper bound to obtain

\mathcal{D}(\mathbf{x}^+) \leq \mathcal{D}(\mathbf{x}) - \gamma \|\nabla \mathcal{D}(\mathbf{x})\|_{\ell_2}^2 + \frac{\gamma^2 L}{2}\|\nabla \mathcal{D}(\mathbf{x})\|_{\ell_2}^2,

which for \gamma \in [0, 1/L] reduces to

\mathcal{D}(\mathbf{x}^+) \leq \mathcal{D}(\mathbf{x}) - \frac{\gamma}{2} \|\nabla \mathcal{D}(\mathbf{x})\|_{\ell_2}^2.

This inequality clearly indicates that gradient descent is indeed a descent method, i.e., at each step it reduces our \mathcal{D} by a value proportional to the product of the step-size and the norm of the gradient.

Now, by plugging in the lower bound inequality into the previous inequality, we obtain for any \mathbf{z} \in \mathbb{R}^N

\mathcal{D}(\mathbf{x}^+) \leq \mathcal{D}(\mathbf{z}) + [\nabla \mathcal{D}(\mathbf{x})]^\mathrm{T}(\mathbf{x}-\mathbf{z}) - \frac{\gamma}{2}\|\nabla \mathcal{D}(\mathbf{x})\|_{\ell_2}^2.

Thus, by evaluating this inequality at \mathbf{z} = \mathbf{x}^\ast, where  \mathbf{x}^\ast is the minimizer of \mathcal{D}, we obtain

\mathcal{D}(\mathbf{x}^+) - \mathcal{D}(\mathbf{x}^\ast) \leq [\nabla \mathcal{D}(\mathbf{x})]^\mathrm{T}(\mathbf{x}-\mathbf{x}^\ast) - \frac{\gamma}{2}\|\nabla \mathcal{D}(\mathbf{x})\|_{\ell_2}^2 \\ = \frac{1}{2\gamma}\left(\|\mathbf{x}-\mathbf{x}^\ast\|_{\ell_2}^2 - \|\mathbf{x}^+ - \mathbf{x}^\ast\|_{\ell_2}^2\right).

Note that since \mathcal{D}(\mathbf{x}^+) - \mathcal{D}(\mathbf{x}^\ast) \geq 0, we have that \|\mathbf{x}^+ - \mathbf{x}^\ast\|_{\ell_2}^2 \leq \|\mathbf{x}-\mathbf{x}^\ast\|_{\ell_2}^2, i.e., distance to a minimizer decreases.

Finally, by setting \mathbf{x} = \mathbf{x}^{t-1}, \mathbf{x}^+ = \mathbf{x}^{t}, and \gamma = 1/L, and summing over t, we obtain

\sum_{i = 1}^t(\mathcal{D}(\mathbf{x}^t)-\mathcal{D}(\mathbf{x}^\ast)) \leq \frac{L}{2}\|\mathbf{x}^0 - \mathbf{x}^\ast\|_{\ell_2}^2

and since the sequence \mathcal{D}(\mathbf{x}^t) is nonincreasing,

\mathcal{D}(\mathbf{x}^t) - \mathcal{D}(\mathbf{x}^\ast) \leq \frac{1}{t}\sum_{i = 1}^t (\mathcal{D}(\mathbf{x}^t) - \mathcal{D}(\mathbf{x}^\ast)) \leq \frac{L}{2t}\|\mathbf{x}^0-\mathbf{x}^\ast\|_{\ell_2}^2.

This implies that gradient descent reaches [\mathcal{D}(\mathbf{x}^{t}) - \mathcal{D}(\mathbf{x}^{\ast})] \leq \epsilon after C/\epsilon iterations, where

C = \frac{L}{2}\|\mathbf{x}^0 - \mathbf{x}^\ast\|_{\ell_2}^2.


References:

[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.

[2] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.