# Fundamentals: Extreme Value Theorems—Part 1

In the last post, we proved that the gradient descent algorithm can be used to computationally find a global minimizer of the least-squares cost functional, i.e., it converges to a vector $\mathbf{x}^\ast \in \mathbb{R}^N$ such that

for all $\mathbf{x} \in \mathbb{R}^N$, where $\mathcal{C}(\mathbf{x}) \triangleq \frac{1}{2}\|\mathbf{y}-\mathbf{H}\mathbf{x}\|_{\ell_2}^2.$

Today, we want to be more general and ask the following question: given an arbitrary continuous function $\mathcal{C}(\mathbf{x})$, with $\mathbf{x} \in \mathbb{R}^N$, how to know that the function has a global minimizer?

Specifically, we will discuss two famous theorems that establish existence of global minimizers.

Theorem 1 [Extreme Value Theorem]: If $\mathcal{X} \subseteq \mathbb{R}^N$ is a compact set and $\mathcal{C}$ is a continuous function on $\mathcal{X}$, then $\mathcal{C}$ has a global minimizer on $\mathcal{X}$

Theorem 2 [Extreme Value Theorem]: If $\mathcal{C}$ is a continuous coercive function defined on all of $\mathbb{R}^N$, then $\mathcal{C}$ has a global minimizer.

Those theorems provide sufficient conditions for the existence of global minimizers. The rest of the post will be about defining the terms compact and coercive; in Part 2 of the post we shall prove both theorems.

We start by establishing two fundamental concepts of open and closed sets in $\mathbb{R}^N$.

Definition 1 [open set]: A subset $\mathcal{X} \subset \mathbb{R}^N$ is open if and only if for every $\mathbf{x} \in \mathcal{X}$  there exist $\epsilon > 0$ such that the open ball $\mathcal{B}_{\epsilon}(\mathbf{x})$ of center $\mathbf{x}$ and radius $\epsilon$ remains in $\mathcal{X}$, i.e., $\mathcal{B}_{\epsilon}(\mathbf{x}) \subset \mathcal{X}$.

Remark that the radius $\epsilon$ may depend on $\mathbf{x}$, and also we remind the definition $\mathcal{B}_{\epsilon}(\mathbf{x}) \triangleq \{\mathbf{y} \in \mathbb{R}^N : \|\mathbf{x}-\mathbf{y}\|_{\ell_2} < \epsilon\}$.

Definition 2 [closed set]: A subset $\mathcal{X} \subset \mathbb{R}^N$ is closed if and only if its complement $\mathcal{X}^\complement = \mathbb{R}^N \setminus \mathcal{X}$ is open.

More intuitively, a closed set is a set that includes its boundary (if there is one), while an open set does not. To establish the term compact set, we introduce one more intermediate definition.

Definition 3 [bounded set]: A subset $\mathcal{X} \subseteq \mathbb{R}^N$ is bounded if there exists a constant $C > 0$ such that $\|\mathbf{x}\|_{\ell_2} < C$ for all $\mathbf{x} \in \mathcal{X}$.

Finally, we now define the compact set by combining Definitions 2 and 3.

Definition 4 [compact set]: A subset $\mathcal{X} \subseteq \mathbb{R}^N$ is compact if it is closed and bounded.

From the perspective of Definition 4, Theorem 1 simply states that if we are optimizing $\mathcal{C}$ over some compact subset of $\mathcal{X} \subseteq \mathbb{R}^N$, then we are sure that there exists a global minimizer $\mathbf{x}^\ast$ that we may hope to find computationally.

Now, what happens if we are interested in optimizing over the whole $\mathbb{R}^N$ rather than on some subset?

This is where the notion of coercive functions becomes useful.

Definition 5 [coercive function]: A continuous function $\mathcal{C}$ that is defined on all of $\mathbb{R}^N$ is coercive if

This means that for any constant $C > 0$ there exists a constant $X > 0$ (that may depend on $C$) such that $\mathcal{C}(\mathbf{x}) > C$ whenever $\|\mathbf{x}\|_{\ell_2} > X$.

Intuitively, for a function to be coercive, it must approach $+\infty$ along any path within $\mathbb{R}^N$ on which $\|\mathbf{x}\|_{\ell_2}$ becomes infinite.

Going back to the example of the least-squares cost functional, we have from the reverse triangular inequality that

where the constant $C = \sqrt{\lambda_{\text{min}}(\mathbf{H}^\mathrm{T}\mathbf{H})}$ with $\lambda_{\text{min}}(\mathbf{H}^\mathrm{T}\mathbf{H})$ denoting the smallest eigenvalue of the matrix $\mathbf{H}^\mathrm{T}\mathbf{H}$.

Thus, when the matrix $\mathbf{H}$ is non-singular, i.e., $C > 0$, and since

we see that least-squares is indeed a coercive function.

# Descending the least-squares

Reconsider a system of linear equations

Here, the vector $\mathbf{y} \in \mathbb{R}^M$ and matrix $\mathbf{H} \in \mathbb{R}^{M \times N}$ are known, and we want to computationally recover the vector $\mathbf{x} \in \mathbb{R}^N$. The unknown vector $\mathbf{e} \in \mathbb{R}^M$ models noise.

If we know nothing about $\mathbf{x}$ beyond the measurements $\mathbf{y}$, then it makes sense to seek $\mathbf{x}$ that predicts $\mathbf{y}$ as closely as possible via  $\mathbf{H}$. Note, however, that this approach would only make sense when $M \geq N$, i.e., when there are at least as many measurements as the unknowns; otherwise, we end up with infinitely many potential vectors $\mathbf{x} \in \mathbb{R}^N$ that can perfectly match given measurements $\mathbf{y}$.

So, how can one enforce the closeness of $\mathbf{x}$ to the measurements $\mathbf{y}$?

The simplest, and most popular, approach is to find $\mathbf{x} \in \mathbb{R}^N$ that minimizes the least-squares function

Given the least-squares cost function, how would we actually minimize it?

One approach that conveniently circumvents the need to physically store and invert the matrix $\mathbf{H}$ is the gradient descent method. To apply it, we first find the expression for the gradient of $\mathcal{D}$ as

Then, given a guess of the solution $\mathbf{x}^{t-1}$, we can update it by taking a step in the direction of negative gradient

where the parameter $\gamma > 0$ controls the size of the step. The process above is repeated many times, for every $t = 1, 2, 3, \dots$, starting from some initial guess  $\mathbf{x}^0$.

How can we be sure that this gradient-descent iteration actually minimizes $\mathcal{D}$? Can we comment on the number of iterations required to get to the bottom of $\mathcal{D}$?

To start analyzing the gradient descent method, we first write the Taylor series expansion of $\mathcal{D}(\mathbf{x})$ at an arbitrary location $\mathbf{z} \in \mathbb{R}^N$

where we used $\nabla^2$ to denote the Hessian, which in our case reduces to $\nabla^2\mathcal{D}(\mathbf{z}) = \mathbf{H}^\mathrm{T}\mathbf{H}$. Note that the second order Taylor expansion was sufficient since $\mathcal{D}$ is a quadratic function. Then, we can obtain two bounds on $\mathcal{D}$ for any $\mathbf{x},\mathbf{z} \in \mathbb{R}^N$

and

where $L \triangleq \lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})$ with the notation $\lambda_{\text{max}}(\mathbf{H}^\mathrm{T}\mathbf{H})$ denoting the largest eigenvalue of the matrix $\mathbf{H}^\mathrm{T}\mathbf{H}$. The lower bound above indicates to the convexity of $\mathcal{D}$, while the upper bound to the Lipschitz continuity of the gradient $\nabla \mathcal{D}$.

We now evaluate $\mathcal{D}$ at $\mathbf{x}^+ = \mathbf{x} - \gamma \nabla \mathcal{D}(\mathbf{x})$, and use the upper bound to obtain

which for $\gamma \in [0, 1/L]$ reduces to

This inequality clearly indicates that gradient descent is indeed a descent method, i.e., at each step it reduces our $\mathcal{D}$ by a value proportional to the product of the step-size and the norm of the gradient.

Now, by plugging in the lower bound inequality into the previous inequality, we obtain for any $\mathbf{z} \in \mathbb{R}^N$

Thus, by evaluating this inequality at $\mathbf{z} = \mathbf{x}^\ast$, where $\mathbf{x}^\ast$ is the minimizer of $\mathcal{D}$, we obtain

Note that since $\mathcal{D}(\mathbf{x}^+) - \mathcal{D}(\mathbf{x}^\ast) \geq 0$, we have that $\|\mathbf{x}^+ - \mathbf{x}^\ast\|_{\ell_2}^2 \leq \|\mathbf{x}-\mathbf{x}^\ast\|_{\ell_2}^2$, i.e., distance to a minimizer decreases.

Finally, by setting $\mathbf{x} = \mathbf{x}^{t-1}$, $\mathbf{x}^+ = \mathbf{x}^{t}$, and $\gamma = 1/L$, and summing over $t$, we obtain

and since the sequence $\mathcal{D}(\mathbf{x}^t)$ is nonincreasing,

This implies that gradient descent reaches $[\mathcal{D}(\mathbf{x}^{t}) - \mathcal{D}(\mathbf{x}^{\ast})] \leq \epsilon$ after $C/\epsilon$ iterations, where

References:

[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.

[2] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.