A little bit theory on unconstrained optimization

In unconstrained optimization one wishes to minimize or maximize a given function

f

, the so called objective function, without further constraints. We stick here on minimization, maximization may be done then by reversing the sign of

f

. Without some additional assumptions the task might not be solvable at all, hence we present here some (not too hard) restrictions on

f

which give us a well defined environment to work in.

General assumptions

The domain $D$ of $f$ is open in $R^{n}$ .
There is a $x^{0} \in D$ such that the level set

$L (f; x^{0}) = {x \in D : f (x) \leq f (x^{0})}$

is closed and bounded in $R^{n}$ (i.e. compact).
$f$ is two times continuously differentiable on an open neighborhood of $L (f; x^{0})$ .

>From this the following facts follow easily using a little calculus:

There exists a $x^{*}$ in $L (f; x^{0})$ such that

$f (x^{*}) \leq f (x) \forall x \in D (the global minimizer)$
If $x^{*}$ is a local minimizer in $L (f; x^{0})$ , i.e.

$f (x^{*}) \leq f (x) \forall x \in D \cap {x : | | x - x^{*} | | \leq δ}$

for some $δ > 0$ , then there holds

$\nabla f (x^{*}) = 0 and \nabla^{2} f (x^{*}) is positive semidefinite,$

the so called necessary first and second order optimality conditions.
If $\nabla f (x^{*}) = 0$ and $\nabla^{2} f (x^{*})$ is positive definite, then $x^{*}$ is a strict local minimzer of $f$ , i.e.

$f (x^{*}) < f (x) \forall x \in D \cap {x : 0 < | | x - x^{*} | | \leq δ}$

for some $δ > 0$ . A sufficient strict local optimality condition.

With the exception of the Newton-minimizer in this chapter all the codes restrict themselves to find a point satisfying the first order necessary condition

\nabla f (x^{*}) = 0 .

Silently one hopes that from the fact that by construction of the methods there holds a strict monotonic decrease of some associated sequence of function values if might follow, that the limit point one obtains will be a strict local minimizer. There exist methods which indeed guarantee to obtain the global minimizer, but these are of a completely different structure from those presented here.

The general structure of most of the methods in this chapter is as follows: Given

x^{0}

(hopefully as assumed above) one computes a sequence

{x^{k}}

from

x^{k + 1} = x^{k} - σ_{k} d^{k}

where

d^{k}

is strictly gradient related, this means

γ_{1} | | \nabla f (x^{k}) | | \leq | | d^{k} | | \leq γ_{2} | | \nabla f (x^{k}) | |

and

\frac{\nabla f (x^{k})^{T} d^{k}}{| | d^{k} | | | | \nabla f (x^{k}) | |} \geq γ_{3} > 0

for some constants

γ_{1}, γ_{2}, γ_{3} > 0

which are implicit in the construction of

d^{k}

, not necessarily explicitly known. In words: the direction of change is of comparable length with the current gradient and has an angle with it, which is bounded away from

π / 2

. Mostly this property is obtained by choosing

d^{k}

as the solution of a linear system

A_{k} d^{k} = \nabla f (x^{k})

with a sequence of matrices

A_{k}

, whose eigenvalues are bounded from below by a positive constant and bounded from above too. Such a sequence of matrices is called uniformly positive definite.

The stepsize

σ_{k}

is computed such that it satisfies the so called principle of sufficient decrease:

f (x^{k}) - f (x^{k} - σ_{k} d^{k}) \geq γ_{4} σ_{k} \nabla f (x^{k})^{T} d^{k}

and

σ_{k} \geq γ_{5} \frac{\nabla f (x^{k})^{T} d^{k}}{| | d^{k} | |^{2}}

γ_{4}

is an explicitly known constant, provided by the user (in the codes in this chapter it is the parameter

δ

) and should be chosen less than 0.5 for efficiency reasons. The existence of

γ_{5}

is assured by the assumptions on

f

and the construction of

σ

, but it is usually unknown.

Inserting all the above one gets

f (x^{k}) - f (x^{k + 1}) \geq γ_{4} γ_{5} γ_{3}^{2} | | \nabla f (x^{k}) | |^{2} \geq 0

and from the boundedness of

f

from below that

\nabla f (x^{k}) \to 0

and from this that every limit point of

{x^{k}}

is a first order necessary point. Clearly

{x^{k}}

has limit points. One also gets

d^{k} \to 0 .

Hence if

{σ_{k}}

is also bounded one gets

x^{k} - x^{k + 1} \to 0

and if

f

has only finitely many stationary points then the whole sequence

{x^{k}}

must converge to one of these.

This is what a user hopes for: the code delivers a sequence such that termination with

| | \nabla f (x^{k}) | | \leq ϵ or | | x^{k} - x^{k + 1} | | \leq ϵ

delivers him a result "near" a true solution. Up to a second order term the error

x^{*} - x^{k}

will be

(\nabla^{2} f (x^{k} {))}^{- 1} \nabla f (x^{k})

hence if the method delivers a

d^{k}

such that

\frac{| | \nabla^{2} f (x^{k}) - A_{k}) d^{k} | |}{| | d^{k} | |} \to 0

(the so called Broyden-Dennis-Morè condition) then termination on grounds of a small difference in

x^{k}

will also be a reliable one.

We mentioned the construction of

d^{k}

above already. The codes in use satisfy this, sometimes with

A_{k}

constructed explicitly, sometimes with

A_{k}

existing but not known explicitly. We now turn to the computation of the stepsize

σ_{k}

. Two methods are in use here: The simplest one is known as backtracking or Goldstein-Armijo descent test and works as follows:

Given a strictly gradient related direction

d^{k}

choose some initial stepsize

σ_{0, k}

less than some universal constant

γ_{0}

(often =1). Then take as

σ_{k}

the largest value

β^{j} σ_{0, k}

which satisfies

f (x^{k}) - f (x^{k} - β^{j} σ_{0, k} d^{k}) \geq δ β^{j} σ_{0, k} \nabla f (x^{k})^{T} d^{k}

with a parameter

0 < β < 1

, (often

β = 1 / 2

) and

0 < δ < 1

, preferably

0 < δ < 1 / 2

The efficiency of this method depends on the proper choice of

σ_{0, k}

. For example the minimizer of the parabola interpolating the values

p (0), p (1), p' (0), with p (s) = f (x^{k} - s d^{k})

is often useful. The picture above shows you the graph of a function, its tangent line at

σ = 0

, the "Armijo line" used in the test, two intervals of admissible stepsizes and the local upper bound on

f

, a parabola in which the upper bound

L

for

| | \nabla^{2} f (x) | |

on the

L (f; x^{0})

enters. This parabola is used to prove that there always are stepsizes satisfying the principle of sufficient decrease.

The drawback of dependency on

σ_{0, k}

is avoided by the so called Powell-Wolfe stepsize which is independent of such: here a stepsize

σ_{k}

is computed such that

\begin{matrix} f (x^{k}) - f (x^{k} - σ d^{k}) & \geq & δ σ \nabla f (x^{k})^{T} d^{k} \\ and \\ \nabla f (x^{k} - σ d^{k})^{T} d^{k} & \leq & κ \nabla f (x^{k})^{T} d^{k} \end{matrix}

with

0 < δ < κ < 1

. Sometimes the strong Powell-Wolfe conditions are used, where on the left hand side of the second inequality the absolute sign is used. This means that a local first order stationary point on the search line is required.

The picture above shows you a graph of a function and two admissible intervals of stepsizes. Observe, that due to the second condition, in contrast to the backtracking scheme arbitrarily small stepsizes or not admissible here. If the strong conditions would be used too, then only two small intervals around the two local minimizers would remain as admissible.

File translated from T_EX by T_TM Unregistered, version 4.03.
On 16 Jun 2016, 18:14.