A little bit theory on unconstrained optimization

In unconstrained optimization one wishes to minimize or maximize a given function f , the so called objective function, without further constraints. We stick here on minimization, maximization may be done then by reversing the sign of f. Without some additional assumptions the task might not be solvable at all, hence we present here some (not too hard) restrictions on f which give us a well defined environment to work in.
General assumptions
  1. The domain D of f is open in R n .
  2. There is a x0 D such that the level set
    L (f; x0 )={xD :f(x)f( x0 )}

    is closed and bounded in R n (i.e. compact).
  3. f is two times continuously differentiable on an open neighborhood of L (f; x0 ).
>From this the following facts follow easily using a little calculus:
  1. There exists a x* in L (f; x0 ) such that
    f( x* )f(x)   xD     (the global minimizer)

  2. If x* is a local minimizer in L (f; x0 ) , i.e.
    f( x* )f(x)   xD {x:||x- x* ||δ}

    for some δ>0 , then there holds
    f( x* )=0 and 2 f( x* ) is positive semidefinite,

    the so called necessary first and second order optimality conditions.
  3. If f( x* )=0 and 2 f( x* ) is positive definite, then x* is a strict local minimzer of f, i.e.
    f( x* )<f(x)   xD {x:0<||x- x* ||δ}

    for some δ>0. A sufficient strict local optimality condition.
With the exception of the Newton-minimizer in this chapter all the codes restrict themselves to find a point satisfying the first order necessary condition
f( x* )=0.

Silently one hopes that from the fact that by construction of the methods there holds a strict monotonic decrease of some associated sequence of function values if might follow, that the limit point one obtains will be a strict local minimizer. There exist methods which indeed guarantee to obtain the global minimizer, but these are of a completely different structure from those presented here.
The general structure of most of the methods in this chapter is as follows: Given x0 (hopefully as assumed above) one computes a sequence { xk } from
xk+1 = xk - σk dk

where dk is strictly gradient related, this means
γ1 ||f( xk )|||| dk || γ2 ||f( xk )||

and
f( xk )T dk || dk ||||f( xk )|| γ3 >0

for some constants γ1 , γ2 , γ3 >0 which are implicit in the construction of dk , not necessarily explicitly known. In words: the direction of change is of comparable length with the current gradient and has an angle with it, which is bounded away from π/2. Mostly this property is obtained by choosing dk as the solution of a linear system
Ak dk =f( xk )

with a sequence of matrices Ak , whose eigenvalues are bounded from below by a positive constant and bounded from above too. Such a sequence of matrices is called uniformly positive definite.
The stepsize σk is computed such that it satisfies the so called principle of sufficient decrease:
f( xk )-f( xk - σk dk ) γ4 σk f( xk )T dk

and
σk γ5 f( xk )T dk || dk | |2

γ4 is an explicitly known constant, provided by the user (in the codes in this chapter it is the parameter δ) and should be chosen less than 0.5 for efficiency reasons. The existence of γ5 is assured by the assumptions on f and the construction of σ, but it is usually unknown.
Inserting all the above one gets
f( xk )-f( xk+1 ) γ4 γ5 γ3 2 ||f( xk )| |2 0

and from the boundedness of f from below that
f( xk )0

and from this that every limit point of { xk } is a first order necessary point. Clearly { xk } has limit points. One also gets
dk 0.

Hence if { σk } is also bounded one gets
xk - xk+1 0

and if f has only finitely many stationary points then the whole sequence { xk } must converge to one of these.
This is what a user hopes for: the code delivers a sequence such that termination with
||f( xk )||ϵ or || xk - xk+1 ||ϵ

delivers him a result "near" a true solution. Up to a second order term the error x* - xk will be
( 2 f( xk ))-1 f( xk )

hence if the method delivers a dk such that
|| 2 f( xk )- Ak ) dk || || dk || 0

(the so called Broyden-Dennis-Morè condition) then termination on grounds of a small difference in xk will also be a reliable one.
We mentioned the construction of dk above already. The codes in use satisfy this, sometimes with Ak constructed explicitly, sometimes with Ak existing but not known explicitly. We now turn to the computation of the stepsize σk . Two methods are in use here: The simplest one is known as backtracking or Goldstein-Armijo descent test and works as follows:
Given a strictly gradient related direction dk choose some initial stepsize σ0,k less than some universal constant γ0 (often =1). Then take as σk the largest value βj σ0,k which satisfies
f( xk )-f( xk - βj σ0,k dk )δ βj σ0,k f( xk )T dk

with a parameter 0<β<1, (often β=1/2) and 0<δ<1, preferably 0<δ<1/2.
backtrackpict.png
The efficiency of this method depends on the proper choice of σ0,k . For example the minimizer of the parabola interpolating the values
p(0),p(1),p'(0), with p(s)=f( xk -s dk )

is often useful. The picture above shows you the graph of a function, its tangent line at σ=0, the "Armijo line" used in the test, two intervals of admissible stepsizes and the local upper bound on f, a parabola in which the upper bound L for || 2 f(x)|| on the L (f; x0 ) enters. This parabola is used to prove that there always are stepsizes satisfying the principle of sufficient decrease.
The drawback of dependency on σ0,k is avoided by the so called Powell-Wolfe stepsize which is independent of such: here a stepsize σk is computed such that
f( xk )-f( xk -σ dk ) δσf( xk )T dk and f( xk -σ dk )T dk κf( xk )T dk

with 0<δ<κ<1. Sometimes the strong Powell-Wolfe conditions are used, where on the left hand side of the second inequality the absolute sign is used. This means that a local first order stationary point on the search line is required.
powellwolfepict.png
The picture above shows you a graph of a function and two admissible intervals of stepsizes. Observe, that due to the second condition, in contrast to the backtracking scheme arbitrarily small stepsizes or not admissible here. If the strong conditions would be used too, then only two small intervals around the two local minimizers would remain as admissible.


File translated from TEX by TTM Unregistered, version 4.03.
On 16 Jun 2016, 18:14.