|
The LBFGS method (Byrd, Nocedal and Schnabel) |
|
|
|
- In this application we search a local minimizer xopt of a smooth nonlinear
function f on Rn.
This means
f(xopt) <=
f(x) for all x from some neighborhood of xopt.
Indeed the method only tries to find a stationary point of f, i.e.
a zero of ∇f(x). Should f be locally convex, then indeed this will
be a minimizer. Convexity however is not needed for this method.
We need f in C1 and the gradient is explicitly used.
But in no way second partial derivatives enter this method.
For a little bit theory concerning the background look here.
- The general structure of the method is as follows: Given an initial guess x0
(which need not be a ''good'' one) and an arbitrary positive definite symmetric matrix
A0 one proceeds iteratively:
xk+1 = xk - σkdk.
Here dk is computed as
dk = (Ak)-1 ∇ f( xk)
with an explicit updating of this inverse.
- As the name of the method already indicates it is based on the (full) BFGS method,
but here only a limited number of back values of the differences of xk-j
and ∇f( xk-j) are used together with a fixed initialization
matrix (chosen as a multiple of the unit matrix, which makes things especially easy)
in order to define (Ak)-1. There exist several formulas
for such a limited memory update, we use here the one published by Byrd, Nocedal and Schnabel.
You find its details
here.
- As you can see from that formula (Ak)-1 is not represented by a
full n × n matrix, but rather by some scalars, two n × m and two
m × m matrices, where m is ''memory length'', the number of back differences
considered.
- The stepsize σk is computed by a code which uses a combination of
quadratic and cubic interpolation with the aim of finding a stepsize
which satisfies the strong Powell-Wolfe conditions (we omit the index k):
|∇ f(x_next)Td| <= κ |∇ f(x)T d|
f(x)-f(x_next) >= σ δ |∇ f(x)T d|.
Here δ and κ are two parameters with default
values 0.01 and 0.9. There always must hold
0 < δ < κ < 1
A very small κ means an almost exact unidimensional minimization in the direction -dk.
- Growth in norm and definiteness of the matrix update is checked internally and a
restart is done eventually. Hence this method always finds a stationary point, provided the level
set of f for x0 is bounded and there are only finitely many such
values. Of course the final precision will depend the conditioning of the Hessian and the precision
of function evaluations. Internally we assume here that these are correct up to machine precision.
- Due to cancellation effects the update formula is numerically unstable, unfortunately (like
others of this kind). Hence it makes little sense to make the memory larger than 20, say.
- The numerical code behind this is LBFGSB.f from netlib, designed by Ciyou Zhu
in collaboration with R.H. Byrd, P. Lu-Chen and J. Nocedal.
- Termination occurs if
- the maximum number of steps is reached (4000)
- if the number of function and gradient evaluations exceeds 40000
- if || ∇ f(xk)|| < ε (1+|f(xk)|)
- if the stepsize algorithm fails to find a sufficient decrease
- if f doesn't decrease by more than 10-12 relative.
|
|
|
Input |
|
- You have the choice between 14 predefined functions for which you can read a short
examples description here.
You also may define a function f of your own,
depending on n variables x(1),...,x(n).
You also have the choice to specify the initial guess x0.
- Either you use the standard parameter settings for the termination criteria or you specify these
yourself. For convenience we have these reduced to ε , δ , κ , iprint, m
Standard termination is
||∇ f|| <= ε(1+|f|) and
||x-xprev|| <= ε (||x||+1), ε=1.0e-7
The default for iprint is zero (=no intermediate output) and for m=min{n/2+1,10}.
- For the predefined cases you have the possibility to replace the standard initial point by
another one. The required dimension can be drawn from the input form.
- You have the possibility to require a contourplot of f around the best point found
with two of the n variables varying and the other components of x fixed at their
current values. you can specify the size of the region. Be careful! Often these functions show
rapid growth such that on a large region you may ''see'' nothing useful.
|
|
|
Output |
|
- cpu time used. If this appears as zero this means less than 0.01 seconds.
- the termination reason.
- The computed best values f(xopt) and
x(1)opt,...,x(n)opt and the gradient components.
- The number of function and gradient evaluations used.
- Two plots in half-log format showing fk - fopt
and ||∇ f(xk)||.
- The contour plot, if requested.
|
|
|
|