|
A Quasi-Newton least squares method |
|
|
|
|
|
- This method solves nonlinear least squares problems.
In abstract terms this means minimizing the euclidean length of a vector function
F of m components with respect to the variable vector x
with n components where we assume here that m >= n.
In applications the situation is similar to
many applications of linear least squares: for a set of data
(ti,yi) one seeks the parameter x of a function model
f(x;t) such that the sum of squares of deviations f(x;ti) -yi
is minimal. The components of F are made up from these deviations (or ''residuals'').
Hence
S(x) = (1/2)*||F(x)||22 and Fi(x) = f(x;ti) -
yi. The function f(x;t) depends nonlinearly on
x here.
- This method tries to find a zero of the gradient of S(x) , which of course is
only a necessary optimality condition. Since this is achieved by monotonic decrease
of S using a decrease depending on the norm of the gradient it is hoped
that indeed a minimizer is found.
Newton's method itself could also be used for finding a zero of ∇ S(x).
This would require the computation of the Hessian matrix of S (matrix of
second partial derivatives).
This reads, using the Jacobian JF(x) of the vector function F
∇2 S(x) = JF(x)TJF(x) + ∑i=1 to m
Fi(x)* ∇2 Fi(x).
Far from the minimizer this could be indefinite and fail to produce a direction of
descent for S. This, besides the necessity of computing the second partial derivatives of
F and the fact that also saddle points and maxima are attraction points of Newton's method is the reason why
this method , which is a modification of the Gauss-Newton method, is preferred.
- In the Gauss-Newton method only the always semidefinite part JF(x)TJF(x)
of the Hessian is used. If the residuals are ''large''
then this may considerably deviate from the true Hessian and the convergence can be very slow.
Therefore here one tries to approximate the second part by a quasi-Newton approximation
C ≈ ∑i=1 to m
Fi(x)* ∇2 Fi(x)
in the hope
to obtain ultimately superlinear convergence (for a positive definite true Hessian case).
The direction of descent for S is then defined using
the so called trust region method: one computes a correction d for the current best point
x0 minimizing the quadratic model for S
S(x0+d) = S(x0) + ∇S(x0)Td +
(1/2)dT(JTJ + C)d subject to the constraint
||d|| ≤ Δ
with adaptation of Δ from step to step. If such a step seems not promising then
the method falls back to a Gauss-Newton step.
- The initial guess for C usually is zero. In step k, given Ck
and the correction dk one computes the update Ck+1 using
the secant equation
Ck+1 dk | = | ∑i=1 to m
Fik ∇2 Fik dk |
| ≈ | ∑i=1 to mFi,k(∇Fi,k+1 -
∇Fi,k) |
| = | ( JTk+1 - JTk)Fk+1 |
| = | yk . |
This of course, together with the symmetry, does not fix Ck+1 and one adds the requirement of minimal
deviation from Ck in the sense of the Frobenius norm, arriving at the so called PSB (Powell symmetric Broyden)
update
Ck+1 = Ck + (rkvkT+ vkrkT)/(dkTvk)
- (dkTrk)/((dkTvk)2)vkvkT
with
rk = yk - Ckdk, vk = ∇Sk+1 - ∇ Sk.
Since experimentation showed that a scaling of Ck would be useful, in the above formula Ck
becomes replaced by τkCk with
τk = min { 1 , |dkTyk|/|dkTCkdk| }
- The introduction of the trust region radius results in regularization of the approximate Hessian if the constraint becomes active
and the computation of the regularization parameter is reduced to a scalar nonlinear equation like in the Levenberg-Marquardt method
- The trust region radius is adapted and the decision whether to use a pure Gauss-Newton or this quasi-Newton step
is made dependend on the prospective results of a step with the current radius.
We omit these details here.
- The order of convergence of this process is superlinear. However, due to
cancellation
effects in the computation of the quasi-Newton correction and early termination
this may not be visible
- If one requires an output of intermediate results, one gets a table whose contents needs some explanation:
- it is the number of the step
- nf is the number of function evaluations (mostly identical to the number of Jacobian evaluations)
- f is the sum of squares halved
- reldf is (Sk+1-Sk)/Sk (zero for zero denominator)
- preldf is the predicted relative change in S using the quadratic model
- reldx is maxi=1 to n |xk,i-xk+1,i|/maxj=1 to n( |xk,j|+|xk+1,j|)
- There are several types of termination:
- ''Absolute function convergence'' means Sk ≤ 10-14
- ''x-convergence'' means reldx ≤ xtol with user input xtol
- ''relative function convergence'' means reldf ≤ ftol with user input ftol
- ''singular convergence'' means that none of the three cases above applies but the predicted decrease in
S is less than ftol*S
- ''false convergence'' means that none of these four cases above applies but reldx ≤ 2.2 × 10-16.
This may occur if either the model is not smooth or the precision requirements of the user are too strong.
- The numerical code used here is NL2SOL (netlib/toms/573) in a upgraded FORTRAN version.
- More details of the implementation are described in the paper
J.E. Dennis, D.M. Gay and R. E.Welsch: An adaptive nonlinear least-squares algorithm.
ACM TOMS 7 (1981), 348--368.
|
|
|
There are two input forms: one with a predefined exponential fitting problem and
one where the user can define the fit himself.
|
|
|
Input: the predefined exponential fit |
|
- Here the data model f(x;t) is a sum
of exponentials such that
Fi(x)=a0 +
a1*exp(b1*ti) + .... +
an*exp(bn*ti) - yi with
x = (a0,a1,b1, ... ,an,bn)T.
- You specify the number of exponentials n.
- You specify the 2n+1 ''true'' coefficients
x = (a0,a1,b1,...,an,bn)T.
Internally these coefficients are stored as the components
x(1),x(2),x(2+n),x(3),x(3+n),...,x(n),x(2n+1)
- You specify an interval [a,b] for t, in which the data will be generated
equidistantly.
- You specify the number m of data points. Important: 2n+1 <= m <= 200 !
- A variable level which is used for generation of artificial ''measurements errors''.
Important: 0 <= level <= 1 ! Now
m data (ti,yi) will be computed using the model with
the ''true'' coefficients and afterwards the values yi are perturbed
using random numbers ri in [0,1] using the formula
yi = yi,orig+(ri-1/2)*level*2*max{|yi,orig|}
Subsequently the model is fitted to the perturbed data.
Attention: don't hope for wonders! fitting by exponentials is a notoriously
badly conditioned problem. This becomes clear if you consider it as
a problem of parameter identification for a linear differential equation with
constant coefficients, with the solution given by a discrete and noisy set of data.
- You have the possibility to change some of the parameters of the code:
- the required precision of the solution vector xtol in [1.0d-12,1.0d-2]
(the relative change of course)
- An upper bound ftol for the change in the mean square error which results in termination
(taken relative) in [1.0d-14,1.0d-2]. The absolute bound for this is fixed
to 1.0d-14.
- the maximal number of function evaluations allowed
- the maximal number of steps allowed maxstep >= n
- the amount of printed output iprint: -1 = none,
0 = only final result, 1 = short iteration protocol plus final result
|
|
|
Output: the predefined exponential fit |
|
- If required a short iteration protocol.
- cpu time used.
- A table with the original and the computed (perturbed) coefficients.
- The sum of squares of data perturbations.
- The sum of squares of residuals after fitting.
- A plot which shows the perturbed data and the fitfunction.
|
|
|
Questions: exponential fit |
|
- How influences an increase of the perturbation the performance of the method?
- How sensitive react the coefficients on an increase of the perturbation level
for different choices of the exponents?
- What takes place if two of your chosen exponents are nearly equal or if the decay
is very small ?
|
|
|
|
|
|
Input: user defined case |
|
- Here you can define the model function f(x;t) yourself.
- You have the possibility to fit own data in which case your input for x
serves as the initial guess or may have the data generated internally.
In the latter case your input for x serves as the ''true'' parameter value.
data are computed from this using your model function and then perturbed using
random perturbations. The perturbed data are subsequently used for fitting.
- You specify the dimension n of x. Important:
n <= 20 !
- You specify the model function f(x;t) by a piece of code depending on
x(1),...,x(n) and t. This must obey the rules of
FORTRAN and the restrictions of JAKEF.
- You specify the initial vector x(1),...,x(n) with n entries.
- The Jacobian is produced analytically here internally using the code JAKEF from NETLIB.
- You have the choice to specify explicitly your data or to
let the program generate data artificially using your specifications.
In the latter case you define:
- An interval [a,b] from which the variable t is drawn equidistantly.
- The number m of data points. Important: n <= m <= 200 !
- An error level level. Important: 0 <= level <= 1 !
m data points (ti,yi,orig=f(x;ti))
are computed and subsequently the data yi are computed from
yi = yi,orig+(ri-1/2)*level*2*max{|yi,orig|}
where ri are random numbers in [0,1].
These data are then fitted.
Direct input of data points:
- You specify their number m. Important: n <= m <= 200 !
- You type the data in the prepared textarea, the data items
t(i),y(i)
separated by comma or blank, not necessarily one per line, for example
0,0 1,1 2,1.4 ..
- Finally you have the possibility to change some of the parameters of the code:
- the required precision of the solution vector xtol in [1.0d-12,1.0d-2]
- an upper bound for the change in the sum of squares which results in termination
ftol in the range [1.0d-14,1.0d-2]
- the maximal number of function evaluations allowed
- the maximal number of steps allowed maxstep >= n
- the amount of printed output iprint: -1 = none,
0 = only final result, 1 = short iteration protocol plus final result
- two components of x to be printyed in the iteration protocol if
this is desired
|
|
|
Output: user defined fit |
|
- If required a short iteration protocol.
- cpu time used by the minimizer.
- Table of original (resp. initial) coefficients and the fitted ones.
- sum of squares of generated errors (if applicable)
- sum of squares of residuals for the final solution.
- A plot which shows the fitdata and the fit function.
|
|
|
Questions: user defined case ?! |
|
- What do you observe if the fit is ''bad'' (residuals resp. perturbations are large)
- When do you observe large perturbations in the coefficients?
- How depend these effects on the model function?
|
|
|
|
|