A Quasi-Newton least squares method

Directly to the input form: exponential fit

Directly to the input form: user defined case

This method solves nonlinear least squares problems. In abstract terms this means minimizing the euclidean length of a vector function F of m components with respect to the variable vector x with n components where we assume here that m >= n. In applications the situation is similar to many applications of linear least squares: for a set of data (t_i,y_i) one seeks the parameter x of a function model f(x;t) such that the sum of squares of deviations f(x;t_i) -y_i is minimal. The components of F are made up from these deviations (or ''residuals''). Hence
S(x) = (1/2)*||F(x)||₂²
and
F_i(x) = f(x;t_i) - y_i.
The function f(x;t) depends nonlinearly on x here.
This method tries to find a zero of the gradient of S(x) , which of course is only a necessary optimality condition. Since this is achieved by monotonic decrease of S using a decrease depending on the norm of the gradient it is hoped that indeed a minimizer is found. Newton's method itself could also be used for finding a zero of ∇ S(x). This would require the computation of the Hessian matrix of S (matrix of second partial derivatives). This reads, using the Jacobian J_F(x) of the vector function F
∇² S(x) = J_F(x)^TJ_F(x) + ∑_{i=1 to m} F_i(x)* ∇² F_i(x).
Far from the minimizer this could be indefinite and fail to produce a direction of descent for S. This, besides the necessity of computing the second partial derivatives of F and the fact that also saddle points and maxima are attraction points of Newton's method is the reason why this method , which is a modification of the Gauss-Newton method, is preferred.
In the Gauss-Newton method only the always semidefinite part J_F(x)^TJ_F(x) of the Hessian is used. If the residuals are ''large'' then this may considerably deviate from the true Hessian and the convergence can be very slow. Therefore here one tries to approximate the second part by a quasi-Newton approximation
C ≈ ∑_{i=1 to m} F_i(x)* ∇² F_i(x)
in the hope to obtain ultimately superlinear convergence (for a positive definite true Hessian case). The direction of descent for S is then defined using the so called trust region method: one computes a correction d for the current best point x₀ minimizing the quadratic model for S
S(x₀+d) = S(x₀) + ∇S(x₀)^Td + (1/2)d^T(J^TJ + C)d
subject to the constraint
||d|| ≤ Δ
with adaptation of Δ from step to step. If such a step seems not promising then the method falls back to a Gauss-Newton step.
The initial guess for C usually is zero. In step k, given C_k and the correction d_k one computes the update C_k+1 using the secant equation

C_k+1 d_k = ∑_{i=1 to m} F_i_k ∇² F_i_k d_k

≈ ∑_{i=1 to m}F_i,k(∇F_i,k+1 - ∇F_i,k)

= ( J^T_k+1 - J^T_k)F_k+1

= y_k .

This of course, together with the symmetry, does not fix C_k+1 and one adds the requirement of minimal deviation from C_k in the sense of the Frobenius norm, arriving at the so called PSB (Powell symmetric Broyden) update
C_k+1 = C_k + (r_kv_k^T+ v_kr_k^T)/(d_k^Tv_k) - (d_k^Tr_k)/((d_k^Tv_k)²)v_kv_k^T
with
r_k = y_k - C_kd_k, v_k = ∇S_k+1 - ∇ S_k.
Since experimentation showed that a scaling of C_k would be useful, in the above formula C_k becomes replaced by τ_kC_k with
τ_k = min { 1 , |d_k^Ty_k|/|d_k^TC_kd_k| }
The introduction of the trust region radius results in regularization of the approximate Hessian if the constraint becomes active and the computation of the regularization parameter is reduced to a scalar nonlinear equation like in the Levenberg-Marquardt method
The trust region radius is adapted and the decision whether to use a pure Gauss-Newton or this quasi-Newton step is made dependend on the prospective results of a step with the current radius. We omit these details here.
The order of convergence of this process is superlinear. However, due to cancellation effects in the computation of the quasi-Newton correction and early termination this may not be visible
If one requires an output of intermediate results, one gets a table whose contents needs some explanation:
1. it is the number of the step
2. nf is the number of function evaluations (mostly identical to the number of Jacobian evaluations)
3. f is the sum of squares halved
4. reldf is (S_k+1-S_k)/S_k (zero for zero denominator)
5. preldf is the predicted relative change in S using the quadratic model
6. reldx is max_{i=1 to n} |x_k,i-x_k+1,i|/max_{j=1 to n}( |x_k,j|+|x_k+1,j|)
There are several types of termination:
1. ''Absolute function convergence'' means S_k ≤ 10^-14
2. ''x-convergence'' means reldx ≤ xtol with user input xtol
3. ''relative function convergence'' means reldf ≤ ftol with user input ftol
4. ''singular convergence'' means that none of the three cases above applies but the predicted decrease in S is less than ftol*S
5. ''false convergence'' means that none of these four cases above applies but reldx ≤ 2.2 × 10^-16. This may occur if either the model is not smooth or the precision requirements of the user are too strong.
The numerical code used here is NL2SOL (netlib/toms/573) in a upgraded FORTRAN version.
More details of the implementation are described in the paper
J.E. Dennis, D.M. Gay and R. E.Welsch: An adaptive nonlinear least-squares algorithm. ACM TOMS 7 (1981), 348--368.

There are two input forms: one with a predefined exponential fitting problem and one where the user can define the fit himself.

Input: the predefined exponential fit

Here the data model f(x;t) is a sum of exponentials such that
F_i(x)=a₀ + a₁*exp(b₁*t_i) + .... + a_n*exp(b_n*t_i) - y_i with x = (a₀,a₁,b₁, ... ,a_n,b_n)^T.
You specify the number of exponentials n.
You specify the 2n+1 ''true'' coefficients x = (a₀,a₁,b₁,...,a_n,b_n)^T. Internally these coefficients are stored as the components x(1),x(2),x(2+n),x(3),x(3+n),...,x(n),x(2n+1)
You specify an interval [a,b] for t, in which the data will be generated equidistantly.
You specify the number m of data points. Important: 2n+1 <= m <= 200 !
A variable level which is used for generation of artificial ''measurements errors''. Important: 0 <= level <= 1 ! Now m data (t_i,y_i) will be computed using the model with the ''true'' coefficients and afterwards the values y_i are perturbed using random numbers r_i in [0,1] using the formula
y_i = y_i,orig+(r_i-1/2)*level*2*max{|y_i,orig|}
Subsequently the model is fitted to the perturbed data. Attention: don't hope for wonders! fitting by exponentials is a notoriously badly conditioned problem. This becomes clear if you consider it as a problem of parameter identification for a linear differential equation with constant coefficients, with the solution given by a discrete and noisy set of data.
You have the possibility to change some of the parameters of the code:
1. the required precision of the solution vector xtol in [1.0d-12,1.0d-2] (the relative change of course)
2. An upper bound ftol for the change in the mean square error which results in termination (taken relative) in [1.0d-14,1.0d-2]. The absolute bound for this is fixed to 1.0d-14.
3. the maximal number of function evaluations allowed
4. the maximal number of steps allowed maxstep >= n
5. the amount of printed output iprint: -1 = none, 0 = only final result, 1 = short iteration protocol plus final result

Output: the predefined exponential fit

If required a short iteration protocol.
cpu time used.
A table with the original and the computed (perturbed) coefficients.
The sum of squares of data perturbations.
The sum of squares of residuals after fitting.
A plot which shows the perturbed data and the fitfunction.

Questions: exponential fit

How influences an increase of the perturbation the performance of the method?
How sensitive react the coefficients on an increase of the perturbation level for different choices of the exponents?
What takes place if two of your chosen exponents are nearly equal or if the decay is very small ?

To the input form: exponential fit

Input: user defined case

Here you can define the model function f(x;t) yourself.
You have the possibility to fit own data in which case your input for x serves as the initial guess or may have the data generated internally. In the latter case your input for x serves as the ''true'' parameter value. data are computed from this using your model function and then perturbed using random perturbations. The perturbed data are subsequently used for fitting.
You specify the dimension n of x. Important: n <= 20 !
You specify the model function f(x;t) by a piece of code depending on x(1),...,x(n) and t. This must obey the rules of FORTRAN and the restrictions of JAKEF.
You specify the initial vector x(1),...,x(n) with n entries.
The Jacobian is produced analytically here internally using the code JAKEF from NETLIB.
You have the choice to specify explicitly your data or to let the program generate data artificially using your specifications. In the latter case you define:
1. An interval [a,b] from which the variable t is drawn equidistantly.
2. The number m of data points. Important: n <= m <= 200 !
3. An error level level. Important: 0 <= level <= 1 ! m data points (t_i,y_i,orig=f(x;t_i)) are computed and subsequently the data y_i are computed from
  y_i = y_i,orig+(r_i-1/2)*level*2*max{|y_i,orig|}
  where r_i are random numbers in [0,1]. These data are then fitted.
Direct input of data points:
1. You specify their number m. Important: n <= m <= 200 !
2. You type the data in the prepared textarea, the data items t(i),y(i)
  separated by comma or blank, not necessarily one per line, for example
  0,0 1,1
  2,1.4 ..
3. Finally you have the possibility to change some of the parameters of the code:
  1. the required precision of the solution vector xtol in [1.0d-12,1.0d-2]
  2. an upper bound for the change in the sum of squares which results in termination ftol in the range [1.0d-14,1.0d-2]
  3. the maximal number of function evaluations allowed
  4. the maximal number of steps allowed maxstep >= n
  5. the amount of printed output iprint: -1 = none, 0 = only final result, 1 = short iteration protocol plus final result
  6. two components of x to be printyed in the iteration protocol if this is desired

Output: user defined fit

If required a short iteration protocol.
cpu time used by the minimizer.
Table of original (resp. initial) coefficients and the fitted ones.
sum of squares of generated errors (if applicable)
sum of squares of residuals for the final solution.
A plot which shows the fitdata and the fit function.

Questions: user defined case ?!

What do you observe if the fit is ''bad'' (residuals resp. perturbations are large)
When do you observe large perturbations in the coefficients?
How depend these effects on the model function?

To the input form: user defined case

Back to the top!

13.12.2013