Linear least squares :QR decomposition and normal equations

Directly to the input form

A linear system
A p = b
with more equations than unknowns typically will have no solution and hence one searches for a ''quasi solution''. This might be the minimum of the residual vector
r = A p - b
in some norm. Linear least squares uses the euclidean norm and this is the simplest case. The solution will be unique if the rank of A is the full column rank, which case we assume here. For a more general case look for the singular values decomposition (svd). In the full rank case the solution can be explicitly written down as
p = (A^T A)^-1 A^T b
or
(A^T A) p = A^T b <=> A^T(A p - b) = A^T r = 0 ,
the so called ''normal equations'', saying that the optimal residual is perpendicular to the columns of A.
The direct application of this representation is often subject to severe illconditioning leading to destructive roundoff propagation and hence other solution methods are applied, see below. ''ill conditioning'' means that small variations in b or A give large variations in p.
A result in this direction asks for the difference between the solutions of the original problem
||Ap - b|| = min_p and the perturbed one ||A^˜p^˜-b^˜|| = min_p^˜
which is as follows:
1. Let
  κ = ||Σ||||Σ^-1||
  with the singular values decomposition
  A = U Σ V^T
  (κ is the condition number of A)
2. E= A - A^˜ , δ(b)= ||b - b^˜||
3. Assume with δ(A) = ||E||/||A|| there holds
  2 κ δ(A) < 1
4. Then , with r the minimal residual of the original problem
  || p - p^˜||/||p|| <= (κ/( 1 - 2 κδ(A) ) ) ×
  ( δ(A) + { δ(b) + κδ(A)||r||}/{||b||²-||r||²}^1/2 )
This means that the condition number of the so called normal equations (κ²) appears here only multiplied with the norm of the residual and if κ||r|| <= 1 that the sensitivity of the least squares problem is essentially identical to the sensitivity of a (solvable) linear system Ap = b, whereas the large residual problem can be very illconditioned.
The above mentioned situation ''more equations than unknowns'' appears quite often in applications. Imagine you have a set of measured data (x(i),y(i)), i=1,...,N,
and a ''model'' for the dependence of y on x, that is an ''Ansatz'' with given functions φ_j(x)
y = F(x;p)=p₁φ₁(x)+...+p_nφ_n(x) + ε(x)
which depends linearly on the unknown parameters p_j, j=1,..,n and where you have reason to assume ε(x_i) being ''small noise''. Inserting the data into this model with ε(x_i) neglected leads to the situation above with a N * n matrix A with the elements
A_i,j = φ_j(x_i) and b_i = y_i .
Using the euclidean norm results in minimization of the sum of squares ∑_{i=1 to N} (F(x_i;p)-y_i)²
w.r.t. p.
The method of least squares has also a justification in statistics: if you assume that ''small noise'' above means normally and independently distributed random variables ε(x_i) with mean zero and variance independent of i then it turns out that the expectation value of your computed p is the ''true'' parameter in the model.
The QR decomposition of the matrix provides one way to diminuate the influence of illconditioning present in the normal equations. It uses the fact that the euclidean length is invariant under orthonormal transformations and hence applies such transformations with the aim of transforming the system A p = b into the form
(Q A)^T = ( R^T, O^T)
R p = c₁
O p = c₂
where R is triangular and square, O is a zero matrix of N-n rows and c=Qb . Since the multiplication with Q does not change the rank, R will be nonsingular. Hence the incompatibility of the system is in the second equation only, with ||c₂|| being the length of the optimal residual. The optimal parameter p is obtained simply by backsubstitution in the first equation. Formally this is equivalent to writing
A = Q^T ( R^T, O^T)^T
The matrix Q is in this application not computed explicitly, but in a factorized form as the product of n (in the case of N=n n-1) so called Householder matrices. These are special orthonormal matrices performing a reflection on a hyperplane with normal u_i:
I - 2 u_iu_i^T / (u_i^Tu_i), i=1,..,n
The vector u_i is the normal of this hyperplane and can be determined easily by the job of step i in this decomposition: annihilate the elements (i+1,i),...,(N,i) in the (current) matrix. If you think of these elements as a column, then for example
(8,3,1,-1,4,-3)^T -> (-10,0,0,0,0,0)^T
with
u = (18,3,1,-1,4,-3)^T
Given those normals u_i any desired operation with the matrix Q which is the product of the Householder matrices can be accomplished.
One can show that the computation and the application of these transformations is numerical backward stable in the sense that the computed solution p is the exact solution to a problem which is a perturbation of the original system in the order of machine roundoff.

It should be obvious that the choice of the ''Ansatz'' heavily influences the condition number of A and hence the sensitivity of the coefficients. The application here allows you to test this for example using different bases of the n dimensional linear space of polynomials of maximum degree n-1 (e.g. monomial, Legrendre and Chebyshev). The following short table with the abscissas x_i equidistant in [-1,1] demonstrates this quite well. For the monomial basis in intervals like [1,5] the situation becomes even worse!
The table shows the reciprocal condition number of A, i.e. 1/κ

N	n	Monomial	Chebychev
10	5	0.5982810E-001	0.6058193E+000
10	10	0.2161485E-003	0.6858268E-001
20	5	0.5834324E-001	0.5135544E+000
20	10	0.7595056E-003	0.4190469E+000
30	5	0.5688019E-001	0.4786713E+000
30	10	0.8214191E-003	0.4422010E+000
30	15	0.8670296E-005	0.2740692E+000
40	5	0.5599882E-001	0.4613035E+000
40	10	0.8290337E-003	0.4275510E+000
40	15	0.9995193E-005	0.3545616E+000
40	20	0.9708376E-007	0.1313730E+000
50	5	0.5542571E-001	0.4509568E+000
50	10	0.8266143E-003	0.4072196E+000
50	15	0.1050676E-004	0.3782863E+000
50	20	0.1163482E-006	0.2202498E+000
100	5	0.5418598E-001	0.4298162E+000
100	10	0.8072133E-003	0.3451954E+000
100	15	0.1084858E-004	0.3424422E+000
100	20	0.1387583E-006	0.3193107E+000

This program gives you three choices: either you specify explictly a problem by the coefficients A and b. Or you play with some linear least squares models which are predefined here. For example it might be instructive to play with a polynomial model using different intervals on the real axis. Or finally you specify an ''ansatz'' of your own and work either with your own data or artificially generated ones.

Input

Case 1: explicit specification of matrix and right hand side:

row and column number of the matrix
the coefficients of A with the elements of b appended, rowwise

Case 2: predefined models with data generation:

You choose a model.
you specify an interval [a,b] for the ''free'' variable x.
you specify the number of data N.
You also specify the amount of generated ''noise''. This is here a set of random numbers r_i following a rectangular distribution in the interval [-errlev,errlev]*y_max_abs where errlev can be chosen. That means that the data are disturbed with an absolute error which is proportional to the largest function value of F. errlev=0 is possible and shows the effect of roundoff errors on the computation.
The abscissas x_i are generated equidistantly in [a,b]
Then y_i is computed as y_i = F(x_i;p₁,...)+2·(½-r_i)·errlev·y_max_abs.
You have the choice between four models with 5 or 6 coefficients.
1. A polynomial using the monomials 1, x, x², ..., x⁵ as basis. You have the additional choice of solving the problem after a linear transformation of [a,b] to [-1,1].
2. A polynomial with the Legendre polynomials L₀(x),...,L₅(x) as basis. These are
  L₀(x) = 1, L₁(x) = x, L₂(x) = ½(3x²-1),...
3. A trigonometric model with the basis 1, sin(c₁ π x), cos(c₂ π x), sin(c₃ π x), cos(c₄ π x).
4. An exponential model with fixed exponents 1,exp(c_i x) , i=1,2,3,4 .
5. In the last two cases you may choose the constants c_i yourself.
You specify the ''true'' coefficients p.

Case 3: A model of your own:

In this case you must write a piece of code following FORTRAN conventions which computes one row of A given the ''instance'' x. Input is here x and n, output is one row of A. Row and column numbers are specified separately.
With the so defined ''ansatz'' you either might generate artificial data (in this case the abscissas are chosen equidistantly in a given interval and you specify the ''true'' coefficients and a perturbation level as in case 2) or
you fit given own discrete data.

Output

In case 1 you get the solution vector p and the euclidean length of the residual. You also get the reciprocal condition number 1/κ
In case 2: you get
your chosen ''true'' coefficients p₀,...,p_n.
the computed coefficients, showing the influence of roundoff and/or generated noise, for the solution way using the normal equations and the QR decomposition.
A plot showing the discrete data points and the approximating function F(x;p) with p from the QR solution way.
The reciprocal condition number 1/κ
The same output as in case two is also provided for case three, an ''ansatz'' of your own.

Questions ?!

If a small errlev generates large deviations in p: what do you conclude from this?
In which cases the results between the normal equations and the QR solution way differ most and why? Can you realize a connection of these differences to the reciprocal condition number?
Which role plays the transformation to [-1,1] in the case of a polynomial model?
Which difference do you observe between the monomial and the Legendre basis?
What do you expect if you choose in case 2 and the exponential model some c_i=0?
How compares the upper bound for the change in the solution given above with with the one you can observe here, if you first set errlev to zero and then to some realistic value in the magnitude of some percent (like measurement errors)?

To the input form

Back to the top!

20.01.2014