Method of Least Squares

This page is a quote from the PhD Thesis of P. Brommer.

Force Matching is closely related to determining the parameters $\xi_{i}$, $i=1,\ldots,n$ of a parametrised function $f_{\left\{\xi_{i}\right\}}(x)$ from a series of $m$ measurements, with $m\gg n$

$$f_{\left\{\xi_{i}\right\}}(x_{j}) = y_{j}, \quad j=1,\ldots,m.\qquad (2.51)$$

Due to measurement errors, this system can generally not be solved exactly. The best solution minimises the deviations of the particular measurements from the function. The definition of those deviations is far from trivial, but if one assumes the measurement errors to be independent, then using the squares of the differences between measurement and function is the method of choice. This is called the method of least squares and goes back to C.F. Gauss. In a further refinement one may assign a weight $w_{j}$ to measurement $j$. Then the target function becomes

$$Z(\left\{\xi_{i}\right\})=\sum_{j=1}^{m}w_{j}\left(f_{\left\{\xi_{i}\right\}}(x_{j})- y_{j}\right)^{2}, \quad i =1,\ldots,n. \qquad (2.52)$$

If the function $f_{\left\{\xi_{i}\right\}}(x)$ in Eq. (2.51) depends linearly on the parameters $\xi_{i}$, i.e.

$$f_{\left\{\xi_{i}\right\}}(x)=\sum_{i=1}^{n}c_{i}(x)\xi_{i}$$

then the solution of the linear least square problem is reduced to the solution of a linear system of equations. The deviation of a measured value $y_{j}$ from $f_{\left\{\xi_{i}\right\}}(x_{j})$is the so-called residue $r_{j}$. An overdetermined system of $m$ equations in $n$ variables $\xi_{i}$ is defined as

$$\sum_{i=1}^{n}c_{i}(x_{j})\xi_{i}-y_{j}=r_{j}$$

or in matrix notation

$$\boldsymbol C \boldsymbol\xi-\boldsymbol y =\boldsymbol r,\quad \boldsymbol C \in \mathbb{R}^{m\times n}, \boldsymbol\xi\in\mathbb{R}^{n}, \boldsymbol y,\boldsymbol r\in\mathbb{R}^{m}. \qquad (2.55)$$

These are the so-called error equations. The objective is now minimising the squares of the residues $r_{j}$, which is just the Euclidian norm $|\boldsymbol r|^{2}$ of the vector $\boldsymbol r$. This is equivalent to solving the linear system of equations

$$\boldsymbol A\boldsymbol\xi+\boldsymbol b=0,\quad\text{where}\quad \boldsymbol A:=\boldsymbol C^{T} \boldsymbol C\in\mathbb{R}^{n\times n}, \boldsymbol b:=-\boldsymbol C^{T}\boldsymbol y\in\mathbb{R}^{n}.$$

The above equations are the normal equations to the error equations (2.55).

There are various efficient numerical implementations of the linear least squares problem. Unfortunately, force matching is a highly nonlinear problem, and the minimum can only be obtained in an iterative fashion. Two methods that were implemented in this work and are described below.

Minimisation without Derivatives According to Powell

Even if the minimum of a function of many variables cannot be obtained in a single calculation, it is usually possible to iteratively improve a starting point until one arrives at an extremal position. The most basic way to do this is following the gradient of the target function to the minimum. Starting at $\boldsymbol \xi_{0}$, the negative gradient vector $\boldsymbol g_{0}=-\nabla Z(\boldsymbol\xi)|_{\boldsymbol\xi=\boldsymbol\xi_{0}}$ of the target function is calculated at $\boldsymbol\xi_{0}$. The next point $\boldsymbol\xi_{1}$ is at the minimum of $Z(\boldsymbol\xi)$ along the straight line through $\boldsymbol\xi_{0}$ with the direction vector $\boldsymbol g_{0}$. Algorithms for one-dimensional minimisation are described below. At $\boldsymbol\xi_{1}$, the gradient of $Z(\boldsymbol\xi)$ is evaluated again and a new search direction is chosen. As each step always goes parallel to the negative gradient of the target function, this is called the method of steepest descent.

Though this method is simple to understand, it is quite inefficient in the number of iteration steps. Consecutive gradients $\boldsymbol g_{i},\boldsymbol g_{i+1}$ are perpendicular ($\boldsymbol g_{i}\cdot\nabla Z(\boldsymbol\xi)|_{\boldsymbol\xi= \boldsymbol\xi_{0}}=0$, otherwise $\boldsymbol\xi_{i+1}$ would not be at the minimum of $Z$ along $\boldsymbol g_{i}$), and the minimisation along $\boldsymbol g_{i+i}$ will generally break the optimisation in the direction of $\boldsymbol g_{i}$. In unfortunate target function landscapes this might lead to a large number of inefficient steps.

The method of conjugate gradients (CG) avoids this by choosing a new gradient direction $\boldsymbol g_{i}$ in a way that will not disturb earlier minimisation steps. At the point $\boldsymbol \xi_{i}$, the Taylor expansion of $Z$ is

$$Z(\boldsymbol x)=Z(\boldsymbol\xi_{i}) + \sum_{l}\frac{\partial Z}{\partial x_{l}}x_{l} + \frac{1}{2}\sum_{l,m}\frac{\partial^{2}Z}{\partial x_{l}\partial x_{m}}x_{l}x_{m} +\ldots\quad\text{with}\quad \boldsymbol{x}=\boldsymbol\xi-\boldsymbol\xi_{i},$$ $$\approx c - \boldsymbol b\cdot\boldsymbol x + \frac{1}{2} \boldsymbol x^{T}\cdot\boldsymbol A\cdot\boldsymbol x,\qquad\qquad\text{(2.58)}$$ where $$c\equiv Z(\boldsymbol\xi_{i}),\qquad \boldsymbol{b}\equiv\left. -\nabla Z\right|_{\boldsymbol\xi_{i}},\qquad [\boldsymbol{A}]_{lm}\equiv \left.\frac{\partial^{2}Z}{\partial x_{l}\partial x_{m}}\right|_{\boldsymbol\xi_{i}}.$$

The matrix $\boldsymbol A$ is the Hessian of $Z$ in $\boldsymbol\xi_{i}$. In the approximation (2.58), the gradient is $\nabla Z = \boldsymbol{A}\cdot \boldsymbol{x} - \boldsymbol{b}.$ When moving by $\delta \boldsymbol{x}$, the change in the gradient is $$\delta(\nabla Z) = \boldsymbol{A}\cdot (\delta \boldsymbol{x}).$$ As $\boldsymbol\xi_{i}$ is at the minimum of the target function along $\boldsymbol g_{i-i}$, the condition that motion along the new direction $\boldsymbol g_{i}$ conserve this optimum is equivalent to demanding the gradient of $Z$ remain perpendicular to $\boldsymbol g_{i-1}$ during that motion. This implies that the also change of the gradient be perpendicular to $\boldsymbol g_{i}$, or $$0=\boldsymbol{g}_{i-1}\cdot \delta(\nabla Z) = \boldsymbol{g}_{i-1}\cdot\boldsymbol{A}\cdot \boldsymbol{g}_{i}.\qquad\qquad\text{(2.62)}$$

Directions that fulfil (2.62) are called conjugated. $n$ pairwise conjugated vectors spanning the $\mathbb{R}^{n}$ form a conjugated basis. If the approximation (2.58) is exact (i.e. $Z$ is a quadratic form), the minimum of $Z$ will be acquired after $n$ steps along the conjugated basis directions. Otherwise, this method ensures quadratic convergence towards the minimum. Many well-documented implementations of this minimisation algorithm (see e.g. [1]) do neither take into account the special form of $Z$ in the case of Force Matching (being a sum of squares) nor the high cost involved in calculating the gradient of $Z$.

A target function of the functional form (2.52) can also be written as $$Z(\boldsymbol\xi)=\sum_{k=1}^{m}[f^{(k)}(\boldsymbol\xi)]^{2},\qquad(2.63)$$ where the $f^{(k)}$ represent the various functions, whose squares contribute to the sum. The vector $\boldsymbol\xi$ formed by the parameters $\xi_{i}$ has dimension $>n$. The first and second derivatives of the $f^{(k)}$ will be written as follows: $$g_{i}^{(k)}(\boldsymbol\xi)=\frac{\partial}{\partial xi_{i}}f^{(k)}(\boldsymbol\xi)\qquad(2.64)$$

$$\text{and} \qquad G_{ij}^{(k)}(\boldsymbol\xi)=\frac{\partial^{2}}{\partial \xi_{i}\partial xi_{j}} f^{(k)}(\boldsymbol\xi).$$

Let $\boldsymbol x$ be an approximation to the minimum at $\boldsymbol x + \boldsymbol\delta$. The derivative of (2.63) vanishes at the minimum: $$\sum_{k=1}^{m}g_{i}^{(k)}(\boldsymbol x+\boldsymbol\delta)\cdot f^{(k)}(\boldsymbol{x}+\boldsymbol\delta) =0 \qquad i=1,\ldots, n.\qquad(2.66)$$ The first two terms of the Taylor expansion of (2.66) around $\boldsymbol x$ yield $$\sum_{k=1}^{m}\left[g_{i}^{(k)}(\boldsymbol x)\cdot f^{(k)}(\boldsymbol x) + \sum_{j=1}^{n}{ G_{ij}^{(k)}(\boldsymbol x) f^{(k)}(\boldsymbol x)+ g_{i}^{(k)}(\boldsymbol x) \cdot g_{j}^{(k)}(\boldsymbol x) }\delta_{j}\right]\approx 0.$$ In a further approximation, the term $G_{ij}^{(k)}(\boldsymbol x) f^{(k)}(\boldsymbol x)$ is neglected. The correction $\boldsymbol\delta$ to $\boldsymbol x$ results from the solution of the linear system of equations $$\sum_{j=1}^{n}\left\{\sum_{k=1}^{m-1}g_{i}^{(k)}(\boldsymbol x) g_{j}^{(k)}(\boldsymbol x) \right\} \delta_{j} = -\sum_{k=1}^{m} g_{i}^{(k)}(\boldsymbol x) f^{(k)}(\boldsymbol x); \qquad i=1,\ldots, n.\qquad(2.68)$$

The new approximation is of course $\boldsymbol x+\lambda_{m}\boldsymbol\delta$, where $\lambda_{m}$ minimises the function $Z(\boldsymbol x+\lambda\boldsymbol\delta)$. In general, the matrix in (2.68) is positive definite and (unless $\boldsymbol x$ is a stationary point) there is a $\lambda_{m}>0$ that improves $Z(\boldsymbol x+ \lambda_{m}\boldsymbol\delta)$, thus theoretically guaranteeing convergence. This is the so-called generalised least squares method.

The derivatives (2.64) are in the case of Force Matching not known explicitly, so they have to be determined numerically at each point $\boldsymbol\xi_{i}$- a very costly operation. Powell² removes the need for explicit gradient calculations beyond the initialisation. He also shows that if the functions $f^{(k)}(\boldsymbol x)$ are of the same order as the correction $\boldsymbol\delta$, errors in the numerical approximate derivatives $g_{j}^{(k)}(\boldsymbol x)$ are acceptable to ensure convergence properties comparable to the generalised least squares method described above.

One starts with $n$ linearly independent direction vectors $\boldsymbol d(1),\ldots,\boldsymbol d(n)$ spanning the parameter space, together with approximations of the gradients $\gamma_{i}^{(k)}$ along these directions, i.e. $$\gamma^{(k)}_{i}\approx\sum_{j=1}^{n}g_{j}^{(k)}(\boldsymbol\xi)\cdot d_{j}(i) \qquad i=1,\ldots,n; \quad k=1,\ldots, m\qquad(2.69)$$ is the derivative of $f^{(k)}$ in the direction $\boldsymbol d_{i}$. These gradients are explicitly taken as independent of the location in configuration space $\boldsymbol\xi$. A change of $\boldsymbol\xi$ by $\boldsymbol\delta$ causes an error of order $\boldsymbol\delta$ times a second derivative in $\gamma^{(k)}_{i}$, which is - as stated above - acceptable. In the first iteration, the coordinate directions are chosen as $\boldsymbol d(i)$, and the $\gamma^{(k)}_{i}$ are initialised with the difference quotients with increments $\epsilon_{i}$ $$\gamma^{(k)}(i)=\frac{f^{(k)}(x_{1},x_{2},\ldots,x_{i-1},x_{i}+\epsilon_{i},x_{i+1},\ldots,x_{n})- f^{(k)}(\boldsymbol x)}{\epsilon_{i}}.\qquad(2.70)$$

The correction $\boldsymbol\delta$ to the starting point $\boldsymbol x$ is the solution of a system of linear equations given by putting the numerical estimates for the derivatives into (2.69): $$\sum_{j=1}^{n}\left\{\sum_{k=1}^{m}\gamma^{(k)}(i)\gamma^{(k)}(j) \right\} q(j) = p(i); \qquad i=1,\ldots, n,$$ $$\text{where}\quad \boldsymbol\delta = \sum_{i=}^{n}q(i)\cdot \boldsymbol{d}(i)$$ $$\text{and}\quad p(i) =-\sum_{k=}^{m} \gamma^{(k)}(i) f^{(k)}(\boldsymbol{x}).$$

For numerical reasons it is beneficial to scale the direction vectors $\boldsymbol d(i)$, so that $$\sum_{k=1}^{m}[\gamma^{(k)}(i)]^{2}=1, \qquad i=1,\ldots,n.\qquad(2.74)$$

Again, the next step is to determine the $\lambda_{\text{min}}$ that minimises $Z(\boldsymbol x+\lambda\boldsymbol\delta)$. Due to the errors in (2.69), this $\lambda_{\text{min}}$ need not be positive. To find this optimal $\lambda_{\text{min}}$, one first brackets a minimum (i.e. find a $f(a)>f(m)<f(b)$ with $a<m<b$) by increasing the bracket size, and then locates the minimum with Brent's algorithm³. This algorithm combines the fast-converging parabolic interpolation with the golden section search. Thus it provides a little insurance: Even in very unfavourable cases, it converges at least linearly.

The best and second best values for $\lambda$, $\lambda_{1}$ and $\lambda_{2}$, that were obtained during the minimisation along $\boldsymbol\delta$, and the associated function values $f^{(k)}(\boldsymbol x+\lambda_{1,2}\boldsymbol\delta)$ are used to approximate the gradient in direction $\boldsymbol\delta$ by the difference quotient $$\frac{\partial}{\partial \lambda}f^{(k)}(\boldsymbol x+\lambda\boldsymbol\delta) \approx \frac{f^{(k)}(\boldsymbol x+\lambda_{1}\boldsymbol\delta)- f^{(k)}(\boldsymbol x+\lambda_{2}\boldsymbol\delta)} {\lambda_{1}-\lambda_{2}}=u^{(k)}(\boldsymbol\delta).$$ By exploiting the fact, that the gradient of $Z(\boldsymbol\xi)$ in the direction $\boldsymbol\delta$ vanishes at $\boldsymbol x+\lambda_{\text{min}}\boldsymbol\delta$, one can improve this approximation to $v^{(k)}(\boldsymbol\delta)$, where $$v^{(k)}(\boldsymbol\delta) = u^{(k)}(\boldsymbol\delta) - \mu f^{(k)}(\boldsymbol x+\lambda_{\text{min}}\boldsymbol\delta)$$

$$\text{with}\quad \mu = \frac{1}{Z(\boldsymbol x+\lambda_{\text{min}}\boldsymbol\delta)} \sum_{k=0}^{m-1}[u^{(k)}(\boldsymbol\delta)\cdot f^{(k)}(\boldsymbol x+\lambda_{\text{min}}\boldsymbol\delta)].$$ and scaling both $v^{(k)}(\boldsymbol\delta)$ and $\boldsymbol\delta$, so that (2.74) still holds when replacing one of the $\gamma^{(k)}(t)$ by $v^{(k)}(\boldsymbol\delta)$ and one of the directions $\boldsymbol d(t)$ by $\delta$. The integer $t$ is selected to maximize the absolute value of the product $p(t)\cdot q(t)$, i.e. $$|p(t)\cdot q(t)|=\max_{1\leq i\leq n}|p(i)\cdot q(i)|.$$ The next iteration step is then started from $\boldsymbol x+\lambda_{\text{min}}\boldsymbol\delta$. The iteration is finished, if the components of both $\boldsymbol\delta$ and $\lambda_{\text{min}}\boldsymbol\delta$ are sufficiently small.

The advantages of Powell's method are discussed in detail in [2] and [4]. Among others, consecutive vectors $\boldsymbol\delta$ tend to be mutually conjugate through the matrix $$\Gamma_{ij}=\sum_{k=0}^{m-1}g_{i}^{(k)}(\boldsymbol x)g_{j}^{(k)}(\boldsymbol x).$$ This makes the algorithm an approximation to the conjugate gradient scheme. Most important for force matching purposes, though, is that beyond the initial $n$ difference quotients no further numeric derivatives are needed explicitly. The only function calls after the initialisation are in the one-dimensional minimisation according to Brent; the results are then reused to calculate the new $\gamma^{(k)}(t)$. Still, the convergence is comparable to the generalised least squares algorithm described above.

1. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P.: Numerical Recipes in C: The Art of Scientific Computing (Academic Press, Cambridge, 1992), 2 edition.
2. Powell, M. J. D.: A method for minimizing a sum of squares of non-linear functions without calculating derivatives. Comp. J., 7 (4),303–307, 1965.
3. Brent, R. P.: Algorithms for minimization without derivatives. Prentice-Hall series in automatic computation (Prentice-Hall, Englewood Cliffs, NJ, 1973). ISBN 0–13–022335–2.
4. Brommer, P.: Entwicklung und Test von Wechselwirkungspotenzialen in Quasikristallen. Master’s thesis, Universität Stuttgart, 2003.