#### Abstract

In experimental science and engineering, least squares are ubiquitous in analysis and digital data processing applications. Minimizing sums of squares of some quantities can be interpreted in very different ways and confusion can arise in practice, especially concerning the optimality and reliability of the results. Interpretations of least squares in terms of norms and likelihoods need to be considered to provide guidelines for general users. Assuming minimal prerequisites, the following expository discussion is intended to elaborate on some of the mathematical characteristics of the least-squares methodology and some closely related questions in the analysis of the results, model identification, and reliability for practical applications. Examples of simple applications are included to illustrate some of the advantages, disadvantages, and limitations of least squares in practice. Concluding remarks summarize the situation and provide some indications of practical areas of current research and development.

#### 1. Introduction

Least squares go back to Gauss and Legendre in the late 1790s. The first important publication on the topic was authored by Legendre in 1806 with the title “New Methods for Determination of a Comet’s Orbit” and had a supplement entitled “On the method of least squares”. Gauss’s first publication on least squares appeared in 1809 at the end of his Theoria motus. He mentioned there in passing that Legendre had presented the method in his work of 1806, but that he himself had already discovered it in 1795. Gauss’s correspondence and the papers found after his death proved that he was certainly the first to make the discovery, but since Legendre was first to publish it, priority rights belong to the latter. Obviously, as both of them reached the result independently of each other, both deserve the honour [1].

In these astronomical applications, “least squares” was a method of obtaining the best possible average value for a measured magnitude, given several observations of the magnitude, when the measurements are found to be unavoidably different due to (random) errors. Of course, long before the theory of errors ever saw the light of day, common sense had chosen the arithmetic average value as the most probable value, hence the dichotomy in numerous situations: independently of the nature of the errors involved, a least-squares procedure gives the arithmetic mean, or more generally some weighted average value, which may or may not be the most likely value in the probabilistic sense (see, e.g., [2] for more general discussions). These two very different interpretations of least squares, now technically often referred to as the Best Approximation Estimate (BAE) in terms of quadratic norms, and the Maximum Likelihood Estimate (MLE) in terms of distributions, respectively, are widely used in functional analysis, inference analysis, and all kinds of application areas. Notice that in general, BAEs in terms of arbitrary norms can be very different from MLEs in terms of distributions. However, it is well known that in general, BAEs for *p* norms and MLEs for exponential distributions coincide with most interesting implications in practice (see, e.g., [3]). The following discussions will concentrate on the fundamental characteristics of the least-squares methodology and related implementation aspects.

Least-squares parameter estimation can be applied to underdetermined just as to overdetermined linear problems. In fact, underdetermined prediction problems are generally more common than overdetermined filtering and adjustment problems. With observations and unknown parameters of unequal weights modeled using some empirical or theoretical covariance (or correlation) functions, more sophisticated estimation methods such as Kriging and least-squares collocation are employed. Correspondingly, Radial Basis Functions (RBFs) and related strategies are used for interpolations of spatially scattered data and other approximations as BAEs.

The preceding implicit assumption of model linearity is essential for practical reasons. In practice, just about any engineering problem nonlinear in terms of its unknown parameters can be linearized as follows:(a)Using a Taylor expansion about an appropriate point or parameter value, the linear term can be used to approximate the model in the neighborhood of the expansion point.(b)Using differentiation in terms of the unknown parameters, the total derivative of the function is linear in terms of the differentials corresponding to the unknown parameters, and hence differential corrections to the unknown parameters can be evaluated as a linear problem.

These two strategies for nonlinear least squares then imply iterative procedures for differential correction estimates to the unknown parameters. Convergence of such iterative procedures is usually ensured for well-chosen parametrization, and the general situation has been discussed by Pope [4] and others in the estimation literature. Obviously, in general, there are numerous types of nonlinear complex problems that cannot be treated with such simplistic strategies but for the purposes of the following discussions, linearity in terms of the unknown parameters will be assumed from here on.

From an application perspective, one exceptional class of separable nonlinear least-squares problems deserves mention in this context. These are problems for which the mathematical model function is a linear combination of nonlinear functions. Specifically, one can assume that there are two sets of unknown parameters where one set is dependent on the other and can be explicitly eliminated. The method of variable projections has proven very appropriate for such nonlinear problems in several application areas [5]. General nonlinear least-squares estimation is still the object of current research (see, e.g., [6] for further discussions and references).

The intrinsic linearity of least-squares computations implies that these can be done simultaneously or in a stepwise manner to obtain exactly the same estimation results. In other words, at the limit, one unknown parameter can be estimated at a time, or one observation or measurement can be processed at a time. This characteristic of linear computations is most useful in least-squares procedures and has led to numerous formulations such as summation of normals, and sequential adjustments. Furthermore, the quadratic computations can be avoided in critical numerically sensitive situations using orthogonal methods such as Givens rotations, Householder’s reflections, and others.

#### 2. Least Squares and Alternatives

Consider a system of linear algebraic nonhomogeneous equations with unknowns , , ,, , where is not necessarily equal to , with corresponding matrix representation and assuming for simplicity. When = without any rank deficiencies in the matrix , that is, with nonsingular, then the unique solution is simply which can be evaluated in practice by Gaussian elimination or any other solution method for simultaneous linear equations. Notice that such solution methods are usually more efficient than the direct matrix inversion method, which is a consideration in numerous application contexts.

When the system is overdetermined with , that is, more equations than unknowns, then one could use the first equations, or some other selection of equations, and assuming no rank deficiency, proceed as in the previous case of = . However, this is not appropriate for most applications as all the observations should somehow contribute to some “optimal” solution. Hence, rewriting the given system of equations with an error term **e**, that is, , to emphasize that there may not exist one value that would satisfy exactly, one obvious strategy is to minimize some norm of , that is, some acceptable measure of the “length” of the vector **e**. In practical terms, this norm of **e** can simply be its Euclidean length, that is,
but more generally, using norms, denoted by ,
for . The solution for a specified value of , if one exists, is called an BAE of . For , the _{2} estimate is the familiar least-squares estimate of , which is going to be discussed below. When *p* = 1, the _{1} estimate is a least-magnitude estimate of , a generalization of the median, and is well known in robust estimation. When , the estimate is a least-maximum or min-max estimate of . For other values of , some BAEs are possible but not often used in practice, except perhaps for 1 2 in multifacility location-allocation problems (e.g., [7]). Notice that for , the BAE does not necessarily exist and when it does exist, it is not necessarily unique, which can greatly complicate matters in applications.

When , the least-squares estimate always exists for a finite set of linear equations, assuming linearly independent columns, and its unique value is easily obtained using basic calculus:
using matrix notation and to minimize **e** = ,
which gives the familiar normal equations
where the square matrix **A** is easily seen to be symmetric and positive definite. The least-squares estimate is then written as
which is easily verified to correspond to a minimum as
The previous matrix inequality simply means that, as the matrix **A** is symmetric and positive definite, it has positive real eigenvalues and hence the situation corresponds to a minimum of **e**, as desired.

For the corresponding underdetermined system **Ax** = **f**, **f****0**, assuming linearly independent rows, there are obviously infinitely many solutions in general. For an optimal solution with minimum quadratic norm, the easiest approach is to use unknown correlates **x** = **y** which imply by substitution
and assuming no rank deficiency as before, **AA**^{T} is nonsingular and hence
which gives by substitution
To see the appropriateness of this estimate , consider
with usually called the nullspace projector corresponding to **A**. More explicitly,(i)for a vector **z** with **Az** = **0**, , and conversely;(ii) for all vector **z**.

Therefore, for any vector **z**,
but for a minimum quadratic norm estimate , only **z***≡***0** is acceptable.

The previous result can readily be generalized to the quadratic norm with weight matrices for correlated observations or measurements of different quality as follows. Let **f** denote an nonzero vector and **A** an matrix with linearly independent columns. Then there is a unique vector which minimizes over all , for some appropriate weight matrix **P**. Furthermore, = .

More generally, the unknowns themselves may have different relevance or other characteristics requiring weight matrices such as for regularization. Letting **f** denote an nonzero vector and **A** an matrix with linearly independent columns, then there is a unique vector which minimizes over all , for some appropriate weight matrices **P** and **Q**. Furthermore, = and when **P **and **Q** are nonsingular, = **Q**^{-1}(**AQ ^{-1}**)

^{-1}

**f**, by algebraic duality.

The proof of this last statement is a straightforward generalization of the previous situation involving the weighted observational errors and prior information about the unknown parameters. Using the Matrix Inversion Lemma (also called the Schur Identity), one can write
when the inverses of the weight matrices exist. Notice that the first RHS expression is in terms of the weight matrices while the second is in terms of their inverses, and that the preceding least-squares estimates are obtained with **P** = **I** and **Q** = **0** for a simple overdetermined system, and with **Q**^{-1} = **I** and **P**^{-1} = **0** for a simple underdetermined system.

There is also an extensive theory dealing with the cases of rank deficiency in the matrix **A** implying a singular matrix **A**^{T}**A** or **AA**^{T} in the above expressions. In such cases, special precautions are required to reduce the number of parameters to be estimated or constrain their estimation. In general, the singular value decomposition discussed in the next section provides the best general strategy for linear problems with rank deficiencies.

In geometrical terms, the least-squares approach corresponds to an approximation using a normal (i.e., orthogonal) projection, and as a BAE, it does not necessarily involve any statistical information. In other words, as shown explicitly above, in all cases of underdetermined, determined, and overdetermined situations, even with weights associated with the observations and parameters, the least-squares solution is simply a weighted average of the observations for each unknown parameter. This is the reason for considering the least-squares approach basically as a mathematical approximation procedure which turns out to be most appropriate for statistical applications (see, e.g., [2] for more general discussions).

However, when interpreting the measurements or observations with finite first and second moments as a Gaussian sample, the average and hence the least-squares estimate becomes the unbiased minimum-variance estimate or the MLE. This is because any sample with finite first and second moments may be identified with a Gaussian sample as the Gaussian or normal distribution is fully specified by the first two moments. This statistical interpretation of least-squares estimates is really useful for error analysis and reliability considerations as in addition to Gaussian implications for the first moment, the second moment information behaves as a Chi-Square () distribution. Gaussian statistics are widely used in the analysis of least-squares estimates largely because of the well-developed theory and wide-ranging practical experience.

The previously introduced weight matrices **P** and **Q** are usually interpreted in the statistical sense as inversely proportional to the covariance matrices of the measurements and unknown parameters, respectively. Using unit proportionality factors, these are explictly
in which the zero subscript corresponds to the mean or expected value. Notice that = **E**[**e**] =** 0 **for unbiasedness while = **E**[] is not necessarily zero in general applications. Using the well-known *covariance propagation law*; that is, for any linear transformation **z** = **Rx**, one has for the corresponding second moment
then the variance of the estimated parameters is readily obtained as
which again shows a formulation in terms of the weights **P** and **Q**, and a dual formulation in terms of their inverses. In practical applications, these two equivalent formulations can be exploited to minimize the computational efforts either in terms of weight (or information) matrices or in terms of covariance matrices. Notice that adding a diagonal term to an arbitrary matrix can be interpreted in several different ways to numerically stabilize the matrix inversion such as, for example, in ridge estimation [8], Tikhonov regularization [9], and variations thereof.

All nonzero covariance matrices and their inverses, the nonzero weight (or information) matrices, are symmetric and positive definite in least-squares estimation. The optimality of the estimates in the sense of minimum variances requires such symmetry and positive definiteness for all the nonzero weight and covariance matrices involved. In some applications, it may be required to control the dynamic range and spectral shape of the covariance of the estimation error and to that end, such methods as Covariance Shaping Least-Squares Estimation [10] can be used advantageously in practice.

For illustration purposes, consider the following situation. Given five measurements with some a priori information about a desired quadratic regression model, that is, , with unknown parameters a, b, and c, the preceding least-squares formulations can be used to provide the estimates for the quadratic polynomial in the interior of the interval spanned by the observations used (see Table 1) for interpolation purposes. Notice that the mathematical model, namely, the quadratic polynomial, was assumed from the context of the experiment or exercise. In general, the model identification needs to be resolved and the situation will be briefly discussed in Section 6.

#### 3. Least-Squares Interpolation and Prediction

Simple linear interpolation consists in an arithmetic mean of quantities, that is,
with all the coefficients 1/. In a more general context, the coefficients would be estimated to optimize the interpolation or prediction. For instance writing and for the observations, = , *i* = 1,, one has the general interpolation formula
that is, with normalized weights . In the presence of correlations between the observations, using the usual matrix notation,
assuming the weight matrix **P** = [] corresponds to the inverse of the covariance matrix **C** = [].

Correspondingly, the least-squares prediction formula
in which the quantities are defined in terms of the correlations between and , including , assuming the expected value of to be zero, **E**[*f*()] = 0. Notice that this expression is of the form of the least-squares solution to an underdetermined linear problem as discussed before. For prediction applications, the correlation terms are often modelled empirically using correlation functions of the separation distance .

When **E**[()] is an unknown constant , the normal equations for the unknown coefficients and are written explicitly as
for some unknown Lagrange multiplier . In general when **E**[()] is modelled as a th degree polynomial, these normal equations become
for some unknown coefficients .

Least-squares prediction problems can be classified differently depending on application and other considerations. In different contexts, deterministic and probabilistic interpretations are used and hence the inferences are different. Three methodologies are mentioned here.(a)Kriging methods with variograms or generalized covariance functions such as, Cov() = or for spatial distance d (see [11] for details).(b)Radial Basis Function (RBF) methods with empirical RBFs as weighing functions, such as, RBF() = log or for radial distance and constant c (see, e.g., [12] for more details). (c)Least-Squares Collocation (LSC) methods with ordinary covariance functions **C***≡***C**_{s} + **C**_{n} with **C**_{s} denoting the signal part and **C**_{n} denoting the noise part of the covariance matrix **C**.

Furthermore, generalized covariance functions (that is with positive power spectra) include ordinary covariance functions and empirical RBFs that can often be interpreted as covariance or correlation functions. Notice that the nonsingularity of the normal equation matrices is always assumed to guarantee a solution without additional constraints.

#### 4. Solution Methodology for Normal Equations

The normal equation matrices **A** or **A** and the like are positive definite symmetric matrices that lend themselves to **L** or **L** decompositions in terms of lower triangular matrices **L**. The best known decomposition algorithm for such a matrix is the Cholesky square-root algorithm which is usually applied simultaneously to the normal equation matrix **C **and right-hand side vector (or column matrix) **F** as follows:
The solution is then obtained as a back substitution in the resulting upper triangular system of equations. The computational effort for normal equations is approximately 1/3 of the effort required using inverse matrix strategies. Furthermore in the case of least-squares adjustments of blocks of stereomodels, photographs and networks of geodetic stations, the normal equation matrix is usually banded to and the Cholesky’s algorithm only requires in such cases. This is really advantageous and has been used in large geodetic network, photogrammetric block adjustments, and most other similar least-squares applications.

Furthermore, in the Cholesky square-root reduction of a normal equation matrix to an upper (or lower) triangular matrix, the numerical conditioning is usually monitored by the magnitude of the computed diagonal elements as these should remain positive for a positive definite symmetric matrix. In some applications, the procedure of monitoring the magnitude of the diagonal elements of the triangular matrix is used to decide on an optimal order for a polynomial model which may be in terms of complex variables. Other similar strategies for numerical analysis of least squares problems based on the triangular decomposition of the normal equations matrix will be mentioned below.

However as mentioned in the previous section, the least-squares prediction problems have a “normal” matrix of the general form
with the matrix **C **symmetric and positive definite, **D** rectangular, and **0** denoting a zero matrix. Such a “normal” matrix is obviously nonpositive definite and hence is not really appropriate for the previous triangular matrix representation. However, the Cholesky’s square-root procedure can be applied to the first equations and then Givens rotations can be applied to transform the remaining rows into the upper triangular form. Givens rotations are applied to two row vectors at any one time to eliminate the first nonzero element of the second row vector thus transforming the system of equations into an upper triangular system for back substitution at any time. Explicit implementation details can be found in [13] and elsewhere. Graphically, the situation is as follows:
Notice that Givens rotations could be applied to the full equations but the preceding strategy is superior in terms of computational efficiency. Givens rotations have excellent numerical stability characteristics but often require slightly more computational efforts than the alternatives.

#### 5. Singular Value Decomposition

For indepth analysis of least-squares results and other related applications, the Singular Value Decomposition (SVD) approach is essential in practice and a brief overview follows. Considering the preceding (rectangular ) matrix **A**, its SVD gives
where **U** is the matrix of (unit) eigenvectors of **A**, **V** is the matrix of (unit) eigenvectors of **A**, and is an matrix with diagonal elements (called the singular values) equal to the square roots of the nonzero eigenvalues of **A** or **A**. By substitution, one readily obtains
where and denote the diagonal matrices of the squares of the singular values of **A **and dimensions and , respectively. The last step in both derivations follows from the orthogonality of the (unit) eigenvectors in **U** and **V**, respectively. Their inverses are, respectively,
as matrix inversion does not change the eigenvectors in the SVD of a symmetric matrix.

The previous least-squares solution for the overdetermined system **A** = **f** is simply
with the special notation = and for the underdetermined case
with the special notation = . These and are usually called generalized inverses of .

From a computational perspective, for an overdetermined system with , it may be more efficient to first perform a **QR** factorization of **A** with **Q** as an matrix with orthogonal columns and an upper triangular matrix **R** of order , and then compute the SVD of **R**, since if **A** = **QR** and **R** = **U**, then the SVD of **A** is given by **A** = (**QU**) . Similarly, for an underdetermined system with , it may be more efficient to first perform an **LQ** factorization of **A** with a lower triangular matrix **L** of order and an matrix **Q** with orthogonal rows, and then compute the SVD of **L**, since if **A** = **LQ** and **L** = **U**, then the SVD of **A** is given by **A** = **U**. The SVD approach is often used in spectral analysis and computing a minimum norm solution for (possibly) rank-deficient linear least-squares and related problems. More discussion of the computational aspects can be found, for example, in [14].

In practice, the SVD of a matrix has been described as “one of the most elegant algorithms in numerical algebra for exposing quantitative information about the structure of a system of linear equations” [15]. In current data assimilation and prediction research using spatiotemporal processes such as in global change and other environmental applications, the SVD approach has become very important for at least three problem areas.

First, considering sequences of discrete data with zero mean for simplicity associated with discrete times , and written in matrix form as
in which the superscripts correspond to the times . Such a convention with columns corresponding to the data sequences at discrete times is quite common in environmental applications. Then a SVD of this data matrix **X **yields **X **= **UΛ**, as discussed above. The columns of **U** are the *Empirical Orthogonal Functions* (EOFs) for the data matrix **X** while the columns of **V** are the corresponding *principal components*. The data transformation **x** or more generally **U*** for a (complex) data vector is usually called a Karhunen-Loève transformation. Such resulting data sequences **z** = **x** are easily seen to be uncorrelated as
which is most useful in practical applications. It is also important to notice that the EOFs can be described as eigenvectors of the corresponding covariance matrix of the available data. These are often called *normal modes* of the measured spatiotemporal process [16]. Since the (power) spectrum of a data sequence is well known to correspond to the spectrum of its (auto) covariance matrix, such normal modes have interesting interpretations in the context of dynamical systems driven by noise (e.g., [17]). Further discussions can be found in [18] and the references therein. An example of simulated application is shown in Figure 1(a) with a spatial pattern of a box followed by two cones over a sinusoidal path, with the resulting time series in Figure 1(b) of 20 occurrences of the pattern of 36 observations. Figure 1(c) shows the first modal (spatial) pattern and the corresponding first singular values which are identical to the input information except for the different scaling in amplitude and sign convention of Mathematica 7 [19]. Analogous simulations can readily be done in two dimensions with various patterns (see [20]). Such simulations show the potential of this methodology in the analysis of environmental and other geophysical time series.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

Second, in numerical conditioning analysis for any linear algebraic system of equations, the singular values of the matrix give most relevant information about the propagation of numerical errors from observations to estimated unknown parameters. For instance, considering some symmetric and positive definite matrix **B**, consider the linear algebraic system
and for some small perturbations **u** and **v** such that
then using the spectral norm, it is well known that
in which (**B**) is the condition number of the matrix **B** in terms of its maximum and minimum eigenvalues, (**B**) and (**B**), respectively. This provides a powerful tool for the analysis of relative changes in the unknown parameters implied by some relative perturbations in the observations.

Third, as the nonzero eigenvalues of **A** and **A** are identical, computations, and numerical analysis in filtering and smoothing can take advantage of this fact in using the smaller matrix in the least-squares computations. In several areas of environmental research, enormous quantities of data and complex mathematical models lead to very large normal equation systems that require much computational efforts. Substantial reductions in the Kalman filtering and data assimilation computations are implied by the proper choice of covariance/information formulations but for indepth error analysis of the results, SVD-based techniques become critical (see, e.g., [21]).

#### 6. Model Identification and Reliability Considerations

As previously mentioned, alternatives to least squares (), such as the least magnitude () and least maximum (), have advantages and disadvantages. First, the contributions of errors and especially outliers will increase in the estimation procedure from to . Therefore, in the absence of outliers, would be best while in the presence of large errors, or robust estimation is more appropriate. However considering the linear normal equations with least squares, least squares are selected for most practical applications. Outliers are a well-known problem with least squares and much literature exists to mitigate the implications.

Furthermore, in numerous application contexts, assuming an algebraic formulation, one has to decide on degrees and orders in regression modeling such as in curve fitting and spectrum estimation. Considering a simple set of measurements or observations to be modelled using an algebraic polynomial of the form . If the degree of the polynomial is unknown, then the least-squares approach to estimating can always achieve a perfect fit to the measurements by selecting the polynomial degree equal to . Actually, the variance of the residuals will decrease with higher and higher degrees to become zero for degree . Hence, the least-squares approach cannot be used to decide on the polynomial degree for such regression applications. Furthermore, the approach alone can hardly be used to decide on some other possible mathematical models such as as additional information is necessary for model identification. Such model identification problems have been studied extensively by Akaike [22, 23] and others (see [24] for further discussions and references).

For example, the sample measurements with an algebraic polynomial of the form would lead to an exact fit for degree 4. However, such high degree would likely be unacceptable because of the oscillations between the data points which would imply uncertainty in any prediction. Considering the least-squares estimates for degrees 0, 1, 2, 3, and 4, with corresponding error variances However, considering the (normalized) Akaike Information Criterion (AIC), defined as for approximating (model) parameters using measurements, one obtains corresponding to each degree 0, 1, 2, and 3, respectively, implying an optimal degree 2 for the modeling as 2.90809 is the minimum AIC value. Other examples can be found in [24] and the references therein.

In this context of least squares, the assumption of finite first and second moments with Gaussian statistics interpretation have implications in terms of expected moments for the estimated parameters. Essentially, assuming a given mathematical model, the given variances for the measurements and/or observations and a priori variances for the unknown parameters can be propagated using the variance propagation law into the estimated parameters and interpreted at some confidence level such as 95%. This is the familiar approach in geomatics with error ellipses reflecting the accuracy of measurements and/or observations and the geometrical strength of a network in positioning.

For example, given positional information at two discrete points and with a variance , the mid point located by the arithmetic average of the coordinates of and has a predicted variance when a linear model is known for any intermediate point. However, when the location and/or definition of the mid point is ambiguous or unknown, such as along some fuzzy line, its predicted uncertainty is likely to be much more than . In other words, the uncertainty in any estimation results is attributable to uncertainty in the assumed mathematical model and in the observational information used.

Another illustrative example is in the prediction of some quantity *g*, such as gravity, as a function of known data at the discrete points and with observational variance . At the mid point between and , the average value, that is, , is usually an adequate prediction of the value there but its variance is likely to be greater than . Otherwise, why bother with measurements It should be noticed that in nonlinear and/or non-Gaussian situations, the error propagation is much more complex, and only numerical Monte Carlo simulations offer a general strategy for uncertainty modeling (see [25] for further details and references).

#### 7. Concluding Remarks

Least squares are ubiquitous in applied science and engineering data processing. From a mathematical perspective, a least-squares estimate is a (weighted) mean solution which may be interpreted differently depending on the application context. Furthermore, any linear finite problem, even an ill-posed one, has a unique solution in the “average sense”. Such a solution is a BAE using a minimum quadratic norm or minimum variance, with the flexibility of possible statistical interpretation as MLE for optimal and reliable predictions.

Advantages of the least-squares approach are essentially in the simple assumptions (i.e., finite first and second moments), the unique estimates from linear normal equations, with excellent computational and applicability characteristics. Disadvantages of the least-squares approach are mainly in terms of oversmoothing properties (such as in curve or surface fitting) and relative overemphasis of outlier observations or measurements.

In terms of numerical computations, the least-squares approach has excellent characteristics in terms of stability and efficiency. This is best seen using the SVD approach for any indepth analysis of least-squares results. The readjustment of the geodetic networks in the North American Datum of 1983 has demonstrated that nearly one million unknowns can be handled reliably with only 32 bit arithmetic on conventional computer platforms (see [26, 27]). A better numerical approach would be difficult to find

Furthermore, it is also important to emphasize that least squares are not appropriate for all types of estimation problems as there are numerous application contexts where a best observation or measurement value needs to be selected among the available ones (as the most frequent value or the one with minimum error). In other application contexts when dealing with observations or measurements likely to be affected by outliers, a more robust estimate such as a median value or estimate may be preferable. No single estimation method can be considered best or optimal for all applications as data characteristics and desired estimates need to be considered.

Finally, some areas of current research and development in least squares and computational analysis include multiresolution analysis and synthesis, data regularization and fusion, EOFs of multidimensional time sequences, RBFs, and related techniques for optimal data assimilation and prediction, especially in spatiotemporal processes. With scientific data generally considered to be increasing faster than computational power, real challenges in the analysis of current observations and measurements abound and strategies have to become more sophisticated.

#### Acknowledgments

The author would like to acknowledge the sponsorship of the Natural Science and Engineering Research Council in the form of a Research Grant on Computational Tools for the Geosciences. Comments and suggestions from the reviewers are also acknowledged with gratitude.