 # The Weighted Linear Least Squares Algorithm

$$\def\myT{\mathsf{T}} \def\myPhi{\mathbf{\Phi}}$$

In this blog post we are going to take a look at the so called weighted linear least squares estimator, which is very similar to the ordinary linear least squares, but with one slight modification: while the ordinary estimator assumes that the errors of all data points have the same variance (which is typically referred to as homoscedasticity) and therefore assigns the same weight to the errors in the objective function, the weighted counterpart allows us to weight every single error individually. This is especially then interesting in cases, where we are working in a heteroscedastic setting, hence, when the variability in the errors cannot be assumed to be the same.

# The ordinary Least Squares Estimator

Let us start with the definition of a linear function (a function linear in its parameters):

where $\theta_i$ is the i-th parameter of the linear model ($\theta_0$ is the so called bias) and $\phi_i$ returns a scalar which is computed based on the inputs $x_1, x_2, \ldots x_\ell$. Since $y$ is linear in its parameters $\theta_i$, there is no requirement for $\phi_i$ being a linear function. So $\phi_i$ could for example simply be $\phi_i=x_i$, a polynomial or a radial basis function or something completely else. For reasons of simplicity we write equation \eqref{eq:linearModel} in the following as:

with

Now let us assume that we have collected a set of $n$ data-points $(\vec{x}^{(1)}, y_*^{(1)}), (\vec{x}^{(2)}, y_*^{(2)}), \ldots (\vec{x}^{(n)}, y_*^{(n)})$ for which we want to build a linear model. For the ordinary linear least squares estimator, the well known equation for ridge regression can be found which determines the parameters $\theta_i$ of the linear model, by minimizing the sum of squared errors + the L2 penalty for regularization (where $\lambda$ is the so called regularization parameter):

which leads to well know equation for ridge regression:

where $\myPhi$ a $n \times (k+1)$ matrix:

and $\vec{y}_*$ is a $n$-dimensional vector: \begin{align} \vec{y}_* = \begin{pmatrix} y_*^{(1)} \\ y_*^{(2)} \\ \vdots \\ y_*^{(n)} \end{pmatrix}. \end{align}

## Weighted Least Squares Estimator

Now let us add a tiny change to equation \eqref{eq:MSE}. We simply introduce a weight parameter $w_i$, with which can weight the errors of the individual data points (we also add some factors of $\frac{1}{2}$ for convenience, but which do not change the equation fundamentally):

In order to minimize the error $E$ defined in above equation, let us first compute the gradient and set it to zero:

Now let us try to vectorize above equations and bring them in matrix form. First lets us bring all terms with $-y^{(i)}_*$ to the other side of the equation:

In the next step we can specify a vector $\vec{\theta}$, such as in \eqref{eq:theta} and write:

Now, let us again define $\myPhi$ in the same way as in \eqref{eq:Phi} and introduce a $n \times n$ matrix $\mathbf{W}$, which contains all weights $w_i$ on its diagonal:

If we take a look at the correspondence table HERE TODO, we can now significantly simplify Eq. \eqref{eq:weightedLSMatrix1}:

With above equation we can finally find the desired parameters of the linear model with weighted squared residuals:

Since $\mathbf{W}$ is an $n \times n$ matrix, one should avoid in practice to actually generate this diagonal matrix for large datasets. Instead, for a diagonal $n \times n$ weight matrix $\mathbf{W} = \mbox{diag}(\vec{w})$ and an $(k+1) \times n$ matrix $\mathbf{\Psi} = \myPhi^\myT$, one could directly use the relation:

Hence, each column has to be simply multiplied with its corresponding weight.

# Example

Let us take a look at a small example: Assume we have sampled some data from a stochastic linear function

where $\epsilon(x)$ is some random noise which we do not know. The data sample could look like this: Figure 1: A data sample for which we want to fit a linear model.

We evaluated each $x$ several times and can observe that the noise seems to vary for each value of $x$. The linear correlation between the two variables in the plot is quite apparent, although it is not that clear, how to fit the line through the data points. For those $x$ for which we have a large spread in the observed values, it is difficult to estimate where the line should approximately pass through. Also, outliers might have large impact on the estimation of the parameters of the linear model. Hence, we can assign weights to each $x$ which are higher for those $x$ which have a small spread in their values and vice versa. A reasonable weight could be the inverse of the variance in each $x$ so that we get:

where $\sigma_{x_i}^2$ is the variance in $x=x_i$. Now let us apply both, the linear least squares estimator and its weighted version to the collected data and see what happens: Figure 2: Linear model fits for the dataset. The real curve is represented by the red line. The green line is based on the estimation of the conventional linear least squares estimator and the blue line is the estimate of the weighted linear least squares estimator.

As we can see the estimate of the weighted linear least squares estimator is much closer to the real line than the conventional estimator. Since this might just be a coincidence for this particular dataset, let us repeat the procedure several thousand times. For this, we specify the real line to be

and set the noise $\epsilon(x)$ to be normally distributed with zero mean and a standard deviation $\sigma_x$ which is selected randomly with the probabilities $\mathbb P(\sigma_x=1) = 3/4$ and $\mathbb P(\sigma_x=5) = 1/4$. In each run we create a dataset with $N=55$ points ($x\in \{-5,-4, \ldots, 4, 5 \}$ and 5 points for each $x$) and compute the error for the weighted and unweighted linear least squares estimator in the slope $m$ and intercept $b$. Then we repeat each run 10 000 times and sum up the squared errors (SSE) of both estimators for both parameters. The output of our simulation is:

As we can see, for this example, the error of the weighted estimator is about 1/3 (approx. $1/\sqrt 3$ for the sum of absolut errors) of its unweighted counterpart for the intercept $b$ and the slope $m$. Finally, let us plot some histograms for the errors of the two estimators: Figure 3: Histograms of the errors in the estimations of the intercept b and the slope m for the unweighted and weighted least squares. The black lines specify the intervals containing 95% of the observed errors.

Also in the above figure we can see that the weighted estimator generally produces smaller errors, since the curves have sharper spikes around zero. The sources for all reported results can be found in the appendix.

# Appendix

images/2018-03-13-the-weighted-least-squares-algorithm/weightedLSE.R view raw

### Markus Thill I studied computer engineering (B.Sc.) and Automation & IT (M.Eng.). Generally, I am interested in machine learning (ML) approaches (in the broadest sense), but particularly in the fields of time series analysis, anomaly detection, Reinforcement Learning (e.g. for board games), Deep Learning (DL) and incremental (on-line) learning procedures.

### Deriving a Closed-Form Solution of the Fibonacci Sequence using the Z-Transform

The Fibonacci sequence might be one of the most famous sequences in the field of mathmatics and computer science. Already high school stu...… Continue reading

#### Derivation of a Weighted Recursive Linear Least Squares Estimator

Published on May 05, 2019

#### Gaussian Distribution With a Diagonal Covariance Matrix

Published on May 04, 2019