The Weighted Linear Least Squares Algorithm

The Weighted Linear Least Squares Algorithm

\def\myT{\mathsf{T}} \def\myPhi{\mathbf{\Phi}} \)

In this blog post we are going to take a look at the so called weighted linear least squares estimator, which is very similar to the ordinary linear least squares, but with one slight modification: while the ordinary estimator assumes that the errors of all data points have the same variance (which is typically referred to as homoscedasticity) and therefore assigns the same weight to the errors in the objective function, the weighted counterpart allows us to weight every single error individually. This is especially then interesting in cases, where we are working in a heteroscedastic setting, hence, when the variability in the errors cannot be assumed to be the same.

The ordinary Least Squares Estimator

Let us start with the definition of a linear function (a function linear in its parameters):

where is the i-th parameter of the linear model ( is the so called bias) and returns a scalar which is computed based on the inputs . Since is linear in its parameters , there is no requirement for being a linear function. So could for example simply be , a polynomial or a radial basis function or something completely else. For reasons of simplicity we write equation \eqref{eq:linearModel} in the following as:


Now let us assume that we have collected a set of data-points for which we want to build a linear model. For the ordinary linear least squares estimator, the well known equation for ridge regression can be found which determines the parameters of the linear model, by minimizing the sum of squared errors + the L2 penalty for regularization (where is the so called regularization parameter):

which leads to well know equation for ridge regression:

where a matrix:

and is a -dimensional vector:

Weighted Least Squares Estimator

Now let us add a tiny change to equation \eqref{eq:MSE}. We simply introduce a weight parameter , with which can weight the errors of the individual data points (we also add some factors of for convenience, but which do not change the equation fundamentally):

In order to minimize the error defined in above equation, let us first compute the gradient and set it to zero:

Now let us try to vectorize above equations and bring them in matrix form. First lets us bring all terms with to the other side of the equation:

In the next step we can specify a vector , such as in \eqref{eq:theta} and write:

Now, let us again define in the same way as in \eqref{eq:Phi} and introduce a matrix , which contains all weights on its diagonal:

If we take a look at the correspondence table HERE TODO, we can now significantly simplify Eq. \eqref{eq:weightedLSMatrix1}:

With above equation we can finally find the desired parameters of the linear model with weighted squared residuals:

Since is an matrix, one should avoid in practice to actually generate this diagonal matrix for large datasets. Instead, for a diagonal weight matrix and an matrix , one could directly use the relation:

Hence, each column has to be simply multiplied with its corresponding weight.


Let us take a look at a small example: Assume we have sampled some data from a stochastic linear function

where is some random noise which we do not know. The data sample could look like this:

alt and id

Figure 1: A data sample for which we want to fit a linear model.

We evaluated each several times and can observe that the noise seems to vary for each value of . The linear correlation between the two variables in the plot is quite apparent, although it is not that clear, how to fit the line through the data points. For those for which we have a large spread in the observed values, it is difficult to estimate where the line should approximately pass through. Also, outliers might have large impact on the estimation of the parameters of the linear model. Hence, we can assign weights to each which are higher for those which have a small spread in their values and vice versa. A reasonable weight could be the inverse of the variance in each so that we get:

where is the variance in . Now let us apply both, the linear least squares estimator and its weighted version to the collected data and see what happens:

alt and id

Figure 2: Linear model fits for the dataset. The real curve is represented by the red line. The green line is based on the estimation of the conventional linear least squares estimator and the blue line is the estimate of the weighted linear least squares estimator.

As we can see the estimate of the weighted linear least squares estimator is much closer to the real line than the conventional estimator. Since this might just be a coincidence for this particular dataset, let us repeat the procedure several thousand times. For this, we specify the real line to be

and set the noise to be normally distributed with zero mean and a standard deviation which is selected randomly with the probabilities and . In each run we create a dataset with points ( and 5 points for each ) and compute the error for the weighted and unweighted linear least squares estimator in the slope and intercept . Then we repeat each run 10 000 times and sum up the squared errors (SSE) of both estimators for both parameters. The output of our simulation is:

SSE unweighted: b  SSE unweighted: m   SSE weighted: b   SSE weighted: m
          1275.48             124.82            410.00             40.65

As we can see, for this example, the error of the weighted estimator is about 1/3 (approx. for the sum of absolut errors) of its unweighted counterpart for the intercept and the slope . Finally, let us plot some histograms for the errors of the two estimators:

alt and id

Figure 3: Histograms of the errors in the estimations of the intercept and the slope for the unweighted and weighted least squares. The black lines specify the intervals containing 95% of the observed errors.

Also in the above figure we can see that the weighted estimator generally produces smaller errors, since the curves have sharper spikes around zero. The sources for all reported results can be found in the appendix.


images/2018-03-13-the-weighted-least-squares-algorithm/weightedLSE.R view raw

N = 5     # number of points for each x
b = 2     # offset of linear function
m = 1     # slope of the linear function
x = -5:5  # range to generate points for the linear function

testWeightedLS <- function(doPlot = F) {
  # Generate data points around the line with different variances for each x
  data =, function(i) c(i, m*i+b + rnorm(n=N, mean = 0, sd = sample(c(1,1,1,5),1))))))
  dataAligned <- melt(data, = "key", id.vars = c(1))
  # Data used for training the models
  xTrain <- dataAligned[,1]
  yTrain <- dataAligned[,3]
  # Estimate the parameters for the unweighted least squares estimator
  thetaNoW = ginv(cbind(1,xTrain)) %*% yTrain
  # compute variance for each x
  estVars <- cbind(data[,1], apply(data[,c(2:ncol(data))], 1, var))
  w <- sapply(xTrain, FUN=function(i) estVars[which(estVars[,1] == i),2])
  # Weighting matrix W based on the variance for each x
  # n x n matrix! Should be done differently for larger data sets
  W <- diag(1/w)
  # Compute the parameters for the weighted least squares estimator
  X = cbind(1,xTrain)
  tXW = t(X) %*% W
  thetaW = ginv( tXW %*% X ) %*% tXW %*% yTrain
  if(doPlot) {
    df <- data.frame(x=xTrain, y=yTrain)
    plotDf <- melt(df, id.vars=c("x"))
    cc <- data.frame(sl = c(thetaNoW[2],thetaW[2], m), 
                     int = c(thetaNoW[1],thetaW[1], b), 
                     Estimator = c('unweighted','weighted','real'))
    p<-ggplot(data = plotDf, aes(x=x,y=value)) +
      geom_point() +
    p<-ggplot(data = plotDf, aes(x=x,y=value)) +
        geom_point() +
        theme_bw(base_size=25) +
        geom_abline(data = cc, aes(slope =sl, intercept = int,colour = Estimator))

  # Compute error of estimated parameters
  c(thetaNoW - c(b,m), thetaW - c(b,m))

runErrs <- t(replicate(10000, testWeightedLS()))
colnames(runErrs) <- c("Error unweighted LSE: b", "Error unweighted LSE: m", "Error weighted LSE: b", "Error weighted LSE: m")
apply((runErrs)^2,2, mean) # Compute the variance of the estimated paramters
apply((runErrs)^2,2, sum) # Compute the sum of squared errors of the estimated paramters
apply(abs(runErrs),2, sum) # Compute the sum of squared errors of the estimated paramters

# Prepare data for histogram
errs_b <- runErrs[,c(1,3)]
colnames(errs_b) <- c("unweighted", "weighted")
histData_b <- data.frame(melt(errs_b)[,-1], param = "b")

errs_m <- runErrs[,c(2,4)]
colnames(errs_m) <- c("unweighted", "weighted")
histData_m <- data.frame(melt(errs_m)[,-1], param = "m")

histData <- rbind(histData_b, histData_m)
colnames(histData) <- c("method", "error", "param")

# Compute 2.5% and 97.5% quantiles
d2 <- histData %>%
  group_by(method, param) %>%
  summarize(lower = quantile(error, probs = .025),
            upper = quantile(error, probs = .975))

# Plot Histogram
ggplot(histData, aes(x = error)) +
  facet_grid(method ~ param) +
  geom_density(aes(colour = method)) + 
  theme_bw(base_size=25) +
  geom_vline(data = d2, aes(xintercept = lower)) +
  geom_vline(data = d2, aes(xintercept = upper)) + 
  xlim(c(-1,1)) +

Markus Thill

Markus Thill
I studied computer engineering (B.Sc.) and Automation & IT (M.Eng.). Generally, I am interested in machine learning (ML) approaches (in the broadest sense), but particularly in the fields of time series analysis, anomaly detection, Reinforcement Learning (e.g. for board games), Deep Learning (DL) and incremental (on-line) learning procedures.

Deriving a Closed-Form Solution of the Fibonacci Sequence using the Z-Transform

The Fibonacci sequence might be one of the most famous sequences in the field of mathmatics and computer science. Already high school stu...… Continue reading