Memory of the exponentially decaying Estimator for Mean and Covariance Matrix

Memory of the exponentially decaying Estimator for Mean and Covariance Matrix

\(
\def\myX{\mathbf{x}} \def\myM{\mathbf{M}} \def\myY{\mathbf{y}} \def\mySigma{\mathbf{\Sigma}} \def\myM{\mathbf{M}} \def\myT{\mathsf{T}} \) This post is part 4 of a series of five articles:

  1. Online Maximum Likelihood Estimation of (multivariate) Gaussian Distributions
  2. Online Estimation of Weighted Sample Mean and Coviarance Matrix
  3. The Covariance of weighted Means
  4. Memory of the exponentially decaying Estimator for Mean and Covariance Matrix
  5. Online Estimation of the Inverse Covariance Matrix

Due to the exponentially decaying weights, the estimator described in the previous post has a limited historic memory, since older observations fade out more and more with every new data point. With such an approach, it is possible to adapt the parameters to drifting (changing) distributions, however, at cost of less accuracy when the data generating process is a stationary distribution. This means, for example, that for forgetting factors , the (co-) variances of the mean vector do not converge to zero for large sample sizes and there will always be some fixed amount of noise left in the estimation, depending on the “rate” of forgetting (specified by ). So one could ask the following 2 questions:

  1. If we compute a weighted mean with forgetting factor , how is this mean distributed, hence, what is and ?
  2. What is the memory of an estimator with exponentially decaying weights? Hence: For which sample size , when computing the ordinary mean (without forgetting) instead, can we obtain the same distribution parameters and ? This means that we are basically looking for an corresponding ordinary estimator which only takes into account the last data points in order to compute the mean.

Intuitively, one could guess for question 2., that the sample size for the ordinary mean corresponds to the value of the normalization factor , since converges for large – according to – towards a fixed value (for ):

For a forgetting factor of , would converge towards . This would correspond to a sample size , when the unweighted mean is computed. Let us verify this in a small simulation with :

  require(mvtnorm)
  require(ggplot2)
  require(reshape2)

  N_weighted <- 10000
  lambda = .99
  realCov <- matrix(c(2,1,1,3), nrow=2)
  realMu <- c(0,2)
  w = lambda ^(N_weighted-c(1:N_weighted))
  W_N = sum(w)
  oneSimulation <- function() {
    #
    # The sample size of the unweighted mean corresponds to W_N
    #
    X_unweighted <- mvrnorm(n=round(W_N),Sigma = realCov,  mu = realMu, tol= 1e-13)
    unweightedMean <- apply(X_unweighted,2,mean)
    #
    # Take a larger sample size for the weighted mean, to demonstrate
    # its limited memory
    #
    X_weighted <- mvrnorm(n=N_weighted,Sigma = realCov,  mu = realMu, tol= 1e-13)
    weightedMean <- apply(sweep(X_weighted,MARGIN=1,w,`*`),2,sum) / W_N
    c(unweightedMean, weightedMean)
  }

  Sim <- replicate(100000, oneSimulation())
  meanUnweighted <- t(Sim[1:2,])
  meanWeighted <- t(Sim[3:4,])
  covMeanUnweighted <- var(meanUnweighted)
  covMeanWeighted <- var(meanWeighted)

  covMeanUnweighted / covMeanWeighted

  df <- data.frame(unweighted = meanUnweighted[,1], weighted = meanWeighted[,1])
  dfPlot <- melt(df, id.vars = c())
  ggplot(dfPlot, aes(value, fill = variable, colour = variable)) +
    geom_density(alpha = 0.1) +
    theme_bw() +
    theme(axis.text=element_text(size=13),
          axis.title=element_text(size=15),
          legend.text=element_text(size=13),
          legend.title=element_text(size=15)) +
    scale_fill_discrete(name = expression("Mean" ~ bar(x)[1])) +
    scale_colour_discrete(name = expression("Mean" ~ bar(x)[1])) +
    ggtitle("Distribution of the unweighted and weighted Sample Mean") +
    theme(plot.title = element_text(hjust = 0.5, size=16, face="bold"))

If we look at the element-wise ratios of the estimated covariance matrices for the unweighted and weighted mean, we notice, that the memory of the weighted estimator has to be larger than our initial guess :

> covMeanUnweighted / covMeanWeighted

       [,1]     [,2]
[1,] 1.987378 1.968643
[2,] 1.968643 1.998786

Also, the visualizations of the distributions of the sample means are not the same, as the following diagram illustrates for the estimated mean of (for , we would obtain a similar graph):

alt and id

Figure 1: Comparison of the distributions of sample means for the unweighted and weighted case. Both sample means seem to be normally distributed. Our initial guess that the memory of the weighted estimator corresponds to the sum of all weights appears to be wrong, if we look at the two distributions. The memory of the weighted estimator must be larger, since its variance is smaller than for the unweighted estimator with a sample size of .

Hence, we have to find another way to estimate the memory . We can do this by actually computing the covariance matrix for the weighted mean. The derivations are shown in another article, in which the following resulting equation is found:

If we extend the above equation to our scenario with exponentially decaying weights, we get:

For and large , the above equation converges towards

Due to the central limit theorem, we know that the ordinary mean vector is distributed as

so that we have to solve the following correspondence for :

which, for large sample sizes , converges to:

For our previous example with , this would mean that the memory of the estimator is approx. .

Now let us replace the sample size of the unweighted mean X_unweighted in the R-script of the previous example and then run the code again. This time we get get:

> covMeanUnweighted / covMeanWeighted

       [,1]     [,2]
[1,] 0.9961787 1.068541
[2,] 1.0685406 1.022556

and also corresponding density plot for the first component of the mean vector indicates that is now the correct value:

alt and id

Figure 2: Comparison of the distributions of sample means for the unweighted and weighted case. Both sample means seem to be normally distributed. This time we set , which appears to be the correct value.

We can also take a look at how the estimator can track the arithmetic mean of a time series with concept changes, as shown in the figure below.

alt and id

Figure 3: Comparison of several estimators with different forgetting factors for the estimation of an arithmetic mean of a time series with concept changes. The original time series is a data stream sampled from a normal distribution. For illustrative purposes, each curve tracks a different time series with different means. At the point all time series undergo a sudden change in their mean. Depending on the forgetting factor, this change can be tracked more or less well.

As we can see, forgetting factors close to 1 can reduce the variance in the estimated mean, but have the drawback, that they cannot track sudden changes in the time series sufficiently fast. On the other side, smaller forgetting factors have a rather large variance in the estimated mean but they can track changes in the time series much faster.

The above figure can be generated with the following R-Code:

require(mvtnorm)
require(MASS)
require(parallel)
require(ggplot2)
require(reshape2)

getMeanSeries <- function(p) {
    Q = p[1]
    mm <- p[2]
    X = c(rnorm(10000, mean=mm, sd = 1)) #, rnorm(10000, mean = -5), rnorm(10000, mean = 0)  #
    X[c(1:5000)] <- X[c(1:5000)] + 5
    mean = 0
    covErrs <- 0
    Ncount <- 0
    QSum = 0
    myMeans = myVars = rep(0,length(X))
    M=0
    for(i in 1:length(X)) {
      x = X[i]
      QSum = Q*QSum + 1
      delta = x - mean
      mean = mean + delta / QSum
      myMeans[i] <- mean
      if(i > 1) M = Q*M + (x - myMeans[i-1]) * (x - myMeans[i])
      myVars[i] <- M / QSum
    }
    return (myMeans)
  }

  M <- lapply(list(c(1,3), c(.999, 2), c(0.99,1), c(0.9,0)), getMeanSeries)
  df <- do.call(cbind, M)
  colnames(df) <- c("Q=1","Q=0.999", "Q=0.99", "Q=0.9")
  dfPlot <- melt(df)
  colnames(dfPlot) <- c("n", "Forgetting", "value")
  levels(dfPlot$Forgetting) <- c("Q=1","Q=0.999", "Q=0.99", "Q=0.9")
  p <- ggplot(data=dfPlot, aes(x=n,y=value, color=Forgetting)) +
    geom_line() +
    xlab("n") +
    ylab("estimated mean") +
    theme_bw() +
    theme(text = element_text(size=25))
  plot(p)

Markus Thill

Markus Thill
I studied computer engineering (B.Sc.) and Automation & IT (M.Eng.). Generally, I am interested in machine learning (ML) approaches (in the broadest sense), but particularly in the fields of time series analysis, anomaly detection, Reinforcement Learning (e.g. for board games), Deep Learning (DL) and incremental (on-line) learning procedures.

Deriving a Closed-Form Solution of the Fibonacci Sequence using the Z-Transform

The Fibonacci sequence might be one of the most famous sequences in the field of mathmatics and computer science. Already high school stu...… Continue reading