Memory of the exponentially decaying Estimator for Mean and Covariance Matrix
\(
\def\myX{\mathbf{x}}
\def\myM{\mathbf{M}}
\def\myY{\mathbf{y}}
\def\mySigma{\mathbf{\Sigma}}
\def\myM{\mathbf{M}}
\def\myT{\mathsf{T}}
\)
This post is part 4 of a series of five articles:
- Online Maximum Likelihood Estimation of (multivariate) Gaussian Distributions
- Online Estimation of Weighted Sample Mean and Coviarance Matrix
- The Covariance of weighted Means
- Memory of the exponentially decaying Estimator for Mean and Covariance Matrix
- Online Estimation of the Inverse Covariance Matrix
Due to the exponentially decaying weights, the estimator described in the previous post has a limited historic memory, since older observations fade out more and more with every new data point. With such an approach, it is possible to adapt the parameters to drifting (changing) distributions, however, at cost of less accuracy when the data generating process is a stationary distribution. This means, for example, that for forgetting factors , the (co-) variances of the mean vector do not converge to zero for large sample sizes and there will always be some fixed amount of noise left in the estimation, depending on the “rate” of forgetting (specified by ). So one could ask the following 2 questions:
- If we compute a weighted mean with forgetting factor , how is this mean distributed, hence, what is and ?
- What is the memory of an estimator with exponentially decaying weights? Hence: For which sample size , when computing the ordinary mean (without forgetting) instead, can we obtain the same distribution parameters and ? This means that we are basically looking for an corresponding ordinary estimator which only takes into account the last data points in order to compute the mean.
Intuitively, one could guess for question 2., that the sample size for the ordinary mean corresponds to the value of the normalization factor , since converges for large – according to – towards a fixed value (for ):
For a forgetting factor of , would converge towards . This would correspond to a sample size , when the unweighted mean is computed. Let us verify this in a small simulation with :
If we look at the element-wise ratios of the estimated covariance matrices for the unweighted and weighted mean, we notice, that the memory of the weighted estimator has to be larger than our initial guess :
Also, the visualizations of the distributions of the sample means are not the same, as the following diagram illustrates for the estimated mean of (for , we would obtain a similar graph):
Hence, we have to find another way to estimate the memory . We can do this by actually computing the covariance matrix for the weighted mean. The derivations are shown in another article, in which the following resulting equation is found:
If we extend the above equation to our scenario with exponentially decaying weights, we get:
For and large , the above equation converges towards
Due to the central limit theorem, we know that the ordinary mean vector is distributed as
so that we have to solve the following correspondence for :
which, for large sample sizes , converges to:
For our previous example with , this would mean that the memory of the estimator is approx. .
Now let us replace the sample size of the unweighted mean X_unweighted
in the R-script of the previous example and then run the code again. This time we get get:
and also corresponding density plot for the first component of the mean vector indicates that is now the correct value:
We can also take a look at how the estimator can track the arithmetic mean of a time series with concept changes, as shown in the figure below.
As we can see, forgetting factors close to 1 can reduce the variance in the estimated mean, but have the drawback, that they cannot track sudden changes in the time series sufficiently fast. On the other side, smaller forgetting factors have a rather large variance in the estimated mean but they can track changes in the time series much faster.
The above figure can be generated with the following R-Code: