 # The Relationship between the Mahalanobis Distance and the Chi-Squared Distribution

In practice, sometimes (multivariate) Gaussian distributions are used for anomaly detection tasks (assuming that the considered data is approx. normally distributed): the parameters of the Gaussian can be estimated using maximum likelihood estimation (MLE) where the maximum likelihood estimate is the sample mean and sample covariance matrix. After the estimating the parameters of the distribution one has to specify a critical value which separates the normal data from the anomalous data. Typically, this critical value is taken from the probability density function (PDF) in such a way that it is smaller than the PDF value of all normal data points in the data set. Then a new data point can be classified as anomalous if the value of the PDF for this new point is below the critical value. Hence, the critical value specifies a boundary which is used to separate normal from anomalous data. In the univariate case the boundary separates the lower and upper tails of the Gaussian from its center (mean). In the 2-dimensional case the boundary is an ellipse around the center and in higher dimensions the boundary can be described by an ellipsoid. But what do we do, if want to find a boundary in a way that separates the most unlikely 2% of the data points from a sample from the remaining 99%. In the univariate case, this scenario is simple: We just have to compute the first percentile (1% quantile) and 99th percentile. All points that end up in the specified tails would then be classified as anomalous. For the multivariate case this is not that straightforward any longer, since our boundary has to be described by an ellipsoid. However, there is a way out of this problem, which has to do with a so called Mahalanobis-distance, as we will see in the following.

$$\def\matr#1{\mathbf #1} \def\tp{\mathsf T}$$

## Prerequisites

### Multiplication of a Matrix with its Transpose

Generally, the product of a $n \times \ell$ matrix $\matr A$ and a $\ell \times p$ matrix $\matr B$ is defined as:

Then, the multiplication of a matrix $\matr A$ with its transpose $\matr A^\tp$ can be written as:

where $\vec a_{k}$ is the $k$th column vector of matrix $\matr A$. Another trivial relation required later is as follows:

### Inverse of a Matrix-Product

\begin{align} (\matr{A} \matr B)^{-1} = \matr B^{-1}\matr A^{-1} \label{eq:inverseProduct} \end{align} since %

### Eigenvalues and Eigenvectors

For a matrix $\matr A$ solve: $\begin{equation} \matr{A} \vec{u} = \lambda \vec{u} \end{equation}$

A value $\lambda$ which fulfills the equation is called an eigenvalue of $\matr A$ and the corresponding vector $\vec\mu$ is called eigenvector. When the eigenvectors of $\matr A$ are arranged in a matrix $\matr U$, we have:

where $\matr \Lambda$ is a diagonal matrix containing the eigenvalues $\lambda_i$ of the corresponding eigenvectors. This representation, where the matrix is represented in terms of its eigenvalues and eigenvectors is also called eigenvalue decomposition. For symmetric matrices $\matr{A}$, the eigenvectors are orthogonal (orthonormal) and the matrix $\matr{U}$ is orthogonal as well (the product with its transpose is the identity matrix). In this case $\matr{U}^{-1}=\matr{U}^{T}$, and equation $\eqref{eq:eigendecomp}$ can be written as:

In this case, also the square root of $\matr A$ (written here as $\matr A^{\frac{1}{2}}$) – such that $\matr A^{\frac{1}{2}}\matr A^{\frac{1}{2}}=A$ – can be easily found to be:

since

The eigenvalue decomposition of the inverse of a matrix $\matr A$ can be computed as follows, using the relation described in equation $\eqref{eq:inverseProduct}$ and the associative property of the matrix product:

Note that $\Lambda^{-1}$ is again a diagonal matrix containing the inverse eigenvalues of $\matr{A}$.

### Linear Affine Transform of a Normally Distributed Random Variable

Assume we apply a linear affine transform to a random variable $X \thicksim N(\vec \mu_x, \Sigma_x)$ with a mean vector $\vec\mu_x$ and a covariance matrix $\Sigma_x$ in order to create a new random variable $Y$:

One can compute the new mean $\vec\mu_y$ and covariance matrix $\Sigma_y$ for $Y$:

and

## Quantile Estimation for multivariate Gaussian Distributions

• Calculating quantiles for multivariate normal distributions is not that trivial as in the one-dimensional case, since we cannot simply compute the integral in the tails of the distribution
• The quantiles in the bivariate case can be seen as ellipses, in higher dimensions as ellipsoids
• The Mahalanobis distance is an interesting measure to describe all points on the surface of an ellipsoid.
• More formal: The usual quantile definition requires a random variable: The p-quantile for a random distribution is the value that fulfills . In the case of a multivariate normal distribution we can take the squared Mahalanobis distance between a point of the multivariate normal distribution and its mean as such a random variable. Then the p-quantile computation will answer the following question: Which value is required so that a random point fulfills ? In other words, when we pick a random point from the distribution, it will have with probability p a squared Mahalanobis distance equal or smaller than . The set of points with forms an ellipsoid.
• In a naive solution one can use a Monte Carlo approach to sample the multivariate normal distribution and compute the quantile based on the Mahalanobis distances of the elements of the sample
• However, this Monte Carlo approach is rather computationally inefficient, especially if quantiles have to be computed very often
• One can show that the squared Mahalanobis distance of a Gaussian distribution is actually Chi-Square distributed.

### Empirical Results suggesting that the Mahalanobis Distance is Chi-Square distributed

In a Quantile-Quantile Plot one can see that quantiles of the Mahalanobis distance of a sample drawn from a Gaussian distribution is very similar to the corresponding quantiles computed on the Chi-Square distribution. The following R-script shows this: ### The squared Mahalanobis Distance follows a Chi-Square Distribution: More formal Derivation

The Mahalanobis distance between two points $\vec x$ and $\vec y$ is defined as

Thus, the squared Mahalanobis distance of a random vector \matr X and the center \vec \mu of a multivariate Gaussian distribution is defined as: \begin{align} D = d(\matr X,\vec \mu)^2 = (\matr X -\vec \mu )^\tp \matr \Sigma^{-1} (\matr X - \vec \mu ) \label{eq:sqMahalanobis} \end{align}

where $\matr \Sigma$ is a $\ell \times \ell$ covariance matrix and $\vec \mu$ is the mean vector. In order to achieve a different representation of $D$ one can first perform an eigenvalue decomposition on $\matr \Sigma^{-1}$ which is (with Eq. $\eqref{eq:eigenvalueInverse}$ and assuming orthonormal eigenvectors):

With Eq. $\eqref{eq:matrixProductWithTranspose}$ we get:

where $\vec u_{k}$ is the $k$th eigenvector of the corresponding eigenvalue $\lambda_k$. Plugging \eqref{eq:SigmaInverseAsSum} back into \eqref{eq:sqMahalanobis} results in:

With Eq. \eqref{eq:multOfTwoSkalars} one gets:

where $Y_k$ is a new random variable based on an affine linear transform of the random vector $\matr X$. According to Eq. \eqref{eq:AffineLinearTransformMean} , we have $\matr Z = (\matr X - \vec \mu ) \thicksim N(\vec 0,\Sigma)$. If we set $\vec a_{k}^\tp = \lambda_k^{-\frac{1}{2}} \vec u_{k}^\tp$ then we get $Y_k = \vec a_{k}^\tp \matr Z = \lambda_k^{-\frac{1}{2}} \vec u_{k}^\tp \matr Z$. Note that $Y_k$ is now a random Variable drawn from a univariate normal distribution $Y_k \thicksim N(0,\sigma_k^2)$, where, according to \eqref{eq:AffineLinearTransformCovariance}:

If we insert

into Eq. \eqref{eq:smallSigma}, we get: %

Since all eigenvectors $\vec u_{i}$ are pairwise orthonormal the dotted products $\vec u_{k}^\tp \vec u_{j}$ and $\vec u_{j}^\tp \vec u_{k}$ will be zero for $j \neq k$. Only for the case $j = k$ we get: %

since the the norm $||\vec u_{k}||$ of a orthonormal eigenvector is equal to 1. The squared Mahalanobis distance can be expressed as:

where $Y_k \thicksim N(0,1).$

Now the Chi-square distribution with $\ell$ degrees of freedom is exactly defined as being the distribution of a variable which is the sum of the squares of $\ell$ random variables being standard normally distributed. Hence, $D$ is Chi-square distributed with $\ell$ degrees of freedom.

### Derivation based on the Whitening Property of the Mahalanobis Distance

Since the inverse $\matr \Sigma^{-1}$ of the covariance matrix $\matr \Sigma$ is also a symmetric matrix, its squareroot can be found – based on Eq. \eqref{eq:sqrtSymMatrix} – to be a symmetric matrix . In this case we can write the squared Mahalanobis distance as %

The multiplication $\matr Y = \matr W \matr Z$, with $\matr W=\matr \Sigma^{-\frac{1}{2}}$ and $\matr Z= \matr X -\vec \mu$ is typically reffered to as a whitening transform, where in this case $\matr W=\matr \Sigma^{-\frac{1}{2}}$ is the so called Mahalanobis (or ZCA) whitening matrix. $\matr Y$ has zero mean, since $(\matr X - \vec \mu ) \thicksim N(\vec 0,\Sigma)$. Due to the (linear) whitening transform the new covariance matrix $\matr \Sigma_y$ is the identity matrix $\matr I$, as shown in the following (using the property in Eq. \eqref{eq:AffineLinearTransformCovariance}):

Hence, all elements $Y_k$ in the random vector $\matr Y$ are random variables drawn from independent normal distributions $Y_k \thicksim N(0,1)$, which leads us to the same conclusion as before, that $D$ is Chi-square distributed with $\ell$ degrees of freedom.

### Markus Thill I studied computer engineering (B.Sc.) and Automation & IT (M.Eng.). Generally, I am interested in machine learning (ML) approaches (in the broadest sense), but particularly in the fields of time series analysis, anomaly detection, Reinforcement Learning (e.g. for board games), Deep Learning (DL) and incremental (on-line) learning procedures.

### Deriving a Closed-Form Solution of the Fibonacci Sequence using the Z-Transform

The Fibonacci sequence might be one of the most famous sequences in the field of mathmatics and computer science. Already high school stu...… Continue reading

#### Derivation of a Weighted Recursive Linear Least Squares Estimator

Published on May 05, 2019

#### Gaussian Distribution With a Diagonal Covariance Matrix

Published on May 04, 2019