A few Notes on Principal Component Analysis

A few Notes on Principal Component Analysis

\(
\def\myX{\mathbf{X}} \def\mySigma{\mathbf{\Sigma}} \def\myT{\mathsf{T}} \)

The main idea of principal component analysis (PCA) is to perform an orthogonal transformation of multi/high-dimensional numeric data so that the resulting data consists of linearly uncorrelated variables, which are called principal components. The first principle component (first variable) accounts for the most variability in the data (hence, the variance along this axis is the largest). The second principal component accounts for the highest possible variance in the data under the constraint that it is orthogonal to the first principal component and so on. Hence, the last principal component (which is orthogonal to all previous principal components) accounts for the least variability in the data. A common application of PCA is dimensionality reduction, where one omits the last principal components of the PCA-transformed data, since these variables usually only explain very little of the variance in the data and do not have significant predictive power. But PCA also has many other applications in statistics, exploratory data analysis (EDA), machine learning and neural networks. We will shortly look at one way how to derive PCA in this post.

Prerequisites

Law of Cosines

alt and id

Figure 1: Derivation of the law of cosines for triangles.

We can simply derive the law of cosines, by looking at the triangle above and writing down a few properties which we can observe:

Now we simply insert Eq. \eqref{eq:cos} and \eqref{eq:sin} into Eq. \eqref{eq:pythagoras} and see where this leads us:

Vector Inner Product

Remember that the vector inner product of two vectors is defined as:

The inner product of a vector with itself is:

We can derive a slightly notation if we consider the law of cosines, as described in the section before:

Vector Projection

Sometimes it is necessary to project a vector onto a vector . The figure below illustrates the projection step. We first try to find a line orthogonal to which connects to the end of vector (indicated by the dotted line). The resulting projection vector has the same direction as but has a length which is specified by the intersection of and the orthogonal dotted line.

alt and id

Figure 2: Projection of a vector onto a vector . The resulting vector is .

In the following, we try to figure out a way to compute the project vector , if we have the vectors and given. Let us first compute the inner product of the and the unit vector in the direction of :

With this little trick we have computed the length of the desired vector . Now we just have to multiply this length with the unit vector in direction and we are done:

Deriving the Principal Components Analysi

  • Data X has to be centered (mean 0)
  • Dimensions of e, X, x, mu, J
  • Notation: we have data points, arranged in a matrix (rows for examples and columns as dimensions)

One can say that PCA performs a orthogonal linear transformation which maps each data point to a new point in another coordinate system. In this new coordinate system the variance of the transformed data points happens to be the largest for the first axis, the second largest for the second axis, and so on. Hence, if our first axis is described by an vector , this axis would explain most of the variance in the data. Since we want to find a vector which maximizes the variance along this vector, on which all data points are projected, we have an optimization problem for which we can write down the objective function in the form:

Note that is the mean of all points in the projected space along the axis described by vector . We can compute the mean along this axis as follows:

So, since the data along each dimension in the original data space is centered around the mean, also the mean in the projected space is zero for any arbitrary . This allows us to re-write Eq. \eqref{eq:objective1} as:

In matrix notation, above expression simplifies to, based on rule TODO in article TODO:

Since the vector is normalized by its norm, we can simplify above equation slightly more, by formulating the objective as a constrained optimization problem:

or

This problem can be solved using the Lagrange multiplier method. The Lagrangian can therefore be written as

where we used the relation (since the original data is centered around the means of each dimension).

Now we can compute the partial derivatives of using the denominator layout and the rule TODO in TODO and set both gradients to zero:

Eq. \eqref{eq:gradient2} leads us to the original constraint

and Eq. \eqref{eq:gradient1} leads us to a more interesting equation

which corresponds to finding the eigenvalues and eigenvectors of . However, this results in a set of possible candidate solutions. But, since we want to maximize our objective , we can simply insert all candidates from \eqref{eq:eigCandidates} into and see, which candidate maximizes the value:

Hence, the eigenvector with the largest corresponding eigenvalue is the solution we are looking for. Or in other words: The eigenvector with the largest eigenvalue for the covariance matrix of our dataset describes the new axis in our projected space, which (from all possible axes) maximizes the variance when all data points projected onto this axis. Furthermore, we also know how large this variance is: Remember that our original objective function measures the variance for all data points projected onto . If we look at Eq. \eqref{eq:maxJ1} – \eqref{eq:maxJ2}, this means that the variance actually has to be exactly the largest eigenvalue .

TODOs

  • PCA in Bengios Book
  • reduce dimensionality of problem and preserve as much of the structure of the data as possible. we represent the data with fewer variables
  • feature selection a possibility , does not work always
  • plot with xkcd
  • relationship of pca and autoencoder
  • svm of victor
  • why largest variability? To preserve the relative distance between the points
  • multiply vector with covariance matrix
  • PCA-04: 06:00
  • Example in 2d
  • Optimization in the space of with dimension of e
  • visualize the cost function + constraint for an example
  • PCA on MNIST instead of eigen faces
  • Derivation of Lagrange multipliers
  • Eigenvectors are orthogonal
  • What is a principal component
  • How two find the second Principal Component

Markus Thill

Markus Thill
I studied computer engineering (B.Sc.) and Automation & IT (M.Eng.). Generally, I am interested in machine learning (ML) approaches (in the broadest sense), but particularly in the fields of time series analysis, anomaly detection, Reinforcement Learning (e.g. for board games), Deep Learning (DL) and incremental (on-line) learning procedures.

Deriving a Closed-Form Solution of the Fibonacci Sequence using the Z-Transform

The Fibonacci sequence might be one of the most famous sequences in the field of mathmatics and computer science. Already high school stu...… Continue reading