Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.


Principal Component Analysis

Principal component analysis, PCA, is a versatile method, which not only allows to take a look at high-dimensional datasets (with more than three variables) but which also shows some special mathematical properties that support the calculation of multilinear regression models even in the case of strongly correlated variables.

One of the problems with multivariate datasets is that they cannot be displayed on 2-dimensional paper or computer screens. The more variables (dimensions) a dataset has, the more complicated and intransparent the situation becomes. This finally results in the inability to recognize any relationships at all.

The central idea behind principal component analysis is to project the high-dimensional data space onto a two-dimensional plane in way that any interesting features of the data will become visible. As the structure of the projected data depends on the direction of the projection, one might ask the question how to find the "best" rotation of the data (or of the axes, which is quite the same).

If we assume that information can be gained from the dataset only in directions which show a maximum of variation, we simply have to find those directions which exhibit the maximum variation and align the rotated axes along these directions. In addition, these new axes should again be orthogonal to each other.

In order to find the new axes, first the direction of the maximum extent of the dataset should be searched. This direction will be the direction of the first axis. Thereafter we use another axis which is normal to the first and rotate it around the first axis until the variation along the new axis is a maximum. Then we add a third axis, again orthogonal to the other two and in the direction of the maximum remaining variation, and so on. This procedure will be repeated until all dimensions have been "used up".

The process described above is called principal component analysis (PCA) and results in a rotated coordinate system with axes showing a maximum of variation in their directions. This somewhat simplified picture can be mathematically condensed to a so-called eigenvalue problem. The eigenvectors of the covariance matrix constitute the principal components. The corresponding eigenvalues give a hint to how much "information" is contained in the individual components.

The following  interactive example  shows a three-dimensional data set and the corresponding principal components. Note that the principal components are orthogonal to each other, and the correlation between any two principal components is zero.