Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.


Variable Selection - Introduction

Sometimes a large number of independent variables, Xi,  is available for a given modeling problem, and not all of these predictor variables may contribute equally well to the explanation of the predicted variable Y. Some of the independent variables may not contribute at all to the model. Thus we have to select from these variables to obtain a model which contains as little variables as possible while still being the "best" model. In principle, all possible combinations of independent variables should be tried for calculating a suitable model. This could turn out to be a formidable task, even if high performance computers are available. Besides the practicability of this approach, there are also several theoretical considerations which should be taken into account:
 

  • trying all possible combinations may lead to chance correlations
  • the contribution of a single variable to the explanation of Y may not easily be assessed if only a small number of observations is available
  • a simple criterion, like the goodness of fit, r2, may lead to wrong conclusions if the number of selected variables approaches the number of observations
  • for more complicated models (e.g. artificial neural networks) the calculation of a single model may be so time-consuming that it is practically impossible to find the "best" combination of independent variables
  • the selection of combinations is guided by the available data; thus the resulting final selection reflects the "best" model for the given data set, and not the "best" subset for the population
  • some of the selection methods are specifically tailored to linear (regression) models; they are unusable with non-linear methods such as neural networks.
Depending on the type of model being used, there are several strategies to (partially) solve the problem:

Using all possible subsets of variables:
 


Stepwise procedures: