Exercise Estimation of Boiling Points from Chemical Structure
Chemical
structures can be described in many different ways. One particular way
is quite useful for setting up quantitative structure property relationships
(QSPR). For each chemical structure investigated, a lot of numerical descriptors
are calculated. These descriptors may define simple things, like the number
of carbon atoms in the structure, or more sophisticated things, such as
descriptors derived from graph-theoretical calculations. After calculating
these descriptors, you end up with a matrix containing these numbers and
a vector with the chemical/physical property of being investigated (e.g.
the boiling point). You can then try to find a suitable set of variables
and set up a multivariate regression model.
Use the data set BOILPTS and
go to the DataLab
to model the boiling point from the given structural descriptors. Try to
combine different descriptors to find an optimum combination (just a hint:
the model should result in a standard deviation of the residuals of below
8.0, a quality of fit of about 0.97, and a F-statistic of about 2300).
Try to answer the following questions:
-
How do you justify your selection of variables?
-
How do the MLR results compare with PCR?
-
Do you have any idea how to cope with the remaining non-linearity?
|