Principal Component Regression with Weka


Update 06/12/16 Because of it's apparent importance to the use of Principal Component Regression I have put together a little more on collinearity

Yes, that's right, Principal Component Regression. What am I doing looking at that? This is another follow up to the Advanced Data Mining With Weka MOOC. The course included a Challenge involving chemometric analysis of a petrochemical dataset. The challenge results for the second dataset were particularly good. On the course forums Peter Reutemann indicated these results were obtained with Gaussian Processes. I tried to figure out how you could get better numbers for this dataset.

In googling chemometric data analysis Principal Component Regression came up as a technique commonly used. I used the R mlr interface provided by Weka and obtained a better correlation coefficient and root mean squared error than the Challenge. I posted this to the list. Elbe Frank posted that this could be done directly with Weka by using a FilteredClassifier. Principal Components is set as the filter and LinearRegression as the classifier. I had attempted something like this by trying to use Principal Components for AttributeSelection instead of data filtering. This with no success so I had resorted to the R option. The Weka filtering mainly seems to be accomplished by setting the maximum number of attributes. I asked how Elbe had determined this and he suggested he had found the best value that didn't give really bad results.

I decided to see if I could come up with something programatic that would basically implement the whole Principal Components Regression process and determine a good value to use for maximum attributes as well. I came up with a sort of binary search for finding this value.

This is PCR.java.

This shows it's results on the second dataset from the challenge.

results2.txt

These results are pretty good, but it is only one dataset. I looked at the pls package for R that includes the principal components regression - pcr. It included three datasets - yarn, oliveoil, and gasoline. You can get more information on these datasets from R with something like...
library(pls)
?yarn

I converted these datasets to the Weka preferred arff format and got these results for yarn...
yarn_train.arff
yarn_test.arff
yarn_results.txt

The search again converges to something that gives good results.

And the results for gasoline. I split this dataset to train/test as per The pls Package: Principal Component and Partial Least Squares Regression in R.
gas_train.arff
gas_test.arff
gas_results.txt

And yet again the search converges to something that looks fairly optimal.

I could compare these results with those from other classifiers and I think this would be the best from what I've looked at, I think including R's regr.pcr. I could set up a test with a stepped search having the same number of steps as my PCR uses in it's search. Again, I think about always my search would result in a more optimal number of components. So I am reasonably happy that it works fairly well in many cases.

Is it amazing and always better than anything else ever? No, I also tried it on another dataset used in the MOOC for soil sample NIR data. It got bad results and failed to converge to it's own search optimal. Trying to run with only training data, not including test, for these datasets also seemed to put the code into some sort of loop.

As far as the search being off goes I think it has underlying assumptions that the number of components has one optimal value with a fewer number value. I believe it also assumes that going toward this optimal value from either less or more components always means increasing accuracy as far as correlation coefficient and RMSE goes. Surprisingly a lot of the data seems to fit this pattern. The soil data apparently does not.

The possible less than optimal search and program loop could be seen as important flaws to anyone seriously wanting to use the code. If you are such a person, serious about wanting to use the code, let me know and I will be happy to look into these concerns.

Collinearity

All of the data tested with above seemed to be NIR. This data appears to involve a lot of collinearity which is, to my understanding, where PCR makes sense. I thought some way of measuring collinearity would be good. First, to know if PCR is a possible choice for given data. Second, to help determine why PCR may not have worked so well with a given dataset.

Googling around I came up with VIF - Variance Inflation Factor. Which seemed to be a common metric for collinearity. I implemented this based on Weka LinearRegression as VIF.java. I looked for R packages that did the same thing to crosscheck my results. The HH package seemed a good one for the purpose. I noticed some differences in the VIF values for the Weka glass dataset. Asking on the Weka list it was suggested that I eliminate the Weka attribute and collinearity attribute selection. I did and my numbers then closely matched the HH package ones. Fine. Then a bit later I thought wait a minute. Eliminate the Weka collinearity attribute selection? I looked and in fact Weka does have it's own code in LinearRegression for this. It checks for attributes with standardized coefficient's greater than 1.5 to eliminate. It eliminates one attribute then does another regression and checks again. Because eliminating the one attribute may change the picture right? This probably is more efficient than using VIF where I had to do a regression for each attribute before checking values. Properly using that for this you would then have to start over for each attribute remaining because again the situation might change after eliminating one. The R package VIF I think takes this approach and claims to be fast regression. Interesting. Possibly more on this later.