Machine Learning

Machine learning, data mining, data analytics, ...
Note that the page is under some construction to be more like my javasrc page. I sort of liked the layout of that.

Update 07/26/15 Not putting up a code update this week. I wasn't completely idle. I used CVStepper some in trying to best my old Kaggle Titanic score. No joy yet. 80% scores look more common now but .799 is still my best. My first CVStepper found configuration was about the best I did. I got away from it for a while and failed to equal what I got on that. Finally I tried using R to throw together a quick voter type scheme based on prior .799 results. I had a few. The voter runs only managed .794 though, repeatedly. I did remember some R anyhow. I think I might give up yet again on Titanic for a while. I might check out some other Kaggle competitions though. They can be interesting. I might put a little more time into the CVStepper code though. A few too many searches now seem to be coming up non-optimal. There might be nothing I can do about this but maybe I can improve on the 'scan' as opposed to 'stepping' code. It was slightly disappointing parameter optimization didn't give better Kaggle competition results. But then, Experimenter results didn't seem to hold up either. Maybe knowing when configurations tend to overfitting? Anyhow, still will put more time of some sort into this.

Update 07/19/15 Still working on CVStepper. Fixed a couple bugs. Unfortunately now you do, correctly, get 'non-optimal' runs that don't include the exhaustive best result. One way was if you initially start with low values that don't reach a 'threshold' setting that results in an improvement. I have added some code to work on that. This might need some improvement sometime. It was possible, with these routines that scan when no improvements are found or scan past epsilon convergence, to still do slightly more evaluations than exhaustive in special cases. This wasn't just bug related, although I think I added some related fixes. I then added an upper bounds check so the scan routines won't exceed the max values which should eliminate any chance of more evaluations than exhaustive.

In a couple cases while the code was 'buggy' it was actually able to beat the exhaustive best values. This was when it incorrectly did exceed the maximum range values and managed to hit on better ones. One benefit of faster searches might be that you can make your search ranges larger and not miss too many of these cases. One occurrence of this was in number of trees for RandomForest. I was trying to test a range of 80-120 trees but suddenly had tests with 10900 which beat the exhaustive best. I hadn't thought adding these really large number of trees actually did any good but it seems it can. Another question I'd sort of like to answer is if cross-validation doing exhaustive search for each fold really gives you better predication accuracy estimates. Training with different parameter configurations improving estimates seems non-intuitive to me. To decide if this true I felt I would need testing on data with separate test datasets to compare accuracy on. One place that has this is in Kaggle competitions.

All of this sort of leading back to my going after something of a white whale I had on the Titanic competition there. This is supposed to be a 101 training competition for people starting out. I was able to get very close to 80% on it but could not get past that after numerous attempts. I remember there that one of the R tutorials (80% achieving) had a RandomForest that included 2000 trees. A value I thought very high. But I am now testing with CVStepper a much larger range of tree values.
-P "I 100.0 1500.0 15.0"
That has been running for hours. A lot of trees. Might still go off on memory. I didn't increase that all. But back chasing that a little bit.

One thing to maybe note is that the multiplication version of this parameter is a little different. The last 15.0 in the number of trees parameter is the number of 'steps'. I find that sort of non-intuitive for multiplication increments I have as shown below like...
C 0.03125 32768.0 4.0 *
The 4.0 here is the multiplication increment itself each time value changes like
value = value * 4

Hopefully more testing and less coding to follow.

Update 07/12/15 If interested grab CVStepper.java off of one of the links below. Sort of a lost weekend. I figured the code was ready for more testing. It seems to work well with RandomForest, so I thought I'd branch out into other classifiers. I ended up on SVM. Both the Weka Optimizing parameters page and the libsvm guide consider it.

libsvm sort of interestingly considered it with multiplicative ranges for the complexity and gamma parameters. I thought, thats an enhancement I can add, so I did. With CVStepper you can indicate a multiplicative increment with something like...
C 0.03125 32768.0 4.0 *
The last character indicating that the multiplication operator should be used for increments.

The problem I ran into is that using the ranges from the libsvm guide: C = 2^-5, 2^-3,...,2^15. Using the first few values lead to very bad, unchanging results. I added code to try and continue searching even if no increments seem to give better than epsilon results. For this to work I had to rely on the weakest option to continue. Simply incrementing the first parameter hoping you will get to a threshold where a suitable improvement results. For ionosphere.arff this did work. However, the stepper algorithm is now somehow more inefficient than exhaustive search. Not good. Maybe, just accepting that unrealistic parameter ranges and maybe multiplicative increments are not good ideas would be best. But I'm not sure I've decided that yet. At least for now. I also had noticed that even when I was getting really bad accuracy before the improvement was reached I was still reporting optimal searches and in fact exhaustive search was indicating the same bad result as best. This must be a bug, which means all optimal result indications I've had up to now are suspect. Again not good.

So more to be done anyhow and effectiveness still not really demonstrated, although it does clearly reach some improvement on default parameter settings.

Update 07/09/15 I think I have the benchmarking how I want it for now so that version is, yes, still CVStepper.java. Now to do some testing besides Iris and see if the code is any good or needs some sort of changing.

I wasn't real interested in CPU seconds for the benchmarking, nor does it do anything like benchmark warming-up. Mainly does the code save wall clock and effort in the way of number of evaluations done. Also whether or not 'stepping' finds the same best parameter set as 'exhaustive'. Additionally, how many steps does 'stepper' actually take, I'm a little worried the search will tend to terminate early.

Sample Iris benchmark

== Benchmark Information (non-CV) ==
--Stepper--
Elapsed: 00:00:01.704
Number of evaluations: 14
Total Number of steps taken: 6
Result: 0.0333
Options: -depth 3 -I 80 -K 2 -S 1 -num-slots 1
Optimal Search: true
--Parameter Selection--
Elapsed: 00:00:13.184
Number of evaluations: 125
Result: 0.0333
Options: -depth 3 -I 80 -K 2 -S 1 -num-slots 1
== Benchmark Information (total CV) ==
--Stepper--
Elapsed: 00:00:03.278
Number of evaluations: 110
Total Number of steps taken: 52
Result: 0.0519
Options: -depth 3 -I 80 -K 2 -S 1 -num-slots 1
--Parameter Selection (exhaustive)--
Elapsed: 00:01:52.323
Number of evaluations: 1250
Result: 0.0519
Options: -depth 3 -I 80 -K 2 -S 1 -num-slots 1
Alternate optimal: 7
-depth 5 -I 80 -K 1 -S 1 -num-slots 1 (0.0444)
-depth 2 -I 90 -K 1 -S 1 -num-slots 1 (0.0519)
-depth 3 -I 80 -K 3 -S 1 -num-slots 1 (0.0222)
-depth 3 -I 100 -K 3 -S 1 -num-slots 1 (0.0444)
-depth 4 -I 110 -K 2 -S 1 -num-slots 1 (0.0444)
-depth 3 -I 80 -K 4 -S 1 -num-slots 1 (0.0444)
-depth 2 -I 80 -K 2 -S 1 -num-slots 1 (0.0296)

Optimal Search: true means that we found the same 'optimal' parameters as exhaustive did. The command line output looks slightly better than the pasted in html.

For more bad looking pasted in html an evaluation is roughly

((OptionHandler)m_Classifier).setOptions(options);
for (int j = 0; j < m_NumFolds; j++) {

// We want to randomize the data the same way for every
// learning scheme.
Instances train = trainData.trainCV(m_NumFolds, j, new Random(1));
Instances test = trainData.testCV(m_NumFolds, j);
m_Classifier.buildClassifier(train);
evaluation.setPriors(train);
evaluation.evaluateModel(m_Classifier, test);
}

Update 06/28/15 The code is a little farther along yet. Still the CVStepper.java. Maybe not complete yet, still largely untested for one thing. The one sort of test I've been doing so far is just RandomForest with the Iris dataset.

For some idea of benchmarking on that. You can run the more or less usual CVParameterSelection version with the -E (exhaustive) option like
java -cp .:weka.jar us.hall.weka.CVStepper -t "/Applications/weka-3-7-12/data/iris.arff" -E -B -P "depth 0.0 4.0 5.0" -P "I 80.0 120.0 5.0" -P "K 0.0 4.0 5.0" -X 10 -S 1 -W weka.classifiers.trees.RandomForest -- -I 100 -K 0 -S 1 -num-slots 1

This gets benchmark (-B) results like...
== Benchmark Information ==
Total Elapsed: 00:02:07.647
Total Number of evaluations: 1375
Final result: 0.0444
Final Options: -depth 4 -I 110 -K 2 -S 1 -num-slots 1

The final result is the best (least) error values.

The results in CVStepper mode more like...
== Benchmark Information ==
Total Elapsed: 00:00:05.256
Total Number of evaluations: 47
Final result: 0.0593
Final Options: -depth 0 -I 80 -K 0 -S 1 -num-slots 1

So, if it can consistently give results reasonably close to the optimal, it runs a lot faster and does a lot less work here.

I've even started a separate page for it. This might be sort of silly if it turns out in more extensive testing not to work worth a nit but anyhow I have a page going for it.

Update 06/23/15 Not totally buggy. Not quite right, but not totally buggy. Theres a 'stepper' boolean at the start of the code that allows switching between stepper search and the usual exhaustive search. CVStepper.java below is still the concerned source. Including both should allow wrapping some benchmarking at some point.

Update 06/21/15 Start at something new anyhow. Although, GapStatisticClusterNum still needs work. I'm trying to come up with a CVParameterSelection subclass that would work something like gradient descent.

CVStepper.java

The idea is to take incremental steps in whatever 'direction' (= incremented attribute values), seems to give the greatest reduction in cross-validation error. Then iterate, or take another step, until you stop getting improvements in the error. The step can involve multiple attributes, a combined step. The idea, seems like it should work with CVParameterSelection but really buggy right now.

Not sure if the idea is original, but it seems like at least for Weka it might make some sense to have a non-exhaustive search of the "state space" (I'm trying to read the CVParameterSelection related paper as well, it's terminology for this). Cutting off when the error converges should amount to some pruning, in the future maybe some attribute checks could also be pruned.

A couple possible problems come immediately to mind. One is that the parameter ranges aren't all continuous. For example, 0 for number of attributes for RandomForest doesn't really mean 0, but that it will try to determine the best number to use. You might have to figure out how to leave 0 out of the range for these. Another problem would be getting the ranges wrong. Since you are always incrementing you might miss an optimal value that is less than the initial one. So you have to always start your range low enough to include the actual optimal. There are probably other possible problems with this approach. But again the possible benefits of avoiding exhaustive cross-validation runs for each possible parameter combination might be worth it. This version should include some benchmarking to see how much benefit this might be.

See how it goes.

Update 03/08/15At long last another update. Not a real big one. I made a couple changes to GapStatisticClusterNum. The first is to check the gap as it goes along so it can quit as soon as it hits a suitable one. Rather than processing everything all the way through for each k cluster number value and finally checking to see if any had a suitable gap. This should generally work out to be more efficient.

The other change is a bug fix. The code uses a Weka EM like bit to iterate through a number of randomly centered KMeans runs to try to find a best cluster centroid. The code was doing the iterating but not remembering the best KMeans that had hit on the best error. Both in the selection and Monte Carlo parts. This is still a lot of KMeans runs, so probably not the best choice from a performance standpoint. I should do some accuracy comparisons with Weka EM and XMeans, and I have a few other changes in mind yet. Hopefully, won't take so long.

Update 12/07/14 Cluster number guesstimating. I thought using the iterated SimpleKMeans code in Weka's EM code and picking the cluster number that minimized error would work for this. I did one implementation of this that sort of assumed the error would decrease linearly. That is not true didn't work real well when the class attribute itself was removed.

I did another implementation based on gap statistic that was suggested on the Weka forum. This assumes the error decreases non-linearly but hits an elbow point at the right number of clusters. It seems sort of working at this point.

The code is GapStatisticClusterNum.java.
and
GapKMeans.java.

03/15/15 I did a little more on this but nothing worth uploading yet. I figure an easy way to compare how effectively the clustering guesstimating is done would be use ClassificationViaClustering. You should be able to use this as your meta Classifier and then select FilteredClassifier as it's clusterer, then use EM or XMeans or GapStatistic as the filtered classifier with a Remove filter to eliminate the last, class attribute for the cluster number estimation.

I got that part working for EM and XMeans, you have to like how Weka let's you chain these things together. But, I only have a very minimal start on a Weka Clusterer implementation for GapStatistic.

Update 11/23/14 Massive data mining and other online privacy concerns

Update 11/09/14 With the elections over I have moved what I did with the Senate elections to here

Weka

To make an initial entry here I came up with a simple command to list 3.7 package jars.
WekaPackageJars.java
I added a variation of this to my HalfPipe application. That currently looks like (for me)...
org.cmdline.weka.Packages
/Users/mjh/wekafiles/packages/classificationViaClustering classificationViaClustering.jar
/Users/mjh/wekafiles/packages/partialLeastSquares partialLeastSquares.jar
/Users/mjh/wekafiles/packages/ensembleLibrary ensembleLibrary.jar
/Users/mjh/wekafiles/packages/gridSearch gridSearch.jar

Another minor update, I changed command properties to give alias's to the earlier Weka commands - classifiers and classifer. I should really have an alias for the Packages class above. More info on these commands again at the Past Updates page.

FastRandomForest Package Update 03/15/15

After some discussion on memory and RandomForest on the Weka mailing list I came across the FastRandomForest project. Which indicates possible improvements both on speed and memory. I came up with a Weka 'package' for the code involved that should work with the 3.7 Weka project manager. The URL you would enter in the Package Manager would be...
http://www195.pair.com/mik3hall/FastRandomZip.zip
Or download here and browse to it.

I have emailed the author in case he wants to include it with the package. Otherwise I guess it was Gnu licensed, the source is included, so my making it available would I think be all right. I think the package manager expects a weka.classifier package so I added such a class that is just a pass through to the authors.

I based some of the package stuff on what was done for NeuralNetwork

Which is also included on the Weka 'Unoffical' packages page.

Also, possibly of interest for the package manager are...
http://weka.wikispaces.com/How+do+I+use+the+package+manager%3F#GUI%20package%20manager-Unofficial%20packages
http://weka.wikispaces.com/How+are+packages+structured+for+the+package+management+system%3F

Kaggle

These are being done with Weka or Weka related java of my own, or the occasional java helper class of my own. Some of the data manipulations are being done in R.

Competitions
Titanic	'Predict' survivors of the disaster.
mnist	Classify numbers of images of handwritten digits.

Titanic Competition

On this one I had hit a wall just short of 80% accuracy that I couldn't get past. I reproduced my best result a number of times but so far - no better. As ideas come up that I think might apply I have made one or two more attempts but so far no 80%. I had looked at the R tutorial that was indicated to get 80% but didn't recognize any magic in there that I could apply to my Weka attempts.

mnist Competition

I had one decent entry with IBk. Improved on that with the 3rd party NeuralNetwork classifier. This is a Weka classifier done by Johannes Amtén. He had a Kaggle forum post indicating an impressive 99.46% accuracy submission for this competition. My improvement was based on running it with no modifications. I wrote code to do that commandline based on the usual boilerplate I came up with for the Forest competition. This provides a convenient way to get more memory than the Weka GUI provides on OS X. The code...
RunNeuralNetwork.java
So, -e for evaluation instead of prediction. Default prediction goes to System.out. You can redirect > predictions.file.
-t training file and -T test file. Given this result here and a 'deep learning' submission far ahead in the Forest competition Neural Nets seem worth a closer look.

Update 08/14/14 Weka package loading

Changed to use Weka package loading so that the package jars do not to be included separately in classpath. NeuralNet.jar does still need to be included for compile.

Execution for now though can be done with simply with something like...
java -Xmx2560m -cp .:weka.jar us.hall.weka.RunNeuralNetwork -e -t train_sparse.arff -T test_sparse.arff

Forest Cover Competition

Update 05/31/15

I set up a page to consider with Weka Experimenter how the AdaBoostM1 and Bagging ensembles work with RandomForest for most of the Weka sample datasets - here

Update 08/03/14

Got to fairly decent standing in the competition, currently 19th I think. As mentioned in the prior update I started using Weka Explorer to check for attributes that might contribute to the class 1-2 misclassifications. None seem to really but I did notice getting rid of some improved general performance. With a couple changes for this I got to my current competition standing. I wrote the LessIsMore java class in the working downloads sections to make it easier to locate attribute removals that improve classification. However, ever since the initial success with that I have failed to get any improvement this way. Removing some attributes appears to improve training set accuracy but don't hold up on the test set. So at this point most, if not all, roads appear to lead down the path of over-fitting the training data.

Update to the update. If you look at code for LessIsMore it is a bit of a mess as I've modified and commented out existing. At some point I should clean some of this competition specific code up into something nicer and more reusable. Some of it is learning as I go as well. Keeping prior history in the code can be a little useful there, although I suppose if I committed to something like git I'd be better off.

Anyhow, the commented code shows I started out using IBk to scan the attributes for any that showed improvement. IBk does well enough with this data and seems less computationally intense than most of the classifiers. So it is fast and uses low memory. When IBk found one I double checked it with Random Forest to see if that improved as well. Of course I came to realize that the best attribute sets for different classifiers can themselves be different to the point of one including attributes that might not even be in the set for the other. So a 'spotter' classifier is probably not optimal. Running with Random Forest or SMO as long as they aren't meta wrapped isn't all that inefficient anyhow.

I've added a couple of the attribute removed training datasets to the file downloads.

I did hit one test run where the confusion matrix appeared to show fair improvement in the 1-2 classification. I wrote ClassPickerAB to implement the idea I had to set those classifications from the better classification. Again, this didn't hold up, so either I was mistaken in the improvement, I'm not sure I Have a valid confusion matrix from my best test run. Or again, this classifier overfit the training set but didn't do as well with the test. The code shows the implementation of the idea anyhow. I might still find a second classifier it works with. Improving the classification for these 2 while keeping the original results for the rest.

Currently trying to get ensemble methods like Vote and Stacking to work. Still need a lot of memory and run a long, long time. Maybe boosting memory on the machine would help? I also had a bug where I was resetting the classifier to run to one other than the ensemble. Fixed that and had to start over. Should I think be reflected be the ForestStacking and ForestVote below.

About all I have time to say on this for right now

Update 07/27/14

Sort of spinning wheels trying a different approach. Although I am also sort of moving forward with the same approach but based on a different classifier. It is looking like a somewhat optimized SMO performs about as well as the Random Forest versions. Yet to be seen if it derives the same benefits from the AdaBoostM1 and Bagging meta wrappers. SMO does seem to run even longer.

Working downloads related to that...
BaggingBoostIBk.java Was another try. Not quite as good as Random Forest or SMO so far.
BoostSMO.java Was needed because even just using AdaBoostM1 on SMO went out sideways on memory.
BaggingBoostSMO.java The in-progress full version. ForestStacking.java competition specific implementation of the Stacking meta ensemble classifier
ForestVote.java competition specific implementation of the Vote meta ensemble classifier.
LessIsMore.java indicate if classifier accuracy improves if attribute is removed
ClassPickerAB.java replace results for two classes with, hopefully, results of a better prediction for those two classes

A good sized chunk of the mis-classification error seemed to be between the first and second classes. These appear to account for about a third of the mis-classifications with 1 classified as 2 or 2 classified as 1. I thought splitting out a new arff based on the class 1 or 2 predications of the original model, might allow me to come up with a better model for class one and two. So far I can't seem to code this correctly.

Given no apparent headway on that so far I did change the arff file so it was just classes 1, 2 and put everything else in class 3. Not quite the model simplification I was looking for above but still somewhat simpler. The train3 download is below. So far it doesn't seem to improve classification for 1 and 2. Although it does somewhat improve overall classification. It may turn out a simpler model isn't going to work for me. Maybe eliminating some attributes would eliminate some of the 1 and 2 overlap?

About it for this week

My starting point on this one was Weka Bagging of Random Forest which is where I mostly got my best results in the Titanic competition. I noticed that AdaBoostM1 also did pretty well and came up with code that chains both...
BaggingBoostRF.java
Additional parms are...
'-r' to reverses the chaining to AdaBoostM1 of Bagging of Random Forest
'-e' to do a cross-validated evaluation instead of making predictions. It still requires the test set parm, which it shouldn't.

This is run, on OS X, for this competition with something like...
java -Xmx2048m -cp .:weka.jar us.hall.weka.BaggingBoostRF -t forest-train-sparse.arff -T forest-test-sparse.arff > predictions
Predictions were handled somewhat specific to this test case. I'm not sure if I can make that more general or not yet. Note the memory override. More on this later.

Arff files
Training: forest-train-sparse.arff
and the much larger...
Test: forest-test-sparse.arff
Trying to separately model the first two classes, with have 1, 2 and everything else lumped into 3.
Training 3 class: forest-train3-sparse.arff
This training set has attributes removed that seemed to work well with IBk
forest-train-x-sparse.arff
This training set should have attributes removed more suited to Random Forest
forest-train-xrf-sparse.arff