Weka

Waikato Envronment for Knowledge Analysis

02/07/16 Starting a new page for anything Weka related. More will be added from the machine learning page as I find time and interest to update documentation. As well, of course, as new content as I find time and interest for that.

10/22/17 Looked at package classifiers. Turned out a bit tricky. Including MLRClassifier crashed the app so I omit. The same for something in the timeseries .gui. classes, so I also skip them, no classifiers are probably missed there. As before see this for details..

10/15/17 The support of my HalfPipe application will now probably mostly go to the new Java 9 version. See details.. That was getting errors on the weka.classifiers command to list all available Weka classifiers. I made a minor change to fix that. I think this might be the only one left that doesn't include package classifiers. That should be looked at sometime.

10/01/17 Updates to some commands for my HalfPipe application for displaying Weka packages. e.g.
whereis ClassificationViaClustering
weka.classifier ClassificationViaClustering
showc weka.classifiers.meta.ClassificationViaClustering

09/17/17 Extended a Groovy script provided by Eibe Frank on the Weka mailing list for learning curves - LearningCurve.java. This version adds in plotting the curves. Rather than taking the parameters to run the classifier it has a parameter to take a saved model for the classifier.
e.g.
java -cp /Applications/weka-3-8-1/weka.jar:. us.hall.weka.LearningCurve -l iris_cluster.model -t $iris

Where the -l switch is the saved model and -t the data file.

07/16/17 I had put a version of the Weka package, ensembleLibrary, up on GitHub. This was to correct, or workaround, some hanging/recursion bug. I am looking at classification optimization again, maybe hyperparameter tuning, maybe ensembles. The main interest, I guess, still being related to Kaggle competitions. As part of that I tried using ensembleLibrary again. It had some issues. I made a couple changes and figured out how to force an update to GitHub. I might make some more changes.

It was indicated on the Weka list that they would come up with or already had a release with a fix for the issue I first ran into. But the code seems to have other issues and possibly room for a little improvement so I may make more updates to the GitHub version of mine.

06/12/16 More OS X specific content, RWeka setup workarounds.
Also an update to the Principal Components Regression with Weka page.

06/09/16 Still more code related to the Weka Advanced MOOC. Principal Components Regression with Weka.

05/22/16 More time in on the same code as last week. It seemed to me I should be able to figure out something Weka was doing that would correlate with the recursion problems R was having. Of course the obvious thing it was doing for this MOOC Class assignment was working with data having 163,841 attributes. Very likely to cause code problems somewhere. I thought though that there might be some way Weka could change it's invocations to work around the problem. I didn't find any such way, there just wasn't a good way to handle that many attributes. -Xss to work with the recursion is probably the only recourse.

I did notice though that using more of the JRI java/R interface data type objects and assign methods instead of putting together string scripts for R to evaluate did seem to generally improve performance. I noticed that this seems to be true for either more attributes or more instances. Again, it isn't all that big a deal until the numbers for these start getting fairly high. But these were general improvements that always seemed to do better than the current implementation. It seemed like using them might make some sense.

So I thought if you're going to go for this you might as well go for it all the way. I modified the code to use a 'fast' method that uses a single call to REXP.createDataFrame. This is probably the fastest way I came up with for increasing attributes but only to a point. It would then crash the app itself. For the Nifti Class exercise data this seemed to be somewhere between 45,000 and 50,000 attributes. So I set a constant limit of 40,000 where the code switches from this to a more 'scalable' method. The scalable pretty closely follows the current implementation, it avoids some of the script evaluates and uses the JRI types. I made a few other changes as well. This version is still faster than the current implementation. It also seemed a little faster as the number of instances increases than the 'fast' version. Almost all my testing here though was with the Weka RDG1 data generator which I noticed almost after the fact only generates true/false nominal attributes.

The testing is overall still very limited but I think this code will hold up to be generally faster running than the current version.

I made one other change. I wrapped this in a SwingWorker subclass. Based roughly on the ones in the Weka PackageManager code. I think for GUI use, which most of the RPlugin related is, This would get it off the application event thread and would make it easier to lazy load the data when R actually needed it.

i2df.execute();
result = i2df.get();

The execute runs the instances to dataframe in background. You could if you like start that right after loading the Weka instances. You can ensure, anytime, later that it is complete by using the get(). I think, I'm not real familiar with this code.

My latest version of RUtils...
RUtils.java

For the really curious...

The benchmarking version, would need to be renamed to be used...
TestRUtils.java

The compile command I used...
javac -cp .:weka.jar:/Users/mjh/wekafiles/packages/RPlugin/lib/JRI.jar:/Users/mjh/wekafiles/packages/RPlugin/lib/JRIEngine.jar:/Users/mjh/wekafiles/packages/RPlugin/libExternal/REngine-0.6-8.1.jar:/Users/mjh/wekafiles/packages/RPlugin/pdm-RPlugin-ce-TRUNK-SNAPSHOT.jar -d . RUtils.java

The OS X execution script...
#!/bin/bash
export R_HOME=/Library/Frameworks/R.framework/Resources
java -Xss10M -Xmx4096M -cp .:weka.jar weka.gui.GUIChooser

How would I use any of this myself on an ongoing basis? I don't know. But I may look some more at using the Weka/R interface from java code and could include this in that somehow. Anyhow, I turned last weeks hacky test version into something that looks a little more like real code.

05/15/16 This week's activities were mainly related to the Advanced Weka courses. This week concerned the R interface. It ran into difficulties at a couple places for me in a RecursiveRelease method in R. Others seemed to have this problem on the large dataset used in the week's final activity. This involved a dataset initially with 163841 attributes. The suggestion was made to increase stack space using -Xss on java launch. This avoided the crashes.

The R code seemed buggy to me and I googled up hits where this was considered to be the case. However, checking I found this method in R's memory.c which looked established and possibly sort of core to normal processing. So, possibly not likely to be considered by those supporting R as buggy and subject ot change.

This still seemed like a bit of a performance issue for Weka since with the RPlugin installed loading the data into Weka also automatically loads it into R. This is done on the EDT and added a couple minutes more time with an unresponsive application not really indicating anything going on. This seems a little questionable to me, it should possibly be done on a worker thread and maybe 'lazy' loaded into R as well only when it is actually needed?

I figured out where Weka does the data loading. org.weka.RUtils . (This is my own testing version). The code seemed a little strange in putting together the R data.frame a column at a time, sort of a piece-meal approach. I started doing some changes to different versions of the instancesToDataFrame method with some benchmarking included. The current implementation of interest is _instancesToDataFrame4. That uses REXP.createDataFrame. One call seems to do everything the original is working much harder to do with numerous calls into R. However, on the activity dataset it goes out sideways. No crash log written, no exception thrown, just the app disappears. I verified it does work with the iris dataset. I added code to truncate the activity dataset attribute loop. Worked at 150 like iris, worked at 1500, worked at 15000, didn't work at 150000. [The current version has debug messages, expects to run to 150000, etc., debugging not ready for normal use.] So it is probably just the interface can't handle the size of the dataset? Most likely memory? This would be the likely explanation for the Weka approach as well. It will work as long as no single column gets too big to be added.

About the only other idea I have on this is to let R read in the dataset itself. It has it's own methods to do this which it has undoubtedly tried to make as efficient as possible. The Weka interfacing code would need to know the original path to the arff file. Which, I'm not sure is readily accessible as this would normally not be needed with the data in Instances. It would have to figure out how many records to skip to get past the @data. It would then probably use read.table and add code to convert that to a data.frame, maybe add column names, etc.

Sort of interesting anyhow

05/08/16 Again more HalfPipe, minor updates to the cron command. Started parsing code for Quartz job file - this will allow dynamic updates without editing it.
See HalfPipe download below.

05/01/16 More HalfPipe, my app, specific. Finally, got the quartz scheduler configuration files handled so that Quartz can be started automatically. This is still OS X only. I have no Windows access to do anything with that right now. I had to get away from my fancier idea of adding a directory into classpath where Quartz could find it's files. I can get it into a classloader but not classpath. So, instead I just set properties to correct full paths. Leave it at that for now. Next would probably be giving CronCommand an option to update the job initialization file. Then you could, say, automatically have it check Weka packages for currency every time the application is started.

Also taking the Advanced Weka course. It looks to cover a variety of interesting topics before it's done. Only week 2 now. MOA and large datasets with streaming. Week 1 was time series, still a definite interest of mine.

04/17/16 New major Weka releases
These Weka changes seem to join the previous 3.6 production and 3.7 development branches and start a new pair with 3.8 production and 3.9 development as the releases. I assume for now, right after the merge, that the two versions are pretty much the same.

There are a number of changes in particular affecting OS X. First, to a more general one. The announcement email indicated...

NOTE 1: Users of Weka 3.6 will find that serialized models created in 3.6 cannot be used in 3.8.

I had considered trying to do something related with Weka after I first heard about it's use of serialization. I didn't finish that. After seeing this I did put something together that I hope might be useful.

WekaSerialConverter.java

09/17/17 New Weka package classloader related changes

To get information on a serialized model you can...

java -cp .:weka37.jar WekaSerialConverter j4837.ser.model

The Weka jar in classpath is the one used to create the model. Or one from a serialized compatible version.
To actually attempt to recreate the serialized model with the new version you can...

java -cp .:weka37.jar WekaSerialConverter -n weka38.jar -t iris.arff -d j4838.ser j4837.ser.model

-n Is the Weka jar file for the new, serialized incompatible release.
-t Is the training file.
-d Is is the 'dumping', serialized new release model.

This code has seen very limited testing. Just using a model created with J48 trained on the Iris dataset. Also, simple meta testing with Bagging of RandomForest again for the Iris dataset. If you have issues with any of your models in using this please let me know. ( email )

Update 04/25/16

The Weka 3.8 package manager issues I was having all seem resolved if you move or remove the ser file indicated in the release notes. I also had to move or remove the unofficial command-to-code package. I will have to use this from 3.7 for now. Hopefully, the author will have time at some point to make it 3.8 compatible.

Whatever problems I was having in updating the application bundle also seem to be gone now. I duplicated the application. I then made my modifications to that bundle and they seem to be working correctly.

For what it's worth these changes are...

Increase the Info.plist JVMOptions memory parameter from 1024M to 2048M, I might even increase it more sometime.

Set R_HOME in the Info.plist environment variables to /Library/Frameworks/R.framework/Resources.

Added the bin/java command from my own installed 1.8 jre to the Weka app embedded jre for the Auto-Weka package.

I would personally vote to see the R_HOME and bin/java inclusion as permanent changes. I'm not sure why my changes were hanging before and what the seeming 'signing' related messages were that I was seeing in system log. I had seen some reference to an Apple bug related to these messages, maybe it was this. Maybe I needed to reboot, who knows?

But the good news is I seem to be good to go now with Weka 3.8 and hopefully 3.9 as well.

03/06/16 Not specifically Weka related. But I have added some basic scheduling functionality into the HalfPipe application. It still has a ways to go. I haven't figured out a good place in application startup to initialize it but if you fire up a new CronCommand it should initialize Quartz. For example...
us.hall.quartz.CronCommand 15 0/2 * * * ? echo "this is scheduled"
My understanding, and per the documentation is this will be "scheduled to run every other minute starting at 15 seconds past the minute." There is also a status job that is scheduled to run every 10 minutes that will just log a message that Quartz is active, like...
INFO] 06 Mar 07:07:45.000 PM DefaultQuartzScheduler_Worker-6 [us.hall.quartz.ActiveJob]
Quartz active _________
The underscores show up as a solid green bar in the app - a sort of visual that it is active.

As this shows, logging support is again active in the application. I think at some point application bloat became a concern to me so I removed the log4j jar, with some sort of involved thing for optional handling logging support if it was added to the classpath by the user. With scheduled activities logging again seems like a must have. It might be nice to ensure that random, scheduled output doesn't mingle with normal command output for cutting and pasting command output as a whole. I am currently thinking maybe have to output areas one, for command entry and output and another split screen for logged messages.

More related to Weka again is that this could allow periodically running the packages out of sync command. Scheduling this to be done maybe daily is a little problematic if you don't have the application up and running all the time. Which I don't currently do myself. I believe Quartz supports persistent job stores of a couple different types, such as sql databases. I need to look into that. For anyone who doesn't want to set up mysql or something like that just for this functionality I think you could maybe have a external text file (xml?) That could be read in at startup and set schedules. Knowing where to keep this data is still sort of hard to know. Weka uses the wekafiles directory in users home directory. On OS X I am more thinking the Application Support directory right now.

For supporting external files I am also considering plugin type support for HalfPipe. The Weka related. jars for the commands, property file for command alias's, schedules, or any other resources would go in the plugin, which I am again thinking of going into the Application Support directory on OS X. This is sort of similar to the Weka package manager idea but somewhat different. Another possible plugin would be one for downloading Oil related data, another sort of machine learning mini-project I have. BitCoin, another interest I have, could be another plugin that could have periodic processes. Plugins would again help with bloat problems for the application itself.

Anyhow, a lot to do and still not all that much done. One more command has been added to list out Quartz meta data...
us.hall.quartz.SchedulerMeta
I still have to come up with aliases for these. The Quartz stuff could also be a plugin but probably not an optional one.

The HalfPipe download again...

halfpipe.dmg

02/21/16 Still not too much of an update. All HalfPipe related. Download below. Fixed head and tail commands so -n will work correctly with piped, like...
weka.log | head -n 5
There is also a fix for weka.pkgmgr.search so that it will include any installed unoffical packages in it's searches.

Not quite a complete update. Worked on 'weka.pkgmgr.check' and the gui for it you can get by using 'set weka.pkgmgr.gui true'. Not complete yet. Would still like the fancy progress bar for updates. Refreshes still seem to refresh everything and not just selected.

I would like to have this working before adding scheduling functionality into HalfPipe. Currently considering Quartz for the scheduler. I could then periodically do checks for Weka packages out of sync. Or, schedule periodic data updates. Or, schedule scripts to run if scripting functionality becomes more used. Supposed to be the next step anyhow.

For now there is a new addition for Weka to the HalfPipe command set.

Package Manager is becoming an important addition to Weka for the 3.7.x versions. There are better than 150 packages in the official repository.

Some way to locate packages that might be of interest might itself be of interest. So I wrote a HalfPipe command to do this.

e.g.
weka.pkgmgr.search ClassifierAttributeEval

Halfpipe download

halfpipe7.dmg

HalfPipe is still mainly an OS X application until I get more access to a Windows machine to begin with. I did get a bat file launcher working a little bit a while back.

This combines invocations of Weka's own weka.core.WekaPackageManager command with a little of the HalfPipe grep command (which I think was pretty much nio sample code from somewhere). So the argument is actually a regular expression. For my own use I will probably just use simple keyword searches which is how I pretty much use grep as well. There is some highlighting to try and show where matches occur in the package info and visually separate packages.

There is also weka.pkgmgr.check which should tell you if any of your installed packages are out of sync. For me it shows mismatched repo/install versions but then says no packages are out of sync. It was if I remember right not complete. I sort of wanted to include a gui'd version that allowed refresh and would show you one of those neat progress bars with the download status. If there is any interest in seeing that completed or any other comments on any of this feel free to get a hold of me...
EMail: contact

HalfPipe and Weka files

I started prefixing weka command alias's with the weka. prefix. Sort of a namespace scheme. This is the first time I've tried it. You can get a complete list of Weka related commands with weka.cmds, which sort of takes advantage of this.

Questions have occasionally come up on the Weka mailing list where the answer is to look at the weka.log file. To possibly make viewing the log a little more convenient HalfPipe includes the weka.log command. Using this by itself will output the entire file.

You can get somewhat more selective about this. For example to get an idea if the RPlugin package has initialized correctly you could use the command(s)...
weka.log | grep JRI
System.in:8:Injecting JRI classes into the root class loader...
System.in:11:Engine class: class org.rosuda.JRI.Rengine ClassLoader:sun.misc.Launcher$ExtClassLoader@2e5f8245
(colors somewhat different in HalfPipe)
If you don't see that you could use grep to check for R_HOME instead. There might be an error message that isn't set correctly. There was for me. How to set that for OS X will follow.

You could also try commands like...
weka.log | head
weka.log | tail
(I added the head and tail commands to HalfPipe just for this. They had been on the to-do list for a while).

I don't have a convenient way to access the data files that Weka includes with it's distribution. Although it probably shouldn't be that difficult to add yourself. For example, for me I more the install folder along with the app to the OS X /Application directory.

Then you can get the installed version...,
weka.version
3.7.13
and then define a property variable...
set weka.data /Applications/weka-3-7-13/data
This should make it more convenient to do command line invocations from HalfPipe, but I haven't tried that yet. An example of using it other than for that might be...
head ${weka.data}/Iris.arff
% 1. Title: Iris Plants Database
%
I have a weka.run command in place but haven't tried it yet.

Weka (and R) on OS/X

RWeka setup on OS X

RWeka currently seems to have issues on OS X. It requires Java 6 to be installed and tries to run against that but needs to run the Weka related with Java 7 or later. The problem seems to be the rJava package's, who in turn seem to say it's Oracle's problem. So you get errors like...
Error in .jnew("weka/core/Attribute", attname[i]) :
java.lang.UnsupportedClassVersionError: weka/core/Attribute : Unsupported major.minor version 51.0
To get around this you can follow these steps...

1). From Terminal check which Java R should be using with the following command...
sudo R CMD javareconf
This should probably indicate Java 8 for the latest Weka versions.

2). From the R console re-install the rJava package from source...
install.packages('rJava', type='source')

3). If you get an error after doing that when entering 'library(rJava)' which indicates somewhere in it that 'Library not loaded: @rpath/libjvm.dylib' you can workaround this from Terminal again with something like...
sudo ln -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib
This makes a symbolic link adding a pointer to the current location of libjvm.dylib to /usr/local/lib. NOTE: This link will not be automatically updated as java versions change.

In order to redo the link first remove the old one like...
sudo rm /usr/local/lib/libjvm.dylib

NOTE: 06/07/I just had to redo this

You can verify that this has worked with...
library(rJava)
.jinit()
.jcall("java/lang/System", "S", "getProperty", "java.runtime.version")
[1] "1.8.0_65-b17”
For the Java runtime being used by rJava.

Update 06/07/17 I ran into pro Warning: unable to access index for repository http://streaming.stat.iastate.edu/CRAN/src/contrib:
cannot open URL 'http://streaming.stat.iastate.edu/CRAN/src/contrib/PACKAGES'
chooseCRANmirror() http://www.onthelambda.com/2014/09/17/fun-with-rprofile-and-customizing-r-startup/ sudo rm /usr/local/lib/libjvm.dylib

Update 05/29/17 I recently had problems with this again when updating the RWeka pakage and added the first two lines. library() of course is normal for loading a R package that you didn't just install. The .jinit() also seemed necessary. The problem I was having, OS X, seemed related to having a Java 9 early access JDK installed. rJava appears to try and parse the version out of the version string shown above. For 9 early access this is just 'ea'. It didn't work. Temporarily I dragged the 'ea' jdk out of the way. It worked then. I had also brought my 1.8 version current at the same time and threw away some old 1.7 and 1.8's. This caused Eclipse problems because it was hard-coded to a specific 1.8 that I had trashed. Rather than try to change Eclipse for now I just drug the old 1.8 back out of the Trash. Everything fine.

You can verify RWeka with whatever you normally use RWeka for...
data <- read.arff("org_c_no_missing-rn.arff")
> dim(data)
[1] 3911 217

Most of this was based on the discussions at...
https://github.com/s-u/rJava/issues/37
http://conjugateprior.org/2014/12/r-java8-osx/

R_HOME for the OS X Weka application

Briefly, for setting R_HOME required by the RPlugin package on OS X I went with updating the Weka application. I had Googled a page that indicated another approach involving launchctl or something like that but I am more comfortable just updating an applications Info.plist file.

So I control clicked the application and chose "Show Package Contents". I then double clicked on the Contents/Info.plist file which launched Xcode. There used to be a standalone application just for editing plist's but that doesn't seem to be around anymore. The Xcode editor seems similar. I added the R_HOME environment variable. LSEnvironment I believe is the actual plist key, but the editor shows it more readable as "Environment variables". This is a dictionary type that you add a R_HOME item to. I used the value R itself gave...
> R.home()
[1] "/Library/Frameworks/R.framework/Resources"

I still haven't done that much yet with either RWeka or the RPlugin but the R interfaces remain an interest. I had tried one for HalfPipe at one time I think along the lines of the plugin. But had limited success.