Kaggle Titanic competition with Weka

Introduction

I was interested in using Weka on the Kaggle Titanic competition. This describes a little bit of what I did in order to do that. This wasn't all done with Weka, most of the data manipulation was done with R. Some changes were simply done with a text editor. If you aren't interested in R you might not be too interested in what follows.

Methods:

Data Collection

Getting the data set up correctly was the biggest task for me in using Weka for the competition. The Kaggle Titanic data [1] is supplied as cvs. The Weka Explorer application can work with these quite nicely, maybe after doing a little adjusting with the appropriate filters. One thing that can trip you up is that Weka expects the predicted class to be the last field. I thought that I had changed this by filter to use the 'first' Titanic 'train' column which is the actual 'survived' class. However, I had not actually managed to change this and didn't realize until considerably later that up until then I had actually been predicting 'embarked'. By the time I realized this I was getting better than 80% accuracy prediction on the 'embarked' class (AdaBoostM1 classifier paired with RandomForest). I was impressed there was enough information in the other features to predict this with that much accuracy.

I did most of the data manipulation in R, some in the OS X text editor TextWrangler. I saved some of the history for future reference.

Writing out the Weka csv file while I was still working with csv.
write.csv(test,"wekatest.csv",row.names=FALSE,quote=c(3))
I was still including the name, 3rd column, at this point and this quoted it. I later figured it had no useful information and started omitting it. Testing showed little impact even on the training set for parch and I also omitted that. Looking at ticket it almost seemed completely different between the training and test sets so I also ended up skipping that.

Set age values to the mean
test$age <- replace(test$age,is.na(test$age),30.0)

In some cases I tried to replace factor missing data with the Weka preferred unknown value of '?'. Occasionally this meant adding '?' as a level for a factor.
levels(train$embarked) <- c(levels(train$embarked),"?")
train$embarked <- replace(train$embarked,which(train$embarked == ""),"?")

Weka requires that the test file include the predicted column, in the same, 'last', place as in the training file.
survived <- rep(0,418)
test <- cbind(survived,test)
I later changed this from 0 to the Weka preferred '?'.

I did end up including the cabin field in the arff files. To define each cabin as a unique nominal class in the @attribute arff fields I first made them a quoted and comma separated list in R.
paste(shQuote(unique(train$cabin)),collapse=",")
I tweaked this and pasted it into the arff file and formatted it with the text editor. When I was working with the actual 'test' file it had new cabin values so I had to merge the training and test values and make the unique list out of that. Of course, then making sure the test and training files both had the complete @attribute definition. I didn't save these commands.

Finally, you get the test file classification to run and want to submit the predictions [2]. You then need to select the "More options..." button in the Weka explorer. The test run output then includes something like this...
inst#, actual, predicted, error, probability distribution
1 ? 1:0 + *0.914 0.086
2 ? 2:1 + 0.461 *0.539
3 ? 1:0 + *0.706 0.294
4 ? 1:0 + *0.719 0.281
Skipping the header line you paste the rest of the results into a text file and read that into R. The process there is something like.
results <- read.table("resBagging.txt")
PassengerID <- c(892:1309)
Survived <- as.character(results$V3)
Survived <- replace(Survived,which(Survived == "1:0"),0)
Survived <- replace(Survived,which(Survived == "2:1"),1)
results.csv <- data.frame(cbind(PassengerID,Survived))
write.csv(results.csv,"resBagging.csv",row.names=FALSE)
Submit the results csv file to Kaggle and you're done.

Things got more difficult to work with when I was done with the training data and tried to switch to testing with the "Supplied test set". As noted in the Weka documentation [3],

Using CSV files as train and test set can be a frustrating exercise.

Tell me about it. I was getting nowhere with the two files as csv, so I decided to convert to arff files. This was probably best but didn't eliminate data problems. Some of the data fields gave Weka trouble irregardless of csv or arff. These were mainly the character fields like cabin or ticket. I will probably refine a better approach to these types of attributes as I run into them in the future.

Results

The first result would be the reusable Weka training and test files for the Titanic competition.

train.arff

test.arff

@attribute pclass {1,2,3}
@attribute sex {female,male}
@attribute age numeric
@attribute sibsp numeric
@attribute fare numeric
@attribute cabin {'B45','E31',...,'B42','C148'}
@attribute embarked {C,Q,S}
@attribute survived {0,1}

The next version of the data. Splitting the deck off of the cabin was an original idea at the time. I think I've seen since that others considered the same thing. In googling on this I did see that only and first and second class decks had life boat stations so that may of had some bearing.

wekatrain2.arff

wekatest2.arff

@attribute pclass {1,2,3}
@attribute sex {female,male}
@attribute age numeric
@attribute sibsp numeric
@attribute parch numeric
@attribute fare numeric
@attribute embarked {C,Q,S}
@attribute decks { 'A','B','C','D','E','F','G','T' }
@attribute cabinNum { '0','1','10',...,'94','95','99' }
@attribute survived {0,1}

Another data attempt was converting age into ranged nominal. I think I saw this on the Kaggle forums. This was the one data change try where I think it mostly got worse results. R named for age factor.

wekatrainAF.arff

wekatestAF.arff

@attribute pclass {1,2,3}
@attribute sex {female,male}
@attribute age { "G1","G2","G3","G4","G5","G6","G7" }
@attribute sibsp numeric
@attribute parch numeric
@attribute fare numeric
@attribute embarked {C,Q,S}
@attribute decks { 'A','B','C','D','E','F','G','T' }
@attribute cabinNum { '0','1','10',...,'94','95','99' }
@attribute survived {0,1}

The current last of the mainstream data changes was trying to add in title. I think I saw this one suggested on the Kaggle forums as well as a better age indicator than age due to missing values for that attribute.

wekatrainT.arff

wekatestT.arff

@attribute pclass {1,2,3}
@attribute sex {female,male}
@attribute age numeric
@attribute sibsp numeric
@attribute parch numeric
@attribute fare numeric
@attribute embarked {C,Q,S}
@attribute decks { 'A','B','C','D','E','F','G','T' }
@attribute cabinNum { '0','1','10','101',...,'93','94','95','99' }
@attribute title { "Mr","Mrs","Miss","Master","Other" }
@attribute survived {0,1}

These files should work with whatever classifier you want to try them out on. At least they have for me.


LibSVM

I did notice at some point that LibSVM wasn't available as a classifier. It did appear available for the traditional Iris dataset. There seem to be some difficulties with some nominal attributes in some cases. I finally removed all attributes from the dataset in Weka explorer and then started adding them back in see where LibSVM stopped being selectable. It was all right until it got back to the embarked/decks/cabinNum attributes. Including any of those made LibSVM not selectable. I tried using the NominalToBinary filter converting the nominal attributes to new 0/1 valued attributes for each possible value. I think this should of made LibSVM happy. It is actually recommended in the LibSVM introductory guide. The fact it didn't work I would almost have to think is a bit of a glitch. To verify it should work I created another modified version of the data doing this filter manually myself.

trainLibSVM.arff

testLibSVM.arff

@attribute pclass_1 numeric
@attribute pclass_2 numeric
@attribute pclass_3 numeric
@attribute sex_female numeric
@attribute sex_male numeric
@attribute age numeric
@attribute sibsp numeric
@attribute parch numeric
@attribute fare numeric
@attribute embarked_na numeric
@attribute embarked_c numeric
@attribute embarked_q numeric
@attribute embarked_s numeric
@attribute decks_na numeric
@attribute decks_a numeric
@attribute decks_b numeric
@attribute decks_c numeric
@attribute decks_d numeric
@attribute decks_e numeric
@attribute decks_f numeric
@attribute decks_g numeric
@attribute decks_t numeric
@attribute title_master numeric
@attribute title_miss numeric
@attribute title_mr numeric
@attribute title_mrs numeric
@attribute title_other numeric
@attribute survived {0,1}

I omitted cabinNum since I didn't want to deal all the values. Given that LibSVM worked just fine. Again suggesting the NominalToBinary filter should of worked in Weka Explorer.

I followed the previously mentioned guide - eventually. Normalizing the data was just about forced, the results were much worse if I didn't. I seemed to get better results initially using the linear kernel but it didn't seem to make any difference adjusting the cost and gamma parameters as suggested in the guide. Reading a little closer the grid-search modifying these parameters was intended to apply to the RBF kernel. I changed back to that and finally tweaking the parameters did get me slightly better results on the training data using RBF.

I submitted this on the Titanic competition getting a .79426, not besting my current high .79904. I'd still like to better .8 sometime but I think for now I'm done with the Titanic competition again for a little while. The LibSVM mystery mostly solved.


Conclusion

Weka is very usable for the Kaggle Titanic competition.

Reference:

[1] Data for the Kaggle Titanic competition.
[2] weka Making Predictions
[3] weka Can I use CSV files?