You can do an internet search on Winter oil and find more information. The link I based my
data changes is this
gasbuddy blog. This provides a brief description of why there are seasonal blends.
The very basic idea is that more volitile fuel is allowed in winter than in summer. This is
because it is assumed that vehicles polution gear won't be able to handle the higher volatility
fuels in the hotter months. This is measuered in a unit called Reid Vapor Pressure [RVP].
From the blog we have...
Normal RVP values look like this:
January-March RVP is 13.5+ in many areas
April-September RVP is 7.0-9.0 in many areas
September-December RVP is 11+ in many areas
> mean(fall$Price) [1] 2.90831 |
> mean(winter$Price) [1] 3.036031 |
> mean(summer$Price) [1] 3.228312 |
Now the question might be - does having this additional information result in an improvement to our gas price prediction model? It doesn't seem to help with the training data when using default Weka RandomForest.
Training Without Blend | Training With Blend | |||
---|---|---|---|---|
Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error |
0.9646 0.0714 0.0985 23.5823 % 26.5171 % |
Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error |
0.9628 0.0718 0.1013 23.7112 % 27.2607 % |
With blend seems to do slightly worse on errors than the data without. However, when we apply the models to the test data.
Test Without Blend | Test With Blend | |||
---|---|---|---|---|
Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error |
-0.3842 0.2677 0.3141 27.2418 % 30.853 % |
Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error |
-0.0778 0.2233 0.2764 22.7197 % 27.1499 % |
Test data with blend seems clearly better. Outputting the Weka prediction results and checking
with R seems to mostly confirm this.
Consider which model gets closer to actual average price...
> mean(resbal$predicted) [1] 2.311187 |
> mean(resblend$predicted) [1] 2.374188 |
> mean(test_blend$Price) [1] 2.388521 |
Comparing maximum errors...
max(abs(resbal$error)) [1] 0.584 |
max(abs(resblend$error)) [1] 0.485 |
Comparing errors falling
table(resbal$error < .10) FALSE TRUE 24 72 |
table(resblend$error < .10) FALSE TRUE 24 72 |
Including the blending data does seem to improve the prediction overall. Although, it is particular to gas prices and wouldn't apply at all to predicting oil prices.
At this point the model seems sufficient to predicting gas prices with reasonable accuracy. I'm not sure how useful this would actually be in forecasting prices. It might be more useful in verifying that local gas prices you are seeing are consistent with what is to be expected.
There are still some inherent assumptions involved. I include midwest gas prices, not all areas might be so predictable. The blend data might not apply to all areas either. I saw somewhere that California always uses a "summer" blend. Probably due to climate and pollution concerns in densely populated areas.
It might be nice to have the process include geo location data. Then it could download the correct price data and maybe do the right thing as far as blending goes. This might be a future enhancement.
This might be a good place to consider the data that I am using and where I am getting
it in a little more detail. Next time around maybe.
For now though, again at least for verification of the results if you would like to do so,
the training/test data used on Weka.
training_blend.arr
test_blend.arff