Gas Blends


Background

You can do an internet search on Winter oil and find more information. The link I based my data changes is this gasbuddy blog. This provides a brief description of why there are seasonal blends. The very basic idea is that more volitile fuel is allowed in winter than in summer. This is because it is assumed that vehicles polution gear won't be able to handle the higher volatility fuels in the hotter months. This is measuered in a unit called Reid Vapor Pressure [RVP].
From the blog we have...

Normal RVP values look like this:
January-March RVP is 13.5+ in many areas
April-September RVP is 7.0-9.0 in many areas
September-December RVP is 11+ in many areas

Gas Price Prediction

If we apply this to our current data using the values for the seasonal ranges as SUMMER, FALL, WINTER, we get...

> table(blend$blend)

  FALL SUMMER WINTER
   113        144             64

Does this do us any good? The first question might be does there actually seem to be a seasonal price difference?
> mean(fall$Price)
[1] 2.90831
               > mean(winter$Price)
[1] 3.036031
               > mean(summer$Price)
[1] 3.228312
There is a 30 cent difference from the highest summer prices to the lowest, however, sort of oddly the cheapest prices seem to be in the fall time frame which occupies the middle RVP range.

Now the question might be - does having this additional information result in an improvement to our gas price prediction model? It doesn't seem to help with the training data when using default Weka RandomForest.

Training Without Blend Training With Blend
Correlation coefficient
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
0.9646
0.0714
0.0985
23.5823 %
26.5171 %
               Correlation coefficient
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
0.9628
0.0718
0.1013
23.7112 %
27.2607 %

With blend seems to do slightly worse on errors than the data without. However, when we apply the models to the test data.

Test Without Blend Test With Blend
Correlation coefficient
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
-0.3842
0.2677
0.3141
27.2418 %
30.853 %
               Correlation coefficient
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
-0.0778
0.2233
0.2764
22.7197 %
27.1499 %

Test data with blend seems clearly better. Outputting the Weka prediction results and checking with R seems to mostly confirm this.
Consider which model gets closer to actual average price...

> mean(resbal$predicted)
[1] 2.311187
           > mean(resblend$predicted)
[1] 2.374188
           > mean(test_blend$Price)
[1] 2.388521

The average predicted price with blend is within a penny and a half of the actual price. It is more than 6 cents closer than the model without it. Which test dataset we use doesn't matter as the gas price values should be identical.

Comparing maximum errors...

max(abs(resbal$error))
[1] 0.584
           max(abs(resblend$error))
[1] 0.485

The error with blend is better, meaning less, by almost 10 full cents.

Comparing errors falling within a threshold value...

table(resbal$error < .10)

FALSE TRUE
      24         72
                table(resblend$error < .10)

FALSE TRUE
      24         72

Gives the same results. Most of the errors are less than 10 cents. While it is not an improvement, the blend model is no worse.

Including the blending data does seem to improve the prediction overall. Although, it is particular to gas prices and wouldn't apply at all to predicting oil prices.

Conclusion

At this point the model seems sufficient to predicting gas prices with reasonable accuracy. I'm not sure how useful this would actually be in forecasting prices. It might be more useful in verifying that local gas prices you are seeing are consistent with what is to be expected.

There are still some inherent assumptions involved. I include midwest gas prices, not all areas might be so predictable. The blend data might not apply to all areas either. I saw somewhere that California always uses a "summer" blend. Probably due to climate and pollution concerns in densely populated areas.

It might be nice to have the process include geo location data. Then it could download the correct price data and maybe do the right thing as far as blending goes. This might be a future enhancement.

Data

This might be a good place to consider the data that I am using and where I am getting it in a little more detail. Next time around maybe. For now though, again at least for verification of the results if you would like to do so, the training/test data used on Weka.
training_blend.arr
test_blend.arff