Improving fuel price predictions with open data

Improving Fuel Price
Predictions With Open Data
Raphael Volz
WebST 2016
Agenda
• Approach: Linking Open Data
• Case Study: Fuel Price Predictions
• Conclusion
Oct 2015
Prof. Dr. Raphael Volz
2
Linking Data requires Data Integration
ID in Wiki Data
Name
does
not work
ID in other
data sets
Source: https://tools.wmflabs.org/reasonator/?&q=1339
3
Example: Linking with WikiData
Which movies feature Johann Sebastian Bach music?
Q1339wikidata
=
nm0001925imdb
Wikidata
Oct 2015
Prof. Dr. Raphael Volz
IMDb
4
Equivalence at core of linked data
Example Symmetry
Johann Sebastian Bach
Q1339wikidata = nm0001925imdb
nm0001925imdb = Q1339wikidata
• Can derive equivalence if predicates are (inverse) functional
Prof. Dr. Raphael Volz
Source: Raphael Volz, Web Ontology Reasoning with Logic Databases, Dissertation University of Karlsruhe, 2004, p. 24 and p. 110
5
Linking Geographical Data
Matching via names
Matching coordinates
• Easier but not trivial
– Translation of coordinate formats
– polygon containment
Wikidata Coordinate
OpenStreetMap Polygon
WGS84
48° 50′ 14″ N,
10° 5′ 37″ E
48.837222,
10.093611
UTM
32U 580249 5409937
– Precision to use for matching ?
Source: Raphael Volz, Joachim Kleb, and Wolfgang Mueller.
"Towards Ontology-based Disambiguation of Geographical
Identifiers." I3 Workshop at WWW2007, 2007.
decimal
places
0
1
2
3
decimal
degrees
1.0
0.1
0.01
0.001
4
0.0001
5
6
0.00001
0.000001
7
0.0000001
Qualitative
country or large region
large city or district
town or village
neighborhood, street
individual street,
land parcel
individual trees
individual humans
practical limit of
commercial surveying
N/S or
E/W at equator
111.32 km
11.132 km
1.1132 km
111.32 m
11.132 m
1.1132 m
11,132 cm
1.1132 cm
6
Linking Data as new step in analytics pipeline
Analytics pipeline and required competencies
1
2
3
Data
identification
Data
Curation
4
5
Statistical
Analysis
6
Model
Creation
7
Model
Assessment
Model
Selection
Model
Use
Linking Data as a new core step – Data Science Rocket
2
1
Data
identification
3
Data
Curation
4
Data Data
identification
identification
Data Data
Curation
Curation
Data
identification
Data
Curation
Core Competency
5
Statistical
Analysis
Subject matter expertise
6
Model
Creation
Computer Science
7
Model
Assessment
8
Model
Selection
Mathematics and statistics
Model
Use
Integration
Source: Raphael Volz, Collaborative Business / Business Intelligence, Slides of Lecture 1, HS Pforzheim, summer term 15
Oct 2015
Prof. Dr. Raphael Volz
7
Agenda
•
•
•
•
Open Data
Linked Data
Case Study: Fuel Price Predictions
Conclusion
?
Oct 2015
Prof. Dr. Raphael Volz
8
Open fuel price data in Germany
Fuel prices in Germany
• Since Sep 2013 companies operating
a public fuel station must report
prices to the German anti trust
agency in real-time
• Objective:
– Increase price transparency
– “Improve the Bundeskartellamts’
possibilities to intervene in the case
of illegal predatory strategies and
other forms of market power
abuse”(1)
Data Set Characteristics(2)
• 3 fuel types (E5,E10,Diesel)
• 14.957 fuel stations
• 30.231.752 price changes
in one year( Jul 14- Jun 15)
–  82.827 price changes per day
–  5,6 price changes per station+day
30.6.15
9am
30.6.15
5pm
• Open Data published at MDM portal
• Data basis for fuel-price apps
Source (1) http://www.bundeskartellamt.de/EN/Economicsectors/MineralOil/MTU-Fuels/mtufuels_node.html
(2) Figure and Statistics own analysis of MTK data
9
A similar price pattern repeats every day
Day of year 2015
Data: Diesel sales price at OMV station Bad Herrenalb via MTS-K,
Rotterdam Market Price of Brent North Sea Crude Oil in Euro,
Interbank USD/EUR day closing price
Own analysis of MTS-K data for OMV Bad Herrenalb
(Jul 14- Jun 14)
10
Despite regularity of price pattern need (open)
market data to robust predictions for a station
Linear Regression Model of Brutto Sales Price of 1l Diesel
 

yˆ   hour  h   day  w   oil  o  c

Coefficient
Estimate
0,96
- 0,00
c
1 am
Raw Oil Price (Brent)
…
6 am
- 0,01
…
o
noon

- 0,09
…
 hour
6 pm
- 0,12
…
EUR/USD exchange rate
9 pm
- 0,00
…
Mo
0,00
…

 day
Data: Diesel sales price at OMV station Bad Herrenalb via MTS-K,
Crude Oil (petroleum), Dated Brent, light blend 38 API, fob U.K.in Euro
At Interbank EUR/USD closing price, Jul 2014 – Jun 2015
 oil
Note: Factorial coefficients can be read as € savings
We
Th
Fr
Sa
> R2_train 0.8185748 > R2_test 0.8178038 > RMSE 0.03838322
Oct 2015
Prof. Dr. Raphael Volz
0,00
0,01
0,00
0,00
1,01
?
11
Leveraging open data we can better understand
competitive dynamics and improve predictions
OpenStreetMap
Nearby competitors
Nearby cities
WikiData
Population
of cities
Mobilitätsdatenmarktplatz
Oct 2015
Operator
Brand
Real
-2,1%
Jet
-1,9%
Shell
1,3%
Aral
1,4%
12
Agenda
•
•
•
•
Open Data
Linked Data
Case Study: Fuel Price Predictions
Conclusion
Oct 2015
Prof. Dr. Raphael Volz
13
Linked Open Data can improve prediction models
and provides interesting data sets for teaching and research
Conclusion
• At minimum, we have learned today when and where to
get fuel for the lowest price
• We can obtain novel insights from open data
• Linking data sets has allowed us to improve prediction
quality and thereby strengthens automated decision
making and analytics
Outlook
• Can showcase “interesting” non-confidential case studies to
students and potential research partners
• Many interesting new research questions arise from
leveraging linked open data for analytics, predictive
systems and building intelligent systems around data
Oct 2015
Prof. Dr. Raphael Volz
14
Average Quality of Linear
Regression Models per Fuel Station
Quality of Single Model for all stations
linked by open data
Model Type
Oct 2015
R2 Train
R2 Test
RMSE
Deep Learning
0,8081
0,3156
0,052
Random Forest
0,8938
0,3305
0,052
Linear Regression
0,8125
0,2818
0,054
Prof. Dr. Raphael Volz
15
Skaled variable importance
in the Random Forest model for all stations
Oct 2015
Prof. Dr. Raphael Volz
16