References - Conference

The Big Data for Official Statistics Competition – results
and lessons learned
Bogomil Kovachev ([email protected])1, Martin Karlberg2, Boro Nikic3, Bogdan Oancea4
and Paolo Righi5
Keywords: official statistics, big data, nowcasting, forecasting
1.
INTRODUCTION
The Big Data for Official Statistics Competition (BDCOMP) was the first official
statistics nowcasting competition at EU level with a big data focus. It was carried out
under the framework of the European Statistical System (ESS6) Vision 2020 Project
BIGD7 and was organised by Eurostat in collaboration with the ESS Big Data Task
Force. The BDCOMP scientific committee composed of colleagues from various
member and observer organisations of the ESS.
The term nowcasting is used to signify the forecasting of a statistical indicator with
extremely tight timeliness – sometimes even before the reference period is over. The
need for good nowcasting techniques is currently increasing given the constant demand
for releasing statistics faster.
The main purpose of BDCOMP was to evaluate different methodologies with respect to
their applicability to the nowcasting process. The task for participants in the competition
was to nowcast official statistics for EU Member States. The accuracy of these nowcasts
is the measure of success for each nowcasting approach.
The call for participation8, was circulated to various academic institutions and
international organisations, contains many details of the competition that are beyond the
scope of this abstract. The competition was additionally publicised on the Eurostat
website and was also promoted at various events on related topics where Eurostat
participated.
After a relatively high number of expressions of interest finally there were five
participants9 that made it through the whole competition process:
P1 – Prof. George Djolov University of Stellenbosch and Statistics South Africa
P2 – Team ETLAnow: Research Institute of the Finnish Economy
1
European Court of Auditors
2
Eurostat
3
The Statistical Office of the Republic of Slovenia
4
The National Institute of Statistics of Romania
5
Istat
6
http://ec.europa.eu/eurostat/web/european-statistical-system
7
http://ec.europa.eu/eurostat/web/ess/about-us/ess-vision-2020/implementation-portfolio#BIGD
8
https://ec.europa.eu/eurostat/cros/content/call-participation
9
More details on the participants is available at
https://ec.europa.eu/eurostat/cros/content/bdcomp-participants-and-methodological-information
1
P3 – JRC Team: Joint Research Centre of the European Commission
P4 – Dr. Roland Weigand
P5 – University of Warwick Forecast Team
2.
METHODS
2.1.
Design
The competition is designed to provide the opportunity for an out-of-sample objective
evaluation of nowcasting methods for official statistics. This is achieved by requiring
participants to submit a forecast (i.e. nowcast) before the official number is out.
The indicator set (described below) is the basis for dividing the competition into tracks
(one track corresponds to one indicator) and then further into tasks – for each indicator
there is a separate series for every EU Member State (and two more for the EU and euro
area aggregates). The combination of an indicator and an EU Member State is called a
task (e.g. “HICP for Luxembourg” is a task).
The competition is divided into rounds – one round per month for the year 2016. Since
different official statistics have different release dates each round has two submission
deadlines.
2.2. Indicators used
The following indicators were part of the competition:
Track 1:
Unemployment in levels
Track 2:
Harmonised Index of Consumer Prices (HICP) all items
Track 3:
HICP excluding energy
Track 4:
Tourism – nights spent at tourist accommodation establishments
Track 5:
Tourism – nights spent at hotels
Track 6:
Volume of retail trade
Track 7:
Volume of retail trade excluding automotive fuel
These seven indicators were chosen to fit the purposes of the competition. They can be
generally considered as economically important so that there is enough interest towards
them. Very important for BDCOMP is also the fact that they are monthly. This ensures
that within a year there is a sufficient amount of data points for a reasonable evaluation.
2.3. Usage of big data and reproducible submissions
Despite the focus on big data we have allowed competitors to use more traditional
techniques. This was made in order to enlarge participation and to provide a benchmark.
The main goal of the competition being the evaluation of methods we have encouraged
the participants to submit open submissions. These are submission from which official
statistics and the scientific community at large benefit the most. Unfortunately however it
2
is not always possible (or desirable) to disclose the data used for the competition and in
order to make a submission fully reproducible it is necessary to supply the methods
(source code) used and the data. As a result of this not all submissions did comply with
the request even if most did.
2.4. List of evaluation measures
The evaluation measures for BDCOMP were designed to provide an objective
comparison of methods. As no single measure can capture all desirable properties of
estimates several measures were used in the competition. The following are the official
evaluation criteria for BDCOMP:
Criterion 1:
Relative mean squared error
This is perhaps the most obvious measure to include in a competition of this kind.
Intuitively it represents how far the submitted forecasts were from the actual outturn. The
formula used is:
RMSE = 1/N * ∑Ni=1 ((Fi - Ri)/Ri)2
where Fi is the forecast for the i-th period and Ri is the official release for the same
period (the benchmark)
Criterion 2:
Directional accuracy
The measure is included in order to provide another look at the performance of a method
– namely whether it correctly predicts the direction of change in the indicator. In order to
be able to draw meaningful conclusions this measure is only applied to seasonally
adjusted data or data that is officially not considered to have significant seasonality.
Based on our results it seems that using directional accuracy as a sole discriminatory
measure could lead to many ties due to the limited size of the set of possible values.
Criterion 3:
Density estimate
The aim of including this measure was to allow participants to express a level of
confidence in the prediction. The measure is designed to allow comparison between
predictions of indicators that have different magnitude (e.g. unemployment levels in a big
country and a small country). The formula used is:
L = (∏Ni=1Pi)(1/N)
where Pi is an appropriately modified likelihood of the official release for period i under
the submitted distribution for the same period10.
2.5. Organisational challenges during the competition
Before starting with the organisation, we had already anticipated some of the challenges
that lay ahead. Naturally we weren’t prepared for everything. We list here most of what
10
More details on the definition of the metrics can be found in the BDCOMP call for participation
https://ec.europa.eu/eurostat/cros/content/call-participation
3
we deem important to have in mind for anyone who might be preparing to run a similar
event.
 We did not make registration an official requirement due to the relatively short
deadline between the announcement of the competition and the first submission
deadline. This lead to the fact that we did not have an established communication
channel with the participants right until the start of BDCOMP. Consequently, any
updates and clarifications had to be made during the competition.
 Strict deadlines need to be maintained which calls for a reliable system of collecting
participants’ forecasts. We have required participants to use a special submission
template for ease of automatic processing. This was complied with quite consistently
which allowed us to use scripts for processing[3]. We have used email which worked
reasonably well with the participation that we had. A dedicated submission system
would have offered some further advantages with respect to machine readability but
would have caused a major issue in case of system failure shortly before a deadline.
 It turned out that for two of our indicators – HICP and HICP without energy there was
a change of reference year planned for 2016. We discovered this shortly after
launching the competition. For the headline figure (HICP) Eurostat continued to
produce numbers with the old reference year so that the official target remained
unchanged. For HICP without energy however this was not the case and the
benchmark (and in one case the participants’ submissions) needed to be re-referenced.
This is not an ideal situation since as the re-referencing is done with officially
published numbers only some precision is lost.
 We have set the rule that for any indicator it is the first official release that counts as a
benchmark. Eurostat’s dissemination chain is not adapted to such a requirement. Due
to subsequent revisions of published numbers it is not easy to discover subsequently
what the initially released figure was; we had to operate an automated daily download
and in the end produced the benchmark series from there. During this process we
discovered that the dissemination chain is not completely static and adjustments
needed to be made frequently.
 Initially we did expect rounding issues to be a problem in certain areas11. What we
did not foresee however was that for the density estimate measure our precision
requirement turns out to be inadequate. For example HICP is reported by some
countries with precision of one decimal after the floating point – consequently we
have requested submissions to be made with this precision. For density estimation
though this turns out to be inadequate – suppose that one considers12 that a reasonable
estimate for a particular figure is 100.05 with a standard deviation of 0.05 – in effect
giving equal probability of 100.0 and 100.1. Under our original rules one would have
to be forced to choose between 100 and 100.1 which in effect makes density
estimation useless. We had to update the rules during the competition and remove the
first month from the evaluation.
11
E.g. for Unemployment numbers are reported in thousands of people and this causes the series for
small countries to be very stable and thus not suitable for such a competition
12
Or rather one’s estimation procedure gives such a result
4
3.
RESULTS13
3.1. Track 1 – Unemployment
Only P2 proposes a forecasting approach using a big data source (Google Trends). The
team uses a seasonal AR (1) model with an exogenous contemporaneous value of the
Google Index as further covariate. The model is closely related to the work shown in the
seminal paper of Choi and Varian [1].
P1 introduces the robust nowcasting algorithm (RNA). The emphasis of the RNA is on
robustness, i.e. flexibility in the face of imperfect data conditions, to accommodate its
possible uses across different time series. At BDCOMP the RNA is applied on
nowcasting the monthly Irish unemployment levels in nominal, i.e. raw and
seasonally-adjusted terms.
Here and in all other tracks P4 applies a series of univariate benchmark methods which
were automatically applied to the time series of interest. They were taken from the open
source R package “forecast” [2] and based on the ets (Error-Trend-Seasonal or
ExponenTial Smoothing) and auto.arima functions there. The model selection in both
cases is done by using information criteria AIC and BIC. auto.arima chooses the best
ARIMA model while ets ([3] and [4]) represents a state space framework for automatic
forecasting with exponential smoothing techniques which is estimating the initial states
and the smoothing parameters by optimising the likelihood function. The ets function
was applied with the default parameters which means an automatically selected error
type, trend type and season type. Both functions were applied to all EU countries and
separately on three different subsets of countries. A fifth prediction model was defined as
a simple average of the predictions given by the first four models (auto.arima and ets
with models chosen by AIC and BIC).
Table 1. Track 1 results
Legend: BM = benchmark, BD = a big data approach, RNA = robust nowcasting
* - BD was more than 2σ ahead
** - only track where the RNA was fielded
† - only benchmark approaches
13
The full set of results are available at https://ec.europa.eu/eurostat/cros/content/bdcomp-results
5
We observe univariate (benchmark) approaches performing quite well in the point
estimate accuracy measure - models were retrained automatically (not changed) every
month. RNA was fielded for only one Task (IE) and it performed best with respect to
directional accuracy.
The big data method of P2 only participated in the point estimate accuracy - it was the
best performing method for one of the tasks – BG.
3.2. Tracks 2 and 3: HICP and HICP excluding energy
P3 developed 20 methods for HICP prediction based on classical ARIMA modelling and
also on more elaborate techniques that included econometric modelling using leading
economic indicators, random forest models with data from Eurostat and The Billion
Prices project or Xgboost models and applied them to 8 EU member states. For a
description of the Xgboost (Extreme Gradient Boosting) algorithm an interested reader
could consult [5] while the Billion Prices Project is described in [6]. The experiments
carried out by P3 showed that ARIMA models gave the best results for HICP prediction.
The best results for most of the countries were obtained with two models that are
basically a weighted average of five ARIMA models with different length of the time
series.
P5 used three basic models along with combinations of them using equal and log score
weights and applied them on Euro Area countries and to FR, DE, IT and UK. A
combination using many exogenous variables were used rendering the approaches by P5
“almost big data”. The three basic models were the Bayesian vector autoregressive
(BVAR) described in [7], a conditional unobserved component stochastic volatility
model (UC-SV) described in [8] and [9] and a univariate autoregressive model AR(p)
with the optimum value of the lag chosen by BIC. For the first two models, different
economic variables were used for each country as regressors.
Table 2. Track 2 results
Legend: BM = benchmark, BD = a big data approach, MV = multivariate approach
FF – Photo finish : second best within around σ /10
† - only benchmark approaches
6
For Track 2 the big data approach P3Approach12 performed well for several countries –
UK, NL, IE, FR (point estimate accuracy). A model with many exogenous variables
(“almost” big data) performed best for the EA (point estimate and density estimate
accuracy), FR (density estimate accuracy). The P5 approach containing the oil price
exogenous variable performed particularly well only for IT.
For Track 3 there was an unforeseen re-referencing that was announced after the launch
of the competition. Consequently, the submissions for the first month had to be discarded
for evaluation purposes. Big economies (EA, EU, DE, FR, IT) seem to be easier to
forecast in comparison with the case of the HICP headline aggregate (Track 2). Most
models with exogenous data outperformed the benchmark models for DE, EA, FR and
UK. Only for IT the benchmark was better (point estimate accuracy). For directional
accuracy the picture is often reverse suggesting the complementarity of the two
measures.
3.3. Tracks 4 and 5: Tourism – nights spent at tourist accommodation and at hotels
A characteristic feature of this track is that the first official estimate (which is used as
benchmark in all BDCOMP tasks) is not as stable – data is revised often for some
countries. Moreover, IE, EL, LU and the UK were not part of this track due to data
quality issues.
P3 used Eurostat data for modelling the sub-aggregates total nights spent (B06), total
nights spent by residents (B04) and total nights spent by non-residences (B05). The
SABRE database with information about number of booked flights in future months was
used. Forecasting was done via ARIMA, Random Forest based regression and Xgboost
regression. Some of the ARIMA models used B06 or B05 or B04 data and some of
ARIMA models used combination of B04 and B05 data.
Table 3. Track 4 results
Legend: BM = benchmark, BD = a big data approach
FF – Photo finish : second best within around σ /10
† - only benchmark approaches
7
For all countries there were big data approaches used - in 8 cases for Track 4 and 3 in
Track 5 they were best. A lot of variability is observable in the results: for HR σ is
around 50 % and some methods are off by more than 100% on average. It seems also that
for some countries March which was the month of Easter in 2016 proved slightly harder
to forecast.
3.4. Tracks 6 and 7- Volume of retail trade including and excluding automotive fuel
This could be regarded as the most unstable of all tracks. Data are often revised and
sometimes revisions are big.
For this track only benchmark approaches (provided by P4) were fielded. An interesting
observation was that for many countries the retail trade indicator without automotive fuel
(Track 7) seems significantly harder to predict compared to the indicator that includes
automotive fuel (Track 6). For example the best performing approach for Track6 SI has a
RRMSE of 3.1% against 8.2% for Track 7. For MT the scores are 1.3% for Track 6 vs
5.5% for Track 7.
4.
CONCLUSIONS
Since the use of big data for official statistics is still a relatively new development it was
expected that participation would be considerably increased if the rules of the
competition allowed participation of traditional forecasting approaches. This assumption
seems to have been justified by the quite low proportion of approaches actually using big
data. It can be further observed that big data methods have not outperformed traditional
methods. This can of course be explained by the fact that macroeconomic forecasting is
an established discipline with a long tradition while the introduction of big data in it is
still an ongoing process.
Many technical challenges were identified in advance allowing us to be adequately
prepared however many remained as explained in section 2.5. None of them proved
unsurmountable though, so the competition could be carried out until the end and the
main objectives were achieved.
Concerning the practical aspects of organising a competition of this kind several remarks
are due here:
 Attracting participants who are dedicated enough to commit to twelve months of
submitting with a strict deadline is a major challenge. The five participants that
BDCOMP actually managed to retain seem in this respect to be a reasonable number.
 As one can observe in the results each measure can be used to consider the results
from a different angle which speaks in favour of maintaining the variety in the
evaluation. The fact that there is no final single winner does perhaps diminish the
competitiveness aspect somewhat but this seems inevitable due the reasons described
above.
 From a scientific viewpoint it would have been much better to perform the evaluation
against data of a more mature vintage than a first release. However this would have
implied a big timing gap between the end of the competition and the evaluation.
8
Alternatively the evaluation could have been made against the latest available release
for each month. One approach could be to perform a study on revisions of the
indicators one plans to include and to select appropriate vintages per indicator.
 It is our belief that the official scoring and the analysis done here are only part of the
insight that can be gained from this event. Since the data and a lot of the approaches
are openly available they are easy to analyse further. As mentioned above one obvious
possible extension would be to wait until final releases are published for all indicators
for the whole 2016 – a better ground truth - and do the scoring again. Clustering the
approaches according to other criteria and evaluating the performance of the clusters
is another possible extension.
REFERENCES
[1] Choi H., Varian H. R Predicting the Present with Google Trends, Economic Record,
(2012). 88(s1), 2–9.
[2] Hyndman, R.J., Khandakar, Y. Automatic time series forecasting: The forecast
package for R, Journal of Statistical Software, (2008), 26, 3.
[3] Hyndman, R.J., Koehler, A.B., Snyder, R.D., and Grose, S. A state space framework
for automatic forecasting using exponential smoothing methods, International J.
Forecasting, 18(3), (2002) 439–454.
[4] Hyndman, R.J., Akram, Md., and Archibald, B. The admissible parameter space for
exponential smoothing models. Annals of Statistical Mathematics, 60(2), (2008) 407–
426.
[5] Chen, T., and He, T., Higgs Boson Discovery with Boosted Trees, JMLR: Workshop
and Conference Proceedings (2015), 42:69-80.
[6] Cavallo, A., and Rigobon, R. The Billion Prices Project: Using Online Prices for
Measurement and Research, Journal of Economic Perspectives – Spring 2016, Vol
30(2), (2016):151-78.
[7] Carriero, A., Galvao, A.B. and Kapetanios, G., A comprehensive evaluation of
macroeco- nomic forecasting methods. Working Paper (2015).
[8] Stock, J.H., and Watson, M.W., Why Has U.S. Inflation Become Harder to Forecast?,
Journal of Money, Credit and Banking, 39(1), (2007) 3- 33.
[9] Stock, J.H., and Watson, M.W. Modeling Inflation after the Crisis. Federal Reserve
Bank of Kansas City, Proceedings - Economic Policy Symposium - Jackson Hole,
(2010), 173-220.
9