Scaling relations in constructed CDR

Mining Mobile Phone Data
to Recognize Urban Areas
Maarten Vanhoof
Orange Labs France
Open Lab Newcastle University
[email protected]
@Metti Hoof
www.MaartenVanhoof.com
Together with
Stéphanie Combes (INSEE)
Marie-Pierre de Bellefon (INSEE)
Thomas Plötz (Open Lab)
Orange Labs-INSEE-Eurostat project
Stéphanie Combes (INSEE)
Benjamin Sakarovitch (INSEE)
Marie-Pierre de Bellefon (INSEE)
Pauline Givord (INSEE)
Vincent Loonis (INSEE)
Pierre-Nicolas Morin (ENSAE)
Fernando Reis (Eurostat)
Michail Skaliotis (Eurostat)
Zbigniew Smoreda (Orange Labs)
Thomas Plötz (Open Lab)
Use of mobile phone data in Geography and Official Statistics
Indicators and the home detection problem
&
Learning urban areas
Learning Urban Areas
Can we mobile phone data to delineate
urban areas in France?
Learning Urban Areas
Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas
Population presence
Deville et al. (2014) Dynamic population mapping using mobile phone data
Picture from http://www.flowminder.org/
Learning Urban Areas
Can we mobile phone data to delineate
urban areas in France?
or
Can we, based on CDR data, predict in what kind of urban area
a cell-tower is positioned?
Goal
Test the feasibility of a continuous urban area zoning based on
mobile phone data (as opposed to 8-yearly official releases)
- Insights in patterns of urban development, segregation, etc.
- Direct help for creation of new urban areas
- Indirect validation of urban area definitions
Learning Urban Areas
• Approach
• Machine learning (supervised)
• Challenges
•
•
•
•
Feature creation from mobile phone data
Algorithm selection
Cross-validation
Evaluation for imbalanced classes
• Results
Learning Urban Areas
• Approach
• Machine learning (supervised)
• Challenges
•
•
•
•
Feature creation from mobile phone data
Algorithm selection
Cross-validation
Evaluation for imbalanced classes
• Results
Machine Learning
1.
2.
3.
4.
Find a proxy (P) for something hard to know (C).
Find a function (F) that defines the relation between P and C.
Use function (F) to make guesses about (C) given a certain (P).
Evaluate your guesses and Improve (F) by repeating 2 and 3
= Learning phase and is done in different ways by different algorithms
Machine learning
Proxy (P) = things from mobile phone data
Function (F) = created by algorithm
Something hard to know (C) = urban area the cell-tower is situated in.
Learning Urban Areas
• Approach
• Machine learning (supervised)
• Challenges
•
•
•
•
Feature creation from mobile phone data (=choosing (P))
Algorithm selection
Cross-validation
Evaluation for imbalanced classes
• Results
Feature creation (proxy (P))
- Standardized timeseries of amount of
activities for each cell-tower each hour
- Pre-existing (aggregated) differences for
- Weekend and weekdays
- Summer and non-summer months
- Urban areas
Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas
Feature creation (proxy (P))
- Average amount of actions per cell-tower
- Size of Voronoi per cell-tower (relates to population density)
Learning Urban Areas
• Approach
• Machine learning (supervised)
• Challenges
•
•
•
•
Feature creation from mobile phone data (=choosing (P))
Algorithm selection
Cross-validation
Evaluation for imbalanced classes
• Results
Machine Learning Algorithms
• Different algorithms will use features differently as they construct and
improve their prediction in different ways
• We tested several supervised algorithms, these worked best:
• Penalized logistic regression (elastic net)
• Boosting trees
• Random forests
Learning Urban Areas
• Approach
• Machine learning (supervised)
• Challenges
•
•
•
•
Feature creation from mobile phone data (=choosing (P))
Algorithm selection
Cross-validation in the light of spatial autocorrelation
Evaluation for imbalanced classes
• Results
Cross-validation
• ML algorithms use training and test sets to improve their function (F).
• Training and test sets should be as independent as possible for crossvalidation to work
• Urban Areas are spatially autocorrelated, so not independent. As such we
have to be careful when constructing test-sets.
Learning Urban Areas
• Approach
• Machine learning (supervised)
• Challenges
•
•
•
•
Feature creation from mobile phone data (=choosing (P))
Algorithm selection
Cross-validation in the light of spatial autocorrelation
Evaluation for imbalanced classes
• Results
Evaluation
• Different ways of evaluating outcomes are possible
• General accuracy (e.g. % of cell-towers correctly predicted)
• Spatial agreement (e.g. Fuzzy Kappa)
• Accuracy by Urban Area class (e.g. G-means)
• In our case, we are dealing with heavily imbalanced classes but we
want our evaluation metric to put even importance on all Urban Area
classes.
Learning Urban Areas
• Approach
• Machine learning (supervised)
• Challenges
•
•
•
•
Feature creation from mobile phone data (=choosing (P))
Algorithm selection
Cross-validation in the light of spatial autocorrelation
Evaluation for imbalanced classes
• Results
Results (random forests)
• Findings are mixed
• Performance is not that high, but metrics allowing for fuzziness are ok.
• Urban centers get predicted really well
• Multi-polarized centers are predicted the worst
Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas
Results
Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas
Comparison with INSEE data
• Performed the same random forest on INSEE data related to urban area
delineation and on INSEE data enriched with land use information from
Corinne data
• Results are slightly better, but surprisingly similar
Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas
Results
Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas
Learning Urban Areas
• Complex classification task
• Imbalanced multi-class
• Spatial aspect: auto-correlation
• Result encourage to create recurrent urban area zoning from CDR
• Between official (less frequent) releases
• Detection of (change in) major urban centers seems trustworthy
• Assessments of rural and isolated areas should be done with caution
• Pilot for use of machine learning in official statistics
• Applications can be rather straightforward
• Supervised classification tasks provide, by definition, evaluation of accuracy
Final thoughts
Use of mobile phone data in official statistics is plausible but may/will
require:
1. An investment in the collection of ground truth data on
• Mobile phone use
• Local market shares of operators
2. A shift from the individual paradigm towards more general descriptive patterns
• E.g. individual home-detection versus population presence over time
• Adaptation of new methodological toolkits; like machine learning and the education and
communication of new types of uncertainties
But please, please, always remember:
Don’t be Batman.
Thanks for your kind attention
[email protected]
@Metti Hoof
www.MaartenVanhoof.com