Mining Mobile Phone Data to Recognize Urban Areas Maarten Vanhoof Orange Labs France Open Lab Newcastle University [email protected] @Metti Hoof www.MaartenVanhoof.com Together with Stéphanie Combes (INSEE) Marie-Pierre de Bellefon (INSEE) Thomas Plötz (Open Lab) Orange Labs-INSEE-Eurostat project Stéphanie Combes (INSEE) Benjamin Sakarovitch (INSEE) Marie-Pierre de Bellefon (INSEE) Pauline Givord (INSEE) Vincent Loonis (INSEE) Pierre-Nicolas Morin (ENSAE) Fernando Reis (Eurostat) Michail Skaliotis (Eurostat) Zbigniew Smoreda (Orange Labs) Thomas Plötz (Open Lab) Use of mobile phone data in Geography and Official Statistics Indicators and the home detection problem & Learning urban areas Learning Urban Areas Can we mobile phone data to delineate urban areas in France? Learning Urban Areas Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas Population presence Deville et al. (2014) Dynamic population mapping using mobile phone data Picture from http://www.flowminder.org/ Learning Urban Areas Can we mobile phone data to delineate urban areas in France? or Can we, based on CDR data, predict in what kind of urban area a cell-tower is positioned? Goal Test the feasibility of a continuous urban area zoning based on mobile phone data (as opposed to 8-yearly official releases) - Insights in patterns of urban development, segregation, etc. - Direct help for creation of new urban areas - Indirect validation of urban area definitions Learning Urban Areas • Approach • Machine learning (supervised) • Challenges • • • • Feature creation from mobile phone data Algorithm selection Cross-validation Evaluation for imbalanced classes • Results Learning Urban Areas • Approach • Machine learning (supervised) • Challenges • • • • Feature creation from mobile phone data Algorithm selection Cross-validation Evaluation for imbalanced classes • Results Machine Learning 1. 2. 3. 4. Find a proxy (P) for something hard to know (C). Find a function (F) that defines the relation between P and C. Use function (F) to make guesses about (C) given a certain (P). Evaluate your guesses and Improve (F) by repeating 2 and 3 = Learning phase and is done in different ways by different algorithms Machine learning Proxy (P) = things from mobile phone data Function (F) = created by algorithm Something hard to know (C) = urban area the cell-tower is situated in. Learning Urban Areas • Approach • Machine learning (supervised) • Challenges • • • • Feature creation from mobile phone data (=choosing (P)) Algorithm selection Cross-validation Evaluation for imbalanced classes • Results Feature creation (proxy (P)) - Standardized timeseries of amount of activities for each cell-tower each hour - Pre-existing (aggregated) differences for - Weekend and weekdays - Summer and non-summer months - Urban areas Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas Feature creation (proxy (P)) - Average amount of actions per cell-tower - Size of Voronoi per cell-tower (relates to population density) Learning Urban Areas • Approach • Machine learning (supervised) • Challenges • • • • Feature creation from mobile phone data (=choosing (P)) Algorithm selection Cross-validation Evaluation for imbalanced classes • Results Machine Learning Algorithms • Different algorithms will use features differently as they construct and improve their prediction in different ways • We tested several supervised algorithms, these worked best: • Penalized logistic regression (elastic net) • Boosting trees • Random forests Learning Urban Areas • Approach • Machine learning (supervised) • Challenges • • • • Feature creation from mobile phone data (=choosing (P)) Algorithm selection Cross-validation in the light of spatial autocorrelation Evaluation for imbalanced classes • Results Cross-validation • ML algorithms use training and test sets to improve their function (F). • Training and test sets should be as independent as possible for crossvalidation to work • Urban Areas are spatially autocorrelated, so not independent. As such we have to be careful when constructing test-sets. Learning Urban Areas • Approach • Machine learning (supervised) • Challenges • • • • Feature creation from mobile phone data (=choosing (P)) Algorithm selection Cross-validation in the light of spatial autocorrelation Evaluation for imbalanced classes • Results Evaluation • Different ways of evaluating outcomes are possible • General accuracy (e.g. % of cell-towers correctly predicted) • Spatial agreement (e.g. Fuzzy Kappa) • Accuracy by Urban Area class (e.g. G-means) • In our case, we are dealing with heavily imbalanced classes but we want our evaluation metric to put even importance on all Urban Area classes. Learning Urban Areas • Approach • Machine learning (supervised) • Challenges • • • • Feature creation from mobile phone data (=choosing (P)) Algorithm selection Cross-validation in the light of spatial autocorrelation Evaluation for imbalanced classes • Results Results (random forests) • Findings are mixed • Performance is not that high, but metrics allowing for fuzziness are ok. • Urban centers get predicted really well • Multi-polarized centers are predicted the worst Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas Results Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas Comparison with INSEE data • Performed the same random forest on INSEE data related to urban area delineation and on INSEE data enriched with land use information from Corinne data • Results are slightly better, but surprisingly similar Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas Results Combes, de Bellefon, Plötz and Vanhoof (in preparation) Mining Mobile Phone Data to Recognize Urban Areas Learning Urban Areas • Complex classification task • Imbalanced multi-class • Spatial aspect: auto-correlation • Result encourage to create recurrent urban area zoning from CDR • Between official (less frequent) releases • Detection of (change in) major urban centers seems trustworthy • Assessments of rural and isolated areas should be done with caution • Pilot for use of machine learning in official statistics • Applications can be rather straightforward • Supervised classification tasks provide, by definition, evaluation of accuracy Final thoughts Use of mobile phone data in official statistics is plausible but may/will require: 1. An investment in the collection of ground truth data on • Mobile phone use • Local market shares of operators 2. A shift from the individual paradigm towards more general descriptive patterns • E.g. individual home-detection versus population presence over time • Adaptation of new methodological toolkits; like machine learning and the education and communication of new types of uncertainties But please, please, always remember: Don’t be Batman. Thanks for your kind attention [email protected] @Metti Hoof www.MaartenVanhoof.com
© Copyright 2025 Paperzz