Mapping Leiden: Automatically extracting street names from

Mapping Leiden:
Automatically extracting street names
from digitized newspaper articles
Kim Groeneveld
Menno van Zaanen
[email protected]
[email protected]
Tilburg University
Introduction
Future functionality allows for searching for
(historical) newspaper articles related to a
geographical area indicated on the map.
This requires identification of street names
in the collection of digitized newspaper articles.
Geographical Entity Recognition (GER) is
a sub-field of Named Entity Recognition
(NER), which identifies geographical entities such as countries, cities, or street names
(Leidner & Lieberman, 2011; Pouliquen,
Steinberger, Ignat, & De Groeve, 2004).
Baseline
recall
100
Evaluation metric score (%)
The Dutch institute “Erfgoed Leiden en
Omstreken” (ELO) develops an interactive
map of Leiden that shows people monumental buildings and many other interesting geographical entities or properties
(Erfgoed Leiden en Omstreken, 2014).
Results: Less annotated data
Baseline
F-score
90
Baseline
precision
80
25
50
75
Evaluation
metric
MBT: F-score
MBT: Precision
MBT: Recall
Stanford NER: F-score
Stanford NER: Precision
Stanford NER: Recall
100
Amount of training data (%)
• Performance of systems improve with more data
– As expected
• Stanford NER outperforms MBT
• Stanford NER requires much more processing time than MBT
Fig. 1: Interactive map of Leiden.
100% = 5000 files
Data set
Results: Fewer annotations
– Random selection
∗ 5000 newspaper articles
∗ from 1997 to 2005
– 16.6 million sentences
– 14,678 street name occurrences
– 846 unique street names
– manually corrected IOB tags
∗ based on baseline
Evaluation metric score (%)
• Collection of newspaper articles
Frequency
100
50%
60%
70% 80%
90%
95%100%
100
Evaluation metric score (%)
100
90
80
70
50%
60%
70% 80%
90% 95%
100%
90
Evaluation
metric
MBT: F-score
80
MBT: Precision
MBT: Recall
Stanford NER: F-score
Stanford NER: Precision
70
Stanford NER: Recall
60
60
9000
10
10000
11000
12000
Number of annotated street names
13000
3500
4000
4500
Number of annotated street names
5000
• Performance of systems decreases with fewer street names annotated in training data
• List of street names
– Precision remains constant
– Recall decreases linearly
1
– All street names in Leiden
– 1716 street names
10
1000
Rank
Fig. 2: Distribution of street names in data set.
• Street names show Zipfian distribution
• MBT and Stanford NER do not identify street names not seen in training data
MBT: 5000 files; Stanford: 2000 files
Conclusion
Problems
• If complete list of street names is available
1. How to identify street names best?
– Baseline system (always annotate street name) is fast and effective
– Machine learning systems may help disambiguate known street names in context
→Improvement of precision
• Impact of amount of training data on system; disambiguation
2. What is the impact of the (in)completeness of the training data?
• If list of street names is incomplete
• (In)completeness of list of street names; generalization
– No system will identify missing street names
→No improvement of recall
• Machine learning systems disambiguate street names
Systems
• Systems do not generalize to unseen street names
• Baseline: Always assign street name based on list of street names
• Memory Based Tagger: k-NN approach; settings taken from NER module of Frog
• Stanford Named Entity Recognizer (NER): CRF-based approach; default settings
We would like to thank Cor de Graaf and Walther Hasselo, ELO, for making the data (list
of street names and newspaper articles) available.
Method
• Measure performance of systems
• Reduce training data
– Less annotated data
– Fewer street names/annotations
Gold
standard
Acknowledgments
Remove
data
Learning
curves
Remove
annotations
Information
curves
Fig. 3: Setup of experiments.
References
Erfgoed Leiden en Omstreken. (2014). https://www.erfgoedleiden.nl/. (Accessed: 30.09.2014)
Leidner, J. L., & Lieberman, M. D. (2011). Detecting geographical references in the form of place names and associated spatial
natural language. SIGSPATIAL Special , 3 (2), 5–11.
Pouliquen, B., Steinberger, R., Ignat, C., & De Groeve, T. (2004). Geographical information recognition and visualization in texts
written in various languages. In Proceedings of the 2004 acm symposium on applied computing (pp. 1051–1058).