Mapping Leiden: Automatically extracting street names from digitized newspaper articles Kim Groeneveld Menno van Zaanen [email protected] [email protected] Tilburg University Introduction Future functionality allows for searching for (historical) newspaper articles related to a geographical area indicated on the map. This requires identification of street names in the collection of digitized newspaper articles. Geographical Entity Recognition (GER) is a sub-field of Named Entity Recognition (NER), which identifies geographical entities such as countries, cities, or street names (Leidner & Lieberman, 2011; Pouliquen, Steinberger, Ignat, & De Groeve, 2004). Baseline recall 100 Evaluation metric score (%) The Dutch institute “Erfgoed Leiden en Omstreken” (ELO) develops an interactive map of Leiden that shows people monumental buildings and many other interesting geographical entities or properties (Erfgoed Leiden en Omstreken, 2014). Results: Less annotated data Baseline F-score 90 Baseline precision 80 25 50 75 Evaluation metric MBT: F-score MBT: Precision MBT: Recall Stanford NER: F-score Stanford NER: Precision Stanford NER: Recall 100 Amount of training data (%) • Performance of systems improve with more data – As expected • Stanford NER outperforms MBT • Stanford NER requires much more processing time than MBT Fig. 1: Interactive map of Leiden. 100% = 5000 files Data set Results: Fewer annotations – Random selection ∗ 5000 newspaper articles ∗ from 1997 to 2005 – 16.6 million sentences – 14,678 street name occurrences – 846 unique street names – manually corrected IOB tags ∗ based on baseline Evaluation metric score (%) • Collection of newspaper articles Frequency 100 50% 60% 70% 80% 90% 95%100% 100 Evaluation metric score (%) 100 90 80 70 50% 60% 70% 80% 90% 95% 100% 90 Evaluation metric MBT: F-score 80 MBT: Precision MBT: Recall Stanford NER: F-score Stanford NER: Precision 70 Stanford NER: Recall 60 60 9000 10 10000 11000 12000 Number of annotated street names 13000 3500 4000 4500 Number of annotated street names 5000 • Performance of systems decreases with fewer street names annotated in training data • List of street names – Precision remains constant – Recall decreases linearly 1 – All street names in Leiden – 1716 street names 10 1000 Rank Fig. 2: Distribution of street names in data set. • Street names show Zipfian distribution • MBT and Stanford NER do not identify street names not seen in training data MBT: 5000 files; Stanford: 2000 files Conclusion Problems • If complete list of street names is available 1. How to identify street names best? – Baseline system (always annotate street name) is fast and effective – Machine learning systems may help disambiguate known street names in context →Improvement of precision • Impact of amount of training data on system; disambiguation 2. What is the impact of the (in)completeness of the training data? • If list of street names is incomplete • (In)completeness of list of street names; generalization – No system will identify missing street names →No improvement of recall • Machine learning systems disambiguate street names Systems • Systems do not generalize to unseen street names • Baseline: Always assign street name based on list of street names • Memory Based Tagger: k-NN approach; settings taken from NER module of Frog • Stanford Named Entity Recognizer (NER): CRF-based approach; default settings We would like to thank Cor de Graaf and Walther Hasselo, ELO, for making the data (list of street names and newspaper articles) available. Method • Measure performance of systems • Reduce training data – Less annotated data – Fewer street names/annotations Gold standard Acknowledgments Remove data Learning curves Remove annotations Information curves Fig. 3: Setup of experiments. References Erfgoed Leiden en Omstreken. (2014). https://www.erfgoedleiden.nl/. (Accessed: 30.09.2014) Leidner, J. L., & Lieberman, M. D. (2011). Detecting geographical references in the form of place names and associated spatial natural language. SIGSPATIAL Special , 3 (2), 5–11. Pouliquen, B., Steinberger, R., Ignat, C., & De Groeve, T. (2004). Geographical information recognition and visualization in texts written in various languages. In Proceedings of the 2004 acm symposium on applied computing (pp. 1051–1058).
© Copyright 2026 Paperzz