Semantic Enrichment for Messy Domains
with Messy Data
Timothy Hill, Europeana
Who we are
Europeana: a pan-European aggregator of
cultural heritage (CH) data objects
• 54 million items (and counting)
• 30+ languages
• 4500 institutions
• national institutions (Louvre, British Museum)
• civic museums
• specialised collections and archives, large (Wellcome Collection) and small
• Extremely heterogeneous
Why enrichment?
Adding structure and information for …
• heterogeneity is a big challenge for Europeana
• in particular linguistic heterogeneity
• Contextualisation
• Navigation
• Exploration
What is enrichment?
Perspectives on enrichment
• Linked Open Data: linking to additional items
• Information Retrieval: adding additional labels
• In fact we do both
• entities and links created using the Europeana Data Model (EDM)
• “Contextual Classes”: Agents, Places, Timespans, Concepts, Events
• labels from these entities are incorporated into our documents
Sample enriched record
• http://iconclass.org/71T37: ["Hanna bringt einen jungen
Ziegenbock nach Hause; Tobit glaubt, sie habe ihn
gestohlen"] (de); ["Anna brings home a young goat:
Tobit thinks she has stolen it"] (fi); ["Anna brings home
a young goat: Tobit thinks she has stolen it"] (en);
["Anna porta a casa un capretto: Tobi pensa che essa
lo abbia rubato"] (it); ["Anna apporte à la maison un
chevreau; Tobit croit qu'elle l'a volé"] (fr);
• http://vocab.getty.edu/aat/300015050: ["油性漆 (顏料料塗
層)"] (zh-hant); ["yu hsing ch'i"] (zh-latn-wadegile); ["oil
paint (paint)"] (en); ["yóu xìng qī"] (zh-latn-pinyin-xhanyu); ["you xing qi"] (zh-latn-pinyin-x-notone);
["peinture à l'huile (oil paint)"] (fr); ["pintura al óleo"]
(es); ["olieverf"] (nl);
• http://semium.org/time/16xx_2_quarter: ["2-я четверть
17-го века"] (ru); ["2 quarter of the 17th century"] (en);
["2e quart 17e siècle"] (fr);
Tobit en Anna met het bokje | Rembrandt Harmensz. van Rijn 1626, Rijksmuseum
Netherlands, Public Domain
First attempts
Lessons learned from the first iteration regarding …
initial dataset selection
source vocabulary selection
• variety of tests and tools used in pilot
• precision ranging from 0.312 - 0.985
• recall from 0.083 - 0.567
• best F-measure 0.597
Dataset ('target') selection
What should be enriched?
Criteria for the pilot corpus
• Decent metadata quality
• orthographically accurate
• orthographically consistent
• correct data-model mappings
• Representative
• linguistically
• in terms of domain
• Note that this question resolves down to the level of the field as much as the record
• some fields are more tightly defined than others
• the less structured the field, the less relevant an enrichment is likely to be
Vocabulary (‘source’) selection
What should we enrich with?
Criteria for pilot vocabularies
• multilingual labels available
• open licenses
• domain-applicable
Initial vocabularies chosen
• Geonames
• DBpedia
Teachable Moment (i)
Domain matters
The trouble with GEMET
• Satisfied many of our criteria
• well-structured
• multilingual
• covered a surprisingly large number of required concepts
• Polysemy when cross-applied to other domains
• ‘Druck’ (‘impression, print’) and ‘drawing' have both medical and artistic senses
• Intra-domain polysemy
• ‘permeability'
• geographical and Linnaean terms often cause false matches
Teachable Moment (ii)
Evaluation is difficult
Problems encountered
• Labour required meant no gold standard achievable
• pooled recall used instead
• Inter-rater agreement useable, but low (Fleiss kappa 0.329)
• more training required
• Co-reference resolution eases the task, but is not always possible
• Spreadsheet presentation confusing and led to multiple errors
Teachable Moment (iii)
For rulesets, tighter is better
• Don’t match across languages
• NB: this can be surprisingly hard to avoid
• Don’t match acronyms
• Dates are helpful where available
• particularly for Agents
• but also for Places
Top: Intra-Cranial Pressure
Wikimedia Commons - CC-BY-4.0:
Bottom: Insane Clown Posse
CC-BY-SA 3.0: http://bit.ly/2iCuJAJ
Larger Lessons (i)
Data-wrangling involves more human effort
and expertise than you want it to ….
… so make involving humans easier
• “Noise squared”
• loose and unpredictable structures (DBpedia, Wikidata) are better-suited to the CH
domain than more consistent taxonomies
• these also demand more curation and editing
• Devolve enrichment to the institutional level, where it should be built into workflow
• as it is in, for example, the Rijksmuseum
• Provide simple APIs for developers and data providers to annotate with …
• … and good feedback mechanisms for non-expert users
• Evaluation needs to be supported with toolkits
Larger Lessons (ii)
Users query for entities more than records
‘Enrichments' aren’t a bonus; they’re the essence
• Entity names make up 70%-95% of all queries through the Europeana portal
• (depending on what Concepts are defined)
• Information need expressed is for entities, with records being ‘about’ those entities
• in terms of discovery, entities are primary and records secondary
• ‘Enrichment' now reconceptualised as a knowledge graph
• records are nodes linked to entities and each other
• traversal possibly as important as retrieval
• Enrichment more than an aid to discovery
• potentially changes the character of what is that’s discovered
Europeana home: http://www.europeana.eu/portal
Europeana Data Model (EDM) documentation: http://pro.europeana.eu/page/edm-documentation
Task Force on Enrichment and Evaluation: Comparative Evaluation of Semantic
Enrichments: http://pro.europeana.eu/files/Europeana_Professional/EuropeanaTech/
EuropeanaTech Task Force on a Multilingual and Semantic Enrichment Strategy: final report:
Entitifying Europeana: Building an Ecosystem of Networked References for Cultural
Objects: http://swib.org/swib16/slides/manguinhas_europeana.pdf
21 October 2016