emearc17

#EMEARC17
Resolving Identities of Entities
Entities and their Attributes
Marta Cichoń, National Library of Poland
#EMEARC17
Objectives
• Data Atomization in the Bibliographic Database
• Using controlled vocabulary and matching with other
standard vocabularies
• Populating standard MARC21 fields with values assigning
attributes to entities
• Creating additional links between entities in our database
• From unstructured data to structured information
• Improving possibilities for data retrieval and linking
#EMEARC17
New opportunities with MARC 21
Bibliographic Records:
Authority Records:
336
337
338
380
381
368
370
371
372
373
374
375
376
377
378
380
385
386
388
Content Type (R)
Media Type (R)
Carrier Type (R)
Form of Work (R)
Other Distinguishing Characteristics of
Work or Expression (R)
385 Audience Characteristics (R)
386 Creator/Contributor Characteristics (R)
388 Time Period of Creation (R)
658 Index Term-Curriculum Objective (R)
(here as field of knowledge – content objective)
#EMEARC17
Other Attributes of Person or Corporate Body (R)
Associated Place (R)
Address (R)
Field of Activity (R)
Associated Group (R)
Occupation (R)
Gender (R)
Family Information (R)
Associated Language (R)
Fuller Form of Personal Name (NR)
Form of work (R)
Audience Characteristics (R)
Creator/Contributor Characteristics (R)
Time Period of Creation (R)
The same MARC 21, new approach
• Shared entities for name and subject authority
control
• Allowing additional “attributes”
• Allowing better data segmentation
• Using new attributes to populate additional facets
in the Faceted Search
• Using the enriched content of records to extract
specific data in various formats
#EMEARC17
Entities and their Attributes
Type of Entity: Persons
o
043 – Geographic Area Code (NR)
o
o
o
o
o
o
o
o
#EMEARC17
046 – Special Coded Dates (R)
368|c – Other designation (R)
368|d – Title of Person (R)
370 – Associated Place (R)
372 – Field of Activity (R)
373 – Associated Group (R)
374 – Occupation (R)
375 – Gender (R)
377 – Associated Language (R)
Entities and their Attributes
Type of Entity: Corporate Bodies
o 043 – Geographic Area Code (NR)
o
o
o
o
o
o
046 – Special Coded Dates (R)
368|a – Type of corporate body (R)
370 – Associated Place (R)
371 – Address (R)
372 – Field of Activity (R)
377 – Associated Language (R)
#EMEARC17
Entities and their Attributes
Type of Entity: Events
o
o
o
o
o
o
043 – Geographic Area Code (NR)
046 – Special Coded Dates (R)
368|a – Type of meeting/event (R)
370 – Associated Place (R)
372 – Field of Activity (R)
377 – Associated Language (R)
#EMEARC17
Entities and their Attributes
Type of Entity: Places
o
o
o
o
034 - Coded Cartographic Mathematical Data (R)
043 – Geographic Area Code (NR)
045 – Time Period (NR)
368|b – Type of jurisdiction (R)
#EMEARC17
Semi-automatic, automatic
and manual method
• Semi-automatic method of populating records with
attributes extracted from existing records’ data with help of
open source tool MarcEdit
• Manual method of populating records with new attributes
using knowledge of experienced catalogers
• Automatic method of mapping attributes from other
datastes
#EMEARC17
Semi-automatic method
Parsing the unstructured bibliographic data into structured data elements - example
of parsing the birth and death dates recorded in MARC 100|d into separate
subfields MARC 046|f and 046|g using regular expressions:
• Selecting and exporting headings, where 100|d starts with regular expression
(\(.*) as dates are recorded in this subfield in brackets
• Copying all data from field 100|d to field 046 in a new file
• Using ‘Edit Indicators Utility’ tool to correct the indicators and ‘Edit Subfield’ tool to
delete the surplus subfield ‘a’ in the copied 046 field
• Using the “Swap Field Utility’ tool to add subfields ‘f’ and ‘g’ in the 046 field with
copied identical data
• Using the ‘Edit Subfield’ tool to edit data in fields ‘f’ and ‘g’ by means of regular
expression
#EMEARC17
Semi-automatic method
Parsing the unstructured bibliographic data into structured data elements example of parsing the birth and death dates recorded in MARC 100|d into
separate subfields MARC 046|f and 046|g using regular expressions
#EMEARC17
Automatic method
Based on mappings
of properties values
example: dcterms
#EMEARC17
MARC 21
dcterms namespace properties
024
dcterms:identifier
034
dcterms:spatial
045
dcterms:temporal
336
dcterms:type
337
dcterms:requires
338
dcterms:medium
385
dcterms:educationLevel
dcterms:audience
388
dcterms:created
MARC 21
Automatic method
Based on mappings
of properties values
example: schema.org
#EMEARC17
schema.org vocabulary equivalent property
024
schema:sameAs
034
schema:geo
046
(for separate subfields)
368
370
(for separate subfields)
schema:birthDate, schema:deathDate;
schema:foundingDate
schema:AdministrativeArea
schema:birthPlace, schema:deathPlace,
schema:foundingPlace,
schema:location
371
schema:address
372
schema:industry
373
schema:affiliation
374
schema:jobTitle
375
schema:gender
376
schema:relatedTo
377
schema:inLanguage
385
schema:audienceType
386
schema:nationality
388
schema:dateCreated
Structured record data
“Wytwórnia filmowa” is Polish
language equivalent
for “Film Studio Company
#EMEARC17
Better record data segmentation
#EMEARC17
More possibilities of data retrieval
• Retrieving selected datasets basing on attributes
related criteria in various formats by means of
data extraction service data.bn.org.pl
• Using standard RESTful API
• Queries returning data
in standard formats
#EMEARC17
More possibilities of data retrieval
Sample queries:
• http://data.bn.org.pl/api/authorities.xml?jurisdiction=Morza
• http://data.bn.org.pl/api/authorities.xml?fieldOfActivity=nie
miecki
• http://data.bn.org.pl/api/bibs.json?audienceGroup=Dzieci
#EMEARC17
More possibilities of data retrieval
http://data.bn.org.pl
GUI - search queries interface available
#EMEARC17
Populating additional facets in the
Faceted Search
New Facets:
Audience
Cultural/ Ethnic Identity
Time Period of Creation
Content Objective
Form/ Genre:
Form
Genre
#EMEARC17
Thank you!
[email protected]
#EMEARC17