Ch. Chiarcos, ACoLi CO, 2016, July 22
LLOD and corpora
TODO
If you have Un*x environment (Linux, BSD, Mac, Cygwin):
- make sure you have JAVA installed and wifi works
- lab members: svn checkout /intern/incubator/conll-rdf
- others: download and unzip http://acoli.informatik.unifrankfurt.de/tmp/conll-rdf-prerelease.zip
others:
- find a neighbor who does
LLOD and corpora
1. Linguistic Linked Open Data
and interoperability challenges wrt. corpora
2. State of the Art: NIF & POWLA
... and why neither is used in NLP nor corpus
linguistics
3. Yet another format: CoNLL-RDF
Breaking the usage barrier ?
4. Working with CoNLL-RDF
Linked Open Data
RDF, RDF vocabularies, Linking
Resource Description Framework (RDF)
• W3C standard (1999)
– generic data model: directed labeled graph
• nodes, edges, labels
– originally developed to provide metadata about
resources
• e.g., journals in a bookstore and eBooks in an online
shop
– resources are unambiguously identified in the web
of data by Uniform Resource Identifiers URIs)
4
Resource Description Framework (RDF)
• a (labeled directed multi-) graph
– nodes („RDF resources“)
• anything we want to provide information about
– edges („RDF properties“)
• assigns a source node („subject“) a target node („object“)
or a value („literal“)
– nodes and edges are unambiguously identified
• Uniform Resource Identifiers (URIs), e.g., URLs
5
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:
bronstijd
rdf:type
thesaurus:
period
(the concept) „bronstijd“ is an (instance of concept) „period“
6
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:
bronstijd
rdf:type
thesaurus:
period
abbreviated for URI
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
could be opened in a browser
resolvable URIs may provide further information
7
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:
bronstijd
rdf:type
thesaurus:
period
skos:prefLabel
„Bronzezeit“@de
in German (de), the preferred label for (the concept) „bronstijd“ is „Bronzezeit“
8
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:
bronstijd
rdf:type
thesaurus:
period
skos:prefLabel
skos:prefLabel
„Bronzezeit“@de
„bronze age“@en
in English (en), the preferred label for (the concept) „bronstijd“ is „bronze age“
9
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
rdf:type
thesaurus:
bronstijd
thesaurus:
period
skos:prefLabel
skos:prefLabel
skos:prefLabel
„bronstijd“@nl
„Bronzezeit“@de
„bronze age“@en
in English (en), the preferred label for (the concept) „bronstijd“ is „bronstijd“
10
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
rdf:type
thesaurus:
bronstijd
thesaurus:
period
triple
notation
(Turtle)
graphical
notation
skos:prefLabel
skos:prefLabel
skos:prefLabel
„bronstijd“@nl
„Bronzezeit“@de
„bronze age“@en
11
RDF Querying
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?de ?en
SPARQL
WHERE {
„SQL meets Turte“
?a skos:prefLabel ?de.
?a skos:prefLabel ?en.
FILTER(langMatches(lang(?de), „de"))
FILTER(langMatches(lang(?en), „en“))
}
triple
notation
(Turtle)
SPARQL
12
RDF Querying
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
triple
notation
(Turtle)
SPARQL
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?de ?en
WHERE {
?a skos:prefLabel ?de.
triples with variables
?a skos:prefLabel ?en.
FILTER(langMatches(lang(?de), „de"))
FILTERS with
XPath-like functions
FILTER(langMatches(lang(?en), „en“))
}
13
RDF Querying
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
rdf:type
thesaurus:
bronstijd
thesaurus:
period
triple
notation
(Turtle)
graphical
notation
skos:prefLabel
skos:prefLabel
skos:prefLabel
„bronstijd“@nl
„Bronzezeit“@de
„bronze age“@en
=> list with
two14cols
RDF
RDF graphs can represent arbitrarily complex structures,
can be freely extended with additional nodes, links (edges
pointing to
thesaurus:
external resources), etc.
arch_concept
rdf:type
thesaurus:
bronstijd
thesaurus:
period
rdfs:subClassOf
skos:prefLabel
skos:prefLabel
skos:prefLabel
„bronstijd“@nl
„Bronzezeit“@de
„bronze age“@en
(every) „period“ is an „arch_concept“ „period“ is subclass of „arch_concept“
15
RDF vocabularies
Specialized vocabularies for different kinds of
information do exist and can (and should) be re-used
thesaurus:
=> uniform information representation
arch_concept
rdf:type
thesaurus:
bronstijd
thesaurus:
period
rdfs:subClassOf
skos:prefLabel
skos:prefLabel
skos:prefLabel
RDF und RDF Schema (RDFS)
basic „Bronzezeit“@de
vocabulary for taxonomies
„bronstijd“@nl
„bronze age“@en
16
RDF vocabularies
Specialized vocabularies for different kinds of
information do exist and can (and should) be re-used
thesaurus:
arch_concept
Additional vocabularies (SKOS, OWL, etc.) extended
vocabulary for taxonomies and ontologies
rdf:type
thesaurus:
thesaurus:
bronstijd
period
rdfs:subClassOf
skos:prefLabel
skos:prefLabel
skos:prefLabel
„bronstijd“@nl
„Bronzezeit“@de
„bronze age“@en
17
RDF vocabularies
Specialized vocabularies for different kinds of
information do exist
thesaurus:
Domain-specific vocabularies can be
arch_concept
specified as required
rdf:type
thesaurus:
bronstijd
thesaurus:
period
rdfs:subClassOf
skos:prefLabel
skos:prefLabel
„bronstijd“@nl
„Bronzezeit“@de
thesaurus
„bronze
age“@enfor archeology
toy vocabulary
18
RDF vocabularies
• graphical browsers
and editors
e.g.
– http://labs.sparna.fr/s
kos-play (SKOS)
– http://protege.stanfor
d.edu/ (RDF, OWL,
SKOS)
19
RDF vocabularies
• graphical browsers
and editors do exist …
•
•
•
•
•
… as well as
data bases,
W3C-standardized
query language,
APIs,
reasoners,
usw.
20
RDF vocabularies
• RDF resources and edges are defined by URIs
– as a URL, they may be accessed remotely
re-usable vocabularies & knowledge bases
– link to community-maintained data types/terms instead of
defining your own
• „Linked Open Data Cloud“ (LOD)
– a great variety of resources that are linked with each other
and released under an open license
21
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
• use URIs as names for things
(1)
– links to external URIs (links) allow us to retrieve more
information from these sites
• if they can be resolved via HTTP
• and provide information as RDF*
• and they include links to other URIs
then, this is Linked Data
(2)
(3)
(4)
(informal)
22
http://www.w3.org/DesignIssues/LinkedData.html
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
=> Information integration
– Structural interoperability
• comparable formats and protocols to access data
=> the same query language for different data sets
23
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
=> Information integration
– Structural interoperability
– Conceptual interoperability
• develop and (re-)use a shared vocabularies for equivalent
concepts
=> the same query on different data sets
24
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
=> Information integration
– Structural interoperability
– Conceptual interoperability
– Federation
• data published on the web
– with a query interface (SPARQL end point)
=> a single query to query different datasets simultaneously
25
Linked Open Data (LOD)
26
LOD cloud: Aug 2014
media
linguistic LOD
bibliography
government
data
geo information
http://lod-cloud.net/
social networks
life sciences
27
Linking Corpora …
• … with other schemes
• … with terminology repositories
• … with lexical-semantic networks
• corpora as Linked Open Data
=> network effects
Structurally Interoperable
Language Resources
NLP Interchange Format (NIF):
Interoperability for NLP pipelines
29
Linked Data Corpus Creation with NIF
(NIF Reference Card)
Best practices to follow for the generation of Linked Data text
corpora, using the NLP Interchange Format (NIF).
Target audience
Corpus creators and users seeking to make corpora interoperable
and to publish them as linked data. Basic knowledge of RDF is
mandatory for conversion. Basic knowledge of linked data and
web server access is needed for publication.
Scope
Conversion of existing corpora into RDF using NIF, as well as
creation of linked data corpora from textual data.
Website: http://site.nlp2rdf.org
Github: http://github.com/nlp2rdf
Example corpus: http://brown.nlp2rdf.org
30
Core concepts
Corpus
We understand a corpus as a collection of documents. Documents contain text, represented as strings
of characters and annotations that provide more information about these strings. NIF provides a way to
identify strings using URIs and annotate them using an ontology.
String identification via URI:
Strings are identified using a URI scheme consisting of: the prefix of the corpus URI; the character
indices of beginning and end of the string; and a scheme identifier between document URI and string
position identifier. Character indices in NIF are counted offset based, starting at zero before the first
character and counting the gaps between the characters until after the last character of the referenced
string:
http://example.org/corpus/document#offset4_10
This URI scheme is valid for text/plain. Other mime types may require different URI schemes.
String annotation
After assigning URIs to meaningful strings of the corpus, these URIs can be annotated using the NIF
core ontology.
Website: http://site.nlp2rdf.org
Github: http://github.com/nlp2rdf
Example corpus: http://brown.nlp2rdf.org
31
Example
String URIs
◦ Strings are the basis of analysis
◦ nif:anchorOf => string value
◦ offset-based
@base <http://example.org/prefix>
<#char=3,12>
<http://example.org/prefix#char=3,12>
Context
◦ Contains document text in nif:isString
◦ nif:beginIndex is always 0
◦ Strings refer to Context with
nif:referenceContext
Website: http://site.nlp2rdf.org
Github: http://github.com/nlp2rdf
Example corpus: http://brown.nlp2rdf.org
32
Example
Pre-defined categories
◦ Word, Phrase, Sentence, Paragraph
◦ rdfs:subClassOf String
◦ hierarchy => subString
oriented towards industrial applications
primarily used for Entity Linking (taldentRef)
limited in scalability
limited support of dependency annotations, no
support for relational semantics
Pre-defined properties
◦ head, lemma, stem, posTag, …
33
Example
Information Integration
◦ Annotations are attached to strings
◦ Implicit unification of divergent annotations
◦ If different tools annotate the same string, this refers to the same URI
34
NIF Core ontology
Rich vocabulary
• OWL-based
• redundant properties
• transitive
• inverse
• before ~
previousWord
35
NIF and computational linguistics
• NIF is used in
– EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics?
– ignorance
“Standard data formats … I'm not sure these are important: if
someone can use a parser, they can probably also write a
Python wrapper”
I
Mark Johnson (2012), Computational Linguistics. Where do we go from
here?, invited plenary talk at the 50th Annual Meeting of the ACL, Jeju
37
NIF and computational linguistics
• NIF is used in
– EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics?
– ignorance
– readability
human-readable?
The
Semantic Web
is
a
good
idea
DT
NNP
VB
DT
JJ
NN
CoNLL
_
http://dbpedia.org/resource/Semantic_Web
_
_
_
_
<http://example.org/sem#offset_4_16>
a nif:String , nif:Phrase, nif:OffsetBasedString ;
nif:anchorOf "Semantic Web"@en ;
nif:beginIndex "4"^^xsd:int ;
nif:endIndex "16"^^xsd:int ;
nif:oliaLink <http://purl.org/olia/penn.owl#NNP> ;
itsrdf:taIdentRef
<http://dbpedia.org/resource/Semantic_Web> ;
nif:referenceContext
<http://example.org/sem#offset_0_32> .
NIF
38
NIF and computational linguistics
• NIF is used in
– EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics?
– ignorance
– readability
– limited expressivity
• non-String annotations / empty strings?
• non-phrasal MWEs?
• hard-wired properties and concepts
– limited to morphosyntax, NER and dependency syntax
39
NIF and computational linguistics
• NIF is used in
– EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics?
–
–
–
–
ignorance
readability
limited expressivity
established formats for „basic“ annotations
• Do we get anything that CoNLL-TSV doesn‘t give us yet?
40
NIF and computational linguistics
• NIF is used in
– EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics?
–
–
–
–
–
ignorance
readability
limited expressivity
established formats for „basic“ annotations
developed with a focus on Semantic Web applications
• neither linguistics nor NLP
41
NIF Alternatives
• TELIX
– motivation similar as NIF
– RDFa annotations, added to XML documents
• popularity decreases with the popularity of XML
• OpenAnnotation
– intended for expressing metadata over content
elements in HTML
• limited support for linguistic annotations
42
NIF Alternatives
• POWLA
– RDF/OWL reconstruction of an XML standoff
format
• capable to represent any annotation faithfully
– comes from a small and specialized sub-community in NLP
and linguistics (multi-layer corpora, discourse annotation)
• as unreadable as the original XML standoff format
– but better to process
43
CoNLL-RDF
Yet another corpus formalism
Technical motivations for corpora in RDF
• use off-the-shelf technologies for
– data storing (RDF triple/quad stores),
– querying (SPARQL),
– manipulation (SPARQL Update), and
– access (SPARQL end points)
• structurally interoperable with
– NLP output (NIF) and other RDF-based corpus
formats,
– dictionaries (lemon), and
– terminology bases
=> flexible information integration
45
CoNLL-RDF
• RDF-based formalism
– rather than OWL-based (NIF, POWLA)
– minimal: no transitive and inverse properties
– generic: no pre-defined categories nor properties
• Grounded in an established and widely used
formats in the field
– tab-separated values: CoNLL format family
• comfortable
– import and export to human-readable
representations
46
47
48
CoNLL-X
format
(2006)
49
Tab-Separated Values
50
CoNLL-2009 additions: SRL
51
CoNLL-U: Universal Dependencies
52
Tool-specific CoNLL variants: SENNA
53
CoNLL formats
•
•
•
•
•
•
comments begin with #
sentences separated by empty line
one word per line
annotations separated by tab (=> columns)
empty columns left empty, contain -, _ or O
HEAD points to another word in the same
sentence
– „foreign key“, identified by sentential position/ID
• all other annotations assign string values
54
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
– sentence: word with ID „0“ ( CoNLL root)
– base URI should refer to the original document in
a corpus, e.g.,
http://cormand.humanum.fr/01npogotiginin_kokorobola.dis.html#
55
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
http://ufal.mff.cuni.cz/conll2009-st/taskdescription.html#<COLNAME>
or
conll:<COLNAME>
56
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
• special treatment of HEAD
– object property pointing to head URI
57
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
• special treatment of HEAD
– object property pointing to head URI
• minimal use of NIF vocabulary
– nif:Word, nif:nextWord, nif:nextSentence
58
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
• special treatment of HEAD
– object property pointing to head URI
• minimal use of NIF vocabulary
• conventionally formatted to resemble CoNLL
59
CoNLL vs. CoNLL-RDF
...
60
CoNLL-RDF processing
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF
basic converter
read CoNLL file, write Turtle
61
CoNLL-RDF processing
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF
basic converter
– CoNLLStreamExtractor
graph manipulation
read CoNLL file sentence by sentence, for every sentence,
apply SPARQL Update statements, write Turtle
62
CoNLL-RDF processing
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF
basic converter
– CoNLLStreamExtractor
graph manipulation
– CoNLLRDFFormatter
basic visualization
read CoNLL-RDF, output: CoNLL-like visualization
• special treatment of depencencies
• limited to conll namespace
• coloring on Unix shells
• imposes some naming conventions
63
CoNLL-RDF processing
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF
basic converter
– CoNLLStreamExtractor
graph manipulation
– CoNLLRDFFormatter
basic visualization
– anything beyond this is handled by SPARQL
update queries
• modular pipeline (examples in *.sh)
• re-usable modules (given appropriate documentation)
– maybe hard to read, but easy to write your own!
64
An example
Requires a Unix shell
Please look into the code!
65
An example: acoli-example.sh
66
An example: acoli-example.sh
read one or several files
if you have one file only, say FILE.conll, write „cat FILE.conll | \“ instead
67
An example: acoli-example.sh
call CoNLLStreamExtractor (run.sh just adds the classpath)
base URI (can be any URL, etc.)
column labels in the order they occur in the CoNLL file
68
An example: acoli-example.sh
do some layouting
$* means that command-line arguments are passed to CoNLLRDFFormatter
but these are not necessary
69
An example: acoli-example.sh
do some layouting
Please, try
for yourself
$* means that command-line arguments are passed to CoNLLRDFFormatter
but these are not necessary
70
Manipulating the data
Between reading and writing, manipulations can be applied to the graph.
–u => next are some SPARQL Update queries
these queries are read from the files in the argument
optionally, a number of iterations can be supplied in {0}
71
sparql/remove-ID.sparql
• For more complicated manipulations, we also
require insertions
– Use INSERT { ... } before DELETE
72
shift-reduce/initialize-SHIFT.sparql
• This is from another pipeline, see shift-reduceexample.sh
73
Task: Write a SPARQL update query
• Given the UD annotations
– write a query that extracts the main verb
(predicate) of a sentence
– add this information to the graph
?verb a conll:predicate.
– add this query to acoli-example.sh and run it
use this as a template.
74
Example, extended
Linking with OLiA
75
Linking Corpora
OLiA
(Schmidt et al. 2006, Chiarcos 2008, 2010, 2012)
External Reference Models
(Terminology Repositories)
Annotation Models
OLiA
GOLD
STTS
TIGER
ISOcat
(morphosyntax)
OntoTag
(morphosyntax)
TDS
ontology
Annotation
Models for
German
TüBaD/Z
German
Connexor
Linking:
given a POWLA individual i
Penn
Reference
Model
Brown
English
Susanne
etc.
EAGLES
EAGLES
EAGLES
MULTEXT/
MULTEXT/
East
MULTEXT/
East
East
11 European
languages
15 (mostly) Eastern
European languages
if annotations of i match
OLiA annotation model
specs
then declare i an instance of
the corresponding OLiA class
Integrating external information
link English POS / dependencies with OLiA
77
sparql/link-penn-POS-simple.sparql
78
Task II: Extend this such that ?a is also an
instance of superclasses of ?concept
79
Example, further extended
Consult an online dictionary
80
Getting German glosses for English text
• SPARQL permits accessing endpoints and web
services on the web
– in the SELECT part of the query
SERVICE <URL> { ... }
• SPARQL permits accessing endpoints and web
services on the web
– e.g. DBnary, an RDF edition of various
Wiktionaries
• http://kaiko.getalp.org/sparql
81
Exploring an end point
What datasets
are in here ?
There seems to be a German
DBnary
82
Exploring an end point
What does the German
dictionary contain?
lemon:LexicalEntry may be what we‘re looking for
83
Exploring an end point
Examples for LexicalEntries
Ok, let‘s look into „Abenteuer“
84
Exploring an end point
Examples for LexicalEntries
Indeed, this contains dbnary:isTranslationOf
85
Exploring an end point
targetLanguage => language
dbnary:writtenForm => String
Check an example translation
86
Explore an end point
etc., until we have
pairs of strings
from DBnary
87
Explore an end point
88
Adding German translations
sparql/gloss-en-to-de-DBnary.sparql
89
Sample output
90
Task III: return a word-by-word German
„translation“ (and nothing else)
• hints: { ... } UNION { ...}
=> logical or
FILTER(NOT EXISTS { ... }) => negation91
© Copyright 2026 Paperzz