PPT

Ontology-based Annotation
Sergey Sosnovsky
@PAWS@SIS@PITT
Outline



O-based Annotation
Conclusion
Questions
Why Do We Need Annotation

Annotation-based Services




Knowledge Management


Integration of Disperse Information (knowledge-based linking)
Better Indexing and Retrieval (based on the document semantics)
Content-based Adaptation (modeling document content in terms of domain
model)
Organization’s Repositories as mini Webs (Boeing, Rolls Royce, Fiat,
GlaxoSmithKline, Merck, NPSA, …)
Collaboration Support

Knowledge sharing and communication
What is Added by O-based Annotation




Ontology-driven processing (effective formal reasoning)
Connecting other O-based Services (O-mapping, O-visualization…)
Unified vocabulary
Connecting to the rest of SW knowledge
Definition
O-based Annotation is a process of
creating a mark-up of Web-documents using a pre-existing ontology
and/or
populating knowledge bases by marked up documents
“Michael Jordan plays basketball”
our: plays
our: Athlete
rdf: type
Michael Jordan
our: Sports
rdf: type
our: plays
Basketball
List of Tools
























AeroDAML / AeroSWARM
Annotea / Annozilla
Armadillo
AktiveDoc
COHSE
GOA
KIM Semantic Annotation Platform
MagPie
Melita
MnM
OntoAnnotate
Ontobroker
OntoGloss
ONTO-H
Ont-O-Mat / S-CREAM / CREAM
Ontoseek
Pankow
SHOE Knowledge Annotator
Seeker
Semantik
SemTag
SMORE
Yawas
…
Information Extraction Tools:
• Alembic
• Amilcare / T-REX
• Annie
• Fastus
• Lasie
• Poteus
• SIFT
• …
Important Characteristics


Automation of Annotation
(manual / semiautomatic / automatic / editable)
Ontology-related issues:







pluggable ontology (yes/no);
ontology language (RDFS / DAML+OIL / OWL / …);
local / anywhere access;
ontology elements available for annotation (concept / instances / relations
/ triples);
where annotations are stored (in the annotated document / on the
dedicated server / where specified)
annotation format (XML / RDF / OWL / …).
Annotated Documents:



document kinds (text / multimedia)
document formats (plain text / html / pdf / …)
documents access (local / web)

Architecture / Interface / Interoperability

Standalone tool / web interface / web component / API / …
Annotation Scale (large – the WWW size / small - a hundred)



Existing Documentation / Tutorial
Availability
SMORE





Manual Annotation
OWL-based Markup
Simultaneous O modification (if necessary)
ScreenScraper mines metadata from annotated
pages and suggests as candidates for the mark-up
Post-annotation O-based Inference
“Michael Jordan plays basketball”
our: plays
our: Athlete
rdf: type
Michael Jordan
our: Sports
rdf: type
our: plays
Basketball
Problems of Manual Annotation




Expensive / Time-consuming
Difficult / Error prone
Subjective (two people annotating the same documents have
in 15–30% annotate them differently)
Never ending



Annotation storage problem


new documents
new versions of ontologies
where?
Trust owner’s annotation


incompetence
Spam (Google does not use <META> info)
Solution: Dedicated Automatic Annotation Services (“Search Engine”- like)
Automatic O-based Annotation

Supervised




MnM
S-Cream
Melita & AktiveDoc
Unsupervised



SemTag - Seeker
Armadillo
AeroSWARM
MnM

Ontology-based Annotation Interface:





Ontology browser (rich navigation capabilities)
Document browser (usually Web-browser)
The annotation is mainly based on select-drag-N-drop
association of text fragments with ontology elements
Built-in or External ML component classifies the main corpus
of documents
Activity Flow:




Markup (A human user manually annotate training set of
documents by ontology elements)
Learn (A learning algorithm is run over the marked up corpus to
learn the extraction rules)
Extract (An IE mechanism is selected and run over a set of
documents)
Review (A human user observes the results and correct them if
necessary)
Amilcare and T-REX

Amilcare:





Automatic IE component
Is used in at least five O-based A tools (Melita,
MnM, Ontoannotate, Ontomat, SemantiK)
Released to about 50 Industrial and Academic
sites
Java API
Recently succeeded by T-REX
Pankow


Input: A web page.
Step 1: Web page is scanned for phrases that might be categorized as instances of
the ontology (partof-speech tagger to find candidate proper nouns)


Step 2: The system iterates through all candidate proper nouns and all candidate
ontology concepts to derive hypothesis phrases using preset linguistic patterns.


Result 2: Set of hypothesis phrases.
Step 3: Google is queried for the hypothesis phrases through


Result 1: set of candidate proper nouns
Result 3: the number of hits for each hypothesis phrase.
Step 4: The system sums up the query results to a total for each instance-concept
pair. Then the system categorizes the candidate proper nouns into their highest
ranked concepts

Result 4: an ontologically annotated web page.
SemTag - Seeker

IBM-developed
~264 million web pages
~72 thousand of concepts (TAP taxonomy)
434 million automatically disambiguated semantic tags

Spotting pass







Learning pass


Documents are retrieved from the Seeker store, and tokenized
Tokens are matched against the TAP concepts.
Each resulting label is saved with ten words to either side as a ``window'' of
context around the particular candidate object.
A representative sample of the data is scanned to determine the corpuswide distribution of terms at each internal node of the taxonomy. TBD
(taxonomy-based disambiguation) algorithm is used.
Tagging pass


“Windows” are scanned once more to disambiguate each reference
determine an TAP object
A record is entered into a database of final results containing the URL, the
reference, and any other associated metadata.
Conclusions




Web-document A is a necessary thing
O-based A benefits (O-based post-processing, unified
vocabularies, etc.)
Manual A is a bad thing
Automatic A is a good thing:

Supervised O-based A:



Useful O-based interface for annotating training set
Traditional IE tools for textual classification
Unsupervised O-based A:



COHSE – matches concept names from the ontology and a
thesaurus against tokens from the text
Pankow – uses ontology to build candidate queries, then uses
community wisdom to choose the best candidate
SemTag – uses concept names to match tokens and hierarchical
relations in the ontology to disambiguate between candidate
concepts for a text fragment
?
Questions
?
?