Methods for the Automatic Construction of Topic Maps

Methods for the Automatic
Construction of Topic Maps
Eric Freese, Senior Consultant
ISOGEN International
Agenda
• Automatic Construction from Structured
Documents
• Automatic Construction from
Unstructured Documents
Contextual Harvesting
• Markup can provide clues about the
information within a document
• Largely dependent on semantic markup
• Takes advantage of nesting within elements
• Rules can be developed for harvesting data
to build topic map constructs
– Rules could then be applied to similar types of
documents
DTD/Schema Development to
Support Harvesting
• DTDs like HTML are mostly useless to a
harvesting system
• Flat structures make associations
between elements more difficult
• New DTD/schema development should
take possible knowledge harvesting into
account
Content-Based Harvesting
• Combination of contextual and natural
language harvesting
• Text is parsed and clues within the text
are used to harvest knowledge.
• HTML documents where labels are
included in the text could be processed
this way
NLP Strategies
• Named Entity recognition
– A list of entities (people, companies,
places, etc.) is defined
– Programs parse a corpus of information to
identify entities
– Limited to the completeness of the entity
list
NLP Strategies – cont.
• Concept extraction
– A list of key words can be defined much
like the named entity strategy
– Common strings may also be identified and
suggested as new concepts
– More processing intensive than named
entity
NLP Strategies – cont.
• Taxonomic classification
– Documents are analysed and classified
according to a human-defined taxonomy
– Specialized programs must be developed
that are able to understand the taxonomy
• Must also be able to process synonyms and
related concepts
NLP Strategies – cont.
• Discourse analysis
– Programs are developed that attempt to
understand the meaning of a text
– Analyze the parts of speech using a
lexicons and rules to attempt to derive the
meanings and usages of words
Steps in Processing Natural
Language
•
•
•
•
Tokenization
Part of Speech Tagging
Bracketing
Identification of useful structures
Tokenization
• GOAL: Prepare text for processing by a
natural language processing system
• Flowing paragraph text is broken into
sentence units
• Sentence units are broken into word tokens
• Word tokens are prepared for part of speech
processing
– Contractions and other constructs receive special
processing
– n’t, ’s
Part of Speech (POS) Tagging
• GOAL: Identify the part(s) of speech for each
word token
• Lexicon – list of words with the possible POS
tags for each word
• EXAMPLE:
– sound - NNP JJ NN VB
•
•
•
•
The Sound of Music, Puget Sound
a sound decision
the sound of silence
sound the alarm
POS Tagging – cont.
• POS tagging is difficult
• Exception processing often required for
phrases and grouped words
• EXAMPLE:
– Time flies like an arrow.
– Fruit flies like an apple.
Bracketing
• GOAL: Identify groupings of words into
phrases and the hierarchical relationship of
phrases to one another
• A set of rules is used to identify how different
parts of speech and phrases can be
combined to form larger phrases.
• EXAMPLE:
– [NP, DT, JJ, NN]
– Noun phrase can consist of a determiner followed
by adjective followed by a noun
Benefits of the phased approach
• The separation of functions allows this
approach to be applied to any language.
– A lexicon is developed for the language
– Rules for language construction are defined
– Generic engine is able to process the data
• The separation of the lexicon and the rules
base allows the model to be
modified/improved as the corpus of text
grows.
Putting it all together
• EXAMPLE: The red ball rolled down the hill.
• Tokenization and POS tagging
–
–
–
–
–
–
The DT
red JJ NNP
ball NN
rolled VBD VBN
down RB IN RBR VBP JJ NN RP
hill NN
Putting it all together – cont.
• EXAMPLE: The red ball rolled down the hill.
• Bracketing rules
–
–
–
–
–
[S, NP, VP]
[NP, DT, JJ, NN]
[VP, VBD, PP]
[PP, IN, NP]
[NP, DT, NN]
• RESULT (using XML for bracketing):
<S><NP><DT>The</DT><JJ>red</JJ><NN>ball</NN>
</NP><VP><VBD>rolled</VBD><PP><IN>down</IN>
<NP><DT>the</DT><NN>hill</NN></NP></VP></S>
Harvesting Considerations
• GIGO rule in effect
• The harvesting process only hastens topic
map construction
• Only some of the topic map merging rules are
applicable
– Limited prospect of meaningful subject identities
• Humans must still participate in the process
of knowledge organization in order to
maintain quality
– Selective inclusion in the topic map/knowledge
base
Q&A
Questions or comments welcome at:
ISOGEN International
1611 W. County Road B, Suite 204
St. Paul, MN 55113 USA
Voice: 1.651.636.9180 - Fax: 1.651.636.9191
[email protected]
www.isogen.com
Demonstrations
• SemanText – open source
• Ontopia Knowledge Suite – commercial
Harvesting Knowledge using
SemanText
• Contextual Harvesting
• Natural Language-Based Harvesting
Contextual Harvesting in
SemanText
• XML markup is used as in NLP to
identify document structures
• Users write rules for harvesting
information into topic maps structures
Harvesting XML Content
• Data can be harvested from any XML
document (including RDF) into a topic map
without the need to develop specialized
programs
• Specification language is based on Xpath
• In future, will also record the location from
which the data was harvested as an
occurrence
Harvesting RDF Content
• RDF structures can be harvested in order
to build topic maps
• Demonstrates the possibility of
interoperability between the two models
• Rules can be established for flavors of RDF
(e.g. Dublin Core, DAML/OIL)
– Allows any document using the tagging
scheme to be harvested
• Only binary associations can be generated
NLP in SemanText
• The initial lexicon included within
SemanText is based on a lexicon derived
from the Penn Treebank tagging of the Brown
corpus (1 million+ words) and a very large
sample from the Wall Street Journal (approx.
3 million words)
• SemanText provides the ability to identify
and add new words to the lexicon through a
GUI.
NLP in SemanText – cont.
• SemanText uses a public domain parser to
tokenize flowing text and identify the
appropriate POS for each word token based
on a set of bracketing rules.
• XML is used to denote the bracketing
• This XML markup allows SemanText to
process natural language using its
contextual-based harvesting capability