Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International Agenda • Automatic Construction from Structured Documents • Automatic Construction from Unstructured Documents Contextual Harvesting • Markup can provide clues about the information within a document • Largely dependent on semantic markup • Takes advantage of nesting within elements • Rules can be developed for harvesting data to build topic map constructs – Rules could then be applied to similar types of documents DTD/Schema Development to Support Harvesting • DTDs like HTML are mostly useless to a harvesting system • Flat structures make associations between elements more difficult • New DTD/schema development should take possible knowledge harvesting into account Content-Based Harvesting • Combination of contextual and natural language harvesting • Text is parsed and clues within the text are used to harvest knowledge. • HTML documents where labels are included in the text could be processed this way NLP Strategies • Named Entity recognition – A list of entities (people, companies, places, etc.) is defined – Programs parse a corpus of information to identify entities – Limited to the completeness of the entity list NLP Strategies – cont. • Concept extraction – A list of key words can be defined much like the named entity strategy – Common strings may also be identified and suggested as new concepts – More processing intensive than named entity NLP Strategies – cont. • Taxonomic classification – Documents are analysed and classified according to a human-defined taxonomy – Specialized programs must be developed that are able to understand the taxonomy • Must also be able to process synonyms and related concepts NLP Strategies – cont. • Discourse analysis – Programs are developed that attempt to understand the meaning of a text – Analyze the parts of speech using a lexicons and rules to attempt to derive the meanings and usages of words Steps in Processing Natural Language • • • • Tokenization Part of Speech Tagging Bracketing Identification of useful structures Tokenization • GOAL: Prepare text for processing by a natural language processing system • Flowing paragraph text is broken into sentence units • Sentence units are broken into word tokens • Word tokens are prepared for part of speech processing – Contractions and other constructs receive special processing – n’t, ’s Part of Speech (POS) Tagging • GOAL: Identify the part(s) of speech for each word token • Lexicon – list of words with the possible POS tags for each word • EXAMPLE: – sound - NNP JJ NN VB • • • • The Sound of Music, Puget Sound a sound decision the sound of silence sound the alarm POS Tagging – cont. • POS tagging is difficult • Exception processing often required for phrases and grouped words • EXAMPLE: – Time flies like an arrow. – Fruit flies like an apple. Bracketing • GOAL: Identify groupings of words into phrases and the hierarchical relationship of phrases to one another • A set of rules is used to identify how different parts of speech and phrases can be combined to form larger phrases. • EXAMPLE: – [NP, DT, JJ, NN] – Noun phrase can consist of a determiner followed by adjective followed by a noun Benefits of the phased approach • The separation of functions allows this approach to be applied to any language. – A lexicon is developed for the language – Rules for language construction are defined – Generic engine is able to process the data • The separation of the lexicon and the rules base allows the model to be modified/improved as the corpus of text grows. Putting it all together • EXAMPLE: The red ball rolled down the hill. • Tokenization and POS tagging – – – – – – The DT red JJ NNP ball NN rolled VBD VBN down RB IN RBR VBP JJ NN RP hill NN Putting it all together – cont. • EXAMPLE: The red ball rolled down the hill. • Bracketing rules – – – – – [S, NP, VP] [NP, DT, JJ, NN] [VP, VBD, PP] [PP, IN, NP] [NP, DT, NN] • RESULT (using XML for bracketing): <S><NP><DT>The</DT><JJ>red</JJ><NN>ball</NN> </NP><VP><VBD>rolled</VBD><PP><IN>down</IN> <NP><DT>the</DT><NN>hill</NN></NP></VP></S> Harvesting Considerations • GIGO rule in effect • The harvesting process only hastens topic map construction • Only some of the topic map merging rules are applicable – Limited prospect of meaningful subject identities • Humans must still participate in the process of knowledge organization in order to maintain quality – Selective inclusion in the topic map/knowledge base Q&A Questions or comments welcome at: ISOGEN International 1611 W. County Road B, Suite 204 St. Paul, MN 55113 USA Voice: 1.651.636.9180 - Fax: 1.651.636.9191 [email protected] www.isogen.com Demonstrations • SemanText – open source • Ontopia Knowledge Suite – commercial Harvesting Knowledge using SemanText • Contextual Harvesting • Natural Language-Based Harvesting Contextual Harvesting in SemanText • XML markup is used as in NLP to identify document structures • Users write rules for harvesting information into topic maps structures Harvesting XML Content • Data can be harvested from any XML document (including RDF) into a topic map without the need to develop specialized programs • Specification language is based on Xpath • In future, will also record the location from which the data was harvested as an occurrence Harvesting RDF Content • RDF structures can be harvested in order to build topic maps • Demonstrates the possibility of interoperability between the two models • Rules can be established for flavors of RDF (e.g. Dublin Core, DAML/OIL) – Allows any document using the tagging scheme to be harvested • Only binary associations can be generated NLP in SemanText • The initial lexicon included within SemanText is based on a lexicon derived from the Penn Treebank tagging of the Brown corpus (1 million+ words) and a very large sample from the Wall Street Journal (approx. 3 million words) • SemanText provides the ability to identify and add new words to the lexicon through a GUI. NLP in SemanText – cont. • SemanText uses a public domain parser to tokenize flowing text and identify the appropriate POS for each word token based on a set of bracketing rules. • XML is used to denote the bracketing • This XML markup allows SemanText to process natural language using its contextual-based harvesting capability
© Copyright 2026 Paperzz