Natural Language Processing

By
Dr.A.S.Alvi
What is Understanding?
 To understand something is to transform it from one
representation into another.
What makes Understanding Hard?
 The Complexity of the target representation into which the
matching is being done.
 The type of mapping: one-one, one-many, many-one,
many-many.
 The level of interaction of the components of the source
representation.
Understanding of Single Sentences:
 Understanding Words: The process of determining the
correct meaning of an individual word is called word sense
disambiguation or lexical disambiguation.
 Understanding Sentences:
Syntactic analysis: Linear sequences of words are
transformed into structures that show how the words relate
to each other.
Semantic analysis : The structures created by the syntactic
analyzer are assigned meaning.
Pragmatic analysis :The structure representing what was
said is reinterpreted to determine what was actually meant
Understanding of Multiple Sentences:
 Following are the reltionship
1.Identical objects. Consider the text
Bill had a red balloon. John wanted it.
The word it should be identified as referring to the red
balloon.
2. Part of Objects. Consider the text
John opened the book he just bought. The title page was
torn.
The phrase ‘the title page’ should be recognized as
being part of the book that was just bought.
Understanding of Multiple Sentences:
3. Parts of actions.
John went to business trip to New York. He left on an early
morning flight.
Taking a flight should be recognized as part of going on
a trip.
4. Objects involved in actions:
Bill decided to drive to the store. He went outside but his
car wouldn’t start.
Bill’s car should be recognizes as a object involved in
his driving to the store.
Understanding of Multiple Sentences:
5.Casual Chains.
There was a big snow storm yesterday. The school are
closed today.
6.Planning sequences:
Tom wanted a new car. He decided to get a job.
Definitions
 Natural languages are languages that living creatures use
for communication
 Artificial Languages are mathematically defined classes of
signals that can be used for communication with machines
 A language is a set of sentences that may be used as
signals to convey semantic information
 The meaning of a sentence is the semantic information it
conveys
Characteristics of Successful Natural Language
Systems
 Successful systems share two properties:
– they are focused on a particular domain rather than allowing
discussion of any topic
– they are focused on a particular task rather than attempting
to understand language completely
 The above means that any natural language machine is
more likely to work correctly if one is to restrict the set of
possible inputs -- input possibility size is inversely
proportional to likelihood of success
Machine Translation
 Examples machine translation systems include:
– TAUM-METRO system
• translates weather reports from English to French
• works very well since language in government weather reports is highly
stylized and regular
– SPANAM system
• translates Spanish into English
• worked on a more open domain
• results were reasonably good although resulting English text was not
always grammatical and very rarely fluent
– AVENTINUS system
• advanced information system for multilingual drug enforcement
• allows law enforcement officials to know what the foreign document is
about
• sorts, classifies and analyzes drug related information
Machine Translation (Cont)
 There are three basic types of machine translation:
– Machine-assisted (aided) human translation (MAHT)
• the translation is performed by human translator, but
he/she uses a computer as a tool to improve or speed up
the translation process
– Human-assisted (aided) machine translation (HAMT)
• the source language text is modified by human translator
either before, during or after it is translated by the
computer
– Fully automatic machine translation (FAMT)
• the source language text is fed into the computer as a file,
and the computer produces a translation automatically
without any human intervention
1.2.1 Machine Translation (Cont)
 Standing on its own, unrestricted machine translation
(FAMT) is still inadequate
– Human-assisted machine translation (HAMT) could be used to
improve the quality of translation
• one possibility is to have a human reader go over the text after the
translation, correcting grammar errors (post-processing)
– human reader can save a lot of time since some of the text will be
translated correctly
– sometimes a monolingual human can edit the output without reading the
original
• another possibility is to have a human reader edit the document before
translation (pre-processing)
– make the original to conform to a restricted subset of a language
– this will usually allow the system to translate the resulting text without
any requirement for post-editing
1.2.1 Machine Translation (Cont)
 Restricted languages are sometimes called “Caterpillar
English”
– Caterpillar was the first company to try writing their manuals using
pre-processing
– Xerox was the first company to really successfully use of the preprocessing approach (SYSTRAN system)
• language defined for their manuals was highly restricted, thus
translation into other languages worked quite well
 There is a substantial start-up cost to any machine
translation effort
– to achieve broad coverage, translation systems should have
lexicons of 20,000 to 100,000 words and grammars of 100 to
10,000 rules (depending on the choice of formalism)
1.2.1 Machine Translation (Cont)
 There are several basic theoretical approaches to machine
translation:
– Direct MT Strategy
• based on good glossaries and morphological analysis
• always between a pair of languages
– Transfer MT Strategy
• first, source language is parsed into an abstract internal representation
• a ‘transfer’ is then made into the corresponding structures in the target
language
– Inerlingua MT Strategy
• the idea is to create an artificial language
– it shares all the features and makes all the distinctions of all languages
– Knowledge-Based Strategy
• similar to the above
• intermediate form is of semantic nature rather than a syntactic one
1.2.2 Database Access
 The first major success of natural language processing
 There was a hope that databases could be controlled by
natural languages instead of complicated data retrieval
commands
– this was a major problem in the early 1970s since the staff in
charge of data retrieval could not keep up with demand of users
for data
 LUNAR system was the first such interface
– built by William Woods in 1973 for NASA Manned Spacecraft
Center
– system was able to correctly answer 78% of the questions such as:
“What is the average modal plagioclase concentration for lunar
samples that contain rubidium?”
1.2.2 Database Access (Cont)
 Other examples of data retrieval systems would include:
– CHAT system
•
•
•
•
developed by Fernando Pereira in 1983
similar level of complexity to LUNAR system
worked on geographical databases
was restricted
– question wording was very important
– TEAM system
• could handle a wider set of problems than CHAT
• was still restricted and unable to handle all types of input
1.2.2 Database Access (Cont)
 Companies such as Natural Language Inc. and Symantec
are still selling database tools that use natural language
 The ability to have natural language control of databases is
not as big of a concern as it was in 1970s
– graphical user interface and integration of spreadsheets, word
processors, graphing utilities, report generating utilities, etc are of
greater concern to database buyers today
– mathematical or set notation seems to be a more natural way of
communicating with a database than plane English
– with advent of SQL, the problem of data retrieval is not as major as
it was in the past
1.2.3 Text Interpretation
 In early 1980s, most online information was stored in
databases and spreadsheets
 Now, most of online information is text: email, news,
journals, articles, books, encyclopedias, reports, essays,
etc
– there is a need to sort this information to reduce it to some
comprehendible amount
 Has become a major field in natural language processing
– becoming more and more important with expansion of the Internet
– consists of:
• information retrieval
• text categorization
• data extraction
1.2.3.1 Information Retrieval
 Information retrieval (IR) is also know as information
extraction (IE)
 Information retrieval systems analyze unrestricted text in
order to extract specific types of information
 IR systems do not attempt to understand all of the text in
all of the documents, but they do analyze those portions of
each document that contain relevant information
– relevance is determined by pre-defined domain guidelines which
must specify, as accurately as possible, exactly what types of
information the system is expected to find
• query would be a good example of such a pre-defined domain
– documents that contain relevant information are retrieved while
other are ignored
1.2.3.1 Information Retrieval (Cont)
 Sometimes documents could be represented by a
surrogate, such as the title and and a list of key words
and/or an abstract
 It is more common to use the full text, possibly subdivided
into sections that each serve as a separate document for
retrieval purposes
 The query is normally a list of words typed by the user
– Boolean combinations of words were used by earlier systems to
construct queries
• users found it difficult to get good results from Boolean queries
• it was hard to find a combination of “AND”s and “OR”s that will
produce appropriate results
1.2.3.1 Information Retrieval (Cont)
 Boolean model has been replaced by vector-space model in
modern IR systems
– in vector-space model every list of words (both the documents and
query) is treated as a vector in n-dimensional vector space (where
n is the number of distinct tokens in the document collection)
– can use a “1” in a vector position if that word appears and “0” if it
does not
– vectors are then compared to determine which ones are close
– vector model is more flexible than Boolean model
• documents can be ranked and closest matches could be reported first
1.2.3.1 Information Retrieval (Cont)
 There are many variations on vector-space model
– some allow stating that two words must appear near each other
– some use thesaurus to automatically augment the words in the
query with their synonyms
 A good discriminator must be chosen in order for the
system to be effective
– common words like “a”, “the” don’t tell us much since they occur
in just about every document
– a good way to set up the retrieval is to give a term a larger weight
if it appears in a small number of documents
1.2.3.1 Information Retrieval (Cont)
 Another way to think about IR is in terms of databases.
An IR system attempts to convert unstructured text
documents into codified database entries. Database entries
might be drawn from a set of fixed values, or they can be
actual sub-strings pulled from the original source text.
 From a language processing perspective, IR systems must
operate at many levels, from word recognition to sentence
analysis, and from understanding at the sentence level on
up to discourse analysis at the level of full text document.
 Dictionary coverage is an especially challenging problem
since open-ended documents can be filled with all manner
of jargon, abbreviations, and proper names, not to mention
typos and telegraphic writing styles.
1.2.3.1 Information Retrieval (Cont)
 Example: (Vector-Space Model) we assume that we have
one very short document that contains one sentence:
“CPSC 533 is the best Computer Science course at UofC”;
also assume that our query is: “UofC”
– we need to set up our n-dimensional vector space: we have 10
distinct tokens (one for every word in the sentence)
– we are going to set up the following vector to represent the
sentence: (1,1,1,1,1,1,1,1,1,1) -- indicating that all ten words are
present
– we are going to set the following vector for the query:
(0,0,0,0,0,0,0,0,0,1) -- indicating that “UofC” is the only word
present in the query
– by ANDing the two vectors together, we get (0,0,0,0,0,0,0,0,0,1)
meaning that our document contains “UofC”, as expected
1.2.3.1 Information Retrieval (Cont)
 Example: Commercial System (HIGHLIGHT):
– helps users find relevant information in large volumes of text and
present it in a structured fashion
• it can extract information from newswire reports for a specific topic
area - such as global banking, or the oil industry - as well as current
and historical financial and other data
– although its accuracy will never match the decision-making skills
of a trained human expert, HIGHLIGHT can process large
amounts of text very quickly, allowing users to discover more
information that even the most trained professional would have
time to look for
– see Demo at: http://www-cgi.cam.sri.com/highlight/
– could be classified under “Extracting Data From Text (1.2.3.3)”
1.2.3.2 Text Categorization
 It is often desirable to sort all text into several categories
 There are number of companies that provide their
subscribers access to all news on a particular industry,
company or geographic area
– traditionally, human experts were used to assign the categories
– in the last few years, NLP systems have proven very accurate
(correctly categorizing over 90% of the news stories)
 Context in which text appears is very important since the
same word could be categorized completely differently
depending on the context
– Example: in a dictionary, the primary definition of the word
“crude” is vulgar, but in a large sample of the Wall Street Journal,
“crude” refers to oil 100% of the time
1.2.3.3 Extracting Data From Text
 The task of data extraction is take on-line text and derive
from it some assertions that can be put into a structured
database
 Examples of data extraction systems include:
– SCISOR system
• able to take stock information text (such as the type released by Dow
Jones News Service) and extract important stock information
pertaining to:
–
–
–
–
–
events that took place
companies involved
starting share prices
quantity of shares that changed hands
effect on stock prices
Current Topic
 1.0 Definitions
 1.1 Characteristics of Successful Machines
 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation
• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
 2.0 Efficient Parsing
 3.0 Scaling up the Lexicon
 4.0 List of References
2.0 Efficient Parsing
 Parsing -- the act of analyzing the grammaticality of an
utterance according to some specific grammar
– previous sentence was “parsed” according to some grammar of
“English” and was determined that it was grammatical
– we read the words in some order (from left to right; from right to
left; or in random order) and analyzed them one-by-one
 Each parse is a different method of analyzing some target
sentence according to some specified grammar
2.0 Efficient Parsing (Cont)
 Simple left-to-right parsing is often insufficient
– it is hard to determine the nature of the sentence
• this means that we have to make an initial guess as to what it is the
sentence is saying
• this forces us to backtrack if the guess is incorrect
 Some backtracking is inevitable
– to make parsing efficient, we want to minimize the amount of
backtracking
• even if a wrong guess is made, we know that a portion of the sentence
has already been analyzed -- there is no need to start from scratch
since we can use the information that is available to us
2.0 Efficient Parsing (Cont)
 Example: we have two sentences:
– “Have students in section 2 of Computer Science 203 take the
exam.”
– “Have students in section 2 of Computer Science 203 taken the
exam?”
• first ten words: “Have students in section 2 of Computer Science 203”
are exactly the same although the meanings of the two sentences are
completely different
• if an incorrect guess is made, we can still use the first ten words when
we backtrack
– this will require a lot less work
2.0 Efficient Parsing (Cont)
 There are three main things that we can do to improve
efficiency:
– don’t do twice what you can do once
– don’t do once what you can avoid altogether
– don’t represent distinctions that you don’t need
 To accomplish these we can use a data structure known as
chart (matrix) to store partial results
– this is a form of dynamic programming
– results are only calculated if they can not be found in the chart
– only a portion of the calculations that can not be found in the chart
is done while the rest is retrieved from the chart
– algorithms that do this are called chart parsers
2.0 Efficient Parsing (Cont)
 Examples of parsing techniques:
–
–
–
–
–
Top-Down, Depth-First
Top-Down, Breadth-First
Bottom-Up, Depth-First Chart
Prolog
Feature Augmented Phrase Structure
 These are not the only parsing techniques that exist
 One is free to come up with his or her own algorithm for
the order in which individual words in every sentence will
be analyzed
2.0 Efficient Parsing (Cont)
 i) Top-Down, Depth-First
– uses a strategy of searching for phrasal constituents from the
highest node (the sentence node) to the terminal nodes (the
individual lexical items) to find a match to the possible syntactic
structure of the input sentence
– stores attempts on a possibilities list as a stacked data structure
(LIFO)
 ii) Top-Down, Breadth-First
– same searching strategy as Top-Down, Depth-First
– stores attempts on a possibilities list as a queued data structure
(FIFO)
2.0 Efficient Parsing (Cont)
 iii) Bottom-Up, Depth-First Chart
– parse begins at the word level and uses the grammar rules to build
higher-level structures (“bottom-up”), which are combined until a
goal state is reached or until all the applicable grammar rules have
been exhausted
 iv) Prolog
– relies on the functionality of Prolog Programming Language to
generate a parse using Top-Down, Depth-First algorithm
– naturally deals with constituents and their relationships
 v) Feature Augmented Phrase Structure
– takes sentence as input and parses it by accessing information in a
featured phrase-structure grammar and lexicon
– parser output is a tree
2.0 Efficient Parsing (Cont)
 Chart parsing can be represented pictorially using a
combination of n + 1 vertices and a number of edges
 Notation for edge labels:
[<Starting Vertex>,<Ending Vertex>, <Result> 
<Part 1>... <Part n> • <Needed Part 1>…<Needed
Part k]
– if Needed Parts are added to already available Parts then Result
would be the outcome, spanning edges from Starting Vertex to
Ending Vertex
– see examples (two pages down)
 If there are no Needed Parts (if k = 0), then the edge is
called complete
– edge is called incomplete otherwise
2.0 Efficient Parsing (Cont)
 Chart-parsing algorithms use a combination of top-down and
bottom-up processing
– this means that it never has to consider certain constituents that could
not lead to a complete parse
– this also means that it can handle grammars with both left-recursive
rules and rules with empty right-hand sides without going into an
infinite loop
– result of our algorithm is a packed forest of parse tree constituents
rather than an enumeration of all possible trees
 Chart Parsing consists of forming a chart with n + 1 vertices
and adding edges to the chart one at a time, trying to produce
a complete edge that spans from vertex 0 to n and is of
category S (sentence)  [0,n, S  NP VP •] There is no
backtracking -- everything that is put into the chart stays there
2.0 Efficient Parsing (Cont)
 A) Edge [0,5, S  NP VP •] -- says an NP followed by
VP combine to make an S that spans the string from 0 to 5
 B) Edge [0,2, S  NP • VP] -- says that an NP spans the
string from 0 to 2, and if we could find a VP to follow it,
then we would have an S
2.0 Efficient Parsing (Cont)
 There are four ways to add and edge to the chart:
– Initializer
• adds an edge to indicate that we are looking for the start symbol of the
grammar, S, starting at position 0, but have not found anything yet
– Predictor
• takes an incomplete edge that is looking for an X and adds new
incomplete edges, that if completed, would build an X in the right place
– Completer
• takes an incomplete edge that is looking for an X and ends at vertex j
and a complete edge that begins at j and has X as the left-hand side, and
combines them to make a new edge where the X has been found
– Scanner
• similar to the completer, except that it uses the input words rather than
exciting complete edges to generate the X
2.0 Efficient Parsing (Cont)
Nondeterministic
Chart Parsing
Algorithm
2.0 Efficient Parsing (Cont)
 Nondeterministic Chart Parsing Algorithm:
– treats the chart as a set of edges
– an new edge is non-deterministically added to the chart at every
step (an edge is non-deterministically chosen from the possible
additions)
– S is the start symbol and S’ is the new nonterminal symbol
• we start out looking for S (i.e. we currently have an empty string)
– add edges using one of the three methods (predictor, completer,
scanner), one at a time until no new edges can be added
– at the end, if the required parse exists, it is found
– if none of the methods could be used to add another edge to the set,
the algorithm terminates
2.0 Efficient Parsing (Cont)
Chart for a Parse
of: “I feel it”
2.0 Efficient Parsing (Cont)
 Using the sample chart on the previous page, the following
steps are taken to complete the parse of “I feel it” -- page 1/3:
– 1. INITIALIZER: if we parse from edge 0 to edge 0 and look for S’, we
still need to find S -- (a)
– 2. PREDICTOR: we are looking for an incomplete edge, that if
completed, would give us S -- we know that S consists of NP and VP,
meaning that by going from 0 to 0 we will have S if we find VP and NP
-- (b)
– 3. PREDICTOR: following a very similar rule, we know that we will
have NP if we can find a Pronoun ; this condition can be achieved by
going from 0 to 0, looking for a Pronoun -- (c)
– 4. SCANNER: if we go from 0 to 1, parsing “I” we will have our NP
since a Pronoun is found -- (d)
2.0 Efficient Parsing (Cont)
 Example (continued) -- page 2/3:
– 5. COMPLETER: we can summarize above steps, we are looking
for S and by going from 0 to1 we have NP and are still looking for
VP -- (e)
– 6. PREDICTOR: we are now looking for VP and by going from 1
to 1 we will have VP if can find a Verb -- (f)
– 7. PREDICTOR: VP can consist of another VP and NP, meaning
that 6 would also work if we can find VP and NP -- (g)
– 8. SCANNER: by going from1 to 2 we can find a Verb, thus we
can find VP -- (h)
– 9. COMPLETER: using 7 and 8, we know that since VP is found
we can complete VP by going from 1 to 2 and finding NP -- (i)
– 10. PREDICTOR: NP can be completed by going from 2 to 2 and
finding a Pronoun -- (j)
2.0 Efficient Parsing (Cont)
 Example (continued) -- page 3/3:
– 11. SCANNER: we can find a Pronoun if we go from 2 to 3, thus
completing NP -- (k)
– 12. COMPLETER: using 7 - 11, we know that VP can be found by
going from 1 to 3, thus finding NP and VP -- (l)
– 13. COMPLETER: using all of the information we collected up to
this point, one can get S by going from 0 to 3, thus finding the
original NP and VP, where VP consists of another VP and NP -(m)
 All of these steps are summarized on the diagram on the
next page
2.0 Efficient Parsing (Cont)
Trace of a Parse of “I feel it”
2.0 Efficient Parsing (Cont)
Left-Corner Parsing Algorithm
2.0 Efficient Parsing (Cont)
 Left-Corner Parsing:
– avoids building some edges that could not possibly be part of an S
spanning the whole string
– builds up a parse tree that starts with the grammar’s start symbol
and extends down to the last word in the sentence
– Non-deterministic Chart Parsing Algorithm is an example of leftcorner parsers
– using example on the previous slide:
• “ride the horse” would never be considered as VP
– saves time since unrealistic combinations do not have to be, first
worked out and then discarded
2.0 Efficient Parsing (Cont)
 Extracting Parses From the Chart: Packing
– when the chart parsing algorithm finishes, it returns an entire chart
(collection of parse trees)
– what we really want is a parse tree (or several parse trees)
– Ex:
• a) pick out parse trees that span the entire input
• b) pick out parse trees that for some reason do not span the entire
input
– the easiest way to do this is to modify COMPLETER so that when
it combines two child edges to produce a parent edge, it stores in
the parent edge the list of children that comprise it.
– when we are done with the parse, we only need to look in chart[n]
for an edge that starts at 0, and recursively look at the children
lists to reproduce a complete parse tree
2.0 Efficient Parsing (Cont)
 Keeps track of the
entire parse tree
 We can look in
chart[n] for an
edge that starts at
0, and recursively
look at the
children lists to
reproduce a
complete parse
tree
A Variant of
Nondeterministic
Chart Parsing
Algorithm
Current Topic
 1.0 Definitions
 1.1 Characteristics of Successful Machines
 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation
• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
 2.0 Efficient Parsing
 3.0 Scaling up the Lexicon
 4.0 List of References
3.0 Scaling Up the Lexicon
 In real text-understanding systems, the input is a sequence
of characters from which the words must be extracted
 Four step process for doing this consists of:
–
–
–
–
tokenization
morphological analysis
dictionary lookup
error recovery
 Since many natural languages are fundamentally different,
these steps would be much harder to apply to some
languages than others
3.0 Scaling Up the Lexicon (Cont)
 a) Tokenization
– process of dividing the input into distinct tokens -- words and
punctuation marks.
– this is not easy in some languages , like Japanese, where there are
no spaces between words
– this process is much easier in English although it is not trivial by
any means
– examples of complications may include:
• A hyphen at the end of the line may be an interword or an intraword
dash
– tokenization routines are designed to be fast, with the idea that as
long as they are consistent in breaking up the input text into
tokens, any problems can always be handled at some later stage of
processing
3.0 Scaling Up the Lexicon (Cont)
 b) Morphological Analysis
– the process of describing a word in terms of the prefixes, suffixes
and root forms that comprise it
– there are three ways that words can be composed:
• Inflectional Morphology
– reflects that changes to a word that are needed in a particular
grammatical context (Ex: most nouns take the suffix “s” when they are
plural)
• Derivational Morphology
– derives a new word from another word that is usually of a different
category (Ex: the noun “softness” is derived from the adjective “short”)
• Compounding
– takes two words and puts them together (Ex: “bookkeeper” is a
compound of “book” and “keeper”)
– used a lot in morphologically complex languages such as German,
Finish, Turkish, Inuit, and Yupik
3.0 Scaling Up the Lexicon (Cont)
 c) Dictionary Lookup
– is performed on every token (except for special ones such as
punctuation)
– the task is to find the word in the dictionary and return its
definition
– two ways to do dictionary lookup:
• store morphologically complex words first:
– complex words are written to dictionary and the looked up when needed
• do morphological analysis first:
– process the word before looking anything up
– Ex: “walked” -- strip of “ed” and look up “walk”
» if the verb is not marked as irregular, then “walked” would be the
past tense of “walk”
– any implementation of the table abstract data type can serve as a
dictionary: hash tables, binary trees, b-tries, and trees
3.0 Scaling Up the Lexicon (Cont)
 d) Error Recovery
– is undertaken when a word is not found in the dictionary
– there are four types of error recovery:
• morphological rules can guess at the word’s syntactic class
– Ex: “smarply” is not in the dictionary but it is probably an adverb
• capitalization is a clue that a word is a proper name
• other specialized formats denote dates, times, social security numbers,
etc
• spelling correction routines can be used to find a word in the
dictionary that is close to the input word
– there are two popular models for defining “closeness” in words:
» Letter-Based Model
» Sound-Based Model
3.0 Scaling Up the Lexicon (Cont)
 Letter-Based Model
– an error consists of inserting or deleting a single letter, transposing
two adjacent letters or replacing one letter with another
– Ex: a 10 letter word is one error away from 530 other words:
• 10 deletions -- each of the ten letters could be deleted
• 9 swaps -- _x_x_x_x_x_x_x_x_x_ there are nine possible swaps
where “x” signifies that “_” on its left and right could be switched
• 10 x 25 replacements -- each of the ten letters can be replaced by
(26 - 1) letters of the alphabet
• 11 x 26 insertions -- x_x_x_x_x_x_x_x_x_x_x and each “x” can be
one of the 26 letters of the alphabet
• total is = 10 + 9 + 225 + 286 = 530
3.0 Scaling Up the Lexicon (Cont)
 Sound-Based Model
– words are translated into canonical form that preserves most of
information needed to pronounce the word, but abstracts away the
details
– Ex: a word such as “attention” might be translated into the
sequence [a, T, a, N, S, H, a, N], where “a” stands for any vowel
• this would mean that words such as “attension” and “atennshun”
translate to the same sequence
• if no other word in the dictionary translates into the same sequence,
then we can unambiguously correct the spelling error
• NOTE: letter-based approach would work just as well for “attention”
but not for “atennshun”, which is 5 errors away from “attention”
3.0 Scaling Up the Lexicon (Cont)
 Practical NPL systems have lexicons with from 10,000 to
1000,000 root word forms
– building such a sizable lexicon is very time consuming and
expensive
• this has been a cost that dictionary publishing companies and
companies with NLP programs have not been willing to share
 Wordnet is an exception to this rule:
– freely available dictionary, developed by a group at Princeton (led
by George Miller)
– diagram on the next slide gives and example of the type of
information returned by Wordnet about the word “ride”
3.0 Scaling Up the Lexicon (Cont)
Wordnet
Example of
the Word
“ride”
3.0 Scaling Up the Lexicon (Cont)
 Although dictionaries like Wordnet are useful, they do not
provide all the lexical information one would like
– frequency information is missing
• some of the meanings are far more likely than others
• Ex: “pen” usually means a writing instrument although (very rarely)
it can mean a female swan
– semantic restrictions are missing
• we need to know related information
• Ex: with the word “ride”, we may need to know whether we are
talking about animals or vehicles because the actions in two cases are
quite different
Current Topic
 1.0 Definitions
 1.1 Characteristics of Successful Machines
 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation
• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
 2.0 Efficient Parsing
 3.0 Scaling up the Lexicon
 4.0 List of References
4.0 List of References
 http://nats-www.informatik.uni-hamburg.de/ Natural




Language Systems
http://www.he.net/~hedden/intro_mt.html Machine
Translation: A Brief Introduction
http://foxnet.cs.cmu.edu/people/spot/frg/Tomita.txt
Masaru Tomita
http://www.csli.stanford.edu/~aac/papers.html Ann
Copestake's Online Publications
http://www.aventinus.de/ AVENTINUS advanced
information system for multilingual drug enforcement
4.0 List of References (Cont)
 http://ai10.bpa.arizona.edu/~ktolle/np.html AZ Noun
Phraser
 http://www.cam.sri.com/ Cambridge Computer Science
Research Center
 http://www-cgi.cam.sri.com/highlight/ Cambridge
Computer Science Research Center, Highlight
 http://www.cogs.susx.ac.uk/lab/nlp/ Natural Language
Processing and Computational Linguistics at The
University of Sussex
 http://www.cogs.susx.ac.uk/lab/nlp/lexsys/ LexSys:
Analysis of Naturally-Occurring English Text with
Stochastic Lexicalized Grammars
4.0 List of References (Cont)
 http://www.georgetown.edu/compling/parsinfo.htm
Georgetown University: General Description of Parsers
 http://www.georgetown.edu/compling/graminfo.htm
Georgetown University: General Information about
Grammars
 http://www.georgetown.edu/cball/ling361/ling361_nlp1.html
Georgetown University: Introduction to Computational
Linguistics
 http://www.georgetown.edu/compling/module.html
Georgetown University: Modularity in Natural Language
Parsing
4.0 List of References (Cont)
 Elaine Rich, Kevin Knight Artificial Intelligence
 Patrick Henry Winston Artificial Intelligence
 Philip C. Jackson Introduction to Artificial Intelligence