Using language models for generic entity

Using language models for generic entity extraction
Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan
Computer Science
University of Waikato
Hamilton, New Zealand
[email protected]
Abstract
to bind them at click time. Second, actions can be
associated with different types of data, using either
explicit programming or programming-by-demonstration
techniques. A day/time specification appearing anywhere
within one’s email could be associated with diary actions
such as updating a personal organizer or creating an
automatic reminder, and each mention of a day/time in the
text could raise a popup menu of calendar-based actions.
Third, text could be mined for data in tabular format,
allowing databases to be created from formatted tables
such as stock-market information on Web pages. Fourth,
an agent could monitor incoming newswire stories for
company names and collect documents that mention
them—an automated press clipping service.
This paper describes the use of statistical
language modeling techniques, such as are
commonly used for text compression, to extract
meaningful, low-level, information about the
location of semantic tokens, or “entities,” in text.
We begin by marking up several different token
types in training documents—for example,
people’s names, dates and time periods, phone
numbers, and sums of money. We form a
language model for each token type and examine
how accurately it identifies new tokens. We then
apply a search algorithm to insert token
boundaries in a way that maximizes compression
of the entire test document. The technique can be
applied to hierarchically-defined tokens, leading
to a kind of “soft parsing” that will, we believe,
be able to identify structured items such as
references and tables in html or plain text, based
on nothing more than a few marked-up examples
in training documents.
In all these examples, the key problem is to recognize
different types of target fragments, which we will call
tokens or “entities”. This is really a kind of language
recognition problem: we have a text made up of different
sublanguages (for personal names, company names, dates,
table entries, and so on) and seek to determine which
parts are expressed in which language.
The information extraction research community (of which
we were, until recently, unaware) has studied these tasks
and reported results at annual Message Understanding
Conferences (MUC). For example, “named entities” are
defined as proper names and quantities of interest,
including personal, organization, and location names, as
well as dates, times, percentages, and monetary amounts
(Chinchor, 1999).
1. INTRODUCTION
Text mining is about looking for patterns in text, and may
be defined as the process of analyzing text to extract
information that is useful for particular purposes.
Compared with the kind of data stored in databases, text
is unstructured, amorphous, and difficult to deal with.
Nevertheless, in modern Western culture, text is the most
common vehicle for the formal exchange of information.
The motivation for trying to extract information from it is
compelling—even if success is only partial.
The standard approach to this problem is manual:
tokenizers and grammars are hand-designed for the
particular data being extracted. Looking at current
commercial state-of-the-art text mining software, for
example, IBM’s Intelligent Miner for Text (Tkach, 1997)
uses specific recognition modules carefully programmed
for the different data types, while Apple’s data detectors
(Nardi et al., 1998) uses language grammars. The Text
Tokenization Tool of Grover et al. (1999) is another
example, and a demonstration version is available on the
Web. The challenge for machine learning is to use
Text mining is possible because you do not have to
understand text in order to extract useful information from
it. Here are four examples. First, if only names could be
identified, links could be inserted automatically to other
places that mention the same name—links that are
“dynamically evaluated” by calling upon a search engine
1
AI
IS
CS
Vol. 8, No. 25.1
August 18, 1998
2> Career jobs (in our CCJ 8.25 digest this week):
THE COMPUTISTS' COMMUNIQUE
Fraunhofer CRCG (Providence, RI): MS/PhD researcher
for digital watermark agents.
"Careers beyond programming."
Case Western Reserve U. (Cleveland): ESCES dept. chair.
1> Politics and policy.
2> Career jobs.
3> Book and journal calls.
4> Silicon Valley jobs.
_________________________________________________________________
UOklahoma (Norman): CS dept. director.
Santa Fe Institute (NM): postdocs in complex, adaptive systems.
3> Book and journal calls:
"Our entire culture has been sucked into the black hole
of computation, an utterly frenetic process of virtual
planned obsolescence. But you know -- that process
CRC Press is seeking proposals for future volumes
in its International Series on Computational Intelligence, or for
chapters in such volumes. Lakhmi C. Jain <[email protected]>.
[connectionists, 13Aug98.]
1> Politics and policy:
The President's Information Technology Advisory Committee
has issued an Aug98 Interim Report about future research needs.
It's online at <http://www.ccic.gov/ac/interim/>.
[Maria Zemankova <[email protected]>, IRList, 10Aug98.]
Kluwer Academic Publishers has a new Genetic Programming
book series, starting with Langdon's "Genetic Programming
and Data Structures: Genetic Programming + Data Structures
= Automatic Programming!". Book ideas may be sent to John R. Koza
The US created 20K new computer services jobs in Jul98,
plus 3K in computer manufacturing, out of just 66K new
US jobs total. The Bureau of Labor Statistics characterized
the computer field as having "strong long-term growth trends."
[TechWeb, 08Aug98. EduP.]
4> Silicon Valley jobs:
Wired ran an article last year about headhunting in
Silicon Valley, by Po Bronson, author of "The First $20 Million
Is Always the Hardest" (Random House). Bronson says there is
tremendous demand for programmers, computer operators, and
marketing people -- so much so that Nohital Systems had
acceded to the demands of a programmer who brought his
8-foot python to work and [temporarily] a night-shift operator
The peak year for female CS graduates was 1983-4, when women
earned 37% (32,172) of BSCS degrees. It dropped to 28% in 1993-4
Figure 1. Masthead and beginning of each section of the electronic newsletter
training instead of explicit programming to detect
instances of sublanguages in running text.
understanding it is not committed to any particular
ontology.
This paper explores the ability of the kind of adaptive
language models used in text compression to locate
patterns in text, patterns of the kind typically sought by
text mining systems.
The sheer quantity of different information items in Table
1 is impressive: 175 items, in ten categories, from a mere
four pages. The volume of such information items is
highly dependent on the kind of text—one would expect
far less in a novel, for example—but this example is not
atypical of the factual, newsy, writings that information
workers frequently have to scan on a regular basis.
2. STRUCTURED INFORMATION IN TEXT
Looking for patterns in text is really the same as looking
for structured data inside documents. According to Nardi
et al. (1998), “a common user complaint is that they
cannot easily take action on the structured information
found in everyday documents ... Ordinary documents are
full of such structured information: phone numbers, fax
numbers, street addresses, email addresses, email
signatures, abstracts, tables of contents, lists of references,
tables, figures, captions, meeting announcements, Web
addresses, and more. In addition, there are countless
domain-specific structures, such as ISBN numbers, stock
symbols, chemical structures, and mathematical
equations.”
Nardi et al. (1998) define structured information as “data
recognizable by a grammar,” and go on to describe how
such data can be specified by an explicit grammar and
acted on by “intelligent agents.” Despite the intuitive
appeal of being able to define data detectors that trigger
particular kinds of actions, there are practical problems
with this approach that stem from the difficulty of
enabling an ordinary user to specify grammars that
recognize the kind of tokens he or she is interested in.
First, practical schemes that allow users to specify
grammars generally presuppose that the input is somehow
divided into lexical tokens, and although “words”
delimited by non-alphanumeric characters provide a
natural tokenization for many of the items in Table 1,
such a decision will turn out to be restrictive in particular
cases. For example, generic tokenization would not allow
the unusual date structure in this particular document (e.g.
30Jul98) to be recognized. In general, any prior division
into tokens runs the risk of obscuring information.
As an example, Figure 1 shows the masthead and
beginning of each section of a 4-page, 1500-word, weekly
electronic newsletter, and Table 1 shows information
items extracted (manually) from it—items of the kind that
readers might wish to take action on. They are classified
into generic types: people’s names; dates and time
periods; locations; sources, journals, and book series;
organizations; URLs; email addresses; phone numbers;
fax numbers; and sums of money. Identifying these types
is rather subjective. For example, dates and time periods
are lumped together, whereas for some purposes they
should be distinguished. Personal and organizational
names are separated, whereas for some purposes they
should be amalgamated—indeed it may be impossible for
a person (let alone a computer) to distinguish them. The
methodology we develop here accommodates all these
options: unlike AI approaches to natural language
Second, practical grammar-based approaches that allow
users to specify the structure use deterministic grammars
rather than probabilistic ones. These paint the world black
and white. Yet the situation exemplified by Table 1 reeks
of ambiguity. A particular name might be a person’s
name, a place name, a company name—or all three. A
name appearing at the beginning of a sentence may be
indistinguishable from an ordinary capitalized word that
starts the sentence—and it is not even completely clear
what to “start a sentence” means. Determining what is
2
Table 1 Generic data items extracted from a 4-page electronic newsletter
People’s names (n)
Al Kamen
Barbara Davies
Bill Park
Bruce Sterling
Ed Royce
Eric Bonabeau
Erricos John Kontoghiorghes
Heather Wilson
John Holland
John R. Koza
Kung-Kiu Lau
Lakhmi C. Jain
Lashon Booker
Lily Laws
Maria Zemankova
Mark Sanford
Martyne Page
Mike Cassidy
Po Bronson
Randall B. Caldwell
Robert L. Park
Robert Tolksdorf
Sherwood L. Boehlert
Simon Taylor
Sorin C. Istrail
Stewart Robinson
Terry Labach
Vernon Ehlers
Zoran Obradovic
Sums of money (m)
$1K
$24K
$60
$65K
$70
$78K
$100
Dates/time periods (d)
30Jul98
31Jul98
02Aug98
04Aug98
05Aug98
07Aug98
08Aug98
09Aug98
10Aug98
11Aug98
13Aug98
14Aug98
15Aug98
August 18, 1998
01Sep98
15Sep98
15Oct98
31Oct98
10Nov98
01Dec98
01Apr99
Nov97
Jul98
Aug98
Mar99
August
July
Spring 1999
Spring 2000
1993-4
1999
120 days
eight years
eight-week
end of 1999
late 1999
month
twelve-year period
Sources, journals, book series (s)
Autonomous Agents and Multi-Agent Systems Journal
Commerce Business Daily (CBD)
Computational Molecular Biology Series
DAI-List
ECOLOG-L
Evolutionary Computation Journal
Genetic Programming book series
IRList
International Series on Computational Intelligence
J. of Complex Systems
J. of Computational Intelligence in Finance (JCIF)
J. of Symbolic Computation (JSC)
J. of the Operational Research Society
Parallel Computing Journal
Pattern Analysis and Applications (PAA)
QOTD
SciAm
TechWeb
WHAT'S NEW
Washington Post
Wired
comp.ai.doc-analysis.ocr
comp.ai.genetic
comp.ai.neural-nets
comp.simulation
Dbworld
Sci.math.num-analysis
Sci.nanotech
Email addresses (e)
[email protected]
[email protected]
[email protected]
[email protected]
erricos.kontoghiorghes@info. unine.ch
[email protected]
[email protected]
[email protected]
koza@genetic-programming. org
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Organizations (o)
ACM
Austrian Research Inst. for AI
Bureau of Labor Statistics
CRC Press
Case Western Reserve U.
Fraunhofer CRCG
Ida Sproul Hall
Kluwer Academic Publishers
NSF
Nohital Systems
Oregon Graduate Inst.
Permanent Solutions
Random House
Santa Fe Institute
UOklahoma
UTrento
Locations (l)
Beaverton
Berkeley
Britain
Canada
Cleveland
Italy
Montreal
NM
Norman
Providence, RI
Quebec
Silicon Valley
Stanford
US
Vienna
the Valley
Phone numbers (p)
650-941-0336
(703) 883-7609
+44 161 275 5716
+44-1752-232 558
Fax numbers (f)
650-941-9430 fax
+44 161 275 6204 Fax
(703) 883-6435 fax
+44-1752-232 540 fax
URLs (u)
http://cbdnet.access.gpo.gov/
http://ourworld.compuserve.com/homepages/ftpub/call.htm
http://www.ccic.gov/ac/interim/
http://www.cs.man.ac.uk/~kung-kiu/jsc
http://www.cs.sandia.gov/~scistra/DAM
http://www.cs.tu-berlin.de/~tolk/AAMAS-CfP.html
http://www.elsevier.nl/locate/parco
http://www.santafe.edu/~bonabeau
http://www.soc.plym.ac.uk/soc/sameer/paa.htm
http://www.wired.com/wired/5.11/es_hunt.html
3
AI
IS
CS
Vol. 8, No. 25.1
<d>August 18, 1998</d>
2> Career jobs (in our CCJ 8.25 digest this week):
THE COMPUTISTS' COMMUNIQUE
<o>Fraunhofer CRCG</o> (<l>Providence, RI</l>): MS/PhD researcher
for digital watermark agents.
"Careers beyond programming."
<o>Case Western Reserve U.</o> (<l>Cleveland</l>): ESCES dept. cha
1> Politics and policy.
2> Career jobs.
3> Book and journal calls.
4> Silicon Valley jobs.
_________________________________________________________________
<o>Uoklahoma</o> (<l>Norman</l>): CS dept. director.
<o>Santa Fe Institute</o> (<l>NM</l>): postdocs in complex, adapti
3> Book and journal calls:
"Our entire culture has been sucked into the black hole
of computation, an utterly frenetic process of virtual
planned obsolescence. But you know -- that process
<o>CRC Press</o> is seeking proposals for future volumes
in its <s>International Series on Computational Intelligence</s>,
chapters in such volumes. <n>Lakhmi C. Jain</n> <<E>l.jain@unisa.
[connectionists, <d>13Aug98</d>.]
1> Politics and policy:
The President's Information Technology Advisory Committee
has issued an <d>Aug98</d> Interim Report about future research ne
It's online at <<u>http://www.ccic.gov/ac/interim/</u>>.
[<n>Maria Zemankova</n> <<E>[email protected]</E>>, <s>IRList</s>,
<d>10Aug98</d>.]
<o>Kluwer Academic Publishers</o> has a new <s>Genetic Program
book series</s>, starting with Langdon's "Genetic Programming
and Data Structures: Genetic Programming + Data Structures
= Automatic Programming!". Book ideas may be sent to <n>John R. K
4> <l>Silicon Valley</l> jobs:
The <l>US</l> created 20K new computer services jobs in <d>Jul
plus 3K in computer manufacturing, out of just 66K new
<l>U</l> jobs total. The <o>Bureau of Labor Statistics</o> charac
the computer field as having "strong long-term growth trends."
[<s>TechWeb</s>, <d>08Aug98</d>. EduP.]
<s>Wired</s> ran an article last year about headhunting in
Silicon Valley, by <n>Po Bronson</n>, author of "The First $20 Mil
Is Always the Hardest" (<o>Random House</o>). Bronson says there
Tremendous demand for programmers, computer operators, and
Marketing people -- so much so that <o>Nohital Systems</o> had
acceded to the demands of a programmer who brought his
8-foot python to work and [temporarily] a night-shift operator
The peak year for female CS graduates was <d>1983-4<d>, when w
earned 37% (32,172) of BSCS degrees. It dropped to 28% in <d>199
Figure 2. Marked up version of Figure 1 (some lines truncated for display)
and what is not a time period is not always easy—the
notion of “time period” in natural language text is illdefined. Text, with its richness and ambiguity, does not
necessarily support hard and fast distinctions. Email
addresses and URLs are exceptions: but, of course, they
are not “natural” language.
3.
LANGUAGE MODELS FOR TEXT
COMPRESSION
Statistical language models are well developed in the field
of text compression. Compression methods are usually
divided into symbolwise and dictionary schemes (Bell et
al., 1990). Symbolwise methods, which generally make
use of adaptively generated statistics, give excellent
compression—in fact, they include the best known
methods. Although the dictionary methods such as the
Ziv-Lempel schemes perform less well, they are used in
practical compression utilities such as Unix compress and
gzip because they are fast.
Third, recognition should degrade gracefully when faced
with real text, that is, text containing errors. Traditional
grammars decide whether a particular string is or is not
recognized, whereas in this application it is often helpful
to be able to recognize a string as belonging to a
particular type even though it is malformed, or a name
even though it is misspelled. (The newsletter used for
Table 1 is unusual in that it is virtually error free.)
In our work we use the Prediction by Partial Matching
(PPM) symbolwise compression scheme (Cleary and
Witten, 1984), which has become a benchmark in the
compression community. It generates “predictions” for
each input symbol in turn. Each prediction takes the form
of a probability distribution that is provided to an encoder.
The encoder is usually an arithmetic or Huffman coder;
fortunately the details of coding are of no relevance to this
paper.
Fourth and most important, text mining will require
incremental, evolutionary development of grammars. The
problems are not fully defined in advance. Grammars will
have to be modified to take account of new data. This is
not easy: the addition of just one new example can
completely alter a grammar and render worthless all the
work that has been expended in building it.
Finally, many of the items in Table 1 cannot be
recognized by conventional grammars. Names are a good
example. Some will have been encountered before; for
them, table lookup is appropriate—but the lookup
operation should recognize legitimate variants. Others
will be composed of parts that have been encountered
before, say John and Smith, but not in that particular
combination. Others will be recognizable by format (e.g.
Randall B. Caldwell). Still others—particularly certain
foreign names—will be clearly recognizable because of
peculiar language statistics (e.g. Kung-Kui Lau). Others
will not be recognizable except by capitalization, which is
an unreliable guide—particularly when only one name is
present.
PPM uses finite-context models of characters, where the
previous few (say three) characters predict the upcoming
one. The conditional probability distribution of characters,
conditioned on the preceding few characters, is
maintained and updated as each character of input is
processed. This distribution, conditioned on the actual
value of the preceding few characters, is used to predict
each upcoming symbol. Exactly the same distributions are
maintained by the decoder, which updates the appropriate
distribution as each character is received. This is what we
call “adaptive modeling”: both encoder and decoder
maintain the same models—not by communicating the
models directly, but by updating them in precisely the
same way.
Rather than using a fixed context length (three was
suggested above), the PPM method chooses a maximum
4
Table 2 Confusion matrix: tokens are identified by language models
(a) Tokens in isolation
(b) Tokens in context
computed label
d
n
s
o
l
m
3
d 47
27 1
2
n
29 1
s
3 15 1
o
5
11
l
8
m
1
e
u
p
f
total 47 32 37 16 14 8
e
u
computed label
p
f
t
1
19
10
4
4
20 10
4
4
0
total
d
50
30
31
19
16
8
20
10
4
4
43
n
s
o
28
l
m
e
u
p
f
1
26
14
1
10
t
total
7
1
5
5
5
50
30
31
19
16
8
20
10
4
4
8
19
1
10
4
4
192
43 29 26 14 11
context length and maintains statistics for this and all
shorter contexts. For example, in most of the experiments
below the maximum context length was five, and
statistics were maintained for models of order five, order
four, order three, order two, order one, and order zero.
These are not stored separately, they are all kept in a
single trie structure.
8
19 10
4
4
24
192
The only remaining question is how to calculate the
escape probabilities. There has been much discussion of
this, and several different methods have been proposed.
Our experiments use method D (Howard, 1993), which
calculates the escape probability in a particular context as
1
2
d
,
n
To encode the next symbol, PPM starts with the
maximum-order model (say order five). If it contains a
prediction for the upcoming character, it is transmitted
according to the order-five distribution. Otherwise, both
encoder and decoder “escape” down to order four. There
are two possible situations. If the order-five context—that
is, the preceding five-character sequence—has not been
encountered before, then escape to order four is
inevitable, and both encoder and decoder can deduce that
fact without requiring any communication. If not, that is,
if the preceding five characters have been encountered in
sequence before but not followed by the upcoming
character, then only the encoder knows that an escape is
necessary. In this case, therefore, it must signal this fact to
the decoder by transmitting an “escape event”—and room
for this event must be made in each probability
distribution that the encoder and decoder maintain.
where n is the number of times that context has appeared
and d is the number of different symbols that have
directly followed it. The probability of a character that has
occurred c times in that context is
c − 12
.
n
Since there are d such characters, and their counts sum to
n, it is easy to confirm that the probabilities in the
distribution (including the escape probability) sum to 1.
One slight further improvement to PPM is incorporated in
the experiments: deterministic scaling (Teahan, 1997).
Although it probably has negligible effect on our overall
results, we record it here for completeness. Experiments
show that in deterministic contexts, for which d=1, the
probability of the single character that has occurred before
reappearing is greater than the 1 – 1/2n implied by the
above estimator. Consequently, in this case the
probability is increased in an ad hoc manner to 1 – 1/6n.
Once any necessary escape event has been transmitted
and received, both encoder and decoder agree that the
upcoming character will be coded by the order-four
model. Of course, this may not be possible either, and
further escapes may take place. Ultimately, the order-zero
model may be reached; in this case the character can be
transmitted if it is one that has occurred before.
Otherwise, there is one further escape (to an “order –1”
model), and the 8-bit ASCII representation of the
character is sent.
4.
USING LANGUAGE MODELS TO
RECOGNIZE TOKENS
Character-based language models provide a good way to
recognize lexical tokens. Tokens can be compressed using
models derived from different training data, and classified
5
Table 3 Confusion matrices averaged over 20 test documents (leave-one-out training)
(a) Tokens in isolation
(b) Tokens in context
computed label
d
n
s
o
l
m
e
u
d 35
22 1 2 2
n
1 19
2
s
1
10 1
o
1 2 1 17
l
14
m
17
e
1
u
p
f
total 35 26 22 14 22 14 17 1
computed label
p
f
t
1
5
1
5
0
total
d
35
27
23
13
21
14
17
1
1
5
33
n
s
o
l
m
e
u
p
f
24
14
10
1
1
1
1
15
t
total
2
2
8
2
5
35
27
23
13
21
14
17
1
1
5
20
157
14
16
1
1
5
157
33 25 14 11 17 14 16
according to which one supports the most economical
representation.
1
1
5
shows the average confusion matrices over twenty test
documents, derived using leave-one-out training: for each
issue, the compression models were trained on the other
nineteen issues. The single issue that we will discuss in
detail is fairly representative of this data set.
To test this, several issues of the same newsletter used for
Table 1 were analyzed manually to extract all people’s
names; dates and time periods; locations; sources,
journals, and book series; organizations; URLs; email
addresses; phone numbers; fax numbers; and sums of
money. The issues were marked up to identify these items
using an XML-style markup; Figure 2 shows the markedup version of the extract in Figure 1.
Of the 192 tokens in Table 1, 174 are identified correctly
and 18 incorrectly. In fact, 69 of them appear in the
training data (with the same label) and 123 are new; all of
the errors are on new symbols. Three of the “old”
symbols contain line breaks that do not appear in the
training data: for example, in the test data Parallel
Computing Journal is split across two lines as indicated.
However, these items were still identified correctly. The
18 errors are easily explained; some are quite
understandable.
Various experiments were carried out to determine the
power of language models to discriminate these tokens,
both out of context and within their context in the
newsletter. Throughout this work, we use the PPM text
compression scheme as described above, with order five
unless otherwise mentioned.
Beaverton, a location, was mis-identified as a name.
Compressed as a name, it occupies 3.18 versus 3.25
bits/char as a location. Norman (!), Cleveland, Britain and
Quebec were also identified as names. Conversely, the
name Mark Sanford was mis-identified as a location.
Although Mark appears in five different names, this is
outweighed by eighteen appearances of San in locations
(e.g. San Jose, San Mateo, Santa Monica, as well as
Sankte Augustin). The name Sorin C. Istrail was also
identified as a location, as was the organization
Fraunhofer CRCG.
3.1 DISCRIMINATING ISOLATED TOKENS
Lists of names, dates, locations, etc. in 19 issues of the
newsletter were input to PPM separately to form ten
language models labeled n, d, l, s, o , u , e, p , f, m. In
addition, a plain text model, t, was formed from the full
text of all 19 issues. These models were used to identify
each of the tokens in Table 1 on the basis of which model
compresses them the most. The results are summarized in
the form of a confusion matrix in Table 2a, where the
rows represent the correct label and the columns the
computed one. Notice that although t never appears as the
correct label, it could be assigned to a token of a different
type because it compresses it best.
ACM, an organization, was mis-identified as a source. In
fact, the only place these letters appear in the training data
is in ACM Washington Update, which is a source. Sources
are the most diverse category and swallow up many
foreign items: the dates eight-week, eight years and
Spring 2000 (note: comp.software.year-2000 is a source);
the name Adnan Amin; organizations Commerce Business
Daily (CBD) and P A A ; and the email address
[email protected] (because genetic-
We will discuss these results below, looking at individual
errors to get a feeling for the kinds of mistakes that are
made by compression-based token identification. These
results are for a single issue of the newsletter. Table 3a
6
100
100
80
80
errors 60
60
40
40
20
20
0
0
0
1
2
3
4
5
0
model order
1
2
3
4
5
model order
(a) Isolated tokens
(b) Tokens in context
Figure 3 Effect of model order on token identification (light areas represent tokens mis-identified as plain text)
programming appears in the training data as both a source
and plain text). Some sources were also mis-identified:
ECOLOG-L was identified as an organization, and
sci.nanotech as an email address.
e – (e0 + em) bits
where em is the entropy of the token with respect to model
m. This was evaluated for each model to determine which
one classified the token best, or whether it was best left as
plain text. The entire procedure was repeated for each
token individually.
3.2 DISTINGUISHING TOKENS IN CONTEXT
Our quest is to identify tokens in the newsletter. Here,
contextual information is available which provides
additional help in disambiguating them. However,
identification must be done conservatively, so that strings
of plain text are not misinterpreted as tokens—and since
there are many strings of plain text, there are countless
opportunities for error.
Table 2b shows the confusion matrix that was generated.
(Again, cross-validation results for the entire data set are
shown in Table 3b.) The number of errors has increased
from 18 to 26; however, 24 of these “errors” are caused
by failure to recognize a token as being different from
plain text, and only two are actual mis-recognitions
(Berkeley is identified as a name and Mark Sanford as a
location).
For example, email addresses in the newsletter are always
flanked by angle brackets. Many sources are preceded by
a [ or ,• and followed by a ,•. (Bullets are used to make
spaces visible.) These contextual clues often help in
identifying tokens. Conversely, identification may be
foiled in some cases by misleading context. For example,
some names are preceded by Rep.•, which reduces the
weight of the capitalization evidence because
capitalization routinely occurs following a period. But by
far the most influential effect of context is that to mark up
any string as a token requires the insertion of two extra
symbols: begin-token and end-token. Unless the language
models are good, tokens will be misread as plain text to
avoid the overhead of these extra symbols.
There is a small overall improvement in compression
through the use of tokens. To code the original test file
using a model generated from the original training files
takes a total of 28,589 bits for 10,889 characters, an
average of 2.63 bits/char. To code the marked-up test file
using the appropriate models generated from the markedup training files takes 841 fewer bits—despite the fact
that there are 364 additional begin-token and end-token
symbols). This represents a 2.9% improvement for the file
as a whole, or 0.077 bits per character of the original file.
Of course, the improvement is diluted by the presence of
a large volume of plain text. When the savings are
averaged over the affected characters—namely the 2945
characters that are present in the tokens alone—the
improvement seems more impressive, 0.28 bits/char.
In order to evaluate the effect of context, all tokens were
replaced by a symbol that was treated by PPM as a single
character. The training data used for the plain-text model
was transformed in this way, a new model was generated
from it, and the text article was compressed by this model
to give a baseline entropy figure of e0 bits. The first token
in Table 1 was restored into the test article as plain text
and the result recompressed to give entropy e bits. The net
space saved by recognizing this token as belonging to
model m is
3.3 EFFECT OF MODEL ORDER
To investigate the effect of model order, the token
discrimination exercise was re-run using PPM models of
order 0, 1, 2, 3, 4, and 5, using the same training and test
data. Figure 3 shows the number of errors in token
identification observed, for both isolated tokens and
7
60
40
errors
identification failures
(in context)
20
actual errors (in isolation)
actual errors (in context)
0
0
Figure 4
10
20
number of training files
30
40
Effect of quantity of training data on token identification
tokens in context. The dark bars show actual errors, and
the light ones indicate failure to identify a token as
something other than plain text. The number of actual
errors is very small when tokens are taken in context,
even for low-order models. However, the number of
identification failures is enormous when the models are
poor ones.
a very satisfactory reduction in errors as the amount of
training data increases, but the number of identification
failures stabilizes at an approximately constant level. Our
choice of 19 training documents for all other tests seems
to be a sensible one.
5. LOCATING TOKENS IN CONTEXT
Although the lowest error rates are observed with the
highest-order models, the improvement as model order
increases beyond 2 is marginal. Note that we have used
models of the same order for all information items
(including the text model itself); better results may be
obtained by choosing the optimal order for each token
type individually. For example, order-0 models identify
every single sum of money, email address, URL, phone
and fax number, with no errors—even in context.
We locate tokens in context by considering the input as an
interleaved string of information from different sources.
This model has been studied by Reif and Storer (1997),
who consider optimal lossless compression of nonstationary sources produced by concatenating finite
strings from different sources. However, they assume that
the individual strings are long, and grow without bound as
the input increases; this assumption allows an encoding
method to be derived with asymptotically optimal
expected length. Volf and Willems (1997) study the
combination of two universal coding algorithms using a
switching method. They devise a dynamic programming
structure to control switching, but rather than computing
the probability of a single transition sequence, they
weight over all transition sequences.
The tradeoff, with tokens in context, between actual errors
and failures to identify is not a fixed one. It can be
adjusted by using a non-zero threshold when comparing
the compression for a particular token with the
compression when its characters are interpreted as plain
text. This allows us to control the error rate, sacrificing a
small increase in the number of errors for a larger
decrease in identification failures.
Our work derives from Teahan et al.’s (1997) for
correcting English text using PPM models. Suppose every
token is bracketed by begin-token and end-token symbols;
the problem then is to “correct” text by inserting such
symbols appropriately. Begin- and end-token will identify
the type of the token in question—thus we have beginname-token, end-name-token, etc, written as <n>, </n>.
The innovation in the present paper is that whenever a
begin-token symbol is encountered, the encoder switches
to the language model appropriate to that type of token,
initialized to a null prior context. And whenever endtoken symbol is encountered, the encoder reverts to the
plain text model that was in effect before, replacing the
token by the single symbol representing that token type.
3.4 EFFECT OF QUANTITY OF TRAINING DATA
Next we examine how identification accuracy depends on
the amount of training data. We varied the training data
from one to 38 issues of the same newsletter, marked up
manually. Figure 4 plots the number of errors in token
identification, for the same test file, against the amount of
training data. The middle line corresponds to isolated
tokens (as in Figure 3a). The other two correspond to
tokens in context, the lower one to the number of actual
errors made (as in the dark bars of Figure 3b), and the
upper one to the number of failures to identify a token as
anything but plain text (the light bars). Notice that there is
8
any path that might turn out to be the best one, were it not
for the fact that a different sequence of unterminated
models might exist in the two paths, some of which, when
terminated, might cause the contexts to differ between the
paths. At present, we do not check for this eventuality.
The second pruning operation is to delete any leaf that (a)
has a larger entropy than the best path so far, and (b)
represents a point that lags more than k symbols behind
that best path. This pruning heuristic is not guaranteed to
be safe: it may delete a path that would ultimately turn out
to be best. The reason is that the price in bits to enter any
model is unbounded; yet so is the benefit that may
eventually accrue from using the model. The third is to
restrict the open-leaves list to a predetermined length.
4.1 ALGORITHM FOR MODEL ASSIGNMENT
Our algorithm takes a string of text and works out the
optimal sequence of models that would produce it, along
with their placement. As a small example, the input string
In•1998,•$2.
produces the output
<t>In•<d>1998</d>,•<m>$2</m>.</t>
Here, t is a model formed from all the training text with
every token replaced by a single-symbol code. The
characters 1998 have been recognized as a date token,
because the date model compresses them best; $2 has
been recognized as a money token, because the money
model compresses them best. The remainder has been
recognized as plain text.
These pruning strategies cause a small number of
identification errors (discussed below); other strategies
are under investigation.
The algorithm works by processing the input characters to
build up a tree in which each path from root to leaf
represents a string of characters that is a possible
interpretation of the input. The paths are alternative
output strings, and begin-token and end-token symbols
appear on them. The entropy of a path can be calculated
by starting at the root and coding each symbol along the
path according to the model that is in force when that
symbol is reached. The context is re-initialized to a
unique starting token whenever begin-token is
encountered, and the appropriate model is entered. On
encountering end-token, it is encoded and the context
reverts to what it was before. For example, the character
that follows
4.2 RESULTS WHEN LOCATING TOKENS IN
CONTEXT
To evaluate the procedure for locating tokens in context,
we used the training data from the same 19 issues of the
newsletter that were used previously, and the same single
issue for testing. Error counting is complicated by the
presence of multiple errors on the same token. Counting
all errors on the same token as one, there are a total of 47
errors:
2 identification errors noted in Section 3.2 for incontext discrimination
24 failures to recognize a token as being different
from plain text
5 incorrect positive identifications
9 boundary errors
3 phone/fax absorption errors
4 pruning errors
<t>In•<d>1998</d>
will be predicted by the four characters In•<d/d>, where
< d/d> is the single-symbol code representing the
occurrence of a date. This context is interpreted in the t
model.
In addition, a further 9 “errors” occurred which were
actually errors in the original markup, made by the person
who marked up the test data.
What causes the tree to branch is the insertion of begintoken symbols for each possible token type, and the endtoken symbol for the currently active token type (in order
that nesting is properly respected). To expand the tree, a
list of open leaves is maintained, each recording the point
in the input string that has been reached and the entropy
value up to that point. The lowest-entropy leaf is chosen
for expansion at each stage.
The two identification errors and 24 failures to recognize
are those noted in the confusion matrix of Table 2b. The
five incorrect positive identifications picked out bonus as
a date, son and Prophet as names, field as and Ida as
organizations. Boundary errors are more interesting.
Many involved names: Wilson, Kung-Kiu, Lashon and
Sorin C were identified by the algorithm as names
whereas it was Heather Wilson, Kung-Kiu Lau, Lashon
Booker and Sorin C. Istrail that were marked up. Similar
mistakes occurred with other token types: year for twelveyear, C B D for CBDNet, International Series and
Computational Intelligence (separately) for International
Series on Computational Intelligence, koza and .org
(separately) for [email protected], J. of
Symbolic Computation and “JSC)” separately for J. of
Symbolic Computation (JSC), and comp.ai for
Unless the tree and the list of open leaves are pruned, they
grow very large very quickly. Currently, three separate
pruning operations are applied that remove leaves from
the list and therefore prevent the corresponding paths
from growing further. First, if two leaves are labeled with
the same character and have the same preceding k
characters, the one with the greater entropy is deleted.
Here, k is the model order (default 5). This would be a
“safe” pruning criterion that could not possibly eliminate
9
comp.ai.genetic. Phone/fax absorption errors are a special
case of boundary errors involving phone and fax numbers:
for example, +44-1752-232 540 was identified as a phone
number whereas it was +44-1752-232 540 fax that was
marked up, as a fax number. Finally, pruning errors are
caused by the pruning strategy described in the previous
section.
account reduces the error rate at the expense of an
increase in the number of symbols that are mistaken for
plain text—a tradeoff that is adjustable. The dynamic
programming method allows this technique to be used to
identify tokens in running text with no clues as to where
they begin and end. The methodology works with
hierarchically-defined tokens, where each token can
contain subtokens. No explicit programming is required
for token identification: rather, machine learning
methodology is used to acquire identification information
automatically from a marked-up set of training
documents. The result is automatic location and
classification of the items contained in test documents.
6. USING LANGUAGE MODELS TO RECOGNIZE
STRUCTURES
So far we have not taken advantage of the potentially
hierarchical nature of the tokens that are inferred. We use
the term “soft parsing” to denote inference of what is
effectively a grammar from example strings, using exactly
the same compression methodology. After analyzing the
errors noted above, we refined the markup of the training
documents to use forename, initial, surname for names;
username, d o m a i n , and top-level domain for email
addresses; and embedded phone numbers for fax
numbers. Examples are
To summarize the success of token identification, 449 out
of a total of 535 names were recognized correctly in
isolation (Table 3a gives these figures as per-document
averages, quantized into integers), for a success rate of
83.9%. These figures improved to 478 correctly
recognized names, that is 89.4%, when context was taken
into account (Table 3b). Other token types fared similarly:
the recognition rate for phone numbers, for example,
increased from 75% without context to 90% in context.
However, 84% of the 456 sources were correctly
identified without context, but this dropped to 60% when
context was taken into account. The reason is that many
sources (37%) were misrecognized as plain text, an option
that was unavailable in the experiments on tokens in
isolation. Thus in this case only 3% of sources were
misidentified as other token types.
Name:
<n><f>Ian</f>•<i>H</i>.•<s>Witten</s></n>
Email:
<e><u>ihw</u>@<d>cs.waikato.ac</d>.<t>nz
</t></e>
Fax:
<f><p>+64-7-856-2889</p>•fax</f>
(Of course, different symbols were used for f, s, u, d, and t
to avoid confusion with the other models.) Then, during
training, models are built for each component of a
structured item, as well as the item itself, and when the
test file is processed to locate tokens in context, these new
tags are inserted into it too. The algorithm described
above accommodates nested tokens without any
modification at all.
The success rate for token location is harder to quantify,
because some errors involve slight misplacement of the
boundary, and a significant number of other errors can be
attributed to mistakes made in the markup. We observed a
fairly small increase in actual misidentification of the
token type, but a significant number of further instances
where tokens in the text were missed; however, these
have not yet been quantified.
The results were mixed. Some errors were corrected (e.g.
Kung-Kiu Lau and Sorin C. Istrail was correctly marked),
but other problems remained (e.g. the fax/phone number
mix-up) and a few new ones were introduced. Some of
these are caused by the pruning strategies used; others are
due to insufficient training data.
The examples we have given also highlight some
weaknesses of the use of compression models for
identifying and locating tokens. The difficulty we
observed with a particular email address, for instances,
highlights the fact that for some lexical items (in this case
an artificial rather than a natural one), the appearance of a
certain character (“@”) is very strong evidence of token
type that should not easily be outvoted by
unrepresentative language statistics.
Despite these inconclusive initial results, we believe that
soft parsing will be a valuable technique in situations with
stronger hierarchical context (e.g. references and tables).
Applications of text mining based on language modeling
are legion; we have just begun to scratch the surface.
Identifying references in documents; locating information
in tables (such as stock prices) expressed in either html or
plain text, inferring document structure, finding names,
addresses, phone numbers on Web pages, data detectors
of any kind—all of these could be accomplished without
any explicit programming.
7. CONCLUSIONS
In this paper we have, through an extended example,
argued the case that statistical language modeling
techniques are valuable for text mining. Different kinds of
tokens in text can be classified because different models
compress them better. Good results are achieved for
tokens in isolation. Taking each token’s context into
10
Great power is gained from the hierarchical nature of the
representation. A reference, for example, contains tokens
that represent names, year of publication, title, journal,
volume, issue number, page numbers, month of
publication. These tokens will be separated by short fillers
involving spaces, punctuation, quotation marks, the word
“and”, etc. The fillers will be quite regular, and although
the tokens that appear, and the order in which they appear,
will vary somewhat, the number of different possibilities
is not large.
Chinchor, N.A. (1999) “Overview of MUC-7/MET-2.”
Proc Message Understanding Conference MUC-7.
Cleary, J.G. and Witten, I.H. (1984) “Data compression
using adaptive coding and partial string matching.” IEEE
Trans on Communications, Vol. 32, No. 4, pp. 396–402.
Grover, C., Matheson, C. and Mikheev, A. (1999) “TTT:
Text Tokenization Tool.” http://www.ltg.ed.ac.uk/
Howard, P.G. (1993) The design and analysis of efficient
lossless data compression systems. PhD thesis, Brown
University, Providence, RI.
In order to investigate the application of language
modeling to text mining in a constrained context, the
experiments reported here have been self-contained. All
training and testing has taken place on issues of a
particular electronic magazine—we have eschewed the
use of any additional information. However, in practice, it
is easy and very attractive to prime models from external
sources—lists of names, organizations, geographical
locations, information sources, even randomly-generated
dates, sums of money, phone numbers. Priming will
greatly reduce the volume of training data that needs to be
marked up manually, making text mining practical even
with small amounts of training data.
Nardi, B.A., Miller, J.R. and Wright, D.J. (1998)
“Collaborative, programmable intelligent agents.” Comm
ACM, Vol. 41, No. 3, pp. 96–104.
Reif, J.H. and Storer, J.A. (1997) “Optimal lossless
compression of a class of dynamic sources.” Proc Data
Compression Conference, edited by J.A. Storer and J.H.
Reif. IEEE Computer Society Press, Los Alamitos, CA,
pp. 501–510.
Teahan, W.J. (1997) Modelling English text. PhD thesis,
University of Waikato, NZ.
We are very grateful to Stuart Inglis and John Cleary who
provided valuable advice and assistance.
Teahan, W.J., Inglis, S., Cleary, J.G. and Holmes, G.
(1997) “Correcting English text using PPM models.” Proc
Data Compression Conference, edited by J.A. Storer and
J.H. Reif. IEEE Computer Society Press, Los Alamitos,
CA, pp. 289–298.
REFERENCES
Tkach, D. (1997) Text mining technology: Turning
information into knowledge. IBM White paper.
ACKNOWLEDGMENTS
Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text
compression. Prentice Hall, Englewood Cliffs, New
Jersey.
Volf, P.A.J. and Willems, F.M.J. (1997) “Switching
between two universal source coding algorithms.” Proc
Data Compression Conference, edited by J.A. Storer and
J.H. Reif. IEEE Computer Society Press, Los Alamitos,
CA, pp. 491–500.
11