Using Sentence-Selection Heuristics to Rank

Using Sentence-Selection Heuristics to Rank Text
Segments in TXTRACTOR
Daniel McDonald and Hsinchun Chen
Artificial Intelligence Lab
Management Information Systems Department
University of Arizona
Tucson, AZ 85721, USA
520-621-2748
{dmm, hchen}@eller.arizona.edu
Indicative text summarization systems support the user in
deciding which documents to view in their totality and which to
ignore. Some summarization techniques use measures of query
relevance to tailor the summary to a specific query [22] [5].
Providing tools for users to sift through query results can
potentially ease the burden of information overload.
Using document summaries can also potentially improve the
results of queries on digital libraries. Relevance feedback methods
usually select terms from entire documents in order to expand
queries. Lam-Adesina and Jones found query-expansion using
document summaries to be considerably more effective than
query-expansion using full-documents [13]. Other summarization
research explores the processing of summaries instead of full
documents in information retrieval tasks [18, 21]. Using
summaries instead of full documents in a digital library has the
potential to speed query processing and facilitate greater post
retrieval analysis, again potentially easing the burden of
information overload.
ABSTRACT
TXTRACTOR is a tool that uses established sentence-selection
heuristics to rank text segments, producing summaries that
contain a user-defined number of sentences. The purpose of
identifying text segments is to maximize topic diversity, which is
an adaptation of the Maximal Marginal Relevance criterion used
by Carbonell and Goldstein [5]. Sentence selection heuristics are
then used to rank the segments. We hypothesize that ranking text
segments via traditional sentence-selection heuristics produces a
balanced summary with more useful information than one
produced by using segmentation alone. The proposed summary is
created in a three-step process, which includes 1) sentence
evaluation 2) segment identification and 3) segment ranking. As
the required length of the summary changes, low-ranking
segments can then be dropped from (or higher ranking segments
added to) the summary. We compare the output of TXTRACTOR
to the output of a segmentation tool based on the TextTiling
algorithm to validate the approach.
1.2 Background
Categories and Subject Descriptors
Approaches to text summarization vary greatly. A distinction
frequently is made between summaries generated by text
extraction and those that generate text abstracts. Text extraction is
widely used [10], utilizing sentences from a document to create a
summary. Early examples of summarization techniques utilized
text extraction [16]. Text abstraction programs, on the other hand,
produce grammatical sentences that summarize a document’s
concepts. The concepts in an abstract are often thought of as
having been compressed. While the formation of an abstract may
better fit the idea of a summary, its creation involves greater
complexity and difficulty [10]. Producing abstracts usually
involves several stages such as topic fusion and text generation
that are not required for text extracts. Recent summarization
research has largely focused on text extraction with renewed
interest in sentence-selection summarization methods in particular
[17]. An extracted summary remains closer to the original
document, by using sentences from the text, thus limiting the bias
that might otherwise appear in a summary [16]. TXTRACTOR
continues this trend by utilizing text extraction methods to
produce summaries.
The goals of text summarizers can be categorized by their intent,
focus, and coverage [7]. Intent refers to the potential use of the
summary. Firmin and Chrzanowski divide a summary’s intent into
three main categories: indicative, informative, and evaluative.
Indicative summaries give an indication of the central topic of the
original text or enough information to judge the text’s relevancy.
I.2.7 Natural Language Processing - Language parsing and
understanding, Text analysis
General Terms:
Algorithms
Keywords
Text summarization, text segmentation, Information Retrieval,
text extraction
1. INTRODUCTION
1.1 Digital Libraries
Automatic text summarization offers potential benefits to the
operation and design of digital libraries. As digital libraries grow
in size, so does the user’s need for information filtering tools.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
JCDL’02, July 13-17, 2002, Portland, Oregon, USA.
Copyright 2002 ACM 1-58113-513-0/02/0007…$5.00.
28
Informative summaries can serve as substitutes for the full
documents and evaluative summaries express the point of view of
the author on a given topic. Focus refers to the summary’s scope,
whether generic or query-relevant. A generic summary is based on
the original text, while a query-relevant summary is based on a
topic selected by the user. Finally, coverage refers to the number
of documents that contribute to the summary, whether the
summary is based on a single document or multiple documents.
TXTRACTOR uses a text extraction approach to produce
summaries that are categorized as indicative, generic, and based
only on single documents.
conclusion” are more likely to appear in scientific literature than
in newspaper articles [10]. Position-based methods are also
domain-dependent. The first sentence in a paragraph contains the
topic sentence in some domains, whereas it is the last sentence
elsewhere. Combined with other techniques, however, these
extraction methods can still contribute to the quality of a
summary.
2.2 Document Segmentation
Document segmentation is an Information Retrieval (IR) approach
to summarization. Narrowing the scope from a collection of
documents, the IR approach views a single document as a
collection of words and phrases from which topic boundaries must
be identified [10]. Recent research in this field, particularly the
TextTiling algorithm [9], seems to show that a document’s topic
boundaries can be identified with a fair amount of success. Once a
document’s segments have been identified, sentences from within
the segments are typically extracted using word-based rules in
order to turn a document’s segments into a summary. Breaking a
document into segments identifies the document’s topic
boundaries. Segmentation is a nice way to make sure that a
document’s topics are adequately represented in a summary.
The IR approach to extraction does have some weaknesses.
Having a word-level focus “prevents researchers from employing
reasoning at the non-word level” [10]. While the IR technique
successfully segments single documents into topic areas [9], the
selection of sentences to extract from within those topic areas
could be improved by using many different heuristics, both wordbased and those that utilize language knowledge. In addition, once
a document is segmented, there is no way to know which of the
segments is the most salient to the overall document. Some
mechanism is required to rank segments so that the most pertinent
topic information either gets extra coverage in the summary, or is
covered first in the summary. A practical problem is also
addressed by ranking segments. When the required number of
sentences in a summary is less than the number of identified
segments, there must be an intelligent way to decide which
segments will not be covered. A possible solution perhaps is to
force a document to have a certain number of segments that match
the number of sentences allowed in the summary. Presetting the
number of acceptable topic areas, however, seems to defeat the
process of true segment identification. Segment boundaries would
seem arbitrary if there were a limit on their number. The process
of finding a document’s topic areas is separate from that of
selecting representative sentences to appear in the summary. A
ranking of the segments, however, would allow a summary to
grow and shrink, while extracting sentences from as many of the
highest ranked topic areas as possible. While this approach would
not be suited to an informative summary, ranking segments and
controlling the number of sentences in a summary are acceptable
for an indicative summary.
2. RELATED RESEARCH
TXTRACTOR is most strongly related to the research by
Carbonell and Goldstein [5] that strives to reduce the redundancy
of information in a query-focused summary. Carbonell and
Goldstein introduce the concept of Maximal Marginal Relevance
(MMR), where each sentence is ranked based on a combination of
a relevance and diversity measure. The consideration of diversity
in TXTRACTOR is achieved by segmenting a document using the
TextTiling algorithm [9]. Sentences coming from different text
segments are considered adequately diverse. All text segments
must be represented in a summary before additional sentences
from an already represented segment can be added. Nomoto [18]
and Radev [19] also present different ways to implement diversity
calculations for summary creation. Different from the
summarization work done by Carbonell and Goldstein, however,
TXRACTOR is not query-focused, but rather uses sentenceselection heuristics, instead of query relevance, to rank a
document’s sentences.
2.1 Sentence Selection
Much research has been done on techniques to identify sentences
that effectively summarize a document. Luhn in 1958 first utilized
word-frequency-based rules to identify sentences for summaries
[16]. Edmundson (1969) added three rules in addition to word
frequencies for selecting sentences to extract, including cue
phrases (e.g., “significant,” “impossible,” “hardly”), title and
heading words, and sentence location (words starting a paragraph
were more heavily weighted) [6]. The ideas behind these older
approaches are still referenced in modern text extraction research.
Sentence-selection methods assist in finding the salient sentences
in a document. By salient, we mean sentences a user would
include in a summary. There has been much review of sentenceselection methods in research. Teufel and Moens found the use of
cue phrases to be the best individual method [24]. Kupiec et al.,
on the other hand, found the position-based method to be the best
[12]. Regarding the combining of sentence-selection heuristics,
research conducted by Kupiec, Pedersen, and Chen found that the
best mix of extraction methods included position, cue phrase, and
sentence length. C. Aone et al. tested several different variations
of tf*idf and the using or suppressing of proper names in their
system DimSum [3]. Goldstein et al. found that summary
sentences had 90 percent more proper nouns per sentence [8].
When deciding which combination of extraction methods to use in
TXTRACTOR, we assume each method is independent and its
impact can be aggregated into the total sentence score. As we
conduct additional summarization experimentation, we will
further refine our use of sentence-selection methods and add
additional promising methods.
Despite the usefulness of sentence extraction methods in finding
salient sentences, they cannot alone produce the highest-quality
extracts. Sentence-selection techniques are often domain
dependent. For example, the words “Abstract” and “in
2.3 Combination Proposal
TXTRACTOR attempts to capture the benefits of sentenceselection summarization and document segmentation while
overcoming many of their deficiencies. The document
segmentation algorithm identifies the document’s main topic
areas, while sentence-selection heuristics identify the salient
sentences of a summary. The topic areas are used as the
foundation for the summary and the salient sentences are used as a
compass guiding the inclusion of certain topic areas. Document
segmentation provides a thorough domain-independent analysis of
the entire document, created in a bottom-up manner. Sentenceselection heuristics provide saliency information in a structured
29
top-down manner. In addition, we have included many sentenceselection techniques in order to reduce the domain-dependency
effect of any one heuristic. We hypothesize that ranking a
document’s segments on the basis of their containing one or many
of the document’s salient sentences will produce summaries that
are more information rich than those produced by the
segmentation-only approach.
sentence. Thus, a high tf*idf score for a sentence is normalized for
sentence length. The resulting score is then added to the
sentence’s score.
3.1.4 Sentence position in a paragraph
As the sentences are extracted from the original document, new
lines and carriage returns signal the beginning of new paragraphs.
The beginning sentence of a document and the beginning sentence
in a paragraph are given additional weight due to their greater
summarizing potential.
3. TXTRACTOR IMPLEMENTATION
TXTRACTOR is a summarizer based on text extraction written in
Java. Its major components include sentence-selection rules and a
segmentation algorithm. The summarization process takes place in
three main steps 1) sentence evaluation 2) segmentation or topic
boundary identification and 3) segment ranking and extraction.
3.1.5 The sentence length
The length of a sentence can provide clues as to its usefulness in a
summary [12] [3]. Before adding sentence length summarization
heuristic, we tried to achieve the same effect by simply not
averaging tf*idf scores over the number of words in the sentence.
Longer sentences would naturally be scored higher because they
would contain more non-stop word terms. This approach overly
weighted long sentences to the point where scores from the tf*idf
equation would overpower the scores in other areas. Normalizing
that score would mute the value of a concentrated sentence with
many document-wide terms. To solve this problem, we made
sentence length its own rule. The length of a sentence is calculated
and its impact added to the sentence’s weight.
Because each of the five sentence-selection rules is calculated
differently, each score had to be normalized so the impact of each
rule would be comparable. For example, the impact of an extra
proper noun in a sentence is not the same as that of a sentence
occurring first in a paragraph or of a sentence being very long.
The current normalization factor for each heuristic was
determined through experimentation.
TXTRACTOR has a configuration option that allows the user to
adjust the impact of each sentence-selection heuristic, without
having to recompile the program. Because each extraction
heuristic was normalized, a user can change the weighting of a
particular heuristic and immediately judge its impact on the
summary. Including the configuration capability facilitates
experimentation with different heuristic weights. In addition,
while the summary-generation logic of TXTRACTOR was
designed to be reasonably domain-independent, a user can still
change the weighting given to different sentence-selection rules
through the configuration option, thus customizing the
summarizer to different domains and uses. For example, a user
may want to see as many proper nouns as possible in the
summary. Increasing the weight of the proper nouns rule will
cause sentences with proper nouns to move to the top of the
sentence ranking and thus appear more often in the summary.
Once each sentence is scored based on the above five heuristics,
the sentences are then ranked according to their summarizing
value. Unlike segmentation-only approaches, sentences are ranked
against other sentences from the entire document, not only those
sentences within the same topic area. An example of sentence
scoring is shown in Figure 1. The three sentences listed are the top
three sentences extracted from a document entitled “May the
Source be With You” from Wired Magazine [15]. A nearly
complete copy of the article is found in Figure 4. Each sentence in
Figure 1 comes from a different topic segment. The highestscoring sentence greatly benefits from being the first sentence in
the article (+20). This position heuristic has been shown to be the
most effective of all sentence-selection heuristics [12]. All three
sentences begin a paragraph (+10). The second sentence contains
the cue phase “thus”, adding 10 points. The cue phrase allows it to
outrank the third sentence despite the third sentence having the
highest values of the three for tf*idf and sentence length.
3.1 Sentence Evaluation:
The summarization process begins by parsing the sentences of the
original text using a program that recognizes 60 abbreviations and
various punctuation anomalies. The original orders of the
sentences are preserved so they can be added to the summary in
the intended order. Once the sentences are identified,
TXTRACTOR begins ranking each sentence. We use five
sentence-selection heuristics to evaluate the document’s
sentences. The following different ranking methods have an
impact on the corresponding scores of each of the sentences.
3.1.1 Presence of cue phrases
Currently, each sentence is checked for the existence of ten
different cue phrases (e.g. “in summary,” “in conclusion,” “in
short,” “therefore”). Cue phrases are words that signal to a reader
that the author is going to summarize his or her idea or subject.
The cue phrases are loaded out of a text file so that additional
words can be easily added as more experimentation is done in this
area.
3.1.2 Proper nouns in the sentence
A TXTRACTOR-generated summary is meant to provide enough
information for a user to be able to decide whether she or he
wants to read the original document in its entirety. Important to
this decision is the existence of certain proper names and places.
Currently, TXTRACTOR simply reads each sentence and counts
the capitalized words, not including the opening word in the
sentence. This is meant as a temporary implementation until a full
entity-extraction algorithm can be implemented. The total number
of capitalized words in each sentence is then averaged over the
number of words in the sentence. Shorter sentences are thus not
penalized for having fewer proper nouns than longer sentences.
The average number of proper nouns is then normalized and
added to the sentence’s score.
3.1.3 TF*IDF
Tf*idf measures how frequently the words in a sentence occur
relative to their occurrence in the entire document. Sentences that
have document words in common are scored higher. To calculate
tf*idf, the occurrence of every word in a sentence and the word’s
total occurrences in the document are totaled. Before the terms are
totaled, however, each word is made lower-case and stemmed
using the Porter stemmer. The Porter stemmer is one of the most
widely used stemming algorithms [11] and can be thought of as a
lexicon-free stemmer because it uses cascaded rewrite rules that
can be run very quickly and do not require the use of a lexicon.
Stemming is performed so that words with the same stem but
different affixes may be treated as the same word when
calculating the frequency of a particular term. The tf*idf
calculation is computed and then averaged over the length of the
30
(topic: 0) (sentence 0) (score: 95) “The laws protecting software code are stifling creativity,
destroying knowledge, and betraying the public trust.” (First document sentence: +30, first sentence of
paragraph + 20, 0 for proper nouns, +34 for tf*idf, +11 for sentence length = score of 95)
(topic: 5) (sentence 52) (score: 85) “Thus, I would dramatically reduce the safeguards for software from the ordinary term of 95 years to an initial term of 5 years, renewable once.” (First sentence of
paragraph + 20, +10 for cue phrase “thus,” +0 for proper nouns, +41 for tf*idf, +14 for sentence
length = score of 85)
(topic: 1) (sentence 23) (score: 85) “Finally, while control is needed, and perfectly warranted, our bias
should be clear up front: Monopolies are not justified by theory; they should be permitted only when
justified by facts.” (First sentence of paragraph + 20, +1 for proper nouns, +45 for tf*idf, +18 for
sentence length = score of 85)
Figure 1- the weighting of individual sentences
example of the segment ranking is shown in Figure 2. Highranking sentences are added first to the summary. Two sentences
from the same segment are not included in the summary
(regardless of their ranking) until a sentence from each segment
has been included. Once all segments are represented in the
summary, then the process starts over adding one sentence from
all segments. Remaining ranked-sentences are added by segment
3.2 Segmentation:
The segmentation algorithm used is based on the TextTiling
algorithm developed by Marti Hearst [9]. The TextTiling
algorithm analyzes a document and determines where the topic
boundaries are located. A topic boundary can be thought of as the
point at which the author of the document changes subjects or
themes. The first step in the TextTiling algorithm is to divide the
text into token-sequences, removing any words that appear on the
stop list. We have used a token-sequence length of 20 and the
same stop word list used by Marti Hearst in TextTiling. Tokensequences are then combined to form blocks. Blocks are
compared using a similarity algorithm. The comparison between
blocks functions like a sliding window. The first block contains
the first token-sequence plus k token-sequences before it. The
second block contains the second token-sequence and the k tokensequences after it. The value for k used in our summarizer is 10,
also the same one used by Marti Hearst in TextTiling. The blocks
are then compared using an algorithm that returns the similarity as
a percentage, which is derived from the number of times the same
terms appear in the two blocks being compared. The Jaccard
coefficient is used for the similarity equation, which differs
slightly from the normalized inner product equation used by
Hearst. We did not consider the impact of using different
similarity equations to be significant. The Jaccard coefficient is as
follows:
L
∑ (w
ik
Si, j =
L
k =1
L
∑w +∑w
k =1
2
ik
k =1
2
jk
2. Segment Ranking
1. Text Segmentation
with ranked sentences
1.
2.
5.
15.
18.
Sentence Ordering
A two-sentence
summary would
include sentences
ranked 1 & 3
A
five-sentence
summary
would
include
sentences
ranked 1, 2, 15, 3, &
4
3.
w jk )
4.
L
8.
− ∑ wik w jk
k =1
Figure 2- text segmentation and sentence-selection combined
Si,j is the similarity between the two blocks of grouped tokensequences i and j, wik is the number of occurrences of term k in
the block, and L is the total number of blocks.
Once the topic boundaries have been identified, TXTRACTOR
then assigns each sentence to a document segment. After all
sentences have been given weights and assigned to segments, then
segment ranking and sentence extraction can operate.
until the summary-length requirement is met. Once the length
requirement is met, the sentences are then sorted by the order in
which they appeared in the original document and displayed on
the screen. Figure 3 shows some pseudo code for the segment
ranking routine. Document segmentation, therefore, provides the
topic structure for a document within which sentence selection
can be utilized to identify the salient topic areas. It is of practical
advantage to rank segments so that a user can easily change the
desired length of the summary while the ranking routine identifies
3.3 Segment Ranking:
Once a document is segmented into its main topic areas,
TXTRACTOR ranks the document segments based on the scores
given to sentences by the sentence-selection heuristics. An
31
4.3 An Example Document
Rank segments (Array of ranked sentences)
while(summary length not achieved)
for( each ranked sentence in array )
if(sentence segment not already used)
if(summary length achieved)
break;
add sentence to summary
end if
else
add to temp array for recursive call
end else
end for
Rank segments( temp array )
end while
end Rank segments
for( all summary sentences)
rank sentences by original document order
end for
While not included in the summaries evaluated in the user studies,
the article in Figure 4 is a good example of the differences
between the TXTRACTOR (reference by “TXT#” in the figure)
summaries and the summaries generated by the segmentation-only
approach (referenced by “SEG#” in the figure) [15]. Large
asterisks and segment numbers highlight the breaks in the
document segments. The first sentence selected by TXTRACTOR
is the first sentence in the document, despite the fact that it had a
10-point lower tf*idf score than the first segmentation sentence.
The first sentence, however, is a very good summarizing sentence.
The segmentation approach then selects a second sentence from
the first topic area. The two summaries then select the same
sentence from segment two to add to their summaries. Later,
TXTRACTOR skips over the third topic area, while the
segmentation algorithm adds its final two sentences from that
topic area. A segmentation summary tries to include sentences
from every segment. TXTRACTOR had ranked the two sentences
added in the segmentation summary as 50th and 62nd respectively.
The sentences had low scores for sentence length and somewhat
low scores for tf*idf. Sentences three and four for the
TXTRACTOR summary are scored highly due to the included cue
phrase, “thus”. The final sentence selected by TXTRACTOR is
not rated in the top five best sentences (it is sixth), but because
two sentences in the top-five come from the first topic area, room
in the summary is preserved for a segment not already
represented. Thus, the ranking routine includes a sentence from
the seventh segment in the summary, instead of duplicating
sentences in a segment.
Figure 3 – Pseudo code for segment ranking
which segments, represented by their sentences, to add to or drop
from the summary.
4. PRELIMINARY TESTING
As a preliminary test of the performance of the TXTRACTOR
summarizer, subjects compared summaries produced by
segmentation alone with summaries produced by TXTRACTOR.
A length limit of five sentences was imposed on all the
summaries.
4.4 Document Selection
Five documents were deliberately selected from various different
subject domains. Document subjects ranged from psychology and
sports to arts and science. The subjects of the documents were
varied in order to see whether the TXTRACTOR approach had
limitations in certain subject domains. Effort was also made to
vary the length of the documents. The numbers of words in the
documents ranged from 537 up to 13,293. Different lengths of
documents were selected so that varied numbers of segments
would be created. By including long articles, we hoped to get
preliminary clues as to which summarizer prioritized a
document’s segments best and whether prioritizing segments led
to improved summaries. In this experiment, we did not ask the
subjects to judge the cohesiveness of the summaries. We tried to
focus the user on the information content of the summary, not its
cohesiveness.
We selected the following five documents to be summarized:
1. Turning Snooping Into Art, by Noah Shachtman, 773
words [23]
2. No. 4 Virginia suffers first loss of season, Game Day
Recap, 537 words [20]
3. Nanotech Fine Tuning by Mark K. Anderson, 654
words [2]
4. Ann Landers by Ann Landers, 650 words [14]
5. A Primer on Narcissism by Sam Vaknin, Ph.D, 13,293
words [25]
4.1 Segmented Summaries
The summaries produced by the segmentation-only approach used
the same segmenting code as that used in TXTRACTOR. After
segments were identified, every word in each segment, except for
those on the stop list, was scored based on the tf*idf equation. The
two highest-ranking terms from the segment were identified along
with the first occurrence of each of the terms in the segment. The
sentence(s) where the first occurrences were identified was then
added to the summary. Each segment produced one or two
sentences, depending on whether one sentence contained the top
two keywords or not. The same procedure for every segment was
carried out until there was at least one sentence from every
segment in the summary. In cases where there were more than
five sentences, only the first five sentences were included in the
summary for comparison purposes.
4.2 TXTRACTOR Configuration
The configuration settings of the document-segmenting algorithm
were kept constant between the TXTRACTOR system and the
segmentation-only system. Token-sequences of 20 characters
were used and 10 token-sequences were added together to form a
block for the similarity comparisons. Blocks were allowed to
cross sentence boundaries and no stemming or noun phrasing was
applied in identifying the document segments. Stemming,
however, was used in calculating tf*idf in the sentence-selection
portion of TXTRACTOR. The segmenting code was allowed to
determine how many topic areas the document had instead of
being forced to generate the boundaries for a predetermined
number of topics.
4.5 Experiment Participants
Five subjects were chosen to compare the TXTRACTOR
summary with the summary generated by the segmentation–only
approach. All subjects were above the age of 20 and were either
completing or had already obtained a bachelor’s degree. The
subjects were emailed the ten summaries, grouped by original
document. Participants were directed to choose the summary that
32
1**[ The laws protecting software code are stifling creativity, destroying knowledge, and betraying the public trust.
TXT1
Legal heavy Lawrence Lessig argues it's time to bust the copyright monopoly.
In the early 1970s, RCA was experimenting with a new technology for distributing film on magnetic tape - what we would come to call video.
SEG1
Researchers were keen not only to find a means for reproducing celluloid with high fidelity but also to discover a way to control the use of the
technology. Their aim was a method that could restrict the use of a film distributed on video, allowing the studio to maximize the film's return from
distribution.
The technology eventually chosen was relatively simple. A video would play once, and when finished, the cassette would lock into place. If a customer
wanted to play the tape again, she would have to return it to the video store and have it unlocked.….
They were horrified. They would "never," Feely reported, permit their content to be distributed in that form, because the content - however clever the
self-locking tape was - was still insufficiently controlled. How could they know, one of the Disney execs asked Feely, "how many people were going to
be sitting there watching" a film? What's to stop someone else from coming in and watching for free?
SEG2
We live in a world with "free" content, and this freedom is not an imperfection. We listen to the radio without paying for the songs we hear;
we hear friends humming tunes that they have not licensed. We tell jokes that reference movie plots without the permission of the directors. We read
our children books, borrowed from a library, without paying the original copyright holder for the performance rights. The fact that content at a particular
time may be free tells us nothing about whether using that content is theft. Similarly, in arguing for increasing content owners' control over content
users, it's not sufficient to say "They didn't pay for this use."
Second, the reason perfect control has not been our tradition's aim is that creation always involves building upon something else. There is no art that
]**2**[
Monopoly controls have been the exception in free
doesn't reuse. And there will be less art if every reuse is taxed by the appropriator.
societies; they have been the rule in closed societies.
Finally, while control is needed, and perfectly warranted, our bias should be clear up front: Monopolies are not justified by theory;
TXT2
SEG3
they should be permitted only when justified by facts. If there is no solid basis for extending a certain monopoly protection, then we should not extend
that protection. This does not mean that every copyright must prove its value initially. That would be a far too cumbersome system of control. But it
does mean that every system or category of copyright or patent should prove its worth. Before the monopoly should be permitted, there must be reason
to believe it will do some good - for society, and not just for monopoly holders.
]**3**[
One example of this expansion of control is in the realm of software.
Like authors and publishers, coders (or more likely, the
SEG4
companies they work for) enjoy decades of copyright protection. Yet the public gets very little in return. The current term of protection for software is
the life of an author plus 70 years, or, if it's work-for-hire, a total of 95 years. This is a bastardization of the Constitution's requirement that copyright be
for "limited times." By the time Apple's Macintosh operating system finally falls into the public domain, there will be no machine that could possibly
run it. The term of copyright for software is effectively unlimited.
Worse, the copyright system safeguards software without creating any new knowledge in return. When the system protects Hemingway, we at
SEG5
]**4**[
least get to see how Hemingway writes.
We get to learn about his style and the tricks he uses to make his work succeed. We can see this
because it is the nature of creative writing that the writing is public. There is no such thing as language that conveys meaning while not simultaneously
transmitting its words. Software is different: Software gets compiled, and the compiled code is essentially unreadable; but in order to copyright
TXT3
software, the author need not reveal the source code. Thus, while the English department gets to analyze Virginia Woolf's novels to train
its students in better writing, the computer science department doesn't get to examine Apple's operating system to train its students in better coding.
]**5**[
The harm that comes from this system of protecting creativity is greater than the loss experienced by computer science education.
While
the creative works from the 16th century can still be accessed and used by others, the data in some software programs from the 1990s is already
inaccessible. Once a company that produces a certain product goes out of business, it has no simple way to uncover how its product encoded data. The
code is thus lost, and the software is inaccessible. Knowledge has been destroyed.
Copyright law doesn't require the release of source code because it is believed that software would become unprotectable. The open source movement
might throw that view into doubt, but even if one believes it, the remedy (no source code) is worse than the disease. There are plenty of ways for
software to be secured without the safeguards of law. Copy-protection systems, for example, give the copyright holder plenty of control over how and
when the software is copied.
]**6**[
And one
If society is to give software producers more protection than they would otherwise take, then we should get something in return.
thing we could get would be access to the source code after the copyright expires.
TXT4
Thus, I would dramatically reduce the safeguards for software - from the ordinary term of 95 years to an initial term of 5 years, renewable
once. And I would extend that government-backed protection only if the author submitted a duplicate of the source code to be held in escrow while the
work was protected. Once the copyright expired, that escrowed version would be publicly available from the copyright office.
Most programmers should like this change. No code lives for 10 years, and getting access to the source code of even orphaned software projects would
]**7**[
Software
benefit all. More important, it would unlock the knowledge built into this protected code for others to build upon as they see fit.
would thus be like every other creative work - open for others to see and to learn from.
There are other ways that the government could help free up resources for innovation. …
One context in particular where this could do some good is in orphaned software. Companies often decide that the costs of developing or maintaining
software outweigh the benefits. They therefore "orphan" the software by neither selling it nor supporting it. They have little reason, however, to make
the software's source code available to others. The code simply disappears, and the products become useless.
Software gets 95 years of copyright protection. By the time the Mac OS finally falls into the public domain, no machine will be able to run it.
TXT5
But if Congress created an incentive for these companies to donate their code to a conservancy, then others could build on the earlier work
and produce updated or altered versions. This in turn could improve the software available by preserving the knowledge that was built into the original
code. Orphans could be adopted by others who saw their special benefit.
]**8**[
The problems with software are just examples of the problems found generally with creativity.
Our trend in copyright law has been to
enclose as much as we can; the consequence of this enclosure is a stifling of creativity and innovation. If the Internet teaches us anything, it is that great
value comes from leaving core resources in a commons, where they're free for people to build upon as they see fit. An Innovation Commons was the
essence - the core - of the Internet. We are now corrupting this core, and this corruption will in turn destroy the opportunity for creativity that the
Internet built.
]**
Figure 4 - original text showing topics and sentences extracted via TXTRACTOR and a segmentation-only approach
33
provided the most pertinent information and seemed to be the
most useful in general.
7. REFERENCES
[1]
4.6 Results
[2]
Despite the small size of the experiment, we were able to observe
some encouraging responses. Of the 25 comparisons that were
made between summaries, the TXTRACTOR summary was
preferred 14 times, the segmentation-only summary was preferred
8 times, and 3 times the summaries were judged to be more or less
the same. The subjects therefore preferred the TXTRACTOR
summaries 7:4 over the summaries generated by segmentation
only. After submitting their responses, the subjects were told
which summarizer produced which summary. The participants
then volunteered explanations for why they choose the summary
they did. A common sentiment was that the TXTRACTOR
summary contained more information, but the sentences
sometimes did not flow well. When the sentences flowed well, as
judged by the participants, the TXTRACTOR-produced summary
was usually preferred. An interesting note is that even though we
had not instructed the subjects to assess the readability of the
summary, users did not ignore the summary’s cohesiveness. It
seems, even with indicative summaries,
that poor readability can distract a subject from information
content.
[3]
[4]
[5]
[6]
[7]
5. CONCLUSION & FUTURE DIRECTION
Based on our tests, the TXTRACTOR summarizer outperformed
the summarizer based solely on segmentation. The hypothesis that
ranking segments through the use of established sentenceselection heuristics leads to better text-extracted summaries
appears to be promising. There is much that can be done,
however, to improve the performance of the summarizer. Future
improvements to TXTRACTOR include implementing the local
salience method of cohesion analysis [4]. The local salience
method is based on the assumption that relevant words and
phrases are revealed by a “combination of grammatical, syntactic,
and contextual parameters”. The original document is parsed to
identify a sentence’s subjects and predicates. Different weights
are then given to sentences based on the part-of-speech containing
the term being analyzed. Experimentation will be conducted on
how many parts-of-speech to parse out of each sentence.
Additional research is needed to tune the weights of the sentenceselection methods being used. Much research that has been done
in this area could be incorporated into our work. In addition,
analyzing the discourse context of the sentences should help
improve the cohesiveness of the summaries.
We are currently planning on conducting more complete
experiments and user studies on our combined segmentation and
sentence-selection approach to summarization. We are looking to
test our summarization approach on a larger scale, similar to that
done at the May 1998 SUMMAC conference [1]. Finally, we
would like to implement and test TXTRACTOR in different
digital library domains such as medical libraries and web pages.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
6. ACKNOWLEDGMENTS
We would like to express our gratitude to NSF Digital Library
Initiative-2, “High-performance Digital Library Systems: From
Information Retrieval to Knowledge Management,” IIS-9817473,
April 1999 – March 2002. We also would like to thank William
Oliver for his implementation of the TextTiling algorithm and
Karina McDonald for her feedback on the summaries.
[15]
[16]
34
in TIPSTER Text Phase III 18-Month Workshop, (Fairfax,
VA, 1998).
Anderson, M.K. Nanotech Fine Tuning
http://www.wired.com/news/technology/0,1282,494472,00.html.
Aone, C., Okurowski, M.E., Gorlinsky, J. and Larsen, B. A
Trainable Summarizer with Knowledge Acquired from
Robust NLP Techniques. in Maybury, M.T. ed. Advances in
Automatic Text Summarization, The MIT Press, Cambridge,
1999, 71-80.
Boguraev, B. and Kennedy, C., Salience-based Content
Characterization of Text Documents. in Proceedings of the
Workshop on Intelligent Scalable Text Summarization at the
ACL/EACL Conference, (Madrid, Spain, 1997), 2-9.
Carbonell, J. and Goldstein, J., The Use of MMR,
Diversity-Based Reranking for Reordering Documents and
Producing Summaries. in SIGIR, (Melbourne, Austrailia,
1998), 335-336.
Edmundson, H.P. New Methods in Automatic Extracting. in
Maybury, M.T. ed. Advances in Automatic Text
Summarization, The MIT Press, Cambridge, 1969, 23-42.
Firmin, T. and Chrzanowski, M.J. An Evaluation of
Automatic Text Summarization Systems. in Maybury, M.T.
ed. Advances in Automatic Text Summarization, The MIT
Press, Cambridge, 1999.
Goldstein, J., Kantrowitz, M., Mittal, V. and Carbonell, J.,
Summarizing Text Documents: Sentence Selection and
Evaluation Metrics. in 22nd International Conference on
Research and Development in Information Retrieval,
(1999).
Hearst, M.A. Segmenting Text into Multi-Paragraph
Subtopic Passages. Computational Linguistics (23(1)). 3364.
Hovy, E. and Lin, C.-Y. Automated Text Summarization in
SUMMARIST. in Maybury, M.T. ed. Advances in
Automatic Text Summarization, The MIT Press, Cambridge,
1999, 81-94.
Jurafsky, D. and Martin, J.H. Speech and Language
Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition. Prentice Hall, Upper Saddle River, 2000.
Kupiec, J., Pedersen, J. and Chen, F., A Trainable
Document Summarizer. in Proceedings of the 18th ACMSIGIR Conference, (1995), 68-73.
Lam-Adesina, A.M. and Jones, G.J.F., Applying
Summarization Techniques for Term Selection in
Relevance Feedback. in SIGIR, (new Orleans, Louisiana,
USA, 2001), 1-9.
Landers, A. Ann Landers
http://www.washingtonpost.com/wp-dyn/articles/A628232002Jan4.html.
Lessig, L. May the Source Be With You. Wired Magazine,
9.12 (December).
http://www.wired.com/wired/archive/9.12/lessig.html.
Luhn, H.P. The Automatic Creation of Literature Abstracts.
in Maybury, M.T. ed. Advances in Automatic Text
Summarization, The MIT Press, Cambridge, 1958, 15-22.
[17] Mani, I. and Maybury, M.T. (eds.). Advances in Automatic
Text Summarization. The MIT Press, Cambridge, 1999.
[18] Nomoto, T. and Matsumoto, Y., A New Approach to
Unsupervised Text Summarization. in SIGIR, (New
Orleans, LA, USA, 2001), 26-34.
[19] Radev, D.R., Jing, H. and Budzikowska, M., Centroidbased summarization of mulitiple documents: sentence
extraction, utility-based evaluation, and user studies. in
ACL/NAAL Workshop on Summarization, (Seatle, WA.,
2000).
[20] Recap. No. 4 Virginia suffers first loss of season
http://sports.espn.go.com/ncaa/mbasketball/recap?gameId
=220050189, 2002.
[21] Sakai, T. and Jones, K.S., Generic Summaries for Indexing
in Information Retrieval. in SIGIR, (New Orleans,
Louisiana, USA, 2001), 190-198.
[22] Sanderson, M., Accurate user directed summarization from
existing tools. in Conference on Information and
Knowledge Management, (Bethesda, MD, USA, 1998), 4551.
[23] Shachtman, N. Turning Snooping Into Art
http://www.wired.com/news/culture/0,1284,49439,00.html.
[24] Teufel, S. and Moens, M., Sentence Extraction as a
Classification Task. in Workshop on Intelligent Scalable
Summarization ACL/EACL Conference, (Madrid, Spain,
1999), 58-65.
[25] Vaknin, S. A Primer on Narcissism
http://www.mentalhelp.net/poc/view_doc.php/type/doc/id/41
9.
35

Download Report

Using Sentence-Selection Heuristics to Rank

Paperzz.com

Your Paperzz