The Use of Phrases and Structured Queries in

The Use of Phrases
W.
and Structured
Bruce
Croft,
Computer
Howard
Queries
R. Turtle;
and Information
University
and David
Science
of Massachusetts,
Abstract
and
in information
tems.
retrieval,
In previous
as a source
This
like
in
this
for
show
that
results
improve
performance,
matically
extracted
nearly
in
queries
using
re-
by
this
way
are
language
selected
can
auto-
query
per-
phrases.
the
came
tion
must
obey
some
among
its
showed
significant
use of phrases
as part
of a text
representation
language
has been
investigated
since
of information
retrieval
research.
indexing
collections
ative
low
baseline
days
example,
included
phrase-based
field
studies
riety
of experiments
tem.
Certainly,
phrases,
(1966).
if used
tained
(1968)
using
there
phrases
should
language
and,
representation.
with
phrases
ition.
These
small
improvements
results
in
have
the
been
improve
experimental
however,
been
very
in some
we feel
that
support
mixed,
collections
this
ranging
intu-
treated
from
swers
to decreases
* Current
address:
West
F’ubtishing
Company,
St.
Paul,
Language
Stud-
In
t Current
ies,
University
Permission
granted
direct
address:
of
to
commercial
of the
that
copying
Machinery.
Chicago,
copy
provided
title
Center
that
for
Chicago,
without
fee
the copies
advantage,
publication
all or part
of this
are not made
and its date
copyright
appear,
material
or distributed
notice
and notice
of the Aeeociation
otherwise,
and/or specific
permission.
@ 1991 ACM O-8979 J-448
and
or to republish,
is
results
representations
from
words,
between
index
for
of this
paper
in using
phraees
with
systems,
operators
such
are
terme?
not
searchers
as AND
(A),
The
anand
algorithms.
the
issues
model.
express
Boolean
to
it be
obvious
clarify
a retrieval
using
eimilar
or should
retrieval
is to
phrases,
retrieval
For example,
term,
single
as these
on
to the
examined.
implications
phrases)
of work
as an index
goals
(e.g.
1 we
call
is given
and
2A
for Computing
queries,
a fee
assume
linguistic
expressions
OR
con-
word-level
(V),
section
$J .50
32
retrieval
effectiveness
is measured
in
terms
of
re-
precision.
test
and
Communications
-J/9 j[O009/0032...
word
for
and the
requires
con-
a probabilis-
provided
of phrases
such
commercial
taining
neither
sufficiently
queetions
structure
Illinois.
the ACM
is by permission
To copy
Information
over
been
user-identified
and
amount
be t rested
derived
of the
involved
Minnesota.
been
significant
One
with
single
relationship
as a relationship
to
have
in
significant
the
terms
have
rel-
1990).
a phrase
index
ob-
the
has not
should
quality
results
Dae,
that
model
the specificity
the
and
fig-
were
Improvements
algorithm
from
phrases.
as CACM2
might
that
with
syntactic
results.
found
relaresults
his improvement
experiments
different
Despite
sys-
feeling
we
collections
of
addi-
syntactic
some
such
Fagan’s
in
Fagan’s
that
co-
some
but
words.
using
baselines
In
both
algorithm,
(Croft
a va-
SMART
the
using
significantly
Cran-
described
consequently,
The
do not,
in the
also
has always
correctly,
of the indexing
of the text
Salton
indexing
tic
for
word
smaller.
phrasee
early
Cleverdon,
single
and
by
phrase,
in
out
with
defined
in a document.
on the
increases
also be pointed
phrase
is
the proxim-
characterized
constraint
ures obtained
siderably
or
the
be
none
to quite
and/or
of components
but
the
of occurrences
words
com-
phrases,
“syntactic”
in
phrase
component
phrases,
best
and
as a statistical
tionships
the
The
number
most
ueing
of factors
statistical
may
criteria
statistical
Introduction
the
phrase
It should
1
A
on
of the
indexing
“statistical”
occurrences
between
syntactic
is one
a number
of its component
A
that
both
varied
constraints
ity
in
used
occurrences
phrases
model.
(1987)
of automatic
process.
to build
retrieval
a natural
he
and
where
phrases
as manually
that
phrasee
formation
effectiveness,
theeis
studies
are used
phrases
that
model,
on phrases,
retrieval
in
used
01003
in othersl,
recent
prehensive
sys-
been
retrieval
a probabilistic
from
as well
have
an approach
and
history
in commercial
queries
language
for
a long
of research
we describe
queries
have
a statistical
majority
in natural
structured
form
Boolean
improvement
paper,
identified
Our
the
little
queries
particularly
work,
of phrases
work,
sulted
In
Boolean
D. Lewist
MA
effectiveness
phrases
Retrieval
Department
Amherst,
Fagan’s
Both
in Information
4.1.
collection
lists
consists
of
the
relevant
of
the
ACM
of
a
set
documents
(CACM)
of
documents,
for
collection
each
a
query.
is described
set
of
The
in
prqximity,
level
for
sentence-level
The
proximity.
example,
tried),
may
or by
formation
query
is used
tify
model
can
phrases
operator
how
incorporating
term
phrases
phrase,
were
used
dl
in the
or other
d2
paper,
(Croft,
1986),
as specifying
identified
struct
in a natural
a structured
bilistic
model
199 1).
goal,
rl
rz
based
representation
In the following
section,
start
by describing
of our
phrases,
been
the inference
emphasizing
in retrieval
these
models
ables
the similarities
clearly
in
seen.
Boolean
In section
building
structured
used
presents
the
work.
network
reviews
work
content
of
for
an overview
and
phrases
in
of our
that
The
uses
in sta-
results
Finally,
in
approach
describe
this
and
section
the
paper.
the importance
to
are
and
other
the
use
are query
information
difference
of
document
nodes,
and
need.
).
need,
paper,
future
document
and
collections.
values
true
and
of the
into
of
forms
emphasizes
to
calculate
of the
of the
such
under
model.
For this
structured
queries
in the inference
model
can
of the
net model
be shown
model
docu-
informa-
as a thesaurus
this
are that
of the
features
net model
it
evidence
knowledge
account
advantages
These
that
representations
interpretation
different
is
representations
domain
a natural
diagram.
sources
different
and
the inference
models
Different
the major
have
possible
as representations
between
multiple
can all be taken
of
with
regarded
probabilistic
content,
tion
4
a discussion
of large
d~’s are
qi’s
need.
major
ment
specific
Section
5, we indicate
Network:
nodes,
user’s
Queries
P(I]Document
queries,
and discuss
the
information
to be
as proximity
Inference
of a document,
false.
en-
them
1: Basic
v+’s are concept
on
have
each
among
Figure
nodes,
1 represents
is the
phrases
of an inference
such
We
research
Instantiating
differences
experimental
results.
directions
through
models.
3, we give
techniques
ways
operators
renet-
which
different
subsection
and
retrieval
previous
describe
form
last
need
net model
models.
and
The
queries
tistical
those
the
Croft,
overall
inference
We then
the
P
interaction.
we review
experiments.
treated
user
‘-lk
‘Amx
to con-
and
our
information
and
‘m
... ..
ql
Phrases
(Turtle
a complex,
of an
analysis
. .. . .
used in a proba-
towards
is to build
language
basis
nets
a step
rs
In
term
are used
is then
on inference
which
approach.
query
which
represents
search
natural
language
query,
based
This
a different
d
in a probabilistic
interpreted
we take
d-l
Y’Jx~J
to iden-
dependencies.
In this
.. . .
lintext.
queries
dependency
were
as (in-
in a document
Boolean
re-
A
such
Structure
the
used
that
retrieval,
(information
be detected
we have
p=wwh-
information
3wordsofretrieva~.
work,
work,
and
by
a proximity
to describe
potential
that
using
construct,
In previous
concept
be expressed
within
guistic
proximity,
using
are discussed
a
be-
low.
Previous
2
2.1
The
The
inference
used
as the
basis
of
phrases,
ments
tion
it
4.
It
follows
son,
1977).
net
that
it
particular
the
comparisons
for
probability
Typically,
decides
model
information
(Turtle
of different
ranking
principle
a document
query
is relevant
(Fuhr,
a slightly
P(I lDocument
information
different
More
need
as a complex
that
trix
specifically,
node
inferin
probagiven
we consider
about
on
the
of parents
its
potential
for
the
can
to
compute
be
used
associated
paper.
with
1 shows
It
the
consists
of the
the
all
Given
the
probability
or
de-
all nodes
a set
these
multiof that
and
DAG,
remaining
basic
has
characterizes
node
a
a ma-
all possible
a node
that
causes.
roots
q, we draw
the dependence
and
between
probabilities
this
33
set
If
or im-
q contains
P(g Ip) for
When
specifies
relationship
Figure
an
the
matrix
propositions.
node
specifies
nodes
and edges
p “causes”
by node
The
variables.
representing
belief
a
that
two
the
pendence
a par-
The
approach
proposition
ple parents,
y
represented
p to q.
matrix)
of the
between
by a node
is a di-
in which
or constants
relations
from
1989)
(DAG)
variables
proposition
(a link
(Pearl,
graph
represented
edge
values
calculates
is satisfied
the
in
probability
is the
dependence
directed
given
1989).
), which
need
document.
is the
represent
sec-
(Robert-
model
which
propositional
in
network
dependency
represent
plies
treat-
model
inference
acyclic
a proposition
199 1) is
experiments
a probabilistic
takes
Croft,
retrieval
,Query),
and
a user’s
and
the
a probabilistic
computes
that
Model
Net
and
document
ence
bility
for
Bayesian
rected,
model
IDocument
a user
ticular
net
the
P(Relevant
that
Inference
is
A
Work
of prior
networks
degree
of
nodes.
inference
of a document
network
network
used
and
in
a
For
retrieval,
teraction
network.
that
a query
with
the
This
and,
and
allows
the information
ument
network
user,
us
is
to
to
through
to the
compute
need is met
consequently,
built
attached
the
for any
produce
in-
document
probability
particular
doc-
a ranked
list
of
can
be
documents.
2.2
Phrases
The
use of phrases
discussed
1. What
2. Are
phrases
(information
query
network
for
the
V Tfiies
A ret?’ieva~)
the
terms
to
determine
if
a phrase
concepts
or are they
relation-
concepts?
weighting
use of phrases
are
systems
issues:
or query?
is an appropriate
Should
4.
query
used
separate
between
3. What
Structured
is
IR
following
in a document
ships
2:
of the
evidence
exists
Figure
in experimental
in terms
used
for
for
affect
phrases?
which
indexing
single
word
and
docu-
queries
ments?
query
network.
a collection
query
gle
The
and
its
processing.
node
and
or
more
each information
tive
query
The
and
and
document
been
cific
to
(i.e.
the
content
represent
a document
signed.
A representation
given
its
The
gle leaf
tion
the
query
that
need
query
networks
expressions.
plex
Figure
query
Boolean
operators
matrix
form
(Turtle
showed
that
queries
is at least
version
of the vector
this
such
as those
2 shows
has
event
that
that
be used
the
A retrieval)
(information
with
DAG
roots
may
inference
as effective
space
been
as-
the
need.
formed
with
as the
model
set
1991).
model
to
both
for
parse
phrase
indexing
about
pairs
(Van
(1990)
tend
to
same
phrases.
Sparck
phrasal
synonymous
used
together
for
other
concept.
or nearly
in documents
research
on
term
evidence
parser
techniques,
for
together
to idenusing
information
than
has been
the
If
may
part
of
hypothesis
words
the
mea-
words
being
synonymous
other
of words
associated
two
Tait
queries.
example,
For instance,
clustering.
and
on the
Of course,
reasons
noun
as phrases
the co-occurrence
mutual
co-occur
grammar.
(e.g.
to analyze
such as the expected
1979).
document
Jones
a syntactic
strongly
Rijsbergen,
used the PLNLP
of the
identified
are
that
siderable
that
of
and grammars
document
and
to use semantic
It is possible,
of words
a library
Parser-based
constructs
information
use information
sure
the
used
cate-
and patterns
a simpler
are then
Dil-
of the
syntactic
against
used
extraction.
semantic
used.
the
( 1987)
linguistic
tree
example,
general
link
34
specific
been
is typical
parse
(1988)
phrase
the
measures
1983).
Smeaton
to refine
to identify
Wu,
Fagan
a complete
lin-
template-
noun>),
It is also possible
tify
Boolean”
where
use
Both
are identified
are matched
For example,
in the
canonical
Fox and
have
(1983)
as <adjective
produce
Statistical
of structured
approaches,
cases,
of the
Turtle
system
for indexing.
hand,
the
techniques
FASIT
between
As mentioned
techniques
phrases,
in documents
whereas
Each
“extended
(Salton,
for
text.
of
com-
Boolean
network
V ~files.
network
A
query
with
to
indexing
identify
have used parsing
techniques
sophistication
to analyze
the
(1984),
an informa-
phrases.
approaches
of varying
phrases)
a sin-
to describe
query
Croft,
with
phrase
categories
parser
or only
distinguishing
(such
text,
node
for
syntactic
parser-based
of words
indexing
templates
In
correspond
a corresponding
and
repre-
a specification
the information
nodes
has
document
basis
and
to
Gray’s
of adjacent
nodes.
to the
express
query
has
associated
multiple
the
of a spenode
and
gories
by a directed
contains
document
in
model)
concept
is an “inverted”
and
that
concepts
intermediate
the
probability
corresponds
is met
simple
each
syntactic
evidence
and
of the
search?
is the
phrases
template-based
Each
a specific
assignment
from
node
network
that
to a document
to which
set of parent
this
nodes
(rk’s).
issue
before,
based
part
during
guistic
lon
document
event
the
node
for
interac-
of document
actual
in
concept
representation
conditional
through
first
statistical
express
is built
nodes
to the
We
The
need
feedback.
consists
text
senting
of the
is modified
an
during
which
phrases
identified
of a sin-
information
or relevance
represents
change
network
representation
representation
arc
and
query
corresponds
observed.
user’s
A
network
node
collection
the
5. Are
once for
consists
representations
concept
document
not
query
need
is built
network
need.
formulation
document
(dt’s)
does
query
represents
information
network
structure
The
which
one
that
document
basis
linguistic
will
be
of conor
statistical
clues
types
of
then
the
choice
tionship
can
1991;
of
Yu
pairs
used
on
Croft,
of words
indexing
various
mation
about
term
specificity
words
in text
that
is provided
prove
criteria.
and
phrase
collection
formed
from
queries.
more
than
pair
form
and
Das,
In these
to identify
phrases
ments
relevant
very
and
accurate
heavy
tent,
burden
the
users
be generally
but
in the
collection
that
used
initial
Croft
were
asked
User
it
and,
input
state-
1988;
phrase
identification
a
to some
been
(Harman,
a
places
during
has
‘%
ex-
query
shown
Croft
A
to
and
Das,
have
been
‘i
2 through
4 are best
the
inference
net
nets
show
alternative
probabilistic
the
text
In
text
As
an
example,
currence
r~
then
information
In the first
correspond
a
in
the
pk is also
a rep-
to a phrase
in the
correspond
retrieval,
a query.
to
the
respectively,
to occurrence
model
(Figure
corresponding
belief
in the
phrase
dence
about
the
to the
concept
ocand
of the
them,
presence
of a query
including
the probability
information
Smeaton’s
work
a phrase
Figure
phrase
component
words
and
linguistic
phrase
concept
that
need).
(1986),
is treated
independent
the
using
relationship
The
will
is the
all
satisfies
model
phrases
had
in phrase
(b)
Belief
in phrase
(c)
Phrase
(d)
evi-
in a document
3: Alternative
(a) Belief
Phrase
Models:
independent
of belief
in components
dependent
on belief
in components
is a dependency
relationship
between
com-
ponents
The
relationships.
document
This
where
the
as
of the
words.
can be estimated
component
between
(or
3(a)),
concept,
(d)
repre-
words
Q represents
(c)
to
model,
are
rj
two
phrase
may
rj
and
space
ri,
to
words.
representation
concepts
in
vector
in
be used
retrieval.
a separate
query
and
also
El
to
inference
phrases
networks,
corresponds
two
of information
pk would
increase
that
of the
These
can
in the
O!m. The
concept
consisting
3.
and
small
by referring
modeling
corresponding
document
resentation
of
model,
these
concepts
of the
Figure
use of phrases
example.
sentation
in
ways
retrieval
describe
for
models
discussed
‘j
T?
mixed.
Issues
(b)
(a)
is potentially
designer
feedback
in pre-
1986;
query
although
with
rejected.
users
of
system.
and
occurring
(Croft,
evidence,
the
“phrases”
pairs
were
in
im-
with
has been
words)
relevance
results
not
effective-
documents
that
This
effective
frequency,
with
documents.
of the
and
inforamong
best
were
was
on the interface
formulation
1990),
4.1)
experiments,
(and
form
the
did
the
is user judgments
1990).
that
in his experiments
of evidence
where
relationships
example,
in the
experiments
texts,
co-occurrence
found
of words
Yang
selecting
by document
restriction
90 times
Another
vious
For
obtained
only
query
and
(see section
every
The
used by Fa-
of co-occurrence
selection.
CACM
(Lewis,
by Salton,
of their
Fagan
frequency
ness improvements
rela-
basis
involves
and
and the form
satisfy
proximity,
two
as well,
thesaurus
used
algorithm
document
words
or
procedure
of that
basic
from
the individual
model
these
others
1991).
phrase
The
distinguish
possibly
a case-by-case
is an extension
(1975).
to
and
a phrase
and
statistical
gan (1987)
and
be
be made
Krovetz
The
can
co-occurrences,
the
used
the
35
Belief
in components
dependent
on belief
in phrase
same
belief
(or
experiments
The
the
second
belief
in the
in
is the
tical
and
the
relative
(idf ). Specific
Salton
and
work,
McGill
the score
document
is also used
where
cepts
the
in Figure
concept
may
document
tic
phrases,
observed
The
rate
specific
in the
third
model.
For
model
Here
phrase
but
to
that
both
contains
the
Ti to
query
rj.
We
with
now
than
concepts,
such
important
r]
for
due
to this
The
final
in
model,
the
the
ditional
and
the
in
not
should
both
in section
that
and
We
address
4. This
model
in the
phrase
of these
retrieval.
Although
contains
for
to
makes
the
in
number
also
issue
has the
word
a lesser
explicit
concepts,
that
models
the
Wu,
1983)
do
not,
however,
can
such
phrase
for large
col-
of using
vir-
overheads
used.
If phrase
however,
for
it
will
the documents
example,
and
weight
storing
word
access to the full
an
text.
a user
Treating
such
the
structured
by the
processed
indexing
There
specify
translation
36
a query
may
by computer,
expressed
has been
into
be
to
some
in natural
some
Boolean
need
the
capture,
sim-
in
a variety
of
terms
advantages
in
the
by their
in the query.
of weighted
of
than
specify,
the words
One
and
Croft,
terms.
relationships,
as a set
connections.
Fox
and
effective
they
between
be
experiments
information
query,
linguistic
query
in
(Turtle
more
an
language
can
(Salton,
of a set of weighted
connections
these
were
contain-
of information
that
model
that
effectiveness
proximity
model
network
difficult
query
and
consider
presents
of particular
search
a structured
OR,
be
evidence
representations
queries
of a natural
nores
that
Boolean
the
may
good
as AND,
consisting
meaningful
rep-
achieve
example,
and
languages
is considerable
accurate
structured
choice
there
extended
queries
normally
of phrase
Some
to identify
for
query
use,
It appears
As
with
When
experiments
is limited
to
them.
form
For examin the
Boolean
operators
pler
con-
concept.
model
involve,
Queries
searchers
1991),
extent,
the
disadvantage
concepts
in
in a document,
the
time.
represen-
storage
was
organization
Structured
needs.
us-
of a phrase
be used
in
at query
as an indexing
search,
information
may
used to describe
are derived
component
occurs
the file
to
by
doc-
little.
in unreasonable
any
out
scanning
technique
terms
This
that
document
very
or providing
people
ing
this
the beliefs
words
then
in a document
prior
of docu-
in
phrases
and
case,
of index
sufficient
part
be carried
and
Fagan’s
done
for
can
information
trained
used
In
representation.
number
this
been
words
and,
model
is a formal
The
and
all component
prime
belief
and
2.3
case, the
is established
text,
phrase
prime
in the component
concept
file
This
position
for
belief
justification.
component
between
idea
belief
Each
some
document
This
phrase
resentation?
that
not
with
indexing.
also be impractical
of words
result
are
decision,
query
may
is not
phrases.
using
be used in the text
if the
In this
has
relatively
working
query
of those
example,
would
to contain
ex-
Although
4 use the
phrases
in the query
is, in this
pair
be necessary
between
a significant
3(d))
has
the
text.
dependence
might
ple,
but
representing
the
in
and
queries.
relationships
generate
in the phrase
belief
also
Boolean
a document
(Figure
from
document
and
(1986)
In
infrequently
inaccurate.
described
between
For
indexing
from
pa-
weights
collections.
occur
or just
occurrences
every
phrase
dependence.
the belief
the concepts
Croft
in a thesaurus.
also
work,
evidence
from
will
coming
phrases
if an inverted
sat-
is less appropriate
other
is that
model
previous
ing
as those
ri
by
model
capturing
behavior
concept
used
belief
to
phrase
estimating
in section
is whether
methods
tually
likely
a tech-
in this
we are currently
models
concepts
to
suggest
of test
al-
of models
further
with
an implementation
for
lections.
this
size
indexing,
difference
tation
A document
more
increased
this
uments
the concepts
words.
be
language
that
The
as a sepa-
between
will
was
natural
believe
phrases
only
model
identifying
dependence
represented
component
to the
con-
be
issue
query
phrase
indexing
had
is a term
is not
words
due
This
periments
the
of the
to be present.
as a dependence
corresponding
is essentially
from
(1990)
phrases
reported
context
the appropriate
problems
collection,
final
and
syntac-
the
address
limited
be essome
collection.
The
The
in
are consequently
CACM
model
on evidence
relationships
3(c))
estimates
a larger
in the
not
should
investigate
Buckley
most
experiments
ment
2).
and
we will
belief
will
be used to learn
is the
to the
of two
Fagan’s
but
of phrases
we
schemes
Fuhr
could
collections,
small
the belief
with
(b).
phrases
collecby
weights)
paper,
weighting
these
the
queries,
Figure
that
a phrase
(Figure
the
concept,
isfy
for
AND
partially
syntactic
text
structured
as the
instance,
is
(or
this
One of the major
for
In Fagan’s
This
In
that
per.
of the
was added
with
depend
text.
weight,
are discussed
words.
indicates
statis-
weight
inverse
retriemzlin
3(b)
beliefs
and
nique
frequency
(1990).
the individual
(a)
words.
“tf.idf”
document
phrase
experiments
the
ternative
the weight
tf.idf
the
Turtle
to information
arrow
phrase
and
both
relative
weight
are represented
(similar
gray
of this
how
the
belief
of the
A
in the
a matching
in our
for
calculates
and
forms
from
phrases
(tf)
(1983)
score
(1987)
of the
word
for
on the
component
words.
of the
in
the case where
as the average
a document
used
depends
Fagan
component
frequency
shows
by Fagan
a combination
in
be
timated.
to the
phrases,
the
of a word
3(b))
the phrase
using
also
4.
concept
used
with
of
(Figure
phrase
syntactic
weights
will
corresponding
model
associated
formed
and
in section
model
concepts
This
tion
weight),
described
a form
of the relational
igof a
easily
structure
language.
preliminary
operators
research
(AND
on the best
and
OR)
of
linguistic
relationships
pared
the
as sources
model.
from
use of Boolean
of word
pairs
Das-Gupta
both
syntactic
when
the
natural
interpreted
OR.
proposed
term
an algorithm,
and
must
when
should
a high
(1990)
presented
translating
a full
query
Boolean
into
Boolean
and
better
Smith’s
to
fectiveness
To
draw
understand
the
of the
from
capturing
in the
query.
from
for
the
makes
relative
capturing
ef-
the
that
50%
82%
to relatively
words
Boolean
the
jective
and
to
certain
it
ANDed
of
into
that
by trans-
Boolean
op-
above
studies
linguistic
Proximity
phrases
queries
to
operating
queries.
ing and
on the
the
system
another
which
may
occurring
query
languages,
tured
queries
On
the
means
text
specification
The
structured
that
for structured
goal,
As
an
the
phrase
several
of the
not
other
times
two
hand,
ex-
ORS,
and
and
lin-
proximity
than
imposing
a
Queries
Language
accurate
avoiding
our
user
descriptions
the
research
input
of query
queries
and
will
represent
goal
and
with
will
is
natural
to
complex
build
(e.g.
strucanaly-
that
Anick,
be represented
relationships
informa-
of
language
information
information
of
an interface
structure
contain
the
problems
as inference
about
need,
aids
1990).
concepts
their
relative
im-
them
(Croft
and
between
1990).
this
paper,
which
phrases.
ments
we address
is to evaluate
Specifically,
designed
Hypothesis
ope~at-
of each
to the
all
in particular
we
to test
one
part
structured
report
the
of this
queries
the
following
results
research
containing
of experi-
hypotheses:
in CACM
words
3 words
referring
Das,
be-
In
consider
in
with
of phrases
rather
by a combining
user
operators.
help,
within
3 were
connections
Boolean
relationships.
occurs
64 instances
only
system.
and
linguistic
proximity
system,
Of
in documents,
operating
provide
capture
of how
all focused
relationships
obtain
while
portance,
The
ample
to
networks
erators.
tween
be equiv-
experiments
Structured
Natural
needs
sis of free
an ad-
suggests
order
tion
natural
groups
be produced
relationships
In
relationships
This
might
paragraph)
on all phrases.
from
of words
or between
modifies.
same
we will
ANDs,
on a case by case basis,
model
proximity
will
translation
ab-
per doc-
level
In future
into
only
stems
We
correspond
in the
modification
phrases,
queries
syntactic
(ANDs)
of the
grad-
1990).
of words
(e.g.
the
is the small
unlimited
documents
relationships
with
contain
20 unique
prox-
weighting,
evaluate
which
document
for
Building
3
CACM
Cornell
structures
61%
queries,
the
(Smith,
(ORS)
direct
noun
from
that
them
of noun
Boolean
methods
ef-
queries
in structured
conjunctions
Indeed,
the
reasonable
of full-text
single
between
language
queries
linguistic
heads
connection
queries
from
of the
correspond
between
used
disjunctions
queries.
the
in natural
language
simple
language
about
derived
of the
collections
to using
phrase
to
documents,
the
documents.
investigate
improve
case, the problem
proximity
full-text
phrases
operators
with
to restricted
guistic
different
As with
of only
that
as requiring
not
an approach
documents,
at
did
Fa-
showed
proximity.
is difficult
small
accurately
phrases
unlimited
a
proximity.
hand,
of words
an average
describe
corresponding
simple
other
and
pres-
in CACM
be very
produced
In this
(1990)
concepts
using
the
Shapiro,
the
groups)
could
queries.
In these
longer,
its use of ad
queries,
about
the
on
CACM
and
alent
the
and
to infer
(noun
4, we describe
or co-occurrence
syntac-
Croft
allowing
collection.
ument.
and
proximity
over
of the
stracts
ef-
However,
including
operators
natural
students
lating
some
reflected
conclusions
more
Boolean
collection
and
comes
rejected
relationships
we compared
found
be
size
Gay
texts
of proximity
CACM
Booleans
ones.
that
use
used
that
in structured
the
are
statistically
In section
sub-
op-
on co-occurrence
text
compounds
document
imity
relationships.
linguistic
uate
produced
algorithm,
to
a vari-
apart
concept
interpretation
(Tong
based
compounds
(1987),
fectiveness
operators.
in
found
results
system
rules
in document
close
performed
produced
language
of Boolean
linguistic
and
nominal
interpreting
using
Boolean
Booleans
are directly
of words
difficult
for
queries
of Smith’s
lists
collections
suggests
of a natural
complexity
the
They
to these
resulting
statistically
about
use of a proximity
concepts.
queries.
gan’s
for
were
RUBRIC
of nominal
language
the
and
statistically
strongly
which
structure
hoc
test
of structured
relationships
it
three
than
of a natural
manually
to manually
work
fectiveness
both
produced
comparably
algorithm
She compared
interpretations
syntactically
stantially
tic
on
p-norm
complex
parse
form.
with
ones
of
The
syntactic
queries
produced
a
distances
the
of words
identified
Smith
at greater
not
simple
ence of query
Gupta.
ety
where
study
by Das-
occurring
that
of the
is the
proximity
due to
discussed
example
phrases
1985),
experts,
tentative,
limitations
be
words
sy9tem.
An
for
degree
of human
two
in documents
erating
as a Boolean
showed
be considered
were
deciding
and
translations
of these
using
for
conjunction
of experimental
comqueries
dependency
information,
AND
the
(1986)
language
same
comparisons
with
the results
a variety
the
language
Preliminary
Croft
natural
semantic
as a Boolean
of agreement
though
for
(1987)
and
queries.
and
ing
other
phrases
structured
concept
27 instances
Hypothesis
37
1:
Structured
will
be
queries
more
incorporat-
effective
than
queries.
2:
Phrases
selected
automatically
un-
Number
manually
No filtering
Corpus
filtering
Table
will
perform
as well
(Yo manual)
parsed
tagged
+ diet
407
(71%)
148
(69%)
191
(89%)
151
221
(50%)
119
(52%)
159
(72%)
of phrases
selected
generated
using
various
next
section,
a variety
of phrase
methods
for
tic
the
models
and
---i%am
man-
are
syntactic
phrase
and
system
Croft
54.3
61.8
(+13.8)
Two
30
48.7
53.4
(+9.7)
from
40
42.5
43.8
(+3.0)
50
35.8
37.4
(+4.6)
60
28.3
29.1
(+2.7)
70
19.7
22.3
(+13.2)
and
the
These
80
15.7
18.1
(+15.6)
from
90
10.6
12.7
(+19.6)
100
8.0
8.9
(+11.8)
33.1
35.6
(+7.5)
stochas-
(1988).
selected
are
procedure
by Church
to phrases
(+1.2)
20
using
These
extraction
(1990),
developed
are compared
68.5
phrases
investigated.
manually
queries.
The
4
Experiments
Table
-
Two
sets
of
experiments
hypotheses.
The
to test
methods
queries
and
retrieval
for
The
representing
to test
whether
phrase
These
test
was
concept
in
directly
set of experiments
Hypothesis
2.
of different
with
for
on
(Section
These
methods
that
were
(Salton,
Fox
collection
nications
4.2)
tests
for
obtained
was
Three
de-
compare
having
portant
se-
manually
in the
a list
Table
the
various
will
be referred
the
and
One
student
This
query
compare
used
the
lected
in
this
that
were
to
used
accommodate
percentage
contained
and
set is the
basis
Stan1983)
retrieval
of the
in
the
The
This
first
model
3(b),
for
4.1.1
supporting
of a phrase
the
The
table
in
a document
is based
relation,
belief
and
depends
terms
on the
the
on
as well
the
as the
phrase
other
two
model
shown
are based
on the
3(a).
se-
various
sets
terms
approach,
represents
any
operator.
terms,
ponent
Since
com~onent
the
and
by
remaining
product
beliefs
will
term.
mean
Using
anding
terms
the
using
network
beliefs
lie in the
range
be lower
than
The
models
beliefs
a
in a
component
subex-
a probabilistic
a two-term
for
assigned
model
and
the individ-
[0..1],
that
.rmobabilistic
and-based
terms
the phrasal
of the
of the
this
(1990),
component
combining
In an inference
as the
terms.
in Turtle
of the
is formed
to a phrase
computes
38
a query
with
assigned
of a representation
reported
for each phrase
pression
sum
phrases
as a co-occurrence
document;
figures
manually
first
phrase
by
phrases
assignment
terms
phrase.
Conjunctive
either
occurrence
which
individual
whereas
in Figure
is modeled
estimates
the
method
in Figure
results.
produced
The
were
each
and
McGill,
in phrases
of single
as a proximity
in
of the
For
belief
in a document),
method
frequency
im-
paper.
3. a hybrid
ual
evidence
is similar
techniques
as a conjunction
a phrase
both
in
queries
sections.
2. treating
of phrases.
The
a phrase
extended
estimating
a phrase
set of queries
is provided.
the
to in the following
show
Belief
be
co-occurrence
of
identify
of phrases
for
frequency
experiments.
(Salton
numbers
in parentheses
4.1
sense,
estimation
of
Commu-
50 queries
form.
documents
techniques
phrases
the
must
methods
(term
col-
version
from
language
tables
to evaluate
1 shows
with
phrase
of relevant
recall-precision
are used
In this
but
terms
1. treating
se-
test
Our
science
text.
selected
CACM
natural
computer
phrases
query,
the
the
1983).
abstracts
Boolean
by taking
manually
using
along
and
a M.S.
of the
queries
tested:
the
automatically
using
Wong,
3204
ACM,
language
was formed
done
and
contains
of the
natural
term,
normal
phrases.
experiments
lection
constructed
phrases.
phrases
All
to a document.
to a normal
improve
bear
of manually
phrases
our
information
approaches
results
2: Performance
and-based
intended
1.
performance
lected
to
4.1)
these
to test
lecting
conducted
(Section
second
signed
dard
set
performance.
Hypothesis
this
were
first
and-based
67.6
estimates,
indexing
– 50 queries
NL
10
experiments
belief
syntactic
by Lewis
tagging
describe
queries
a parser-based
described
we
deriving
language
phrases
techniques
Recall
the
natural
queries
tagged
ually.
In
-50
197
1: Numbers
as phrases
of Phrases
select ed
the
belief
assigned
sum
to the
with
to
operator
com-
queries
PI
Recall
:ision
(Yo change
-50
queries
and-based
NL
65.9
(-2.6)
10
68.5
68.5”
(+0,0)
69.2
(+1.1)
20
.54.3
61.8
(+13.8)
56.2
(+3.6)
20
61.8
57.4
(-7,1)
62.2
(+0.8)
30
48.7
53.4
(+9.7)
50.0
(+2.7)
30
53.4
51.0
(-4.5)
53.0
(-0.7)
40
42.5
43.8
(+3.0)
39.5
(-7.1)
40
43.8
44.3
(+1.2)
43.6
(-0.3)
50
35.8
37.4
(+4.6)
33.9
(-5.2)
50
37.4
37.6
(+0.4)
37.4
(-0.0)
60
28.3
29.1
(+2.7)
28.3
(-0.2)
60
29.1
33.5
(+15,3)
29.6
(+1.8)
70
19.7
22.3
(+13.2)
21.4
(+8.6)
70
22.3
24.5
(+10.1)
22.4
(+0.3)
80
15.7
18.1
(+15.6)
16.1
(+2,5)
80
18.1
19.9
(+9.6)
18.2
(+0.2)
90
10.6
12.7
(+19.6)
10.3
(-3.0)
90
12.7
13.2
(+3,8)
12,7
(-0.2)
100
8.0
8.9
(+11.8)
7.0
(–12.3)
100
8.9
10.0
(+12.3)
9.0
(+0.8)
33.1
35.6
(+7.5)
32.9
(-0.8)
35.6
36.0
(+1.1)
35.7
(+0,4)
35.2
38.0
(+8.0)
.
37.2
(+5.7)
./
10
3: Performance
.
of manually
constructed
average
queries
Table
4: Comparison
belief
estimates
ually
phrases
resulted
the
and
natural
Proximity
In the
second
of terms
model,
and-based
fault
belief.
in
of the
in close
This
is viewed
the
only
in the
a belief
greater
estimate
depends
on the number
model,
resentation
a phrase
concept
document
quency
details
for concepts
on the
results
of Gay
As shown
significantly
worse
performs
guage
query.
cause
the
that
do
the
about
same
principally
rather
the
belief
The
than
one
hybrid,
poor
and Fagan’s
due
as the
the
to not
an
are
documents.
of
imity
in
for
4.1.3
hance
single
that
model
brids
it
lan-
suffers
containing
be-
phrase
the hypothesis
term
would
collections
containing
by recent
Indeed,
ten
documents
top
queries
significantly
better
than
nearly
as well
exif we
re-
using
queries
and
ex-
effective
1990).
collection,
the
estimate
is present
mate
term
or
based
of the
model
is
matches
prox-
the origas the
39
belief
based
a document
single
that
estimate
of the
beliefs,
Of
the
maximum
these,
proximity
met,
were
(the
and
of the
and
use some
single
constraints
other
belief
if the
based
include
the
of the
beliefs),
hybrid
term
were
of the
phrase
beliefs
not
attempt
proximity
on singlearm!-
the
single
mean
term
operator
for
if
esti-
original
met.
hy-
phrase
estimates
maximum
en-
recall,
of these
proximity
tested
best
that
All
on the
product
the
the
enhance
models
of both,
term
The
phrases
phrases
phrase
features
on the
liefs.
if the
satisfy
of hybrid
are not
beliefs
proximity-based
and-based
best
in
based
that
while
a series
constraints
assigned
none
do not
We
approaches
precision
to combine
– documents
are
col-
phrases.
Hybrid
To test
performs
the
with
fre-
natural
model
sacrifices
belief
phrase
“strict”
the
query
to be more
is supported
in
CACM
collec-
use of an and-based
Candela,
precision
language
and-based
within
and
perform
natural
which
effective
for
view
the
use of proxim-
device,
estimate
the
phrases
inal
based
of the proximity
documents
raw
from
The
estimate
This
(Harman
compare
trieved
enhancing
ardbased
and
mechanism
The
doc-
are present
legitimate
particularly
documents.
collec-
(many
that
if at all.
it is not
of short
are short
abstracts
Many
is also
CACM
paragraph)
enhancing
As such,
we tested
constraints
recognizing
is a precision
window
model
of the
records.
infrequently,
than
of the proxim-
original
terms
occur
and
3204
proximity
proximity
collection
of a single
large
rep-
experiments,
this
is too
consist
contains
all
(1990)
model
in this
tion
periments
of the
no abstract
is a recall
(1990).
as a document
to ignoring
width
records
generally
the
matches.
composition
proximity-based
belief
to estimate
conjunctive
proximity
performance
idf
the
the
In the proxon the
to
pect
in which
document
performance
have
only
Relaxing
additional
estimate
is independent
proximity
with
of
This
terms.
these
the
as well
The
lections
the
proximity
See Turtle
Croft
model
only
satisfy
terms.
than
tion.
few
part,
recall.
only
In
inverse
The
3, the
Performance
proximity
contain
not
and
in Table
is based
is based
unit.
for
of
as an independent
the
nets.
is set to three
number
terms.
default.
use of tf and
in inference
window
de-
containing
and
belief
and
phrasal
on the
in
ity
term
the
of documents
is viewed
(tf)
(idf ) of the
more
the
the single
whose
frequency
but
required
than
with
the
single
is satisfied
associated
poor
due,
phrases
In
any
than
documents
assigned
imity
with
the
those
relation
containing
greater
phrase,
with
as a sequence
in a document.
increases
in a phrase
proximity
The
2).
very
uments
a belief
with
model,
the beliefs
than
(Table
constraints.
produces
man-
performance
proximity
associated
terms
identified
queries
a document
belief
common
proximity
better
is assigned
beliefs
were
language
a phrase
model,
a phrase
on the
terms
phrases
occurring
from
terms
single
in significantly
original
4.1.2
of and-based,
phrases
proximity
in which
and
Fagan
(+1.2)
– proximity-based
ity
queries
68.5
Table
the
-50
f37.t3
top
the
(% change
hvbrid
10
average
with
Precision
.—
and
Recall
-—
proximity
beused
a document
The
perfor-
mance
of the
better
than
CACM
To test
the
4),
gle terms,
can
no single
were
and
terms
from
retrieval
performs
manually
the
(and-based,
by
terms
single
from
original
sections.
relatively
phrases
with
summarized
here.
As
all
only
manually
The
shown
in
de-
single
last
the
selected
of
and
proximity-based
lection.
4.1.4
Comparison
with
earlier
phrase
model
results
To
compare
mented
work
a phrase
a phrase
the
our
model
is simply
component
tially
the
as we would
query.
By
in effect,
acts
term
when
estimate
belief
beliefs
phrase
As
The
for
for terms
in a
nificantly
in a
the concepts
in this
way
so that
with
other
it
tive
terms
in
Table
CACM
suggests
form
better
than
mate
on collections
point
out
that
the average
than
reported
precision
nets
Fagan
of .36 compared
with
estimate
figures
(1987).
lent
to that
terms
Automatic
versus
also
well
as simply
using
all
query.
set of single
One
strategy
For
an
selected
The
and
results
of the
improve
last
about
retrieval
kind
vide
this
we are of course
identify
useful
effort.
In
three
methods
for
parser
based
primarily
a stochastic
section
phrase
4.2.3),
(Section
4.2.4).
and
a user
we
describe
bracketer
sparing
syntax
which
obtained
from
obtained
can
pro-
, such
frequency
terms
should
to be correct,
Unless
of this
original
query
4.2.1
Corpus
the
influence
as those
section
use all
in addition
to the
terms
that
should
results
terms
in the
from
set of selected
dicthe
phrases
all
single
in
collection,
For
noted,
the
quality.
found
in the
as
original
is phrase
the component
otherwise
phrases
about
from
be dropped.
be retained.
all single
for select-
with
probably
but
equiva-
performed
terms
high
are less likely
a training
quality
the
phrases.
user
erate
ural
a
the phrases
attempt
some
part
that
can
pus
phrase
40
and
form
of test
out
are
exceeds
some
retained
threshold
only
it
to generated
(Fagan,
if
(in
ones.
spurious
1987)
their
our
varies
used to gen-
of the original
cases
quality
to eliminate
filtering
phrases
many
low
phrases
the technique
on the quality
In
be used
selected
upon
query,
to screen
corpus
a dictionary
automatically
depending
language
to apply
4.2.2),
filtering
of
considerably
with
phrases:
incorporates
The
could
the
(Section
from
ses-
that
experiments
recognizing
on phrase
phrases
could
an interactive
automatically,
should
effecelimi-
query,
contain
terms
single
for
Strategies
to be included
phrases
with
all of the sinqueries
that
of
de-
incorporatphrases
in techniques
automatically
information
(Section
during
interested
phrases
that
selected
While
of information
this
show
performance.
sion,
of speech
section
manually
that
select
high-quality
remainder
information
to
not
is essentially
strategies
single
are not
the original
queries
terms
factor
used
tionaries
phrases
ing
with
these
ri-
from
sig-
many
are required
queries
to be
terms
since
query
to the phrases).
but
of .32.
manually
terms
examined,
terms
constructed
of these
were
higher
of
improve
on retrieval
single
that
query
manually
obtained
component
4.2
It is also clear
ing
obtained
col-
and-based
impact
performance
the original
(in addition
the
will
that
eliminating
per-
Specifically,
to an average
on
the
CACM
performance
and
in the original
60’?ZOof the single
be-
sim-
than
a set of single
retrieval
The
performance
the
a major
general,
esti-
We should
are significantly
has
by phrases.
the
larger
will
or Fagan’s
documents.
inference
in
estimates
work
hybrid
and-based
of large
with
difference
Fagan
initial
the
the
is little
or
Again,
that
either
sing phrases
those
there
hybrid,
collection.
collections
average
4,
and-based,
to select
contained
retrieval.
are
general,
behave
on the
that
documents
used
from
In
estimates
phrases
In
gle terms
described
better
used
be
degrade.
degrades
scribed
we
when
vari-
will
phrases
estimates
estimate
of large
with
performance.
hybrid
significantly
will
and
variables,
perform
fo-
in the
two
these
estimates
proximity-based
method
included
nate
shown
the
collections
We will
selected
we hypothesize
and
we use essen-
weight
combined
for
for
for terms
weight
hybrid
performance
imple-
estimates
beliefs
phrase
the
belief
model,
use to combine
the
we
query.
tween
the
this
normalizing
as a single
in the
of the
With
to combine
computing
(1987),
the
mean
method
phrase
Fagan
in which
terms.
the same
are,
with
Again,
the
belief
used
terms
filtering
remaining
of
manually
the
is used
single
subset).
the
remaining
and
both
the
for
the
and-based
ilarly
terms.
of
section
independent
terms
for
all
corpus
independent
performance
the
filter-
estimate
terms,
and
Results
three
corpus
and the method
or some
technique
are
belief
single
queries,
ables
significantly
what
(no
method,
whether
or hybrid),
terms
with
Adding
as using
4,2. 1) is used,
to select
the
recognition
considered:
proximity,
phrases
queries.
single
(Section
following
identi-
phrase
were
cus on recognition
and sin-
manually
identified
original
the
as well
ing
that
identified
performance.
about
with
precision
to the
variables
work
of the phrases
using
addition
other
estimate.
manually
5, dropping
In
on the
suggests
average
conducted
terms,
single
in Table
improve
importance
and terms,
significantly
initial
documents
and-based
tests
is not
formulation
However,
larger
the relative
fied phrases
grades
(Table
estimate
1O-2O’ZO over
estimate
and-based
containing
hybrid
all
phrase
original
collection
a collection
the
hybrid
the
in
may
phrases
One
more
in an
technique
phrases
which
collection
case,
nat-
be useful
is cor-
candidate
frequency
than
one
precision
(Yo
v,
\ change)
man.
=
65.2
(-4.8)
68.5
(+0.0)
54.1
(-5.7)
58.6
(+2.1)
30
51.0
45.7
(–10.5)
51.0
(-0.2)
40
44.3
39.7
(–10.3)
43.7
(-1.5)
50
37.6
33.6
(–10.6)
38.4
(+2.2)
60
33.5
28.4
(–15.3)
31.9
(-4.9)
70
24.5
20.2
(–17.8)
23.7
(-3.4)
80
19.9
15.4
(–22.4)
18.0
(-9.6)
90
13.2
11.2
(–14.8)
12.8
(-3.2)
100
10.0
9.3
(–7.5)
10.2
(+1.4)
36.0
32.3
(~10.3j
35.7
(–0.9)
effect
of single
I
Table
5: Performance
– 50 queries
unfiltered
be
filtered
term
good
tends
10
68.5
68.2
(-0.5)
20
58.6
58.0
(-1.0)
able
30
51.0
50.1
(-1.8)
tering
40
43.7
43.0
(-1.6)
sistency,
50
38.4
37.1
(-3,5)
otherwise
60
31,9
30.8
(-3.3)
phrase
70
23.7
22.9
(-3.6)
performance
80
18.0
17.7
(-1.8)
tering.
90
12.8
11.9
(-6.8)
100
10.2
9.3
(-8.8)
35.7
34.9
(-2.2)
6: Effect
constructed
of corpus
phrase
filtering
selected
phrases
tend
phrases
and
improves
performance
results
reduce
the
number
but
it is also clear
are
eliminated
that
of phrases
that
corpus
with
for
each
a number
(using
does
technique
used,
of reasonable
manually
selected
in
the
operating
to match
query
only
slight
and
performance
frequen-
word
terms
high
frequency
are
occur
as a
receive
Again,
this
with
no credit
technique
the
in-
system)
they
documents
gains
corpus
collection
when
fil-
technique,
computer
only
occurrences.
that
this
very
system,
the
in a document
term
high
better
phrase
suggests
Essentially,
unless
selected
since
set of single
fil-
For con-
corpus
With
reli-
phrase
filtering
understated
very
the
query.
(e.g.,
phrase
with
to be as
manually
without
effective.
from
of the
phrases,
that
collections
be
phrases
removed
for single
phrases
phrases
larger
not
corpus
phrase
mean
somewhat
can
from
cluded
corpus
will
can be achieved
filtering
words
filtering
are
inclusion
for these
use
This
results
phrases
1 shows
will
noted.
Work
manually
the
performance.
selected
term
with
and
overall
as manually
assumed
Table
descriptors
to help
Automatically
queries
occurrence).
selection
content
phrases
cies are
Table
terms
57.4
avera~e
Precisio
phrases
all
20
10
Recall
man.
no terms
Recall
I
– 50 queries
phrases
gives
CACM
collec-
tion.
as a
guideline).
The
effectiveness
heavily
ually
on the
selected
do not
hurts
occur
to
treat
it
dow
mand
The
interpreter,
using
in the
phrase,
quality
even
by the
small
complexity
words
strict
are
class,
comprising
proximity,
assigned
the
text
belief
during
(e.g.,
sim-
(e.g.,
com-
compatibility
phrases
is unaffected
for
the
from
of automatically
a partial
which
generate
all pairs
parse
since
subject,
and
system
analyzes
from
is the
Longman
41
noun
a noun
only
level,
phrases
(LDOCE)
phrase
head
(Boguraev
words
that
and
a modifying
sentence
of
and Briscoe,
which
adjective.
to
The
1987)
which
a
is its
The
below
produce
lexicon
Contemporary
reare
constituents
of heuristics
constituents.
a
to
are heads
Examples
phrase
those
Dictionary
we used
attempts
by a grammatical
1990).
a set
relationships
which
of a noun
relying
adjacent
is to
all pairs
experiment,
system
Croft,
extract
syntactic
this
connected
and
phrases
and
of non-function
structures
the
clause
For
generation
(Lewis
and
text,
in specified
output.
phrase
the
to
occur
parser
generating
or document
syntactic
verb
y).
tend
the
lationship
win-
others
type
phrases
query
of syntactic
while
performance
same
appear
sample
these
in
reason
not
the
of words
de-
user-produced
collection
microcode)
the
does
method
parse
actually
a user
terminology
CACM
One
that
is strong
if it
used
phrases
Since
of these
Syntactic
For man-
collection
6).
there
Some
commonly
in
query
4.2.2
depends
phrases.
(Table
horizontal
occur
individual
documents
the
covered
not
3 When
all
not
manager,
do
once
collection.
were
period
ply
than
filtering
original
eliminating
slightly3
as high
CACM
phrases
the
more
produced
in the
phrase
of the
phrases,
performance
liberately
of corpus
quality
used
English
provides
syntactic
categories
for
analyzer
for
cabulary
considerably.
As
shown
phrases
form
tic
Table
by
phrases
(Table
a very
many
nated
by corpus
remain
phrase
programs,
separation
Since
contains
most
queries,
it
could
to
the
tagged
tified
vo-
of which
do not
if
phrases
that
phrases.
More
comprehensive
the
when
these
abbreviated
per-
phrases
4.2.5
ral
performance
also
using
that
signs
parts
lexical
tagger
language
queries
a
tent
help.
a stochastic
developed
of speech
and
to
contextual
addition,
the
by
words
Church
based
on
probabilities
boundaries
of
tag-
(1988)
knowledge
investigated
phrases
ing
tagged
by the
phrases.
Church
The
tagger
produced
higher
noun
Church
number
of
phrases
number
results
both
generates
than
tors
is the
remains
of phrases
from
its
produced
lower
bigger
from
number
1), but
a much
based
phrases.
and
fewer
system.
contributor
the
system
because
linguistic
Which
The
using
than
the
worse
original
7, but
better
tagged
struc-
of these
to improved
natural
than
the
recall
wit bout
with
appears
to
fac-
performance
perform
language
manually
slightly
queries
and
selected
that
manual
phrases
corpus
phrase
filtering).
perform
Phrases
from
work
on the
retrieval
effectiveness.
slightly
these
and
others,
evidence
about
or
thesaurus
is
to
specific
general
ACM
version).
source,
possibility
use
for
identifying
phrases
in
a machine-readable
lexicon.
source
For
of
Computing
Because
we used
these
dictionary
experiments,
phrases
for
Review
Classification
there
those
are
identified
or
few
domain-
we used
computer
very
queries
a very
science
System
phrases
in queries
in
cepts
related
- the
tween
tures
this
concepts
experts
as additions
not
42
the
simple
thus
far.
queries)
the hyimprove
also support
the sec-
difference
between
selected
we are currently
a much
that,
and
will
larger
on
more
phrases
re-
corpus.
larger
useful
collecsource
contribute
concentrate
structured
of information
importance
those
of
more
in
the
that
go beyond
form
the
to include
in the query,
concepts
capturing
of two
This
and
query.
con-
We
relationships
simple
For example,
AND
the
We are look-
original
for
components.
on improving
queries.
of query
methods
as phrases.
often
phrase
with
phrases,
to
studying
such
schemes
and automatically
a much
types
as relative
(1987
phrases
hybrid
than
is little
observing
research
such
also
results
becomes
for building
ing at other
Another
in-
retrieval.
future
techniques
the
identifying
and
previously,
are
that
queries
4, we can accept
there
experiments
proximity
the set
Proximity
investigated
of manually
better
selecting
size.
structured
The
in that
these
with
in
better
(and
be importhat
suggests
for
in section
phrases
As mentioned
to effective
a dictionary
results
pro-
query,
collection
techniques
that
hypothesis
for
work
substantially
pothesis
tions,
and
tool
con-
selected
may
improved
proximity-based
Conclusion
Our
4.2.4
and
5
We,
(Ta-
and
as
and
is likely
structure
be an effective
to
which
initial
document
co-occurrence
peating
some-
phrases
earlier,
overall
manually
in the
representing
the effectiveness
better
phrases
in
natu-
automatically
It
can be further
term
ond
levels,
results
phrases
the
to be included
of
phrases.
phrases
ocwith
original
as well
The
perspective.
selecting
mentioned
creases
9).
precision
levels
for
Based
it
to be established.
what
high
terms
appear
the
are reasonable,
rate,
from
that
phrases
the
not
outperform
a user’s
in documents
indexby
by the tagger
error
phrases
than
selected
phrases
selected
(Table
did
of single
As
noun
produced
(Table
phrases
the parser-based
Queries
at
from
although
terms
importance
are
simple
as syntactic
lower
the manually
indexing
tures
tagger
parser
of these
with
the
of phrases
by the syntactic
proportion
lower
exactly
is considerably
on comparison
ble
using
Us-
words
further,
than
manually
phrases
strategies
In
identified.
We
slightly.
Performance
dictionary
better
queries,
containing
bearing
tant
as-
of occurrence.
simple
in
improved
and
perform
performance
stochastic
queries
tagged
queries
ging
The
contained
of
to the tagged
single
50 documents.
all
set.
improves
we removed
were
than
almost
selected
are added
the
dictionary
Summary
Combining
original
parsing,
from
phrases
idenin
and
phrases
Syntactic
they
the
phrases,
performance
filtering,
in more
duced
4.2.3
term
curred
technique
the
might
corpus
8),
of
manually
was
occurred
although
number
in the
phrase
phrase
that,
the dictionary
(Table
queries
is large
filtering
accurate
grammar,
a small
present
elimi-
degrade
from
improve
When
of the
algorithms
which
a better
to significantly
were
phrases
of questionable
phrases
phrases
added
A dictionary
form
1 shows
query
are
work,
corresponding)
useful
describe
exact
Table
them
ing
phrases.
the
only
lansyntac-
of candidate
phrases
the set of syntactic
per-
natural
The
number
noise
syntactic
above
the
large
filtering,
is possible
of syntactic
more
this
the
phrases.
implementations
of the
be used
using
either
of these
(e. g.,
formance.
A simple
described
than
1), many
While
value
queries
system
constructed
produces
content.
7,
the
worse
manually
parser
words.
augments
query.
in
or
35,000
morphology
produced
significantly
guage
about
inflectional
linguistic
in Boolean
concepts
implies
are
be-
strucqueries,
which
a strong
are
rela-
Precision
Recall
NL
– 50 queries
parsed
tagged
10
67.6
68.2
(+0.8)
59.2
(–12.5)
68.4
(+1.2)
20
54.3
58.0
(+6.9)
51.4
(-5.3)
57.3
(+5.6)
30
48.7
50.1
(+2.8)
41.5
(–14.7)
49.4
(+1.5)
40
42.5
43.0
(+1.1)
36.4
(-14.4)
41.4
(-2.5)
50
35.8
37.1
(+3.7)
30.5
(–14.8)
35.4
(-1.0)
60
28,3
30.8
(+8.8)
26,1
(-7,9)
29.0
(+2,3)
70
19.7
22.9
(+16.1)
19.7
(+0.3)
21.3
(+8,3)
80
15.7
17.7
(+12.5)
15.6
(-0.7)
17,0
(+8.4)
90
10.6
11.9
(+12.0)
10.1
(-4.5)
11.3
(+6.8)
100
8.0
9.3
+16.1)
7.2
(-9.4)
8.5
+6.0)
33.1
34.9
29.8
(–10.1)
33.9
[+2.4)
average
Table
(% change)
manual
7: Performance
‘(+5.3)
of automatically
selected
—
phrases
.. .
Precision
( YO change)
-,
NL
tagged
corpus
phrase
– 50 queries
tagged
Recall
(with
plus
with
dictionary
term
filtering
10
67.6
68.4
(+1.2)
69.4
(+2.7)
71.5
(+5.7)
20
54.3
57.3
(+5.6)
57.0
(+5.1)
59.1
(+8.8)
30
48.7
49.4
(+1.5)
49.5
(+1.7)
50.3
(+3.4)
40
42.5
41.4
(-2.5)
40.7
(-4.2)
42.1
(-0.8)
50
35.8
35.4
(-1.0)
36.0
(+0.8)
36.5
(+2.0)
60
28.3
29.0
(+2.3)
28.8
(+1.6)
29.3
(+3.6)
70
19.7
;;.;
(+8.3)
21.6
(+9.5)
21.9
(+11.4)
80
15.7
(+8.4)
17.5
(+11.8)
17.8
(+13.4)
90
10.6
11:3
(+6.8)
11.7
(+10.4)
12.1
(+13.8)
100
8.0
8.5
(+6.0)
9.0
(+12.8)
9.3
(+16.4)
33.1
33.9
(+2.4)
34.1
(+3.1)
35.0
(+5.7)
average
Table
8: Performance
1 ecision
Recall
NL
of tagged
(’ZO chang
manual
phrases
) -50
queries
tagged+
(unfiltered)
Table
\
dictionary
(filtered)
10
67.6
68.5
(+1.3)
71.5
(+5.7)
20
54.3
58.6
(+8.0)
59.1
(+8.8)
30
48.7
51.0
(+4.6)
50.3
(+3.4)
40
42.5
43.7
(+2.7)
42.1
(-0.8)
50
35.8
38.4
(+7.4)
36.5
(+2.0)
60
28.3
31.9
(+12.6)
29.3
(+3.6)
70
19.7
23.7
(+20.4)
21.9
(+11.4)
80
15.7
18.0
(+14.6)
17.8
(+13.4)
90
10.6
12.8
(+20.2)
12.1
(+13.8)
100
8.0
10.2
(+27.3)
9.3
(+16.4)
33.1
35.7
‘(+7.7j
9: Performance
of manual
versus
43
35.0
automatic
‘(+5.7j
phrase
selection
filtering)
tionship
between
not
what
clear
of our
work
those
type
in
the
implementation
concepts
in
the
of relationship.
near
future
of interfaces
text,
An
will
but
it
important
be the
appropriate
design
for
Syntactically
is
Based
American
part
Society
Fagan,
J. Experiments
Document
and
Acknowledgments
Non-Syntactic
Report
Force
was supported
and
Office
to thank
sions
of Scientific
Research.
about
Krovetz
this
Ltd.
as Ken
Church
able
AFOSR
Bob
Group,
the
in part
by contract
and
work.
for
the
Bell
stochastic
Penelope
and
for
to thank
LDOCE
Air
Gay,
Brennan,
Interface
Boolean
for
Language
Query”,
Conference
tion
Information
on Research
Retrieval,
1990.
and
Management,
Fuhr,
N.
ing”,
avail-
Briscoe,
T.
“Models
Retrieval
ral
Language
ing
system
“Large
Processing:
of LDOCE”,
Hanssen,
203-218;
1987.
Church,
K.
N.;
Buckley,
from
via
Natural
ACM
Phrase
Parser
the
Parts
for Unrestricted
Second
Conference
for
Harman,
cessing,
136-143,
D.
grammar
13:
Program
and
Proceedings
Natural
Performance
England.
E. M.
Factors
of Indezing
Systems,
Cranfield
Research
Aslib
W.
B.
“Boolean
cies in Probabilistic
American
Queries
Retrieval
Society
for
of the
Systems,
Language
Vol. 1,2.
W. B.;
and
ceedings
search
368,
of the
and
Term
in
Science,
“Experiments
R.
document
It$’th
with
retrieval
International
Development
37:
P. “Boolean
ciety
Dillon,
for
Information
M.;
Interpretation
Retrieval”,
Gray,
.lournal
Science,
A.S.
38:
“FASIT:
query
acquiPro-
Retrieval,
on
expansion”,
Fully
321-332,
G. “Retrieving
records
using
Society
and
1988.
from
a gi-
statistical
for
rzmk-
Information
1990.
“Lexical
Ambiguity
and
In-
on Information
appear).
D. Representation
Ph.D.
and
Thesis,
W.B.
Proceedings
J.,
Learning
in preparation,
“Term
Clustering
of the 13th
and
in Informa1991.
of Syntactic
International
Development
Confer-
in Information
Re-
1990.
Van
Probabilistic
Morgan
Reasoning
Kaufmann,
Rijsbergen,
worths,
C.J.
London;
in
California,
Intelligent
Sys-
1989.
Information
Retrieval.
Butter-
1979.
ReRobertson,
S.E.
Journal
Salton,
of Conjunctions
245-254;
Retrieval,
Transactions
385-404,
Pearl,
349-
of the American
A
query
of the
IR”,
Document
Development
on Research
ACM
on Research
trieval,
71-77;
systems”,
Conference
in Information
of
and
Conference
W.B.
D. D.; Croft,
Phrases”,
Dependen-
Journal
Models”.
interactive
Croft,
(to
D.
Lewis,
1966.
1990.
Das-Gupta,
for
Das,
use
Index-
Proceedings
1990.
Retrieval,
Retrieval.
tems.
Croft,
index25, 55-
the
1986.
sition
probabilistic
Pro-
Cranfield,
Project,
Information
Pro-
1990.
Management,
Research
of the American
R.;
formation
Determining
and
Com-
Document
on a minicomputer
41, 581-589,
Noun
ence
Croft,
Science
Nominal
Data”,
45-62,
ACM
Candela,
Journal
Science,
Lewis,
Keen,
on
in Information
of text
ing”,
cod-
1988.
C. W.;
with
and
“Probabilistic
“Towards
of 11th
D.;
gabyte
tion
Cleverdon,
Technical
Information
21-38,
Feedback
Retrieval,
Development
Natu-
Linguistics,
Text”,
on Applied
C.
Conference
Proceedings
in Informa-
Computational
Stochastic
retrieval
Relevance
Krovetz,
“A
26(l),
Processing
in Info~mation
International
Lexicons
Utilizing
Retrieval”,
for
Information
Harman,
B.;
Boguraev,
Thesis,
Computer
Manipulation
Development
135-150,
Syntactic
“Interpreting
cessing
as well
R. A.;
of the 13th
and
W.B.
discus-
making
‘(A Direct
Proceedings
Indexing
of
72, 1989.
Flynn,
J .M.
Croft,
Information
bracketer.
J. D.;
Robbins,
Ph. D.
University,
for
13th
B.;
Phrase
Comparison
1987.
L.;
Fuhr,
P. G.;
Cornell
pounds
ing
D. R.; Alvey,
Automatic
A
Methods.
87-868,
wish
References
Anick,
99- 108;
Longman
available,
for
of the
34:
IRI-
the
authors
Sibun
Laboratories
tagger
Grant
with
The
We also wish
making
and
by NSF
90-0110
in
Retrieval:
Department,
research
Journa/
Science,
1983.
and
for
This
System”,
Information
structured
queries.
8814790
Indexing
for
G.
Retrieval.
So-
“The
Probability
Ranking
of Documentation,
Automatic
33, 294-304,
Information
McGraw-Hill,
New
Principle
Organization
York;
in
1977.
and
1968.
1987.
Automatic
44
Salton,
G.;
mation
Retrieval.
McGill,
M.
Introduction
McGraw-Hill,
to Modern
New
York;
1983.
Infor-
Salton,
G.;
Fox,
formation
26, 1022-1036,
Salton,
G.;
Wu,
H.
“Extended
Boolean
Communications
of
the
InACM,
1983.
Yang,
Importance
in
the American
44;
E. A.;
Retrieval”,
C. S.; Yu,
Automatic
Society
C.T.
Text
for
“A
Theory
of Term
Analysis”,
Information
Journal
Science,
of
26:
33-
1975.
Smeaton,
A.;
corporating
Van
a Document
SIGIR
Rijsbergen,
Syntactic
Smith,
and
Theoretical
Sparck
Jones,
De-
Model
of Informa-
Query
Tait,
Generation,
Ph.D.
E@ciency
Thesis,
University,
1990.
J .1. “Automatic
Journal
of
Computer
Search
Term
Documentation,
40:
1984.
Tong,
R. M.;
of
D .G.
in
Retrieval”,
Machine
Studies,
Turtle,
H.
Ph. D.
Thesis,
Technical
Shapiro,
Uncertainty
formation
Turtle,
and
of the P-Norm
Generation”,
50-66;
on Research
1988.
Cornell
K.;
into
of ACM
31-52,
Properties.
Department,
on In-
Queries
Retrieval,
Syntactic
Science
tions
Proceedings
Conference
M .E. Aspects
Retrieval:
Variant
Strategy”.
in Information
tion
“Experiments
of User
Retrieval
International
velopment
C.J.
Processing
22:
Inference
H. R.;
International
265-282;
Networks
University
Report
90-92,
Croft,
“Experimental
a Rule-Based
Retrieval
on Information
Systems,
for
of
In-
Man-
Document
Massachusetts,
Retrieval,
COINS
1990.
W. B.
Network-Based
Journal
1985.
for
of
InvestigaSystem
“Evaluation
Model”,
(to
of an Inference
ACM
Transactions
appear).
45