The Use of Phrases W. and Structured Bruce Croft, Computer Howard Queries R. Turtle; and Information University and David Science of Massachusetts, Abstract and in information tems. retrieval, In previous as a source This like in this for show that results improve performance, matically extracted nearly in queries using re- by this way are language selected can auto- query per- phrases. the came tion must obey some among its showed significant use of phrases as part of a text representation language has been investigated since of information retrieval research. indexing collections ative low baseline days example, included phrase-based field studies riety of experiments tem. Certainly, phrases, (1966). if used tained (1968) using there phrases should language and, representation. with phrases ition. These small improvements results in have the been improve experimental however, been very in some we feel that support mixed, collections this ranging intu- treated from swers to decreases * Current address: West F’ubtishing Company, St. Paul, Language Stud- In t Current ies, University Permission granted direct address: of to commercial of the that copying Machinery. Chicago, copy provided title Center that for Chicago, without fee the copies advantage, publication all or part of this are not made and its date copyright appear, material or distributed notice and notice of the Aeeociation otherwise, and/or specific permission. @ 1991 ACM O-8979 J-448 and or to republish, is results representations from words, between index for of this paper in using phraees with systems, operators such are terme? not searchers as AND (A), The anand algorithms. the issues model. express Boolean to it be obvious clarify a retrieval using eimilar or should retrieval is to phrases, retrieval For example, term, single as these on to the examined. implications phrases) of work as an index goals (e.g. 1 we call is given and 2A for Computing queries, a fee assume linguistic expressions OR con- word-level (V), section $J .50 32 retrieval effectiveness is measured in terms of re- precision. test and Communications -J/9 j[O009/0032... word for and the requires con- a probabilis- provided of phrases such commercial taining neither sufficiently queetions structure Illinois. the ACM is by permission To copy Information over been user-identified and amount be t rested derived of the involved Minnesota. been significant One with single relationship as a relationship to have in significant the terms have rel- 1990). a phrase index ob- the has not should quality results Dae, that model the specificity the and fig- were Improvements algorithm from phrases. as CACM2 might that with syntactic results. found relaresults his improvement experiments different Despite sys- feeling we collections of addi- syntactic some such Fagan’s in Fagan’s that co- some but words. using baselines In both algorithm, (Croft a va- SMART the using significantly Cran- described consequently, The do not, in the also has always correctly, of the indexing of the text Salton indexing tic for word smaller. phrasee early Cleverdon, single and by phrase, in out with defined in a document. on the increases also be pointed phrase is the proxim- characterized constraint ures obtained siderably or the be none to quite and/or of components but the of occurrences words com- phrases, “syntactic” in phrase component phrases, best and as a statistical tionships the The number most ueing of factors statistical may criteria statistical Introduction the phrase It should 1 A on of the indexing “statistical” occurrences between syntactic is one a number of its component A that both varied constraints ity in used occurrences phrases model. (1987) of automatic process. to build retrieval a natural he and where phrases as manually that phrasee formation effectiveness, theeis studies are used phrases that model, on phrases, retrieval in used 01003 in othersl, recent prehensive sys- been retrieval a probabilistic from as well have an approach and history in commercial queries language for a long of research we describe queries have a statistical majority in natural structured form Boolean improvement paper, identified Our the little queries particularly work, of phrases work, sulted In Boolean D. Lewist MA effectiveness phrases Retrieval Department Amherst, Fagan’s Both in Information 4.1. collection lists consists of the relevant of the ACM of a set documents (CACM) of documents, for collection each a query. is described set of The in prqximity, level for sentence-level The proximity. example, tried), may or by formation query is used tify model can phrases operator how incorporating term phrases phrase, were used dl in the or other d2 paper, (Croft, 1986), as specifying identified struct in a natural a structured bilistic model 199 1). goal, rl rz based representation In the following section, start by describing of our phrases, been the inference emphasizing in retrieval these models ables the similarities clearly in seen. Boolean In section building structured used presents the work. network reviews work content of for an overview and phrases in of our that The uses in sta- results Finally, in approach describe this and section the paper. the importance to are and other the use are query information difference of document nodes, and need. ). need, paper, future document and collections. values true and of the into of forms emphasizes to calculate of the of the such under model. For this structured queries in the inference model can of the net model be shown model docu- informa- as a thesaurus this are that of the features net model it evidence knowledge account advantages These that representations interpretation different is representations domain a natural diagram. sources different and the inference models Different the major have possible as representations between multiple can all be taken of with regarded probabilistic content, tion 4 a discussion of large d~’s are qi’s need. major ment specific Section 5, we indicate Network: nodes, user’s Queries P(I]Document queries, and discuss the information to be as proximity Inference of a document, false. en- them 1: Basic v+’s are concept on have each among Figure nodes, 1 represents is the phrases of an inference such We research Instantiating differences experimental results. directions through models. 3, we give techniques ways operators renet- which different subsection and retrieval previous describe form last need net model models. and The queries tistical those the Croft, overall inference We then the P interaction. we review experiments. treated user ‘-lk ‘Amx to con- and our information and ‘m ... .. ql Phrases (Turtle a complex, of an analysis . .. . . used in a proba- towards is to build language basis nets a step rs In term are used is then on inference which approach. query which represents search natural language query, based This a different d in a probabilistic interpreted we take d-l Y’Jx~J to iden- dependencies. In this .. . . lintext. queries dependency were as (in- in a document Boolean re- A such Structure the used that retrieval, (information be detected we have p=wwh- information 3wordsofretrieva~. work, work, and by a proximity to describe potential that using construct, In previous concept be expressed within guistic proximity, using are discussed a be- low. Previous 2 2.1 The The inference used as the basis of phrases, ments tion it 4. It follows son, 1977). net that it particular the comparisons for probability Typically, decides model information (Turtle of different ranking principle a document query is relevant (Fuhr, a slightly P(I lDocument information different More need as a complex that trix specifically, node inferin probagiven we consider about on the of parents its potential for the can to compute be used associated paper. with 1 shows It the consists of the the all Given the probability or de- all nodes a set these multiof that and DAG, remaining basic has characterizes node a a ma- all possible a node that causes. roots q, we draw the dependence and between probabilities this 33 set If or im- q contains P(g Ip) for When specifies relationship Figure an the matrix propositions. node specifies nodes and edges p “causes” by node The variables. representing belief a that two the pendence a par- The approach proposition ple parents, y represented p to q. matrix) of the between by a node is a di- in which or constants relations from 1989) (DAG) variables proposition (a link (Pearl, graph represented edge values calculates is satisfied the in probability is the dependence directed given 1989). ), which need document. is the represent sec- (Robert- model which propositional in network dependency represent plies treat- model inference acyclic a proposition 199 1) is experiments a probabilistic takes Croft, retrieval ,Query), and a user’s and the a probabilistic computes that Model Net and document ence bility for Bayesian rected, model IDocument a user ticular net the P(Relevant that Inference is A Work of prior networks degree of nodes. inference of a document network network used and in a For retrieval, teraction network. that a query with the This and, and allows the information ument network user, us is to to through to the compute need is met consequently, built attached the for any produce in- document probability particular doc- a ranked list of can be documents. 2.2 Phrases The use of phrases discussed 1. What 2. Are phrases (information query network for the V Tfiies A ret?’ieva~) the terms to determine if a phrase concepts or are they relation- concepts? weighting use of phrases are systems issues: or query? is an appropriate Should 4. query used separate between 3. What Structured is IR following in a document ships 2: of the evidence exists Figure in experimental in terms used for for affect phrases? which indexing single word and docu- queries ments? query network. a collection query gle The and its processing. node and or more each information tive query The and and document been cific to (i.e. the content represent a document signed. A representation given its The gle leaf tion the query that need query networks expressions. plex Figure query Boolean operators matrix form (Turtle showed that queries is at least version of the vector this such as those 2 shows has event that that be used the A retrieval) (information with DAG roots may inference as effective space been as- the need. formed with as the model set 1991). model to both for parse phrase indexing about pairs (Van (1990) tend to same phrases. Sparck phrasal synonymous used together for other concept. or nearly in documents research on term evidence parser techniques, for together to idenusing information than has been the If may part of hypothesis words the mea- words being synonymous other of words associated two Tait queries. example, For instance, clustering. and on the Of course, reasons noun as phrases the co-occurrence mutual co-occur grammar. (e.g. to analyze such as the expected 1979). document Jones a syntactic strongly Rijsbergen, used the PLNLP of the identified are that siderable that of and grammars document and to use semantic It is possible, of words a library Parser-based constructs information use information sure the used cate- and patterns a simpler are then Dil- of the syntactic against used extraction. semantic used. the ( 1987) linguistic tree example, general link 34 specific been is typical parse (1988) phrase the measures 1983). Smeaton to refine to identify Wu, Fagan a complete lin- template- noun>), It is also possible tify Boolean” where use Both are identified are matched For example, in the canonical Fox and have (1983) as <adjective produce Statistical of structured approaches, cases, of the Turtle system for indexing. hand, the techniques FASIT between As mentioned techniques phrases, in documents whereas Each “extended (Salton, for text. of com- Boolean network V ~files. network A query with to indexing identify have used parsing techniques sophistication to analyze the (1984), an informa- phrases. approaches of varying phrases) a sin- to describe query Croft, with phrase categories parser or only distinguishing (such text, node for syntactic parser-based of words indexing templates In correspond a corresponding and repre- a specification the information nodes has document basis and to Gray’s of adjacent nodes. to the express query has associated multiple the of a spenode and gories by a directed contains document in model) concept is an “inverted” and that concepts intermediate the probability corresponds is met simple each syntactic evidence and of the search? is the phrases template-based Each a specific assignment from node network that to a document to which set of parent this nodes (rk’s). issue before, based part during guistic lon document event the node for interac- of document actual in concept representation conditional through first statistical express is built nodes to the We The need feedback. consists text senting of the is modified an during which phrases identified of a sin- information or relevance represents change network representation representation arc and query corresponds observed. user’s A network node collection the 5. Are once for consists representations concept document not query need is built network need. formulation document (dt’s) does query represents information network structure The which one that document basis linguistic will be of conor statistical clues types of then the choice tionship can 1991; of Yu pairs used on Croft, of words indexing various mation about term specificity words in text that is provided prove criteria. and phrase collection formed from queries. more than pair form and Das, In these to identify phrases ments relevant very and accurate heavy tent, burden the users be generally but in the collection that used initial Croft were asked User it and, input state- 1988; phrase identification a to some been (Harman, a places during has ‘% ex- query shown Croft A to and Das, have been ‘i 2 through 4 are best the inference net nets show alternative probabilistic the text In text As an example, currence r~ then information In the first correspond a in the pk is also a rep- to a phrase in the correspond retrieval, a query. to the respectively, to occurrence model (Figure corresponding belief in the phrase dence about the to the concept ocand of the them, presence of a query including the probability information Smeaton’s work a phrase Figure phrase component words and linguistic phrase concept that need). (1986), is treated independent the using relationship The will is the all satisfies model phrases had in phrase (b) Belief in phrase (c) Phrase (d) evi- in a document 3: Alternative (a) Belief Phrase Models: independent of belief in components dependent on belief in components is a dependency relationship between com- ponents The relationships. document This where the as of the words. can be estimated component between (or 3(a)), concept, (d) repre- words Q represents (c) to model, are rj two phrase may rj and space ri, to words. representation concepts in vector in be used retrieval. a separate query and also El to inference phrases networks, corresponds two of information pk would increase that of the These can in the O!m. The concept consisting 3. and small by referring modeling corresponding document resentation of model, these concepts of the Figure use of phrases example. sentation in ways retrieval describe for models discussed ‘j T? mixed. Issues (b) (a) is potentially designer feedback in pre- 1986; query although with rejected. users of system. and occurring (Croft, evidence, the “phrases” pairs were in im- with has been words) relevance results not effective- documents that This effective frequency, with documents. of the and inforamong best were was on the interface formulation 1990), 4.1) experiments, (and form the did the is user judgments 1990). that in his experiments of evidence where relationships example, in the experiments texts, co-occurrence found of words Yang selecting by document restriction 90 times Another vious For obtained only query and (see section every The used by Fa- of co-occurrence selection. CACM (Lewis, by Salton, of their Fagan frequency ness improvements rela- basis involves and and the form satisfy proximity, two as well, thesaurus used algorithm document words or procedure of that basic from the individual model these others 1991). phrase The distinguish possibly a case-by-case is an extension (1975). to and a phrase and statistical gan (1987) and be be made Krovetz The can co-occurrences, the used the 35 Belief in components dependent on belief in phrase same belief (or experiments The the second belief in the in is the tical and the relative (idf ). Specific Salton and work, McGill the score document is also used where cepts the in Figure concept may document tic phrases, observed The rate specific in the third model. For model Here phrase but to that both contains the Ti to query rj. We with now than concepts, such important r] for due to this The final in model, the the ditional and the in not should both in section that and We address 4. This model in the phrase of these retrieval. Although contains for to makes the in number also issue has the word a lesser explicit concepts, that models the Wu, 1983) do not, however, can such phrase for large col- of using vir- overheads used. If phrase however, for it will the documents example, and weight storing word access to the full an text. a user Treating such the structured by the processed indexing There specify translation 36 a query may by computer, expressed has been into be to some in natural some Boolean need the capture, sim- in a variety of terms advantages in the by their in the query. of weighted of than specify, the words One and Croft, terms. relationships, as a set connections. Fox and effective they between be experiments information query, linguistic query in (Turtle more an language can (Salton, of a set of weighted connections these were contain- of information that model that effectiveness proximity model network difficult query and consider presents of particular search a structured OR, be evidence representations queries of a natural nores that Boolean the may good as AND, consisting meaningful rep- achieve example, and languages is considerable accurate structured choice there extended queries normally of phrase Some to identify for query use, It appears As with When experiments is limited to them. form For examin the Boolean operators pler con- concept. model involve, Queries searchers 1991), extent, the disadvantage concepts in in a document, the time. represen- storage was organization Structured needs. us- of a phrase be used in at query as an indexing search, information may used to describe are derived component occurs the file to by doc- little. in unreasonable any out scanning technique terms This that document very or providing people ing this the beliefs words then in a document prior of docu- in phrases and case, of index sufficient part be carried and Fagan’s done for can information trained used In representation. number this been words and, model is a formal The and all component prime belief and 2.3 case, the is established text, phrase prime in the component concept file This position for belief justification. component between idea belief Each some document This phrase resentation? that not with indexing. also be impractical of words result are decision, query may is not phrases. using be used in the text if the In this has relatively working query of those example, would to contain ex- Although 4 use the phrases in the query is, in this pair be necessary between a significant 3(d)) has the text. dependence might ple, but representing the in and queries. relationships generate in the phrase belief also Boolean a document (Figure from document and (1986) In infrequently inaccurate. described between For indexing from pa- weights collections. occur or just occurrences every phrase dependence. the belief the concepts Croft in a thesaurus. also work, evidence from will coming phrases if an inverted sat- is less appropriate other is that model previous ing as those ri by model capturing behavior concept used belief to phrase estimating in section is whether methods tually likely a tech- in this we are currently models concepts to suggest of test al- of models further with an implementation for lections. this size indexing, difference tation A document more increased this uments the concepts words. be language that The as a sepa- between will was natural believe phrases only model identifying dependence represented component to the con- be issue query phrase indexing had is a term is not words due This periments the of the to be present. as a dependence corresponding is essentially from (1990) phrases reported context the appropriate problems collection, final and syntac- the address limited be essome collection. The The in are consequently CACM model on evidence relationships 3(c)) estimates a larger in the not should investigate Buckley most experiments ment 2). and we will belief will be used to learn is the to the of two Fagan’s but of phrases we schemes Fuhr could collections, small the belief with (b). phrases collecby weights) paper, weighting these the queries, Figure that a phrase (Figure the concept, isfy for AND partially syntactic text structured as the instance, is (or this One of the major for In Fagan’s This In that per. of the was added with depend text. weight, are discussed words. indicates statis- weight inverse retriemzlin 3(b) beliefs and nique frequency (1990). the individual (a) words. “tf.idf” document phrase experiments the ternative the weight tf.idf the Turtle to information arrow phrase and both relative weight are represented (similar gray of this how the belief of the A in the a matching in our for calculates and forms from phrases (tf) (1983) score (1987) of the word for on the component words. of the in the case where as the average a document used depends Fagan component frequency shows by Fagan a combination in be timated. to the phrases, the of a word 3(b)) the phrase using also 4. concept used with of (Figure phrase syntactic weights will corresponding model associated formed and in section model concepts This tion weight), described a form of the relational igof a easily structure language. preliminary operators research (AND on the best and OR) of linguistic relationships pared the as sources model. from use of Boolean of word pairs Das-Gupta both syntactic when the natural interpreted OR. proposed term an algorithm, and must when should a high (1990) presented translating a full query Boolean into Boolean and better Smith’s to fectiveness To draw understand the of the from capturing in the query. from for the makes relative capturing ef- the that 50% 82% to relatively words Boolean the jective and to certain it ANDed of into that by trans- Boolean op- above studies linguistic Proximity phrases queries to operating queries. ing and on the the system another which may occurring query languages, tured queries On the means text specification The structured that for structured goal, As an the phrase several of the not other times two hand, ex- ORS, and and lin- proximity than imposing a Queries Language accurate avoiding our user descriptions the research input of query queries and will represent goal and with will is natural to complex build (e.g. strucanaly- that Anick, be represented relationships informa- of language information information of an interface structure contain the problems as inference about need, aids 1990). concepts their relative im- them (Croft and between 1990). this paper, which phrases. ments we address is to evaluate Specifically, designed Hypothesis ope~at- of each to the all in particular we to test one part structured report the of this queries the following results research containing of experi- hypotheses: in CACM words 3 words referring Das, be- In consider in with of phrases rather by a combining user operators. help, within 3 were connections Boolean relationships. occurs 64 instances only system. and linguistic proximity system, Of in documents, operating provide capture of how all focused relationships obtain while portance, The ample to networks erators. tween be equiv- experiments Structured Natural needs sis of free an ad- suggests order tion natural groups be produced relationships In relationships This might paragraph) on all phrases. from of words or between modifies. same we will ANDs, on a case by case basis, model proximity will translation ab- per doc- level In future into only stems We correspond in the modification phrases, queries syntactic (ANDs) of the grad- 1990). of words (e.g. the is the small unlimited documents relationships with contain 20 unique prox- weighting, evaluate which document for Building 3 CACM Cornell structures 61% queries, the (Smith, (ORS) direct noun from that them of noun Boolean methods ef- queries in structured conjunctions Indeed, the reasonable of full-text single between language queries linguistic heads connection queries from of the correspond between used disjunctions queries. the in natural language simple language about derived of the collections to using phrase to documents, the documents. investigate improve case, the problem proximity full-text phrases operators with to restricted guistic different As with of only that as requiring not an approach documents, at did Fa- showed proximity. is difficult small accurately phrases unlimited a proximity. hand, of words an average describe corresponding simple other and pres- in CACM be very produced In this (1990) concepts using the Shapiro, the groups) could queries. In these longer, its use of ad queries, about the on CACM and alent the and to infer (noun 4, we describe or co-occurrence syntac- Croft allowing collection. ument. and proximity over of the stracts ef- However, including operators natural students lating some reflected conclusions more Boolean collection and comes rejected relationships we compared found be size Gay texts of proximity CACM Booleans ones. that use used that in structured the are statistically In section sub- op- on co-occurrence text compounds document imity relationships. linguistic uate produced algorithm, to a vari- apart concept interpretation (Tong based compounds (1987), fectiveness operators. in found results system rules in document close performed produced language of Boolean linguistic and nominal interpreting using Boolean Booleans are directly of words difficult for queries of Smith’s lists collections suggests of a natural complexity the They to these resulting statistically about use of a proximity concepts. queries. gan’s for were RUBRIC of nominal language the and statistically strongly which structure hoc test of structured relationships it three than of a natural manually to manually work fectiveness both produced comparably algorithm She compared interpretations syntactically stantially tic on p-norm complex parse form. with ones of The syntactic queries produced a distances the of words identified Smith at greater not simple ence of query Gupta. ety where study by Das- occurring that of the is the proximity due to discussed example phrases 1985), experts, tentative, limitations be words sy9tem. An for degree of human two in documents erating as a Boolean showed be considered were deciding and translations of these using for conjunction of experimental comqueries dependency information, AND the (1986) language same comparisons with the results a variety the language Preliminary Croft natural semantic as a Boolean of agreement though for (1987) and queries. and ing other phrases structured concept 27 instances Hypothesis 37 1: Structured will be queries more incorporat- effective than queries. 2: Phrases selected automatically un- Number manually No filtering Corpus filtering Table will perform as well (Yo manual) parsed tagged + diet 407 (71%) 148 (69%) 191 (89%) 151 221 (50%) 119 (52%) 159 (72%) of phrases selected generated using various next section, a variety of phrase methods for tic the models and ---i%am man- are syntactic phrase and system Croft 54.3 61.8 (+13.8) Two 30 48.7 53.4 (+9.7) from 40 42.5 43.8 (+3.0) 50 35.8 37.4 (+4.6) 60 28.3 29.1 (+2.7) 70 19.7 22.3 (+13.2) and the These 80 15.7 18.1 (+15.6) from 90 10.6 12.7 (+19.6) 100 8.0 8.9 (+11.8) 33.1 35.6 (+7.5) stochas- (1988). selected are procedure by Church to phrases (+1.2) 20 using These extraction (1990), developed are compared 68.5 phrases investigated. manually queries. The 4 Experiments Table - Two sets of experiments hypotheses. The to test methods queries and retrieval for The representing to test whether phrase These test was concept in directly set of experiments Hypothesis 2. of different with for on (Section These methods that were (Salton, Fox collection nications 4.2) tests for obtained was Three de- compare having portant se- manually in the a list Table the various will be referred the and One student This query compare used the lected in this that were to used accommodate percentage contained and set is the basis Stan1983) retrieval of the in the The This first model 3(b), for 4.1.1 supporting of a phrase the The table in a document is based relation, belief and depends terms on the the on as well the as the phrase other two model shown are based on the 3(a). se- various sets terms approach, represents any operator. terms, ponent Since com~onent the and by remaining product beliefs will term. mean Using anding terms the using network beliefs lie in the range be lower than The models beliefs a in a component subex- a probabilistic a two-term for assigned model and the individ- [0..1], that .rmobabilistic and-based terms the phrasal of the of the this (1990), component combining In an inference as the terms. in Turtle of the is formed to a phrase computes 38 a query with assigned of a representation reported for each phrase pression sum phrases as a co-occurrence document; figures manually first phrase by phrases assignment terms phrase. Conjunctive either occurrence which individual whereas in Figure is modeled estimates the method in Figure results. produced The were each and McGill, in phrases of single as a proximity in of the For belief in a document), method frequency im- paper. 3. a hybrid ual evidence is similar techniques as a conjunction a phrase both in queries sections. 2. treating of phrases. The a phrase extended estimating a phrase set of queries is provided. the to in the following show Belief be co-occurrence of identify of phrases for frequency experiments. (Salton numbers in parentheses 4.1 sense, estimation of Commu- 50 queries form. documents techniques phrases the must methods (term col- version from language tables to evaluate 1 shows with phrase of relevant recall-precision are used In this but terms 1. treating se- test Our science text. selected CACM natural computer phrases query, the the 1983). abstracts Boolean by taking manually using along and a M.S. of the queries tested: the automatically using Wong, 3204 ACM, language was formed done and contains of the natural term, normal phrases. experiments lection constructed phrases. phrases All to a document. to a normal improve bear of manually phrases our information approaches results 2: Performance and-based intended 1. performance lected to 4.1) these to test lecting conducted (Section second signed dard set performance. Hypothesis this were first and-based 67.6 estimates, indexing – 50 queries NL 10 experiments belief syntactic by Lewis tagging describe queries a parser-based described we deriving language phrases techniques Recall the natural queries tagged ually. In -50 197 1: Numbers as phrases of Phrases select ed the belief assigned sum to the with to operator com- queries PI Recall :ision (Yo change -50 queries and-based NL 65.9 (-2.6) 10 68.5 68.5” (+0,0) 69.2 (+1.1) 20 .54.3 61.8 (+13.8) 56.2 (+3.6) 20 61.8 57.4 (-7,1) 62.2 (+0.8) 30 48.7 53.4 (+9.7) 50.0 (+2.7) 30 53.4 51.0 (-4.5) 53.0 (-0.7) 40 42.5 43.8 (+3.0) 39.5 (-7.1) 40 43.8 44.3 (+1.2) 43.6 (-0.3) 50 35.8 37.4 (+4.6) 33.9 (-5.2) 50 37.4 37.6 (+0.4) 37.4 (-0.0) 60 28.3 29.1 (+2.7) 28.3 (-0.2) 60 29.1 33.5 (+15,3) 29.6 (+1.8) 70 19.7 22.3 (+13.2) 21.4 (+8.6) 70 22.3 24.5 (+10.1) 22.4 (+0.3) 80 15.7 18.1 (+15.6) 16.1 (+2,5) 80 18.1 19.9 (+9.6) 18.2 (+0.2) 90 10.6 12.7 (+19.6) 10.3 (-3.0) 90 12.7 13.2 (+3,8) 12,7 (-0.2) 100 8.0 8.9 (+11.8) 7.0 (–12.3) 100 8.9 10.0 (+12.3) 9.0 (+0.8) 33.1 35.6 (+7.5) 32.9 (-0.8) 35.6 36.0 (+1.1) 35.7 (+0,4) 35.2 38.0 (+8.0) . 37.2 (+5.7) ./ 10 3: Performance . of manually constructed average queries Table 4: Comparison belief estimates ually phrases resulted the and natural Proximity In the second of terms model, and-based fault belief. in of the in close This is viewed the only in the a belief greater estimate depends on the number model, resentation a phrase concept document quency details for concepts on the results of Gay As shown significantly worse performs guage query. cause the that do the about same principally rather the belief The than one hybrid, poor and Fagan’s due as the the to not an are documents. of imity in for 4.1.3 hance single that model brids it lan- suffers containing be- phrase the hypothesis term would collections containing by recent Indeed, ten documents top queries significantly better than nearly as well exif we re- using queries and ex- effective 1990). collection, the estimate is present mate term or based of the model is matches prox- the origas the 39 belief based a document single that estimate of the beliefs, Of the maximum these, proximity met, were (the and of the and use some single constraints other belief if the based include the of the beliefs), hybrid term were of the phrase beliefs not attempt proximity on singlearm!- the single mean term operator for if esti- original met. hy- phrase estimates maximum en- recall, of these proximity tested best that All on the product the the enhance models of both, term The phrases phrases phrase features on the liefs. if the satisfy of hybrid are not beliefs proximity-based and-based best in based that while a series constraints assigned none do not We approaches precision to combine – documents are col- phrases. Hybrid To test performs the with fre- natural model sacrifices belief phrase “strict” the query to be more is supported in CACM collec- use of an and-based Candela, precision language and-based within and perform natural which effective for view the use of proxim- device, estimate the phrases inal based of the proximity documents raw from The estimate This (Harman compare trieved enhancing ardbased and mechanism The doc- are present legitimate particularly documents. collec- (many that if at all. it is not of short are short abstracts Many is also CACM paragraph) enhancing As such, we tested constraints recognizing is a precision window model of the records. infrequently, than of the proxim- original terms occur and 3204 proximity proximity collection of a single large rep- experiments, this is too consist contains all (1990) model in this tion periments of the no abstract is a recall (1990). as a document to ignoring width records generally the matches. composition proximity-based belief to estimate conjunctive proximity performance idf the the In the proxon the to pect in which document performance have only Relaxing additional estimate is independent proximity with of This terms. these the as well The lections the proximity See Turtle Croft model only satisfy terms. than tion. few part, recall. only In inverse The 3, the Performance proximity contain not and in Table is based is based unit. for of as an independent the nets. is set to three number terms. default. use of tf and in inference window de- containing and belief and phrasal on the in ity term the of documents is viewed (tf) (idf ) of the more the the single whose frequency but required than with the single is satisfied associated poor due, phrases In any than documents assigned imity with the those relation containing greater phrase, with as a sequence in a document. increases in a phrase proximity The 2). very uments a belief with model, the beliefs than (Table constraints. produces man- performance proximity associated terms identified queries a document belief common proximity better is assigned beliefs were language a phrase model, a phrase on the terms phrases occurring from terms single in significantly original 4.1.2 of and-based, phrases proximity in which and Fagan (+1.2) – proximity-based ity queries 68.5 Table the -50 f37.t3 top the (% change hvbrid 10 average with Precision .— and Recall -— proximity beused a document The perfor- mance of the better than CACM To test the 4), gle terms, can no single were and terms from retrieval performs manually the (and-based, by terms single from original sections. relatively phrases with summarized here. As all only manually The shown in de- single last the selected of and proximity-based lection. 4.1.4 Comparison with earlier phrase model results To compare mented work a phrase a phrase the our model is simply component tially the as we would query. By in effect, acts term when estimate belief beliefs phrase As The for for terms in a nificantly in a the concepts in this way so that with other it tive terms in Table CACM suggests form better than mate on collections point out that the average than reported precision nets Fagan of .36 compared with estimate figures (1987). lent to that terms Automatic versus also well as simply using all query. set of single One strategy For an selected The and results of the improve last about retrieval kind vide this we are of course identify useful effort. In three methods for parser based primarily a stochastic section phrase 4.2.3), (Section 4.2.4). and a user we describe bracketer sparing syntax which obtained from obtained can pro- , such frequency terms should to be correct, Unless of this original query 4.2.1 Corpus the influence as those section use all in addition to the terms that should results terms in the from set of selected dicthe phrases all single in collection, For noted, the quality. found in the as original is phrase the component otherwise phrases about from be dropped. be retained. all single for select- with probably but equiva- performed terms high are less likely a training quality the phrases. user erate ural a the phrases attempt some part that can pus phrase 40 and form of test out are exceeds some retained threshold only it to generated (Fagan, if (in ones. spurious 1987) their our varies used to gen- of the original cases quality to eliminate filtering phrases many low phrases the technique on the quality In be used selected upon query, to screen corpus a dictionary automatically depending language to apply 4.2.2), filtering of considerably with phrases: incorporates The could the (Section from ses- that experiments recognizing on phrase phrases could an interactive automatically, should effecelimi- query, contain terms single for Strategies to be included phrases with all of the sinqueries that of de- incorporatphrases in techniques automatically information (Section during interested phrases that selected While of information this show performance. sion, of speech section manually that select high-quality remainder information to not is essentially strategies single are not the original queries terms factor used tionaries phrases ing with these ri- from sig- many are required queries to be terms since query to the phrases). but of .32. manually terms examined, terms constructed of these were higher of improve on retrieval single that query manually obtained component 4.2 It is also clear ing obtained col- and-based impact performance the original (in addition the will that eliminating per- Specifically, to an average on the CACM performance and in the original 60’?ZOof the single be- sim- than a set of single retrieval The performance the a major general, esti- We should are significantly has by phrases. the larger will or Fagan’s documents. inference in estimates work hybrid and-based of large with difference Fagan initial the the is little or Again, that either sing phrases those there hybrid, collection. collections average 4, and-based, to select contained retrieval. are general, behave on the that documents used from In estimates phrases In gle terms described better used be degrade. degrades scribed we when vari- will phrases estimates estimate of large with performance. hybrid significantly will and variables, perform fo- in the two these estimates proximity-based method included nate shown the collections We will selected we hypothesize and we use essen- weight combined for for for terms weight hybrid performance imple- estimates beliefs phrase the belief model, use to combine the we query. tween the this normalizing as a single in the of the With to combine computing (1987), the mean method phrase Fagan in which terms. the same are, with Again, the belief used terms filtering remaining of manually the is used single subset). the remaining and both the for the and-based ilarly terms. of section independent terms for all corpus independent performance the filter- estimate terms, and Results three corpus and the method or some technique are belief single queries, ables significantly what (no method, whether or hybrid), terms with Adding as using 4,2. 1) is used, to select the recognition considered: proximity, phrases queries. single (Section following identi- phrase were cus on recognition and sin- manually identified original the as well ing that identified performance. about with precision to the variables work of the phrases using addition other estimate. manually 5, dropping In on the suggests average conducted terms, single in Table improve importance and terms, significantly initial documents and-based tests is not formulation However, larger the relative fied phrases grades (Table estimate 1O-2O’ZO over estimate and-based containing hybrid all phrase original collection a collection the hybrid the in may phrases One more in an technique phrases which collection case, nat- be useful is cor- candidate frequency than one precision (Yo v, \ change) man. = 65.2 (-4.8) 68.5 (+0.0) 54.1 (-5.7) 58.6 (+2.1) 30 51.0 45.7 (–10.5) 51.0 (-0.2) 40 44.3 39.7 (–10.3) 43.7 (-1.5) 50 37.6 33.6 (–10.6) 38.4 (+2.2) 60 33.5 28.4 (–15.3) 31.9 (-4.9) 70 24.5 20.2 (–17.8) 23.7 (-3.4) 80 19.9 15.4 (–22.4) 18.0 (-9.6) 90 13.2 11.2 (–14.8) 12.8 (-3.2) 100 10.0 9.3 (–7.5) 10.2 (+1.4) 36.0 32.3 (~10.3j 35.7 (–0.9) effect of single I Table 5: Performance – 50 queries unfiltered be filtered term good tends 10 68.5 68.2 (-0.5) 20 58.6 58.0 (-1.0) able 30 51.0 50.1 (-1.8) tering 40 43.7 43.0 (-1.6) sistency, 50 38.4 37.1 (-3,5) otherwise 60 31,9 30.8 (-3.3) phrase 70 23.7 22.9 (-3.6) performance 80 18.0 17.7 (-1.8) tering. 90 12.8 11.9 (-6.8) 100 10.2 9.3 (-8.8) 35.7 34.9 (-2.2) 6: Effect constructed of corpus phrase filtering selected phrases tend phrases and improves performance results reduce the number but it is also clear are eliminated that of phrases that corpus with for each a number (using does technique used, of reasonable manually selected in the operating to match query only slight and performance frequen- word terms high frequency are occur as a receive Again, this with no credit technique the in- system) they documents gains corpus collection when fil- technique, computer only occurrences. that this very system, the in a document term high better phrase suggests Essentially, unless selected since set of single fil- For con- corpus With reli- phrase filtering understated very the query. (e.g., phrase with to be as manually without effective. from of the phrases, that collections be phrases removed for single phrases phrases larger not corpus phrase mean somewhat can from cluded corpus will can be achieved filtering words filtering are inclusion for these use This results phrases 1 shows will noted. Work manually the performance. selected term with and overall as manually assumed Table descriptors to help Automatically queries occurrence). selection content phrases cies are Table terms 57.4 avera~e Precisio phrases all 20 10 Recall man. no terms Recall I – 50 queries phrases gives CACM collec- tion. as a guideline). The effectiveness heavily ually on the selected do not hurts occur to treat it dow mand The interpreter, using in the phrase, quality even by the small complexity words strict are class, comprising proximity, assigned the text belief during (e.g., sim- (e.g., com- compatibility phrases is unaffected for the from of automatically a partial which generate all pairs parse since subject, and system analyzes from is the Longman 41 noun a noun only level, phrases (LDOCE) phrase head (Boguraev words that and a modifying sentence of and Briscoe, which adjective. to The 1987) which a is its The below produce lexicon Contemporary reare constituents of heuristics constituents. a to are heads Examples phrase those Dictionary we used attempts by a grammatical 1990). a set relationships which of a noun relying adjacent is to all pairs experiment, system Croft, extract syntactic this connected and phrases and of non-function structures the clause For generation (Lewis and text, in specified output. phrase the to occur parser generating or document syntactic verb y). tend the lationship win- others type phrases query of syntactic while performance same appear sample these in reason not the of words de- user-produced collection microcode) the does method parse actually a user terminology CACM One that is strong if it used phrases Since of these Syntactic For man- collection 6). there Some commonly in query 4.2.2 depends phrases. (Table horizontal occur individual documents the covered not 3 When all not manager, do once collection. were period ply than filtering original eliminating slightly3 as high CACM phrases the more produced in the phrase of the phrases, performance liberately of corpus quality used English provides syntactic categories for analyzer for cabulary considerably. As shown phrases form tic Table by phrases (Table a very many nated by corpus remain phrase programs, separation Since contains most queries, it could to the tagged tified vo- of which do not if phrases that phrases. More comprehensive the when these abbreviated per- phrases 4.2.5 ral performance also using that signs parts lexical tagger language queries a tent help. a stochastic developed of speech and to contextual addition, the by words Church based on probabilities boundaries of tag- (1988) knowledge investigated phrases ing tagged by the phrases. Church The tagger produced higher noun Church number of phrases number results both generates than tors is the remains of phrases from its produced lower bigger from number 1), but a much based phrases. and fewer system. contributor the system because linguistic Which The using than the worse original 7, but better tagged struc- of these to improved natural than the recall wit bout with appears to fac- performance perform language manually slightly queries and selected that manual phrases corpus phrase filtering). perform Phrases from work on the retrieval effectiveness. slightly these and others, evidence about or thesaurus is to specific general ACM version). source, possibility use for identifying phrases in a machine-readable lexicon. source For of Computing Because we used these dictionary experiments, phrases for Review Classification there those are identified or few domain- we used computer very queries a very science System phrases in queries in cepts related - the tween tures this concepts experts as additions not 42 the simple thus far. queries) the hyimprove also support the sec- difference between selected we are currently a much that, and will larger on more phrases re- corpus. larger useful collecsource contribute concentrate structured of information importance those of more in the that go beyond form the to include in the query, concepts capturing of two This and query. con- We relationships simple For example, AND the We are look- original for components. on improving queries. of query methods as phrases. often phrase with phrases, to studying such schemes and automatically a much types as relative (1987 phrases hybrid than is little observing research such also results becomes for building ing at other Another in- retrieval. future techniques the identifying and previously, are that queries 4, we can accept there experiments proximity the set Proximity investigated of manually better selecting size. structured The in that these with in better (and be importhat suggests for in section phrases As mentioned to effective a dictionary results pro- query, collection techniques that hypothesis for work substantially pothesis tions, and tool con- selected may improved proximity-based Conclusion Our 4.2.4 and 5 We, (Ta- and as and is likely structure be an effective to which initial document co-occurrence peating some- phrases earlier, overall manually in the representing the effectiveness better phrases in natu- automatically It can be further term ond levels, results phrases the to be included of phrases. phrases ocwith original as well The perspective. selecting mentioned creases 9). precision levels for Based it to be established. what high terms appear the are reasonable, rate, from that phrases the not outperform a user’s in documents indexby by the tagger error phrases than selected phrases selected (Table did of single As noun produced (Table phrases the parser-based Queries at from although terms importance are simple as syntactic lower the manually indexing tures tagger parser of these with the of phrases by the syntactic proportion lower exactly is considerably on comparison ble using Us- words further, than manually phrases strategies In identified. We slightly. Performance dictionary better queries, containing bearing tant as- of occurrence. simple in improved and perform performance stochastic queries tagged queries ging The contained of to the tagged single 50 documents. all set. improves we removed were than almost selected are added the dictionary Summary Combining original parsing, from phrases idenin and phrases Syntactic they the phrases, performance filtering, in more duced 4.2.3 term curred technique the might corpus 8), of manually was occurred although number in the phrase phrase that, the dictionary (Table queries is large filtering accurate grammar, a small present elimi- degrade from improve When of the algorithms which a better to significantly were phrases of questionable phrases phrases added A dictionary form 1 shows query are work, corresponding) useful describe exact Table them ing phrases. the only lansyntac- of candidate phrases the set of syntactic per- natural The number noise syntactic above the large filtering, is possible of syntactic more this the phrases. implementations of the be used using either of these (e. g., formance. A simple described than 1), many While value queries system constructed produces content. 7, the worse manually parser words. augments query. in or 35,000 morphology produced significantly guage about inflectional linguistic in Boolean concepts implies are be- strucqueries, which a strong are rela- Precision Recall NL – 50 queries parsed tagged 10 67.6 68.2 (+0.8) 59.2 (–12.5) 68.4 (+1.2) 20 54.3 58.0 (+6.9) 51.4 (-5.3) 57.3 (+5.6) 30 48.7 50.1 (+2.8) 41.5 (–14.7) 49.4 (+1.5) 40 42.5 43.0 (+1.1) 36.4 (-14.4) 41.4 (-2.5) 50 35.8 37.1 (+3.7) 30.5 (–14.8) 35.4 (-1.0) 60 28,3 30.8 (+8.8) 26,1 (-7,9) 29.0 (+2,3) 70 19.7 22.9 (+16.1) 19.7 (+0.3) 21.3 (+8,3) 80 15.7 17.7 (+12.5) 15.6 (-0.7) 17,0 (+8.4) 90 10.6 11.9 (+12.0) 10.1 (-4.5) 11.3 (+6.8) 100 8.0 9.3 +16.1) 7.2 (-9.4) 8.5 +6.0) 33.1 34.9 29.8 (–10.1) 33.9 [+2.4) average Table (% change) manual 7: Performance ‘(+5.3) of automatically selected — phrases .. . Precision ( YO change) -, NL tagged corpus phrase – 50 queries tagged Recall (with plus with dictionary term filtering 10 67.6 68.4 (+1.2) 69.4 (+2.7) 71.5 (+5.7) 20 54.3 57.3 (+5.6) 57.0 (+5.1) 59.1 (+8.8) 30 48.7 49.4 (+1.5) 49.5 (+1.7) 50.3 (+3.4) 40 42.5 41.4 (-2.5) 40.7 (-4.2) 42.1 (-0.8) 50 35.8 35.4 (-1.0) 36.0 (+0.8) 36.5 (+2.0) 60 28.3 29.0 (+2.3) 28.8 (+1.6) 29.3 (+3.6) 70 19.7 ;;.; (+8.3) 21.6 (+9.5) 21.9 (+11.4) 80 15.7 (+8.4) 17.5 (+11.8) 17.8 (+13.4) 90 10.6 11:3 (+6.8) 11.7 (+10.4) 12.1 (+13.8) 100 8.0 8.5 (+6.0) 9.0 (+12.8) 9.3 (+16.4) 33.1 33.9 (+2.4) 34.1 (+3.1) 35.0 (+5.7) average Table 8: Performance 1 ecision Recall NL of tagged (’ZO chang manual phrases ) -50 queries tagged+ (unfiltered) Table \ dictionary (filtered) 10 67.6 68.5 (+1.3) 71.5 (+5.7) 20 54.3 58.6 (+8.0) 59.1 (+8.8) 30 48.7 51.0 (+4.6) 50.3 (+3.4) 40 42.5 43.7 (+2.7) 42.1 (-0.8) 50 35.8 38.4 (+7.4) 36.5 (+2.0) 60 28.3 31.9 (+12.6) 29.3 (+3.6) 70 19.7 23.7 (+20.4) 21.9 (+11.4) 80 15.7 18.0 (+14.6) 17.8 (+13.4) 90 10.6 12.8 (+20.2) 12.1 (+13.8) 100 8.0 10.2 (+27.3) 9.3 (+16.4) 33.1 35.7 ‘(+7.7j 9: Performance of manual versus 43 35.0 automatic ‘(+5.7j phrase selection filtering) tionship between not what clear of our work those type in the implementation concepts in the of relationship. near future of interfaces text, An will but it important be the appropriate design for Syntactically is Based American part Society Fagan, J. Experiments Document and Acknowledgments Non-Syntactic Report Force was supported and Office to thank sions of Scientific Research. about Krovetz this Ltd. as Ken Church able AFOSR Bob Group, the in part by contract and work. for the Bell stochastic Penelope and for to thank LDOCE Air Gay, Brennan, Interface Boolean for Language Query”, Conference tion Information on Research Retrieval, 1990. and Management, Fuhr, N. ing”, avail- Briscoe, T. “Models Retrieval ral Language ing system “Large Processing: of LDOCE”, Hanssen, 203-218; 1987. Church, K. N.; Buckley, from via Natural ACM Phrase Parser the Parts for Unrestricted Second Conference for Harman, cessing, 136-143, D. grammar 13: Program and Proceedings Natural Performance England. E. M. Factors of Indezing Systems, Cranfield Research Aslib W. B. “Boolean cies in Probabilistic American Queries Retrieval Society for of the Systems, Language Vol. 1,2. W. B.; and ceedings search 368, of the and Term in Science, “Experiments R. document It$’th with retrieval International Development 37: P. “Boolean ciety Dillon, for Information M.; Interpretation Retrieval”, Gray, .lournal Science, A.S. 38: “FASIT: query acquiPro- Retrieval, on expansion”, Fully 321-332, G. “Retrieving records using Society and 1988. from a gi- statistical for rzmk- Information 1990. “Lexical Ambiguity and In- on Information appear). D. Representation Ph.D. and Thesis, W.B. Proceedings J., Learning in preparation, “Term Clustering of the 13th and in Informa1991. of Syntactic International Development Confer- in Information Re- 1990. Van Probabilistic Morgan Reasoning Kaufmann, Rijsbergen, worths, C.J. London; in California, Intelligent Sys- 1989. Information Retrieval. Butter- 1979. ReRobertson, S.E. Journal Salton, of Conjunctions 245-254; Retrieval, Transactions 385-404, Pearl, 349- of the American A query of the IR”, Document Development on Research ACM on Research trieval, 71-77; systems”, Conference in Information of and Conference W.B. D. D.; Croft, Phrases”, Dependen- Journal Models”. interactive Croft, (to D. Lewis, 1966. 1990. Das-Gupta, for Das, use Index- Proceedings 1990. Retrieval, Retrieval. tems. Croft, index25, 55- the 1986. sition probabilistic Pro- Cranfield, Project, Information Pro- 1990. Management, Research of the American R.; formation Determining and Com- Document on a minicomputer 41, 581-589, Noun ence Croft, Science Nominal Data”, 45-62, ACM Candela, Journal Science, Lewis, Keen, on in Information of text ing”, cod- 1988. C. W.; with and “Probabilistic “Towards of 11th D.; gabyte tion Cleverdon, Technical Information 21-38, Feedback Retrieval, Development Natu- Linguistics, Text”, on Applied C. Conference Proceedings in Informa- Computational Stochastic retrieval Relevance Krovetz, “A 26(l), Processing in Info~mation International Lexicons Utilizing Retrieval”, for Information Harman, B.; Boguraev, Thesis, Computer Manipulation Development 135-150, Syntactic “Interpreting cessing as well R. A.; of the 13th and W.B. discus- making ‘(A Direct Proceedings Indexing of 72, 1989. Flynn, J .M. Croft, Information bracketer. J. D.; Robbins, Ph. D. University, for 13th B.; Phrase Comparison 1987. L.; Fuhr, P. G.; Cornell pounds ing D. R.; Alvey, Automatic A Methods. 87-868, wish References Anick, 99- 108; Longman available, for of the 34: IRI- the authors Sibun Laboratories tagger Grant with The We also wish making and by NSF 90-0110 in Retrieval: Department, research Journa/ Science, 1983. and for This System”, Information structured queries. 8814790 Indexing for G. Retrieval. So- “The Probability Ranking of Documentation, Automatic 33, 294-304, Information McGraw-Hill, New Principle Organization York; in 1977. and 1968. 1987. Automatic 44 Salton, G.; mation Retrieval. McGill, M. Introduction McGraw-Hill, to Modern New York; 1983. Infor- Salton, G.; Fox, formation 26, 1022-1036, Salton, G.; Wu, H. “Extended Boolean Communications of the InACM, 1983. Yang, Importance in the American 44; E. A.; Retrieval”, C. S.; Yu, Automatic Society C.T. Text for “A Theory of Term Analysis”, Information Journal Science, of 26: 33- 1975. Smeaton, A.; corporating Van a Document SIGIR Rijsbergen, Syntactic Smith, and Theoretical Sparck Jones, De- Model of Informa- Query Tait, Generation, Ph.D. E@ciency Thesis, University, 1990. J .1. “Automatic Journal of Computer Search Term Documentation, 40: 1984. Tong, R. M.; of D .G. in Retrieval”, Machine Studies, Turtle, H. Ph. D. Thesis, Technical Shapiro, Uncertainty formation Turtle, and of the P-Norm Generation”, 50-66; on Research 1988. Cornell K.; into of ACM 31-52, Properties. Department, on In- Queries Retrieval, Syntactic Science tions Proceedings Conference M .E. Aspects Retrieval: Variant Strategy”. in Information tion “Experiments of User Retrieval International velopment C.J. Processing 22: Inference H. R.; International 265-282; Networks University Report 90-92, Croft, “Experimental a Rule-Based Retrieval on Information Systems, for of In- Man- Document Massachusetts, Retrieval, COINS 1990. W. B. Network-Based Journal 1985. for of InvestigaSystem “Evaluation Model”, (to of an Inference ACM Transactions appear). 45
© Copyright 2026 Paperzz