PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/84168 Please be advised that this information was generated on 2017-06-17 and may be subject to change. Quantifying the Challenges in Parsing Patent Claims Suzan Verberne* s.verberne@ let.ru.nl Eva D’hondtf e.dhondt@ let.ru.nl Nelleke O o s td ijk n.ooStdijk@ let.m .nl Cornelis H.A Koster1 kees@ cs.ru.nl ABSTRACT In this paper, we aim to verify and quantify the challenges of p aten t claim processing th a t have been identified in the literature. We focus on the following three challenges th a t, judging from the num bers of m entions in papers concerning p aten t analysis and p aten t retrieval, are central to p aten t claim processing: (1) T he length of sentences is much longer th a n for general language use; (2) M any novel term s are introduced in p aten t claims th a t are difficult to understand; (3) T he syntactic stru ctu re of p aten t claims is complex. We find th a t the challenges of p aten t claim processing th a t are related to syntactic stru ctu re are much m ore problem atic th a n th e challenges at th e vocabulary level. T he sentence length issue only causes problem s indirectly by resulting in more stru ctu ra l am biguities for longer noun phrases. Keywords P a te n t Claim Processing, Challenges in P a te n t Search, Vo cabulary Issues, Syntactic Parsing 1. INTRODUCTION P ate n t retrieval is a rising research topic in th e Inform a tio n R etrieval (IR) community. One of th e m ost salient search tasks perform ed on p aten t databases is prior a rt re trieval. T he ta sk of prior a rt retrieval is: given a p aten t application, find existing p aten t docum ents th a t describe in ventions which are sim ilar or related to th e new application. For every p aten t application th a t is filed at the E uropean P a te n t Office, prior a rt retrieval is perform ed by qualified p aten t examiners. T heir goal is to determ ine w hether the claim ed invention fulfills th e criterion of novelty com pared to earlier sim ilar inventions [1]. In its classic set-up, prior a rt searching involves a large am ount of hum an effort: T hrough careful exam ination of p o tential keywords in th e p aten t application th e p aten t ex am iner composes a query and retrieves a set of docum ents. ‘ Inform ation Foraging L ab /C en tre for Language and Speech Technology, R adboud U niversity Nijmegen and Research G roup in C om putational Linguistics, U niversity of W olver h am pton ^Inform ation Foraging L ab /C en tre for Language and Speech Technology, R adboud University Nijmegen * Inform ation Foraging L ab/C om puting Science In stitu te, R adboud University Nijmegen Copyright is held by the author/owner(s). 1st International Workshop on Advances in Patent Information Retrieval (AsPIRe’10), March 28, 2010, Milton Keynes. D ocum ent by docum ent is th e n analyzed to judge its rel evance. From th e relevant docum ents new keywords are added to th e query and th e process is repeated un til rel evant inform ation has been found or th e search possibilities have been exhausted. Since professional searchers are expen sive, it is worthwhile investigating how th e prior a rt search ing process can be facilitated by retrieval engines. Previous work suggests th a t for prior art search, th e claims section is th e m ost inform ative p a rt of a p aten t, b u t it is also the m ost difficult to parse [12, 25, 14, 13]. Among th e language processing tasks th a t can support th e p aten t search and analysis process are te rm extraction, sum m arization and tran slatio n [27]. In order to perform these tasks (sem i-)autom atically, at least sentence splitting and m orphological analysis is needed b u t in m any cases also some form of syntactic parsing. Existing n atu ra l language parsers m ay fail to properly analyze p aten t claims because th e language used in p aten ts differs from th e ‘regular’ E n glish language for which th e tools have been developed [25]. P a te n t claims have a fixed structure: T hey consist of one long sentence, startin g w ith “We claim :” or “W h at is claim ed is:”, followed by item lists (‘series of specified elem ents’1), which are realized by noun phrases. T he term inology used in p aten t claims is highly dependent on th e specific topic dom ain of th e p aten t (e.g. m echanical engineering). T he challenges related to p aten t claim processing are iden tified by a num ber of researchers in th e p aten t retrieval field (see Section 2) b u t these studies lack any kind of q uan tification of th e challenges: M ost of th em do not provide statistics on sentence length, sentence stru ctu re, lexical dis trib u tio n s and th e differences betw een th e language used in p aten t claim s and th e language used in large non -p aten t cor pora. In th is paper, we aim to verify and quantify th e challenges of p aten t claim processing th a t have been identified in the literature. We focus on th e three challenges th a t are listed in th e often-cited p aper by Shinmori et al. (2003) [25] about p aten t claim processing for Japanese:2 1. T he length of th e sentences is much longer th a n for general language use; 2. M any novel term s are introduced in p aten t claims th a t are difficult to understand; 1h ttp : / / en.w ikipedia.org/wiki / C la i^ ( p a te n t) 2T he research on p aten t processing and retrieval has a some w hat longer history in Ja p a n th a n in E urope and th e U.S. because of th e p aten t retrieval track in th e N T C IR evalua tio n cam paign [15]. 3. T he stru ctu re of the p aten t claims is complex. Consequently, m ost syntactic parsers — even those th a t achieve good results on general language tex ts — fail to correctly analyze p aten t claims. We chose these challenges because we th in k they are cen tra l to p aten t claim processing, which may be concluded from th e frequent m entions of these challenges in other p a pers concerning p aten t analysis and p aten t retrieval (see Sec tio n 2). We expect th a t th e challenges th a t Shinmori et al. found for Japanese will also hold for English p aten t claims. We will verify this in Section 4. In th e same section, we will quantify th e challenges related to sentence length, vo cabulary issues and syntactic structure, using a num ber of (p atent and non-patent corpora) and N LP tools. F irst, in Section 2 we provide a background for th e current paper. T hen, in Section 3 we describe th e d a ta th a t we used. 2. BACKGROUND: PATENT PROCESSING In this section, we discuss previous work on p aten t pro cessing. T he papers th a t we discuss here stress th e com plexity of the language used in patents, especially in the claims sections. M ost of the work is directed at facilitating hum an p aten t processing, in m any cases by im proving the readability of p aten t texts. Bonino et al. (2010) explain th a t in p aten t searching, b o th recall and precision are highly im portant [5]. Because of le gal repercussions, no relevant inform ation should be missed. O n th e other hand, retrieving fewer (irrelevant) docum ents makes th e search process more efficient. In order to have full control over precision and recall, p aten t search profession als generally employ an iterative search process. T his pro cess can be supported by N LP tasks such as query synonym expansion (which is already commonly used in p aten t te x t searches), sentence focus identification and m achine tra n s lation. Mille and W anner (2008) stress th a t of all sections in a p aten t docum ent, th e claims section is the m ost difficult to read for hum an readers [22]. T his is especially due to th e fact th a t in accordance w ith international p aten t w riting guide lines, each claim m ust consist of one single sentence. Mille and W anner m ention sim ilar challenges to th e ones listed by Shinmori et al. (2003): sentence length, term inology and syntactic structure. However, they describe th e term inology challenge not as an issue of understanding com plex term s (as Shinmori does [25]) b u t as th e problem of ‘ab stra ct vo cabulary’, which is not further specified in their paper. In their introduction to the special issue on p aten t process ing, Fujii et al. (2007) sta te th a t from a legal point of view, th e claims section of a p aten t is th e m ost im p o rtan t [12]. T hey describe th e language used in p aten t claims as a very specific sublanguage and sta te th a t specialized N LP m eth ods are needed for analyzing and generating p aten t claims. W anner et al. (2008) describe their advanced p aten t pro cessing service P A T E xpert [27]. P A T E xpert is aim ed a t fa cilitating p aten t analysis by th e use of knowledge bases (on tologies) and a set of N LP techniques such as tokenizers, lem m atizers, taggers and syntactic parsers. Moreover, it of fers a paraphrasing m odule which accounts for a two-step sim plification of the text: (1) splitting the te x t in smaller units, taking into account its discourse structure, and (2) transform ing th e sm aller un its into easily understandable clauses w ith th e use of ‘predefined well-formedness criteria’. T seng et al. (2007) experim ent w ith a num ber of tex t m ining techniques for p aten t analysis th a t are related to th e analytical procedures applied by professional searchers on p aten t tex ts [26]. T hey perform autom atic sum m ariza tio n using te x t surface features (such as position and title words). Moreover, they extend th e p o rter stem m er algo rith m and also an existing stopw ord word list, b o th focusing on th e specifics of p aten t language. Tseng et al. identify the ex traction of key-phrases as one of th e m ain challenges in p aten t claim analysis because “single words alone are often too general in m eanings or am biguous to represent a con cept”. T his relates to th e ‘ab stra ct vocabulary’-problem as identified by Mille and W anner (see above). T seng et al. find th a t m ulti-word strings th a t are repeated throughout a p aten t are good key-phrases and likely to be legal term s. Finally, Sherem etyeva (2003) uses pred icate-arg u m en t stru c tures to improve th e readability of th e claims section [24]. In her system , readability im provem ent is th e first step in a suggested p aten t sum m arization m ethod. All of th e papers m entioned in this section use some form of N LP to facilitate p aten t analysis by hum ans. In th e IR field, however, p aten t retrieval is generally addressed as a te x t retrieval task th a t only uses word level inform ation w ithout deeper linguistic processing. Academic research on p aten t retrieval has m ainly focused on th e relative weighing of th e index term s and on exploiting th e p aten t docum ent stru ctu re to boost retrieval [21]. For an overview of the sta te of th e a rt in academic and com mercial p aten t retrieval system s, we refer to Bonino et al. (2010) [5]. A sm all num ber of approaches to p aten t retrieval use lin guistic processing to improve retrieval. T he system s devel oped by Escora et al. (2008) and Chen et al. (2003) perform a com bination of syntactic and sem antic analysis on th e doc um ents [11, 8]. T he work described by K oster et al. (2009) and D ’h ondt et al. (2010) aims at developing an interactive p aten t retrieval engine th a t uses dependency relations as in dex and search term s [18, 10]. In order to generate these dependency relations, a syntactic parser is developed th a t is especially ad ap ted to analyzing p aten t texts. We will come back to this parser in section 3.2. 3. DATA For th e experim ents rep o rted in th is paper, we use the subset of 400,000 docum ents of th e M A trixw are R Esearch Collection (M AREC) th a t was supplied by M atrixW are3 for use in th e A sP IR e’10 workshop. In th e rem ainder of this paper, we will refer to th is corpus of 400,000 p aten ts as the ‘M A REC subcorpus’. 3.1 Preprocessing the corpus Since th e aim of th e current p aper is to quantify th e chal lenges of parsing p aten t claims, we first extracted th e claims sections from th e M A REC subcorpus, disregarding th e other fields of th e XML docum ents. Moreover, as we are develop ing techniques for m ining English p aten t texts, we are only interested in those p aten ts th a t are w ritten in English. Using a P erl script, we ex tracted all English claims sec tions (m arked w ith < claim s lang="EN">) from th e directory tree of th e M A REC subcorpus and removed th e XML m ark up. T his resulted in 67,292 claims sections4 w ith 56,117,443 3h ttp : / / w w w .m atrixw are.com / 4T he other docum ents in th e subcorpus either do not contain words in total. Having extracted and cleaned up all claims sections, we used a sentence splitter to split th e claims sections in smaller units. As pointed out by [25], sentence splitting for claims sections is not a trivial task. M any sentences have been glued together using semi-colons (;). We therefore decided to not only use full stops as a split characters in our sentence sp litter bu t also semi-colons. We found th a t th e 67,292 claims sections consist of 1,051,040 sentences.5 3.2 ♦ % of Marec sentences ■ % of BNC sentences Parsing the corpus In order to assess and quantify th e th ird challenge listed in Section 1 (th e complex syntactic structure of p aten t claims), we need a syntactic analysis of th e M A R EC subcorpus. To this end, we use th e baseline version of th e syntactic parser th a t is under developm ent in th e ‘Text M ining for Intellec tu a l P ro p e rty ’ (TM 4IP) project [18]. The aim of this project is to develop an approach to interactive retrieval for p aten t texts, in which dependency triplets instead of single words are used as indexing term s. In th e T M 4IP project, a dependency triplet has been de fined as two term s th a t are syntactically related through one of a lim ited set of relators (SUBJ, O BJ, P R E D , MOD, ATTR, ...), where a term is usually th e lem m a of a content word. [10]. For example, th e sentence “T he system consists of four separate m odules” will be analyzed into th e following set of dependency trip lets: [system,SUBJ,consist] [consist,PREPof, mod ule] [module,ATTR,separate] [module, QUANT,four] Using dependency trip lets as indexing term s in a classifi cation experim ent, K oster and Beney (2009) have recently achieved good results for classifying p aten t applications in th eir correct IPC classes [17]. T he dependency parser th a t generates the triplets is called A E G IR (‘A ccurate English G ram m ar for Inform ation Re trieval’). In its baseline version, A EG IR is a rule-based de pendency parser th a t combines a set of hand-w ritten rules w ith an extensive lexicon. T he resolution of lexical am biguities is guided by lexical frequency inform ation stored in the parser lexicon. These lexical frequencies provide inform ation on the possible p arts of speech th a t can be associated w ith a p articular word form. For example, in general English, we can expect zone as a noun to have a higher frequency th a n zone as a verb. For th e current paper, we collected lexical frequency inform a tion from a num ber of different sources in order to exam ine th e lexical differences between th e English language use in p aten t claims com pared to th e language use in difference contexts. We will come back to this in Section 4.2. For th e current paper, we decided to parse 10,000 of th e 67,292 English p aten t claims in th e M A REC subcorpus. These 10,000 claims contain a to ta l of 6.9 million words. Sentencing these claims using the sentence splitter described in Section 3.1 results in 207,946 sentences. a claims section or are in a language other th a n English. 5Recall from Section 1 th a t p aten t claims are com posed of noun phrases (NPs), not clauses. In th e rem ainder of this paper, we use th e word ‘sentences’ to refer to th e units (m ostly NPs) th a t are separated by semicolons and full stops in p aten t claims. We use th e word ‘noun phrase (N P )’ if we refer to th e syntactic characteristics of such units. 1 10 100 1000 10000 sentence length F ig u re 1: D is tr ib u tio n o f s e n te n c e le n g th s in th e M A R E C su b c o r p u s, co m p a re d t o th e B N C . 4. VERIFYING AND QUANTIFYING PATENT CLAIM CHALLENGES T he th ree challenges of p aten t claim processing m entioned in Section 1 are: (1) T he length of th e sentences is much longer th a n for general language use; (2) M any novel term s are introduced in p aten t claims th a t are difficult to under stand; and (3) th e stru ctu re of th e p aten t claims is complex, as a result of which syntactic parsers fail to correctly analyze p aten t claims. In th e following subsections 4.1, 4.2 and 4.3 we perform a series of analyses and experim ents in order to verify and quantify these th ree challenges. 4.1 Challenge 1: Sentence length A fter splitting th e M AREC subcorpus into sentences (see Section 3.1), we ex tracted th e following sentence-level sta tis tics from th e corpus. As already rep o rted in th e previous sec tion, th e 67,292 claims sections of th e MAREC-400000 sub corpus consist of 1,051,040 sentences. T here is much overlap between th e sentences: after removing duplicates, 580,866 unique sentences rem ain. T he m edian sentence length is 22 words; th e average length is 53 words. Binning th e sentences from M AREC w ith th e same length together and counting th e num ber of sentences in each group results in a long ta il distribution. T he peak of th e d istrib u tion lies around 20 words (25,000 occurrences), w ith outliers for sentence lengths 3 (20,637 occurrences) and 5 (32,849 occurrences). In Figure 1, th e M AREC sentence length dis trib u tio n is com pared to th e sentence length d istribution of th e B ritish N ational C orpus (BNC) [19], which we prepro cessed using th e same sentence sp litter as we used on th e M AREC subcorpus. Figure 1 shows th a t sentences in M A REC are, as th e lit eratu re suggests, longer th a n th e sentences in th e BNC (the early peak is th e BNC, th e later peak is M A REC), even if we use th e semi-colon for sentence splitting in addition to th e full stop. 4.2 Challenge 2: Vocabulary Shinmori et al. (2003) sta te th a t m any novel term s are used in Japanese p aten t claims. We perform ed th ree types of analysis on th e vocabulary level to verify th is for En- T ab le 1: L ex ica l co v e ra g e o f t h e C E L E X w o rd fo rm le x ico n on th e M A R E C su b c o r p u s, b o th m ea su red s tr ic tly and le n ie n tly (d isreg a r d in g sin g le ch a ra cters, n u m erals an d ch em ic a l fo rm u la e ), and b o th on t h e t y p e le v el and t h e to k e n level. C E L E X -M A R E C strict type coverage 55.3% C E L E X -M A R E C lenient type coverage 60.4% C E L E X -M A R E C strict token coverage 95.9% C E L E X -M A R E C lenient token coverage 98.8% glish p aten t claims: (1) a lexical coverage te st of single-word term s from a lexicon of general English on the M A REC sub corpus, (2) an overview of th e m ost frequent words in the M A REC subcorpus com pared to th e BNC, (3) frequency counts on am biguous lexical item s (as introduced in Sec tio n 3.2) and (4) an analysis of m ulti-word term s in the M A REC corpus. The coverage of general English vocabulary In order to quantify th e differences betw een th e vocabu lary used in p aten t claims and general English vocabulary, we perform ed a lexical coverage te st of th e C ELEX lexi cal database [2] on the M A REC subcorpus. T he CELEX file EM W .CD contains 160,568 English word forms th a t are supposed to cover general English vocabulary: According to th e C ELEX readm e file6, th e lexicon contains th e word forms derived from all lem m ata in th e Oxford Advanced L earner’s D ictionary (1974) and th e Longm an D ictionary of C ontem porary English (1978). T he C ELEX docum entation reports th a t on th e 17.9 million word corpus of B irm ing h am U niversity/C O B U ILD , th e token coverage of CELEX is 92%. We m easured th e coverage of C ELEX entries on th e MAR E C subcorpus using a so-called corpus filter w ritten in A G FL .7 A corpus filter takes as input a corpus in plain te x t and a w ordform lexicon. T he corpus te x t is split up into tokens. These are m atched to th e lexicon using a sm art form of m atching w ith respect to capitalization: If a word is in th e lexicon in lowercase, th e n it may m atch b o th an uppercase and a lowercase variant in th e corpus. If a word in th e lexicon has one or more uppercase letters, th en it only m atches equally uppercased forms in the corpus. T his fa cilitates sentence-initial capitalization in th e corpus for low ercase lexicon forms such as the, while it prevents proper nam es from th e lexicon to be m atched to com mon nouns in th e corpus. Moreover, th e corpus filter allows us to skip over spe cial tokens such as single characters, num erals and formu lae. If we disregard these special tokens we get a more le nient lexical coverage m easurem ent. We m easured lexical coverage b o th on th e token level (counting duplicate words separately) and the type level (counting duplicate words once). A type-level count always gives a lower lexical cov erage because th e words th a t are not covered by th e lexicon are generally lower-frequency words. T he lexical coverage (b oth type and token counts) for the C ELEX lexicon on the M A REC subcorpus can be found in Table 1. In Table 1, we m arked th e strict token coverage in boldface 6h ttp ://w w w .ld c .u p e n n .e d u /C a ta lo g /re a d m e file s/ celex. readm e.htm l 7h ttp ://w w w .ag fl.cs.ru .n l/ because th is is th e percentage (95.9%) th a t can be com pared to th e token coverage rep o rted by th e C ELEX docum enta tio n on th e COBUILD corpus (92%, see above). We can see th a t these percentages are com parable, th e M A REC sub corpus giving a slightly higher coverage th a n th e COBUILD corpus. U nfortunately, we cannot com pare th e ty p e cover ages of th e CELEX lexicon for b o th th e corpora because we do n ot know th e ty p e coverage of th e C ELEX lexicon on the COBUILD corpus. If we look at th e top-frequency tokens from M A REC th a t are n o t in th e C ELEX lexicon, we see th a t th e first 26 of these are num erals (which we excluded in our lenient ap proach). If we disregard these, th e te n m ost frequent to kens are: indicia, U-shaped, cross-section, cross-sectional, flip-flop, L-shaped, spaced-apart, thyristor, cup-shaped, and V-shaped.8 T he lexical coverage of th e C ELEX lexicon on th e M A REC corpus com pared to th e COBUILD corpus shows th a t p aten t claims do not use m any words th a t are not covered by a lex icon of general English. T he next three subsections should m ake clear w hat vocabulary differences do exist between p aten t claims and general English language use. Frequent words We ex tracted a word frequency list from th e M A REC sub corpus. A n overview of th e 20 m ost frequent words in b o th th e M A REC subcorpus and th e BNC already shows rem ark able differences (Table 2). T he counts are norm alized to the relative frequency p er 10,000 words of running text. T hree lexical item s in Table 2 need some explanation: • In p aten t claims, said is used as a definite determ iner referring back to a previously defined entity.9 In every day English, it could be rephrased as ‘th e previously m entioned’, e.g. “T he condensation nucleus counter of claim 6 w herein said control signal fu rth er is a func tio n of th e differential of said error signal.” Said has a strong reference function and can be used for th e iden tification of anaphora in p aten t texts. T he word occurs in 47% of all sentences in th e M A REC subcorpus. • T he word wherein is used very frequently in claims for th e specifications of devices, m ethods and processes. A brief, prototypical exam ple is “T he m ethod of claim 4 w herein n is zero.” W herein occurs in 61% of all sentences in th e M A REC subcorpus. If we only con sider th e 122,925 sentences th a t are around m edian sentence length (21-25 words), even 71% contains the word wherein. T he frequent use of wherein is strongly connected to th e n atu re and aims of p aten t claims: to define and specify all characteristics of an invention. • T he same holds for th e word comprising, which is fre quently used to specify a device or m ethod, e.g. “T he h eat exchanger of claim 1 further com prising m etal w arp fibers...” 8T his small set of term s shows th a t hyphenation is a pro ductive and frequent phenom enon in p aten t claims. For th a t reason, th e A E G IR gram m ar is equipped w ith a set of rules th a t accurately analyse different types of com positional hy p henated forms. In this paper, we will not go into specifics on th is subject; it will be covered in fu tu re work. 9A E G IR tre a ts this use of said as an adjective, as we will see later in this section. T ab le 2: T h e 20 m o st freq u en t to k e n s in th e M A R E C su b c o r p u s w ith th e ir r e la tiv e fre q u en cie s p er 1 0 ,0 0 0 w ord s o f p a te n t cla im s, and t h e 20 m o st freq u en t to k e n s in th e B N C w ith th e ir re la tiv e fre q u e n c ie s p er 1 0 ,0 0 0 w o rd s o f B N C t e x t s . M A REC claims BNC Freq. per token Freq. per token 10000 words 10000 words 674 the 715 the 480 a 376 of 457 said 303 and 450 of 266 to 278 and 206 in 261 to 202 a 158 in 129 is 128 claim 120 th a t 124 w herein 87 it 121 for 86 for 115 is 81 be 102 an 70 on 101 first 68 w ith 100 means 67 are 90 second 63 by 63 from 62 as 62 w ith 57 was 57 one 57 this 56 1 55 s 53 comprising 52 I Table 2 shows a clear difference in th e m ost frequently used words in p aten t claims (M AREC) com pared to general English (the BNC). Thus, when we take into account th e fre quency of words, th e language use in p aten t tex ts definitely differs from th a t found in general English (see th e previous subsection). Lexical frequencies for ambiguous words As explained in Section 3.2, we consult several resources to o b tain lexical frequencies. For th e aim of th e current paper, it is interesting to analyze th e differences betw een th e fre quencies obtained from different types of sources. For devel opm ent and analysis purposes, we obtained lexical frequen cies from the following sources: (a) th e P enn T reebank [20], (b) th e B ritish N ational Corpus BNC, (c) 79 Million words from th e UKWAC w ebcorpus [3], PO S tagged by th e treetagger, (d) 7 Million words of p aten t claims from th e CLEFIP [23] corpus parsed w ith th e Connexor CFG parser [16], and (e) th e 6.9 Million words of p aten t claims from the M A REC corpus parsed w ith th e A E G IR dependency parser (see Section 3.2). We converted th e annotations in each of the corpora to the A E G IR ta g set.10 We extracted from the A E G IR lexicon the 28,917 wordforms th a t occur in th e lexicon w ith m ore th a n one p a rt of speech (POS) and counted th e frequencies of the wordforms for each of the PO Ss th a t occur in th e corpora. For each w ordform w w ith p arts of speech pi..„ in source 10For some tags this was not possible, for exam ple where there was a m any-to-m any m atch betw een the labels used in a corpus and th e labels used in th e A E G IR tagset. s, we calculated th e relative frequency for each PO S p as: relfreq w p ,, co u n t(w ,p , s) Y h =0 co u n t(w ,p i, s) (1) We took th e average relative frequency over th e sources 1..m as: a vgrelfreqw E 7=0r e lfr e q ( w ,p ,s j ) m (2) We calculated th e average relative frequency (Equation 2) for two sets of sources: P en n /B N C /U K W A C (PBU) on th e one h and (representing general English language use), and M A R E C /C L E F -IP (MC) on th e oth er hand. T h en we considered wordforms for which a v g r e lfre q w,p,MC - a v g r e lfre q w,p,PBU > 0.5 (3) holds to be typical for p aten t claim s.11 For example, th e w ordform said w ith p a rt of speech ‘ad jective’ comes out as being typical for p aten t language, where as th e same word w ith th e p a rt of speech ‘verb’ is labeled as atypical for p aten t language.12 However, ap a rt from th is ex am ple it is difficult to draw any conclusions from th e o u tp u t of our lexical frequency analysis. Only 4% of th e am bigu ous wordforms for which we obtained lexical frequencies are labeled as typical for p aten t language. O ne problem in th e identification of typical wordforms is th a t it is difficult to distinguish betw een peculiarities caused by a different descriptive model of th e p arser/tag g er used (e.g. one parser may prefer th e label ‘adjective’ over the label ‘p ast p articip le’ for word forms such as closed in a phrase such as ‘th e closed door’) and an actual difference in language use in th e corpus (e.g. said as an adjective vs. said as a verb). M ost of th e exam ples in th e list of typical wordforms are difficult for us to in terp ret (e.g. adhesive as an adjective is labeled as typical while adhesive as a noun is labeled as atypical). Therefore, and because only a fraction (4%) of the words come o ut as typical for p aten t language, we consider th e lexical frequencies for am biguous words to be inconclu sive. T hey do n ot show a clear difference betw een p aten t vocabulary and regular English vocabulary. Multi-word terms We include th e topic of m ulti-word term s here because in Section 1 we referred to ‘novel term s’ (following Shinmori [25]) w ithout distinguishing betw een single-words term s and m ulti word term s. Since we found no difference betw een th e single te rm vocabulary in general English and th e English used in p aten t texts. (see ‘T he coverage of general English vocabu lary ’), we hypothesize th a t th e authors of p aten t claims in troduce com plex m ulti-word N Ps th a t co n stitu te new (tech nical) term s. To verify this, we m ake use of th e SPEC IA LIST lexi con [6]. According to th e developers th is lexicon covers b o th 11T he threshold of 0.5 was chosen because a difference value higher th a n 0.5 m eans th a t in th e two te x t types th e other of th e two word classes for th e same word is th e m ajority word class. 12Interestingly, th e Connexor CFG parser only labeled 55% of th e occurrences of said in th e C L E F-IP corpus as an ad jective, and th e oth er occurrences as a verb. We conjecture th a t these parsing errors are due to th e fact th a t th e Connexor parser was not tu n ed for p aten t d a ta b u t for general English. commonly occurring English words and biomedical vocabu lary discovered in th e NLM Test Collection and th e UMLS M etathesaurus. By using lexical item s from a reliable lexi con, we do not rely on syntactic annotation of th e corpus; instead we assume th a t every occurrence of a word sequence from the lexicon in the corpus is a m ulti-word term . T he SPE C IA L IST lexicon contains approxim ately 200,000 com pound nouns consisting of two words, 30,000 nouns con sisting of three words, and around 10,000 nouns consisting of four or more words. We used these m ulti-w ord term s as in p u t for a corpus filter as described in section 4.2. We found th a t fewer th a n 2% of th e two-word N Ps from SPE C IA L IST occurs in the M A REC subcorpus. For th e three-w ord N Ps, this percentage is lower th a n 1% and for th e longer N Ps it is negligible. T he te n m ost frequent m ulti-word N Ps from SPE C IA L IST in th e M A REC corpus are carbon atom s, alkyl group, hydrogen atom , am ino acid, molecular w eight, combustion engine, control device, nucleic acid, semiconduc tor device and storage m eans. However, their frequencies are still relatively small. Moreover, th e large m ajority of m ulti-word term s in p aten t claims are com positional in the sense th a t they are formed from two or more lexicon words, com bined in one word-form following regular com positional rules. This m eans th a t for the purpose of syntactic parsing, it is not necessary to add these m ultiw ords to th e parser lexicon. W hat does this m ean? It seems th a t lexicalized m ulti word N Ps (term s from th e SPEC IA LIST lexicon) do not occur very frequently in p aten t claims. T his can be due to th e topic dom ains covered by the M A REC subcorpus being different from th e dom ains included in th e SPECIA LIST lexicon. However, this is not very likely since we found th a t on the single-word level th e p aten t dom ain does not contain m any words th a t are not in the general English vocabulary. We conjecture th a t p aten t authors w rite claims in which they create novel N Ps (not captured by term inologies such as SPEC IA L IST). T his is also found by D ’hondt (2009), who reports th a t “these [multi-word] term s are invented and defined ad hoc by th e p aten t w riters and will generally not end up in any dictionary or lexicon.” [9]. This would confirm th e introduction of novel term s by p aten t authors, b u t only w ith respect to m ulti-word term s. 4.3 Challenge 3: Syntactic structure According to international p aten t w riting guidelines, p aten t claims are built out of noun phrases instead of clauses (see Section 2). T his can be problem atic for p aten t processing techniques th a t are based on syntactic analysis. Syntactic parsers are generally designed to analyze clauses, n ot noun phrases. T his m eans th a t if there is a possible interp retatio n of the in p u t string as being a clause, th en th e parser will try to analyze it as such: In case of lexical am biguity one of th e words will be interpreted as finite verb whereas it should be a noun or participle. A n analysis of the o u tp u t of th e baseline version of th e AEG IR parser on a subset of th e M A REC corpus can provide insight into the challenges relating to syntactic structures th a t occur in p aten t claims. To this end, we created a small sam ple from th e com plete set of M A REC sentences: a ra n dom sam ple of 100 sentences th a t are five to nine words in length. T he m otivation for this short sentence length in the sam ple was twofold: O n th e one hand we w anted to capture m ost N P constructions th a t occur in p aten t claims b u t at T ab le S: E v a lu a tio n o f t h e b a selin e A E G I R p arser and t h e s ta te -o f-th e -a r t C o n n ex o r C F G p a rser for a s e t o f 100 sh o rt (5 —9 w o rd s) se n te n c e s from th e M A R E C su b c o rp u s. A E G IR Connexor CFG precision 0.45 0.71 recall 0.50 0.71 F1-score 0.47 0.71 In ter-an n o tato r agreem ent 0.83 0.83 th e same tim e we w anted to minimize stru ctu ra l ambiguity. For evaluation purposes, we m anually created ground tr u th dependency analyses for 100 random ly selected sentences from th is set. We found th a t only 4% of th e short sen tences are clauses (e.g. “F2 is th e preselected operating fre quency.”). T he ground tr u th annotations were created by two as sessors: b o th created annotations for 60 sentences, w ith an overlap of 20 sentences. We m easured th e in ter-an n o tato r agreem ent by counting th e num ber of identical dependency trip lets among th e two annotators. Dividing this num ber by th e to ta l num ber of trip lets created by one an n o tato r gives accuracy1, dividing th e num ber by th e to ta l num ber of trip lets created by th e other an n o tato r gives accuracy2. We take th e average accuracy as in ter-an n o tato r agreem ent.13 T his way, we found an in ter-an n o tato r agreem ent of 0.83, which is considered substantial. For th e 20 sentences th a t were an n o tated by b o th the assessors, a consensus an n o tatio n was agreed upon w ith the help from a th ird (expert) assessor. A fter th a t, we adapted th e annotations of th e 80 sentences th a t had been an n o tated by one of th e two assessors in accordance w ith th e consensus annotation. T his resulted in a consistently an n o tated set of 100 sentences. We used these annotations to evaluate th e baseline version of th e A E G IR parser. We calculated precision as th e num ber of correct trip lets in th e A E G IR o u tp u t divided by th e to ta l num ber of trip lets created by A EG IR, and recall as th e num ber of correct trip lets in the A E G IR o u tp u t divided by th e num ber of trip lets created by th e hum an assessor. In order to com pare th e baseline version of th e A E G IR parser to a state-of-the-art dependency parser, we ran the Connexor CFG parser [16] on th e same set of short p aten t claim sentences. We converted th e o u tp u t of th e parser to dependency trip lets according to th e A E G IR descriptive m odel14 and th en evaluated it using th e same procedure as described for th e A E G IR parser above. T he results for b o th A E G IR and th e Connexor parser are in Table 3. Table 3 shows th a t th e perform ance of th e baseline version of th e A E G IR parser on short p aten t sentences is still m od erate, and lower th a n th e state-of-the a rt Connexor parser. T he errors m ade by A E G IR can provide valuable insights in th e peculiarities of p aten t language. T he m ost frequent parsing m istakes m ade by A E G IR are (1) th e wrong choice 13C ohen’s K ap p a cannot be determ ined for these d a ta since there exists no chance agreem ent for th e creation of depen dency triplets. 14A one-to-one conversion was possible to a large extent. T he only problem atic construction was th e phrasal preposition according to , which is trea ted differently by th e Connexor parser and th e A E G IR descriptive model. for th e head of a dependency relation (e.g. [9,ATTR,claim] for “claim 9” and (2) incorrect attachm ent of postm odifiers in NPs. For example, for th e sentence “T he m ethod of claim 4 wherein n is zero.”, th e parser incorrectly generates [method,PREPof,n] instead of [method,PREPof,claim] and it labels wherein as a modifier to n: [n,MOD,X:wherein] . T he former of these errors is repeated frequently in the data: th e regular expression “claim [0-9]+” occurs in 96% of th e sentences in th e M A REC subcorpus. T he la tte r case (am biguities caused by postm odifier attachm ent) is known to be problem atic for syntactic parsing. In p aten t claims, however, th e problem is even more frequent th a n in general English because the N Ps in p aten t claims are often very long (recall th e m edian sentence length of 22 words). This brings us back to the central syntactic challenge m entioned several tim es in this paper: p aten t claim s are com posed of N Ps instead of clauses. In order to find other syntactic differences betw een p aten t claims and general English, we plan to evaluate th e baseline version of th e A E G IR parser on a set of sentences from the BNC and com pare th e outcom e to th e results obtained for M A REC sentences (Table 3).15 5. CONCLUSIONS AND FUTURE WORK We have analyzed three challenges of p aten t claim pro cessing th a t are m entioned in th e literature: (1) T he length of th e sentences is much longer th a n for general language use; (2) M any novel term s are introduced in p aten t claims th a t are difficult to understand; and (3) th e stru ctu re of th e p aten t claims is complex, as a result of which syntactic parsers fail to correctly analyze p aten t claims. W here pos sible, we supported our analyses w ith quantifications of the findings, using a num ber of (patent and non-patent corpora) and NLP tools. W ith respect to (1), we verified th a t sentences in English p aten t claims are longer th a n in general English, even if we split th e claims not only on full stops b u t also on semi-colons. T he m edian sentence length in th e M A REC subcorpus is 22 words; the average length is 53 words. W ith respect to (2), we perform ed a num ber of analy ses related to the vocabulary of p aten t claims. We found th a t at th e level of single words, not m any novel term s are introduced by p aten t authors. Instead, they ten d to use words from th e general English vocabulary, which was d em onstrated by a token coverage of 96% of th e CELEX lexicon on th e M A REC subcorpus. However, th e frequency d istribution of words in p aten t claims does differ from th a t in general English, which can be especially seen from th e list of top-frequency words from M A REC and BNC. Moreover, it seems th a t the authors of p aten t claim s do introduce novel term s, b u t only at th e m ulti-word level: we found th a t th e lexicalized m ulti-word term s from th e SPEC IA L IST lexicon have low frequencies in th e M A REC subcorpus. W ith respect to (3), we parsed 10,000 claims from the M A REC subcorpus using the baseline version of the A E G IR dependency parser and we perform ed a m anual evaluation of the parser o u tp u t for 100 short sentences from th e corpus. We can confirm th a t syntactic parsing for p aten t claims is a 15O f course, we can expect some problem s w hen we run a parser th a t is being developed for p aten t tex ts specifi cally to BNC d ata, such as th e generation of th e trip let [Betty,ATTR,said] for th e last two words of “O h , th a t is sad,” said Betty. challenge, especially because th e claims consist of sequences of noun phrases instead of clauses while syntactic parsers are designed for analyzing clauses. As a result, th e parser will try to label at least one word in th e sentence a finite verb. In conclusion, we can say th a t th e challenges of p aten t claim processing th a t are related to syntactic stru ctu re are even more problem atic th a n th e challenges at th e vocabulary level. T he sentence length issue only causes problem s indi rectly by resulting in more stru ctu ra l am biguities for longer noun phrases. In th e near future, we will further develop th e A E G IR dependency parser into a h y b rid 16 parser th a t incorporates inform ation on th e frequencies of dependency triplets. These frequencies (which are stored in th e trip let datab ase th a t is connected to A EG IR) guide th e resolution of stru ctu ra l am biguities. For example, th e inform ation th a t ‘carbon ato m s’ is a highly frequent N P w ith th e stru ctu re [atom,ATTR, carbon] guides th e disam biguation of a com plex N P such as “cycloalkyl w ith 5-7 ring carbon atom s su b stitu ted by a m em ber selected from th e group consisting of am ino and sulphoam ino” (taken from th e M A REC subcorpus), which contains many stru ctu ra l am biguities. T he same holds for th e frequent error [9,ATTR,claim] th a t we m entioned in Section 4.3. Given th e high frequency of th is error type, it is relatively easy to solve using trip let frequencies. In order to collect reliable frequency inform ation on de pendency relations, we use a b o o tstra p process. As th e s ta r t ing point of th e b o o tstra p we use reliably an n o tated corpora for general English such as th e P en n Treebank [20] and the B ritish N ational Corpus (BNC) [7]. We th en use p arts of p aten t corpora such as M A REC and C L E F -IP [23], which we an n o tate syntactically using autom atic parsers. M ore over, we harvest term inology lists and thesauri such as the biomedical thesaurus UMLS [4], which contain m any m ulti word N Ps and therefore can provide us w ith reliable A T T R relations (such as [atom,ATTR,carbon] ). T he addition of th is inform ation allows us to tu n e the A E G IR parser specifically to th e language used in p aten t texts. We expect th a t a num ber of th e parsing problem s described in th is p aper will be solved by incorporating fre quency inform ation th a t is extracted from p aten t d ata. To w hat extent this will be successful is to be seen from the further developm ent and evaluation of th e A E G IR parser. 6. ACKNOWLEDGMENTS T he T M 4IP project is funded by M atrixware. 7. REFERENCES [1] N. Akers. T he E uropean P a te n t System: an in tro d u ctio n for p aten t searchers. World P a ten t In fo rm a tio n , 21(3):135-163, 1999. [2] R. Baayen, R. Piepenbrock, and H. van Rijn. T h e C E L E X Lexical Database (CD -RO M ). Linguistic D ata C onsortium , U niversity of Pennsylvania, Philadelphia, USA, 1993. [3] M. Baroni, S. Bernardini, A. Ferraresi, and E. Z anchetta. T he W aCky wide web: a collection of very large linguistically processed web-crawled 16H ybrid in th e sense th a t it combines rule-based and prob abilistic inform ation [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] corpora. Language Resources and Evaluation, 43(3):209-226, 2009. O. Bodenreider. T he unified m edical language system (UMLS): integrating biomedical term inology. Nucleic Acids Research, 32:D267-D270, 2004. D. Bonino, A. Ciaram ella, and F. Corno. Review of th e state-of-the-art in p aten t inform ation and forthcom ing evolutions in intelligent p aten t inform atics. World P atent Inform ation, 32:30-38, 2010. A. Browne, A. McCray, and S. Srinivasan. T he Specialist Lexicon. N ational Library o f M edicine Technical Reports, pages 18-21, 2000. L. B urnard. Users reference guide for the B ritish N ational Corpus. Technical report, Oxford University C om puting Services, 2000. L. Chen, N. Tokuda, and H. Adachi. A p aten t docum ent retrieval system addressing b o th sem antic and syntactic properties. In Proceedings of the A C L -2003 workshop on P atent corpus processing, pages 1-6, M orristow n, N J, USA, 2003. A ssociation for C om putational Linguistics. E. D ’hondt. Lexical Issues of a Syntactic A pproach to Interactive P a te n t Retrieval. In The Proceedings o f the 3rd B C S IR S G Sym posium on Future Directions in Inform ation Access, pages 102-109, 2009. E. D ’hondt, S. Verberne, N. O ostdijk, and L. Boves. R e-ranking based on Syntactic Dependencies in P rior-A rt Retrieval. In Proceedings of the D utch-Belgium Inform ation Retrieval Workshop 2010, 2010. To appear. E. Escorsa, M. G iereth, Y. K om patsiaris, S. Papadopoulos, E. P ian ta, G. Piella, I. P uhlm ann, G. Rao, M. R otard, P. Schoester, L. Serafini, and V. Zervaki. Towards content-oriented p aten t docum ent processing. World P atent Inform ation, 30(1):21-33, 2008. A. Fujii, M. Iwayam a, and N. K ando. In trodu ctio n to th e special issue on p aten t processing. Inform ation Processing and M anagem ent, 43(5):1149-1153, 2007. E. G raf and L. Azzopardi. A m ethodology for building a te st collection for prior a rt search. In Proceedings of the 2nd International Workshop on Evaluating Inform ation Access (E V IA ), pages 60-71, 2008. M. Iwayama, A. Fujii, N. K ando, and Y. M arukawa. E valuating p aten t retrieval in th e th ird N T C IR workshop. Inform ation Processing and M anagem ent, 42(1):207-221, 2006. M. Iwayama, A. Fujii, N. K ando, and A. Takano. Overview of p aten t retrieval task at NTCIR-3. In Proceedings o f the third N T C IR workshop on research in inform ation retrieval, autom atic text sum m arization and question answering, 2003. T. Jarvinen and P. Tapanainen. Towards an im plem entable dependency gram m ar. In The Proceedings o f C O LIN G -A C L, volume 98, pages 1-10, 1998. C. K oster and J. Beney. Phrase-B ased D ocum ent C ategorization Revisited. In Proceedings 2nd International Workshop on P atent Inform ation Retrieval (P a IR ’09), 2009. C. K oster, N. O ostdijk, S. Verberne, and E. D ’hondt. [19] [20] [21] [22] [23] [24] [25] [26] [27] Challenges in Professional Search w ith PHA SAR. In Proceedings o f the D utch-Belgium Inform ation Retrieval workshop, 2009. G. Leech. 100 million words of English: th e B ritish N ational Corpus (BNC). Language Research, 28(1):1-13, 1992. M. M arcus, B. Santorini, and M. Marcinkiewicz. Building a large an n o tated corpus of English: T he P en n Treebank. Com putational linguistics, 19(2):313-330, 1994. H. Mase, T. M atsubayashi, Y. Ogawa, M. Iwayama, and T. Oshio. P roposal of tw o-stage p aten t retrieval m ethod considering th e claim structure. In A C M Transactions on A sia n Language Inform ation Processing (T A L IP ), volume 4, pages 190-206, 2005. S. Mille and L. W anner. M aking te x t resources accessible to th e reader: T he case of p aten t claims. In Proceedings o f the International Language Resources and Evaluation Conference (LR E C ), pages 1393-1400, M arrakech, Morocco, 2008. G. Roda, J. T ait, F. Piroi, and V. Zenz. C L E F-IP 2009: retrieval experim ents in th e Intellectual P ro p erty dom ain. In C LEF working notes 2009, pages 1-16, 2009. S. Sheremetyeva. N atu ral language analysis of p aten t claims. In Proceedings o f the A C L-2003 workshop on P atent corpus processing, pages 66-73, 2003. A. Shinmori, M. O kum ura, Y. M arukawa, and M. Iwayama. P a te n t claim processing for readability: stru ctu re analysis and te rm explanation. In Proceedings o f the A C L-2003 workshop on P atent corpus processing-Volume 20, page 65. A ssociation for C om putational Linguistics, 2003. Y. Tseng, C. Lin, and Y. Lin. T ext m ining techniques for p aten t analysis. Inform ation Processing and M anagem ent, 43(5):1216-1247, 2007. L. W anner, R. Baeza-Yates, S. B rugm ann, J. Codina, B. Diallo, E. Escorsa, M. G iereth, Y. K om patsiaris, S. Papadopoulos, E. P ian ta, et al. Towards content-oriented p aten t docum ent processing. World P atent Inform ation, 30(1):21-33, 2008.
© Copyright 2026 Paperzz