84168 - Radboud Repository

PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
The following full text is an author's version which may differ from the publisher's version.
For additional information about this publication click this link.
http://hdl.handle.net/2066/84168
Please be advised that this information was generated on 2017-06-17 and may be subject to
change.
Quantifying the Challenges in Parsing Patent Claims
Suzan Verberne*
s.verberne@ let.ru.nl
Eva D’hondtf
e.dhondt@ let.ru.nl
Nelleke O o s td ijk
n.ooStdijk@ let.m .nl
Cornelis H.A Koster1
kees@ cs.ru.nl
ABSTRACT
In this paper, we aim to verify and quantify the challenges
of p aten t claim processing th a t have been identified in the
literature. We focus on the following three challenges th a t,
judging from the num bers of m entions in papers concerning
p aten t analysis and p aten t retrieval, are central to p aten t
claim processing: (1) T he length of sentences is much longer
th a n for general language use; (2) M any novel term s are
introduced in p aten t claims th a t are difficult to understand;
(3) T he syntactic stru ctu re of p aten t claims is complex. We
find th a t the challenges of p aten t claim processing th a t are
related to syntactic stru ctu re are much m ore problem atic
th a n th e challenges at th e vocabulary level. T he sentence
length issue only causes problem s indirectly by resulting in
more stru ctu ra l am biguities for longer noun phrases.
Keywords
P a te n t Claim Processing, Challenges in P a te n t Search, Vo­
cabulary Issues, Syntactic Parsing
1.
INTRODUCTION
P ate n t retrieval is a rising research topic in th e Inform a­
tio n R etrieval (IR) community. One of th e m ost salient
search tasks perform ed on p aten t databases is prior a rt re­
trieval. T he ta sk of prior a rt retrieval is: given a p aten t
application, find existing p aten t docum ents th a t describe in­
ventions which are sim ilar or related to th e new application.
For every p aten t application th a t is filed at the E uropean
P a te n t Office, prior a rt retrieval is perform ed by qualified
p aten t examiners. T heir goal is to determ ine w hether the
claim ed invention fulfills th e criterion of novelty com pared
to earlier sim ilar inventions [1].
In its classic set-up, prior a rt searching involves a large
am ount of hum an effort: T hrough careful exam ination of
p o tential keywords in th e p aten t application th e p aten t ex­
am iner composes a query and retrieves a set of docum ents.
‘ Inform ation Foraging L ab /C en tre for Language and Speech
Technology, R adboud U niversity Nijmegen and Research
G roup in C om putational Linguistics, U niversity of W olver­
h am pton
^Inform ation Foraging L ab /C en tre for Language and Speech
Technology, R adboud University Nijmegen
* Inform ation Foraging L ab/C om puting Science In stitu te,
R adboud University Nijmegen
Copyright is held by the author/owner(s).
1st International Workshop on Advances in Patent Information Retrieval
(AsPIRe’10), March 28, 2010, Milton Keynes.
D ocum ent by docum ent is th e n analyzed to judge its rel­
evance. From th e relevant docum ents new keywords are
added to th e query and th e process is repeated un til rel­
evant inform ation has been found or th e search possibilities
have been exhausted. Since professional searchers are expen­
sive, it is worthwhile investigating how th e prior a rt search­
ing process can be facilitated by retrieval engines. Previous
work suggests th a t for prior art search, th e claims section
is th e m ost inform ative p a rt of a p aten t, b u t it is also the
m ost difficult to parse [12, 25, 14, 13].
Among th e language processing tasks th a t can support
th e p aten t search and analysis process are te rm extraction,
sum m arization and tran slatio n [27]. In order to perform
these tasks (sem i-)autom atically, at least sentence splitting
and m orphological analysis is needed b u t in m any cases also
some form of syntactic parsing. Existing n atu ra l language
parsers m ay fail to properly analyze p aten t claims because
th e language used in p aten ts differs from th e ‘regular’ E n ­
glish language for which th e tools have been developed [25].
P a te n t claims have a fixed structure: T hey consist of one
long sentence, startin g w ith “We claim :” or “W h at is claim ed
is:”, followed by item lists (‘series of specified elem ents’1),
which are realized by noun phrases. T he term inology used
in p aten t claims is highly dependent on th e specific topic
dom ain of th e p aten t (e.g. m echanical engineering).
T he challenges related to p aten t claim processing are iden­
tified by a num ber of researchers in th e p aten t retrieval field
(see Section 2) b u t these studies lack any kind of q uan­
tification of th e challenges: M ost of th em do not provide
statistics on sentence length, sentence stru ctu re, lexical dis­
trib u tio n s and th e differences betw een th e language used in
p aten t claim s and th e language used in large non -p aten t cor­
pora.
In th is paper, we aim to verify and quantify th e challenges
of p aten t claim processing th a t have been identified in the
literature. We focus on th e three challenges th a t are listed
in th e often-cited p aper by Shinmori et al. (2003) [25] about
p aten t claim processing for Japanese:2
1. T he length of th e sentences is much longer th a n for
general language use;
2. M any novel term s are introduced in p aten t claims th a t
are difficult to understand;
1h ttp : / / en.w ikipedia.org/wiki / C la i^ ( p a te n t)
2T he research on p aten t processing and retrieval has a some­
w hat longer history in Ja p a n th a n in E urope and th e U.S.
because of th e p aten t retrieval track in th e N T C IR evalua­
tio n cam paign [15].
3. T he stru ctu re of the p aten t claims is complex.
Consequently, m ost syntactic parsers — even those th a t
achieve good results on general language tex ts — fail to
correctly analyze p aten t claims.
We chose these challenges because we th in k they are cen­
tra l to p aten t claim processing, which may be concluded
from th e frequent m entions of these challenges in other p a­
pers concerning p aten t analysis and p aten t retrieval (see Sec­
tio n 2). We expect th a t th e challenges th a t Shinmori et al.
found for Japanese will also hold for English p aten t claims.
We will verify this in Section 4. In th e same section, we
will quantify th e challenges related to sentence length, vo­
cabulary issues and syntactic structure, using a num ber of
(p atent and non-patent corpora) and N LP tools.
F irst, in Section 2 we provide a background for th e current
paper. T hen, in Section 3 we describe th e d a ta th a t we used.
2.
BACKGROUND: PATENT PROCESSING
In this section, we discuss previous work on p aten t pro­
cessing. T he papers th a t we discuss here stress th e com­
plexity of the language used in patents, especially in the
claims sections. M ost of the work is directed at facilitating
hum an p aten t processing, in m any cases by im proving the
readability of p aten t texts.
Bonino et al. (2010) explain th a t in p aten t searching, b o th
recall and precision are highly im portant [5]. Because of le­
gal repercussions, no relevant inform ation should be missed.
O n th e other hand, retrieving fewer (irrelevant) docum ents
makes th e search process more efficient. In order to have full
control over precision and recall, p aten t search profession­
als generally employ an iterative search process. T his pro­
cess can be supported by N LP tasks such as query synonym
expansion (which is already commonly used in p aten t te x t
searches), sentence focus identification and m achine tra n s­
lation.
Mille and W anner (2008) stress th a t of all sections in a
p aten t docum ent, th e claims section is the m ost difficult to
read for hum an readers [22]. T his is especially due to th e fact
th a t in accordance w ith international p aten t w riting guide­
lines, each claim m ust consist of one single sentence. Mille
and W anner m ention sim ilar challenges to th e ones listed by
Shinmori et al. (2003): sentence length, term inology and
syntactic structure. However, they describe th e term inology
challenge not as an issue of understanding com plex term s
(as Shinmori does [25]) b u t as th e problem of ‘ab stra ct vo­
cabulary’, which is not further specified in their paper.
In their introduction to the special issue on p aten t process­
ing, Fujii et al. (2007) sta te th a t from a legal point of view,
th e claims section of a p aten t is th e m ost im p o rtan t [12].
T hey describe th e language used in p aten t claims as a very
specific sublanguage and sta te th a t specialized N LP m eth ­
ods are needed for analyzing and generating p aten t claims.
W anner et al. (2008) describe their advanced p aten t pro­
cessing service P A T E xpert [27]. P A T E xpert is aim ed a t fa­
cilitating p aten t analysis by th e use of knowledge bases (on­
tologies) and a set of N LP techniques such as tokenizers,
lem m atizers, taggers and syntactic parsers. Moreover, it of­
fers a paraphrasing m odule which accounts for a two-step
sim plification of the text: (1) splitting the te x t in smaller
units, taking into account its discourse structure, and (2)
transform ing th e sm aller un its into easily understandable
clauses w ith th e use of ‘predefined well-formedness criteria’.
T seng et al. (2007) experim ent w ith a num ber of tex t
m ining techniques for p aten t analysis th a t are related to
th e analytical procedures applied by professional searchers
on p aten t tex ts [26]. T hey perform autom atic sum m ariza­
tio n using te x t surface features (such as position and title
words). Moreover, they extend th e p o rter stem m er algo­
rith m and also an existing stopw ord word list, b o th focusing
on th e specifics of p aten t language. Tseng et al. identify the
ex traction of key-phrases as one of th e m ain challenges in
p aten t claim analysis because “single words alone are often
too general in m eanings or am biguous to represent a con­
cept”. T his relates to th e ‘ab stra ct vocabulary’-problem as
identified by Mille and W anner (see above). T seng et al.
find th a t m ulti-word strings th a t are repeated throughout a
p aten t are good key-phrases and likely to be legal term s.
Finally, Sherem etyeva (2003) uses pred icate-arg u m en t stru c­
tures to improve th e readability of th e claims section [24].
In her system , readability im provem ent is th e first step in a
suggested p aten t sum m arization m ethod.
All of th e papers m entioned in this section use some form
of N LP to facilitate p aten t analysis by hum ans. In th e IR
field, however, p aten t retrieval is generally addressed as a
te x t retrieval task th a t only uses word level inform ation
w ithout deeper linguistic processing. Academic research on
p aten t retrieval has m ainly focused on th e relative weighing
of th e index term s and on exploiting th e p aten t docum ent
stru ctu re to boost retrieval [21]. For an overview of the
sta te of th e a rt in academic and com mercial p aten t retrieval
system s, we refer to Bonino et al. (2010) [5].
A sm all num ber of approaches to p aten t retrieval use lin­
guistic processing to improve retrieval. T he system s devel­
oped by Escora et al. (2008) and Chen et al. (2003) perform
a com bination of syntactic and sem antic analysis on th e doc­
um ents [11, 8]. T he work described by K oster et al. (2009)
and D ’h ondt et al. (2010) aims at developing an interactive
p aten t retrieval engine th a t uses dependency relations as in­
dex and search term s [18, 10]. In order to generate these
dependency relations, a syntactic parser is developed th a t is
especially ad ap ted to analyzing p aten t texts. We will come
back to this parser in section 3.2.
3.
DATA
For th e experim ents rep o rted in th is paper, we use the
subset of 400,000 docum ents of th e M A trixw are R Esearch
Collection (M AREC) th a t was supplied by M atrixW are3 for
use in th e A sP IR e’10 workshop. In th e rem ainder of this
paper, we will refer to th is corpus of 400,000 p aten ts as the
‘M A REC subcorpus’.
3.1
Preprocessing the corpus
Since th e aim of th e current p aper is to quantify th e chal­
lenges of parsing p aten t claims, we first extracted th e claims
sections from th e M A REC subcorpus, disregarding th e other
fields of th e XML docum ents. Moreover, as we are develop­
ing techniques for m ining English p aten t texts, we are only
interested in those p aten ts th a t are w ritten in English.
Using a P erl script, we ex tracted all English claims sec­
tions (m arked w ith < claim s lang="EN">) from th e directory
tree of th e M A REC subcorpus and removed th e XML m ark ­
up. T his resulted in 67,292 claims sections4 w ith 56,117,443
3h ttp : / / w w w .m atrixw are.com /
4T he other docum ents in th e subcorpus either do not contain
words in total.
Having extracted and cleaned up all claims sections, we
used a sentence splitter to split th e claims sections in smaller
units. As pointed out by [25], sentence splitting for claims
sections is not a trivial task. M any sentences have been
glued together using semi-colons (;). We therefore decided
to not only use full stops as a split characters in our sentence
sp litter bu t also semi-colons.
We found th a t th e 67,292 claims sections consist of 1,051,040
sentences.5
3.2
♦ % of Marec sentences
■ % of BNC sentences
Parsing the corpus
In order to assess and quantify th e th ird challenge listed in
Section 1 (th e complex syntactic structure of p aten t claims),
we need a syntactic analysis of th e M A R EC subcorpus. To
this end, we use th e baseline version of th e syntactic parser
th a t is under developm ent in th e ‘Text M ining for Intellec­
tu a l P ro p e rty ’ (TM 4IP) project [18]. The aim of this project
is to develop an approach to interactive retrieval for p aten t
texts, in which dependency triplets instead of single words
are used as indexing term s.
In th e T M 4IP project, a dependency triplet has been de­
fined as two term s th a t are syntactically related through
one of a lim ited set of relators (SUBJ, O BJ, P R E D , MOD,
ATTR, ...), where a term is usually th e lem m a of a content
word. [10]. For example, th e sentence
“T he system consists of four separate m odules”
will be analyzed into th e following set of dependency trip lets:
[system,SUBJ,consist] [consist,PREPof, mod­
ule] [module,ATTR,separate] [module, QUANT,four]
Using dependency trip lets as indexing term s in a classifi­
cation experim ent, K oster and Beney (2009) have recently
achieved good results for classifying p aten t applications in
th eir correct IPC classes [17].
T he dependency parser th a t generates the triplets is called
A E G IR (‘A ccurate English G ram m ar for Inform ation Re­
trieval’). In its baseline version, A EG IR is a rule-based de­
pendency parser th a t combines a set of hand-w ritten rules
w ith an extensive lexicon.
T he resolution of lexical am biguities is guided by lexical
frequency inform ation stored in the parser lexicon. These
lexical frequencies provide inform ation on the possible p arts
of speech th a t can be associated w ith a p articular word form.
For example, in general English, we can expect zone as a
noun to have a higher frequency th a n zone as a verb. For
th e current paper, we collected lexical frequency inform a­
tion from a num ber of different sources in order to exam ine
th e lexical differences between th e English language use in
p aten t claims com pared to th e language use in difference
contexts. We will come back to this in Section 4.2.
For th e current paper, we decided to parse 10,000 of th e
67,292 English p aten t claims in th e M A REC subcorpus.
These 10,000 claims contain a to ta l of 6.9 million words.
Sentencing these claims using the sentence splitter described
in Section 3.1 results in 207,946 sentences.
a claims section or are in a language other th a n English.
5Recall from Section 1 th a t p aten t claims are com posed
of noun phrases (NPs), not clauses. In th e rem ainder of
this paper, we use th e word ‘sentences’ to refer to th e units
(m ostly NPs) th a t are separated by semicolons and full stops
in p aten t claims. We use th e word ‘noun phrase (N P )’ if we
refer to th e syntactic characteristics of such units.
1
10
100
1000
10000
sentence length
F ig u re 1: D is tr ib u tio n o f s e n te n c e le n g th s in th e
M A R E C su b c o r p u s, co m p a re d t o th e B N C .
4.
VERIFYING AND QUANTIFYING PATENT
CLAIM CHALLENGES
T he th ree challenges of p aten t claim processing m entioned
in Section 1 are: (1) T he length of th e sentences is much
longer th a n for general language use; (2) M any novel term s
are introduced in p aten t claims th a t are difficult to under­
stand; and (3) th e stru ctu re of th e p aten t claims is complex,
as a result of which syntactic parsers fail to correctly analyze
p aten t claims. In th e following subsections 4.1, 4.2 and 4.3
we perform a series of analyses and experim ents in order to
verify and quantify these th ree challenges.
4.1
Challenge 1: Sentence length
A fter splitting th e M AREC subcorpus into sentences (see
Section 3.1), we ex tracted th e following sentence-level sta tis­
tics from th e corpus. As already rep o rted in th e previous sec­
tion, th e 67,292 claims sections of th e MAREC-400000 sub­
corpus consist of 1,051,040 sentences. T here is much overlap
between th e sentences: after removing duplicates, 580,866
unique sentences rem ain. T he m edian sentence length is 22
words; th e average length is 53 words.
Binning th e sentences from M AREC w ith th e same length
together and counting th e num ber of sentences in each group
results in a long ta il distribution. T he peak of th e d istrib u ­
tion lies around 20 words (25,000 occurrences), w ith outliers
for sentence lengths 3 (20,637 occurrences) and 5 (32,849
occurrences). In Figure 1, th e M AREC sentence length dis­
trib u tio n is com pared to th e sentence length d istribution of
th e B ritish N ational C orpus (BNC) [19], which we prepro­
cessed using th e same sentence sp litter as we used on th e
M AREC subcorpus.
Figure 1 shows th a t sentences in M A REC are, as th e lit­
eratu re suggests, longer th a n th e sentences in th e BNC (the
early peak is th e BNC, th e later peak is M A REC), even if
we use th e semi-colon for sentence splitting in addition to
th e full stop.
4.2
Challenge 2: Vocabulary
Shinmori et al. (2003) sta te th a t m any novel term s are
used in Japanese p aten t claims. We perform ed th ree types
of analysis on th e vocabulary level to verify th is for En-
T ab le 1: L ex ica l co v e ra g e o f t h e C E L E X w o rd fo rm
le x ico n on th e M A R E C su b c o r p u s, b o th m ea su red
s tr ic tly and le n ie n tly (d isreg a r d in g sin g le ch a ra cters,
n u m erals an d ch em ic a l fo rm u la e ), and b o th on t h e
t y p e le v el and t h e to k e n level.
C E L E X -M A R E C strict type coverage
55.3%
C E L E X -M A R E C lenient type coverage
60.4%
C E L E X -M A R E C strict token coverage
95.9%
C E L E X -M A R E C lenient token coverage
98.8%
glish p aten t claims: (1) a lexical coverage te st of single-word
term s from a lexicon of general English on the M A REC sub­
corpus, (2) an overview of th e m ost frequent words in the
M A REC subcorpus com pared to th e BNC, (3) frequency
counts on am biguous lexical item s (as introduced in Sec­
tio n 3.2) and (4) an analysis of m ulti-word term s in the
M A REC corpus.
The coverage of general English vocabulary
In order to quantify th e differences betw een th e vocabu­
lary used in p aten t claims and general English vocabulary,
we perform ed a lexical coverage te st of th e C ELEX lexi­
cal database [2] on the M A REC subcorpus. T he CELEX
file EM W .CD contains 160,568 English word forms th a t are
supposed to cover general English vocabulary: According
to th e C ELEX readm e file6, th e lexicon contains th e word
forms derived from all lem m ata in th e Oxford Advanced
L earner’s D ictionary (1974) and th e Longm an D ictionary of
C ontem porary English (1978). T he C ELEX docum entation
reports th a t on th e 17.9 million word corpus of B irm ing­
h am U niversity/C O B U ILD , th e token coverage of CELEX
is 92%.
We m easured th e coverage of C ELEX entries on th e MAR E C subcorpus using a so-called corpus filter w ritten in
A G FL .7 A corpus filter takes as input a corpus in plain
te x t and a w ordform lexicon. T he corpus te x t is split up
into tokens. These are m atched to th e lexicon using a sm art
form of m atching w ith respect to capitalization: If a word
is in th e lexicon in lowercase, th e n it may m atch b o th an
uppercase and a lowercase variant in th e corpus. If a word
in th e lexicon has one or more uppercase letters, th en it only
m atches equally uppercased forms in the corpus. T his fa­
cilitates sentence-initial capitalization in th e corpus for low­
ercase lexicon forms such as the, while it prevents proper
nam es from th e lexicon to be m atched to com mon nouns in
th e corpus.
Moreover, th e corpus filter allows us to skip over spe­
cial tokens such as single characters, num erals and formu­
lae. If we disregard these special tokens we get a more le­
nient lexical coverage m easurem ent. We m easured lexical
coverage b o th on th e token level (counting duplicate words
separately) and the type level (counting duplicate words
once). A type-level count always gives a lower lexical cov­
erage because th e words th a t are not covered by th e lexicon
are generally lower-frequency words. T he lexical coverage
(b oth type and token counts) for the C ELEX lexicon on the
M A REC subcorpus can be found in Table 1.
In Table 1, we m arked th e strict token coverage in boldface
6h ttp ://w w w .ld c .u p e n n .e d u /C a ta lo g /re a d m e file s/
celex. readm e.htm l
7h ttp ://w w w .ag fl.cs.ru .n l/
because th is is th e percentage (95.9%) th a t can be com pared
to th e token coverage rep o rted by th e C ELEX docum enta­
tio n on th e COBUILD corpus (92%, see above). We can see
th a t these percentages are com parable, th e M A REC sub­
corpus giving a slightly higher coverage th a n th e COBUILD
corpus. U nfortunately, we cannot com pare th e ty p e cover­
ages of th e CELEX lexicon for b o th th e corpora because we
do n ot know th e ty p e coverage of th e C ELEX lexicon on the
COBUILD corpus.
If we look at th e top-frequency tokens from M A REC th a t
are n o t in th e C ELEX lexicon, we see th a t th e first 26 of
these are num erals (which we excluded in our lenient ap ­
proach). If we disregard these, th e te n m ost frequent to ­
kens are: indicia, U-shaped, cross-section, cross-sectional,
flip-flop, L-shaped, spaced-apart, thyristor, cup-shaped, and
V-shaped.8
T he lexical coverage of th e C ELEX lexicon on th e M A REC
corpus com pared to th e COBUILD corpus shows th a t p aten t
claims do not use m any words th a t are not covered by a lex­
icon of general English. T he next three subsections should
m ake clear w hat vocabulary differences do exist between
p aten t claims and general English language use.
Frequent words
We ex tracted a word frequency list from th e M A REC sub­
corpus. A n overview of th e 20 m ost frequent words in b o th
th e M A REC subcorpus and th e BNC already shows rem ark­
able differences (Table 2). T he counts are norm alized to the
relative frequency p er 10,000 words of running text. T hree
lexical item s in Table 2 need some explanation:
• In p aten t claims, said is used as a definite determ iner
referring back to a previously defined entity.9 In every­
day English, it could be rephrased as ‘th e previously
m entioned’, e.g. “T he condensation nucleus counter of
claim 6 w herein said control signal fu rth er is a func­
tio n of th e differential of said error signal.” Said has a
strong reference function and can be used for th e iden­
tification of anaphora in p aten t texts. T he word occurs
in 47% of all sentences in th e M A REC subcorpus.
• T he word wherein is used very frequently in claims for
th e specifications of devices, m ethods and processes.
A brief, prototypical exam ple is “T he m ethod of claim
4 w herein n is zero.” W herein occurs in 61% of all
sentences in th e M A REC subcorpus. If we only con­
sider th e 122,925 sentences th a t are around m edian
sentence length (21-25 words), even 71% contains the
word wherein. T he frequent use of wherein is strongly
connected to th e n atu re and aims of p aten t claims: to
define and specify all characteristics of an invention.
• T he same holds for th e word comprising, which is fre­
quently used to specify a device or m ethod, e.g. “T he
h eat exchanger of claim 1 further com prising m etal
w arp fibers...”
8T his small set of term s shows th a t hyphenation is a pro­
ductive and frequent phenom enon in p aten t claims. For th a t
reason, th e A E G IR gram m ar is equipped w ith a set of rules
th a t accurately analyse different types of com positional hy­
p henated forms. In this paper, we will not go into specifics
on th is subject; it will be covered in fu tu re work.
9A E G IR tre a ts this use of said as an adjective, as we will
see later in this section.
T ab le 2:
T h e 20 m o st freq u en t to k e n s in th e
M A R E C su b c o r p u s w ith th e ir r e la tiv e fre q u en cie s
p er 1 0 ,0 0 0 w ord s o f p a te n t cla im s, and t h e 20 m o st
freq u en t to k e n s in th e B N C w ith th e ir re la tiv e fre­
q u e n c ie s p er 1 0 ,0 0 0 w o rd s o f B N C t e x t s .
M A REC claims
BNC
Freq.
per token
Freq.
per token
10000 words
10000 words
674
the
715
the
480
a
376
of
457
said
303
and
450
of
266
to
278
and
206
in
261
to
202
a
158
in
129
is
128
claim
120
th a t
124
w herein
87
it
121
for
86
for
115
is
81
be
102
an
70
on
101
first
68
w ith
100
means
67
are
90
second
63
by
63
from
62
as
62
w ith
57
was
57
one
57
this
56
1
55
s
53
comprising 52
I
Table 2 shows a clear difference in th e m ost frequently
used words in p aten t claims (M AREC) com pared to general
English (the BNC). Thus, when we take into account th e fre­
quency of words, th e language use in p aten t tex ts definitely
differs from th a t found in general English (see th e previous
subsection).
Lexical frequencies for ambiguous words
As explained in Section 3.2, we consult several resources to
o b tain lexical frequencies. For th e aim of th e current paper,
it is interesting to analyze th e differences betw een th e fre­
quencies obtained from different types of sources. For devel­
opm ent and analysis purposes, we obtained lexical frequen­
cies from the following sources: (a) th e P enn T reebank [20],
(b) th e B ritish N ational Corpus BNC, (c) 79 Million words
from th e UKWAC w ebcorpus [3], PO S tagged by th e treetagger, (d) 7 Million words of p aten t claims from th e CLEFIP [23] corpus parsed w ith th e Connexor CFG parser [16],
and (e) th e 6.9 Million words of p aten t claims from the
M A REC corpus parsed w ith th e A E G IR dependency parser
(see Section 3.2).
We converted th e annotations in each of the corpora to the
A E G IR ta g set.10 We extracted from the A E G IR lexicon the
28,917 wordforms th a t occur in th e lexicon w ith m ore th a n
one p a rt of speech (POS) and counted th e frequencies of the
wordforms for each of the PO Ss th a t occur in th e corpora.
For each w ordform w w ith p arts of speech pi..„ in source
10For some tags this was not possible, for exam ple where
there was a m any-to-m any m atch betw een the labels used
in a corpus and th e labels used in th e A E G IR tagset.
s, we calculated th e relative frequency for each PO S p as:
relfreq w p ,,
co u n t(w ,p , s)
Y h =0 co u n t(w ,p i, s)
(1)
We took th e average relative frequency over th e sources
1..m as:
a vgrelfreqw
E
7=0r e lfr e q ( w ,p ,s j )
m
(2)
We calculated th e average relative frequency (Equation
2) for two sets of sources: P en n /B N C /U K W A C (PBU) on
th e one h and (representing general English language use),
and M A R E C /C L E F -IP (MC) on th e oth er hand. T h en we
considered wordforms for which
a v g r e lfre q w,p,MC - a v g r e lfre q w,p,PBU > 0.5
(3)
holds to be typical for p aten t claim s.11
For example, th e w ordform said w ith p a rt of speech ‘ad ­
jective’ comes out as being typical for p aten t language, where­
as th e same word w ith th e p a rt of speech ‘verb’ is labeled as
atypical for p aten t language.12 However, ap a rt from th is ex­
am ple it is difficult to draw any conclusions from th e o u tp u t
of our lexical frequency analysis. Only 4% of th e am bigu­
ous wordforms for which we obtained lexical frequencies are
labeled as typical for p aten t language.
O ne problem in th e identification of typical wordforms is
th a t it is difficult to distinguish betw een peculiarities caused
by a different descriptive model of th e p arser/tag g er used
(e.g. one parser may prefer th e label ‘adjective’ over the
label ‘p ast p articip le’ for word forms such as closed in a
phrase such as ‘th e closed door’) and an actual difference
in language use in th e corpus (e.g. said as an adjective vs.
said as a verb).
M ost of th e exam ples in th e list of typical wordforms are
difficult for us to in terp ret (e.g. adhesive as an adjective
is labeled as typical while adhesive as a noun is labeled as
atypical). Therefore, and because only a fraction (4%) of the
words come o ut as typical for p aten t language, we consider
th e lexical frequencies for am biguous words to be inconclu­
sive. T hey do n ot show a clear difference betw een p aten t
vocabulary and regular English vocabulary.
Multi-word terms
We include th e topic of m ulti-word term s here because in
Section 1 we referred to ‘novel term s’ (following Shinmori [25])
w ithout distinguishing betw een single-words term s and m ulti­
word term s. Since we found no difference betw een th e single
te rm vocabulary in general English and th e English used in
p aten t texts. (see ‘T he coverage of general English vocabu­
lary ’), we hypothesize th a t th e authors of p aten t claims in­
troduce com plex m ulti-word N Ps th a t co n stitu te new (tech­
nical) term s.
To verify this, we m ake use of th e SPEC IA LIST lexi­
con [6]. According to th e developers th is lexicon covers b o th
11T he threshold of 0.5 was chosen because a difference value
higher th a n 0.5 m eans th a t in th e two te x t types th e other
of th e two word classes for th e same word is th e m ajority
word class.
12Interestingly, th e Connexor CFG parser only labeled 55%
of th e occurrences of said in th e C L E F-IP corpus as an ad ­
jective, and th e oth er occurrences as a verb. We conjecture
th a t these parsing errors are due to th e fact th a t th e Connexor parser was not tu n ed for p aten t d a ta b u t for general
English.
commonly occurring English words and biomedical vocabu­
lary discovered in th e NLM Test Collection and th e UMLS
M etathesaurus. By using lexical item s from a reliable lexi­
con, we do not rely on syntactic annotation of th e corpus;
instead we assume th a t every occurrence of a word sequence
from the lexicon in the corpus is a m ulti-word term .
T he SPE C IA L IST lexicon contains approxim ately 200,000
com pound nouns consisting of two words, 30,000 nouns con­
sisting of three words, and around 10,000 nouns consisting
of four or more words. We used these m ulti-w ord term s as
in p u t for a corpus filter as described in section 4.2. We found
th a t fewer th a n 2% of th e two-word N Ps from SPE C IA L ­
IST occurs in the M A REC subcorpus. For th e three-w ord
N Ps, this percentage is lower th a n 1% and for th e longer
N Ps it is negligible. T he te n m ost frequent m ulti-word N Ps
from SPE C IA L IST in th e M A REC corpus are carbon atom s,
alkyl group, hydrogen atom , am ino acid, molecular w eight,
combustion engine, control device, nucleic acid, semiconduc­
tor device and storage m eans. However, their frequencies
are still relatively small. Moreover, th e large m ajority of
m ulti-word term s in p aten t claims are com positional in the
sense th a t they are formed from two or more lexicon words,
com bined in one word-form following regular com positional
rules. This m eans th a t for the purpose of syntactic parsing,
it is not necessary to add these m ultiw ords to th e parser
lexicon.
W hat does this m ean? It seems th a t lexicalized m ulti­
word N Ps (term s from th e SPEC IA LIST lexicon) do not
occur very frequently in p aten t claims. T his can be due to
th e topic dom ains covered by the M A REC subcorpus being
different from th e dom ains included in th e SPECIA LIST
lexicon. However, this is not very likely since we found th a t
on the single-word level th e p aten t dom ain does not contain
m any words th a t are not in the general English vocabulary.
We conjecture th a t p aten t authors w rite claims in which
they create novel N Ps (not captured by term inologies such
as SPEC IA L IST). T his is also found by D ’hondt (2009),
who reports th a t “these [multi-word] term s are invented and
defined ad hoc by th e p aten t w riters and will generally not
end up in any dictionary or lexicon.” [9]. This would confirm
th e introduction of novel term s by p aten t authors, b u t only
w ith respect to m ulti-word term s.
4.3
Challenge 3: Syntactic structure
According to international p aten t w riting guidelines, p aten t
claims are built out of noun phrases instead of clauses (see
Section 2). T his can be problem atic for p aten t processing
techniques th a t are based on syntactic analysis. Syntactic
parsers are generally designed to analyze clauses, n ot noun
phrases. T his m eans th a t if there is a possible interp retatio n
of the in p u t string as being a clause, th en th e parser will try
to analyze it as such: In case of lexical am biguity one of th e
words will be interpreted as finite verb whereas it should be
a noun or participle.
A n analysis of the o u tp u t of th e baseline version of th e AEG IR parser on a subset of th e M A REC corpus can provide
insight into the challenges relating to syntactic structures
th a t occur in p aten t claims. To this end, we created a small
sam ple from th e com plete set of M A REC sentences: a ra n ­
dom sam ple of 100 sentences th a t are five to nine words in
length. T he m otivation for this short sentence length in the
sam ple was twofold: O n th e one hand we w anted to capture
m ost N P constructions th a t occur in p aten t claims b u t at
T ab le S: E v a lu a tio n o f t h e b a selin e A E G I R p arser
and t h e s ta te -o f-th e -a r t C o n n ex o r C F G p a rser for
a s e t o f 100 sh o rt (5 —9 w o rd s) se n te n c e s from th e
M A R E C su b c o rp u s.
A E G IR Connexor CFG
precision
0.45
0.71
recall
0.50
0.71
F1-score
0.47
0.71
In ter-an n o tato r agreem ent
0.83
0.83
th e same tim e we w anted to minimize stru ctu ra l ambiguity.
For evaluation purposes, we m anually created ground tr u th
dependency analyses for 100 random ly selected sentences
from th is set. We found th a t only 4% of th e short sen­
tences are clauses (e.g. “F2 is th e preselected operating fre­
quency.”).
T he ground tr u th annotations were created by two as­
sessors: b o th created annotations for 60 sentences, w ith an
overlap of 20 sentences. We m easured th e in ter-an n o tato r
agreem ent by counting th e num ber of identical dependency
trip lets among th e two annotators. Dividing this num ber
by th e to ta l num ber of trip lets created by one an n o tato r
gives accuracy1, dividing th e num ber by th e to ta l num ber of
trip lets created by th e other an n o tato r gives accuracy2. We
take th e average accuracy as in ter-an n o tato r agreem ent.13
T his way, we found an in ter-an n o tato r agreem ent of 0.83,
which is considered substantial.
For th e 20 sentences th a t were an n o tated by b o th the
assessors, a consensus an n o tatio n was agreed upon w ith the
help from a th ird (expert) assessor. A fter th a t, we adapted
th e annotations of th e 80 sentences th a t had been an n o tated
by one of th e two assessors in accordance w ith th e consensus
annotation. T his resulted in a consistently an n o tated set
of 100 sentences. We used these annotations to evaluate
th e baseline version of th e A E G IR parser. We calculated
precision as th e num ber of correct trip lets in th e A E G IR
o u tp u t divided by th e to ta l num ber of trip lets created by
A EG IR, and recall as th e num ber of correct trip lets in the
A E G IR o u tp u t divided by th e num ber of trip lets created by
th e hum an assessor.
In order to com pare th e baseline version of th e A E G IR
parser to a state-of-the-art dependency parser, we ran the
Connexor CFG parser [16] on th e same set of short p aten t
claim sentences. We converted th e o u tp u t of th e parser
to dependency trip lets according to th e A E G IR descriptive
m odel14 and th en evaluated it using th e same procedure as
described for th e A E G IR parser above. T he results for b o th
A E G IR and th e Connexor parser are in Table 3.
Table 3 shows th a t th e perform ance of th e baseline version
of th e A E G IR parser on short p aten t sentences is still m od­
erate, and lower th a n th e state-of-the a rt Connexor parser.
T he errors m ade by A E G IR can provide valuable insights
in th e peculiarities of p aten t language. T he m ost frequent
parsing m istakes m ade by A E G IR are (1) th e wrong choice
13C ohen’s K ap p a cannot be determ ined for these d a ta since
there exists no chance agreem ent for th e creation of depen­
dency triplets.
14A one-to-one conversion was possible to a large extent. T he
only problem atic construction was th e phrasal preposition
according to , which is trea ted differently by th e Connexor
parser and th e A E G IR descriptive model.
for th e head of a dependency relation (e.g. [9,ATTR,claim]
for “claim 9” and (2) incorrect attachm ent of postm odifiers
in NPs. For example, for th e sentence “T he m ethod of
claim 4 wherein n is zero.”, th e parser incorrectly generates
[method,PREPof,n] instead of [method,PREPof,claim] and
it labels wherein as a modifier to n: [n,MOD,X:wherein] .
T he former of these errors is repeated frequently in the
data: th e regular expression “claim [0-9]+” occurs in 96%
of th e sentences in th e M A REC subcorpus. T he la tte r case
(am biguities caused by postm odifier attachm ent) is known
to be problem atic for syntactic parsing. In p aten t claims,
however, th e problem is even more frequent th a n in general
English because the N Ps in p aten t claims are often very
long (recall th e m edian sentence length of 22 words). This
brings us back to the central syntactic challenge m entioned
several tim es in this paper: p aten t claim s are com posed of
N Ps instead of clauses.
In order to find other syntactic differences betw een p aten t
claims and general English, we plan to evaluate th e baseline
version of th e A E G IR parser on a set of sentences from the
BNC and com pare th e outcom e to th e results obtained for
M A REC sentences (Table 3).15
5.
CONCLUSIONS AND FUTURE WORK
We have analyzed three challenges of p aten t claim pro­
cessing th a t are m entioned in th e literature: (1) T he length
of th e sentences is much longer th a n for general language
use; (2) M any novel term s are introduced in p aten t claims
th a t are difficult to understand; and (3) th e stru ctu re of
th e p aten t claims is complex, as a result of which syntactic
parsers fail to correctly analyze p aten t claims. W here pos­
sible, we supported our analyses w ith quantifications of the
findings, using a num ber of (patent and non-patent corpora)
and NLP tools.
W ith respect to (1), we verified th a t sentences in English
p aten t claims are longer th a n in general English, even if we
split th e claims not only on full stops b u t also on semi-colons.
T he m edian sentence length in th e M A REC subcorpus is 22
words; the average length is 53 words.
W ith respect to (2), we perform ed a num ber of analy­
ses related to the vocabulary of p aten t claims. We found
th a t at th e level of single words, not m any novel term s
are introduced by p aten t authors. Instead, they ten d to
use words from th e general English vocabulary, which was
d em onstrated by a token coverage of 96% of th e CELEX
lexicon on th e M A REC subcorpus. However, th e frequency
d istribution of words in p aten t claims does differ from th a t
in general English, which can be especially seen from th e list
of top-frequency words from M A REC and BNC. Moreover,
it seems th a t the authors of p aten t claim s do introduce novel
term s, b u t only at th e m ulti-word level: we found th a t th e
lexicalized m ulti-word term s from th e SPEC IA L IST lexicon
have low frequencies in th e M A REC subcorpus.
W ith respect to (3), we parsed 10,000 claims from the
M A REC subcorpus using the baseline version of the A E G IR
dependency parser and we perform ed a m anual evaluation
of the parser o u tp u t for 100 short sentences from th e corpus.
We can confirm th a t syntactic parsing for p aten t claims is a
15O f course, we can expect some problem s w hen we run
a parser th a t is being developed for p aten t tex ts specifi­
cally to BNC d ata, such as th e generation of th e trip let
[Betty,ATTR,said] for th e last two words of “O h , th a t is
sad,” said Betty.
challenge, especially because th e claims consist of sequences
of noun phrases instead of clauses while syntactic parsers
are designed for analyzing clauses. As a result, th e parser
will try to label at least one word in th e sentence a finite
verb.
In conclusion, we can say th a t th e challenges of p aten t
claim processing th a t are related to syntactic stru ctu re are
even more problem atic th a n th e challenges at th e vocabulary
level. T he sentence length issue only causes problem s indi­
rectly by resulting in more stru ctu ra l am biguities for longer
noun phrases.
In th e near future, we will further develop th e A E G IR
dependency parser into a h y b rid 16 parser th a t incorporates
inform ation on th e frequencies of dependency triplets. These
frequencies (which are stored in th e trip let datab ase th a t is
connected to A EG IR) guide th e resolution of stru ctu ra l am ­
biguities. For example, th e inform ation th a t ‘carbon ato m s’
is a highly frequent N P w ith th e stru ctu re [atom,ATTR, carbon]
guides th e disam biguation of a com plex N P such as “cycloalkyl w ith 5-7 ring carbon atom s su b stitu ted by a m em ber
selected from th e group consisting of am ino and sulphoam ino”
(taken from th e M A REC subcorpus), which contains many
stru ctu ra l am biguities. T he same holds for th e frequent error
[9,ATTR,claim] th a t we m entioned in Section 4.3. Given
th e high frequency of th is error type, it is relatively easy to
solve using trip let frequencies.
In order to collect reliable frequency inform ation on de­
pendency relations, we use a b o o tstra p process. As th e s ta r t­
ing point of th e b o o tstra p we use reliably an n o tated corpora
for general English such as th e P en n Treebank [20] and the
B ritish N ational Corpus (BNC) [7]. We th en use p arts of
p aten t corpora such as M A REC and C L E F -IP [23], which
we an n o tate syntactically using autom atic parsers. M ore­
over, we harvest term inology lists and thesauri such as the
biomedical thesaurus UMLS [4], which contain m any m ulti­
word N Ps and therefore can provide us w ith reliable A T T R
relations (such as [atom,ATTR,carbon] ).
T he addition of th is inform ation allows us to tu n e the
A E G IR parser specifically to th e language used in p aten t
texts. We expect th a t a num ber of th e parsing problem s
described in th is p aper will be solved by incorporating fre­
quency inform ation th a t is extracted from p aten t d ata. To
w hat extent this will be successful is to be seen from the
further developm ent and evaluation of th e A E G IR parser.
6.
ACKNOWLEDGMENTS
T he T M 4IP project is funded by M atrixware.
7.
REFERENCES
[1] N. Akers. T he E uropean P a te n t System: an
in tro d u ctio n for p aten t searchers. World P a ten t
In fo rm a tio n , 21(3):135-163, 1999.
[2] R. Baayen, R. Piepenbrock, and H. van Rijn. T h e
C E L E X Lexical Database (CD -RO M ). Linguistic D ata
C onsortium , U niversity of Pennsylvania, Philadelphia,
USA, 1993.
[3] M. Baroni, S. Bernardini, A. Ferraresi, and
E. Z anchetta. T he W aCky wide web: a collection of
very large linguistically processed web-crawled
16H ybrid in th e sense th a t it combines rule-based and prob­
abilistic inform ation
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
corpora. Language Resources and Evaluation,
43(3):209-226, 2009.
O. Bodenreider. T he unified m edical language system
(UMLS): integrating biomedical term inology. Nucleic
Acids Research, 32:D267-D270, 2004.
D. Bonino, A. Ciaram ella, and F. Corno. Review of
th e state-of-the-art in p aten t inform ation and
forthcom ing evolutions in intelligent p aten t
inform atics. World P atent Inform ation, 32:30-38,
2010.
A. Browne, A. McCray, and S. Srinivasan. T he
Specialist Lexicon. N ational Library o f M edicine
Technical Reports, pages 18-21, 2000.
L. B urnard. Users reference guide for the B ritish
N ational Corpus. Technical report, Oxford University
C om puting Services, 2000.
L. Chen, N. Tokuda, and H. Adachi. A p aten t
docum ent retrieval system addressing b o th sem antic
and syntactic properties. In Proceedings of the
A C L -2003 workshop on P atent corpus processing,
pages 1-6, M orristow n, N J, USA, 2003. A ssociation
for C om putational Linguistics.
E. D ’hondt. Lexical Issues of a Syntactic A pproach to
Interactive P a te n t Retrieval. In The Proceedings o f the
3rd B C S IR S G Sym posium on Future Directions in
Inform ation Access, pages 102-109, 2009.
E. D ’hondt, S. Verberne, N. O ostdijk, and L. Boves.
R e-ranking based on Syntactic Dependencies in
P rior-A rt Retrieval. In Proceedings of the
D utch-Belgium Inform ation Retrieval Workshop 2010,
2010. To appear.
E. Escorsa, M. G iereth, Y. K om patsiaris,
S. Papadopoulos, E. P ian ta, G. Piella, I. P uhlm ann,
G. Rao, M. R otard, P. Schoester, L. Serafini, and
V. Zervaki. Towards content-oriented p aten t docum ent
processing. World P atent Inform ation, 30(1):21-33,
2008.
A. Fujii, M. Iwayam a, and N. K ando. In trodu ctio n to
th e special issue on p aten t processing. Inform ation
Processing and M anagem ent, 43(5):1149-1153, 2007.
E. G raf and L. Azzopardi. A m ethodology for building
a te st collection for prior a rt search. In Proceedings of
the 2nd International Workshop on Evaluating
Inform ation Access (E V IA ), pages 60-71, 2008.
M. Iwayama, A. Fujii, N. K ando, and Y. M arukawa.
E valuating p aten t retrieval in th e th ird N T C IR
workshop. Inform ation Processing and M anagem ent,
42(1):207-221, 2006.
M. Iwayama, A. Fujii, N. K ando, and A. Takano.
Overview of p aten t retrieval task at NTCIR-3. In
Proceedings o f the third N T C IR workshop on research
in inform ation retrieval, autom atic text
sum m arization and question answering, 2003.
T. Jarvinen and P. Tapanainen. Towards an
im plem entable dependency gram m ar. In The
Proceedings o f C O LIN G -A C L, volume 98, pages 1-10,
1998.
C. K oster and J. Beney. Phrase-B ased D ocum ent
C ategorization Revisited. In Proceedings 2nd
International Workshop on P atent Inform ation
Retrieval (P a IR ’09), 2009.
C. K oster, N. O ostdijk, S. Verberne, and E. D ’hondt.
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
Challenges in Professional Search w ith PHA SAR. In
Proceedings o f the D utch-Belgium Inform ation
Retrieval workshop, 2009.
G. Leech. 100 million words of English: th e B ritish
N ational Corpus (BNC). Language Research,
28(1):1-13, 1992.
M. M arcus, B. Santorini, and M. Marcinkiewicz.
Building a large an n o tated corpus of English: T he
P en n Treebank. Com putational linguistics,
19(2):313-330, 1994.
H. Mase, T. M atsubayashi, Y. Ogawa, M. Iwayama,
and T. Oshio. P roposal of tw o-stage p aten t retrieval
m ethod considering th e claim structure. In A C M
Transactions on A sia n Language Inform ation
Processing (T A L IP ), volume 4, pages 190-206, 2005.
S. Mille and L. W anner. M aking te x t resources
accessible to th e reader: T he case of p aten t claims. In
Proceedings o f the International Language Resources
and Evaluation Conference (LR E C ), pages 1393-1400,
M arrakech, Morocco, 2008.
G. Roda, J. T ait, F. Piroi, and V. Zenz. C L E F-IP
2009: retrieval experim ents in th e Intellectual
P ro p erty dom ain. In C LEF working notes 2009, pages
1-16, 2009.
S. Sheremetyeva. N atu ral language analysis of p aten t
claims. In Proceedings o f the A C L-2003 workshop on
P atent corpus processing, pages 66-73, 2003.
A. Shinmori, M. O kum ura, Y. M arukawa, and
M. Iwayama. P a te n t claim processing for readability:
stru ctu re analysis and te rm explanation. In
Proceedings o f the A C L-2003 workshop on P atent
corpus processing-Volume 20, page 65. A ssociation for
C om putational Linguistics, 2003.
Y. Tseng, C. Lin, and Y. Lin. T ext m ining techniques
for p aten t analysis. Inform ation Processing and
M anagem ent, 43(5):1216-1247, 2007.
L. W anner, R. Baeza-Yates, S. B rugm ann, J. Codina,
B. Diallo, E. Escorsa, M. G iereth, Y. K om patsiaris,
S. Papadopoulos, E. P ian ta, et al. Towards
content-oriented p aten t docum ent processing. World
P atent Inform ation, 30(1):21-33, 2008.