Meeting title

Term Distillation
in Patent Retrieval
Ricoh Company, Ltd.
Hideo Itoh
Hiroko Mano
Yasushi Ogawa
Cross-DB Retrieval (1)
The query domain differs
from the retrieval target one.
Query domain
Target domain
news articles
patents
Retrieval
System
query article
2003/7/14
ranking
of patents
NTCIR-3 Workshop Meeting
2
Cross-DB Retrieval (2)
• Problem
– Incorrect query term weighting
caused by the difference of term occurrence
distribution between query and target domain.
– Example
• The query term “社長 (president)” would be given
a large weight, because the df in patents is very low.
• However, “社長” is not a good term for patent
retrieval.
2003/7/14
NTCIR-3 Workshop Meeting
3
Term distillation
• A general framework for query term selection
for cross-DB retrieval
Query document
Term extraction (morphological analysis
+ stopword list)
Candidates of Query Term
Term selection using TDV
Query Terms
2003/7/14
(Term Distillation Value)
NTCIR-3 Workshop Meeting
4
Term Distillation Value
• TDV represents “goodness” of the query term
• Generic model
TDV = QV ・ TV
where
QV : conventional term selection value
TV : newly introduced selection value
for cross-DB retrieval
• Two probabilities used for estimation of TV
p = Prob (term | target domain )
q = Prob (term | query domain )
2003/7/14
NTCIR-3 Workshop Meeting
5
Instances of TV
Distillation Model
• Zero
• Swets
• Naïve Bayes
• Bayesian classification
• Binary independence
• Target domain
• Query domain
• Binary
• Joint probability
• Decision theoretic
2003/7/14
Estimation of TV
constant = 1
p-q
p/q
p / (p + α・q + ε)
log p (1 – q) / q (1- p )
p
1-q
1 (p > 0) or 0 (p = 0)
p (1–q)
log (p / q)
NTCIR-3 Workshop Meeting
6
Instances of QV
Conventional Model
• Zero
• Approximated 2-Poisson
• Term frequency
• IDF
• Probabilistic tf * idf
• tf * idf
2003/7/14
Estimation of QV
constant = 1
tf / ( tf + β)
tf
log ( N / df + 1)
tf / ( tf + β) ・ log ( N / df + 1)
tf ・ log ( N / df + 1)
NTCIR-3 Workshop Meeting
7
The Cross-DB Retrieval System
IREX news articles
Query DB
NTCIR-3 patents
Target DB
p = df T / N T
q = df Q / N Q
query document
Cross-DB
Retrieval
System
Target
documents
ranking
of documents
2003/7/14
NTCIR-3 Workshop Meeting
8
Experimental Results
Topic = article-only
Number of query terms = 8
Automatic retrieval with pseudo-relevance feedback
QV
TV
p
q
AveP
tf
log(p(1-q)/q(1-p))
TITLE
IREX
0.1953
tf
log(p/q)
TITLE
IREX
0.1948
tf * idf
p / (p + αq + ε)
TITLE
IREX
0.1844
tf
p / (p + αq + ε)
TITLE
IREX
0.1843
1
p / (p + αq + ε)
TITLE
IREX
0.1816
tf
1-q
TITLE
IREX
0.1730
tf
p/q
TITLE
IREX
0.1701
tf
p / (p + αq + ε)
ABST WHOLE 0.1694
tf
1
2003/7/14
ー
NTCIR-3 Workshop Meeting
ー
0.1645
9
Samples of query terms
• Topic 0001
– 装置。サブミクロン。液体。工業。特殊。特殊機。分離。粒子。
– 装置。乳化。撹拌。液体。粒子。撹拌機。微粒子。サブミクロン。
• Topic 0002
– 種子。植物。福岡。農法。単行本。引用。漫画。SEED。
– 種子。植物。粘土団子。団子。粘土。農法。農業。編集。
• Topic 0003
– 機器。湯場。日本。制御。下請け。発注。仕事。時代。
– 制御。5相。モータ。ステツピングモータ。電子機器。機器。電子。
• Topic 0004
– エポック。エポック社。バンダイ。製造。訴訟。地裁。東京。万円。
– 製造。玩具。ゲイム機。カードゲーム。小型。ゲーム。技術的。指摘。
upper : conventional mehod (tf)
lower : with term distillation (tf ・ log(p(1-q)/(q(1-p))
2003/7/14
NTCIR-3 Workshop Meeting
10
NTCIR-3 mandatory runs
Topic = article+supplement Number of query terms = 8
Automatic retrieval with pseudo-relevance feedback
Run QV
f021 tf
f020 1
f022 tf
f019 tf*idf
TV
p / (p + αq + ε)
p / (p + αq + ε)
p / (p + αq + ε)
p / (p + αq + ε)
p
TITLE
TITLE
ABST
TITLE
q
IREX
IREX
WHOLE
IREX
AveP
0.2794
0.2701
0.2688
0.2637
• The retrieval performance is remarkable in comparison
with other submitted runs.
• Under the influence of supplemental data, the effect of
term distillation is unclear.
2003/7/14
NTCIR-3 Workshop Meeting
11
NTCIR-3 optional runs
• Automatic retrieval with pseudo-relevance feedback
Run
f018
fields
t,d,c
t,d,c,n
d
t,d
t,d,n
d,n
s
t,d
AveP
0.3262
0.3056
0.3039
0.2801
0.2753
0.2750
0.2712
0.1283
P@10
0.4323
0.4258
0.4032
0.3581
0.4000
0.4323
0.3806
0.1968
Rret
1197
1182
1133
1100
1140
1145
991
893
(t=title, d=desc, c=concept, n=narrative, s=supplement)
2003/7/14
NTCIR-3 Workshop Meeting
12
Conclusions
• We proposed “term distillation” which is
a general framework for cross-DB retrieval.
• Our experiments using NTCIR-3 patent
retrieval test collection demonstrate that
term distillation is effective in the cross-DB
retrieval.
2003/7/14
NTCIR-3 Workshop Meeting
13