The Principle of
Information Retrieval
Department of Information Management
School of Information Engineering
Nanjing University of Finance & Economics
2017
Chapter 6:Evaluation in IR
How to measure user happiness
• The key utility measure is user happiness
•
•
•
•
Relevance of results (effectiveness)
Speed of response
Size of the index
User interface design
• Independent of the quality of the results
• Utility, success, completeness, satisfaction, worth,
value, time, cost, …
1 Introduction
评价的作用
• 评价在信息检索研究中发挥着重要作用
• 评价在信息检索系统的研发中一直处于核心的地位,
以致于算法与其效果评价方式是合二为一的
(Saracevic, SIGIR 1995)
信息检索系统评价的起源
• Kent等人第一次提出了关于Precision和Recall
(开始称为relevance)的概念(Kent, 1955)
信息检索系统评价的起源
• Cranfield-like evaluation methodology
• Cranfield在上世纪伍十年代末到六十年代初提出了基
于查询样例集、标准答案集和语料库的评测方案,被
称为IR评价的“grand-daddy”
• 确立了评价在信息检索研究中的核心地位
• Cranfield是一个地名,也是一个研究所的名称
信息检索系统评价的起源
• Gerard Salton 与 SMART 系统
• Gerard Salton是SMART系统的主要研发者
• SMART提供了一个研究平台,同时也提供了一个评
测计算方法,再提供了答案后可以给出常用的评价
指标
信息检索系统评价的起源
• Sparck-Jones 的著作 “Information retrieval
experiment”
• 主要论述IR实验和评测
How to measure information
retrieval effectiveness
• We need a test collection consisting of three
things
• A document collection
• A test suite of information needs, expressible as
queries
• A set of relevance judgments, standardly a binary
assessment of either relevant or not relevant for
each query-document pair
How to measure information
retrieval effectiveness
• And in this test collection
• Collection and suite of information needs have to
be of a reasonable size
• Results are highly variable over different documents
and information needs
• 50 information needs at least
Difficulties
• The difference of stated information need and
query
• Relevance is assessed relative to an information need, not a
query
• The subjectivity of relevance decision
• Many systems contain various parameters that
can be adjusted to tune system performance
• The correct procedure is to have one or more development
test collections
Difficulties
• Voorhees 估计,对一个规模为800万的文档集
合进行针对1个查询主题的相关性评判需要耗费
1名标注人员9个月的工作时间
• TREC提出Pooling方法,在保证评价结果可靠性
的基础上大大减少了评判工作量
• 缺点:处理的查询数目少,针对小规模的查询集
合,仍需要耗费十余名标注人员1-2个月的工作
时间
2 Standard test collections
• The Cranfield collection
• Text Retrieval Conference (TREC)
• GOV2
• NII Test Collections for IR Systems (NTCIR)
• Reuters-21578 and Reuters-RCV1
The Cranfield collection
• Collected in the United Kingdom starting in the
late 1950s, it contains 1398 abstracts of
aerodynamics journal articles, a set of 225
queries
• Allowing precise quantitative measures
• But too small
Text Retrieval Conference (TREC)
• The U.S. National Institute of Standards and
Technology (NIST) has run a large IR test bed
evaluation series since 1992
• Over a range of different test collections
• But the best known test collections are the
ones used for the TREC Ad Hoc track during
the first 8 TREC evaluations from 1992 to 1999
• Comprise 6 CDs containing 1.89 million
documents (mainly, but not exclusively,
newswire articles)
Text Retrieval Conference (TREC)
• TRECs 6–8 provide 150 information needs over
about 528,000 newswire and Foreign
Broadcast Information Service articles
• Relevance judgments are available only for the
documents that were among the top k
returned for some system which was entered
in the TREC evaluation
GOV2
• Contains 25 million GOV2 web pages
• GOV2 is one of the largest Web test collection
but still more than 3 orders of magnitude
smaller than the current size of the document
collections indexed by the large web search
companies
NII Test Collections for IR Systems
(NTCIR)
• Similar sizes to the TREC collections
• Focusing on East Asian language and crosslanguage information retrieval
Reuters-21578 and Reuters-RCV1
• Most used for text classification
• Reuters-21578 collection of 21578 newswire
articles
• Reuters Corpus Volume 1 (RCV1) is much
larger, consisting of 806,791 documents
3 Evaluation of unranked retrieval
sets
• Precision
• Recall
• Accuracy
• F measure
Precision
• Precision (P) is the fraction of retrieved
documents that are relevant
Recall
• Recall (R) is the fraction of relevant documents
that are retrieved
The another way to define P and
R
The comparison of P and R
• Typical web surfers prefer P to R
• Various professional searchers such as
paralegals and intelligence analysts prefer R to
P
• Individuals searching their hard disks prefer R
to P
The comparison of P and R
• The two quantities clearly trade off against one
another
• Recall is a non-decreasing function of the number
of documents retrieved
• Can always get a recall of 1 (but very low precision)
by retrieving all documents for all queries
• Precision usually decreases as the number of
documents retrieved is increased
Which is more difficult to measure?
• Precision
• Recall
关于查全率的质疑
• 分母是个无法确定的值,所以建立在其理论上
的查全率也是不实际的
• 相关文献没能被检索出来的原因是什么?
• 数据库的设计水平还是用户操作水平?
• 是标引的原因还是检索的原因?
• 是任何一个数据库都有一个相关的系数存在?
关于查准率的质疑
• 数据库中存在大量应查到而查不到的文献时,查
出来的文献就是100%准确有意义吗?
• 查准率分母中的不相关文档是如何产生的?
• 是系统造成的?
• 用户检索时由于表达不清造成的?
• 还是用户最终取舍形成的?
相对查准率
关于查全率与查准率关系的质疑
b: irrelevant
a: relevant
and retrieved
and retrieved
d: irrelevant
and
unretrieved
c: relevant
and unretrieved
关于查全率与查准率关系的质疑
• 一般认为是呈反比关系
• 如果a/(c+a)值增大,必然是c值减小
• c值减小有两种原因,是c线下移或b线右移,其
结果必然b值增大,所以a/(b+a)值减小,反之
也成立
• 但是这是建立在一种假设之上,那就是a为定量
• 但是定量不是a,而是(c+a),如c值下降,a值必然上升,所以a
是变量
• 所以,b和c之间没有必然的联系,可以想象b线和c线能够同时
向边线移动。c和b等于0不太可能,但不是没有可能
• 事实上,检索系统可以同时提高两个指标
Contingency table
Retrieved
Not retrieved
Relevant
Not relevant
true positives
(tp)
false negatives
(fn)
false positives
(fp)
true negatives
(tn)
The other way to define P and R
• P = tp/(tp + fp)
• R = tp/(tp + fn)
F measure
• A single measure that trades off precision
versus recall
• α∈[0, 1] and thus β2∈[0,∞]
• The default balanced F measure use β=1
F measure
• Values of β < 1 emphasize precision, while
values of β > 1 emphasize recall
• It is harmonic mean rather than the simpler
average
F measure
• The harmonic mean is always less than either
the arithmetic or geometric mean, and often
quite close to the minimum of the two
numbers
• This strongly suggests that the arithmetic mean is
an unsuitable measure to use because it closer to
their maximum than harmonic mean
F measure
Accuracy
• Accuracy is the fraction of its classifications that
are correct
• An information retrieval system can be thought of
as a two-class classifier
• Accuracy=(tp+tn)/(tp+fp+fn+tn)
Accuracy
• Often used for evaluating machine learning
classification problems
• Not an appropriate measure for information
retrieval problems
• Normally over 99.9% of the documents are in the
not relevant category
• Can maximize accuracy by simply deeming all
documents irrelevant to all queries, that is to say tn
maybe too large
4 Evaluation of ranked retrieval results
• Precision, recall, and the F measure are setbased measures and are computed using
unordered sets of documents
• In a ranked retrieval context, appropriate sets
of retrieved documents are naturally given by
the top k retrieved documents
The types
• Precision-recall curve
• Mean Average Precision (MAP)
• Precision at k
• R-precision
• ROC curve
Precision-recall curve
Precision-recall curve
• If the (k + 1)th document retrieved is irrelevant
then recall is the same as for the top k
documents, but precision has dropped
• If it is relevant, then both precision and recall
increase, and the curve jags up and to the right
How to evaluate?
• More cliffy, less effective
Precision-recall curve with
interpolated precision
• Can remove the jiggles of curve
• The interpolated precision at a certain recall
level r is defined as the highest precision found
for any recall level q ≥ r
• The justification is that almost anyone would be
prepared to look at more documents if it would
increase the percentage of the viewed set that
were relevant
Precision-recall curve with 11point interpolated average
precision
• Can boil curve’s information down to a few
numbers
• 11 points means 0.0, 0.1, 0.2, … , 1.0
• For each recall level, we then calculate the
arithmetic mean of the interpolated precision
at that recall level for each information need in
the test collection
Precision-recall curve with 11point interpolated average
precision
Mean Average Precision (MAP)
• Most standard among the TREC community
• A single-figure measure of quality across recall
levels
• Have especially good discrimination and
stability
Mean Average Precision (MAP)
• For a single information need, Average
Precision is the average of the precision value
obtained for the set of top k documents
existing after each relevant document is
retrieved
Mean Average Precision (MAP)
• The set of relevant documents for an
information need qj ∈ Q is {d1, . . . dmj}
• Rjk is the set of ranked retrieval results from the
top result until you get to document dk
The disadvantage of MAP
• Fixed recall levels are not chosen,
corresponding precision is at all recall levels
• Calculated scores normally vary widely across
information needs( 0.1 to 0.7 )
• So a set of test information needs must be large
and diverse enough to be representative of
system effectiveness across different queries
Precision at k
• What matters is how many good results there
are on the first page or the first k pages,
especially for search engine
• This method measures precision at fixed low
levels of retrieved results, such as 10 or 30
documents
• It has the advantage of not requiring any
estimate of the size of the set of relevant
documents but the disadvantages that it is the
least stable
R-precision
• It requires having a set of known relevant
documents of size Rel, from which we calculate
the precision of the top Rel documents
returned
• Like Precision at k, R-precision describes only
one point on the precision-recall curve
• And the R-precision and R-recall is equal ( r/Rel )
and is identical to the break-even point of the
PR curve
R-precision
• But even a perfect system could only achieve a
precision at 20 of 0.4 if there were only 8
relevant documents
• Averaging this measure across queries thus
makes more sense
5 Assessing relevance
About information needs
• Information needs must be germane to the
documents
• Information needs are best designed by
domain experts
• Using random combinations of query terms as
an information need is generally not a good
idea because typically they will not resemble
the actual distribution of information needs
About assessment
• Time-consuming and expensive process
involving human beings
• For tiny collections like Cranfield, exhaustive judgments of
relevance for each query and document pair were obtained
• For large modern collections, it is usual for relevance to be
assessed only for a subset of the documents for each query
• The most standard approach is pooling, where
relevance is assessed over a subset of the
collection that is formed from the top k
documents returned
About assessment
• Humans and their relevance judgments are
quite variable
• But this is not a problem to be solved
• In the final analysis, the success of an IR system
depends on how good it is at satisfying the needs
of these variable humans
• But needs to measure the agreement of these
assessments
Kappa statistic
• It can measure how much agreement between
relevance judgments
• P(A) is the proportion of the times the judges
agreed
• P(E) is the proportion of the times they would
be expected to agree by chance, which needs
to be estimated
Kappa statistic
II
I
Yes
No
Total
Yes
No
Total
300
10
310
20
70
90
320
80
400
Kappa statistic
• P(A) = (300+ 70)/400 = 370/400 = 0.925
• P(nonrelevant) = (80+90)/(400+400) =
170/800 = 0.2125
• P(relevant) = (320+ 310)/(400+ 400) = 630/800
= 0.7878
• P(E) = P(nonrelevant) + P(relevant) = 0.21252 +
0.78782 = 0.665
• k= (P(A)− P(E))/(1− P(E)) = (0.925−0.665)/(1−
0.665) = 0.776
Kappa statistic
• Result
• Kappa value above 0.8 is taken as good agreement
• Kappa value between 0.67 and 0.8 is taken as fair
agreement
• Kappa value below 0.67 is seen as dubious basis for
evaluation
• If there are more than two judges, it is normal
to calculate an average pairwise kappa value
Kappa statistic
• The level of agreement in TREC normally falls in
the range of “fair” (0.67–0.8)
• This means not requiring more fine-grained
relevance labeling
• But these deviation has in general been found to
have little impact on the relative effectiveness
ranking despite the variation of individual assessors’
judgments
About relevance
• Some dubious hypothesis
• The relevance of one document is treated as
independent of the relevance of other documents
in the collection
• Assessments are binary and not nuanced
• Information need is treated as an absolute,
objective decision
• So any results are heavily skewed by the choice
of collection, queries, and relevance judgment
set
Marginal relevance
• It means whether a document still has
distinctive usefulness after the user has looked
at certain other documents
• The most extreme case of this is documents that are
duplicates – a phenomenon that is actually very
common on the Web
• Maximizing marginal relevance requires returning
documents that exhibit diversity and novelty
6 Evaluation of Search Engines
• 九款流行搜索引擎评测
• 华军软件园测评中心
• 2008-07-21
主要搜索服务种类
平均搜索速度
搜索结果显示
搜索性能测试——网页搜索测试
搜索性能测试——图片搜索测试
搜索性能测试——音频搜索测试
搜索性能测试——视频搜索测试
搜索性能测试——文件搜索测试
美国九大搜索引擎评测
基于用户行为分析的自动评价
• 标准测试集方法和日志分析方法的综合
• 从宏观用户行为的角度客观反映搜索引擎检索质
量的变化
• 提供一个快速、半即时的评价各大搜索引擎检
索效果的平台
点击集中度
• 针对查询历史信息的特征
• 假设:不同用户具有相同的检索需求时,他们的点击都会集
中在某个网页或几个页面上。
• 导航类需求,目标页面唯一
• 信息类需求,目标页面多样化
• 特征:用户点击分布
查询Q的页面Page用户点击数
用户点击率p
查询Q的总用户点击数
点击集中度
• 不同的搜索引擎用户点击分布不同
• 例:电影
baidu
google
yahoo
sogou
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
© Copyright 2026 Paperzz