A Proposal to Extend and Enrichthe Scientific Data Curation of

A Proposal to Extend and Enrich
the Scientific Data Curation of Evaluation Campaigns
Maristella Agosti, Giorgio M. Di Nunzio, and Nicola Ferro
Information Management Systems (IMS) Research Group
Department of Information Engineering – University of Padova
{agosti, dinunzio, ferro}@dei.unipd.it
Methodology
Extending Evaluation
The current approach for laboratory evaluation of information access systems relies on the ◮ Scientific data, their curation, enrichment, and interpretation are essential components of
scientific research. These issues are better faced and framed in the wider context of the
Cranfield methodology, which makes use of experimental collections.
curation of scientific data, which plays an important role on the systematic definition of a
◮ An experimental collection C allows the comparison of information access systems according
proper methodology to manage and promote the use of data.
to some measurements which quantify their performances;
◮ Therefore, we have to take into consideration the possibility of information enrichment of
◮ If we reasoning about this evaluation paradigm, a first step is to point out that the
scientific data, meant as archiving and preserving scientific data so that the experiments,
experimental evaluation in the Information Retrieval (IR) field is a scientific activity and,
records, and observations will be available for future research, as well as provenance, curation,
as such, its outcomes are different kinds of valuable scientific data
and citation of scientific data items.
◮ Using the experimental data, we produce different performance measurements, such as ◮ Furthermore, the importance of some of the many possible reasons for which keeping data is
precision and recall, that are standard measures that are used to evaluate the performances
important are for example:
of an Information Retrieval System (IRS) for a given experiment. Starting from these
⊲ re-use of data for new research, including collection based research to generate new science;
performance measurements, we can compute descriptive statistics, such as mean or median,
⊲ retention of unique observational data which is impossible to re-create;
used to summarize the overall performances achieved by an experiment or by a collection
⊲ retention of expensively generated data which is cheaper to maintain than to re-generate;
of experiments. Finally, we can perform hypothesis tests and other statistical analyses to
⊲ enhancing existing data available for research projects;
conduct an in-depth analysis and comparison over a set of experiments.
⊲ validating published research results.
Wisd
om
Know
Infor
Data
mat i
ledge
on
Pa pe
rs
i st
St at
ic s
u
M e as
Key Points
r es
m en
i
r
e
Exp
ts
Conceptual model and metadata the information space implied by an evaluation
campaign needs an appropriate conceptual model which takes into consideration and describes
◮ data: the experimental collections and the experiments correspond to the “data level” in
all the entities involved by the evaluation campaign. From a conceptual model we can derive
the hierarchy, since they are the raw, basic elements needed for any further investigation
also appropriate data formats for exchanging information among organizers and participants.
and they would have little meaning by themselves. In fact, an experiment and the list of
results obtained conducting it are almost useless without a relationship with the experimental Unique Identification Mechanism the lack of a conceptual model also implies that there
collection with respect to which the experiment has been conducted and the list of results
is no common mechanism for uniquely identify the different digital objects involved in
produced; those data constitute the basis for any subsequent computation;
an evaluation campaign. Indeed, the possibility of citing scientific data and their further
elaboration is an effective way for making scientists and researchers an active part of the
◮ information: the performance measurements correspond to the “information level” in the
digital curation process.
hierarchy, since they are the result of computations and processing on the data, so that we
have associated a meaning to the data by way of some kind of relational connection. For Statistical Analyses in developing an infrastructure, it is advisable to add some form of
support and guide to participants for adopting a more uniform way of performing statistical
example, precision and recall measures are obtained by relating the list of results contained
analyses on their own experiments. If this support is added, participants can not only benefit
in an experiment with the relevance judgements J;
from standard experimental collections which make their experiments comparable, but they
◮ knowledge: the descriptive statistics and the hypothesis tests correspond to the “knowledge
can
also
exploit
standard
tools
for
the
analysis
of
the
experimental
results,
which
would
make
level” in the hierarchy, since they are a further elaboration of the information carried by the
the
analysis
and
assessment
of
their
experiments
comparable
too.
performance measurements and provide us with some insights about the experiments;
◮ wisdom:
Running System
alicante
50,000
2,148
1,804
Geometric Mean Average Precision
Binary Preference (BPREF)
0.1941
0.3453
Interploated Recall (%)
AH-MONO-FR-CLEF2006
Priority
Query Construction
Source Language
Topic Fields
Pooled
0
69.05
10
58.28
20
51.78
30
46.37
40
43.29
50
37.97
60
34.57
70
30.60
80
27.44
90
20.03
100
13.73
UniNEfr1 [MAP 44.58%; Pooled]
80%
70%
UniNEfr2 [MAP 43.41%; Pooled]
60%
95aplmofrtdn5s [MAP 42.90%; Pooled]
50%
95aplmofrtd5s [MAP 40.96%; Pooled]
40%
MercwCombzSqrtAll [MAP 40.94%; Pooled]
30%
20%
humFR06tde [MAP 40.77%; Pooled]
10%
frx101frFS3FS6 [MAP 40.03%; Pooled]
0%
0%
are issues raised at international level which suggest that the IR experimental
evaluation as a source of scientific data requests and the evaluation methodology itself
need to be reconsidered to be properly supported by an organizational, hardware, and
software infrastructures which allow for management, search, access, curation, enrichment,
and citation of the produced scientific data.
EC in the i2010 Digital Library Initiative clearly states that “digital repositories of
scientific information are essential elements to build European eInfrastructure for knowledge
sharing and transfer, feeding the cycles of scientific research and innovation up-take” ?.
⊲ The US National Scientific Board points out that “organizations make choices on behalf
of the current and future user community on issues such as collection access; collection
structure; technical standards and processes for data curation; ontology development;
annotation; and peer review” ?.
⊲ The Australian Working Group on Data for Science suggests to “establish a nationally
supported long-term strategic framework for scientific data management, including guiding
principles, policies, best practices and infrastructure” ?.
10%
20%
30%
40%
50%
60%
Interpolated Recall
70%
80%
90%
100%
frFSfr4S [MAP 39.92%; Pooled]
Ad−Hoc Monolingual French track − Box plot of the Topics of the Experiment
1.0000
0.0000
0.1571
0.3405
0.5802
0.4230
0.3713
0.2653
0.0000
1.0000
0.3713
0.2653
8dfrexp [MAP 38.28%; Pooled]
frFSfr2S [MAP 37.94%; Pooled]
humFR06td [MAP 37.91%; Pooled]
0%
5%
10%
15%
20% 25%
30%
35%
40% 45% 50% 55% 60% 65%
Mean Average Precision
70%
75%
80% 85%
90%
95% 100%
Ad−Hoc Monolingual French track − Distribution of the Topics of the Experiment
Number of Topics of the Experiment
5
30okapiexp [MAP 37.80%; Pooled]
MercDTree5reduced [MAP 37.77%; Pooled]
30dfrexp [MAP 37.13%; Pooled]
9okapiexp [MAP 37.06%; Pooled]
4
3
MercBaseStemNoaccTD [MAP 36.88%; Pooled]
2
95aplmofrtd4 [MAP 35.92%; Pooled]
1
0
0%
5%
10%
15%
20% 25%
30%
35%
40% 45% 50% 55% 60% 65%
Mean Average Precision
70%
75%
80% 85%
90%
95% 100%
Cd61.5 [MAP 34.54%; Pooled]
humFR06t [MAP 34.53%; Pooled]
Precision averages (%) for individual
queries
Ad−Hoc Monolingual French track − Comparison to Median Mean Average Precision by Topic (Topics 301 to 325)
1
30dfrexp
0.8
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
30.52
17.87
35.75
63.32
46.12
5.24
39.06
68.95
8.19
16.03
52.25
18.55
17.01
9.47
14.60
35.49
13.23
47.20
0.67
0.14
32.83
1.22
25.41
27.80
21.75
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
69.84
72.34
53.80
60.60
87.67
15.71
0.00
86.90
42.72
54.23
1.18
74.85
9.49
62.21
58.02
100.00
51.45
27.03
27.15
35.26
24.70
8.93
73.92
69.33
40.52
Cd62.0 [MAP 34.39%; Pooled]
0.6
Cld61.5 [MAP 33.41%; Pooled]
0.4
Difference
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
0.2
0
Cbaseline [MAP 31.96%; Pooled]
−0.2
−0.4
−0.6
RIMAM06TDMLRef [MAP 31.63%; Pooled]
−0.8
−1
301
302
303
304
305
306
307
308
309
310
311
312
313
314
Topic Identifier
315
316
317
318
319
320
321
322
323
324
325
RIMAM06TDML [MAP 31.51%; Pooled]
Ad−Hoc Monolingual French track − Comparison to Median Mean Average Precision by Topic (Topics 326 to 350)
1
30dfrexp
RIMAM06TL [MAP 28.51%; Pooled]
0.8
0.6
RIMAM06TDNL [MAP 26.09%; Pooled]
0.4
Difference
◮ There
UniNEfr3 [MAP 44.68%; Pooled]
90%
37.13
experimental evaluation is usually carried out in important international evaluation
campaigns which bring research groups together, provide them with the means for measuring
the performances of their systems, discuss and compare their results.
AH-MONO-FR-CLEF2006
Ad−Hoc Monolingual French track − Box Plot of the Topics
Ad−Hoc Monolingual French track − Interpolated Recall vs Average Precision
30dfrexp
Mean Average Precision
Maximum
Minimum
First Quartile
Second Quartile
Third Quartile
Interquartile range
Mean
Standard Deviation
Lower Outlier Threshold
Upper Outlier Threshold
Mean With No Outliers
Std With No Outliers
Track Overview Results and Graphs
100%
Average precision (non-interpolated) for all
relevant documents (averaged over queries)
◮ The
AH-MONO-CLEF2006
1
AUTOMATIC
French
title, description
true
DFR with query expansion
Precision Averages (%)
Average Precision
Infrustructure
30dfrexp
Overall statistics for 50 queries :
Total number of documents over all queries
Retrieved
Relevant
Relevant retrieved
Experiments
theories, models, algorithms, techniques, and observations, which are usually
communicated by means of papers, talks, and seminars, correspond to the “wisdom level” in
the hierarchy, since they provide interpretation, explanation, and formalization of the content
of the previous levels.
0.2
0%
0
10%
20%
−0.2
30% 40% 50% 60% 70%
Mean Average Precision
80%
90% 100%
−0.4
−0.6
−0.8
−1
326
327
328
329
330
331
332
111
333
334
335
336
337
338
339
Topic Identifier
340
341
342
343
344
345
346
347
348
349
350
24
References
⊲ The
The 6th NTCIR Workshop 2007, May 15-18, National Center of Science, Tokyo, Japan
[1 ] Agosti, M., Di Nunzio, G. M., and Ferro, N. (2006a). A Data Curation Approach to Support In-depth Evaluation Studies. In Proc.
International Workshop on New Directions in Multilingual Information Access (MLIA 2006), pages 65–68. http://ucdata.
berkeley.edu/sigir2006-mlia.htm.
[2 ] Agosti, M., Di Nunzio, G. M., and Ferro, N. (2006b). Scientific Data of an Evaluation Campaign: Do We Properly Deal
With Them? In Working Notes for the CLEF 2006 Workshop. http://www.clef-campaign.org/2006/working_notes/
workingnotes2006/agostiCLEF2006.pdf.
[3 ] Di Nunzio, G. M. and Ferro, N. (2005). DIRECT: a System for Evaluating Information Access Components of Digital Libraries.
In Proc. 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005), pages 483–484.
Lecture Notes in Computer Science (LNCS) 3652, Springer, Heidelberg, Germany.
[4 ] Di Nunzio, G. M. and Ferro, N. (2006). Scientific Evaluation of a DLMS: a service for evaluating information access components.
In Proc. 10th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2006), pages 536–539.
Lecture Notes in Computer Science (LNCS) 4172, Springer, Heidelberg, Germany.