A Proposal to Extend and Enrich the Scientific Data Curation of Evaluation Campaigns Maristella Agosti, Giorgio M. Di Nunzio, and Nicola Ferro Information Management Systems (IMS) Research Group Department of Information Engineering – University of Padova {agosti, dinunzio, ferro}@dei.unipd.it Methodology Extending Evaluation The current approach for laboratory evaluation of information access systems relies on the ◮ Scientific data, their curation, enrichment, and interpretation are essential components of scientific research. These issues are better faced and framed in the wider context of the Cranfield methodology, which makes use of experimental collections. curation of scientific data, which plays an important role on the systematic definition of a ◮ An experimental collection C allows the comparison of information access systems according proper methodology to manage and promote the use of data. to some measurements which quantify their performances; ◮ Therefore, we have to take into consideration the possibility of information enrichment of ◮ If we reasoning about this evaluation paradigm, a first step is to point out that the scientific data, meant as archiving and preserving scientific data so that the experiments, experimental evaluation in the Information Retrieval (IR) field is a scientific activity and, records, and observations will be available for future research, as well as provenance, curation, as such, its outcomes are different kinds of valuable scientific data and citation of scientific data items. ◮ Using the experimental data, we produce different performance measurements, such as ◮ Furthermore, the importance of some of the many possible reasons for which keeping data is precision and recall, that are standard measures that are used to evaluate the performances important are for example: of an Information Retrieval System (IRS) for a given experiment. Starting from these ⊲ re-use of data for new research, including collection based research to generate new science; performance measurements, we can compute descriptive statistics, such as mean or median, ⊲ retention of unique observational data which is impossible to re-create; used to summarize the overall performances achieved by an experiment or by a collection ⊲ retention of expensively generated data which is cheaper to maintain than to re-generate; of experiments. Finally, we can perform hypothesis tests and other statistical analyses to ⊲ enhancing existing data available for research projects; conduct an in-depth analysis and comparison over a set of experiments. ⊲ validating published research results. Wisd om Know Infor Data mat i ledge on Pa pe rs i st St at ic s u M e as Key Points r es m en i r e Exp ts Conceptual model and metadata the information space implied by an evaluation campaign needs an appropriate conceptual model which takes into consideration and describes ◮ data: the experimental collections and the experiments correspond to the “data level” in all the entities involved by the evaluation campaign. From a conceptual model we can derive the hierarchy, since they are the raw, basic elements needed for any further investigation also appropriate data formats for exchanging information among organizers and participants. and they would have little meaning by themselves. In fact, an experiment and the list of results obtained conducting it are almost useless without a relationship with the experimental Unique Identification Mechanism the lack of a conceptual model also implies that there collection with respect to which the experiment has been conducted and the list of results is no common mechanism for uniquely identify the different digital objects involved in produced; those data constitute the basis for any subsequent computation; an evaluation campaign. Indeed, the possibility of citing scientific data and their further elaboration is an effective way for making scientists and researchers an active part of the ◮ information: the performance measurements correspond to the “information level” in the digital curation process. hierarchy, since they are the result of computations and processing on the data, so that we have associated a meaning to the data by way of some kind of relational connection. For Statistical Analyses in developing an infrastructure, it is advisable to add some form of support and guide to participants for adopting a more uniform way of performing statistical example, precision and recall measures are obtained by relating the list of results contained analyses on their own experiments. If this support is added, participants can not only benefit in an experiment with the relevance judgements J; from standard experimental collections which make their experiments comparable, but they ◮ knowledge: the descriptive statistics and the hypothesis tests correspond to the “knowledge can also exploit standard tools for the analysis of the experimental results, which would make level” in the hierarchy, since they are a further elaboration of the information carried by the the analysis and assessment of their experiments comparable too. performance measurements and provide us with some insights about the experiments; ◮ wisdom: Running System alicante 50,000 2,148 1,804 Geometric Mean Average Precision Binary Preference (BPREF) 0.1941 0.3453 Interploated Recall (%) AH-MONO-FR-CLEF2006 Priority Query Construction Source Language Topic Fields Pooled 0 69.05 10 58.28 20 51.78 30 46.37 40 43.29 50 37.97 60 34.57 70 30.60 80 27.44 90 20.03 100 13.73 UniNEfr1 [MAP 44.58%; Pooled] 80% 70% UniNEfr2 [MAP 43.41%; Pooled] 60% 95aplmofrtdn5s [MAP 42.90%; Pooled] 50% 95aplmofrtd5s [MAP 40.96%; Pooled] 40% MercwCombzSqrtAll [MAP 40.94%; Pooled] 30% 20% humFR06tde [MAP 40.77%; Pooled] 10% frx101frFS3FS6 [MAP 40.03%; Pooled] 0% 0% are issues raised at international level which suggest that the IR experimental evaluation as a source of scientific data requests and the evaluation methodology itself need to be reconsidered to be properly supported by an organizational, hardware, and software infrastructures which allow for management, search, access, curation, enrichment, and citation of the produced scientific data. EC in the i2010 Digital Library Initiative clearly states that “digital repositories of scientific information are essential elements to build European eInfrastructure for knowledge sharing and transfer, feeding the cycles of scientific research and innovation up-take” ?. ⊲ The US National Scientific Board points out that “organizations make choices on behalf of the current and future user community on issues such as collection access; collection structure; technical standards and processes for data curation; ontology development; annotation; and peer review” ?. ⊲ The Australian Working Group on Data for Science suggests to “establish a nationally supported long-term strategic framework for scientific data management, including guiding principles, policies, best practices and infrastructure” ?. 10% 20% 30% 40% 50% 60% Interpolated Recall 70% 80% 90% 100% frFSfr4S [MAP 39.92%; Pooled] Ad−Hoc Monolingual French track − Box plot of the Topics of the Experiment 1.0000 0.0000 0.1571 0.3405 0.5802 0.4230 0.3713 0.2653 0.0000 1.0000 0.3713 0.2653 8dfrexp [MAP 38.28%; Pooled] frFSfr2S [MAP 37.94%; Pooled] humFR06td [MAP 37.91%; Pooled] 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% Mean Average Precision 70% 75% 80% 85% 90% 95% 100% Ad−Hoc Monolingual French track − Distribution of the Topics of the Experiment Number of Topics of the Experiment 5 30okapiexp [MAP 37.80%; Pooled] MercDTree5reduced [MAP 37.77%; Pooled] 30dfrexp [MAP 37.13%; Pooled] 9okapiexp [MAP 37.06%; Pooled] 4 3 MercBaseStemNoaccTD [MAP 36.88%; Pooled] 2 95aplmofrtd4 [MAP 35.92%; Pooled] 1 0 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% Mean Average Precision 70% 75% 80% 85% 90% 95% 100% Cd61.5 [MAP 34.54%; Pooled] humFR06t [MAP 34.53%; Pooled] Precision averages (%) for individual queries Ad−Hoc Monolingual French track − Comparison to Median Mean Average Precision by Topic (Topics 301 to 325) 1 30dfrexp 0.8 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 30.52 17.87 35.75 63.32 46.12 5.24 39.06 68.95 8.19 16.03 52.25 18.55 17.01 9.47 14.60 35.49 13.23 47.20 0.67 0.14 32.83 1.22 25.41 27.80 21.75 Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 69.84 72.34 53.80 60.60 87.67 15.71 0.00 86.90 42.72 54.23 1.18 74.85 9.49 62.21 58.02 100.00 51.45 27.03 27.15 35.26 24.70 8.93 73.92 69.33 40.52 Cd62.0 [MAP 34.39%; Pooled] 0.6 Cld61.5 [MAP 33.41%; Pooled] 0.4 Difference Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic 0.2 0 Cbaseline [MAP 31.96%; Pooled] −0.2 −0.4 −0.6 RIMAM06TDMLRef [MAP 31.63%; Pooled] −0.8 −1 301 302 303 304 305 306 307 308 309 310 311 312 313 314 Topic Identifier 315 316 317 318 319 320 321 322 323 324 325 RIMAM06TDML [MAP 31.51%; Pooled] Ad−Hoc Monolingual French track − Comparison to Median Mean Average Precision by Topic (Topics 326 to 350) 1 30dfrexp RIMAM06TL [MAP 28.51%; Pooled] 0.8 0.6 RIMAM06TDNL [MAP 26.09%; Pooled] 0.4 Difference ◮ There UniNEfr3 [MAP 44.68%; Pooled] 90% 37.13 experimental evaluation is usually carried out in important international evaluation campaigns which bring research groups together, provide them with the means for measuring the performances of their systems, discuss and compare their results. AH-MONO-FR-CLEF2006 Ad−Hoc Monolingual French track − Box Plot of the Topics Ad−Hoc Monolingual French track − Interpolated Recall vs Average Precision 30dfrexp Mean Average Precision Maximum Minimum First Quartile Second Quartile Third Quartile Interquartile range Mean Standard Deviation Lower Outlier Threshold Upper Outlier Threshold Mean With No Outliers Std With No Outliers Track Overview Results and Graphs 100% Average precision (non-interpolated) for all relevant documents (averaged over queries) ◮ The AH-MONO-CLEF2006 1 AUTOMATIC French title, description true DFR with query expansion Precision Averages (%) Average Precision Infrustructure 30dfrexp Overall statistics for 50 queries : Total number of documents over all queries Retrieved Relevant Relevant retrieved Experiments theories, models, algorithms, techniques, and observations, which are usually communicated by means of papers, talks, and seminars, correspond to the “wisdom level” in the hierarchy, since they provide interpretation, explanation, and formalization of the content of the previous levels. 0.2 0% 0 10% 20% −0.2 30% 40% 50% 60% 70% Mean Average Precision 80% 90% 100% −0.4 −0.6 −0.8 −1 326 327 328 329 330 331 332 111 333 334 335 336 337 338 339 Topic Identifier 340 341 342 343 344 345 346 347 348 349 350 24 References ⊲ The The 6th NTCIR Workshop 2007, May 15-18, National Center of Science, Tokyo, Japan [1 ] Agosti, M., Di Nunzio, G. M., and Ferro, N. (2006a). A Data Curation Approach to Support In-depth Evaluation Studies. In Proc. International Workshop on New Directions in Multilingual Information Access (MLIA 2006), pages 65–68. http://ucdata. berkeley.edu/sigir2006-mlia.htm. [2 ] Agosti, M., Di Nunzio, G. M., and Ferro, N. (2006b). Scientific Data of an Evaluation Campaign: Do We Properly Deal With Them? In Working Notes for the CLEF 2006 Workshop. http://www.clef-campaign.org/2006/working_notes/ workingnotes2006/agostiCLEF2006.pdf. [3 ] Di Nunzio, G. M. and Ferro, N. (2005). DIRECT: a System for Evaluating Information Access Components of Digital Libraries. In Proc. 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005), pages 483–484. Lecture Notes in Computer Science (LNCS) 3652, Springer, Heidelberg, Germany. [4 ] Di Nunzio, G. M. and Ferro, N. (2006). Scientific Evaluation of a DLMS: a service for evaluating information access components. In Proc. 10th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2006), pages 536–539. Lecture Notes in Computer Science (LNCS) 4172, Springer, Heidelberg, Germany.
© Copyright 2024 Paperzz