Computing Resources

20 May 2010
LREC 2010
Building a Domain-Specific Document Collection for
Evaluating Metadata Effects on Information Retrieval
Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones
School of Computing, Dublin City University, Ireland
Outline
CNGL
Objective
Data collection preparation and overview
IR test collection design
Baseline Experiments
Summary
CNGL
Centre of Next Generation Localisation (CNGL)
4 Universities: DCU, TCD, UCD, and UL
Team: 120 PhD students, PostDocs, and PIs
Supported by Science Foundation of Ireland (SFI)
9 Industrial Partners: IBM, Microsoft, Symantec, …
Objective: Automation of the localisation process
Technologies: MT, AH, IR, NLP, Speech, and Dev.
Objective
1.
2.
3.
4.
Create a collection of data that is:
Suitable for IR tasks
Suitable for other research fields (AH, NLP)
Large enough to produce conclusive results
Associated with defined evaluation strategies
Prepare the collection from freely available data
YouTube
Domain specific (Basketball)
Build standard IR test collection (document set + topics set +
relevance assessment)
YouTube Videos Features
Posting
User
Descriptio
n
Tags
Document
Posting
date
Category
- Video URL
- Video Title
Comment
s
Responde
d Videos
Related
Videos
Number
of
Favorited
Length
Number
of
Ratings
Number
of
Views
Methodology for Crawling Data
50 NBA related queries used to search YouTube
First 700 results per query crawled with related videos
Crawled pages parsed and metadata extracted.
Extracted data represented in XML format
Non-sport category results filtered out
Used Queries:
NBA - NBA Highlights - NBA All Starts - NBA fights
Top ranked 15 NBA players in 2008 + Jordan + Shaq
29 NBA teams
Data Collection Overview
Crawled video pages: 61,340 pages
Max crawled related/responded video pages: 20
Max crawled comments for a given video page: 500
Comments associated with contributing user’s ID
Crawled user profiles ≈ 250k
XML sample
Topics Creation
40 topics (queries) created
Specific topics related to NBA
TREC topic = query (title) + description + narrative
<title>Michael Jordan best dunks</title>
<description>Find the best dunks through the career of Michael
Jordan in NBA. It can be a collection of dunks in matches, or dunk
contest he participated in. </description>
<narrative>A relevant video should contain at least one dunk for
Jordan. Videos of dunks for other players are not relevant. And other
plays for Jordan other than dunks are not relevant as
well</narrative>
Relevance Assessment
4 indexes created:
Title
Title +Tags
Title + Tags + Description
Title + Tags + Description + Related videos titles
5 different retrieval models used
20 different result lists, each contains 60 documents
Result lists merged with random ranking
122 to 466 documents assessed per topic
1 to 125 relevant documents per topic (avg. = 23)
Baseline Experiments
Search 4 different indexes:
Title
Title +Tags
Title + Tags + Description
Title + Tags + Description + Related videos titles
Indri retrieval model used to rank results
1000 results retrieved for each search
Mean average precision (MAP) used to compare the
results
Results
0.45
0.40
0.35
MAP
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Title
Title+Tags
Title+Tags+Desc
All text fields
Top bigrams in
“Tags” field
Kobe Bryant
NBA Basketball
Lebron James
Michael Jordan
Los Angeles
All Star
Chicago Bulls
Boston Celtics
AH/Personalisation
Allen Iverson
Angeles Lakers
Slam Dunk
Basketball NBA
Dwight Howard
Vince Carter
Dwyane Wade
Kevin Garnett
Toronto Raptors
Houston Rockets
Miami Heat
O’Neal
Phoenix Suns
Detroit Pistons
Tracy Mcgrady
Yao Ming
Chris Paul
Multimedia
Amazing Highlights
processing
New York
Pau Gasol
Cleveland Cavaliers
NBA Amazing
Summary (new language resource)
Metadata
Sentiment
Analysis
NER
IR test set
Tags
Comments
Ratings
61,340
XML docs
# Views
Reranking
using ML
40 topics +
rel. assess.
250,000
User profiles
Videos
Questions & Answers
Q: Is this collection available for free?
A: No
Q: Nothing could be provided?
A: Scripts + Topics + Rel. assess. (needs updating)
Q: Any other questions?
A: …
Thank you