PowerPoint - Cornell Computer Science

CS 430 / INFO 430
Information Retrieval
Lecture 22
Metadata 4
1
Course Administration
2
Automated Creation of Metadata
Records
Sometimes it is possible to generate metadata automatically from
the content of a digital object. The effectiveness varies from field
to field.
Examples
3
•
Images -- characteristics of color, texture, shape, etc. (crude)
•
Music -- optical recognition of score (good)
•
Bird song -- spectral analysis of sounds (good)
•
Fingerprints (good)
Automated Information Retrieval
Using Feature Extraction
Example: features extracted from images
• Spectral features: color or tone, gradient, spectral parameter etc.
• Geometric features: edge, shape, size, etc.
• Textural features: pattern, spatial frequency, homogeneity, etc.
Features can be recorded in a feature vector space (as in a term
vector space). A query can be expressed in terms of the same
features.
Machine learning methods, such as a support vector machine, can
be used with training data to create a similarity metric between
image and query
Example: Searching satellite photographs for dams in California
4
Example: Blobworld
5
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
6
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
7
Effective Information Discovery
With Homogeneous Digital Information
Comprehensive metadata with Boolean retrieval
Can be excellent for well-understood categories of material, but
requires standardized metadata and relatively homogeneous
content (e.g., MARC catalog).
Full text indexing with ranked retrieval
Can be excellent, but methods developed and validated for
relatively homogeneous textual material (e.g., TREC ad hoc
track).
8
Mixed Content
Examples: NSDL-funded collections at Cornell
Atlas. Data sets of earthquakes, volcanoes, etc.
Reuleaux. Digitized kinematics models from the nineteenth
century
Laboratory of Ornithology. Sound recording, images, videos
of birds and other animals.
Nuprl. Logic-based tools to support programming and to
implement formal computational mathematics.
9
Mixed Metadata: the Chimera of
Standardization
Technical reasons
(a) Characteristics of formats and genres
(b) Differing user needs
Social and cultural reasons
(a) Economic factors
(b) Installed base
10
Information Discovery in a Messy World
Building blocks
Brute force computation
The expertise of users -- human in the loop
Methods
(a) Better understanding of how and why users seek for
information
(b) Relationships and context information
(c) Multi-modal information discovery
(d) User interfaces for exploring information
11
Understanding How and Why Users
Seek for Information
Homogeneous content
All documents are assumed equal
Criterion is relevance (binary measure)
Goal is to find all relevant documents (high recall)
Hits ranked in order of similarity to query
Mixed content
Some documents are more important than other
Goal is to find most useful documents on a topic and then
browse
Hits ranked in order that combines importance and similarity
to query
12
Automatic Creation of Surrogates for
Non-textual Materials
Discovery of non-textual materials usually requires surrogates
• How far can these surrogates be created automatically?
• Automatically created surrogates are much less expensive than
manually created, but have high error rates.
• If surrogates have high rates of error, is it possible to have
effective information discovery?
13
Example: Informedia Digital Video
Library
Collections: Segments of video programs, e.g., TV and radio
news and documentary broadcasts. Cable Network News,
British Open University, WQED television.
Segmentation: Automatically broken into short segments of
video, such as the individual items in a news broadcast.
Size: More than 4,000 hours, 2 terabyte.
Objective: Research into automatic methods for organizing and
retrieving information from video.
Funding: NSF, DARPA, NASA and others.
Principal investigator: Howard Wactlar (Carnegie Mellon
University).
14
Informedia Digital Video Library
History
• Carnegie Mellon has broad research programs in speech
recognition, image recognition, natural language processing.
• 1994. Basic mock-up demonstrated the general concept of a
system using speech recognition to build an index from a sound
track matched against spoken queries. (DARPA funded.)
• 1994-1998. Informedia developed the concept of multi-modal
information discovery with a series of users interface
experiments. (NSF/DARPA/NASA Digital Libraries Initiative.)
• 1998 - . Continued research particularly in human computer
interaction. Commercial spin-off failed.
15
The Challenge
A video sequence is awkward for information discovery:
• Textual methods of information retrieval cannot be applied
• Browsing requires the user to view the sequence. Fast skimming
is difficult.
• Computing requirements are demanding (MPEG-1 requires 1.2
Mbits/sec).
Surrogates are required
16
Multi-Modal Information Discovery
The multi-modal approach to information retrieval
Computer programs to analyze video materials for clues
e.g., changes of scene
• methods from artificial intelligence, e.g., speech
recognition, natural language processing, image
recognition.
• analysis of video track, sound track, closed captioning if
present, any other information.
Each mode gives imperfect information. Therefore use
many approaches and combine the evidence.
17
Multi-Modal Information Discovery
With mixed content and mixed metadata, the amount of
information about the various resources varies greatly
but clues from many difference sources can be combined.
"The fundamental premise of the research was that the
integration of these technologies, all of which are
imperfect and incomplete, would overcome the
limitations of each, and improve the overall performance
in the information retrieval task."
[Wactlar, 2000]
18
Informedia Library Creation
Video
Audio
Text
Speech recognition
Image extraction
Natural language
interpretation
19
Segmentation
Segments
with derived
metadata
Text Extraction
Source
Sound track: Automatic speech recognition using Sphinx II and III
recognition systems. (Unrestricted vocabulary, speaker independent,
multi-lingual, background sounds). Error rates 25% up.
Closed captions: Digitally encoded text. (Not on all video. Often
inaccurate.)
Text on screen: Can be extracted by image recognition and optical
character recognition. (Matches speaker with name.)
Query
Spoken query: Automatic speech recognition using the same system
as is used to index the sound track.
20
Typed by user
Multimodal Metadata Extraction
21
Informedia: Information Discovery
User
Querying via
natural
language
Requested segments
and metadata
Segments
with derived
metadata
22
Browsing via
multimedia
surrogates
23
Limits to Scalability
Informedia has demonstrated effective information discovery
with moderately large collections
Problems with increased scale:
• Technical -- storage, bandwidth, etc.
• Diversity of content -- difficult to tune heuristics
• User interfaces -- complexity of browsing grows with scale
24
Lessons Learned
• Searching and browsing must be considered integrated parts
of a single information discovery process.
• Data (content and metadata), computing systems (e.g.,
search engines), and user interfaces must be designed
together.
• Multi-modal methods compensate for incomplete or errorprone data.
25
Interoperability
The Problem
Conventional approaches require partners to support
agreements (technical, content, and business)
But a Web based digital library program needs thousands
of very different partners
... most of whom are not directly part of the program
The challenge is to create incentives for independent
digital libraries to adopt agreements
26
Approaches to interoperability
The conventional approach
 Wise people develop standards: protocols, formats, etc.
 Everybody implements the standards.
 This creates an integrated, distributed system.
Unfortunately ...
 Standards are expensive to adopt.
 Concepts are continually changing.
 Systems are continually changing.
 Different people have different ideas.
27
Interoperability is about agreements
Technical agreements cover formats, protocols, security
systems so that messages can be exchanged, etc.
Content agreements cover the data and metadata, and include
semantic agreements on the interpretation of the messages.
Organizational agreements cover the ground rules for access,
for changing collections and services, payment, authentication,
etc.
The challenge is to create incentives for independent digital
libraries to adopt agreements
28
Function versus cost of acceptance
Cost of acceptance
Few
adopters
Many
adopters
Function
29
Example: security
Cost of acceptance
Public key
infrastructure
Login ID and
password
IP address
30
Function
Example: metadata standards
Cost of acceptance
MARC
Dublin
Core
Free text
31
Function
NSDL: The Spectrum of Interoperability
32
Level
Agreements
Example
Federation
Strict use of standards
(syntax, semantic,
and business)
AACR, MARC
Z 39.50
Harvesting
Digital libraries expose
metadata; simple
protocol and registry
Open Archives
metadata harvesting
Gathering
Digital libraries do not
cooperate; services must
seek out information
Web crawlers
and search engines