200501010PSU - Edward A. Fox

PSU/Villanova/VT Discussion
Virginia Tech’s Digital Library
Research Laboratory
Jan. 10, 2005 -- PSU
Edward A. Fox, [email protected]
Virginia Tech, Blacksburg, VA 24061 USA
http://fox.cs.vt.edu/talks/
http://fox.cs.vt.edu/cv.htm
Acknowledgements (Selected)
• Sponsors: ACM, Adobe, AOL, CAPES,
CNI, CONACyT, DFG, IBM, Microsoft,
NASA, NDLTD, NLM, NSF (IIS-9986089,
0086227, 0080748, 0325579; ITR0325579; DUE-0121679, 0136690,
0121741, 0333601), OCLC, SOLINET,
SUN, SURA, UNESCO, US Dept. Ed.
(FIPSE), VTLS
Acknowledgements: Faculty, Staff
• Lillian Cassel, Debra Dudley, Roger
Ehrich, Joanne Eustis, Weiguo Fan,
James Flanagan, C. Lee Giles, Eberhard
Hilf, John Impagliazzo, Filip Jagodzinski,
Rohit Kelapure, Neill Kipp, Douglas
Knight, Deborah Knox, Aaron Krowne,
Alberto Laender, Gail McMillan, Claudia
Medeiros, Manuel Perez, Naren
Ramakrishnan, Layne Watson, …
Acknowledgements: Students
• Pavel Calado, Yuxin Chen, Fernando
Das Neves, Shahrooz Feizabadi, Robert
France, Marcos Goncalves, Nithiwat
Kampanya, S.H. Kim, Aaron Krowne,
Bing Liu, Ming Luo, Paul Mather,
Saverio Perugini, Unni. Ravindranathan,
Ryan Richardson, Rao Shen, Ohm
Sornil, Hussein Suleman, Ricardo
Torres, Wensi Xi, Xiaoyan Yu, Baoping
Zhang, Qinwei Zhu, …
Stepping Stones &
Pathways:
Improving retrieval by
chains of relationships
between
document topics
Fernando Das-Neves, Virginia Tech DLRL
A Little Experiment
(Compare a simple query with a longer version that explicitly includes
stepping stones)
• “Literary Style in Sherlock Holmes
stories”
No. of rel. docs.
Literary Style
2
Sherlock Holmes
VS.
4
Connan Doyle
20
Sherlock Holmes
Literary Style
5
Victorian Novel
5
• Note: Numbers are total relevant web pages in top 20 Google
results for the query made up of terms on either end of the link.
Another Example
• “What is the Relationship between Data Mining and Recommender
Systems?”
Data Mining
7
Recommender
Systems
VS.
10
Data Mining
Machine Learning
Collaborative
Filtering
Social Networks
9
11
10
15
Recommender
Systems
• Naïve Results: There are many matches that are possible answers.
• Discussion: But, many of the pages with co-occurrences give no real
information about the requested relationship.
An Alternative Interpretation of a
Query in IR:
• A query represents two related, separable
concepts.
• Objective: Retrieve a sequence of documents
that support a valid set of chains of relationships
between the two concepts.
• Input: a query representing two concepts.
• Output: two groups of documents + a set of
stepping stones (document groups, i.e.,
clusters) connecting the topics by pathways
(relations among clusters).
Type of Questions Matching
Alternative Interpretation
• Ill-defined questions, with non-enumerated
answers:
– “How or why is X related to Y?”
– “What is the X of Y?”
• Even if queries with form “give me
something about X” lead to relevant docs, it
is possible to increase the quantity and
quality of information in the query result,
when relations are explicit (as a result of our
semi-automatic method).
Why is this useful?
• Questions of this type are common.
– For example, such questions often occur
during research studies.
– These occur often in educational settings,
e.g., for homework.
– These occur often in workplace settings,
requiring gathering and relating of information.
• Handling of this type of question by current
systems often is inadequate.
How to Build Stepping Stones
and Pathways?
• Our approach involves a belief network, to combine
content+structure in document similarity calculation,
including citation and co-citation similarities.
• Find two relevant document sets, each related to one of
the two original sub-queries.
• Find a diverse set of strong candidates, each connecting
the two subsets, but as different as possible from other
candidates.
• Create stepping stones by finding similar documents to
those candidates; keep the clusters that are heavily cited,
or whose documents are highly correlated (in all aspects).
• Repeat the process, finding a new stepping stone in
between each pair of clusters that are weakly related, until
the pathway length is too long, or the similarity is sufficient.
Streams, Structures, Spaces,
Scenarios, and Societies (5S): A
Formal Digital Library Framework
and Its Applications
Marcos André Gonçalves
Doctoral defense
Virginia Tech, Blacksburg, VA 24061 USA
Informal 5S Definition:
DLs are complex systems that
•
•
•
•
•
help satisfy info needs of users (societies)
provide info services (scenarios)
organize info in usable ways (structures)
present info in usable ways (spaces)
communicate info with users (streams)
5Ss
Ss
Examples
Objectives
Streams
Text; video; audio; image
Describes properties of the DL content
such as encoding and language for
textual material or particular forms of
multimedia data
Structures Collection; catalog;
hypertext; document;
metadata
Specifies organizational aspects of the DL
content
Spaces
Measure; measurable,
topological, vector,
probabilistic
Defines logical and presentational views
of several DL components
Scenarios
Searching, browsing,
recommending
Details the behavior of DL services
Societies
Service managers,
learners, teachers, etc.
Defines service managers, responsible
for running DL services; actors, that use
those services
Hypotheses
• A formal theory for DLs can be built
based on 5S.
• The formalization can serve as a
basis for modeling and building highquality DLs.
5S Framework and DL Development
(Gonçalves)
Formal
Theory/
Metamodel
5S
Requirements
5SGraph
5SL
Analysis
DL XML
Log
5SLGen
OO Classes
Workflow
Design
Components
Implementation
DL
Evaluation
Test
5SLGen: Automatic DL Generation
Requirements (1)
5S
Meta
Model
DL
Expert
Analysis (2)
5SLGraph
DL
Designer
Practitioner
5SL
DL
Model
component
pool
ODLSearch,
ODLBrowse,
ODLRate,
ODLReview,
…….
Teacher
Design (3)
Researcher
5SLGen
Tailored
DL
Services
Implementation (4)
Research Questions
1. Can we formally elaborate 5S?
2. How can we use 5S to formally describe digital libraries?
3. What are the fundamental relationships among the Ss
and high-level DL concepts?
4. How can we allow digital librarians to easily express
those relationships?
5. Which are the fundamental quality properties of a DL?
Can we use the formalized DL framework to
characterize those properties?
6. Where in the life cycle of digital libraries can key aspects
of quality be measured and how?
Outline
• Motivation: the problem
– Hypotheses and research questions
• Part 1:Theory
– 5S: introduction, formal definitions
– The formal ontology
• Part 2: Tools/Applications
–
–
–
–
Language
Visualization
Generation
Logging
• Part 3: Quality
• Conclusions, Future Work
5S and DL formal definitions and compositions (April 2004 TOIS)
relation (d. 1)
sequence graph (d. 6)
(d. 3)
measurable(d.12), measure(d.13), probability (d.14),
language (d.5)
vector (d.15), topological (d.16) spaces
sequence
tuple (d. 4)*
(d.
3)
function
state (d. 18)
event (d.10)
(d. 2)
5S
grammar (d. 7)
streams (d.9)
structures (d.10) spaces (d.18) scenarios (d.21) societies
(d. 24)
services (d.22)
structured
stream (d.29)
digital
object
(d.30)
structural
metadata
specification
(d.25)
transmission collection (d. 31)
(d.23)
repository
(d. 33)
descriptive
metadata
specification
(d.26)
metadata catalog
(d.32)
(d.34)indexing
service
hypertext
(d.36)
browsing
service
(d.37)
digital
library
(minimal) (d. 38)
searching
service (d.35)
Digital Library Formal Ontology
Streams
image
is_version_of/
cites/links_to
contains
text
describes
video
audio
contains
do

C
Ic
Structures
 ms
mss
belongs_to
describes
DM c
stores
R
Measurable
is_a
Measure
employs
produces
Top
employs
produces
is_a
Societies
is_a
Pr Vec Metric
Spaces
employs
produces
inherits_from/includes
runs
Se

extends
reuses
uses
Sc
precedes
contains
happens_before
participates_inAc
recipient
e
Scenarios
SM

association
op
executes
redefines
invokes
Composition of key
infrastructure services
universal
collection
Authoring
Digitizing
p
Describing e doi
Cataloguing
e e
Acquiring
p
mskj
p
e
p
C
e
de
scr
Submitting
p
ibe
s
DMC
Indexing
p
Ic
Linking
p
Hypertext
Composition of additional
services
Infrastructure
Information
Satisfaction
Services
Services (Add_Value)
Rating
Indexing
p
Training
p
Society
actor
p
handle
anchor
{(doi, acj, rij), I
C
i  I, j }
e
classCt
e
e
Browsing
user model/expr
e
p
Recommending
p
{dor, r  R}
Searching
p
query/category C, {doi, i  I}
e
e
e
Requesting
p
e
query
e
e
e
Filtering
Binding
p
p
{dof, f  F}
biuk
e
fundamental
composite

transformer
e
{doj, j  J}
e
e
Visualizing
p
spj
Expanding query
p
query’
Ontology: Taxonomy of Services
Infrastructure Services
Repository-Building
Creational
Preservational
Acquiring
Authoring
Cataloging
Crawling (focused)
Describing
Digitizing
Harvesting
Submitting
Conserving
Converting
Copying/Replicating
Translating (format)
Add
Value
Annotating
Classifying
Clustering
Evaluating
Extracting
Indexing
Linking
Logging
Measuring
Rating
Reviewing (peer)
Surveying
Training (classifier)
Translating
Visualizing
Information
Satisfaction
Services
Binding
Browsing
Customizing
Disseminating
Expanding(query)
Filtering
Recommending
Requesting
Searching
5SL: a DL Modeling language
• Domain specific languages
– Address a particular class of problems by offering
specific abstractions and notations for the domain
at hand
– Advantages: domain-specific analysis, program
management, visualization, testing, maintenance,
modeling, and rapid prototyping.
• XML-based realization of 5S
– Interoperability
– Use of many standard sub-languages (e.g., MIME
types, XML Schemas, UML notations)
Overview of 5SGraph
Workspace
(instance model)
Structured
toolbox
(metamodel)
5SGen – Version 2: ODL,
Services, Scenarios
5SL-Scenario
Model (6)
DL
Designer
Component
Pool
5SL-Societies
Model (1)
Java
XMI:Class
Model (3)
ODL
Search
Wrapping
ODL
Browse
Wrapping
import
import
Xmi2Java (4)
Java
Classes
Model (5)
DL
Designer
StateChart
Model (8)
XPATH/JDOM
Transform (2)
.
.
.
Java
XPath/JDOM
Transform (7)
Scenario
Synthesis (9)
5SGen
Deterministic
FSM (10)
SMC (11)
Java
Finite
State Machine
Class
Controller (12)
binds
Generated DL Services
JSP
User
Interface
View (13)
The XML Log Format
Log
Transaction SessionId MachineInfo Timestamp
Event
StatusInfo
Search
SearchBy
SessionInfo
RegisterInfo
ErrorInfo
Action
Browse
QueryString
Statement
Update
Collection Catalog
StoreSysInfo
Timeout
PresentationInfo
Quality and the Information
Life Cycle
Active
Accura
cy
Comple
ten
Conform ess
ance
Timeliness
Similarity
Preservability
Describing
Organizing
Indexing
Authoring
Modifying
Semi-Active
Pertinence
Retention
Significance
Mining
Creation
Accessibility
Storing
Accessing
Timeliness
Filtering
Utilization
Archiving
Distribution
Seeking
Discard
Inactive
Searching
Browsing
Recommending
Relevance
Similarity
Ac
ce
ss i
bil
Networking P
res
i
er v t y
ab
ilit
y
Rao Shen’s Preliminary Exam:
Hypothesis and Research Questions
•
The 5S framework provides effective
solutions to DL integration.
– Formally define the DL integration problem?
– Guide integration of domain focused DLs?
•
•
•
How to formally model such domain specific DLs?
How to integrate formally defined DL models into a
union DL model?
How to use the union DL model to help design and
implement high quality integrated DLs?
– Assess the integration?
Related Work
DL interoperability approach
Consists of
Intermediary-based
Interrelated with
mapping-based
use
mediator
wrapper
use
agent
schema mapping
used in
two architectures
Consists of
federation
Union Archiving
use
hybrid mapper
has an example
SemInt
composite mapper
has an example
LSD
DL integration formalization
based on
DL interoperability approach
Consists of
Intermediary-based
Interrelated with
mapping-based
use
mediator
wrapper
use
agent
schema mapping
used in
two architectures
Consists of
federation
Union Archiving
use
hybrid mapper
composite mapper
trained by
GA
Formal Definition of DL Integration
•
DLi=(Ri, DMi, Servi, Soci), 1  i  n
–
–
–
–
•
•
•
•
Ri is a network accessible repository
DMi is a set of metadata catalogs for all collections
Servi is a set of services
Soci is a society
UnionRep
UnionCat
UnionServices
UnionSociety
Architecture of a Union DL
DL1
Union DL
DL2
Union Society
Society

archaeologists
Service
Searching

Society
Archaeologists
General Public
General Public
Union Service
Harvesting, Mapping,
Searching, Browsing,
Clustering, Visualization

Service
Browsing
Catalog1
Union
Catalog
Catalog2
Repository1
Union
Repository
Repository2
Example of Union Service: CitiViz
CitiViz:
A Visual User Interface to the
CITIDEL System
ECDL 2004, Bath, England,
September 2004
Nithiwat Kampanya, Rao Shen,
Seonho Kim, Chris North, and
Edward A. Fox
[email protected]
http://fox.cs.vt.edu
A Minimal DL in the 5S Framework
Streams
Structured
Stream
Structures
Spaces
Structural
Metadata
Specification
Scenarios
Societies
services
Descriptive
Metadata
Specification
indexing
browsing searching
hypertext
Digital Object
Collection
Metadata Catalog
Repository
Minimal DL
A Minimal ArchDL in the 5S Framework
Streams
Structures
Structured
Stream
Spaces
Descriptive
Metadata
specification
Scenarios
Societies
services
SpaTemOrg
StraDia
Arch Descriptive
Metadata specification
ArchObj
indexing
browsing searching
hypertext
ArchDO
Arch Metadata catalog
ArchColl
ArchDColl
ArchDR
Minimal ArchDL
ArchDL Expert
Scenario
Sub-model
ETANA-DL
Union Services
Descriptions
5S Archaeology
MetaModel
Structure
Sub-model
VN Metadata Format
HD
Catalog
Mapping Tool
Wrapper4VN
Wrapper4HD
Inverted Files
Search
Service
XOAI
Browse DB
Browse
Service
Component
Pool
Services DB
5SGen
Other
XOAI
ETANA-DL
Services
Web Interface
Union
Catalog
Browsing
…
HD Metadata Format
ETANA-DL Metadata Format
VN
Catalog
Harvesting
Mapping
Searching
Browsing
…
ArchDL Designer
5SGraph
Computing and Information
Technology Interactive Digital
Educational Library (CITIDEL)
• Domain: computing / information
technology
• Genre: one-stop-shopping for teachers &
learners: courseware (CSTC, JERIC),
leading DLs (ACM, IEEE-CS, DB&LP,
CiteSeer), PlanetMath.org, NCSTRL
(technical reports), …
• Submission & Collection: sub/partner
collections  www.citidel.org
www.CITIDEL.org
• Led by Virginia Tech, with co-PIs:
– Fox (director, DL systems)
– Lee (history)
– Perez (user interface, Spanish support)
– Students: Ryan Richardson, Kate McDevitt,
Jon Pryor, Baoping Zhang
• Partners
– College of New Jersey (Knox)
– Hofstra (Impagliazzo)
– Villanova (Cassel)
– Penn State (Giles)
Digital library architecture for local
and interoperable CITIDEL services
EDUCATORS
Multilingual
Searching
LEARNERS
Browsing
Union Metadata
Filtering
Filtering Profiles
OAI
Data
Provider
Annotating
ADMINISTRATORS
Revising
Administering
User Profiles
Annotations
OAI
Data
Harvester
Remote and Peer Digital Libraries (eg. NSDL -CIS)
PORTALS
SERVICES
REPOSITORIES
CITIDEL Technology Features
•Component architecture (Open Digital Library)
•Re-use and compose re-deployable digital library components.
•Built Using Open Standards & Technologies
•OAI: Used to collect DL Resources and DL Interoperability
•XSL and XML: Interface rendering with multi-lingual community
based translation of screens and content (Spanish, …)
•Perl: Component Integration
•ESSEX: Search Engine Functionality
•Very fast, utilizing in-memory processing
•Includes snap-shots for persistence
•Multi-scheming (Aaron Krowne, now at Emory U. Library)
•Integrates multiple classifications / views through maps, closure
•Extensions: clustering, visualization, personalization, …
Cluster Search Results from CITIDEL
Cluster NDLTD-Computing
Naren Ramakrishnan and Saverio Perugini (U. Dayton)
CITIDEL + PIPE
• Adds Interaction
Personalization to
CITIDEL
•Automatically
handles multi-modal
conversion to Cell
phone, PDA, Etc.
•Can be adopted to
any digital data set,
only requires XML file
of content with
hierarchy maintained.
OCKHAM Library Network (NSDL)
NSDL
Services
NSDL
OCKHAM
Library
Network
OCKHAM
Services
Library
Services
Teachers
Learners
Librarians
OCKHAM (Ming Luo)
• Simplicity (a la OCCAM’s razor)
• Support by Mellon and DLF
• Four main ideas:
1. Components
2. Lightweight protocols
3. Open reference models (e.g., 5S, OAIS)
4. Community perspective and involvement
• Funded by NSF in NSDL, with P2P, with
Emory, Notre Dame, Oregon State, …
OCKHAM Proposed Services
•
•
•
•
•
•
•
•
Alerting
Browsing
Cataloging
Conversion
OAI – Z39.50
Pathfinding
Registry
(plus others such as from adapted ODL)
A Digital Library Case Study
• Domain: graduate
education, research
• Genre:ETDs=electronic
theses & dissertations
• Submission:
http://etd.vt.edu
• Collection:
http://www.theses.org
Project:
Networked Digital
Library of Theses
& Dissertations
(NDLTD)
http://www.ndltd.org
(supported by
Ming Luo)
OCLC SRU Interface => Dr. A.K. Tyagi
ETD Union Search Mirror Site in China (CALIS)
(http://ndltd.calis.edu.cn – popular site!)
LOCKSS Extensions:
Bing Liu, Xiaoyu Zhang, Ji-Sun Kim
•
•
•
•
•
Lots of copies keep stuff safe
Stanford (Vicky Reich)
Initial focus on lower levels, journals
Shift to OAI, esp. for ETDs
Collab with Emory (Martin Halbert)
– NDIIP: AmericanSouth, MetaArchive
– Help deploy and adapt, apply in other contexts
• Another registry
• Set of publisher manifests (information providers)
• Set of storage systems (archival storage)
Hussein Suleman
(Capetown, S. Africa)
Document
Document
Document
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
XPMH
OA
OA
XPMH
XPMH
OA
OA
OA
XPMH
XPMH
XPMH
PMH
XPMH
OA
XPMH
XPMH
XPMH
OA
OA
OA
XPMH
open digital library
PMH
Program
Program
Program
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
Image
Image
Image
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
Video
Video
Video
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
Example Open Digital Library
Document
Document
ETD-1
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
ODLRecent
Recent
USER INTERFACE
ODLUnion
PMH
Filter
PMH
ODLUnion
Browse
Union
PMH
ODLBrowse
ODLUnion
PMH
Filter
PMH
Search
ODLSearch
Students and
researchers
Program
Program
ETD-2
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
ETD DL for the Networked Digital
Library of Theses and Dissertations
(www.ndltd.org)
Image
Image
ETD-3
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
Video
Video
ETD-4
1010100101
1010100101
0100101010
1010100101
0100101010
1001010101
0100101010
1001010101
0101010101
1001010101
0101010101
0101010101
ETD collections
Open Digital Library Deployments
• NDLTD (www.ndltd.org)
• Computer Science Teaching Center
(www.cstc.org)
• Computing and Information Technology
Interactive Digital Educational Library
(www.citidel.org)
• Open Archives Distributed (NSF, DFG) –
enhancements to PhysNet
• OCKHAM
• Open to others through DL-in-a-box
Interest-based User Grouping
Model
for Collaborative Filtering
in Digital Libraries
7th ICADL 2004
Shanghai, P.R. China
Dec. 15, 2004
Edward A. Fox, Seonho Kim
Virginia Tech, Blacksburg, VA 24061 USA
Some Other Students/Projects
• Wensi Xi: Matrices, reinforcement, clusters (Microsoft)
• Paul Mather: mod/sim of large DLs on clusters;
characterization: uses, files (NASA)
• Ming Luo: personalization aided by demographics
• Ryan Richarson: CLIR with concept maps
• Xiaoyan Yu: Stepping Stones and Pathways (NSF,
Fernando Das Neves completed & returned to Argentina)
• Baoping Zhang: Physics and classification (NSF, DFG)
• Several: TREC with GP
• New projects:
– Superimposed information w. PSU (NSF NSDL)
– Quality and metasearch and structure w. Emory (IMLS)
• …
Conclusion
•
•
•
•
•
Many DL/IR: areas, projects, students
Theory
Architecture
Modeling and simulation
Systems development and testing to:
validate above, demonstrate innovations
• Users, interfaces, visualization, usability