Introduction to Digital Libraries Week 1

Introduction to Digital Libraries
Week 1: (Digital) Libraries Defined
Old Dominion University
Department of Computer Science
CS 695 Fall 2003
Michael L. Nelson <[email protected]>
09/26/03
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Overview
• As much as possible, this class will be handled
electronically
• http://www.cs.odu.edu/~mln/teaching/cs695-f03/
• http://list.odu.edu/listinfo/cs695-f03/
• Contact me anytime:
• [email protected]
• 683 6393
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Objective
•An interactive class
•digital libraries is (still) too immature a subject
for traditional lecture style classes
•The students should gain
•insight and overview of the DL field
•ideas for areas of DL research and development
•preparation for the development, use,
application or management of DLs to their
work environment
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Michael Lesk, Practical Digital Libraries: Books, Bytes & Bucks
very highly recommended book!!!
it is possible to get through the class without it, but it is an
excellent addition to your collection
Frakes & Baeza-Yates, Information Retrieval: Data Structures & Algorithms
will be used in lecture #3. not required, but a very nice book with real code
Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval
a follow-on to the above; most likely the book to be used in Dr. Bollen’s
IR class in the spring
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Caveats
•I’m a computer scientist
•no traditional library experience
•I’m an information radical
•“information wants to be free”
•Projects I’ve been involved with will receive
preferential treatment ;-)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Vannevar Bush (1890-1974)
•
Director of the Office of Scientific Research and
Development
• lead 6000 scientists in R&D for WWII
• previously, science lacked large scale teams
• also director of NACA (1939)!
•
Predicted many technological advances
• the “memex” is one whose spirit we are implementing
• the purpose was to provide scientists the capability to
exchange information; to have access to the totality of
recorded information
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Memex
•
Integrated computer, keyboard, and desk
•
“mechanized private file and library”
• remove drudgery from information retrieval
• suggested implementation was microfilm
• various user operations are suggested
•
Associative indexing was the main purpose
• “the process of tying two items together is the important
thing”
• prelude to hypertext...
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Memex
•
Information could come pre-associatively indexed,
but the key point was user customization
• WWW still does not provide that today
•
Bush observes that tools change our way of doing,
and expand the horizons before us
• full impact of WWW and DLs still not known
•
Interesting: Bush’s AM article did not predict freetext searching...
• knowledge trails only; Yahoo minus keyword searching
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
from Lesk,
http://community.bellcore.com/lesk/columbia/session2/
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
from Lesk,
http://community.bellcore.com/lesk/columbia/session1/
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
What is a Digital Library (DL)?
•“a collection of information that is both
digitized and organized” (Lesk, p. 1)
•there are any number of alternate definitions,
but this seems fair enough
•no mention of architecture, implementation,
content, etc.
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
from:
http://community.bellcore.com/lesk/libbuzz.gif
M. Lesk
I•
m unaware of new data,
but •
digital•library seems
to have totally replaced
electronic•library -- MLN
•
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
How is a DL different from a
database?
•
A traditional SQL database has as its basic element
data items in a relation:
select name
from employee, project
where employee.deptnumber = “25” AND
project.number = “100”
•
•
databases exploit known structures and relations
DBMS retrieval is not probabilistic (Frakes,
Baeza-Yates, p. 3)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
How is a DL different from
traditional IR systems?
•
•
•
The difference is less clear
IR systems can be considered the precursors to
DLs
The basic unit of a IR system is a document and
the focus is on textual retrieval
• exact matching - Boolean, text pattern searching
• inexact matching - probabilistic, vector space,
clustering
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
How is a DL different from
traditional IR systems?
•We will consider DLs to be a superset of IR
systems
•Nomenclature change partly due to change
in implementation technology
•IR -> DL generally coincides with the spread of
WWW
•Typically, IRs provide metadata access only,
where DLs access to metadata + data
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
How is a DL different from the
WWW?
•The keyword is organization
•The WWW as a whole has no real organization
•Some meta searchers (Yahoo, Lycos)
attempt to add an organizational framework
to their web holdings
•However, most are focused on keyword
searching (i.e., Altavista)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
How is a DL different from the
WWW?
•Another key difference is who controls the
input into the system
•most meta searchers hunt down their holdings
• Lycos is short for Lycosidae lycosa (the •
wolf spider•
), which pursues its prey and does not
build a web (Mauldin, IEEE Expert, 1/97)
•some (Yahoo) have humans in the loop for
review and classification
•To date, DLs are generally more tightly
controlled, and have a targeted customer set
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
DL = Content + Services
digital library = collection of information both digitized and organized
-- M. Lesk, 1997
non-WWW
Access
WWW (http) Access
(most common)
(now uncommo n)
Digital Library Services
(searching, browsing, citation anlaysis
usage analysis, alerts)
Vector
and/or
Boolean
Search
Engin es
RDBMS
File
Sys tems
(traditional IR)
Content
Other
Techno logies
••
Why not just use the WWW•?
• WWW by itself has low archival &
management characteristics
• •
Why not use a RDBMS?•
• In the same way that a card catalog
is not a TL, a RDBMS is candidate
technology for use in DLs
• DL is the union of the content
and services defined on the
content
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
How is a DL Different from a
Traditional Library?
•
TL has as its focus physical objects
• even if the card catalog (metadata) is electronic, the
purpose is to point you to a physical location
• trafficking in physical objects has both obvious and
subtle implications
• object can exist only in 1 place
• if you have it, I can•
t have it (zero-sum distribution)
• I have to go to the object, or wait for it to come to me
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
TLs vs. DLs
•DLs clearly better than TLs at:
•Dissemination, storing information variety
•However, TL objects are more survivable
•Who will archive the research information?
•the publishers?
•the institutions?
•the authors?
•Will the average DL object still be accessible in
10 years?
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
How is a DL Different from a
Traditional Library?
•
Digital Library
• removing the physical restriction has obvious benefits
• multiple access, multiple listings, electronic transmission
• also complicates many other issues...
• intellectual property, terms and conditions, etc.
•
Note that a TL offers additional social and
educational benefits
• Most TLs also offer hybrid services too.
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
from Lesk,
http://community.bellcore.com/lesk/columbia/session1/
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
TLs vs. DLs
•Where does publishing stop, and libraries
begin?
•there has always been tensions between TLs and
traditional publishers, but the roles were fairly
well defined
•DLs can muddle the separation of these
responsibilities
•result: conflict, and/or new models
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Public and Special Libraries
•
A service like Yahoo is more like a public library
• accession is still controlled by the staff
• the customer set is the general public
• the holdings are broadly scoped
•
Most DLs that we will study are more like Special
Libraries
• customer set is small and focused
• holdings are narrowly scoped
• accession is perhaps even more important
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Public and Special Libraries
•Special Libraries
•traditional - NASA LaRC Technical Library
•digital - NASA Technical Report Server
•Public Libraries
•traditional •
Norfolk Public Library
•digital - Yahoo
•When in doubt, apply the “Popular
Mechanics” test...
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Scientific and Technical
Information (STI)
•
•
•
Unless otherwise specified, a DL in the context of
this class will be short for “Special DL”
Specifically, we will investigate DLs that serve
STI
STI uses generally represent the vanguard of
application of information technology
• but almost never the commercial driver
• examples: Internet, Mosaic, etc.
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
STI DLs
•The technology used for STI DLs today will
soon be used in a broader, more general
interest applications
•So while limiting our discussion to STI DLs
is generally helpful, we must remember that
STI DLs are a subset of the application of
DL technology
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
DL Economic Drivers
from Lesk,
http://community.bellcore.com/lesk/columbia/session1/
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
DL Economic Drivers
from Lesk,
http://community.bellcore.com/lesk/columbia/session3/
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
DL Economic Drivers
from Lesk,
http://community.bellcore.com/lesk/columbia/session1/
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
What is STI?
•STI is the collection of materials,
independent of format,used in research,
development, and other technical activities
•reports, data sets, images, videos, software, etc.
•It is also the output of such R&D activities
•STI includes both white and grey literature
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
White and Grey Literature
•The line between the two is not always clear
•Grey Net offers an admittedly obsolete
definition of grey literature:
••
that type of publication unavailable through
normal book-selling channels, often produced
in small quantities with limited distribution,
promotion, and exploitation•
•http://www.greynet.org/
•(no longer supported by MCB as of late 2000)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
White and Grey Literature
•
•
Grey Net also admits that electronic publishing
has changed this definition, and a suitable
replacement is still under debate
Intuitively though:
• White: author and publisher are often different, the
work has been independently reviewed, obtaining the
work is straightforward
• Grey: may not be reviewed, often •
published•from the
source origin, may be difficult to obtain
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Literature Examples
•White
•Journals, books, edited conference proceedings,
etc.
•Grey
•technical reports, government reports, unedited
proceedings, non-document STI, etc.
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
So Why Worry About
Grey Literature?
•
•
While White is generally perceived as having a
higher pedigree, easier to obtain (in a sense), etc.,
it is generally less timely and is often a summary
or abstract of a larger body of work
Some technologies can become obsolete in the
time it takes to move from Grey to White
• NASA LaRC wind tunnel example
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Pyramid of STI
Journal Articles
Conference Papers
time
Technical Reports
software
raw data
notes
video /
images
Figure 2: Pyramid of Publications Rests on Unpublished STI
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
History of STI Distribution
•Originally, scientists published books to
document their findings
•but the delay was terribly long
•Then, scientists exchanged personal letters
among themselves for rapidity
•but this is point-to-point communication, not
broadcast
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
History of STI Distribution
•The current system of journals evolved in
the 17th century as the synthesis of both
previous models
•more timely than books, more available than
letters
•in fact, some journals with the emphasis on
“speed” still have “Letters” in their title
• historical information from (Odlyzko, 1995)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
But Are Journals Still Relevant?
•People still publish in them (tenure and
promotions are still largely “count the
journal publications” exercises)
•But do people read them?
•The current use of journals is now:
•“a medium for priority claiming, quality
control, and archiving scientific work”
(Bennion, 1994)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Unavailable, or Not Worth Citing?
from Lesk,
http://community.bellcore.com/lesk/columbia/session13/
figure 9.7 in text
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
But Are Journals Still Relevant?
•
•
•
How important is refereeing anyway?
Most rejected papers end up published somewhere
else (Lesk, p. 214)
Referees have rejected many worthy papers,
including some that are the most cited in their
respective journals (Campanario, 1996)
• this is another well studied problem, contact me for
more details
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
But Are Journals Still Relevant?
•Different disciplines have adapted:
•physics - “the small amount of filtering
provided by refereed journals plays no effective
role in our research” (Ginsparg, 1994)
•math - “it is rare for experts in any
mathematical subject to learn of a major new
development in their area through a journal
publication” (Odlyzko, 1995)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
But Are Journals Still Relevant?
•computer science •“in his area, journals have become irrelevant”
(Odlyzko, quoting Rob Pike)
•“if it did not happen at a conference, it didn’t
happen” (Odlyzko, quoting Joan Feigenbaum)
•“if I read it in a journal, I’m not in the loop” (Grycz,
1992)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Solutions by Discipline
•
Physics
• pre-prints
•
Mathematics
• pre-prints
•
Computer Science
• technical reports, conference proceedings
•
Chemistry
• still mainly journals, but review is cursory (Quinn,
1995)
•
Economics
• working papers
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Journal System - Economic Problems
•20,000 primary research journals (Bennion,
1994)
•the number of scientific papers published
annually doubles every 10-15 years (Price, 1956)
•STI does not enjoy economies of scale
•intended audiences are generally static; the content
becomes more specialized (Odlyzko, 1995)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Journal System - Economic Problems
•Because of the academic pressures, journals
tend to stay the same size, but the number
of titles goes up (Quandt, 1996)
•The acquisition budget of a library is
constant (or decreasing), so it must be more
selective in which titles it provides
•If libraries cancel subscriptions, the cost to
the remainder of the subscribers goes up
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Journal System - Economic Problems
•
•
•
The rising cost causes other libraries to cancel
subscriptions, causing the price to go up further...
Journals driving themselves out of business is a
well studied problem - contact me for more
information
Odlyzko estimates that:
American universities spend as much
buying mathematics journals as the
NSF spends doing mathematical research
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Journal System - Economic Problems
•Chemical Abstracts (Lesk, pp. 203-204)
•begun in 1950s, used to cost •
dozens of dollars per
year, and invidual chemists subscribed•
•today, it costs $17,400 / year.
•Okerson & Stubbs, 1992
•university book purchases down 15% 1986-1991
•journals/faculty 14 -> 12 in same period
•by year 2017, libraries would buy nothing at all!
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
from Lesk,
http://community.bellcore.com/lesk/columbia/session1/
figure 9.2 in text
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Journal System - Coverage Problems
•But journals only cover a fraction of
available STI
•approximately 100K domestic, unrestricted STI
technical reports (grey literature) produced
annually (Esler & Nelson, 1998)
•Print journals, by definition, cannot provide
access to non-report STI
•software, datasets, etc.
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Electronic Journals?
•An experiment that most scholars agree is
good, is the eventual path, and is a great
idea for everyone else’s papers...
•until tenure is given based on publications in
electronic journals, they will not be fully
accepted
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Electronic Journals
from McEldowney,
http://poe.acc.virginia.edu/~pm9k/libsci/charts.html
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
But How Much is STI?
from McEldowney,
http://poe.acc.virginia.edu/~pm9k/libsci/charts.html
no data for electronic journals is given, but it seems likely that it follows a similar distribution
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Many DL Projects Are
“Journal Centric”
•Many DL projects (JSTOR, TULIP, etc.) are
focused on automating the traditional
journal methods
•this is acceptable for archiving past issues, but
seems unsatisfying for future STI
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
My Prediction for Journals
•
Highly specialized titles will go completely
electronic, driven by the rising cost and static
readership
• economics and academic acceptance will determine
when this happens
•
Popular•
•
titles with broader appeal will exist
in a hybrid format, both paper and
electronic version
••
subscribers•are likely to receive the value added
material (soft copy, additional materials, etc.)
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]
Common Shortcomings of
Current DLs
•Focused on journals, despite their
decreasing to some fields
•Inadequate treatment of grey literature, the
grist of technical exchange
•Non-document STI (software, datasets, etc.)
not handled
ODU CS CS 695 Fall 2003 Michael L. Nelson [email protected]