From eprint archives to open archives and OAI

Lessons from the
Open Citation Project
Presented by Steve Hitchcock, Southampton University
These slides prepared for The Open Archives Initiative: application and
exploitation, a one-day seminar on the application and exploitation of the
OAI Protocol for Metadata Harvesting, May 14, 2003, London
A joint JISC-NSF
International Digital Libraries Project 1999-2002
A post-Google information
environment
Electronic journals exist in a post-Gutenberg and a post-Google
information environment
The ability to locate a specified item of information precisely and
instantly among the mass of information available on the Web has
profound implications.
In the electronic environment the search engine has become the de
facto interface to information, rather than the fragmented packages
that have migrated from the print world.
About this presentation
• Citebase: citation-ranked search and impact discovery service
– New scientometric indices
– Evaluating Citebase
• EPrints.org software: free software to build and manage OAIcompliant eprint archives
• Growth of OAI, Eprints.org and institutional archives
• How to accelerate the growth of OAI eprint archives
Citebase, a discovery service with
usage- and citation-bases ranking
http://citebase.eprints.org/
“Google for the refereed literature”
Citebase is based on a citation database
• Harvests metadata using OAI-PMH
• Extracts and indexes citations from published research papers stored
in the larger open access, OAI disciplinary archives - currently arXiv,
CogPrints and BioMed Central
• Provides impact (and other)-ranked search based on reference data
• Re-exports metadata + references
Some old and new scientometric
(“publish or perish”) indices of
research impact
• Quality-level and citation-counts of the journal in which
the article appears
• Citation-counts for the article
• Citation-counts for the researcher
• Co-citations, co-text (cited with whom/what else?)
• Citation-counts for the preprint
• Usage-measures (“hits”, Webmetrics)
• Time-course analyses, early predictors, etc.
Citebase, a new interface to the
scholarly literature
Time-Course of Citations (red)
and Usage (hits, green)
Witten, Edward (1998) String Theory and Noncommutative Geometry Adv. Theor. Math. Phys. 2 : 253
1. Preprint or Postprint
appears. 2. It is
downloaded (and
sometimes read).
3. Eventually citations
may follow (for more
important papers).
4. This generates more
downloads,
etc.
“Perhaps the most important new information to become available for bibliometric studies is the per article
readership information.”
Kurtz et al. (2003) "The NASA Astrophysics Data System: Sociology, Bibliometrics and Impact"
http://cfa-www.harvard.edu/~kurtz/jasist-submitted.ps
Evaluating Citebase
http://opcit.eprints.org/opcitevaluation.shtml
• First detailed user evaluation of an open access Web citation
indexing service
• The evaluation was aimed at users of arXiv, and all others who use
bibliographic services to access the refereed journal literature.
• Citebase was evaluated by nearly 200 users from different
backgrounds between June and October 2002
• Just prior to the evaluation Citebase had records for 230,000
papers, indexing 5.6 million references.
• By discipline, approximately 200,000 of these papers are classified
within arXiv physics archives.
Results of Citebase evaluation
• Web-based citation indexing of open access eprint archives is closer to a
state of readiness for serious use than had previously been realised
• Within the scope of its primary components, the search interface and
services available from its rich bibliographic records, Citebase can be used
simply and reliably for the purpose intended
• Tasks can be accomplished efficiently with Citebase regardless of the
background of the user
• Links to citing and co-citing papers are features of Citebase that are valued
by users
• Citebase compares favourably with other bibliographic services
• Coverage is seen as a limiting factor. Non-physicists were frustrated at the
lack of papers from other sciences
Accomplishing tasks with Citebase
Tasks can be accomplished
efficiently with Citebase
regardless of the
background of the user.
A key part of the evaluation
assessed the usability of
Citebase with a practical
exercise to build a short
bibliography based on a
series of questions
Yellow line, T=true
Blue, F=false
Purple, N=no response
All users
Physicists only
Most useful features of Citebase
Links to citing and co-citing papers are features of Citebase that are
valued by users
Citebase compares favourably
with other bibliographic services
Growth of OAI, Eprints.org and
Institutional Archives
How OAI Archives for institutional research output have been
growing – and how to accelerate their growth
The following slides are taken from the presentation The Research
Impact Cycle, which contains key data on the growth of open access
through the self-archiving of institutional (peer-reviewed) research.
These data can be freely used or adapted for other talks. Copy this
PPT version for reuse.
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving.ppt
Data collected and analysed by Tim Brody, Electronics and
Computer Science, Southampton University
Growth in number of OAI Archives
(now 140+ Archives, but the average number of papers per Archive (9000)
needs to grow faster!)
Cumulative Mean Records per Archive
Cumulative Archives to Date
Archives
160
140
120
100
80
60
40
20
0
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
ja
nv
-9
av 9
r-9
ju 9
il9
oc 9
t- 9
ja 9
nv
-0
av 0
r-0
ju 0
il0
oc 0
t- 0
ja 0
nv
-0
av 1
r-0
ju 1
il0
oc 1
tja 01
nv
-0
av 2
r-0
ju 2
il0
oc 2
t- 0
ja 2
nv
-0
3
Mean Records per Archive
Number of Archives and Mean Number of
Papers Per Archive (all OAI Archives)
EPrints.org software
http://www.eprints.org/
Generates eprint archives that are compliant with the OAI Protocol for
Metadata Harvesting.
Eprints.org software has been used to build institutional archives, and
disciplinary archives.
In conjunction with OAI, Eprints.org has been a primary motivator
for institutional archives
Eprints.org v. 2.0 released February 2002 (now on v. 2.2.1)
EPrints is free (GPL) software, aimed at organisations and communities.
Growth in number of Eprints.org Archives (c. 70)
(again, average number of papers per Archive [c. 120] needs to grow faster!)
80
40
60
30
40
20
20
10
0
0
Mean Records per Archive
Cumulative Archives to Date
Archives
50
no
v02
100
ju
il02
se
pt
-0
2
60
no
v01
ja
nv
-0
2
m
ar
s02
m
ai
-0
2
120
ju
il01
se
pt
-0
1
70
ar
s01
m
ai
-0
1
140
m
Mean Records per Archive
Cumulative Number of Eprints.org Archives and
Mean Number of Papers Per Archive (- top 3)
Work that needs to be done to
accelerate growth per archive
These curves must become convex upward:
Institutional self-archiving policies are needed
50
45
40
35
30
25
20
15
10
5
0
3000
New Records
2500
2000
1500
1000
500
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Latency (30 Day Periods)
New Records in Latency Period
Mean New Records per Archive
Mean
Latency of Record Additions to New EPrints Archives
What have we learned from the
Open Citation Project?
• OAI is gathering momentum
• Software for building OAI repositories is available
• Institutional archives are beginning to be created, but need to be filled
by authors
• Attracting authors requires evidence of services that will improve the
visibility and impact of their works
• Citation-ranked search and reference linking are examples of OAI
services that do this
“Online or Invisible?”
(Lawrence 2001)
“average of 336% more citations to online articles compared to offline
articles published in the same venue”
Lawrence, S. (2001) “Free online availability substantially increases a
paper's impact”. Nature, 411 (6837): 521
http://www.neci.nec.com/~lawrence/papers/online-nature01/
What is needed to fill the archives
1.
2.
3.
4.
5.
Universities: Adopt a university-wide policy of selfarchiving all university research output, e.g. Southampton
(ECS) Research Self-Archiving Policy
http://www.ecs.soton.ac.uk/~lac/archpol.html
Departments: Create Departmental OAI-compliant Eprint
Archives
University Libraries: Provide digital library support for
research self-archiving and archive-maintenance
Promotion Committees: Request a standardized online CV
from all candidates, with refereed publications all linked to
their full-texts in the Departmental Archives
Research Funders: Assess research impact online (from the
online CVs)
Mandating online UK Research
Assessment CVs linked to
university eprint archives
"will set an example for the rest of the world that will almost
certainly be emulated in terms of research assessment and
research access"
Ariadne, issue 35, April 30, 2003
http://www.ariadne.ac.uk/issue35/harnad/
Exploiting OAI
• OAI has become the critical technical infrastructure for open
access to author self-archived papers in institutional archives
• OAI enables cross-archive services such as Citebase
• Open access data and services promise increased visibility
and impact for authors
• OAI resources will begin to grow significantly when authors
realise this, and when research councils start mandating open
access to the publication of results of funded research
Credits: Open Citation Project @
Southampton
• Principal Investigator is Stevan Harnad
• Technical development at Southampton is directed by Les Carr
• EPrints.org software is being developed by Chris Gutteridge
• Citebase is produced and managed by Tim Brody
• Project manager is Steve Hitchcock
A copy of these slides can be found on the OpCit Web site
http://opcit.eprints.org/. Look for Papers and Presentations
Contact Steve Hitchcock: [email protected]