Impact of Dynamic Harvesting on Federation Services and Adding

Challenges in Building Federation Services over Harvested Metadata
Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad Zubair, and Zhao Yang
Old Dominion University, Norfolk, VA 23529
(anan, tang_j,maly,mln,zubair, yzhao,@cs.odu.edu)
Abstract. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a de facto
standard for federation of digital repositories. OAI-PMH defines how repositories (data providers)
expose their metadata as well as how harvesters (service providers) extract the metadata. In this paper
we will concentrate on our experience as service providers. Although OAI-PMH makes it easy to
harvest data from data providers, adding specialized services requires significant work beyond the OAIPMH capabilities. Harvesting provides only the basic services to get metadata from repositories.
However, processing these data or retrieving related metadata is not part of the OAI-PMH. In this
paper, we will present the impact of value-added services like citation and reference processing,
equation searching, and data normalization on the process of dynamic harvesting in Archon. Dynamic
harvesting introduces challenges of keeping specialized-services consistent with ingestion of new
metadata records. Archon is a NSDL NSF funded digital library that federates physics collections. We
present performance data for the overhead of using these services over the regular harvesting. We also
present the implementation of Simple Object Access Protocol (SOAP) web services (and corresponding
client) that enhances the usability of our collections by allowing third parties to provide their own
clients to search our databases directly.
1
Introduction
Using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standardizes and
facilitates the process of federating multiple digital libraries (DLs). However, by design this protocol does
not provide support for services beyond basic harvesting of metadata records. For example, a service that
provides reference information related to a document as well as links from references to their
corresponding documents needs the original metadata along with the metadata of the references.
Furthermore, the full text corresponding to the original metadata might be retrieved and scanned for
references. Such correlated or full text harvesting is not directly supported by OAI-PMH. Additional
processing is required to store reference linking information and to update references to and from exiting
documents. Another service example is normalization, such as populating missing metadata fields (e.g.
subject). We have implemented services to deduce the subject if missing from the harvested Dublin Core
metadata. Other examples of value-added services include displaying and searching equations that are
embedded in metadata fields such as title or abstract in formats such as LaTeX [4] or MathML [8].
Equations in these formats are not easy for users to browse or view, so we have added a service to render
these equations. However, this requires metadata post-processing to identify and extract equations and to
convert them to easily displayable images. Processes like the one mentioned above introduce some
challenges in the process of harvesting and some processes might consume a long time that provides
constraints on how and when these processes must be scheduled. In this paper we review our experiences
with these issues in the implementation of Archon (a DL that federates physics collections) that is part of
NSF’s National Science Digital Library (NSDL) project [9]. We also discuss how the addition of these
specialized services affects the harvesting time.
After harvesting, Archon can work as an OAI aggregator that enables other DLs to harvest from it using
the OAI-PMH. However, while the OAI-PMH provides harvesting capabilities, it does not provide
distributed searching capabilities. To that end we developed web services to support searching of our
collection by third parties. To implement this, we used Simple Object Access Protocol (SOAP) based Web
services. As an illustration, we developed the Personal Digital Library Assistant (PDLA), a reference
implementation of a client that uses these services while adding its own services of book-shelving and
reference creation.
2
Process Automation
In Archon, we harvest records using OAI-PMH. This protocol provides the capability to harvest records
incrementally from OAI-enabled repositories using simple commands. However, at the core of Archon we
have high level services that require post-processing of harvested metadata. To keep these specializedservices consistent with ingestion of new metadata records; we implement Archon’s post-harvesting
processes as tasks that can be run incrementally and automatically.
The Archon post-processing conceptually consists of tasks for citation and equation processing,
normalization, and a subject resolver. Some of the metadata fields (e.g. dates) have to be normalized to
compensate for many conventions used at the various repositories. Normalization is not time-consuming
and integration with the harvester is straightforward. However, citation processing, equation processing and
subject resolution have to be implemented as tasks either because they are difficult to integrate into other
processes or because of performance issues. In the following subsections, we will describe these tasks in
detail.
2.1
Citation Processing
The Archon reference-linking service provides the user a list of the references for specific metadata
records. Where possible the service provides links to the documents at external source archives and within
Archon. Figure 1 shows a screenshot of the bottom of a metadata record that includes a list of references
with some references not being resolved and several resolved to both external sources and internal Archon
links. The external links are generated by using the script from CERN [6].
Fig. 1. Archon Reference-linking service
Figure 2 shows a screenshot of Archon’s citation service. This service allows the user to search the
records by citation relationship. When a user clicks the “Citations” link along each record (visible at the
right of Figure 1), a window will appear that allows the user to specify a search either for all records citing
this record (citation to) or all records cited by this record (citation from) in either the full Archon collection
or the current result set.
Citation Processing is the background process to prepare and organize data for Archon’s referencelinking service automatically after a harvest of metadata records. As shown in Figure 3, its main parts are
the reference collector, bibliographic collector and reference process.
The reference collector is responsible for getting the references and uses separate approaches for
different sources. One approach is to harvest reference information using OAI-PMH. For arXiv and APS
records, the reference information is available as parallel metadata records. The second approach is to
download the full-text document and use reference extraction software. For the CERN records we use a
reference extraction script from CERN [2]. It can extract references from PDF files.
Fig. 2. Archon’s Interface for Citation Service
The bibliographic collector is responsible for getting bibliographic information. This is necessary
because currently most DC records do not provide enough information establishing relations between
metadata records with references. Where available we harvest parallel metadata records and obtain
bibliographic information from them. Bibliographic information from different sources may be different for
the same item. For example, some may use “Phys. Rev. D” while others use “Physics Review D”. To
address this problem, we normalize the bibliographic information.
Reference
Collector
OAI
Others
Harvester
Get archive
Extract
references
Parser
DC
Reference
Process
Reference
Resolver
Bibliographic
Collector
Parser
Raw Bibliographic
Raw Reference
Old Link
Adjustment
Harvester
Crosslink
Normalization
Extrernal
Link
References
Bibliographic
Fig. 3. Citation Processing Modules
The reference process part is the main part of citation processing. It includes the reference resolver,
crosslink and external link and old link adjustment. The reference resolver is used to resolve the reference
strings for the newly added records. It extracts information contained in the reference string useful for
setting up the link to the corresponding record. For example, the string “A. Donnachie, Phys. Lett. B 296
(1992) 227”, will identify the journal as “Phys. Lett. B”, the volume is 296 and starting page is 227.
Currently, we use the CERN software package for reference resolving. Crosslink sets up links from newly
resolved references to their corresponding records in Archon. External link sets up URLs for newly
resolved references. We use a rule based heuristic algorithm that is based on the URL generator of the
CERN software package.
Old link adjustment makes adjustments on references that have been processed by previous modules.
We have to make corresponding changes in the links when we ingest new records into Archon because
Archon reference-linking service links the references within its federation. Therefore, when we add a new
record in Archon, we need to check whether there are some old references that can be linked to this record.
The current status of references in Archon is shown in Table 1. It gives the total numbers of references
for each archives as well as how many references have been resolved successfully, how many resolved
references have external links (to an outside source), how many have internal links (within Archon), and
how many have links (either external or internal or both). “Resolved” is higher than “Linked” because we
may be able to resolve all components to a valid internal representation but the URL generator is unable to
construct a link. In summary, we have a success rate ranging from 50% to 80%.
Table 1. Archon Reference Status
Archives
arXiv
APS
CERN
Total
External
Internal
Linked
Resolved
4,838,158
2,191,419
1,257,367
2,790,904
2,900,347
686,521
427,601
195,187
432,604
520,843
58,105
24,345
9,115
25,513
27,753
To test the validity of our rule based link generator we randomly select 20 records from Archon and
analyzed the reference resolution. Table 2 shows the test result. We divide the result of links into three
categories.
 Success: succeeds to point to an appropriate document.
 Success: but not to individual document, rather the link goes to a page which has a document list.
 Failure: fails to point to either the corresponding document or a document list.
Table 2. External Link Success/failure Ratio
Records
oai:aps.org:PhysRevD.59.104013
oai:aps.org:PhysRevD.16.3242
oai:arXiv.org:cond-mat/9906406
oai:aps.org:PhysRevD.64.017701
oai:arXiv.org:cond-mat/9906415
oai:arXiv.org:astro-ph/9902112
oai:arXiv.org:chao-dyn/9607003
oai:arXiv.org:hep-lat/9312002
oai:arXiv.org:nucl-th/9312002
oai:arXiv.org:hep-th/9311181
oai:arXiv.org:hep-ph/9312204
oai:arXiv.org:gr-qc/9312026
oai:arXiv.org:patt-sol/9402001
oai:arXiv.org:hep-th/9403119
oai:arXiv.org:nucl-th/9404010
oai:arXiv.org:cond-mat/9404026
oai:cds.cern.ch:CERN-PS-2002-011AE
oai:cds.cern.ch:CERN-PPE-96-008
oai:aps.org:PhysRevD.10.357
oai:aps.org:PhysRevD.20.211
oai:aps.org:PhysRevD.30.2674
Total
Total
external
links
Success
8
15
25
8
8
50
19
9
19
12
24
14
13
25
13
17
1
6
15
25
8
8
44
17
9
16
10
23
12
5
20
12
14
1
Success
(but not to
individual)
0
0
0
0
0
0
1
0
1
0
1
1
7
4
0
1
0
11
4
9
6
310
9
4
9
6
273
0
0
0
0
16
Failu
re
2
0
0
0
21
2
0
0
0
0
6
1
0
2
2
0
1
1
1
1
2
0
In our test, the failure rate is less than 10%, which we have found fairly constant in larger random
samples. The main reasons for failure links are:
 The link is to another resolver which fails to resolve it. For example, we generate the URL for the
reference “I. G. Koh, W. Troost, and A. Van Proeyen Nucl. Phys., B 292(1987) 201” (a reference of
oai:aps.org:PhysRevD.59.104013
)
as
http://doi.crossref.org/resolve?sid=xxx:xxx&issn=05503213&volume=292&spage=201.
 However, the destination resolver, CrossRef, cannot resolve it.
 The requested object does not exist. For example, we succeed to generate the right URL for “Phys. Rev.
A 26, 1179 (1982)” (a reference of oai:arXiv.org:chao-dyn/9607003 ) to APS. However, the cited object
does not exist in APS.
 The request page does not exist. This may indicate that the URL we generated may not be right.
2.2
Equation Processing
In our federation we might find equations in some parts of the metadata that are represented in formats such
as LaTeX or MathML. Viewing such representations is not as intuitive as viewing the equations
themselves, so it is very useful to provide a visual tool to view the equations. We represent the equations
as images and display these images when the metadata records are displayed. This requires tasks to be
performed after harvesting new metadata records. Figure 4 shows the tasks involved in equation
processing. These tasks include:
 Identifying equations: Depending on the format of the equations the boundaries of the equation within
the text of the metadata are identified. For example LaTeX equations have special characters like $.
 Filtering equations: This task is applied to remove incomplete or incorrectly identified equations from
the list of equations.
 Equation storage: For fast browsing images for the equations are generated and stored in a database.
Details on using equations in search can be found in [7]. None of these tasks done for the historic
harvest of a collection differ from the equivalent tasks done after a dynamic harvest. In contrast to the
citation services, we do not have to rely on other metadata and/or full text documents since all the
information is in the regular metadata records. Thus, the task of automating this service was simply one of
invoking the appropriate module after harvesting.
Image Converter
Eqn2Gif
DC Metadata
MathEqn
Eqn Data
cHotEqn
Img2Gif
EqnExtractor
EqnRecorder
EqnCleaner
EqnFilter
Acme.JPM.Encoders.GifEncoder
Formula Filter
Fig. 4. Equation Processing
2.3
Subject Resolvers
Some of the metadata in the archives might not be available through OAI-PMH. For example in arXiv and
APS we encountered many missing subjects and had to either guess some of them or use different resources
and metadata archives to obtain them. In this section, we describe our subject resolver, which will try to fill
the subject field for APS and arXiv DC records (CERN and NASA records already have appropriate
subject fields). For each arXiv record without a subject field, the subject resolver will try to guess a subject
based on its OAI identifier.
Get parallel
metadata
PACS Spec
Parse to get
PACS code
Map code to
subject String
Guess
subject
DC
Fig. 5. Subject Resolving for APS Records
Figure 5 shows the process for resolving subject for an APS record. Some APS parallel metadata records
contain PACS (Physics and Astronomy Classification Scheme) codes, which is a subject classification
schema for physics and astronomy. Therefore, for an APS DC record without a subject field, we use the
PACS codes from its corresponding parallel metadata record (which has been already obtained for citation
service) and map the PACS codes into a string from the PACS schema. If its parallel metadata record does
not contain PACS codes at all, we will try to guess its subject from its OAI identifier.
3
Web Services and Applications
Web Services are self-contained self-describing, modular applications that can be described, published,
located, and remotely invoked. They can perform anything from simple requests to complicated business
processes. Once a Web service is deployed, other applications (and other Web services) can discover and
invoke the deployed service [14]. Web services usage promotes interoperability by minimizing the
requirements for shared understanding as well as enabling just in time integration.
We implemented some web services to expose and extend the use of our collections. Two of these web
services are:
 Search service: This service provides access to all search functionality without the need to use the
Archon interface.
 Book shelf service: This service allows each user to have a personalized collection which is a subset of
the overall archives. It can be used for different purposes. For example, teachers can use it to collect
course materials and package it in a personalized collection.
Fig. 6. PDLA Search Interface
Using the above web services, clients can be developed to provide special features according to
requirements of different user communities. We developed a client, Personal Digital Library Assistant
(PDLA), using the search web service. This is a Windows client that provides a customized interface to
search the Archon collections. It can be used to select and store special sub collection of the search results
locally. It can also be used to export selected documents’ metadata in different formats such as XML, Text
or Microsoft Word using selected templates like the ACM style or IEEE style. This allows an author to
create her own bibliography of frequently used references and have them presented in formats appropriate
for the intended journal. Figure 6 shows the results of using the PDLA client to search Archon for papers
written by ‘Maly’. In Figure 7 we specify that references to a select group of papers on the bookshelf be
exported in Word format appropriate for an ACM publication with the result shown in Figure 8.
Fig. 7. PDLA Export Interface
Fig. 8. References generated by PDLA
4
Performance
Performance was an important consideration when we implemented the Archon background processing
since it has to complete before the next dynamic harvesting starts. We have done performance tests on the
OAI harvester, Citation Processing and Subject Revolver for APS records. The objective was to identify
any bottlenecks. The results for harvesting DC records for NCSTRL-NCSU (Table 3), arXiv from Arc
(Table 4), APS (Table 5) and for harvesting parallel metadata records from APS respectively (Table 6) are
shown below.
Table 3. NCSTRL-NCSU1
Table 4. arXiv (from ARC)
Operati
Number Average
Operation on Time (s) of Times Time (s)
Operati
Number Average
Operation on Time (s) of Times Time (s)
Identify
0.6
1
0.6
Identify
DB
Resumptio
7.0
143
0.1
DB
Resumptio
0.1
2
0.1
n
n
ListRecord
0.17
1
0.2
46
1,000
0.1
0.50
10
0.1
3,805.8
10
380.6
5.7
1
5.7
3,858.3
1,000
3.9
ListRecord
s
46.2
2
23.1
ListSets
24.8
1
24.8
Total
80.5
143
0.6
s
ListSets
Total
Table 5. APS (DC)
Table 6. APS (Parallel)
Operati
Number Average
Operation on Time (s) of Times Time (s)
Operati
Number Average
Operation on Time (s) of Times Time (s)
Identify
0.2
1
0.2
Identify
DB
Resumptio
9.7
220
0.1
DB
Resumptio
0.1
4
0.1
n
n
ListRecord
s
ListSets
Total
0.2
1
0.2
45.1
906
0.1
0.4
10
0.1
72.1
11
6.6
0.6
1
0.6
125.6
906
0.1
ListRecord
10.3
11
0.9
0.6
1
0.6
s
ListSets
22.5
220
0.1
Total
These tables show how many times each operation has been executed and how much time has been
spent on each operation, where Identify, Resumption, ListRecords, ListSets stand for sending
corresponding request and getting the result, and DB stands for database operation.
From these tables, we can see that the most time-consuming operation is ListRecords. And the
performance of the OAI harvester greatly depends on the performance of data providers. For harvesting
one DC record, it took 0.1 seconds for APS, 0.6 seconds for NCSTRL and 3.9 seconds for ARC on the
average. In addition, the size of the record has some effects on the performance of the OAI harvester.
Harvesting one APS parallel metadata record, which is much larger than its corresponding DC record in
average, takes more time than harvesting one APS DC record. Table 7 and Table 8 show the performance
results for Citation Processing for APS and arXiv records.
Table 7. Citation Processing for APS
Operation
Table 8. Citation Processing for arXiv
Operation
Number Average
Time (s)
of Times Time (s)
Adjustment
Biblio
Parsing
Biblio
Normalization
13.2
895
0.1
21.8
906
0.1
N/A
N/A
N/A
Ref Parsing
106.4
902
0.1
Ref resolving
624.6
13,129
0.1
Cross-linking
103
13,129
0.1
1
An entry ‘0.1’ means a time less than 0.1s.
Operation
Operation
Number Average
Time (s)
of Times Time (s)
Adjustment
8.2
614
0.1
Biblio Parsing
Biblio
Normalization
10.1
614
0.1
21.9
453
0.1
Ref Parsing
134.2
614
0.2
Ref Resolving
923.7
18,797
0.1
Cross-linking
123.1
18,563
0.1
Total
868.9
906
1.0
Total
1,221.3
614
2
From these two tables, we can see that the most time-consuming operation is reference resolving. It
took about 1 second for processing one APS record and about 2 seconds for processing one arXiv record.
This difference is because for these test records one APS record has 14 references on average while one
arXiv record on average has 30 references.
Table 9. Citation Processing for CERN
Operation
Table 10. APS Subject Revolving
Operation
Number Average
Time (s)
of Times Time (s)
Adjustment
Download
HTML
Download
PDF
Ref
Extraction
13.6
972
0.1
327.9
256
1.3
468.8
181
2.6
117.7
179
0.7
Ref resolving
75
1,397
0.1
Cross-linking
9.8
1,397
0.1
16.8
2,369
0.1
1,029.6
972
1.1
DB
Total
Operation
Initial
operation
Get
parallel
metadata
Parse
metadata
Avera
Operati
Number ge Time
on Time (s) of Times (s)
0.1
1
0.1
14.3
1,000
0.1
7.1
996
0.1
2.1
514
0.1
19.5
996
0.1
Flag
7.1
996
0.1
Index
18.5
1
18.5
Total
68.5
1,000
0.1
Map
Update
Table 9 shows the result of Citation Processing for CERN records. Citation Processing for CERN
records is different from processing APS or arXiv records. CERN citation processing extracts references
from the full-text downloaded documents instead of getting references from parallel metadata records. The
identifier field of a CERN record contains a URL for an HTML page, which contains a URL for the PDF
file. Hence, to download the full-text document involves two steps: downloading the HTML page and
downloading the PDF page. However, not all CERN records contain extractable PDF files. To improve the
performance, our program will not try to download the full-text documents for those without extractable
PDF files.
From the tables, we see that bottlenecks are in downloading HTML pages, downloading PDFs and
reference extraction. It took about 1 second to process one CERN record. Table 10 shows the result of
subject resolving for APS records. In the table, Update stands for updating the subject field for the DC
records, Flag stands for updating some flags in database, Map stands for mapping the PACS code to subject
string and Index stands for re-index the database. From the table, we can see that it took about 68
milliseconds to process one record.
5
Conclusions
This paper reports on our experience with implementing higher level services for a collection that federates
a number of other collections using OAI-PMH. Since there are few, if any, such services beside Archon,
this paper is more of a study in itself than a comparative analysis. Our performance analysis shows that we
can comfortably set the scheduler of the OAI harvester to about 1 day and have a safety factor for human
intervention should the automatic process break down. Whether the search Web service is indeed a useful
tool depends on the usefulness of the basic federation collection. We are in the process of adding to the
current production service of federating CERN, arXiv, and APS, also NASA and plan to collaborate with
AIP(American Institute of Physics) to have their collections included as well. Once all these are federated
and working at the high service level at a dynamic basis, the Web services should prove to be attractive
particularly to authors of papers who can thus maintain their own bibliographies.
Acknowledgement: we are extremely grateful for the cooperation and collaboration of Corrado Pettenati
and Jean-Yves Le Meur from CERN who have graciously provided us their code for reference processing.
They have worked with us to adapt their process to Archon; their code is one of the reasons for the high
success rate of reference resolution.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
CERN Document Server. Available at: http://cdsweb.cern.ch.
Claivaz J., Le Meur,J. , Robinson, N. , “From Fulltext Documents to Structured Citations:
CERN's Automated Solution”, High Energy Phys. Libr. Webzine : 5 (2001). Available at
http://doc.cern.ch/heplw/5/papers/2/.
Lagoze, C., Van De Sompel, H., Nelson, M. & Warner, S. (2002), “The Open Archives Initiative Protocol for
Metadata
Harvesting
(Version
2.0)”.
Available
at:
http://www.openarchives.org/OAI/openarchivesprotocol.html.
LaTeX project home page, http://www.latex-project.org/.
Liu, X., Maly, K., Zubair, M. & Nelson, M. L. (2001), “Arc - An OAI service provider for digital library
federation.”, D-Lib Magazine, 7(4). Available at http: //www.dlib.org/dlib/april01/liu/04liu.html.
Lodi, E., Vesely, M., Vigen, J., "Link managers for grey literature", CERN-AS-99-006, 4th International
Conference on Grey Literature : New Frontiers in Grey Literature, Washington, DC, USA, 4 - 5 Oct 1999
/Ed. by Farace, D J and Frantzen, J - GreyNet, Amsterdam, 2000. - pp.116-134.
Maly K., Zubair, M., Nelson, M., Liu, X., Anan, H., Gao, J., Tang, J., & Zhao, Y., “Archon—a digital library
that federates physics collections.”, In DC-2002: Metadata for e-Communities: Supporting Diversity and
Convergence, Florence, Italy, October 13-17 2002, pp. 25-37.
Mathematical Markup Language (MathML) Version 2.0, W3C Recommendation 21 February 2001,
http://www.w3.org/TR/MathML2/.
NSDL project 2002. Available at: http://www.nsdl.org.
PACS 2003. Available at: http://www.aip.org/pacs/.
SOAP Version 1.2, W3C Working Draft 9 July 2001, Martin Gudgin, Marc Hadley, Jean-Jacques Moreau,
Henrik Frystyk Nielsen http://www.w3.org/TR/soap12/ .
Van De Sompel, H and Lagoze, C., “The Santa Fe Convention of the Open Archives Initiative”, D-Lib
Magazine, 6(2), February 2000. http://www.dlib.org/delb/february00/vandesompel-oai/02vandesompeloai.html.
UDDI Specifications http://www.uddi.org/specification.html.
Web Services Description Language (WSDL) 1.1, W3C Note 15 March 2001, Erik Christensen, Francisco
Curbera, Greg Meredith, Sanjiva Weerawarana http://www.w3.org/TR/wsdl