Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad Zubair, and Zhao Yang Old Dominion University, Norfolk, VA 23529 (anan, tang_j,maly,mln,zubair, yzhao,@cs.odu.edu) Abstract. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a de facto standard for federation of digital repositories. OAI-PMH defines how repositories (data providers) expose their metadata as well as how harvesters (service providers) extract the metadata. In this paper we will concentrate on our experience as service providers. Although OAI-PMH makes it easy to harvest data from data providers, adding specialized services requires significant work beyond the OAIPMH capabilities. Harvesting provides only the basic services to get metadata from repositories. However, processing these data or retrieving related metadata is not part of the OAI-PMH. In this paper, we will present the impact of value-added services like citation and reference processing, equation searching, and data normalization on the process of dynamic harvesting in Archon. Dynamic harvesting introduces challenges of keeping specialized-services consistent with ingestion of new metadata records. Archon is a NSDL NSF funded digital library that federates physics collections. We present performance data for the overhead of using these services over the regular harvesting. We also present the implementation of Simple Object Access Protocol (SOAP) web services (and corresponding client) that enhances the usability of our collections by allowing third parties to provide their own clients to search our databases directly. 1 Introduction Using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standardizes and facilitates the process of federating multiple digital libraries (DLs). However, by design this protocol does not provide support for services beyond basic harvesting of metadata records. For example, a service that provides reference information related to a document as well as links from references to their corresponding documents needs the original metadata along with the metadata of the references. Furthermore, the full text corresponding to the original metadata might be retrieved and scanned for references. Such correlated or full text harvesting is not directly supported by OAI-PMH. Additional processing is required to store reference linking information and to update references to and from exiting documents. Another service example is normalization, such as populating missing metadata fields (e.g. subject). We have implemented services to deduce the subject if missing from the harvested Dublin Core metadata. Other examples of value-added services include displaying and searching equations that are embedded in metadata fields such as title or abstract in formats such as LaTeX [4] or MathML [8]. Equations in these formats are not easy for users to browse or view, so we have added a service to render these equations. However, this requires metadata post-processing to identify and extract equations and to convert them to easily displayable images. Processes like the one mentioned above introduce some challenges in the process of harvesting and some processes might consume a long time that provides constraints on how and when these processes must be scheduled. In this paper we review our experiences with these issues in the implementation of Archon (a DL that federates physics collections) that is part of NSF’s National Science Digital Library (NSDL) project [9]. We also discuss how the addition of these specialized services affects the harvesting time. After harvesting, Archon can work as an OAI aggregator that enables other DLs to harvest from it using the OAI-PMH. However, while the OAI-PMH provides harvesting capabilities, it does not provide distributed searching capabilities. To that end we developed web services to support searching of our collection by third parties. To implement this, we used Simple Object Access Protocol (SOAP) based Web services. As an illustration, we developed the Personal Digital Library Assistant (PDLA), a reference implementation of a client that uses these services while adding its own services of book-shelving and reference creation. 2 Process Automation In Archon, we harvest records using OAI-PMH. This protocol provides the capability to harvest records incrementally from OAI-enabled repositories using simple commands. However, at the core of Archon we have high level services that require post-processing of harvested metadata. To keep these specializedservices consistent with ingestion of new metadata records; we implement Archon’s post-harvesting processes as tasks that can be run incrementally and automatically. The Archon post-processing conceptually consists of tasks for citation and equation processing, normalization, and a subject resolver. Some of the metadata fields (e.g. dates) have to be normalized to compensate for many conventions used at the various repositories. Normalization is not time-consuming and integration with the harvester is straightforward. However, citation processing, equation processing and subject resolution have to be implemented as tasks either because they are difficult to integrate into other processes or because of performance issues. In the following subsections, we will describe these tasks in detail. 2.1 Citation Processing The Archon reference-linking service provides the user a list of the references for specific metadata records. Where possible the service provides links to the documents at external source archives and within Archon. Figure 1 shows a screenshot of the bottom of a metadata record that includes a list of references with some references not being resolved and several resolved to both external sources and internal Archon links. The external links are generated by using the script from CERN [6]. Fig. 1. Archon Reference-linking service Figure 2 shows a screenshot of Archon’s citation service. This service allows the user to search the records by citation relationship. When a user clicks the “Citations” link along each record (visible at the right of Figure 1), a window will appear that allows the user to specify a search either for all records citing this record (citation to) or all records cited by this record (citation from) in either the full Archon collection or the current result set. Citation Processing is the background process to prepare and organize data for Archon’s referencelinking service automatically after a harvest of metadata records. As shown in Figure 3, its main parts are the reference collector, bibliographic collector and reference process. The reference collector is responsible for getting the references and uses separate approaches for different sources. One approach is to harvest reference information using OAI-PMH. For arXiv and APS records, the reference information is available as parallel metadata records. The second approach is to download the full-text document and use reference extraction software. For the CERN records we use a reference extraction script from CERN [2]. It can extract references from PDF files. Fig. 2. Archon’s Interface for Citation Service The bibliographic collector is responsible for getting bibliographic information. This is necessary because currently most DC records do not provide enough information establishing relations between metadata records with references. Where available we harvest parallel metadata records and obtain bibliographic information from them. Bibliographic information from different sources may be different for the same item. For example, some may use “Phys. Rev. D” while others use “Physics Review D”. To address this problem, we normalize the bibliographic information. Reference Collector OAI Others Harvester Get archive Extract references Parser DC Reference Process Reference Resolver Bibliographic Collector Parser Raw Bibliographic Raw Reference Old Link Adjustment Harvester Crosslink Normalization Extrernal Link References Bibliographic Fig. 3. Citation Processing Modules The reference process part is the main part of citation processing. It includes the reference resolver, crosslink and external link and old link adjustment. The reference resolver is used to resolve the reference strings for the newly added records. It extracts information contained in the reference string useful for setting up the link to the corresponding record. For example, the string “A. Donnachie, Phys. Lett. B 296 (1992) 227”, will identify the journal as “Phys. Lett. B”, the volume is 296 and starting page is 227. Currently, we use the CERN software package for reference resolving. Crosslink sets up links from newly resolved references to their corresponding records in Archon. External link sets up URLs for newly resolved references. We use a rule based heuristic algorithm that is based on the URL generator of the CERN software package. Old link adjustment makes adjustments on references that have been processed by previous modules. We have to make corresponding changes in the links when we ingest new records into Archon because Archon reference-linking service links the references within its federation. Therefore, when we add a new record in Archon, we need to check whether there are some old references that can be linked to this record. The current status of references in Archon is shown in Table 1. It gives the total numbers of references for each archives as well as how many references have been resolved successfully, how many resolved references have external links (to an outside source), how many have internal links (within Archon), and how many have links (either external or internal or both). “Resolved” is higher than “Linked” because we may be able to resolve all components to a valid internal representation but the URL generator is unable to construct a link. In summary, we have a success rate ranging from 50% to 80%. Table 1. Archon Reference Status Archives arXiv APS CERN Total External Internal Linked Resolved 4,838,158 2,191,419 1,257,367 2,790,904 2,900,347 686,521 427,601 195,187 432,604 520,843 58,105 24,345 9,115 25,513 27,753 To test the validity of our rule based link generator we randomly select 20 records from Archon and analyzed the reference resolution. Table 2 shows the test result. We divide the result of links into three categories. Success: succeeds to point to an appropriate document. Success: but not to individual document, rather the link goes to a page which has a document list. Failure: fails to point to either the corresponding document or a document list. Table 2. External Link Success/failure Ratio Records oai:aps.org:PhysRevD.59.104013 oai:aps.org:PhysRevD.16.3242 oai:arXiv.org:cond-mat/9906406 oai:aps.org:PhysRevD.64.017701 oai:arXiv.org:cond-mat/9906415 oai:arXiv.org:astro-ph/9902112 oai:arXiv.org:chao-dyn/9607003 oai:arXiv.org:hep-lat/9312002 oai:arXiv.org:nucl-th/9312002 oai:arXiv.org:hep-th/9311181 oai:arXiv.org:hep-ph/9312204 oai:arXiv.org:gr-qc/9312026 oai:arXiv.org:patt-sol/9402001 oai:arXiv.org:hep-th/9403119 oai:arXiv.org:nucl-th/9404010 oai:arXiv.org:cond-mat/9404026 oai:cds.cern.ch:CERN-PS-2002-011AE oai:cds.cern.ch:CERN-PPE-96-008 oai:aps.org:PhysRevD.10.357 oai:aps.org:PhysRevD.20.211 oai:aps.org:PhysRevD.30.2674 Total Total external links Success 8 15 25 8 8 50 19 9 19 12 24 14 13 25 13 17 1 6 15 25 8 8 44 17 9 16 10 23 12 5 20 12 14 1 Success (but not to individual) 0 0 0 0 0 0 1 0 1 0 1 1 7 4 0 1 0 11 4 9 6 310 9 4 9 6 273 0 0 0 0 16 Failu re 2 0 0 0 21 2 0 0 0 0 6 1 0 2 2 0 1 1 1 1 2 0 In our test, the failure rate is less than 10%, which we have found fairly constant in larger random samples. The main reasons for failure links are: The link is to another resolver which fails to resolve it. For example, we generate the URL for the reference “I. G. Koh, W. Troost, and A. Van Proeyen Nucl. Phys., B 292(1987) 201” (a reference of oai:aps.org:PhysRevD.59.104013 ) as http://doi.crossref.org/resolve?sid=xxx:xxx&issn=05503213&volume=292&spage=201. However, the destination resolver, CrossRef, cannot resolve it. The requested object does not exist. For example, we succeed to generate the right URL for “Phys. Rev. A 26, 1179 (1982)” (a reference of oai:arXiv.org:chao-dyn/9607003 ) to APS. However, the cited object does not exist in APS. The request page does not exist. This may indicate that the URL we generated may not be right. 2.2 Equation Processing In our federation we might find equations in some parts of the metadata that are represented in formats such as LaTeX or MathML. Viewing such representations is not as intuitive as viewing the equations themselves, so it is very useful to provide a visual tool to view the equations. We represent the equations as images and display these images when the metadata records are displayed. This requires tasks to be performed after harvesting new metadata records. Figure 4 shows the tasks involved in equation processing. These tasks include: Identifying equations: Depending on the format of the equations the boundaries of the equation within the text of the metadata are identified. For example LaTeX equations have special characters like $. Filtering equations: This task is applied to remove incomplete or incorrectly identified equations from the list of equations. Equation storage: For fast browsing images for the equations are generated and stored in a database. Details on using equations in search can be found in [7]. None of these tasks done for the historic harvest of a collection differ from the equivalent tasks done after a dynamic harvest. In contrast to the citation services, we do not have to rely on other metadata and/or full text documents since all the information is in the regular metadata records. Thus, the task of automating this service was simply one of invoking the appropriate module after harvesting. Image Converter Eqn2Gif DC Metadata MathEqn Eqn Data cHotEqn Img2Gif EqnExtractor EqnRecorder EqnCleaner EqnFilter Acme.JPM.Encoders.GifEncoder Formula Filter Fig. 4. Equation Processing 2.3 Subject Resolvers Some of the metadata in the archives might not be available through OAI-PMH. For example in arXiv and APS we encountered many missing subjects and had to either guess some of them or use different resources and metadata archives to obtain them. In this section, we describe our subject resolver, which will try to fill the subject field for APS and arXiv DC records (CERN and NASA records already have appropriate subject fields). For each arXiv record without a subject field, the subject resolver will try to guess a subject based on its OAI identifier. Get parallel metadata PACS Spec Parse to get PACS code Map code to subject String Guess subject DC Fig. 5. Subject Resolving for APS Records Figure 5 shows the process for resolving subject for an APS record. Some APS parallel metadata records contain PACS (Physics and Astronomy Classification Scheme) codes, which is a subject classification schema for physics and astronomy. Therefore, for an APS DC record without a subject field, we use the PACS codes from its corresponding parallel metadata record (which has been already obtained for citation service) and map the PACS codes into a string from the PACS schema. If its parallel metadata record does not contain PACS codes at all, we will try to guess its subject from its OAI identifier. 3 Web Services and Applications Web Services are self-contained self-describing, modular applications that can be described, published, located, and remotely invoked. They can perform anything from simple requests to complicated business processes. Once a Web service is deployed, other applications (and other Web services) can discover and invoke the deployed service [14]. Web services usage promotes interoperability by minimizing the requirements for shared understanding as well as enabling just in time integration. We implemented some web services to expose and extend the use of our collections. Two of these web services are: Search service: This service provides access to all search functionality without the need to use the Archon interface. Book shelf service: This service allows each user to have a personalized collection which is a subset of the overall archives. It can be used for different purposes. For example, teachers can use it to collect course materials and package it in a personalized collection. Fig. 6. PDLA Search Interface Using the above web services, clients can be developed to provide special features according to requirements of different user communities. We developed a client, Personal Digital Library Assistant (PDLA), using the search web service. This is a Windows client that provides a customized interface to search the Archon collections. It can be used to select and store special sub collection of the search results locally. It can also be used to export selected documents’ metadata in different formats such as XML, Text or Microsoft Word using selected templates like the ACM style or IEEE style. This allows an author to create her own bibliography of frequently used references and have them presented in formats appropriate for the intended journal. Figure 6 shows the results of using the PDLA client to search Archon for papers written by ‘Maly’. In Figure 7 we specify that references to a select group of papers on the bookshelf be exported in Word format appropriate for an ACM publication with the result shown in Figure 8. Fig. 7. PDLA Export Interface Fig. 8. References generated by PDLA 4 Performance Performance was an important consideration when we implemented the Archon background processing since it has to complete before the next dynamic harvesting starts. We have done performance tests on the OAI harvester, Citation Processing and Subject Revolver for APS records. The objective was to identify any bottlenecks. The results for harvesting DC records for NCSTRL-NCSU (Table 3), arXiv from Arc (Table 4), APS (Table 5) and for harvesting parallel metadata records from APS respectively (Table 6) are shown below. Table 3. NCSTRL-NCSU1 Table 4. arXiv (from ARC) Operati Number Average Operation on Time (s) of Times Time (s) Operati Number Average Operation on Time (s) of Times Time (s) Identify 0.6 1 0.6 Identify DB Resumptio 7.0 143 0.1 DB Resumptio 0.1 2 0.1 n n ListRecord 0.17 1 0.2 46 1,000 0.1 0.50 10 0.1 3,805.8 10 380.6 5.7 1 5.7 3,858.3 1,000 3.9 ListRecord s 46.2 2 23.1 ListSets 24.8 1 24.8 Total 80.5 143 0.6 s ListSets Total Table 5. APS (DC) Table 6. APS (Parallel) Operati Number Average Operation on Time (s) of Times Time (s) Operati Number Average Operation on Time (s) of Times Time (s) Identify 0.2 1 0.2 Identify DB Resumptio 9.7 220 0.1 DB Resumptio 0.1 4 0.1 n n ListRecord s ListSets Total 0.2 1 0.2 45.1 906 0.1 0.4 10 0.1 72.1 11 6.6 0.6 1 0.6 125.6 906 0.1 ListRecord 10.3 11 0.9 0.6 1 0.6 s ListSets 22.5 220 0.1 Total These tables show how many times each operation has been executed and how much time has been spent on each operation, where Identify, Resumption, ListRecords, ListSets stand for sending corresponding request and getting the result, and DB stands for database operation. From these tables, we can see that the most time-consuming operation is ListRecords. And the performance of the OAI harvester greatly depends on the performance of data providers. For harvesting one DC record, it took 0.1 seconds for APS, 0.6 seconds for NCSTRL and 3.9 seconds for ARC on the average. In addition, the size of the record has some effects on the performance of the OAI harvester. Harvesting one APS parallel metadata record, which is much larger than its corresponding DC record in average, takes more time than harvesting one APS DC record. Table 7 and Table 8 show the performance results for Citation Processing for APS and arXiv records. Table 7. Citation Processing for APS Operation Table 8. Citation Processing for arXiv Operation Number Average Time (s) of Times Time (s) Adjustment Biblio Parsing Biblio Normalization 13.2 895 0.1 21.8 906 0.1 N/A N/A N/A Ref Parsing 106.4 902 0.1 Ref resolving 624.6 13,129 0.1 Cross-linking 103 13,129 0.1 1 An entry ‘0.1’ means a time less than 0.1s. Operation Operation Number Average Time (s) of Times Time (s) Adjustment 8.2 614 0.1 Biblio Parsing Biblio Normalization 10.1 614 0.1 21.9 453 0.1 Ref Parsing 134.2 614 0.2 Ref Resolving 923.7 18,797 0.1 Cross-linking 123.1 18,563 0.1 Total 868.9 906 1.0 Total 1,221.3 614 2 From these two tables, we can see that the most time-consuming operation is reference resolving. It took about 1 second for processing one APS record and about 2 seconds for processing one arXiv record. This difference is because for these test records one APS record has 14 references on average while one arXiv record on average has 30 references. Table 9. Citation Processing for CERN Operation Table 10. APS Subject Revolving Operation Number Average Time (s) of Times Time (s) Adjustment Download HTML Download PDF Ref Extraction 13.6 972 0.1 327.9 256 1.3 468.8 181 2.6 117.7 179 0.7 Ref resolving 75 1,397 0.1 Cross-linking 9.8 1,397 0.1 16.8 2,369 0.1 1,029.6 972 1.1 DB Total Operation Initial operation Get parallel metadata Parse metadata Avera Operati Number ge Time on Time (s) of Times (s) 0.1 1 0.1 14.3 1,000 0.1 7.1 996 0.1 2.1 514 0.1 19.5 996 0.1 Flag 7.1 996 0.1 Index 18.5 1 18.5 Total 68.5 1,000 0.1 Map Update Table 9 shows the result of Citation Processing for CERN records. Citation Processing for CERN records is different from processing APS or arXiv records. CERN citation processing extracts references from the full-text downloaded documents instead of getting references from parallel metadata records. The identifier field of a CERN record contains a URL for an HTML page, which contains a URL for the PDF file. Hence, to download the full-text document involves two steps: downloading the HTML page and downloading the PDF page. However, not all CERN records contain extractable PDF files. To improve the performance, our program will not try to download the full-text documents for those without extractable PDF files. From the tables, we see that bottlenecks are in downloading HTML pages, downloading PDFs and reference extraction. It took about 1 second to process one CERN record. Table 10 shows the result of subject resolving for APS records. In the table, Update stands for updating the subject field for the DC records, Flag stands for updating some flags in database, Map stands for mapping the PACS code to subject string and Index stands for re-index the database. From the table, we can see that it took about 68 milliseconds to process one record. 5 Conclusions This paper reports on our experience with implementing higher level services for a collection that federates a number of other collections using OAI-PMH. Since there are few, if any, such services beside Archon, this paper is more of a study in itself than a comparative analysis. Our performance analysis shows that we can comfortably set the scheduler of the OAI harvester to about 1 day and have a safety factor for human intervention should the automatic process break down. Whether the search Web service is indeed a useful tool depends on the usefulness of the basic federation collection. We are in the process of adding to the current production service of federating CERN, arXiv, and APS, also NASA and plan to collaborate with AIP(American Institute of Physics) to have their collections included as well. Once all these are federated and working at the high service level at a dynamic basis, the Web services should prove to be attractive particularly to authors of papers who can thus maintain their own bibliographies. Acknowledgement: we are extremely grateful for the cooperation and collaboration of Corrado Pettenati and Jean-Yves Le Meur from CERN who have graciously provided us their code for reference processing. They have worked with us to adapt their process to Archon; their code is one of the reasons for the high success rate of reference resolution. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. CERN Document Server. Available at: http://cdsweb.cern.ch. Claivaz J., Le Meur,J. , Robinson, N. , “From Fulltext Documents to Structured Citations: CERN's Automated Solution”, High Energy Phys. Libr. Webzine : 5 (2001). Available at http://doc.cern.ch/heplw/5/papers/2/. Lagoze, C., Van De Sompel, H., Nelson, M. & Warner, S. (2002), “The Open Archives Initiative Protocol for Metadata Harvesting (Version 2.0)”. Available at: http://www.openarchives.org/OAI/openarchivesprotocol.html. LaTeX project home page, http://www.latex-project.org/. Liu, X., Maly, K., Zubair, M. & Nelson, M. L. (2001), “Arc - An OAI service provider for digital library federation.”, D-Lib Magazine, 7(4). Available at http: //www.dlib.org/dlib/april01/liu/04liu.html. Lodi, E., Vesely, M., Vigen, J., "Link managers for grey literature", CERN-AS-99-006, 4th International Conference on Grey Literature : New Frontiers in Grey Literature, Washington, DC, USA, 4 - 5 Oct 1999 /Ed. by Farace, D J and Frantzen, J - GreyNet, Amsterdam, 2000. - pp.116-134. Maly K., Zubair, M., Nelson, M., Liu, X., Anan, H., Gao, J., Tang, J., & Zhao, Y., “Archon—a digital library that federates physics collections.”, In DC-2002: Metadata for e-Communities: Supporting Diversity and Convergence, Florence, Italy, October 13-17 2002, pp. 25-37. Mathematical Markup Language (MathML) Version 2.0, W3C Recommendation 21 February 2001, http://www.w3.org/TR/MathML2/. NSDL project 2002. Available at: http://www.nsdl.org. PACS 2003. Available at: http://www.aip.org/pacs/. SOAP Version 1.2, W3C Working Draft 9 July 2001, Martin Gudgin, Marc Hadley, Jean-Jacques Moreau, Henrik Frystyk Nielsen http://www.w3.org/TR/soap12/ . Van De Sompel, H and Lagoze, C., “The Santa Fe Convention of the Open Archives Initiative”, D-Lib Magazine, 6(2), February 2000. http://www.dlib.org/delb/february00/vandesompel-oai/02vandesompeloai.html. UDDI Specifications http://www.uddi.org/specification.html. Web Services Description Language (WSDL) 1.1, W3C Note 15 March 2001, Erik Christensen, Francisco Curbera, Greg Meredith, Sanjiva Weerawarana http://www.w3.org/TR/wsdl
© Copyright 2026 Paperzz