Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge Digital Library Collections There is a distinction between BUILDING collections DELIVERING information to users Similar to ‘compile-time’ versus ‘runtime’ distinction in computer programming Information structures should usually be prepared in advance Building a Collection The Collector A subsystem that takes you step by step through building a simple collection Conceals details behind the scenes First locate information on your computer or the Web Plain text, HTML, Word, PDF, email file, etc. Plug-ins Plug-ins are software modules that handle Format conversion Metadata extraction Plug-ins promote extensibility Greenstone Archive Format Greenstone Archive Format XML-based file format File format for: Documents Metadata Collection Configuration File Collection Configuration File Defines the structure of a collection Governs how the collection is built Specifies how the collection will appear to users Greenstone Extended Capabilities Extending the Capabilities of Greenstone Plug-ins Classifiers Handle different document and metadata formats Handle different kinds of browsing structures Format statements and Macros Govern the user interface content and appearance Why Greenstone? Benefits of Greenstone General system for constructing and presenting digital collections Handles millions of documents, text, images, audio, video User interfaces identical in Web-based and CDROM versions Installs on Windows and Linux Access locally or remotely using web browser Organization of Collections Each collection can be organized differently: Format of source documents Metadata Directory structure Document structure Searching and browsing services Presentation Auxiliary services Variation of Source Format Source documents can be supplied in: Plain text HTML PostScript PDF Word E-mail Other file types Images Video Audio Variation of Metadata Different types of metadata Metadata can be supplied differently ‘fields’ in MS Word <meta> tags in HTML Information coded into filename and directories Spreadsheet or other data file Explicit metadata format like MARC Variation of Directory Structure Collections can vary in the directory structure in which the information is located Variation of Document Structure Document structure Flat Divided sequentially into pages Hierarchical organization Title or other metadata available at each level Variation of Services Searching Metadata Indexes Hierarchical levels Browsing Metadata Browser type Variation of Presentation Results can be presented to users in various ways: Format that target documents are shown in Search results page Metadata browsers Interface language Variation of Auxiliary Services A collection may require additional services User logging Etc. Collection Configuration File Allows Variation A digital library collection is made by Gathering raw material Designing the collection Putting design information about the structure and presentation of the collection in the Collection Configuration File Front Page of Collection Statement of collection’s purpose Statement of collection’s coverage Explanation of how collection is organized Searching Involves Indexes Searching is provided by indexes built from different parts of the documents Entire documents Paragraphs Titles Sections Section headings Figure captions Indexes Indexes can be created automatically using Documents Supporting files Indexes can be rebuilt automatically New document in the same format becomes available Process can awake, check for new material, and rebuild the indexes Plug-ins for Indexing Source documents are converted into standard XML form for indexing using plug-ins Standard plug-ins process Plain text HTML Word PDF Usenet and email messages New plug-ins can be written for other document types Browsing Involves Lists Browsing involves lists that can be examined by the user Authors Titles Dates Hierarchical classification structures Classifier Modules Modules called classifiers are used to create browsers and build browsing structures from metadata Scrollable lists Alphabetic selectors Dates Hierarchies Programmers can write new classifiers to create novel browsing capabilities Search Terms Search Terms in Greenstone: Alphabetic characters Digits Separated by white space Punctuation acts as white space Two Types of Queries Query for ALL of the words Boolean AND Query for SOME of the words Ranked Indexes to Search In most collections, you can choose different indexes to search Examples: Author and title indexes Chapter and paragraph indexes Usually the full matching document is returned regardless of index searched Preferences Page Preferences Page Allows advanced control over search operation: Case-folding and stemming Advanced query mode where users specify Boolean operators Large-query interface Display search history Preferences Page Preferences Page Specify subcollections to be included in searches Specify presentation language Customize interface Textual vs. standard interface Suppress navigation bar Suppress alert system Using the Collector The Greenstone Collector Easiest way to build a simple collection The Collector allows you to: Create a new collection Modify or add to an existing collection Delete a collection Starting the Collector Click the Collector link from the default Greenstone home page Log in When Greenstone is installed, an account called admin is set up with a password chosen during installation The Collector works through a standard web interface Creating a New Collection Collector’s main purpose is to build a new collection Structure of a collection is determined when the collection is set up Simplest to copy the structure of an existing collection and then edit Collection Building Steps 1. Collection Information 2. Source Data 3. Configuration 4. Building 5. Viewing Collection Building Steps ☐ Collection Information ☐ Source Data ☐ Configuration ☐ Building ☐ Viewing 1. Collection Information Give the collection a name and provide associated information Title Short phrase used to identify the collection within the digital library Contact e-mail address Brief description Sets out the principles that govern what is included in the collection Collection Building Steps ☑ Collection Information ☐ Source Data ☐ Configuration ☐ Building ☐ Viewing 2. Source Data Specify the location of the sources Clone existing collection Specify on a pull-down menu the existing collection Create a completely new collection 2. Source Data In the provided boxes, indicate where Source Documents are located Specification of sources file:// http:// ftp:// file:// File name on the Greenstone server system That file will be included in collection Directory name on the Greenstone server Everything in the folder and its subfolders will be included http:// Web page The web page will be downloaded All pages it links to (and all pages they link to) that reside on the same site, below the URL, will also be downloaded URL that leads to a list of files Everything in the folder and its subfolders will be included in collection ftp:// File to be downloaded using FTP Directory name on the FTP server Downloads everything in the folder and its subfolders Collection Building Steps ☑ Collection Information ☑ Source Data ☐ Configuration ☐ Building ☐ Viewing 3. Configuration This step can be bypassed Allows adjustment of configuration options The construction and presentation of all collections are controlled by specifications in a special collection configuration file Collection Building Steps ☑ Collection Information ☑ Source Data ☑ Configuration ☐ Building ☐ Viewing 4. Building The computer does the work of the building process Indexes are built: For browsing For searching Following specifications in the collection configuration file Status line shows progress Warnings shown if files can’t be found Collection Building Steps ☑ Collection Information ☑ Source Data ☑ Configuration ☑ Building ☐ Viewing 5. Viewing View the collection that has just been created E-mail can be sent to the collection’s contact address Must enable by editing main.cfg configuration file Working with Existing Collections Add more material and rebuild the collection Edit the configuration file to modify the collection’s structure Delete the collection Put the collection on CD-ROM Adding Material to a Collection Do not re-specify files that are already in the collection Files would be included twice If the building process fails, the old version remains unchanged Structure of collection can be changed Edit the configuration file May add plug-ins or an option to a plug-in Plug-ins & Document Formats Plug-ins are specified in the collection configuration file File name determines document format Widely used document formats: TEXTPlug HTMLPlug WORDPlug PDFPlug PSPlug EMAILPlug ZIPPlug Text Files TEXTPlug Plug-In *.txt *.text Plain text file Title metadata based on the first line of the file HTML Files HTMLPlug Plug-In *.htm *.html .shtml .shm .asp .php .cgi HTML Files HTMLPlug Plug-In Imports HTML files Title metadata extracted from the HTML <title> tag Other HTML <meta> tag data can be extracted Parses and processes any links in the file Links to other files in the collection are trapped and replaced by references to the document HTML Files file_is_url Optional switch within the HTML plug-in Causes URL metadata to be inserted into each document, based on the file-name convention that is adopted by the mirroring package. The collection uses this metadata to allow readers to refer to the original source material rather than a local copy Microsoft Word Files WORDPlug Plug-In *.doc Imports Microsoft Word documents Greenstone uses independent programs to convert Word files to HTML Many variants on the Word format Older Word formats use a simple text string extraction PDF Files PDFPlug Plug-In *.pdf Imports PDF Files Adobe’s Portable Document Format Greenstone uses independent programs to convert PDF files to HTML PostScript Files PSPlug Plug-In *.ps Imports PostScript Files Works best when a standard conversion program is already installed on the computer Uses simple text extraction algorithm if no conversion program is present Email Files EMAILPlug Imports files containing email Each source is checked for e-mail contents Extracts metadata: *.email Subject To From Date Deals with common formats Netscape, Eudora, Unix mail readers Compressed & Archived Files ZIPPlug Plug-In *.zip *.tar .gz *.z *.tgz *.bz Relies on standard utility programs being present Building Collections Manually Building a Collection Building a Collection: The process of taking a set of documents and metadata information and creating all the indexes and data structures that support the searching, browsing, and viewing operations that the collection offers Building a Collection Four Phases in Building a Collection Make Import Import the documents and metadata, convert to a Greenstone standard form Build Make a skeleton framework structure to contain the collection Build the required indexes and data structures Install Make the collection operational Building Collections Manually ☐ Getting Started ☐ Making a framework for the collection ☐ Importing the documents ☐ Building the indexes ☐ Installing the collection Getting Started Locate the command prompt Go to the directory where Greenstone was installed cd “C:\Program Files\gsdl” Tell system where to find Greenstone files setup.bat Sets the variable GSDLHOME to the Greenstone home directory To return later cd “%GSDLHOME%” Building Collections Manually ☑ Getting Started ☐ Making a framework for the collection ☐ Importing the documents ☐ Building the indexes ☐ Installing the collection Make a framework for the collection Use the Perl program mkcol.pl to ‘make a collection’ Get description of usage and arguments perl –S mkcol.pl mkcol.pl May leave off first part if system recognizes that .pl files are associated with Perl Make a framework for the collection perl –S mkcol.pl –creator emailAddress collectionName Make a framework for the collection Examine the file structure cd “%GSDLHOME%\collect\collectionName” List directory contents dir Seven subdirectories are created: archives building etc (contains collect.cfg file) images import index perllib Make a framework for the collection collect.cfg File emailAddress placed in the creator and maintainer lines collectionName placed in collection-meta lines Plug-ins are inserted Building Collections Manually ☑ Getting Started ☑ Making a framework for the collection ☐ Importing the documents ☐ Building the indexes ☐ Installing the collection Importing the documents The collection’s import directory should contain the source material Drag the directory containing the source material into the import directory You may drag several source directories and hierarchies Importing the documents The import process: Brings documents into the Greenstone system Standardizes document format (the way that metadata is specified) Standardizes the file structure (that contains the documents) Importing the documents To get a list of options for the import program: perl –S import.pl The basic import command is: perl –S import .pl collectionName Importing the documents You may be in any directory when the import command is issued The software works by knowing the collection’s name and the Greenstone home directory Warnings may appear When files are found without corresponding plugins These files will be ignored Building Collections Manually ☑ Getting Started ☑ Making a framework for the collection ☑ Importing the documents ☐ Building the indexes ☐ Installing the collection Building the indexes Use the program buildcol.pl Building the indexes Modify collect.cfg file to customize the collection’s appearance collectionname Web browsers receive this name as the title of the collection’s front page collectionextra Description of the collection Appears under “About this collection” on the collection’s home page Enter as a single line in the editor Building the indexes Modify collect.cfg file to customize the collection’s appearance iconcollection Give the collection an icon image Put the location of the image between quotes If absent, the collection’s name will be used Use _httpprefix_ as a shorthand way of beginning any URL that points within the Greenstone file area Example: _httpprevix_/collect/collectionName/images/icon.gif Building the indexes To get a list of options for the build program: perl –S buildcol.pl The basic build command is: perl –S buildcol .pl collectionName Building the indexes The building process takes about a minute on small collections and can take much longer for very large collections You may ignore most warning messages Serious problems will cause the program to terminate Building Collections Manually ☑ Getting Started ☑ Making a framework for the collection ☑ Importing the documents ☑ Building the indexes ☐ Installing the collection Installing the collection Building is done in the building directory Collection must be moved to the index directory before users can see it Drag contents of the building directory to the index directory If index already contains files, remove them first Forgetting to move the contents of building to index is a common mistake Installing the collection To view the newly built collection: Restart Greenstone If using the Local Library version Reload Greenstone Home Page If using the Web version Importing and Building General Information Two Main Parts to Collection Building: Importing (import.pl) Building (buildcol.pl) Files and Directories Collection Specific Directories GSDLHOME collect – all the digital library collections collectionName – directory of collection import – original source material archives – result of import process building – temporary, contents manually moved to index index – bulk of info served to users (import, archives and building can be deleted) etc – contains collect.cfg file images – icons used for the collection perllib – Perl programs specific to collection Other Greenstone Directories GSDLHOME lib – common software for both the collection server and receptionist bin – programs used for building process script – Perl programs used (mkcol.pl, import.pl, buildcol.pl) perllib – Perl modules plugins – Perl plugins classify – Perl classifiers cgi-bin – Greenstone runtime system (absent in Local Library version) src – source code in C++ colservr – the collection server recpt – the receptionist Other Greenstone Directories GSDLHOME packages – source code for external software packages used by Greenstone (indexing and compression program, database manager program, etc.) (each package is stored in a directory of its own with a readme file) bin – executables mappings – Unicode translation tables etc – configuration files for the entire system, initialization and error logs, user authorization database images – user interface images and icons macros – small code fragments that drive the user interface tmp – temporary files docs – documentation for the system Object Identifiers Document’s permanent name in the system Remain the same when collection rebuilt Assigned by the import process Stored as an attribute in the document archive file Character strings starting with the letters HASH (HASH0109d3850a6de440c4d1ca2) Used to name directory where archive file is stored Plug-Ins Plug-ins do most of the work of the import process Operate in the order in which they are listed in the collect.cfg file Input file is passed to each plug-in until one is found that can process it If there is no plug-in that can process a file, a warning is printed Plug-ins determine the traversal of the subdirectory structure in the import directory RecPlug - processes directories, recurses through directory structures and passes the name through the plug-in list GAPlug – processes Greenstone Archive Format documents (in the archives directory structure) ArcPlug – used during building, processes list of document OIDs produced during import (list is stored in archives.inf file) The Import Process The Import Process Brings documents and metadata into the system in a standardized XML form Original material placed in import directory Import process transforms it to files in the archives directory The original material can be deleted New material added to collection by placing it in import directory and re-executing the import process Collection can be rebuilt from archive files The new material finds it way into archives along with existing files To keep the source form of collections Do not delete the archives “Source” form can be augmented and rebuilt later The Build Process The Build Process Creates the indexes and data structures that make the collection operational Indexes for the whole collection are built all at once Build process does not work incrementally Adding new material to archives requires that entire collection be rebuilt (by issuing buildcol.pl) Most collections can be rebuilt overnight Options for Import and Build Additional Options for Import Additional Options for Build Options for Import and Build To see options for any Greenstone script, type its name at the command prompt Options for Import and Build help with debugging (see Table 6.5 on page 310): verbosity archivedir maxdocs collectdir out keepold debug Greenstone Archive Documents Greenstone Archive Format <!DOCTYPE GreenstoneArchive [ <!ELEMENT Section (Description,Content,Section*)> <!ELEMENT Description (Metadata*)> <!ELEMENT Content (#PCDATA)> <!ELEMENT Metadata (#PCDATA)> <ATTLIST Metadata name CDATA #REQUIRED> ]> Document Metadata Metadata – descriptive information about author, title, date and keywords Stored with metadata name Stored at the beginning of the section Example: <Metadata name=“Title”>Freshwater Resources in Arid Lands</Metadata> Document Metadata Dublin Core – a metadata standard New metadata types can be invented Metadata can be assigned by an automatic process rather than manually entered The Dublin Core Collection Configuration File Collection Configuration File Default Configuration File Getting the Most Out of Your Documents Basic Plug-In Options Document Processing Plug-ins Document Processing Plug-ins Document Processing Plug-ins Assigning Metadata from a File XML Document Type Definition (DTD) Example XML Metadata File Document Type Definition (DTD) <!DOCTYPE GreenstoneDirectoryMetadata [ <!ELEMENT DirectoryMetadata (FileSet*)> <!ELEMENT FileSet (FileName+,Description)> <!ELEMENT FileName (#PCDATA)> <!ELEMENT Description (Metadata*)> <!ELEMENT Metadata (#PCDATA)> <ATTLIST Metadata name CDATA #REQUIRED> <ATTLIST Metadata mode (accumulate|override) "override"> ]> Example XML Metadata File <?xml version="1.0" ?> <!DOCTYPE GreenstoneDirectoryMetadata SYSTEM "http://greenstone.org/dtd/GreenstoneDirectoryMetadata/1.0/GreenstoneDi rectoryMetadata.dtd"> <DirectoryMetadata> <FileSet> <FileName>nugget.*</FileName> <Description> <Metadata name="Title">Nugget Point Lighthouse</Metadata> <Metadata name="Place" mode="accumulate">Nugget Point</Metadata> </Description> </FileSet> <FileSet> <FileName>nugget-point-1.jpg</FileName> <Description> <Metadata name="Title">Nugget Point Lighthouse</Metadata> <Metadata name="Subject">Lighthouse</Metadata> </Description> </FileSet> </DirectoryMetadata> Tagging Document Files <!-<Section> <Description> <Metadata name="Title"> Realizing human rights for poor people: Strategies for achieving the international development targets </Metadata> </Description> --> (text of section goes here) <!-</Section> --> Classifiers Format Statements Format Statements Examples of Format Strings
© Copyright 2025 Paperzz