Joint Conference on Digital Libraries 2007
Generating Best Effort Preservation Metadata
for Web Resources at Time of Dissemination
Joan A. Smith & Michael L. Nelson
Old Dominion University
Department of Computer Science
Norfolk, VA 23529
{jsmit, mln}@cs.odu.edu
JCDL 2007
Presented: 20 June 2007
What’s In A Web Page?
20 June 2007
{jas,mln}@odu.edu
Slide # 2
A Simple Web Page: Behind the Scenes
20 June 2007
{jas,mln}@odu.edu
Slide # 3
HTTP: Behind the Scenes
Non-Text Resource example:
http://foo.edu/jackJill.jpg
• Note the sparse metadata from the HTTP GET request
• Binary content is not human-readable and does not even
display properly in the terminal window
Color map
NISO information
Base64 encoding of resource
MD5 or other hash function
Subject matter
ÿØÿà
And more metadata would help preserve the Jack
and Jill document, too:
–
–
–
–
Language
Document summary/abstract
Keyword extraction
Lexical signature
20 June 2007
GET /jackJill.jpg HTTP/1.1
Host: foo.edu
HTTP/1.1 200 OK
Date: Mon, 11 Jun 2007 16:49:25 GMT
Server: Apache/1.3.33 (Unix)
Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT
ETag: "5800535-3e72-4312f924"
Accept-Ranges: bytes
Content-Length: 15986
Content-Type: image/jpeg
We really need more metadata for the digital
archeologist of the future:
–
–
–
–
–
% telnet foo.edu 80
Trying 82.165.199.160...
Connected to foo.edu.
Escape character is '^]'.
"#2s¡ 35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ
¬ê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{ç ÉÎ Ð ?~‰· õÔÓ!RÓ@Š’û¡·TÓ`r ’pz{ ëÖ. éhéQ)Ùè5üb»[g¨øx ^zè ²
"#2s¡ 35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ
¬ê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{ç ÉÎ Ð ?~‰· õÔÓ!RÓ@Š’û¡·TÓ`r ’pz{ ëÖ. éhéQ)Ùè5üb»[g¨øx ^zè
Connection closed by foreign host.
{jas,mln}@odu.edu
Slide # 4
High
AIP
Low
Probability of Preservation
Preservation & Metadata
Less
More
Resource Metadata Available
What I get from the HTTP/HTML
What I need to make an Archival Information Package (AIP)
20 June 2007
{jas,mln}@odu.edu
Slide # 5
Post-Harvest Processing (at Ingest)
Harvest
Analyze/Examine/Process
Archive
Often a combination of manual and automated input
20 June 2007
{jas,mln}@odu.edu
Slide # 6
Metadata Generation Utility Examples
Name
Description
Jhove
Analysis by type (img, audio, text)
Kea
Key phrase extraction
OTS
Open Text Summarizer
ExifTool
Image/video metadata extractor
PDFlib-pCOS
Extract PDF metadata
MP3-Tag
Extract audio file tags
Essence
Customized information extraction
GDFR
MIME++
MD5
Message Digest
File Magic
Uses content-identification bits of the file
20 June 2007
{jas,mln}@odu.edu
Slide # 7
The Conscientious Webmaster
He who waits to do a great deal of good will never do anything. -- Samuel Johnson
Preservation is
important…
But I’m soooo busy…
How to help???
20 June 2007
{jas,mln}@odu.edu
Slide # 8
Configuring the Web-Server for Automatic Metadata
http://foo.edu/example.html
•
•
•
No impact to everyday users
Regular “GET” => “regular” response
OAI-PMH “Get Record” => “crate” response
•
•
•
Standard Apache
“Location” directive
mod_oai module
configured with “plug-ins”
Scripts, utilities, etc. can
vary by MIME type
http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/example.html&metadataPrefix=crate
20 June 2007
{jas,mln}@odu.edu
Slide # 9
Harvest with Metadata (at Dissemination)
Harvest
Pre-processed resource
Metadata Magic: Get the resource together with its metadata
20 June 2007
{jas,mln}@odu.edu
Slide # 10
Automatic Metadata via mod_oai
http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2007-06-18T18:21:46Z</responseDate>
<request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg
metadataPrefix=“crate">http://foo.edu/crate/</request>
<GetRecord>
<record>
<header> <identifier>http://foo.edu/jackJill.jpg</identifier>
<datestamp>2007-01-17T04:09:07Z</datestamp>
<setSpec>mime:image:jpeg</setSpec>
</header>
<crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>
<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc</data>
</crateContent>
<crateMetadata>
<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec>
<version>file-4.16</version>
<data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>
</description>
<description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>
<version>Jhove (Rel. 1.1, 2006-06-05)</version>
<data> Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg
ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750
Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul
MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT
Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian
CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33
YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3
Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0
Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0</data>
</description>
</crateMetadata>
</record>
</GetRecord> </OAI-PMH>
20 June 2007
{jas,mln}@odu.edu
Slide # 11
High
Low
Probability of Preservation
Preservation & Metadata
Less
More
Resource Metadata Available
HTTP/HTML
Automatic metadata utilities/CRATE
Archival Information Package (AIP)
20 June 2007
{jas,mln}@odu.edu
Slide # 12
Automatic, Best-Effort Metadata
• Unverified
– Utility results are not cross-checked
– Output of analyses directly into XML response
• Undifferentiated
– No categorization of output
– Resource and metadata cohabit response
• Automatic
– Generated at time of dissemination
– Integrates preservation functions with the web server
A simple, easy-to-implement option for improving
preservation metadata for web resources
20 June 2007
{jas,mln}@odu.edu
Slide # 13
Further Information
• The mod_oai project home page:
http://www.modoai.org/
• IWAW 2007:
“CRATE: A Simple Model for Self-Describing Web Resources”
• Authors’ webs:
• http://www.cs.odu.edu/~mln/pubs/
• http://www.joanasmith.com/pubs.html
I Helped!
20 June 2007
{jas,mln}@odu.edu
Slide # 14
© Copyright 2026 Paperzz