CRATE: A Simple Model for Self-Describing Web

International Web Archiving Workshop 2007
CRATE: A Simple Model for SelfDescribing Web Resources
Joan A. Smith & Michael L. Nelson
Old Dominion University
Department of Computer Science
Norfolk, VA 23529
{jsmit, mln}@cs.odu.edu
WWW and Digital Libraries: Vastly Different Worlds
Digital Library
–
–
–
–
–
World Wide Web
Organized
Groomed content
Lots of metadata
Structured changes
Active preservation policies
– A disorganized free-for-all
– Near-zero metadata
– Unpredictable additions,
deletions, modifications
– No preservation policy
Harvester Home Companion
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Crawlapalooza
Slide # 2
Web Sites: Metadata Challenged
HTML metadata
% telnet foo.edu 80
Trying 82.165.199.160...
Connected to foo.edu.
Escape character is '^]'.
JPEG metadata
GET /jackJill.jpg HTTP/1.1
Host: foo.edu
HTTP/1.1 200 OK
Date: Mon, 11 Jun 2007 16:49:25 GMT
Server: Apache/1.3.33 (Unix)
Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT
ETag: "5800535-3e72-4312f924"
Accept-Ranges: bytes
Content-Length: 15986
Content-Type: image/jpeg
ÿØÿà
"#2s¡ 35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 3
Archives: Metadata-Rich
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 4
YAMM?! (Yet Another Metadata Model?)
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 5
The MPEG-21 DIDL Model
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 6
High
Low
Probability of Preservation
Preservation & Metadata
Less
More
Resource Metadata Available
HTTP/HTML
Automatic metadata utilities/CRATE
Archival Information Package (AIP)
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 7
# Webs >> # Archivists
Typical ingest scenario
Archivist
Web Sites
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 8
Harnessing the Web Server
User: standard GET request
and response
Archivist: mod_oai GetRecord
request and response
Self-describing resource
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 9
What is a “Self-Describing” Resource?
Standard HTTP Headers -Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT
ETag: "5800535-3e72-4312f924"
Content-Length: 15986
Content-Type: image/jpeg
PLUS: Output from built-in utilities:
EXIF TOOL:
File Name
Camera Model Name
Date/Time Original
Shooting Mode
Shutter Speed
Aperture
Metering Mode
Exposure Compensation
ISO
Lens
Focal Length
Image Size
Quality
Flash
White Balance
Focus Mode
Contrast
Sharpness
Saturation
Color Tone
File Size
File Number
IWAW ‘07
103_0315.JPG
JHOVE TOOL:
Canon EOS DIGITAL REBEL
Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg
2003:09:30 13:37:51
ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750
Sports
Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul
1/2000
MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT
7.1
Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian
Evaluative
CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33
0
YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3
400
Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0
File/Magic:
75.0 - 300.0mm
Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0
300.0mm
JPEG image data
3072x2048
JFIF standard 1.00
Normal
resolution (DPI)
Off
"LEAD Technologies Inc. V1.01“
Auto
33 x 26
AI Servo AF
MD5 Hash:
+1
58a54e8638db432f4515eedf89f44505
+1
+1
Normal
1606 kB
…CRATE: Wrapped together with the resource in simple XML
103-0315
{jsmit,mln}@cs.odu.edu
Slide # 10
Metadata Generation Utility Examples
Name
Description
Jhove
Analysis by type (img, audio, text)
Kea
Key phrase extraction
OTS
Open Text Summarizer
ExifTool
Image/video metadata extractor
PDFlib-pCOS
Extract PDF metadata
MP3-Tag
Extract audio file tags
Essence
Customized information extraction
GDFR
MIME++
MD5
Message Digest
File Magic
Uses content-identification bits of the file
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 11
Web Server Configuration: “conf” file
### Section 1: Global Environment
#
ServerType standalone
ServerRoot "/etc/httpd"
PidFile /var/run/httpd.pid
ResourceConfig /dev/null
AccessConfig /dev/null
Timeout 300
KeepAlive On
MaxKeepAliveRequests 0
KeepAliveTimeout 15
MinSpareServers 16
MaxSpareServers 64
StartServers 16
MaxClients 512
MaxRequestsPerChild 100000
### Section 2: 'Main' server configuration
#
Port 80
<IfDefine SSL>
Listen 80
Listen 443
</IfDefine>
User www
Group www
ServerAdmin [email protected]
ServerName www.openna.com
DocumentRoot "/home/httpd/ona"
<Directory />
Options None
AllowOverride None
Order deny,allow
Deny from all
</Directory>
<Directory "/home/httpd/ona">
Options None
AllowOverride None
Order allow,deny
Allow from all
</Directory>
IWAW ‘07
•
•
•
•
Operational Rules
Modules (mod_perl, etc.)
Security
Virtual Hosts
<Files .pl>
Options None
AllowOverride None
Order deny,allow
Deny from all
</Files>
<IfModule mod_dir.c>
DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi
</IfModule>
#<IfModule mod_include.c>
#Include conf/mmap.conf
#</IfModule>
UseCanonicalName On
<IfModule mod_mime.c>
TypesConfig /etc/httpd/conf/mime.types
</IfModule>
DefaultType text/plain
HostnameLookups Off
{jsmit,mln}@cs.odu.edu
Slide # 12
Apache: mod_oai Location Directive
<Location /modoai>
SetHandler modoai-handler
modoai_oai_active ON
<modoai_plugin>
label “md5sum”
exec “/usr/bin/md5sum %s”
version “/usr/bin/md5sum --version”
mime “*/*”
</modoai_plugin>
<modoai_plugin>
label “file”
exec “/usr/bin/file -kz %s”
version “/usr/bin/file -v”
mime “*/*”
</modoai_plugin>
<modoai_plugin>
label “jhove”
exec “/opt/jhove/jhove -m pdf-hul %s”
version “/opt/jhove/jhove -v”
mime “application/pdf”
</modoai_plugin>
<modoai_plugin>
label “pronom”
exec “java -jar DROID.jar -L %s”
version “java -jar DROID.jar -V”
mime “*/*”
</modoai_plugin>
</Location /modoai>
IWAW ‘07
•
•
•
•
{jsmit,mln}@cs.odu.edu
Scripts
Pipes
Executables
MIME-based selective
processing
Slide # 13
Building a CRATE
• URI, UUID
CRATE
• Standard HTTP
Headers
• Plug-In Metadata
CRATE ID
METADATA
RESOURCE
• Base64-Encoded
Resource
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 14
CRATE example from mod_oai
http://foo.edu/modoai/?verb=GetRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2007-06-18T18:21:46Z</responseDate>
<request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg
metadataPrefix=“crate">http://foo.edu/crate/</request>
<GetRecord>
<record>
<header> <identifier>http://foo.edu/jackJill.jpg</identifier>
<datestamp>2007-01-17T04:09:07Z</datestamp>
<setSpec>mime:image:jpeg</setSpec>
</header>
<crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>
<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc </data>
</crateContent>
<crateMetadata>
<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec>
<version>file-4.16</version>
<data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>
</description>
<description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>
<version>Jhove (Rel. 1.1, 2006-06-05)</version>
<data><![CDATA[ Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg
ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750
Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul
MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT
Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian
CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33
YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3
Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0
Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 ]]></data>
</description>
</crateMetadata>
</record>
</GetRecord> </OAI-PMH>
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 15
Automatic, Best-Effort Metadata
• Automatic
– Generated at time of dissemination
– Integrates preservation functions with the web server
• Unverified
– Utility results are not cross-checked
– Output of analyses go directly into XML response
• Undifferentiated
– No categorization of output
– Resource and metadata form complex-object response
A simple, easy-to-implement option for improving
available preservation metadata for web resources
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 16
Issues - Or Not?
• Web Server Performance
– Academic vs dot-com expectations
– Solution options
• Utility Efficiency
– Java-based vs C-based
– Market pressures
• Security
– Metadata vs risk
– Access controls
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 17
Next Up…
• mod_oai Open Source release
• Formalize/release CRATE schema definition (XSD)
• Metrics Collection & Evaluation
–
–
–
–
IWAW ‘07
Academic sites
Dot-Com sites
Examine utility compatibility and issues
Address security concerns
{jsmit,mln}@cs.odu.edu
Slide # 18
Demo
TODAY:
• http://beatitude.cs.odu.edu:8080/modoaitest/diag.jpg
• http://beatitude.cs.odu.edu:8080/modoai/?verb=GetRecord&metadataPrefix=
crate&identifier=http://localhost/modoaitest/diag.jpg
AT MODOAI.ORG:
• http://www.modoai.org/demos.html
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 19
Further Information
• The mod_oai project home page:
http://www.modoai.org/
• JCDL 2007:
Generating Best Effort Preservation Metadata
For Web Resources At Time Of Dissemination
• Authors’ webs:
• http://www.cs.odu.edu/~mln/pubs/
• http://www.joanasmith.com/pubs.html
IWAW ‘07
{jsmit,mln}@cs.odu.edu
Slide # 20