Using Scalable and Security Web Technologies to

Using Scalable and Secure Web Technologies
to Design Global Format Registry
Muluwork Geremew, Sangchul Song and Joseph JaJa
Institute for Advanced Computer Science Studies
Department of ECE, University of Maryland
Sponsored by Library of Congress and NSF
1
Motivation
• Handling of digital formats is an essential part of
long-term preservation.
• Format obsolescence
– Technology evolution and the obsolescence of systems and
applications software may leave users unable to access their old
files.
– Software developers may go out of business and no longer
support the applications.
• Digital preservation requires
– Different essential aspects of objects.
– Tools for capturing the essential format characteristics of
information stored as digital object and processing it.
2
Existing Methodologies
• Standardizing the digital contents to few
common formats.
– JPEG2000, OMF, and PDF/A are among the few
selected open standard formats.
• Migration
– Transforms older versions to newer formats.
– Tends to be costly and prone to errors.
• Emulation
– The original bit-streams are executed using an
emulator.
– Implementing such a strategy is extremely
challenging and can be viewed as a transformation.
3
Our Goal
• A flexible framework for incorporating advances
achieved through the existing approaches.
• Development of an efficient, scalable and
platform independent prototype to enable the
tracking and handling of format obsolescence.
– Development of a Global Digital Format Registry
(GDFR) – FOrmat CUration Service (FOCUS)
– Development of enabler modules that can interface
between GDFR and end-user applications.
4
FOCUS Architecture
5
FOCUS on LDAP and SOAP
• Interoperability
– Protocols are platform independent
• Performance
– Most operations are read-only queries. LDAP gives high
performance in this environment.
• Extensibility
– LDAP schema can be easily extended
• Scalability
– By the use of Distributed LDAP
• Security
– SOAP can be on top SSL (https)
– LDAP-based Format Registry can be easily integrated with any
other LDAP-based authentication/authorization mechanisms.
6
Global Digital Format Registry
• GDFR serves to provide detailed information
about formats.
• Existing Format Registries:
– UPenn’s FRED- (http://tom.library.upenn.edu/fred)
– Pronom- (http://www.nationalarchives.gov.uk/pronom/)
– Wotzit’s Format- (http://www.wotsit.org)
• Not clear how extensible, scalable, or how they
can be interfaced with existing preservation
systems.
7
FOCUS
Software
Web
Service
Agent
Global
Digital
Format
Registry
Software
• The registry contains information
– File formats
– Software tools
• Multiple ways to access GDFR in FOCUS
are provided.
– Directly through LDAP interface
– Indirectly through SOAP interface
8
GDFR-Internal Structure
dc=umiacs, dc=umd, dc=edu
General descriptive
properties.
Processing : format
taken as input and/or
output.
ou=Format-Registry
ou=Applications
General descriptive
properties.
Processing: rendering,
editing, conversion and
validation services/systems.
ou=Formats
Adobe Acrobat v6.0
Adobe PDF v1.4
Adobe Photoshop v7.0
CompuServ GIF 1989a
Jhove 1.0
JPEG Image Format 2000
9
Web-Service Agent
Format
Inquiry
Web
Service
Agent
Global
Digital
Format
Registry
Client
• Mediator between user and registry
• Serviced via SOAP
• Contains a file format identifier module, FIDER
– Java module for format identification
– Uses file magic number
– Sequential from restrictive to general
10
Web-Service Agent
• Tailorability
– Specific needs of an existing preservation system can
be met by custom-tailoring Web-Service.
• Interoperability
– Independent of OS and languages
• Convenience
– Multiple LDAP queries can be reduced to one Web
Service function call.
– Any updates can be done in a single place, not having
to distribute new modules to end users
11
FOCUS- Supplementary Tools
• Validation Software
– Verifies and validates file formats of given file.
• Rendering Software
– Interprets bit streams of files into human-friendly
representation on the screen.
• Editing Software
– Adds/Deletes/Modifies the contents of given file,
keeping the correct file format.
• Conversion Software
– Converts a file format to current or emerging formats.
12
FOCUS Service Model
Web
Service
Agent
Identificatio
n Service
Format
Registry
Locates transformation services
to convert DO from source
format to format of interest.
Conversio
n Software
Identifies format of a specific
Validation
DO using the internal signature
Software
Determines a verification service
to verify the format of a specific DO
Identifies current rendering conditions
Rendering
for specific digital format.
Software
13
Example Scenario: Digital Object Format
Verification
Format
Registry
Web
Service
Agent
Verifier?
Format
?
App ID / App Info
Format ID / Format Info
Conversio
n service
ID
Service
Verify
this?
Valid/Well-formed
Step
requests
to
the
Step
Step1:
3:
5:
2:User
User
Registry
requests
connects
returns
for
toidentify
format
the
information
validation
ID service
and
on
Step
4:
Registry
returns
validation
Step 6:
Validation
service
returns
the
format
a information
fileverifier
via
Web
Service
service
format
available
and
verify
the
for
this
format
format
ID
and
information,
such
as
its
verification result
service location
Validation
Service
Rendering
Service
14
Demo
15
Conclusion
• FOCUS design offers maximum
– Flexibility – Web Service Agent can be easily tailored
to
meet the various needs of different preservation
institutions.
– Scalability – Format registry can also be distributed.
• FOCUS integrates current format preservation
techniques and makes them available through SOAPbased web interface.
• In summary, we believe that the FOCUS prototype
represents a significant advance towards the
development of secure and scalable digital format
registry.
16