Report - Metadata Extraction Home Page

Spring 09
__________________________________________________________
Project Report
for
Non-Form Template Interpreter for MetaData Extraction
Student Name : Divya Josyala
Advisor : Dr.Steven Zeil
___________________________________________________________________
Abstract
An effective way to access large and diverse document collections is by making them
searchable via metadata fields like author, title and publishing organization. The manual
process of metadata generation is extremely time consuming and expensive when compared to
the automated process of metadata extraction. The digital library group at ODU is currently
developing and automated meta-data extraction system to support document collections with
diverse structure and layout
One of the key steps of this automated meta data extraction system involves the design of
a rule based template which defines rules to extract metadata from a single class of similar PDF
documents. These PDF documents are initially converted into XML files by an OCR engine
before extracting meta-data from them. This project involves the implementation of a Java
based interpreter which interprets the rules in the template and extracts the corresponding
metadata from the XML documents. The template rules introduced in this project have a clear
model of language semantics, have a simple and concise representation and provide easy
extensibility.
Overview of Meta-data Extraction System
Many federal organizations like Defense Technical Information Center (DTIC), National
Aeronautics and Space Administration (NASA), and U.S Government Printing Office (GPO)
manage large document collections with diverse layout and structure. These document
collections can be accessed easily by making them searchable via metadata fields like author,
title and publishing organization. Manually creating metadata for large collections is very
expensive and time consuming when compared to automation of metadata extraction. Existing
approaches to automating the extraction of meta-data fail to support heterogeneous
collections handled by the federal organizations .The digital library group at ODU is developing
an automated process of meta-data extraction for large, diverse, and evolving document
collections. The goal of this metadata extraction system is to extract metadata in XML format from
PDF documents
The automated metadata extraction system developed by the digital library group employs
a template based metadata extraction approach where documents are classified into sets of
homogeneous collections (documents with similar layout and structure i.e, the pages
containing title and other meta-data fields appear similar .Refer Figure 1. ) . A rule-based
template which defines rules to extract meta-data is associated with each class of documents.
For example, a template might state that the text set in the in the first line of the first page is,
in that layout, the document title and this text set is associated with the meta-data field “title”.
Figure 1 : Documents with similar layout
Architecture of the meta-data extraction system
Figure 2 : Meta-Data Extraction Flow Diagram
The PDF Documents enter the input processing system where the documents are truncated
and processed by an OCR program which converts the documents into XML format. These
XML documents are first sent to the form processing system where the documents are
searched for any of the standardized RDP forms . If the forms are found , the meta-data is
extracted from the documents using the corresponding form templates else the documents
enter the non-form processing system which generates a candidate extraction solution
from the templates available. The extracted meta-data then enters the output processor
which consists of the post-processing module and validation module. The post processing
module handles cleanup and normalization of the metadata. The final automated step of
the process is the validation module which, using an array of deterministic and statistical
tests, determines the acceptability of the extracted metadata. Any document that fails to
meet the validation criteria is flagged for human review and correction.
Non-Form Processing
This module can be divided into two components : Classification system and the Template
Interpreter (non-form extraction engine). Initially , the documents are compared against
known document layouts and classified based on structural or visual similarity. template is
selected for the closest matching layout .The template and the document then enter the
interpreter where the meta-data is extracted from the document using the rules defined in
the template. The figures shown below depict a sample PDF document ,the corresponding
template used and the meta-data extracted from it.
Figure 3 : Non-Form PDF Document
<template id ="au ">
<let name ="page1" select="doc/page[1]">
<bind name ="identifier" select ="$page1//line[wd[starts-with(.,'AU/')]]"/>
<let name ="CApar" select ="$page1//para[line[(wd = 'COMMAND' and wd = 'AIR') or (wd =
'WAR' and wd = 'AIR')]]">
<bind name ="CorporateAuthor" select="$CApar"/>
<let name = "afterCA" select ="$page1//para[. >> $Capar]">
<let name ="authorPar" select ="$afterCA//line[wd[starts-with(.,'by') ]]">
<bind name ="UnclassifiedTitle" select ="$afterCA//line[ $authorPar >> .]"/>
<bind name ="PersonalAuthor" select="$afterCA//line[. >> $authorPar]"/>
</let>
</let>
</let>
<bind name ="advisor" select ="$page1//para//line[wd[starts-with(.,'Advisor')]]" />
<bind name ="ReportDate" select ="$page1//line[wd[matches(.,$Date)]]"/>
</let>
<bind name ="CurrentDate" select ="date:to-string(date:new())"/>
</template>
Figure 4: Non-form template used to extract meta-data for PDF document shown in Figure 3
__________________________________________________________________________________________________________
<au>
<identifier>AU/ACSC/012/1999-04</identifier>
<CorporateAuthor>AIR
COMMAND
AND
STAFF
UNIVERSITY</CorporateAuthor>
<UnclassifiedTitle>INTEGRATING COMMERCIAL
ELECTRONIC EQUIPMENT TO IMPROVE
MILITARY CAPABILITIES
</UnclassifiedTitle>
<PersonalAuthor>Jeffrey A. Bohler LCDR, USN</PersonalAuthor>
<advisor>Advisor: CDR Albert L. St.Clair</advisor>
<ReportDate>April 1999</ReportDate>
</au>
Figure 5 : Meta Data extracted from the document in Figure 3
COLLEGE
AIR
Limitations of the Current Template Design
The template structure currently used for non-form processing poses some limitations
namely
1. Non-concise representation
2. Ad-hoc semantics – The template rules are designed for specific tasks and cannot be
easily extended to add further functionality .
3. Limited ability to express or exploit document structure. For example the rules can
express “find the line containing AIR COMMAND “ but cannot express “find the
paragraph containing AIR COMMAND”
4. Cannot combine basic search/test functions . For example the rules can’t express
title came after a header AND would start with the phrase "Fact Sheet”. It can
express only one of them but not both.
Template Design
The proposed language consists of high-level syntactic elements and lower-level text
expressions. The syntactic elements are the <template> element that serves as the root of
the entire template, <let> elements that define a variable names, and <bind> elements that
establish a value for a metadata field.The text expressions consist of XPath expressions,
which are evaluated in a context consisting of the document root and the variable bindings
from the LET elements
<template id ="au ">
<let name ="page1" select="doc/page[1]">
<bind name ="identifier" select ="$page1//line[wd[starts-with(.,'AU/')]]"/>
<let name ="CApar" select ="$page1//para[line[(wd = 'COMMAND' and wd = 'AIR') or (wd =
'WAR' and wd = 'AIR')]]">
<bind name ="CorporateAuthor" select="$CApar"/>
<let name = "afterCA" select ="$page1//para[. >> $Capar]">
<let name ="authorPar" select ="$afterCA//line[wd[starts-with(.,'by') ]]">
<bind name ="UnclassifiedTitle" select ="$afterCA//line[ $authorPar >> .]"/>
<bind name ="PersonalAuthor" select="$afterCA//line[. >> $authorPar]"/>
</let>
</let>
</let>
<bind name ="advisor" select ="$page1//para//line[wd[starts-with(.,'Advisor')]]" />
<bind name ="ReportDate" select ="$page1//line[wd[matches(.,$Date)]]"/>
</let>
<bind name ="CurrentDate" select ="date:to-string(date:new())"/>
</template>
A. <template>
The <template> element serves as the root of the template structure.
Attributes:
@id: an identifier for the template and, by implication, for the class of documents processed by
this template
Children:
a list of <let> and/or <bind> elements
Semantics:
 Each child is evaluated in sequence.
 If any child fails, the template fails.
 If all children succeed, return the concatenation of the metadata bindings from all children.
B. <bind>
The <bind> element establishes a binding between a metadata field name and a string denoting a
value for that field. It can also supply attribute values to annotate that binding.
Attributes:


@name: an identifier for a metadata field
@select: a XPath expression
Children:
 none
Semantics:


C.
The @select attribute expression is evaluated.
If evaluation of this expression succeeds, return a list containing one binding associating the
text value of the selected nodes with the field name indicated in the @name attribute. Any
other attributes of <bind> are attached to this binding as descriptive attributes.
<let>
The <let> element establishes a binding between an XPath variable name and the value of an XPath
expression.
Attributes:
 @name: an identifier for an XPath variable
 @select: a XPath expression
Children:
 a list of <let> and/or <bind> elements
Semantics:


The @select attribute expression is evaluated.
If evaluation of @select succeeds,, evaluate each child in a context augmented by the
binding of the @select expression result to the XPath variable indicated by the @name
attribute.
o
If all children succeed, return the concatenation of the lists of bindings returned by
the children.
Implementation Details
Technologies used : JDOM, XPath 2.0 and XML.
Tools : Eclipse
External Classes/Interface : Used S9API or Snappy Interface of Saxon XSLT/XPath 2.0 Processor to
evaluate XPath expressions
Both the template and the non-form XML document enter the JDOM interpreter. An instance of
SAXBuilder is used to build the template (which is in XML ) into a JDOM document tree and
the DocumentBuilder provided by S9API of SAXON Processor is used to build the non-form XML
document in tree form. The SAXON Processor implements a tiny tree model as its internal data
structure which is faster to build ,occupies less space but is slower to navigate when compared to
the Linked Tree model. After reading the documents the interpreter starts processing the template
starting with the root node <template> followed by the children <let> and <bind> nodes of the
<template> . The corresponding functionality provided for each of these elements in the JDOM
interpreter is as follows
Function template ( ) for <template> :
 Each child is evaluated in sequence.
 If any child fails, the template fails.
 If all children succeed, return the concatenation of the metadata bindings from all
children.
Function Let ( ) for <Let> :
The @select attribute expression is evaluated.
If evaluation of @select succeeds,, evaluate each child in a context augmented by the
binding of the @select expression result to the XPath variable indicated by the @name
attribute.
o If all children succeed, return the concatenation of the lists of bindings returned by
the children.
Example 1 :
<let name ="page1" select="doc/page[1]">


Code Snippet to Evaluate XPath Expression:
Processor proc = new Processor(false) ;
XPathCompiler xpath = proc.newXPathCompiler();
XPathSelector selector = xpath.compile(expression).load();
selector.setContextItem(Doc);
XdmValue val = (XdmValue)selector.evaluate();
m.put(name, val);
The variables defined by the <let> nodes take values of type XdmValue which is a sequence of
XdmNodes. Each Node is a sub-tree.
Value for the XPath Expression in Example 1 :
<page width="12240" height="15840" x-res="300" y-res="300" orientation="0" pgno="1">
<region left="1356" top="1409" right="10451" bottom="14400">
<vert-white-space t="1409" b="1488" pct="0.499" loc="top" unit="px" />
<para t="1488" l="1440" r="3773" b="1661" li="0" ri="0" align="left" linespacing="300">
<line l="1440" t="1488" r="3773" b="1661" ff="Times New Roman" fs="1200">
<wd l="1440" t="1488" r="3773" b="1661">AU/ACSC/138/2000-04</wd>
</line>
</para>
<vert-white-space b="14400" t="13925" loc="bottom" unit="px" pct="2.999" />
</region>
</page>
Function Bind() for <bind> :

The @select attribute expression is evaluated.
 If evaluation of this expression succeeds, return a list containing one binding
associating the text value of the selected nodes with the field name indicated in the
@name attribute. Any other attributes of <bind> are attached to this binding as
descriptive attributes.
Example 2:
<bind name ="identifier" select ="$page1//line[wd[starts-with(.,'AU/')]]"/>
Code Snippet to Evaluate XPath expression with variables:
Processor proc = new Processor(false) ;
XPathCompiler xpath = proc.newXPathCompiler();
xpath.declareVariable(new QName(var[i]));
XPathSelector selector = xpath.compile(expression).load();
selector.setContextItem(Doc);
selector.setVariable(new QName(var[i]), var_value[i]);
XdmValue val = (XdmValue)selector.evaluate();
Value :
<line l="1440" t="1488" r="3773" b="1661" ff="Times New Roman" fs="1200">
<wd l="1440" t="1488" r="3773" b="1661">AU/ACSC/138/2000-04</wd>
</line>
Meta-Data Extracted :
<identifier>AU/ACSC/012/1999-04</identifier>
Example 3 :
<bind name ="CurrentDate" select ="date:to-string(date:new())”/>
Code Snippet to Evaluate XPath expression with functions :
Processor proc = new Processor(false) ;
XPathCompiler xpath = proc.newXPathCompiler();
xpath.declareNamespace("date", "java:java.util.Date");
XPathSelector s = xpath.compile(path).load();
s.setContextItem(Doc);
XdmAtomicValue val = (XdmAtomicValue)s.evaluate();
Output :
<Current Date> Mon Apr 06 14:59:52 EDT 2009 </CurrentDate>
Future Work :
Example 3 above gives a description of how to use extension functions using SAXON XPath
Processor. Similarly, in future user defined extension functions can be developed to
extract metadata which will further simplify the template and help to examine a wider
variety of document layouts. As discussed earlier ,the classification system assigns a
template to the document .It does this by applying several templates to the extraction
system and performing validation on the results .This phase has to be tested against the
new template. Finally , after integrating the interpreter with the new system there is scope
for further research and development for creating an automated intelligent assistant
program for writing templates.
References :

Paper : “Automated Template-Based Me ta-Data Extraction Architecture ” by Paul Flynn, Li
Zhou, Kurt Maly, Steven Zeil, and Mohammad Zubair , ICADL 2007

White Paper : “ An XPath based Template Language for Describing the placement of metadata within the document ” By Dr. Steven Zeil

Book : “ XSLT 2.0 and XPath 2.0 – A Programmer’s Reference ” by Michael Kay

Download Report

Report - Metadata Extraction Home Page

Paperzz.com

Your Paperzz