DIATHESIS: OCR based semantic annotation of newspapers.

DIATHESIS: OCR based semantic annotation
of newspapers.
Martin Doerr, Georgios Markakis,
Maria Theodoridou, Minas Tsikritzis
June 2007
Center for Cultural Informatics | Information Systems Laboratory | Institute
of Computer Science |Foundation for Research and Technology - Hellas
(FORTH) | Vassilika Vouton,P.O.Box1385,GR71110 Heraklion,Crete,Greece
Abstract
Digitization of historical newspapers constitutes nowadays an essential
means for the preservation, dissemination of this material and the creation
of large scale, distributed digital archives. Currently there are several
approaches for rendering this type of digitized material searchable. The
rst approach relies on the manual completion of metadata elds regarding
the physical aspects of the digitized material. The second relies on the
manual creation of semantic relations describing and linking the digitized
content. Finally the last approach makes use of OCR (optical character
recognition) technology for full text indexing purposes.
Each one of the above mentioned approaches has its advantages and
disadvantages. In the DIATHESIS digital library system we tried to implement a novel hybrid approach based on the three previous ones. The
system provides the user a fully web based interface that enables her to
annotate specic segments of a document, extract the OCR text that corresponds to that segment and describe the segment with detailed metadata
information based on the CIDOC conceptual reference model.
1
Introduction.
Historical newspapers are one of the most signicant source of information for
researchers due to the wealth of information they provide regarding every aspect
of everyday political, social and intellectual life. Access to this type of archival
material is usually obstructed by the following factors:
•
In order to protect the archival material from potential damage some
archives prohibit the access to the largest part of their collection.
•
Direct contact with the original archival material constitutes a potential
health hazard (due to dust and fungi).
1
•
The lack of indexes to newspapers combined with the vastness of information contained in them makes research a very time consuming task.
•
This type of material is usually dispersed geographically in many archives
or private collections.
Many archives adopted digitization of newspapers as a straightforward method
to deal with the above problems. Digitized material is easier to preserve and
much easier to distribute via the Web. However, conversion of archival material
into a digital image format (i.e. JPEG, TIFF, PDF or DJVU) does not solve
the problem of rapid access to this material. Digitization itself is inadequate if
it does not provide the means of rapidly accessing the digitized material in a
timely and accurate manner (also known as the searchability issue).
A common practice of rendering digitized newspaper material searchable is
to produce a basic set of manually completed metadata tags that are meant to
describe the whole document/issue and enable the user to access the digitized
archive via basic keyword search applied to these elds.
A more advanced
variation of this practice is the use of Ontologies in order to create semantic
relations that describe and semantically link the digitized content with other
digital objects or semantically annotate parts of it.
Another practice frequently met in digital archives digitization projects refers
to the conversion of the original material into a digital format, accompanied by
the extraction of the free text contained in the original image via the use of
Optical Character Recognition techniques. In this case the extracted full text
is used for indexing and presentation purposes. For future reference we could
call the rst case as physical features based classication, the second case as
conceptual classication and the third one as OCR based full text indexing
approach of newspaper archival material.
1.1 The physical features based classication approach.
Some real world examples of this approach are the cases of National Library of
Austria [1], the German National Library [7] and several other European institutions [4, 9, 6] . In these cases archival material was converted into multimedia
material (JPEG or DJVU images) and classied with a basic set of metadata
(number of issue, date of publication, newspaper name, number of pages etc).
The result of these digitization attempts resembled more like a browsing mechanism rather than a conventional search engine.
The nal user could browse
through an existing catalog of online newspapers in a similar manner that she
would access the original material via a conventional catalog.
The advantages of this approach lie in its simplicity: the main task in the
digitization process is the conversion of the original material into a digital format
while maintaining a relatively simple table in a relational database, that holds
the mappings between the digitized material and its metadata. However, this
approach suers from some serious drawbacks.
The nal user is unable to
conduct full-text searches on an article or issue level basis. Given the plethora of
information contained in a single newspaper issue this approach maintains the
nd the needle in the haystack problems encountered in non digital newspaper
archives. In addition all systems of this kind do not provide the researcher with
an exact map representing the conceptual structure of the archive.
2
1.2 The conceptual classication approach.
The conceptual classication approach overcomes many of the above weaknesses
by enabling the user to perform a knowledge engineering task upon the already
digitized material via the use of ontologies. The latter dier radically in scope
and nature compared to the metadata used in the previous approach:
•
Ontologies are used to express a specic conceptual view over the digitized
material. The often quoted denition of ontology is "the specication of
ones conceptualization of a knowledge domain".
In contrast to the at
metadata that describe physical aspects of the archival material (newspaper name, number of issue), ontologies use dierent conceptual schemes
in order to declare relations between instances of an a priori set of given
classes (top level ontologies).
•
The use of top level ontologies guarantees to a certain extent the semantic
interoperability among dierent archives. The main idea is that via the
adoption of common conceptual framework provided by a top level ontology, several institutions could easily create a common semantic based
search mechanism. Currently there are several top level domain ontologies
that can be used for this purpose (CIDOC CRM[17], Dublin Core[5]).
•
In some cases the produced metadata may allow the researcher or intelligent agents to perform logical inference queries exploiting logical relations
declared between conceptual entities.
Tim Berners Lee stressed under
several occasions this need for machine understandable metadata as the
foundation of his semantic web vision [26, 27].
•
In addition, the user may use concepts that classify the document that
are not initially contained within the document itself (i.e. classifying the
document as a pre WWII era document or a document concerning educational policies). This is more an act of interpretation of the document by
an expert rather than a description of its contents, an act which remains
until nowadays an exclusively human virtue.
There have been already some attempts to use this semantic approach for
archival purposes. The Neptuno Project [15] for instance (although it did not
involve the preservation and classication of archival material but the creation
of new) was an attempt to infuse the article creation process with semantic properties. Neptuno provided an authoring environment where the article editor was
able to create content along with its semantic description.
On the cultural documentation/preservation systems domain there have already been some attempts to adopt this ontology oriented approach. The AR-
1 of the Institute of Computer
CHON system[23, 32] was such an initial attempt
Science at FORTH to perform annotation tasks upon digitized material. In this
approach the user was encouraged to mark-up parts of the representation on
his/her medium of choice (e.g. areas on a document image) and associate local
information (e.g.
le name, spatial or temporal co-ordinates etc.)
1A
with text,
Multimedia System for Archival, Annotation and Retrieval of Historical Documents,
Jan. 1997 - June 1998. An ICS-FORTH project originally used for the digitization of 100,000
documents of the Vikelaia Municipal Library of Heraklion.
3
Figure 1: The Archon System Architecture.
images, sound-clips, video or sub parts of them without altering the original
material.
The Encyclopedia of Chicago [34] managed to create an online encyclopedia by semantically linking information comprising of historical articles, images,videos and audio clips. This material originated from the original contents
of the Encyclopedia of Chicago that was digitized and semantically interrelated
via the use of semantic web technologies and Dublin Core Metadata. The same
approach was followed in the Pergamos system in the University of Athens Digital Library digitization project [20] where original manuscripts were digitized
and interlinked via the use of SW technologies.
Despite its indisputable advantages regarding the precision capabilities of
ontology based systems, this approach suers from some serious drawbacks most
of them concerning the eciency of the knowledge building process:
•
Given the density of information in a newspaper, production of metadata
is a notoriously time consuming task (knowledge engineering bottleneck).
•
Due to the above reason it is almost impossible to manually dene all
the semantic relations or entities contained even in a single article. For
instance one can imagine an article containing a long list of participants in
a demonstration (more than 50 names in total). The manual conversion
of all these names into entities in a RDF graph is an almost prohibitive
task.
Information Extraction techniques could speed up the whole pro-
cess by enabling the semi-automatic extraction of named entities from
full text, but unfortunately in historical newspapers the required data are
usually not available in a textual electronic format. On the other hand
if one is obliged to omit valuable information in order to speed up the
classication process (ie omit some names of not particularly great historical signicance), this information will not be accessible via a search
4
mechanism.
•
Article level search is usually not supported or it is hard to implement.
Manually dening areas of the document is a possible option, however
this is a time consuming activity that narrows down further the existing
knowledge acquisition bottleneck.
1.3 The OCR based Full Text Indexing Approach.
The knowledge acquisition deciencies of the later approach is handled eciently
by automatic digitization approaches that make use of OCR analysis of digitized
newspapers. Several institutions [2, 3, 11, 12, 10, 13, 8, 14] adopted this approach
in order to digitize eciently a large volume of their newspaper archives, either
by use of commercial software[2] (like Olive Software's Active Paper Reader or
dtSearch ) or by implementing their own tailor made solutions[12]. There are
two main trends in this approach:
1. The semi-automatic full text indexing approach is conducted roughly as
follows: In a rst phase the newspaper page is converted into a digital
format, and then via a supervised OCR process is broken down into its
constituent parts (articles).
In the process the page and its constituent
articles are transformed into PDF les that contain both the full text
and its correlated image segment. In the next phase a group of humans
manually creates an XML le that link all the separate article PDF les
with the specic page, and contains specic metadata regarding that page.
Finally the produced PDF and XML les are bundled into a ZIP le and
they are imported massively into a full text indexing mechanism[24].
2. The fully automated full text indexing approach : the archival material is
again analyzed via an OCR process and is massively imported into a full
text indexing mechanism in a similar manner to the previous approach.
The main dierence in this case is that the linking of the page segments
that articles consist of as well as the partial extraction of metadata contained in certain elds (ie article title) is conducted in an automatic unsupervised manner via the application of a heuristic algorithm. However
this algorithm is to a large extent context dependent: it depends to the
specic layout structure of a given newspaper as well as the language the
2
articles are written in .
Full Text Indexing techniques are currently considered to be the state of the
art in the area of newspaper digitization and this is mainly for the following
reasons:
•
Creation of searchable full - text index via OCR is a much faster process
compared to the manual creation of metadata. In the case study of British
library's digitization project it is estimated that over 20.000 pages of historic documents and newspaper pages containing some 500.000 articles
2 A ne example of this approach was the British Library's newspaper digitization project[2,
28]. In this case a Bitmap Zoning algorithm was used to automatically identify article regions
within the newspaper issue. However dierent versions of the same algorithm were used for
the analysis of newspapers printed before 1900 due to signicant structural dierences of these
newspapers compared to the ones that were published during the 20th century.
5
have been processed and rendered searchable in only two months [28]. In
the case of Utah's Digitization project again a total sum of 30.000 pages
have been processed in less than a year.
•
Separation of searchability and readability [28]: by the term searchability we refer to the means by which the digitized material is rendered
accessible via the use of conventional keyword search methods. A common practice followed in the context of these techniques is that the full
text produced via the OCR analysis of the document is used for full text
indexing purposes. It is generally acknowledged that OCR analysis of historical documents (although it has greatly improved during the last few
years) never produces a 100% exact replica of the original digitized image
in a textual format [21]. In order to overcome the inadequacies of OCR
recognition, most of these systems implemented fuzzy search methods that
produce satisfactory results even in the presence of slightly corrupted text.
However most of the texts produced via a OCR process cannot be used for
presentation purposes (readability issue) due to their poor quality. In
order to overcome this problem, image segmentation techniques were used
in order to extract the appropriate image segment from the issue image
and bundle it with the appropriate text (usually in a PDF format). This
approach solves both the searchability and the readability issues since
the textual part of the PDF is used for indexing purposes and the image
part for presentation purposes.
•
It is possible to conduct searches at a page/issue/article level basis : due
to the exibility and the segmentation capabilities of this OCR based
approach it is possible for these systems to fetch the exact article that
the researcher was looking for. This fact alone increases dramatically the
precision/recall factor compared to the two previous approaches.
•
The search is conducted via keywords in a manner that is familiar to the
average user of contemporary Web Search engines (Google, Yahoo etc).
•
This approach also addresses the problem of ecient content dissemina-
tion over the Web : since the page is segmented into its constituent parts,
the latter can be used to represent a specic search result item to the nal
user . By following this practice, the overall browsing experience is greatly
improved and network congestion problems are avoided.
Despite its advantages this approach still suers from some serious drawbacks:
•
Even though nowadays the bag of words approach constitutes the predominant search paradigm for the Web it still suers from some well known
precision/recall issues. In conventional search engines the user has to cope
with thousands of results that are essentially irrelevant to the user's needs.
Some search engines have invented some ranking/ltering mechanisms in
3
order to deal with the above problem . Unfortunately such mechanisms
are not currently implemented in this OCR based full text indexing approach. This fact combined with the inherent imperfection of the OCR
3 See
for instance Google's PageRank mechanism.
6
produced text (even though this imperfection is handled to a certain degree by fuzzy search methods) necessarily means that the retrieval capabilities of full text based systems of this kind are necessarily inferior to the
retrieval capabilities of their original counterparts (web search engines).
•
Newspaper archives are not as chaotic as the Web :
The Web itself is
by nature an unstructured, continuously expanding universe of HTML
documents that refer to conceptually diverse material. Under these circumstances, the use of purely information retrieval solutions appeared
inevitably as the most pragmatic approach of bringing order to chaos.
Historical archives on the other hand are not as chaotic in nature: in most
cases the digitized material is produced by a usually overwhelming ,but
yet nite, number of newspaper issues of a collection.
It is a common
practice for humanities scholars to classify historical references contained
in newspaper articles under dierent categories and seek information by
referring to these categories [19] (i.e. by historical periods, types of activities, types of actors etc). This potential categorization requires essentially
the scholar's intervention during the indexing process of the digitized material and constitutes in some sense the conceptual structure of the archive.
The existence of such an underlying structure is almost absent in OCR
based information retrieval systems.
•
The search of information in OCR based information retrieval systems is
conceptually blind : the user usually attempts to perform a blind keyword
search within the contents of the digitized archive, without an explicit
notion of the semantics conveyed by archive content. A semantic structure
of this kind could be used in order to provide the user with a concise
description of the archive's contents. It could also provide a conceptual
guide that could be used in conjunction to full text search queries in order
to improve the overall precision factor of the system.
•
This approach uses bulky import les rendering thus the import process
a computationally expensive procedure. The zip les used for the import
process consist of the original image le and its segments in PDF format,
and a XML le that links these les together.This practically means that
the same page is indexed and stored in the content repository twice: once
as a whole document and once as a constituent of its parts.
2
A fourth approach:
hybrid annotation upon
OCR results.
The main rationale behind the creation of the DIATHESIS system [30] was to
extend the conceptual classication approach in order to perform a conceptual
modeling task upon the already digitized material by adding semantic value
to newspaper segments.
This system attempts to implement a realistic con-
ceptual classication approach by combining the best elements from the three
approaches mentioned above:
1. It permits searches on a newspaper issue basis (newspaper issue name,
number, publication date) in a similar manner to the physical features
based classication approach.
7
2. It permits searches on an article level basis via the use of full text queries
in a similar manner to the OCR based Full Text Indexing Approach.
3. It permits searches on an article level basis via the semantic relationships
assigned to each segment.
4. It permits searches that combine all of the above elements.
Since this approach is basically an extension of the conceptual classication
approach our main concern was to provide ecient means for speeding up the
knowledge creation process. In addition it It does so by implementing the following strategy:
•
Document annotation is based on a single faceted model based on the
CIDOC CRM RDF schema.
The newspaper page is modeled as an in-
stance of a Document class (E31.Document) that refers to instances of
the Activity class (E7.Activity). By adopting such a xed conceptualization, we managed to speed up to a large extent the annotation process
and deal with the semantic interoperability issues encountered in purely
full text indexing implementations.
•
The system adopts a rather shallow semantic annotation approach: the
resulting RDF triples don't aim at the formation of a coherent semantic
network. Instead they try to combine full text search and metadata based
search in such a manner that improves the average precision factor of the
system. In simple words DIATHESIS tries to improve information ltering
by assigning semantic properties to full text search. At the same time the
system produces a robust semantic backbone that can be used for the
construction of a coherent semantic network in future implementations.
•
OCR results have a dual use in our system: they are used for searchability
and presentation purposes in a similar manner to the one described in the
full text indexing approach. In addition they become building materials
for a user friendly annotation environment that allows the user to easily
dene areas of interest (an article or a sum of articles) given the OCR
produced segments.
At the core of this system lies the concept of document annotation. It is generally acknowledged that annotation is a frequently encountered scholarly practice
during the document understanding process [29]. Because of its popularity as
a practice in the real world, annotation interfaces constitute a design pattern
queenly met in cultural documentation systems.
According to a distinction
made by Doerr and al [33] we can distinguish annotation types in such systems
under the following three main categories:
1. Annotations by medium:
(a) Lexical or hyperlink annotations.
(b) Visual (icon or highlighting) annotations.
(c) Acoustic (audio signals) annotations.
2. Annotations by locality of reference: Annotations may refer to
8
(a) Entire texts.
(b) Parts of texts.
(c) Both.
3. Annotation by process:
(a) Textual annotation: involves adding some form of free text commentary to a resource, aimed primarily for human readers.
(b) Link annotation: provides information in the form of the contents of
a link destination other than an explicit piece of text or other data.
(c) Semantic annotation: assigns markup elements according to a specied model which takes values from controlled vocabularies and it
aims both at human readers and software agents.
According to the rst two parts of this classication the type of annotation performed via the DIATHESIS interface is a partial visual annotation of newspaper
material. However according to the third part of this classication the type of
annotation encountered in this system has multiple dimensions:
•
It is textual: upon annotation of a specic segment, the user assigns the
full text that corresponds to this selection as an attribute to the created
object.
•
It represents a link: the annotation indicates a link to the specic coordinates that dene a segment of the annotated document.
•
It is semantic: annotation indicates a semantic relationship on two levels.
The rst level concerns the relationship of the annotated segment to its
parent document (newspaper). The second level concerns the relationship
of the segment to other semantic entities (Actor, Stu, Place etc).
Due to its multidimensional nature we could classify this this type of annotation as a hybrid one. Hybrid annotations are of particularly great importance
for they enable us to assign semantic properties to whole regions of text. By
assigning such properties to text we are able to create a semantic context that
dramatically increases the precision of full text queries. The importance of the
existence such a semantic context in the querying process has been acknowledged also in the COLLATE project [35].
2.1 The CIDOC CRM Core as a conceptual basis of document annotation.
As mentioned above, the underlying semantics of DIATHESIS are based on the
CIDOC Conceptual Reference Model [31]. The CIDOC Conceptual Reference
Model (CRM) is a high-level ontology to enable information integration for
cultural heritage data and their correlation with library and archive information.
It is the culmination of over 10 years of work by an interdisciplinary team, and
has been accepted by the International Standards Organization as standard ISO
21127.
The CIDOC CRM is intended to promote a shared understanding of
cultural heritage information by providing a common and extensible semantic
9
Figure 2: CIDOC based semantic relationships between the parent document
and its constituent parts.
framework to which any cultural heritage information can be mapped.
The
CRM provides denitions and a formal structure for describing the implicit and
explicit concepts and relationships utilized in cultural heritage documentation.
It is intended as a common language for domain experts and implementers to
formulate requirements for information systems, and to serve as a guide for good
practice of conceptual modeling. CRM can thus provide the semantic glue that
is needed to mediate between dierent sources of cultural heritage information,
such museums, libraries and archives.
In the DIATHESIS system the newspaper page is modeled as an instance of
a Document class (E31.Document) that refers to instances of the Activity class
(E7.Activity). This essentially means that each annotation upon the newspaper
issue indicates a refers_to (reference) relationship between a document and an
activity or a sum of activities mentioned within the document's text. Practically
this means that each article in the text is modeled as an activity or a sum of
activities. As a consequence in this implementation, a Document object links
to more than one Activity objects.
The second type of semantic relationship encountered in this implementation concerns the relationships stated between the declared activity and other
CIDOC CRM classes.
More specically the following classes of CIDOC have
been used:
1. E2.Temporal_Entity: this indicates the specic time period when the activity mentioned in the document took place. Usually this is the time of
the edition of the newspaper issue, however a single article might refer
to a dierent time period (like in the case of historical references).
A
P4F.has_time-span attribute is used to establish the semantic relationship with the parent Activity class and contains a pair of long numbers
10
that indicate a specic time-span.
2. E39.Actor: this class refers to all the actors that are considered participants in the mentioned activity. There is a P14F.carried_out_by relationship between the Actor and Activity objects.
3. E70.Stu: this class refers to all the objects that were used in the mentioned activity. There is a P16F.used_specic_object relationship between the Stu and Activity classes.
4. E53.Place: this class refers to the geographic location where the activity
took place. It has a P7F.took_place_at relationship with the Activity
class.
5. E55.Type: this class indicates the type of activity that took place. It has
a P2F.has_type relationship with the Activity class.
6. A String containing the full text included in the original segment. This has
a a P3F.has_note relationship with the Activity class which is basically
a pseudo - semantic relationship indicating that the extracted text is a
note upon the dened segment. Although this does not indicate an actual
semantic relationship, the inclusion of the full text in this semantic model,
will enable us later to submit queries that exploit both the full text and
the metadata contained in each Activity object.
Despite the fact that this implementation is based on a purely ontocentric
model, it does not aim at the creation of a closed semantic network in the fashion
that the ARCHON system did. In a purely semantic network implementation
the leaves of this semantic tree structure would be links to distinct RDF resources that could be linked to each other via property nodes deepening thus
the produced semantic net to an innite length.
However, in this implementation the leaves are not resources but plain literals. The literals are either assigned directly by the user (directly typewritten
or copied from full text) or they are contained in reserved vocabularies (thesauri). The specic reserved vocabularies are centrally managed and stored in
a thesaurus repository (SIS-TMS [18]) arranged in a hierarchical structure that
reects the broader term - narrow term relationships that exist between them.
These specic hierarchies represent the domain specic (context dependent) vocabulary that is used and maintained by a group of annotators and it is used as
the conceptual basis for the interpretation of a selected event.
3
Technical Implementation
DIATHESIS uses the Fedora digital library system as its backend. The latter is
an ontocentric content management system specially designed for the creation
of digital library collections [25] that permits the creation of semantic net structures linking digitized cultural material. At its core is a digital object model
that supports multiple views of each digital object and the relationships among
these objects. Digital objects can encapsulate locally-managed content or make
reference to remote content.
Three kinds of fedora object prototypes were used in this implementation:
11
Figure 3: CIDOC based semantic properties of a user dened newspaper segment.
12
1. Entry Objects: These objects contain the information regarding the newspaper issue. They consist of one RELS-EXT datastream that contains the
semantic information described in the E31.Document class and an auxiliary datastream called RELATIONS that contains the relations between
the current issue, its constituent parts (pages), and its declared segments
(articles).
2. Metadata Objects: These objects contain the information regarding each
newspaper segment dened by the user.
They consist of one RELS-
EXT datastream that contains the semantic information described in the
E7.Activity class. They also contain an internally managed XML datastream (MAPPINGS) that holds information regarding the exact coordinates of the selected segment and the exact pages where the segment is
located.
3. Image Objects:
This type of objects contain the data required for the
visualization of the newspaper page. This object contains three types of
datastreams: one internally managed IMAGE datastream that contains
the JPEG image thumbnail used for preservation purposes, an IMAGEBIG
datastream that holds a reference to the original scan (TIFF image) that
is stored in a remote location for preservation purposes and an IMGXML
datastream that contains an internally managed XML le that represents
the document segments and the text produced by the OCR process. In
addition this object uses a disseminator, whose main purpose is to bind
the IMAGE and IMGXML datastreams together. Upon invocation, the
disseminator performs an XSL transformation upon the contents of the
two datastreams in order to produce the interactive SVG element that
visualizes the segmented document page. The produced SVG element is
the the semantic canvas upon which the document annotation task will
take place.
3.1 System Workow and the Import Process.
The initial goal is to populate the system with Entry objects, which represent the
newspaper issue as a whole. Entry objects are in some sense the raw material
upon which the annotator can extract segments manually. The ingest process
is initiated via the administration menu and has two prerequisite elements:
1. An XML le that denes the location of the directories containing the
digitized material per issue (and optionally some metadata concerning
this issue) and
2. A specic folder structure that holds the digitized material (thumbnail
JPEG and original TIFF images).
These two elements are the product of the digitization process by a team of
digitization specialists. Upon completion, the import process produces a series
of Entry objects, their corresponding Image objects as well as the semantic
relationships that link these objects together. The produced material is stored
in the Fedora repository and it forms the raw material upon which another
group of annotators will perform the knowledge extraction task. The document
13
Figure 4: The DIATHESIS system workow.
annotation team uses the annotation interface of DIATHESIS in order to isolate
specic segments of text upon the produced SVG canvas producing this way
a series of Metadata Objects.
The produced objects are used by the search
mechanism to conduct searches on an article level or a document level basis.
3.2 The system architecture
The system consists of two separate web apps that are bundled in a customized
Fedora installation package. The data are stored and retrieved within a fedora
repository.
The system consists of three main modules:
1. The image annotation application:
This is an AJAX-based application
that is divided in three tabs. The rst tab (Document Annotation Tab)
permits the user to select a specic Entry Object (newspaper issue), and
use a specially designed GUI to dene new Metadata Objects, and correlate them with specic regions of each page.
The second tab (Issue Metadata Tab) contains metadata that describe the
document issue as a whole. Finally, the third tab (metadata tab) diplays
the required elds for the classication of a specic segment. There are
two kinds of elds used in the metadata tab: free text elds (were data is
directly typewritten or copied directly from the OCR text) and reserved
vocabulary elds (where keywords are selected from SIS-TMS generated
SVG tree structures).
Upon submitting the document and its annotated regions, the system
14
Figure 5: The Document Annotation Tab.
Figure 6: The Metadata Tab.
15
recursively creates and ingests into the Fedora repository a series of Metadata objects which hold information regarding the OCRed text contained
in the specic page, the coordinates of the specic segment on the given
page and a set of specic CIDOC compliant metadata describing the annotated segment.
2. The Administration Application:
This is the administration portion of
the application. Through it, the administrator can import new material
into the database, supervise the import process, modify the structure of
metadata used by the system, and obtain statistical information regarding
the use of the system (document insertion rate, metadata creation rate
etc).
The Administrator screen uses an embedded ActiveX control in
order to integrate an instance of the Abby Finereader OCR software with
the import process.
3. The Viewer Application: This is an Ajax based application designed to
provide rapid access to digitized material over the web.
The user can
conduct searches either on an article level or an issue level basis.
The
download mechanism is based on a exible architecture that allows the
timely download of the retrieved material. In a similar manner to most
OCR based implementations, the system allows the user to retrieve the
exact segment of the document that contains the search criteria submitted
by the user.
4
Future Research
The system uses an ecient user interface in order to speed up the annotation process. Undoubtedly there is a signicant improvement in terms of speed
of classication over the Manual ontocentric classication approach. However,
this approach is still signicantly slower than the fully automated OCR based
systems, to the extent that it requires the user's involvement during the classication process.
In order to overcome this problem and narrow down the cost/production
gap between the hybrid annotation and and the OCR based Full Text Indexing Approach we plan to implement information extraction techniques, for the
semiautomatic extraction of named entities from full text. The successful implementation of such an approach could speed up radically the annotation process
and relieve the user from the repetitive task of manually locating names of
actors, places or objects within the text of countless articles. Information extraction techniques have already successfully been used in digital library systems
for this purpose (see the Perseus Digital Library project [22]). We currently focus on Adaptive Information Extraction Techniques [16] due to their relative
independence from xed, domain/language specic gazetteers.
5
Conclusion
In DIATHESIS we managed to implement a novel approach for the digitization
of historical newspapers. This approach takes into account the advantages and
the disadvantages of most digitization approaches of this type of material. It uses
16
OCR technology for the structural and textual analysis of scanned newspaper
material in a similar manner to the OCR based approach. However, unlike this
approach the data produced by the OCR processing of scanned material are not
imported directly into a full text indexing mechanism. Instead they are used
for the creation of a highly exible annotation interface. The system allows the
users to perform hybrid annotations upon the digitized material assigning this
way semantic properties to specic regions of text.
The search interface enables the user to conduct searches on a document
level and an article level basis via the execution of full text queries. In addition,
it provides a semantic framework that enables the user to combine full text and
metadata based searches. By allowing this type of query execution the system
provides a semantic lter that greatly improves the precision of the conducted
searches. In addition it uses a robust top level domain ontology (CIDOC-CRM)
in order to ensure that the produced knowledge can be inter exchanged between
dierent institutions.
Finally, it provides a exible presentation mechanism
that allows the partial download of the digitized material in order to improve
the overall user experience and reduce download time.
Over time the DIATHESIS system has evolved into a stable, lightweight, easily deployable and highly congurable newspaper digitization suite. DIATHESIS
is currently being used for the digitization of the Vikelaia Municipal library's
newspaper collection in Heraklion Greece, where over 30.000 pages have already
been classied and indexed. The same system has also been successfully used for
the digitization of the Filekpaideytiki Etaireia Archive in Athens and there are
also plans for using it for the digitization of the I AVGHI newspaper archive .
References
[1] Anno:
Austrian
newspapers
online
project
(http://deposit.ddb.de/online/exil/exil.htm).
[2] British
library
online
newspaper
archive
(http://www.uk.olivesoftware.com/).
[3] The brooklyn daily eagle online (http://www.brooklynpubliclibrary.org/eagle/).
[4] Denmark:
Digitaliserede
danske
aviser
1759-1865
(http://www.statsbiblioteket.dk).
[5] The dublin core metadata initiative: http://dublincore.org/.
[6] Estonia: Digiteeritud eesti ajalehed (http://dea.nlib.ee/).
[7] "exilpresse
digital.
deutsche
exilzeitschriften
1933-1945"
project
(http://deposit.ddb.de/online/exil/exil.htm).
[8] Historical newspapers in washington (http://www.secstate.wa.gov/history/
newspapersname.aspx).
[9] Iceland
:the
vestnord
project
(1696-2002)
(http://www.timarit.is/listi.jsp?lang=4&t=).
[10] Krueger
library
winona
newspaper
project
:
(http://www.winona.edu/library/databases/winonanewspaperproject.htm).
17
[11] Northern new york historical newspapers (http://news.nnyln.net/).
[12] Utah digital newspapers (http://www.lib.utah.edu/digital/unews/).
[13] Wisconsin
local
history
and
biography
articles
(http://www.wisconsinhistory.org/wlhba/).
[14] Balakrishnan. Universal digital library:future research directions. Journal
of Zhejiang University SCIENCE, 11:12041205, 2005.
[15] Pablo Castells, F. Perdrix, E. Pulido, Mariano Rico, V. Richard Benjamins,
Jesús Contreras, and J. Lorés.
a digital newspaper archive.
Neptuno: Semantic web technologies for
In Christoph Bussler, John Davies, Dieter
Fensel, and Rudi Studer, editors, ESWS, volume 3053 of Lecture Notes in
Computer Science, pages 445458. Springer, 2004.
[16] F. Ciravegna, A. Dingli, D. Petrelli, and Y. Wilks. Timely and non-intrusive
active document annotation via adaptive information extraction. In Work-
shop Semantic Authoring Annotation and Knowledge Management (European Conf. Articial Intelligence), 2002, 2002.
[17] Martin Doerr & Nicholas Crofts. Electronic esperanto: The role of the object oriented cidoc reference model. In Proc. of the ICHIM '99, Washington
DC, September 22-26 1999.
[18] Martin Doerr & Irini Fundulaki. Sis - tms: A thesaurus management system
for distributed digital collections. In Proc. of the 2nd European Conference,
ECDL'98, pages 215234, Heraklion, Crete, Greece, September 1998 1998.
[19] Ann Blandford Jon Rimmer George Buchanan, Sally Jo Cunningham and
Claire Warwick. Information seeking by humanities scholars. Lecture Notes
in Computer Science, 3652/2005:218229, September 2005.
[20] Mara
Nikolaidou
George
Pyrounakis,
Kostas
Saidis
and
Vassilios
Karakoidas. Introducing pergamos : A fedora-based dl system utilizing digital object prototypes. Lecture Notes in Computer Science, 4172/2006:500
503, September 2006.
[21] Susan Haigh. Optical character recognition (ocr) as a digitization technology. Technical report, Information Technology Services National Library
of Canada, November 15 1996.
[22] David A. Smith Anne Mahoney Gregory R. Crane Jerey A. Rydberg-Cox,
Robert F. Chavez. Knowledge management in the perseus digital library.
Ariadne Journal, (25), September 2000.
[23] M. Doerr & P. Trahanias K. Chandrinos, J. Immerkaer. A visual tagging
technique for annotating large-volume multimedia databases - a tool for
adding semantic value to improve information rating. 5th DELOS Workshop "Filtering and Collaborative Filtering", European Research Consortium for Informatics and Mathematics (ERCIM), November 10-12 1997.
[24] Karen Edge Kenning Arlitsch, L. Yapp.
project. D-Lib Magazine, 9(3), 2003.
18
The utah digital newspapers
[25] Carl Lagoze, Sandy Payette, Edwin Shin, and Chris Wilper. Fedora: An
architecture for complex objects and their relationships, Aug 2005.
[26] TIm Berners Lee. The world wide web: Past, present and future. IEEE
Computer special issue of October 1996, 1996.
[27] Tim Berners Lee. Realising the full potential of the web. Based on a talk
presented at the W3C meeting, London, 1997.
[28] Emil Steinvil Marilyn Deegan, Edmund King. British library microlmed
newspapers and oxford grey literature online.
Technical report, Oxford
University, British Library , Olive Software Inc., 2003.
[29] C Marshall. Annotation: from paper books to the digital library. In Pro-
ceedings of the ACM Digital Libraries '97 Conference, Philadelphia, pages
2326, July 1997.
[30] Maria Theodoridou Martin Doerr, Georgios Markakis.
historical newspapers.
Digital library of
ERCIM News:Special Issue on Digital Libraries,
(66), July 2006.
[31] Tony Gill Stephen Stead Matthew Sti Nick Crofts, Martin Doerr. Def-
inition of the CIDOC Conceptual Reference Model. ICOM/CIDOC CRM
Special Interest Group, 4.2.1 edition, October 2006.
[32] Maria Theodoridou Manolis Tzobanakis Panos Constantopoulos, Martin Doerr. Historical documents as monuments and as sources. In Computer
Applications and Quantitative Methods in Archaeology Conference, Heraklion, Greece, April 2002.
[33] Maria Theodoridou Manolis Tzobanakis Panos Constantopoulos, Martin Doerr. On information organization in annotation systems. Intuitive
Human Interface, LNAI 3359:189200, 2004.
[34] Bill Parod.
Encyclopedia of chicago fedora implementation.
Technical
report, Northwestern University, May 15 2005.
[35] Ulrich Thiel, Holger Brocks, Andrea Dirsch-Weigand, Andre Everts, Ingo
Frommholz, and Adelheit Stein.
Queries in context: Access to digitized
historic documents in a collaboratory for the humanities. From Integrated
Publication and Information Systems to Information and Knowledge Environments, pages 117127, 2005.
19