DIATHESIS: OCR based semantic annotation of newspapers. Martin Doerr, Georgios Markakis, Maria Theodoridou, Minas Tsikritzis June 2007 Center for Cultural Informatics | Information Systems Laboratory | Institute of Computer Science |Foundation for Research and Technology - Hellas (FORTH) | Vassilika Vouton,P.O.Box1385,GR71110 Heraklion,Crete,Greece Abstract Digitization of historical newspapers constitutes nowadays an essential means for the preservation, dissemination of this material and the creation of large scale, distributed digital archives. Currently there are several approaches for rendering this type of digitized material searchable. The rst approach relies on the manual completion of metadata elds regarding the physical aspects of the digitized material. The second relies on the manual creation of semantic relations describing and linking the digitized content. Finally the last approach makes use of OCR (optical character recognition) technology for full text indexing purposes. Each one of the above mentioned approaches has its advantages and disadvantages. In the DIATHESIS digital library system we tried to implement a novel hybrid approach based on the three previous ones. The system provides the user a fully web based interface that enables her to annotate specic segments of a document, extract the OCR text that corresponds to that segment and describe the segment with detailed metadata information based on the CIDOC conceptual reference model. 1 Introduction. Historical newspapers are one of the most signicant source of information for researchers due to the wealth of information they provide regarding every aspect of everyday political, social and intellectual life. Access to this type of archival material is usually obstructed by the following factors: • In order to protect the archival material from potential damage some archives prohibit the access to the largest part of their collection. • Direct contact with the original archival material constitutes a potential health hazard (due to dust and fungi). 1 • The lack of indexes to newspapers combined with the vastness of information contained in them makes research a very time consuming task. • This type of material is usually dispersed geographically in many archives or private collections. Many archives adopted digitization of newspapers as a straightforward method to deal with the above problems. Digitized material is easier to preserve and much easier to distribute via the Web. However, conversion of archival material into a digital image format (i.e. JPEG, TIFF, PDF or DJVU) does not solve the problem of rapid access to this material. Digitization itself is inadequate if it does not provide the means of rapidly accessing the digitized material in a timely and accurate manner (also known as the searchability issue). A common practice of rendering digitized newspaper material searchable is to produce a basic set of manually completed metadata tags that are meant to describe the whole document/issue and enable the user to access the digitized archive via basic keyword search applied to these elds. A more advanced variation of this practice is the use of Ontologies in order to create semantic relations that describe and semantically link the digitized content with other digital objects or semantically annotate parts of it. Another practice frequently met in digital archives digitization projects refers to the conversion of the original material into a digital format, accompanied by the extraction of the free text contained in the original image via the use of Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes. For future reference we could call the rst case as physical features based classication, the second case as conceptual classication and the third one as OCR based full text indexing approach of newspaper archival material. 1.1 The physical features based classication approach. Some real world examples of this approach are the cases of National Library of Austria [1], the German National Library [7] and several other European institutions [4, 9, 6] . In these cases archival material was converted into multimedia material (JPEG or DJVU images) and classied with a basic set of metadata (number of issue, date of publication, newspaper name, number of pages etc). The result of these digitization attempts resembled more like a browsing mechanism rather than a conventional search engine. The nal user could browse through an existing catalog of online newspapers in a similar manner that she would access the original material via a conventional catalog. The advantages of this approach lie in its simplicity: the main task in the digitization process is the conversion of the original material into a digital format while maintaining a relatively simple table in a relational database, that holds the mappings between the digitized material and its metadata. However, this approach suers from some serious drawbacks. The nal user is unable to conduct full-text searches on an article or issue level basis. Given the plethora of information contained in a single newspaper issue this approach maintains the nd the needle in the haystack problems encountered in non digital newspaper archives. In addition all systems of this kind do not provide the researcher with an exact map representing the conceptual structure of the archive. 2 1.2 The conceptual classication approach. The conceptual classication approach overcomes many of the above weaknesses by enabling the user to perform a knowledge engineering task upon the already digitized material via the use of ontologies. The latter dier radically in scope and nature compared to the metadata used in the previous approach: • Ontologies are used to express a specic conceptual view over the digitized material. The often quoted denition of ontology is "the specication of ones conceptualization of a knowledge domain". In contrast to the at metadata that describe physical aspects of the archival material (newspaper name, number of issue), ontologies use dierent conceptual schemes in order to declare relations between instances of an a priori set of given classes (top level ontologies). • The use of top level ontologies guarantees to a certain extent the semantic interoperability among dierent archives. The main idea is that via the adoption of common conceptual framework provided by a top level ontology, several institutions could easily create a common semantic based search mechanism. Currently there are several top level domain ontologies that can be used for this purpose (CIDOC CRM[17], Dublin Core[5]). • In some cases the produced metadata may allow the researcher or intelligent agents to perform logical inference queries exploiting logical relations declared between conceptual entities. Tim Berners Lee stressed under several occasions this need for machine understandable metadata as the foundation of his semantic web vision [26, 27]. • In addition, the user may use concepts that classify the document that are not initially contained within the document itself (i.e. classifying the document as a pre WWII era document or a document concerning educational policies). This is more an act of interpretation of the document by an expert rather than a description of its contents, an act which remains until nowadays an exclusively human virtue. There have been already some attempts to use this semantic approach for archival purposes. The Neptuno Project [15] for instance (although it did not involve the preservation and classication of archival material but the creation of new) was an attempt to infuse the article creation process with semantic properties. Neptuno provided an authoring environment where the article editor was able to create content along with its semantic description. On the cultural documentation/preservation systems domain there have already been some attempts to adopt this ontology oriented approach. The AR- 1 of the Institute of Computer CHON system[23, 32] was such an initial attempt Science at FORTH to perform annotation tasks upon digitized material. In this approach the user was encouraged to mark-up parts of the representation on his/her medium of choice (e.g. areas on a document image) and associate local information (e.g. le name, spatial or temporal co-ordinates etc.) 1A with text, Multimedia System for Archival, Annotation and Retrieval of Historical Documents, Jan. 1997 - June 1998. An ICS-FORTH project originally used for the digitization of 100,000 documents of the Vikelaia Municipal Library of Heraklion. 3 Figure 1: The Archon System Architecture. images, sound-clips, video or sub parts of them without altering the original material. The Encyclopedia of Chicago [34] managed to create an online encyclopedia by semantically linking information comprising of historical articles, images,videos and audio clips. This material originated from the original contents of the Encyclopedia of Chicago that was digitized and semantically interrelated via the use of semantic web technologies and Dublin Core Metadata. The same approach was followed in the Pergamos system in the University of Athens Digital Library digitization project [20] where original manuscripts were digitized and interlinked via the use of SW technologies. Despite its indisputable advantages regarding the precision capabilities of ontology based systems, this approach suers from some serious drawbacks most of them concerning the eciency of the knowledge building process: • Given the density of information in a newspaper, production of metadata is a notoriously time consuming task (knowledge engineering bottleneck). • Due to the above reason it is almost impossible to manually dene all the semantic relations or entities contained even in a single article. For instance one can imagine an article containing a long list of participants in a demonstration (more than 50 names in total). The manual conversion of all these names into entities in a RDF graph is an almost prohibitive task. Information Extraction techniques could speed up the whole pro- cess by enabling the semi-automatic extraction of named entities from full text, but unfortunately in historical newspapers the required data are usually not available in a textual electronic format. On the other hand if one is obliged to omit valuable information in order to speed up the classication process (ie omit some names of not particularly great historical signicance), this information will not be accessible via a search 4 mechanism. • Article level search is usually not supported or it is hard to implement. Manually dening areas of the document is a possible option, however this is a time consuming activity that narrows down further the existing knowledge acquisition bottleneck. 1.3 The OCR based Full Text Indexing Approach. The knowledge acquisition deciencies of the later approach is handled eciently by automatic digitization approaches that make use of OCR analysis of digitized newspapers. Several institutions [2, 3, 11, 12, 10, 13, 8, 14] adopted this approach in order to digitize eciently a large volume of their newspaper archives, either by use of commercial software[2] (like Olive Software's Active Paper Reader or dtSearch ) or by implementing their own tailor made solutions[12]. There are two main trends in this approach: 1. The semi-automatic full text indexing approach is conducted roughly as follows: In a rst phase the newspaper page is converted into a digital format, and then via a supervised OCR process is broken down into its constituent parts (articles). In the process the page and its constituent articles are transformed into PDF les that contain both the full text and its correlated image segment. In the next phase a group of humans manually creates an XML le that link all the separate article PDF les with the specic page, and contains specic metadata regarding that page. Finally the produced PDF and XML les are bundled into a ZIP le and they are imported massively into a full text indexing mechanism[24]. 2. The fully automated full text indexing approach : the archival material is again analyzed via an OCR process and is massively imported into a full text indexing mechanism in a similar manner to the previous approach. The main dierence in this case is that the linking of the page segments that articles consist of as well as the partial extraction of metadata contained in certain elds (ie article title) is conducted in an automatic unsupervised manner via the application of a heuristic algorithm. However this algorithm is to a large extent context dependent: it depends to the specic layout structure of a given newspaper as well as the language the 2 articles are written in . Full Text Indexing techniques are currently considered to be the state of the art in the area of newspaper digitization and this is mainly for the following reasons: • Creation of searchable full - text index via OCR is a much faster process compared to the manual creation of metadata. In the case study of British library's digitization project it is estimated that over 20.000 pages of historic documents and newspaper pages containing some 500.000 articles 2 A ne example of this approach was the British Library's newspaper digitization project[2, 28]. In this case a Bitmap Zoning algorithm was used to automatically identify article regions within the newspaper issue. However dierent versions of the same algorithm were used for the analysis of newspapers printed before 1900 due to signicant structural dierences of these newspapers compared to the ones that were published during the 20th century. 5 have been processed and rendered searchable in only two months [28]. In the case of Utah's Digitization project again a total sum of 30.000 pages have been processed in less than a year. • Separation of searchability and readability [28]: by the term searchability we refer to the means by which the digitized material is rendered accessible via the use of conventional keyword search methods. A common practice followed in the context of these techniques is that the full text produced via the OCR analysis of the document is used for full text indexing purposes. It is generally acknowledged that OCR analysis of historical documents (although it has greatly improved during the last few years) never produces a 100% exact replica of the original digitized image in a textual format [21]. In order to overcome the inadequacies of OCR recognition, most of these systems implemented fuzzy search methods that produce satisfactory results even in the presence of slightly corrupted text. However most of the texts produced via a OCR process cannot be used for presentation purposes (readability issue) due to their poor quality. In order to overcome this problem, image segmentation techniques were used in order to extract the appropriate image segment from the issue image and bundle it with the appropriate text (usually in a PDF format). This approach solves both the searchability and the readability issues since the textual part of the PDF is used for indexing purposes and the image part for presentation purposes. • It is possible to conduct searches at a page/issue/article level basis : due to the exibility and the segmentation capabilities of this OCR based approach it is possible for these systems to fetch the exact article that the researcher was looking for. This fact alone increases dramatically the precision/recall factor compared to the two previous approaches. • The search is conducted via keywords in a manner that is familiar to the average user of contemporary Web Search engines (Google, Yahoo etc). • This approach also addresses the problem of ecient content dissemina- tion over the Web : since the page is segmented into its constituent parts, the latter can be used to represent a specic search result item to the nal user . By following this practice, the overall browsing experience is greatly improved and network congestion problems are avoided. Despite its advantages this approach still suers from some serious drawbacks: • Even though nowadays the bag of words approach constitutes the predominant search paradigm for the Web it still suers from some well known precision/recall issues. In conventional search engines the user has to cope with thousands of results that are essentially irrelevant to the user's needs. Some search engines have invented some ranking/ltering mechanisms in 3 order to deal with the above problem . Unfortunately such mechanisms are not currently implemented in this OCR based full text indexing approach. This fact combined with the inherent imperfection of the OCR 3 See for instance Google's PageRank mechanism. 6 produced text (even though this imperfection is handled to a certain degree by fuzzy search methods) necessarily means that the retrieval capabilities of full text based systems of this kind are necessarily inferior to the retrieval capabilities of their original counterparts (web search engines). • Newspaper archives are not as chaotic as the Web : The Web itself is by nature an unstructured, continuously expanding universe of HTML documents that refer to conceptually diverse material. Under these circumstances, the use of purely information retrieval solutions appeared inevitably as the most pragmatic approach of bringing order to chaos. Historical archives on the other hand are not as chaotic in nature: in most cases the digitized material is produced by a usually overwhelming ,but yet nite, number of newspaper issues of a collection. It is a common practice for humanities scholars to classify historical references contained in newspaper articles under dierent categories and seek information by referring to these categories [19] (i.e. by historical periods, types of activities, types of actors etc). This potential categorization requires essentially the scholar's intervention during the indexing process of the digitized material and constitutes in some sense the conceptual structure of the archive. The existence of such an underlying structure is almost absent in OCR based information retrieval systems. • The search of information in OCR based information retrieval systems is conceptually blind : the user usually attempts to perform a blind keyword search within the contents of the digitized archive, without an explicit notion of the semantics conveyed by archive content. A semantic structure of this kind could be used in order to provide the user with a concise description of the archive's contents. It could also provide a conceptual guide that could be used in conjunction to full text search queries in order to improve the overall precision factor of the system. • This approach uses bulky import les rendering thus the import process a computationally expensive procedure. The zip les used for the import process consist of the original image le and its segments in PDF format, and a XML le that links these les together.This practically means that the same page is indexed and stored in the content repository twice: once as a whole document and once as a constituent of its parts. 2 A fourth approach: hybrid annotation upon OCR results. The main rationale behind the creation of the DIATHESIS system [30] was to extend the conceptual classication approach in order to perform a conceptual modeling task upon the already digitized material by adding semantic value to newspaper segments. This system attempts to implement a realistic con- ceptual classication approach by combining the best elements from the three approaches mentioned above: 1. It permits searches on a newspaper issue basis (newspaper issue name, number, publication date) in a similar manner to the physical features based classication approach. 7 2. It permits searches on an article level basis via the use of full text queries in a similar manner to the OCR based Full Text Indexing Approach. 3. It permits searches on an article level basis via the semantic relationships assigned to each segment. 4. It permits searches that combine all of the above elements. Since this approach is basically an extension of the conceptual classication approach our main concern was to provide ecient means for speeding up the knowledge creation process. In addition it It does so by implementing the following strategy: • Document annotation is based on a single faceted model based on the CIDOC CRM RDF schema. The newspaper page is modeled as an in- stance of a Document class (E31.Document) that refers to instances of the Activity class (E7.Activity). By adopting such a xed conceptualization, we managed to speed up to a large extent the annotation process and deal with the semantic interoperability issues encountered in purely full text indexing implementations. • The system adopts a rather shallow semantic annotation approach: the resulting RDF triples don't aim at the formation of a coherent semantic network. Instead they try to combine full text search and metadata based search in such a manner that improves the average precision factor of the system. In simple words DIATHESIS tries to improve information ltering by assigning semantic properties to full text search. At the same time the system produces a robust semantic backbone that can be used for the construction of a coherent semantic network in future implementations. • OCR results have a dual use in our system: they are used for searchability and presentation purposes in a similar manner to the one described in the full text indexing approach. In addition they become building materials for a user friendly annotation environment that allows the user to easily dene areas of interest (an article or a sum of articles) given the OCR produced segments. At the core of this system lies the concept of document annotation. It is generally acknowledged that annotation is a frequently encountered scholarly practice during the document understanding process [29]. Because of its popularity as a practice in the real world, annotation interfaces constitute a design pattern queenly met in cultural documentation systems. According to a distinction made by Doerr and al [33] we can distinguish annotation types in such systems under the following three main categories: 1. Annotations by medium: (a) Lexical or hyperlink annotations. (b) Visual (icon or highlighting) annotations. (c) Acoustic (audio signals) annotations. 2. Annotations by locality of reference: Annotations may refer to 8 (a) Entire texts. (b) Parts of texts. (c) Both. 3. Annotation by process: (a) Textual annotation: involves adding some form of free text commentary to a resource, aimed primarily for human readers. (b) Link annotation: provides information in the form of the contents of a link destination other than an explicit piece of text or other data. (c) Semantic annotation: assigns markup elements according to a specied model which takes values from controlled vocabularies and it aims both at human readers and software agents. According to the rst two parts of this classication the type of annotation performed via the DIATHESIS interface is a partial visual annotation of newspaper material. However according to the third part of this classication the type of annotation encountered in this system has multiple dimensions: • It is textual: upon annotation of a specic segment, the user assigns the full text that corresponds to this selection as an attribute to the created object. • It represents a link: the annotation indicates a link to the specic coordinates that dene a segment of the annotated document. • It is semantic: annotation indicates a semantic relationship on two levels. The rst level concerns the relationship of the annotated segment to its parent document (newspaper). The second level concerns the relationship of the segment to other semantic entities (Actor, Stu, Place etc). Due to its multidimensional nature we could classify this this type of annotation as a hybrid one. Hybrid annotations are of particularly great importance for they enable us to assign semantic properties to whole regions of text. By assigning such properties to text we are able to create a semantic context that dramatically increases the precision of full text queries. The importance of the existence such a semantic context in the querying process has been acknowledged also in the COLLATE project [35]. 2.1 The CIDOC CRM Core as a conceptual basis of document annotation. As mentioned above, the underlying semantics of DIATHESIS are based on the CIDOC Conceptual Reference Model [31]. The CIDOC Conceptual Reference Model (CRM) is a high-level ontology to enable information integration for cultural heritage data and their correlation with library and archive information. It is the culmination of over 10 years of work by an interdisciplinary team, and has been accepted by the International Standards Organization as standard ISO 21127. The CIDOC CRM is intended to promote a shared understanding of cultural heritage information by providing a common and extensible semantic 9 Figure 2: CIDOC based semantic relationships between the parent document and its constituent parts. framework to which any cultural heritage information can be mapped. The CRM provides denitions and a formal structure for describing the implicit and explicit concepts and relationships utilized in cultural heritage documentation. It is intended as a common language for domain experts and implementers to formulate requirements for information systems, and to serve as a guide for good practice of conceptual modeling. CRM can thus provide the semantic glue that is needed to mediate between dierent sources of cultural heritage information, such museums, libraries and archives. In the DIATHESIS system the newspaper page is modeled as an instance of a Document class (E31.Document) that refers to instances of the Activity class (E7.Activity). This essentially means that each annotation upon the newspaper issue indicates a refers_to (reference) relationship between a document and an activity or a sum of activities mentioned within the document's text. Practically this means that each article in the text is modeled as an activity or a sum of activities. As a consequence in this implementation, a Document object links to more than one Activity objects. The second type of semantic relationship encountered in this implementation concerns the relationships stated between the declared activity and other CIDOC CRM classes. More specically the following classes of CIDOC have been used: 1. E2.Temporal_Entity: this indicates the specic time period when the activity mentioned in the document took place. Usually this is the time of the edition of the newspaper issue, however a single article might refer to a dierent time period (like in the case of historical references). A P4F.has_time-span attribute is used to establish the semantic relationship with the parent Activity class and contains a pair of long numbers 10 that indicate a specic time-span. 2. E39.Actor: this class refers to all the actors that are considered participants in the mentioned activity. There is a P14F.carried_out_by relationship between the Actor and Activity objects. 3. E70.Stu: this class refers to all the objects that were used in the mentioned activity. There is a P16F.used_specic_object relationship between the Stu and Activity classes. 4. E53.Place: this class refers to the geographic location where the activity took place. It has a P7F.took_place_at relationship with the Activity class. 5. E55.Type: this class indicates the type of activity that took place. It has a P2F.has_type relationship with the Activity class. 6. A String containing the full text included in the original segment. This has a a P3F.has_note relationship with the Activity class which is basically a pseudo - semantic relationship indicating that the extracted text is a note upon the dened segment. Although this does not indicate an actual semantic relationship, the inclusion of the full text in this semantic model, will enable us later to submit queries that exploit both the full text and the metadata contained in each Activity object. Despite the fact that this implementation is based on a purely ontocentric model, it does not aim at the creation of a closed semantic network in the fashion that the ARCHON system did. In a purely semantic network implementation the leaves of this semantic tree structure would be links to distinct RDF resources that could be linked to each other via property nodes deepening thus the produced semantic net to an innite length. However, in this implementation the leaves are not resources but plain literals. The literals are either assigned directly by the user (directly typewritten or copied from full text) or they are contained in reserved vocabularies (thesauri). The specic reserved vocabularies are centrally managed and stored in a thesaurus repository (SIS-TMS [18]) arranged in a hierarchical structure that reects the broader term - narrow term relationships that exist between them. These specic hierarchies represent the domain specic (context dependent) vocabulary that is used and maintained by a group of annotators and it is used as the conceptual basis for the interpretation of a selected event. 3 Technical Implementation DIATHESIS uses the Fedora digital library system as its backend. The latter is an ontocentric content management system specially designed for the creation of digital library collections [25] that permits the creation of semantic net structures linking digitized cultural material. At its core is a digital object model that supports multiple views of each digital object and the relationships among these objects. Digital objects can encapsulate locally-managed content or make reference to remote content. Three kinds of fedora object prototypes were used in this implementation: 11 Figure 3: CIDOC based semantic properties of a user dened newspaper segment. 12 1. Entry Objects: These objects contain the information regarding the newspaper issue. They consist of one RELS-EXT datastream that contains the semantic information described in the E31.Document class and an auxiliary datastream called RELATIONS that contains the relations between the current issue, its constituent parts (pages), and its declared segments (articles). 2. Metadata Objects: These objects contain the information regarding each newspaper segment dened by the user. They consist of one RELS- EXT datastream that contains the semantic information described in the E7.Activity class. They also contain an internally managed XML datastream (MAPPINGS) that holds information regarding the exact coordinates of the selected segment and the exact pages where the segment is located. 3. Image Objects: This type of objects contain the data required for the visualization of the newspaper page. This object contains three types of datastreams: one internally managed IMAGE datastream that contains the JPEG image thumbnail used for preservation purposes, an IMAGEBIG datastream that holds a reference to the original scan (TIFF image) that is stored in a remote location for preservation purposes and an IMGXML datastream that contains an internally managed XML le that represents the document segments and the text produced by the OCR process. In addition this object uses a disseminator, whose main purpose is to bind the IMAGE and IMGXML datastreams together. Upon invocation, the disseminator performs an XSL transformation upon the contents of the two datastreams in order to produce the interactive SVG element that visualizes the segmented document page. The produced SVG element is the the semantic canvas upon which the document annotation task will take place. 3.1 System Workow and the Import Process. The initial goal is to populate the system with Entry objects, which represent the newspaper issue as a whole. Entry objects are in some sense the raw material upon which the annotator can extract segments manually. The ingest process is initiated via the administration menu and has two prerequisite elements: 1. An XML le that denes the location of the directories containing the digitized material per issue (and optionally some metadata concerning this issue) and 2. A specic folder structure that holds the digitized material (thumbnail JPEG and original TIFF images). These two elements are the product of the digitization process by a team of digitization specialists. Upon completion, the import process produces a series of Entry objects, their corresponding Image objects as well as the semantic relationships that link these objects together. The produced material is stored in the Fedora repository and it forms the raw material upon which another group of annotators will perform the knowledge extraction task. The document 13 Figure 4: The DIATHESIS system workow. annotation team uses the annotation interface of DIATHESIS in order to isolate specic segments of text upon the produced SVG canvas producing this way a series of Metadata Objects. The produced objects are used by the search mechanism to conduct searches on an article level or a document level basis. 3.2 The system architecture The system consists of two separate web apps that are bundled in a customized Fedora installation package. The data are stored and retrieved within a fedora repository. The system consists of three main modules: 1. The image annotation application: This is an AJAX-based application that is divided in three tabs. The rst tab (Document Annotation Tab) permits the user to select a specic Entry Object (newspaper issue), and use a specially designed GUI to dene new Metadata Objects, and correlate them with specic regions of each page. The second tab (Issue Metadata Tab) contains metadata that describe the document issue as a whole. Finally, the third tab (metadata tab) diplays the required elds for the classication of a specic segment. There are two kinds of elds used in the metadata tab: free text elds (were data is directly typewritten or copied directly from the OCR text) and reserved vocabulary elds (where keywords are selected from SIS-TMS generated SVG tree structures). Upon submitting the document and its annotated regions, the system 14 Figure 5: The Document Annotation Tab. Figure 6: The Metadata Tab. 15 recursively creates and ingests into the Fedora repository a series of Metadata objects which hold information regarding the OCRed text contained in the specic page, the coordinates of the specic segment on the given page and a set of specic CIDOC compliant metadata describing the annotated segment. 2. The Administration Application: This is the administration portion of the application. Through it, the administrator can import new material into the database, supervise the import process, modify the structure of metadata used by the system, and obtain statistical information regarding the use of the system (document insertion rate, metadata creation rate etc). The Administrator screen uses an embedded ActiveX control in order to integrate an instance of the Abby Finereader OCR software with the import process. 3. The Viewer Application: This is an Ajax based application designed to provide rapid access to digitized material over the web. The user can conduct searches either on an article level or an issue level basis. The download mechanism is based on a exible architecture that allows the timely download of the retrieved material. In a similar manner to most OCR based implementations, the system allows the user to retrieve the exact segment of the document that contains the search criteria submitted by the user. 4 Future Research The system uses an ecient user interface in order to speed up the annotation process. Undoubtedly there is a signicant improvement in terms of speed of classication over the Manual ontocentric classication approach. However, this approach is still signicantly slower than the fully automated OCR based systems, to the extent that it requires the user's involvement during the classication process. In order to overcome this problem and narrow down the cost/production gap between the hybrid annotation and and the OCR based Full Text Indexing Approach we plan to implement information extraction techniques, for the semiautomatic extraction of named entities from full text. The successful implementation of such an approach could speed up radically the annotation process and relieve the user from the repetitive task of manually locating names of actors, places or objects within the text of countless articles. Information extraction techniques have already successfully been used in digital library systems for this purpose (see the Perseus Digital Library project [22]). We currently focus on Adaptive Information Extraction Techniques [16] due to their relative independence from xed, domain/language specic gazetteers. 5 Conclusion In DIATHESIS we managed to implement a novel approach for the digitization of historical newspapers. This approach takes into account the advantages and the disadvantages of most digitization approaches of this type of material. It uses 16 OCR technology for the structural and textual analysis of scanned newspaper material in a similar manner to the OCR based approach. However, unlike this approach the data produced by the OCR processing of scanned material are not imported directly into a full text indexing mechanism. Instead they are used for the creation of a highly exible annotation interface. The system allows the users to perform hybrid annotations upon the digitized material assigning this way semantic properties to specic regions of text. The search interface enables the user to conduct searches on a document level and an article level basis via the execution of full text queries. In addition, it provides a semantic framework that enables the user to combine full text and metadata based searches. By allowing this type of query execution the system provides a semantic lter that greatly improves the precision of the conducted searches. In addition it uses a robust top level domain ontology (CIDOC-CRM) in order to ensure that the produced knowledge can be inter exchanged between dierent institutions. Finally, it provides a exible presentation mechanism that allows the partial download of the digitized material in order to improve the overall user experience and reduce download time. Over time the DIATHESIS system has evolved into a stable, lightweight, easily deployable and highly congurable newspaper digitization suite. DIATHESIS is currently being used for the digitization of the Vikelaia Municipal library's newspaper collection in Heraklion Greece, where over 30.000 pages have already been classied and indexed. The same system has also been successfully used for the digitization of the Filekpaideytiki Etaireia Archive in Athens and there are also plans for using it for the digitization of the I AVGHI newspaper archive . References [1] Anno: Austrian newspapers online project (http://deposit.ddb.de/online/exil/exil.htm). [2] British library online newspaper archive (http://www.uk.olivesoftware.com/). [3] The brooklyn daily eagle online (http://www.brooklynpubliclibrary.org/eagle/). [4] Denmark: Digitaliserede danske aviser 1759-1865 (http://www.statsbiblioteket.dk). [5] The dublin core metadata initiative: http://dublincore.org/. [6] Estonia: Digiteeritud eesti ajalehed (http://dea.nlib.ee/). [7] "exilpresse digital. deutsche exilzeitschriften 1933-1945" project (http://deposit.ddb.de/online/exil/exil.htm). [8] Historical newspapers in washington (http://www.secstate.wa.gov/history/ newspapersname.aspx). [9] Iceland :the vestnord project (1696-2002) (http://www.timarit.is/listi.jsp?lang=4&t=). [10] Krueger library winona newspaper project : (http://www.winona.edu/library/databases/winonanewspaperproject.htm). 17 [11] Northern new york historical newspapers (http://news.nnyln.net/). [12] Utah digital newspapers (http://www.lib.utah.edu/digital/unews/). [13] Wisconsin local history and biography articles (http://www.wisconsinhistory.org/wlhba/). [14] Balakrishnan. Universal digital library:future research directions. Journal of Zhejiang University SCIENCE, 11:12041205, 2005. [15] Pablo Castells, F. Perdrix, E. Pulido, Mariano Rico, V. Richard Benjamins, Jesús Contreras, and J. Lorés. a digital newspaper archive. Neptuno: Semantic web technologies for In Christoph Bussler, John Davies, Dieter Fensel, and Rudi Studer, editors, ESWS, volume 3053 of Lecture Notes in Computer Science, pages 445458. Springer, 2004. [16] F. Ciravegna, A. Dingli, D. Petrelli, and Y. Wilks. Timely and non-intrusive active document annotation via adaptive information extraction. In Work- shop Semantic Authoring Annotation and Knowledge Management (European Conf. Articial Intelligence), 2002, 2002. [17] Martin Doerr & Nicholas Crofts. Electronic esperanto: The role of the object oriented cidoc reference model. In Proc. of the ICHIM '99, Washington DC, September 22-26 1999. [18] Martin Doerr & Irini Fundulaki. Sis - tms: A thesaurus management system for distributed digital collections. In Proc. of the 2nd European Conference, ECDL'98, pages 215234, Heraklion, Crete, Greece, September 1998 1998. [19] Ann Blandford Jon Rimmer George Buchanan, Sally Jo Cunningham and Claire Warwick. Information seeking by humanities scholars. Lecture Notes in Computer Science, 3652/2005:218229, September 2005. [20] Mara Nikolaidou George Pyrounakis, Kostas Saidis and Vassilios Karakoidas. Introducing pergamos : A fedora-based dl system utilizing digital object prototypes. Lecture Notes in Computer Science, 4172/2006:500 503, September 2006. [21] Susan Haigh. Optical character recognition (ocr) as a digitization technology. Technical report, Information Technology Services National Library of Canada, November 15 1996. [22] David A. Smith Anne Mahoney Gregory R. Crane Jerey A. Rydberg-Cox, Robert F. Chavez. Knowledge management in the perseus digital library. Ariadne Journal, (25), September 2000. [23] M. Doerr & P. Trahanias K. Chandrinos, J. Immerkaer. A visual tagging technique for annotating large-volume multimedia databases - a tool for adding semantic value to improve information rating. 5th DELOS Workshop "Filtering and Collaborative Filtering", European Research Consortium for Informatics and Mathematics (ERCIM), November 10-12 1997. [24] Karen Edge Kenning Arlitsch, L. Yapp. project. D-Lib Magazine, 9(3), 2003. 18 The utah digital newspapers [25] Carl Lagoze, Sandy Payette, Edwin Shin, and Chris Wilper. Fedora: An architecture for complex objects and their relationships, Aug 2005. [26] TIm Berners Lee. The world wide web: Past, present and future. IEEE Computer special issue of October 1996, 1996. [27] Tim Berners Lee. Realising the full potential of the web. Based on a talk presented at the W3C meeting, London, 1997. [28] Emil Steinvil Marilyn Deegan, Edmund King. British library microlmed newspapers and oxford grey literature online. Technical report, Oxford University, British Library , Olive Software Inc., 2003. [29] C Marshall. Annotation: from paper books to the digital library. In Pro- ceedings of the ACM Digital Libraries '97 Conference, Philadelphia, pages 2326, July 1997. [30] Maria Theodoridou Martin Doerr, Georgios Markakis. historical newspapers. Digital library of ERCIM News:Special Issue on Digital Libraries, (66), July 2006. [31] Tony Gill Stephen Stead Matthew Sti Nick Crofts, Martin Doerr. Def- inition of the CIDOC Conceptual Reference Model. ICOM/CIDOC CRM Special Interest Group, 4.2.1 edition, October 2006. [32] Maria Theodoridou Manolis Tzobanakis Panos Constantopoulos, Martin Doerr. Historical documents as monuments and as sources. In Computer Applications and Quantitative Methods in Archaeology Conference, Heraklion, Greece, April 2002. [33] Maria Theodoridou Manolis Tzobanakis Panos Constantopoulos, Martin Doerr. On information organization in annotation systems. Intuitive Human Interface, LNAI 3359:189200, 2004. [34] Bill Parod. Encyclopedia of chicago fedora implementation. Technical report, Northwestern University, May 15 2005. [35] Ulrich Thiel, Holger Brocks, Andrea Dirsch-Weigand, Andre Everts, Ingo Frommholz, and Adelheit Stein. Queries in context: Access to digitized historic documents in a collaboratory for the humanities. From Integrated Publication and Information Systems to Information and Knowledge Environments, pages 117127, 2005. 19
© Copyright 2026 Paperzz