OR2015_paper_panel_templatex

tranScriptorium : computer aided, crowd sourced
transcription of hand written text, for repositories?
Rory McNicholl, University of London, [email protected]
Dr Tim Miles-Board, University of London, [email protected]
Session Type (select one)
 Panel
 Presentation
Abstract
Over the past 10 or so years significant investment has been made by various cultural heritage
organisations across Europe in digitising historical collections of handwritten documents. If well
planned the output of these digitisation projects may end up in a repository or similar, thus
improving access to document images. Can this improvement in access be further enhanced?
The Transcriptorium project is a European Commission FP7 funded project (2013-2015) that
brings together a suite of tools for the purpose of computer aided transcription and enhancement
of digitized handwritten material. These software tools include those for document image
analysis (DIA) developed by National Centre for Scientific Research (Greece), handwritten text
recognition (HTR) developed by the Universitat Politecnica de Valencia (Spain) and natural
language models (NLM) developed by Institute of Dutch Lexicology, Universiteit Leiden
(Netherlands). As the project required that these tools be available to other systems they have
been developed to operate as software services.
The project included the development of a desktop application (University of Innsbruck, Austria)
and a crowd-sourcing platform (University College London and University of London Computer
Centre, UK) that use the DIA, HTR and NLM outputs to arrive at computer aided transcription
solutions, designed with the aim of improving efficiency and reducing cost of the transcription of
handwritten documents.
Conference Themes
Select the conference theme(s) your proposal best addresses:
 Supporting Open Scholarship, Open Science, and Cultural Heritage
 Managing Research (and Open) Data
 Integrating with External Systems
 Re-using Repository Content
 Exploring Metrics and Assessment
 Managing Rights
 Developing and Training Staff
 Building the Perfect Repository
Keywords
Handwritten text recognition, transcription, crowd-sourcing, cultural heritage
Audience
Librarians, archivists, repository managers, historians, digital humanists, philologists and
linguisticians.
Background
“Looking back” aligns with the target material of this technology, digitised handwritten
manuscripts, papers and letters etc. Transcription of cultural heritage material enhances
discovery and enables new avenues of research. “Looking forward” the technologies developed
by the project partners are cutting edge and have the potential to significantly enhance the
discovery, reuse and interoperability of digitised historical (and other) hand written texts held in
repositories.
Presentation content
-
-
-
-
Some context: A huge cultural heritage resource that is “hidden” even after digitization as
automatic transcription is often not possible and manual transcription is too costly.
Many repositories contain significant amounts of digitized historical documents with
either no or patchy transcription.
Out-line of the Transcriptorium project and the partners involved and some of their
previous projects.
The individual technologies involved in the project and how they are combined to form a
transcription workflow.
Demonstration of Transcriptorium platform(s).
Document management for the transcription platforms and how repository platforms may
play a part (provide the source materials, manage review of crowd-sourced transcription
etc)
A precursor to tranScriptorium - The Transcribe Bentham project - does not involve any
automation however achieves a rate of 100 submitted transcripts per week from
volunteers. The combination of automation with a manual crowd sourcing element can
make transcription of large collections an affordable reality.
What can we do with transcriptions? Enhanced discoverability (via indexed hand-written
documents), searching within documents, TEI, readability and accessibility.
Conclusion
Looking back - there has been much effort to digitise, describe, store and publish historical
written material and repositories of various ilk have played an important role in this effort.
Looking further back there is still a vast amount of human knowledge inside such documents that
remain hidden to some degree from recent communication revolutions.
Looking forwards – although repositories have already played a role in safeguarding and
enhancing the description and cataloging of historical documents, the next step for such
repositories is to interact with new tools that have the potential to unlock the whole document.
Both by providing a platform from which resources can be accessed by transcription tools, but
also to play a part in capturing and disseminating those enhancements provided by such tools.