Born-Digital Cultural Content Preservation and - Ben Fino

Born-Digital Cultural Content
Preservation and Permanent Access
Ben Fino-Radin
It can no longer be said that our world is becoming digital. In an era where the
names of websites are vernacular verbs, we have arrived. We are already swimming
in a vast sea of information and cultural content that has never existed in any form
but digital. Everything – from journalism, books, and music, to visual art – is
becoming digital in creation and consumption. We are entering the era of a borndigital cultural legacy. The term born-digital1 has been used to describe the current
generation of youth. These future adults are growing up in a world completely
saturated with digital devices, and cannot imagine a time before Youtube existed.
While this is not the respect in which we will be using the term, its use is evocative
of our current transitional point in history. When we talk about born-digital
cultural content, this describes content that was created digitally, and is
experienced within the digital environment. Specific examples would include
emails, web pages, digital photographs, text messages, text files, and this paper
itself. Some strictly define born digital material as having no analog counterpart,
however this is too closed a definition. Rather we can say that this paper is born
digital, in the sense that it’s primary form is a file on a computer – it’s physical
form is simply an analog duplication of it’s primary form. This paper will explore
the biases of the digital medium, and the unique challenges it presents to the field
preservation.
Palfrey, John G., and Urs Gasser. Born Digital: Understanding the First Generation of Digital Natives.
New York: Basic, 2008. Print.
1
Let us first investigate the born-digital manuscript, and the process of writing.
Except for dwindling traditionalists, most writers compose their works digitally.
Early adopters of the personal computer, and word processing such as Salman
Rushdie have been making use of the digital environment as their creative medium
since the early 1980’s. As is the case with any medium, the computer carries it's
own biases and tendencies. The computer affects everyone from creators to
consumers in myriad of ways that we are only beginning to understand2. The
manuscript, and the process of writing are heavily affected by what are possibly the
medium's greatest strengths, and greatest challenges: malleability and
impermanence of editing3. All physical forms of writing allow for an evocative
glimpse into the writer’s process. With hand written text, we can observe erasures,
with typewritten text, whiteouts or strikethroughs. With the digital medium, the
difference (some see this as it’s beauty, some a problem) is its lack of evidence. I
just erased an entire sentence. Currently if writers wish to preserve the process of
their creation of works, this would require them to save a new copy of their file
every time they wished to preserve it's current state. Whereas the physical medium
provided a window into the writer’s process (edits, or lack thereof) by default, the
digital medium does not. This dictates a form of process to the creative individual
that requires a self-conscious form of preservation. It need not be this way, this is
simply a reality that has been made default by software developers. The digital
written word is one of the simplest forms of data, and it’s files are very small. This
is one file format that has not significantly bloated in bytes over the course of time.
A topic of much research and writing. Some recent publications include “The Shallows” by Nicholas Carr,
“Cognitive Surplus” by Clay Shirky, and “Program or be Programmed” by Douglas Rushkoff
3 Schmitz, Dawn. “The Born‐Digital Manuscript as Cultural Form and Intellectual Record” Proc. of Time
Will Tell, But Epistemology Won't: A Conference on Richard Rorty's Archive, University of California, Irvine.
Web. <http://virtualpolitik.org/rorty/Schmitz_Rorty_paper.pdf>
2
Simultaneously, the cost of storage is constantly decreasing as drive capacities
steadily increase. Why not take advantage of this discrepancy - offer writers an
authoring tool that archives the evolution of their work. There are a few word
processors that are targeted towards writers4, however most focus on "features"
such as organizing the document into chapters, and offering a place for notes and
character development. Rather than mediating the creator’s environment, we
should be focused on helping them to preserve their process and legacy. There
would be no understanding or appreciation of the flawless perfection of Mozart's
edit-free manuscripts if they were digital. It would not enter our minds to consider
whether or not they had been edited heavily, or written flawlessly.
A common misconception is that digital content is more stable than traditional media. This is patently false. Consider the amount of ancient sculpture, pottery, painting, and manuscripts contained in the world's museums. What success might you have opening ten-­‐year-­‐old text documents on your current computer? With traditional media there is essentially one form of degradation: physical. Paintings fade, paper acidifies, sculptures fall apart. Born-­‐digital content is affected by four forms of degradation: 1) physical obsolescence 2) physical deterioration 3) data obsolescence 4) data deterioration. To illustrate these aspects, lets say that you are a born-­‐digital preservation specialist, and you work in the archives of a major library. A renowned writer donates their papers to your institution – wonderful. However, their papers are born-­‐digital, taking the form of a spectrum of disks, hard drives, and computers. First consideration would be given to Physical obsolescence. This refers to the fact that storage media is in constant evolution. Without a way of interfacing with a computer 4
yWriter, PageFour, Manuscript, to name a few.
data retrieval would be rendered impossible. Try reading a floppy disk with your current computer – chances are there isn’t a place to insert one. If there is a place for one, it is for a 3 ½” disk. If the hypothetical materials in question included 5 ¼” floppy disks, you would need to find hardware that could read this storage media – either a computer with a built-­‐in drive, or an external floppy drive. This is of course to assume that the storage media has not suffered from physical deterioration. Most forms of storage are quite delicate5. In fact, the most common form of data storage – the hard drive – is the most tenuous. The inside of a hard drive closely resembles a record player, including metal platters that contain data, and a small arm, which reads the data. These delicate mechanical devices are highly prone to failure. Similarly, if a portion of our hypothetical born-­‐digital collection included a CD, this media may be unreadable to do any number of physical factors (writable optical media will inevitably be unreadable in as few as five to ten years due to exposure to light). If neither physical obsolescence nor deterioration befall the collection, there are two remaining hurdles. Data obsolescence refers to the fact that all data is encrypted in a particular format. It is compressed so that it is more easily readable by whatever software it was created with. This is a problem since as software evolves, support for formats is discontinued, and files become reliant on the software they were created with. This is a problem because all software works within the framework of the particular hardware and operating system it was intended to run on. Since software and hardware exist in a perpetual symbiotic relationship of development, there are inevitably formats of data in our hypothetical collection that will require significant digital archeology to achieve a successful recovery. As our fourth and final challenge Storage is becoming less and less physically delicate though. Flash based memory is entirely digital has no
mechanical parts. Although currently rather expensive per gigabyte, it will eventually completely replace
hard drives and forms of portable storage.
5
data deterioration presents the threat that even if we successfully elude all three of the previous forms of degradation, the data in question may have become corrupt at some point. Any time a file is moved, edited, or transferred to different storage media, there is a very real risk of one bit being put out of its proper place, or written incorrectly. In some cases this creates minor but permanent glitches, in worst cases, it renders the data completely useless – unreadable by the computer. This illustrates
the main difference between degradation of traditional materials and degradation
of born-digital materials. In most cases born-digital materials are rendered
completely inaccessible by degradation of any scale. This is why born-digital
preservation is referred to as permanent access. Without proper preservation, we
simply lose access to the data. Forever. Imagine a point in time where the English
language has been forgotten. Books containing the language, immense cultural
legacies, masterpieces, all would still exist in physical form - people would be able
to see English words on the written page, but they would not be able to read,
understand, or derive any knowledge from the dead language. Not only has history
witnessed near occurrences of this hypothetical, but most people who have used
computers for at least a few years have experienced this on a personal level. As
time passes, software is re-written. People decide that they need a faster computer
for a new piece of software they are using; only to find that another piece of older
software they still rely on does not work on the new machine. Incompatibilities are
simply a fact of life fed by an endless cycle of upgrades to hardware and software,
fed by a combination of innovation and commerce-driven planned obsolescence6.
The Internet is arguably the most complicated of all born digital materials because
Laforet, Anne, Aymeric Mansoux, and Marloes De Valk. “Rock, Paper, Scissors And Floppy Disk”
Pi.kuri.mu. July 2010. Web. 30 Oct. 2010. <http://pi.kuri.mu/rock/>
6
all of the aspects of degradation apply, and in addition to this, materials simply
disappear. Content creators may discontinue their website for any number of
reasons. However, there is something that has a far more severe impact than an
individual taking down their site: large content publishing systems being
discontinued. For example - imagine if wordpress or tumblr suddenly announced
that they were closing down – all content would be gone as soon as they took down
the site and disconnected their servers. This does and has happened. It usually
occurs after the site has fallen into obscurity, and its inhabitants have moved on to
the next platform. Without initiatives focused on the preservation of the Internet, a
vast and diverse legacy would slowly but surely disappear. One recent example
occurred on October 26th 2009. This is the day that Yahoo discontinued the web
hosting service Geocities. During the mid 1990s Geocities was a vibrant community
of DIY websites. This was an important period in the evolution of the Internet, as it
was one of the largest, early examples of like-minded people gathering and forming
virtual relationships, communities, and discussions, online. Thankfully, word of
the discontinuation of Geocities spread, and a few key organizations effectively
preserved its contents.
Despite these seemingly endless challenges, there are also great and unique
opportunities offered to the conservator. As institutions are faced with these
challenges, and best practices develop, incredible innovations have begun to
emerge. One major contributor to the field has been the born-digital team at the
Manuscript And Rare Book Library (MARBL) at Emory University. In 2007
renowned author/critic Salman Rushdie donated his papers. Included in this
collection were 4 computers and an external hard drive. These computers
contained manuscripts, unfinished projects, and correspondence – all things of
great interest to researchers. This situation provides not only the question of how
to recover, stabilize, and preserve the data (most of which being over twenty years
old) – but how to provide it to researchers in a practical and useful way that
respects the privacy and wishes of a living writer? One of the solutions to emerge
was a searchable and browse-able database of all of the files in a modern format
useful to scholars - including file descriptions and full text access. Not only can
Rushdie scholars read drafts of his written works, character studies, and emails to
his publisher, but all of this data is available instantly. A search for “Vena”, a
character from Rushdie’s novel “Ground Beneath Her Feet”, yields over 100 fulltext document results. By far the greatest and most unique innovation achieved by
the team at Emory is an emulation of Rushdie’s Performa 5400. MARBL’s team of
technologists was able to create an intact, bootable disk image of the computer.
This in combination with custom designed software running on a contemporary
computer allows researchers to essentially browse Rushdie’s computer as he left it.
This offers the opportunity for an unprecedented human understanding of the way
that the author used the tool of his trade. This level of perspective on the author’s
creative process is like being able to see how Camus used a typewriter, or how he
organized his desk.
Now that we are in the midst of an era where our global cultural legacy is borndigital, we must plan for the future of our data footprint. If we don't we run the risk
of partial and permanent loss of a moment in history7. Planning and
Kuny, Terry. "A Digital Dark Ages? Challenges in the Preservation of Electronic Information." Proc. of
63RD IFLA Council and General Conference. 1997.
7
implementation must occur on two levels, personal and public. Personal, in the
sense that creators must be custodians of their digital content: redundant backups,
and a thorough understanding of the medium. Otherwise, their born-digital
materials may not last long enough to reach the hands of an institution with
specialists on hand. Beyond the personal scale, it must be recognized that such
materials are very much at the mercy of the corporations that develop the
hardware and software, which the data is created on, and is preserved with.
Consortiums must be developed in order to establish standards, legislation even,
for the future of data. There must be a global commitment to the standard of
permanent access. This has been recognized by the U.S. Federal Government, and
the Library of Congress which, in December of 2000 founded the National Digital
Information Infrastructure and Preservation Program.
"NDIIPP is based on an understanding that digital stewardship on a national scale depends on
public and private communities working together. The Library has built a preservation network
of over 130 partners from across the nation to tackle the challenge…" 8
The legislation appropriated $100 million, which has funded a broad range of
projects devoted to preserving our cultural legacy. Archive-It9, is a subscription
service that allows institutions to build customized archives of web-based
resources. All of the data is hosted by the Internet Archive10, and is searchable as
any other database subscription a library would use. The Web Archiving Service
(WAS) is a “Web-based curatorial tool that enables libraries and archivists to
“About the Program.” Digital Preservation (Library of Congress). Web. Oct. 2010.
<http://www.digitalpreservation.gov/library/>
9 “About Archive-It.” Archive-It.org. Web. Oct. 2010. <http://www.archive-it.org/public/about-us.html>
10 “The Internet Archive is a nonprofit organization founded in 1996 to build an Internet library, with the
purpose of offering permanent access for researchers, historians, and scholars to historical collections that
exist in digital format.” - http://www.digitalpreservation.gov/partners/ia/ia.html
8
capture, curate, analyze, and preserve Web-based government and political
information.” 11 This is significantly different than the Internet Archive, and
Archive-It, as it gives more control to the institution, hosting the archive on their
own servers if they so desire. Audit Control Environment (ACE) is a project that
developed a tool for validating the integrity of digital files in migration, and helps
institutions to perform audits of their collections.12
If there is anything certain of born-digital preservation, it is that the field itself is in
constant evolution. The fact is that the very tools we use to preserve born-digital
materials are born-digital materials themselves. The software, systems, theories,
methods, and best practices of born-digital preservation will perpetually remain in
a state of flux. Libraries, museums, archives, and private collections will encounter
more and more born-digital materials. The key for any organization that finds itself
dealing with born-digital materials, is to approach it’s immediate preservation with
care and expediency. An institution with any type of born-digital collection must
seek the expertise of a specialist – whether a small collection seeking a part-time
consultant, or large institution employing a diverse team. The tenuous nature of
the digital medium’s stability is perhaps the conservator’s worst enemy – as
without continuous development and stewardship, there is great risk for the loss of
a vast cultural legacy, and moment in history. “Web Archives: Yesterday's Web; Today's Archives.” Web Archiving Service. Web. Oct. 2010.
<http://webarchives.cdlib.org/>
12 “An Approach to Digital Archiving and Preservation Technology.” Ace. Web. Oct. 2010.
<https://wiki.umiacs.umd.edu/adapt/index.php/Ace>
11