Download PDF

Collecting and Archiving the Federal Web
Mark Phillips
Assistant Dean for Digital Libraries
UNT Libraries
May 19, 2015
In most areas of life the Web is the primary
platform for the creation and dissemination of
information
May 19, 2015
Open Access Symposium
2
This rings true for the governmental Web as well.
May 19, 2015
Open Access Symposium
3
Municipal, County, State, and Federal
Governments reach their constituents via the Web
May 19, 2015
Open Access Symposium
4
The United State Federal Government has used
the Internet as a publishing platform for over 20
years.
May 19, 2015
Open Access Symposium
5
Making it the single largest platform for the
dissemination of federal information in existence
May 19, 2015
Open Access Symposium
6
The ability to quickly and in a cost effective
manner publish content to the Web has made it
possible for all branches of the Federal
government to reach those in need of
governmental information.
May 19, 2015
Open Access Symposium
7
But with great abilities to create also come
opportunities for loss.
May 19, 2015
Open Access Symposium
8
Content that was so easy to make available can
just as easily be removed.
May 19, 2015
Open Access Symposium
9
With the ability for planned or unplanned denial of
access is as easy as a few keystrokes
May 19, 2015
Open Access Symposium
10
It is important for libraries, archives, and cultural
heritage institutions to select, collect, organize,
provide access to and preserve this important
Web content.
May 19, 2015
Open Access Symposium
11
End of Term
May 19, 2015
Open Access Symposium
12
2004
May 19, 2015
Open Access Symposium
13
National Archives and Records Administration
(NARA)
May 19, 2015
Open Access Symposium
14
Conducted an “End of Term” Web harvest at the
end of the first Bush term.
May 19, 2015
Open Access Symposium
15
Available at http://webharvest.gov/
May 19, 2015
Open Access Symposium
16
May 19, 2015
Open Access Symposium
17
In 2008 NARA announced it wouldn't be doing the
same at the end of the second Bush term
May 19, 2015
Open Access Symposium
18
A call to action!
May 19, 2015
Open Access Symposium
19
US members of the International Internet
Preservation Consortium (IIPC) formed an ad-hoc
group to shoulder the task of doing the .gov crawl
May 19, 2015
Open Access Symposium
20
California Digital Library
Internet Archive
Government Printing Office
Library of Congress
University of North Texas
May 19, 2015
Open Access Symposium
21
Worked to establish a list URLs for the “Federal
Domain”
May 19, 2015
Open Access Symposium
22
May 19, 2015
Open Access Symposium
23
Primarily .gov and .mil
May 19, 2015
Open Access Symposium
24
Initiated a series of three crawls
May 19, 2015
Open Access Symposium
25
Pre Election
Post Election
Post Inauguration
May 19, 2015
Open Access Symposium
26
With two goals in mind.
May 19, 2015
Open Access Symposium
27
Capture the expected change of the Federal Web
caused by the transition in the executive branch
May 19, 2015
Open Access Symposium
28
Serve as a catalyst event to create a large-scale
archive of .gov and .mil content
May 19, 2015
Open Access Symposium
29
In the end 16TB of Web content was harvested
May 19, 2015
Open Access Symposium
30
160,211,356 URIs
May 19, 2015
Open Access Symposium
31
Copies of all data available to each of the partners
May 19, 2015
Open Access Symposium
32
May 19, 2015
Open Access Symposium
33
In 2012 the same group initiated the project again
May 19, 2015
Open Access Symposium
34
This time Harvard was added to the roster.
May 19, 2015
Open Access Symposium
35
A slightly modified process was used to collect
candidate URLs for crawling
May 19, 2015
Open Access Symposium
36
May 19, 2015
Open Access Symposium
37
We also worked with Library Science students
from the Pratt Institute to nominate social media
sites for the government
May 19, 2015
Open Access Symposium
38
1,476 URLs from 31 nominators were collected
May 19, 2015
Open Access Symposium
39
Because President Obama was re-elected,
additional crawls were not conducted after the
initial pre-election crawl
May 19, 2015
Open Access Symposium
40
Once compiled the EOT 2012 Crawl data
measured 31TB
May 19, 2015
Open Access Symposium
41
Almost double that of EOT 2008
May 19, 2015
Open Access Symposium
42
We are beginning the discussions for EOT 2016
May 19, 2015
Open Access Symposium
43
What has been done with the data to date?
May 19, 2015
Open Access Symposium
44
IMLS Funded: Classification of the End-of-Term
Archive: Extending Collection Development to
Web Archives (eotcd), 2010-2012
May 19, 2015
Open Access Symposium
45
Attempt to overlay the SuDoc Classification
System on the EOT 2008 Web archive
May 19, 2015
Open Access Symposium
46
May 19, 2015
Open Access Symposium
47
May 19, 2015
Open Access Symposium
48
End of Term 2008 Presidential Web Archive:
PDF Content Analysis
May 19, 2015
Open Access Symposium
49
An analysis of the 4.5 million unique PDF
documents in the EOT 2008 archive
May 19, 2015
Open Access Symposium
50
May 19, 2015
Open Access Symposium
51
May 19, 2015
Open Access Symposium
52
May 19, 2015
Open Access Symposium
53
The PDF dataset contains over 60 million pages
of content.
May 19, 2015
Open Access Symposium
54
1.5 million pdfs (33%) were 1 page in length
May 19, 2015
Open Access Symposium
55
Longest PDF is 17,584 pages
May 19, 2015
Open Access Symposium
56
Average PDF Length = 13.8 pages
May 19, 2015
Open Access Symposium
57
We investigated the number of domains that a
PDF was served from.
May 19, 2015
Open Access Symposium
58
PDFs hosted on 1 – 25 unique domains
Average 1.1 domain per PDF
May 19, 2015
Open Access Symposium
59
PDFs hosted on 1 – 4 Top Level Domains
May 19, 2015
Open Access Symposium
60
PDFs hosted on 1 – 1763 unique URLs
Average = 1.2 URLS per PDF
May 19, 2015
Open Access Symposium
61
What domains had the most PDFs?
May 19, 2015
Open Access Symposium
62
gpo.gov with 1,082,735 PDFs
May 19, 2015
Open Access Symposium
63
Followed by:
usda.gov
house.gov
army.mil
bea.gov
census.gov
May 19, 2015
Open Access Symposium
64
With PDFs over 1 page?
May 19, 2015
Open Access Symposium
65
gpo.gov = 594,430
May 19, 2015
Open Access Symposium
66
usda.gov
house.gov
uscis.gov
uscourts.gov
army.mil
tres.gov
noaa.gov
May 19, 2015
Open Access Symposium
67
What about 20+ pages
May 19, 2015
Open Access Symposium
68
gpo.gov
gao.gov
epa.gov
usda.gov
army.mil
noaa.gov
May 19, 2015
Open Access Symposium
69
The PDF work is one small example of ways we
have reused this Web archive data internally in
the UNT Libraries
May 19, 2015
Open Access Symposium
70
We have an interest in mining these Web archives
to extract the discrete “publications” and move
them into our traditional digital library
infrastructure.
May 19, 2015
Open Access Symposium
71
We did this in a trial collection
May 19, 2015
Open Access Symposium
72
The Defense Base Closure and Realignment
Commission
May 19, 2015
Open Access Symposium
73
Known as BRAC
May 19, 2015
Open Access Symposium
74
May 19, 2015
Open Access Symposium
75
May 19, 2015
Open Access Symposium
76
May 19, 2015
Open Access Symposium
77
In addition to enabling students, researchers and
the general public the ability to go back in time to
view the Federal government through the lens of
their websites.
May 19, 2015
Open Access Symposium
78
In 2014 the UNT Libraries acquired a 110 TB Web
archive of all .gov content from the Internet
Archive's Global Wayback Machine
May 19, 2015
Open Access Symposium
79
We worked with IA to extract this content, which
they made available to researchers as a
demonstration of the opportunities for research
that Web archives offer.
May 19, 2015
Open Access Symposium
80
This content represents the .gov domain from
1996 to 2013
May 19, 2015
Open Access Symposium
81
We have plans to provide local access to this data
as well as researching ways of extracting
meaningful subsets of information of integration
into the UNT Digital Library and our local
government information collections.
May 19, 2015
Open Access Symposium
82
In closing...
May 19, 2015
Open Access Symposium
83
As the Web continues to play an important role in
people's everyday lives,
May 19, 2015
Open Access Symposium
84
And as governments at all levels seek to engage
with constituents via this Web
May 19, 2015
Open Access Symposium
85
It is increasingly important that cultural heritage
institutions select, collect, describe, provide
access to and preserve the Web publications of
our governments.
May 19, 2015
Open Access Symposium
86
So that citizens, students, scholars and
researchers in the future can view and use these
resources as they seek to understand how laws,
policies, practices and processes...
May 19, 2015
Open Access Symposium
87
Have changed over time.
May 19, 2015
Open Access Symposium
88
Questions?
Mark Phillips
[email protected]
May 19, 2015
Open Access Symposium
89