Collecting and Archiving the Federal Web Mark Phillips Assistant Dean for Digital Libraries UNT Libraries May 19, 2015 In most areas of life the Web is the primary platform for the creation and dissemination of information May 19, 2015 Open Access Symposium 2 This rings true for the governmental Web as well. May 19, 2015 Open Access Symposium 3 Municipal, County, State, and Federal Governments reach their constituents via the Web May 19, 2015 Open Access Symposium 4 The United State Federal Government has used the Internet as a publishing platform for over 20 years. May 19, 2015 Open Access Symposium 5 Making it the single largest platform for the dissemination of federal information in existence May 19, 2015 Open Access Symposium 6 The ability to quickly and in a cost effective manner publish content to the Web has made it possible for all branches of the Federal government to reach those in need of governmental information. May 19, 2015 Open Access Symposium 7 But with great abilities to create also come opportunities for loss. May 19, 2015 Open Access Symposium 8 Content that was so easy to make available can just as easily be removed. May 19, 2015 Open Access Symposium 9 With the ability for planned or unplanned denial of access is as easy as a few keystrokes May 19, 2015 Open Access Symposium 10 It is important for libraries, archives, and cultural heritage institutions to select, collect, organize, provide access to and preserve this important Web content. May 19, 2015 Open Access Symposium 11 End of Term May 19, 2015 Open Access Symposium 12 2004 May 19, 2015 Open Access Symposium 13 National Archives and Records Administration (NARA) May 19, 2015 Open Access Symposium 14 Conducted an “End of Term” Web harvest at the end of the first Bush term. May 19, 2015 Open Access Symposium 15 Available at http://webharvest.gov/ May 19, 2015 Open Access Symposium 16 May 19, 2015 Open Access Symposium 17 In 2008 NARA announced it wouldn't be doing the same at the end of the second Bush term May 19, 2015 Open Access Symposium 18 A call to action! May 19, 2015 Open Access Symposium 19 US members of the International Internet Preservation Consortium (IIPC) formed an ad-hoc group to shoulder the task of doing the .gov crawl May 19, 2015 Open Access Symposium 20 California Digital Library Internet Archive Government Printing Office Library of Congress University of North Texas May 19, 2015 Open Access Symposium 21 Worked to establish a list URLs for the “Federal Domain” May 19, 2015 Open Access Symposium 22 May 19, 2015 Open Access Symposium 23 Primarily .gov and .mil May 19, 2015 Open Access Symposium 24 Initiated a series of three crawls May 19, 2015 Open Access Symposium 25 Pre Election Post Election Post Inauguration May 19, 2015 Open Access Symposium 26 With two goals in mind. May 19, 2015 Open Access Symposium 27 Capture the expected change of the Federal Web caused by the transition in the executive branch May 19, 2015 Open Access Symposium 28 Serve as a catalyst event to create a large-scale archive of .gov and .mil content May 19, 2015 Open Access Symposium 29 In the end 16TB of Web content was harvested May 19, 2015 Open Access Symposium 30 160,211,356 URIs May 19, 2015 Open Access Symposium 31 Copies of all data available to each of the partners May 19, 2015 Open Access Symposium 32 May 19, 2015 Open Access Symposium 33 In 2012 the same group initiated the project again May 19, 2015 Open Access Symposium 34 This time Harvard was added to the roster. May 19, 2015 Open Access Symposium 35 A slightly modified process was used to collect candidate URLs for crawling May 19, 2015 Open Access Symposium 36 May 19, 2015 Open Access Symposium 37 We also worked with Library Science students from the Pratt Institute to nominate social media sites for the government May 19, 2015 Open Access Symposium 38 1,476 URLs from 31 nominators were collected May 19, 2015 Open Access Symposium 39 Because President Obama was re-elected, additional crawls were not conducted after the initial pre-election crawl May 19, 2015 Open Access Symposium 40 Once compiled the EOT 2012 Crawl data measured 31TB May 19, 2015 Open Access Symposium 41 Almost double that of EOT 2008 May 19, 2015 Open Access Symposium 42 We are beginning the discussions for EOT 2016 May 19, 2015 Open Access Symposium 43 What has been done with the data to date? May 19, 2015 Open Access Symposium 44 IMLS Funded: Classification of the End-of-Term Archive: Extending Collection Development to Web Archives (eotcd), 2010-2012 May 19, 2015 Open Access Symposium 45 Attempt to overlay the SuDoc Classification System on the EOT 2008 Web archive May 19, 2015 Open Access Symposium 46 May 19, 2015 Open Access Symposium 47 May 19, 2015 Open Access Symposium 48 End of Term 2008 Presidential Web Archive: PDF Content Analysis May 19, 2015 Open Access Symposium 49 An analysis of the 4.5 million unique PDF documents in the EOT 2008 archive May 19, 2015 Open Access Symposium 50 May 19, 2015 Open Access Symposium 51 May 19, 2015 Open Access Symposium 52 May 19, 2015 Open Access Symposium 53 The PDF dataset contains over 60 million pages of content. May 19, 2015 Open Access Symposium 54 1.5 million pdfs (33%) were 1 page in length May 19, 2015 Open Access Symposium 55 Longest PDF is 17,584 pages May 19, 2015 Open Access Symposium 56 Average PDF Length = 13.8 pages May 19, 2015 Open Access Symposium 57 We investigated the number of domains that a PDF was served from. May 19, 2015 Open Access Symposium 58 PDFs hosted on 1 – 25 unique domains Average 1.1 domain per PDF May 19, 2015 Open Access Symposium 59 PDFs hosted on 1 – 4 Top Level Domains May 19, 2015 Open Access Symposium 60 PDFs hosted on 1 – 1763 unique URLs Average = 1.2 URLS per PDF May 19, 2015 Open Access Symposium 61 What domains had the most PDFs? May 19, 2015 Open Access Symposium 62 gpo.gov with 1,082,735 PDFs May 19, 2015 Open Access Symposium 63 Followed by: usda.gov house.gov army.mil bea.gov census.gov May 19, 2015 Open Access Symposium 64 With PDFs over 1 page? May 19, 2015 Open Access Symposium 65 gpo.gov = 594,430 May 19, 2015 Open Access Symposium 66 usda.gov house.gov uscis.gov uscourts.gov army.mil tres.gov noaa.gov May 19, 2015 Open Access Symposium 67 What about 20+ pages May 19, 2015 Open Access Symposium 68 gpo.gov gao.gov epa.gov usda.gov army.mil noaa.gov May 19, 2015 Open Access Symposium 69 The PDF work is one small example of ways we have reused this Web archive data internally in the UNT Libraries May 19, 2015 Open Access Symposium 70 We have an interest in mining these Web archives to extract the discrete “publications” and move them into our traditional digital library infrastructure. May 19, 2015 Open Access Symposium 71 We did this in a trial collection May 19, 2015 Open Access Symposium 72 The Defense Base Closure and Realignment Commission May 19, 2015 Open Access Symposium 73 Known as BRAC May 19, 2015 Open Access Symposium 74 May 19, 2015 Open Access Symposium 75 May 19, 2015 Open Access Symposium 76 May 19, 2015 Open Access Symposium 77 In addition to enabling students, researchers and the general public the ability to go back in time to view the Federal government through the lens of their websites. May 19, 2015 Open Access Symposium 78 In 2014 the UNT Libraries acquired a 110 TB Web archive of all .gov content from the Internet Archive's Global Wayback Machine May 19, 2015 Open Access Symposium 79 We worked with IA to extract this content, which they made available to researchers as a demonstration of the opportunities for research that Web archives offer. May 19, 2015 Open Access Symposium 80 This content represents the .gov domain from 1996 to 2013 May 19, 2015 Open Access Symposium 81 We have plans to provide local access to this data as well as researching ways of extracting meaningful subsets of information of integration into the UNT Digital Library and our local government information collections. May 19, 2015 Open Access Symposium 82 In closing... May 19, 2015 Open Access Symposium 83 As the Web continues to play an important role in people's everyday lives, May 19, 2015 Open Access Symposium 84 And as governments at all levels seek to engage with constituents via this Web May 19, 2015 Open Access Symposium 85 It is increasingly important that cultural heritage institutions select, collect, describe, provide access to and preserve the Web publications of our governments. May 19, 2015 Open Access Symposium 86 So that citizens, students, scholars and researchers in the future can view and use these resources as they seek to understand how laws, policies, practices and processes... May 19, 2015 Open Access Symposium 87 Have changed over time. May 19, 2015 Open Access Symposium 88 Questions? Mark Phillips [email protected] May 19, 2015 Open Access Symposium 89
© Copyright 2026 Paperzz