Web Archiving
Lauren Ko
Web Archiving Programmer
Digital Projects Unit
What is web archiving?
Identifying, capturing, and preserving in an archive important pieces of the Web for future audiences.
Why web archiving?
Web pages and sites change and disappear all the time
● Example: whitehouse.gov
http://www.whitehouse.gov/
http://www.cybercemetery.unt.edu/archive/allcollections/20090119205216/http://www.whitehouse.gov/
●
How do we archive the Web?
●
Determine the scope of a desired collection
●
Harvest the content
●
Store the content persistently
●
Make it available
Determining the scope
●
What do you want to collect?
–
●
Crawl based on topic, organization, domain, etc.
To what extent do you want to collect it?
–
–
Shallowly crawl many domains
●
Web portal type sites that exist to link to many other sites
●
Only go a specified number of hops from starting point
Deeply crawl a more controlled set
●
Harvesting entire domains Determining the scope (continued)
●
Navigate around the sites you want to collect to help determine the criteria by which you need to configure the crawl
–
Do you want the entire domain?
●
–
All subdomains?
●
●
Example: everything at unt.edu
unt.edu plus library.unt.edu, international.unt.edu, and the rest
–
A subset of URIs based on a regular expression
–
Only certain types of files (pdfs, html, images, etc.)
–
Linked content from other sites
Create a list of seed URIs from which the crawl spans out
What is web harvesting?
This is where we instruct a Web crawler (a.k.a. spider, robot, bot) to move through the part of the Web we are interested in and download the content that we want in our collection.
What does this crawler do?
●
●
●
●
It is a program that begins at our seed URLs, reading through page content to discover other URIs (hyperlinks)
The newly discovered URIs are added to a list called a frontier, so they can be visited and all of their links can be discovered as well. Crawler recursively continues until URIs are no longer in scope
The crawl operator configures the crawler and tells it what URIs should be added to the frontier for inspection and which content should be downloaded to become part of the archive.
What else does the operator configure?
●
●
Aside from telling the crawler what to download, the operator also tells it:
–
How many threads (processes) should be run at once
–
How long should the crawler wait before retrievals
–
How many times should URIs be retried
–
Should the crawler comply with robots.txt
–
How should content be written
This configuration is not only meant to instruct the crawler, but also keeps the crawler from using more resources than its machine can handle.
Challenges of harvesting
●
●
●
Crawler misses some links buried in javascript, xml, or other weird places
Crawler cannot copy the database that powers a dynamic site
May not be able to access embedded content without custom processors (YouTube videos and Flickr slideshows)
●
Crawler traps such as calendars (generate links to infinitum)
●
Must be “polite” –
Respect robots.txt –
Limit number of simultaneous downloads to avoid putting heavy load on host servers
Free web crawlers
●
wget
●
HTTrack
●
Heritrix
wget
●
●
Program to download files from the Web that can be used as a web crawler when run with recursive or mirror flag
Several options allow flexibility of use pertaining to: –
●
Runs from command line, so it is easy to call from other scripts
●
Example call:
–
logfiles; retry limits; overwriting or saving duplicates on revisits; download rates; download wait times; accepting or rejecting specific suffixes, patterns, domains, directories; depth of crawl; security certificates; resuming downloads
wget mirror convertlinks wait=5 http://orgs.unt.edu/lissa/
HTTrack
●
Software specifically for copying a website to a local directory
–
●
Runs from the command line
●
Features:
–
●
User can specify number of requests per connection (up to 8), number of retries, depth of search, maximum file length, transfer rates, mark downloaded pages with footer information about when and where it was collected and under what name it was stored, resume past downloads, set simple filters with + and – on URL or mime type, index all words contained in the HTML files
Example call:
–
Recursively builds all directories and rearranges original site's relative linkstructure to work locally
httrack orgs.unt.edu/lissa O lissa A50000 c1 index
Why do we not use these tools?
These tools by default store the downloaded sites in a directory/file structure that is NOT suited to archival purposes.
We use Heritrix
●
From the Heritrix site (http://crawler.archive.org/):
–
●
Heritrix is the Internet Archive's opensource, extensible, webscale, archivalquality web crawler project.
Reasons we like it:
–
Easy to install from precompiled package
–
Archival; will write content to WARC files (ISO standard format)
–
Web user interface
–
High level of flexibility and control over: number of threads, job metadata, download rates, wait times, maximum file length, crawl scope, adding modules, writing data
–
Setup, save, reuse, relaunch profiles and jobs
–
And especially because it handles very large crawls
Setting the Scope in Heritrix
●
Operator gives Heritrix a set of scope rules, Heritrix compares each URI to the rules in order and assigns an accept, reject, or pass. When it gets to the end of the rules, the final accept/reject value sticks.
●
Usually we start with an accept all or reject all rule.
●
Our most used rules:
–
SURT prefix
–
Matches regular expression
–
Too many path segments
–
Number of hops from seed
–
Transclusion rule
–
Scope plus one
–
Prerequisite rule
SURT prefix
●
●
SURT form – Rearranged version of URI that puts focus on the domain name hierarchy –
helpful for sorting/comparing and determining inclusion of subdomains –
Example: surt form of http://archive.org is http://(org,archive, and will match URIs http://crawler.archive.org and http://webteam.archive.org
SURT prefix – take the SURT form, if there are at least three slashes, strip off everything after the last slash. –
SURT prefix of http://www.library.unt.edu/librariesandcollections/digitalcollections is http://(edu,unt,library,www,)/librariesandcollections/
●
Heritrix SURT prefix rule by default uses seed list for SURT prefixes, so above example would accept all URIs beginning with 'http://www.library.unt.edu/librariesandcollections'
Regular expressions
●
●
●
Regular expressions are a way of matching sets of strings that meet certain criteria (contain certain patterns of characters)
Heritrix will compare each URI to a supplied regular expression and decide whether to accept or reject it
Special characters use in Java regular expressions:
–
([{\^$|]})?*+.
([{\^$|]})?*+.
●
() are for grouping
●
[] defines a set from which to select
●
\ means treat the following character normally
●
^ means at the beginning or negation depending on position
●
is used in defining ranges
●
$ means at the end
●
| means or
●
? means the preceding character or group may or may not be there
●
* means the preceding character zero or more times
●
+ means the preceding character one or more times
●
. means any character
More regular expression stuff!!
●
(red|blue)ball means 'redball' or 'blueball'
●
^ab{5}$ means the string begins with 'a' followed by 5 'b's at the end
●
[agAG]\. means any lowercase or uppercase letter between 'a' and 'g' followed by a literal period
–
●
'a.' matches, 'D.' matches, 'A' with no dot fails, 'X.' fails
a+b*(hello)? means one or more 'a's followed by zero or more 'b's and then maybe the word 'hello'
–
'abbbhello' matches, 'a' matches, 'ab' matches, 'bbbb' fails
Heritrix regular expression examples
●
We tell Heritrix to accept all URIs that match:
^https?://.*globalchange\.gov/.*$ ●
–
Accepts: http://www.globalchange.gov/whatsnew/265reportonsealevelrise
–
Rejects: http://www.globalworld.com
We tell Heritrix to reject all URIs that match:
^https?://(www\.)?(iii\.library\.unt\.edu|texashistory\.unt\.edu|
web3\.unt\.edu/.*calendar/.*php.*(year|date)=(19|200[05]|20[19])|
international\.unt\.edu.*(/text/|/javascript/|/1.2/|lang/en_US|login|auth/|/
id/)).*$
–
Rejects: http://iii.library.unt.edu/search~S12/?
searchtype=X&searcharg=if+chins+could+kill
–
Accepts: http://www.library.unt.edu/libraryinstruction
Back to the other scope rules (briefly)
●
Too many path segments rule – rejects URIs if there are too many path segments ('/')
–
●
●
Too many hops rule – rejects URIs too many hops from seed
Transclusion rule – accepts URIs that have up to a set number of nonnavigational link hops at the end of the hop path
–
●
Facilitates the collecting of embedded content (images, etc. hosted elsewhere)
Scope plus one – accepts a URI if it comes directly from a page that is in scope
–
●
Helps avoid crawler traps
Good for sites that link to a lot of offsite articles
Prerequisite rule – accept URI if it is a prerequisite for one that is in scope
WARC files
●
●
●
We tell Heritrix to write Web ARChive (WARC) files
Format supports writing many digital resources of varying types into an aggregate archival file with related metadata.
Different types of WARC records make up the WARC file
–
Each record has metadata and payload
●
●
●
●
warcinfo – contains metadata about the WARC file; we put crawl information in the payload
request – payload contains HTTP GET response sent to the host
response – metadata includes payload digest, timestamp, etc.; payload is HTTP response headers plus actual file content metadata – payload contains metadata about affiliated downloaded item including outlinks from HTML pages
What do we do with all these WARCs?
We feed them to the Wayback machine!!
Wayback machine
●
●
It allows us to share our WARCs with our users
●
We install the software and edit some xml configuration files to:
●
●
Wayback is the open source Java implementation of Internet Archive Wayback Machine (http://archiveaccess.sourceforge.net/projects/wayback/)
–
Tell it how to serve/display things (mode of play back, etc.)
–
Where to find things (paths to needed indexes)
–
Set URLs for accessing various collections (access points)
Give it an index where each record in each WARC is indexed and another file saying where to find each WARC
http://www.cybercemetery.unt.edu/archive/allcollections/
Wayback replay issues
●
●
●
Not a perfect replay experience
Links must be rewritten to stay within the archive, but this does not always happen correctly
Some elements may not work:
–
Javascript
–
Dynamically generated content
–
Original search functionality
–
Forms
What else can we do?
Fulltext indexing...
NutchWAX
●
●
●
Java based open source software for fulltext indexing of Web archive collections that is an extension of the web search software Nutch
Simplified steps to get your fulltext search going:
–
Create a manifest file with paths to all WARCs and a directory for what indexing steps create
–
Run some commands to generate indexes (import, updatedb, invertlinks, index)
–
Install and configure NutchWAX web interface, tell it about the indexes
Resource intensive
–
Need to have twice as much disk space available for the indexing (generates scratch files)
●
–
If you have 500GB of WARCs, you need 1TB of space for indexing
Couple GB of spare memory NutchWAX search embedded in Wayback NutchWAX results
Did that make sense?
© Copyright 2026 Paperzz