Culling Data Down to Size

E-Discovery Tip Sheet
Culling Data Down to Size
We always mean to go through our closets, our sock drawers, our magazine
racks to the extent we still have them, and winnow out the stuff that is just taking up
space. You know, the shirt you don’t love and haven’t worn in two years but it’s in
decent shape, so you keep it? The baseball faux stirrup socks that no longer entirely
encompass your toes? All those old print issues of The Onion?
The day you finally go after all that dreck you wonder where all the open space
came from, and why you put it off for so long.
You know where I’m going with this.
It’s the same with data storage, but more so. There is a cost for storage, just as
there is a cost per square foot of what a landlord calls living space. That used to be
physical space, so it was at least easier to see. Now that it lives in electronic devices,
possibly not even located in your downtown office, it is further out of sight and out of
mind… until you need to find something, or are compelled to do so by a production
request or subpoena.
Searching, retrieval, backup, transfer, power, bandwidth, security, everything
becomes more complicated, more time consuming, and more expensive if you do not
have rules governing data retention and retirement. Add potential litigation to the
scenario, and the costs rise by an order of magnitude; that that point, you don’t have the
time for an orderly inventory and rulemaking process. You just need to freeze, find,
preserve and collect responsive data, dynamic data, privileged data, possibly damaging
data, now.
Policy Options
Let’s say that you do not (yet) have a document retention / document destruction
policy, which would govern the prescribed handling and life cycles of various document
November 2015
E-Discovery Tip Sheet
Page 2
types. Such a policy, if scrupulously applied, may be submitted to explain why a
certain class of requested documents is not available from the period of the actions
described in the complaint. But without a policy, that explanation is not available and
the data must be produced.
If you don’t have a formal corporate policy, you may at least have an IT-based
policy regarding maintenance and storage of backups, or an email archive system.
Backup software, particularly now that tapes are fading from the scene, may be set to
run full/differential backup cycles of a defined length and retention period. This is
acceptable, as long as backups are retained once the possibility of litigation is
acknowledged. Archiving systems that retain emails in an alternate location would
need to be examined to make sure that no mails are being removed, and then be
accounted for in collection planning.
Funnelling Data
Okay, so you have no policy of any kind and have to produce from everything
you’ve ever committed to electronic storage (and, one hopes, not deleted ad hoc). Let’s
say a data map of servers and custodial user systems turns up a tidy Terabyte (1,024
Gigabytes) of raw data. How can human beings possibly analyze this much
information, let alone under deadline?
Think of a kitchen funnel: wide at the top, narrow at the bottom, to aid in
transferring a big bowl of stuff into a bottle or jar. That is the model for culling, except
the means used to electronic information are also electronic in nature, using properties
of the files called metadata, information describing aspects of an electronic file and file
system, and sometimes by searching for terms in the textual body of each file.
Here are a few prime examples of how one would funnel a Terabyte of data:
DeNISTing – If an entire volume is collected (generally a smaller volume such as
a custodian’s computer, where possibly relevant data may have been deleted), as
opposed to selectively searching for relevant information, known program files
and operating system components may well have been swept up. Applying lists
of hashes of such files (for example, winword.exe), as registered with a body such
as the National Institute of Standards and Technology (NIST), can help remove
these artifacts from your collected data.
November 2015
E-Discovery Tip Sheet
Page 3
Deduplication – A mathematical calculation of the content of each file is made,
so that when the final representation (the hash value) is rendered, all identical
hashes represent identical files. This comes up a lot in email with multiple
custodians who are on the same distribution list, for example, and can
significantly reduce the amount of information to be processed and reviewed.
Date Range – The subject matter of litigation is alleged to have occurred at some
point in time. Only files created or modified within that date range need be
collected for consideration.
File Types – Based upon the circumstances of a case, you may select the kinds of
file types you wish to exclude, or collect.
●
Some file types are commonly associated with user-created data:
.DOC/.DOCX (MS Word), .XLS/.XLSX (MS Excel), .PPT/.PPTX (MS
PowerPoint), .PDF (Adobe Acrobat), sometimes .TXT (Text).
●
Containers such as MS Outlook PST or OST files, or archive formats such
as ZIP, RAR, 7z or TAR.gz, are collected for extraction and further culling
and examination.
●
Some may be user content, or something downloaded: think .JPG (a
photo or a Web picture) or .HTML (a Web page, or a user-generated
report). These may also be collected, depending on the case.
●
Others are program-related and will almost never have user data: .EXE,
.DLL, .OCX, .LIB, and so on.
Custodian Names – Where general file locations or mailbox names are known,
specific folder branches or mailbox folders may be collected.
The amount of chaff removed by these means will vary, but we see 60% to 80%,
sometimes better culling results. Additional means, such as conversational analytics
and concept clustering, may provide further improvement.
This leads us to keyword searching within the body of the text that remains. My
principal concern with keywords is that, if used in the collection process rather than in
culling a collected set, you would need to go back to the source and collect again if the
search terms changed – which they tend to do as one knows more about a case. But
there are other caveats to bear in mind when using keywords:
(a)
You need text, or the ability within your culling (or collection) tool to OCR
on the fly. The same applies, by the way, to technology assisted clustering/culling tools.
November 2015
(b)
E-Discovery Tip Sheet
Page 4
Your tool must be able to search within the aforementioned archive
container files.
(c)
Not all OCR is good OCR – think PDFs from images, or pure images.
(d)
Not all possible relevant search terms, or their synonyms, abbreviations or
codewords, are known at the point of culling (or collecting).
The best policy is always to have a policy beforehand. This makes it easier to
impose a legal hold, to undertake a risk assessment, and to prepare a data map for
preservation and collection. It also costs a lot less overall.
When contemplating the cost of collecting and culling data, bear in mind that the
less material at the top of the funnel, the lower the cost of the funnel required. Culling a
banker’s box worth of hard drives is going to cost a lot, but nothing compared to the
cost of reviewing more data than is absolutely necessary. And that is the bottom line.
-- Andy Kass
[email protected]
917-512-7503
The views expressed in this E-Discovery Tip Sheet are solely the views of the author, and do not necessarily
represent the opinion of U.S. Legal Support, Inc.
U.S. LEGAL SUPPORT, INC.
ESI & Litigation Services
PROVIDING EXPERT SOLUTIONS FROM DISCOVERY TO VERDICT
•
•
•
•
•
•
e-Discovery
Document Collection & Review
Litigation Management
Litigation Software Training
Meet & Confer Advice
Court Reporting Services
•
•
•
•
•
•
At Trial Electronic Evidence Presentation
Trial Consulting
Demonstrative Graphics
Courtroom & War Room Equipment
Deposition & Case Management Services
Record Retrieval
www.uslegalsupport.com
Copyright © 2015 U.S. Legal Support, Inc., 90 Broad Street, New York NY 10004 (800) 824-9055. All rights reserved.
To update your e-mail address or unsubscribe from these mailings, please reply to this email with CANCEL in the subject
line.