E-Discovery Tip Sheet Culling Data Down to Size We always mean to go through our closets, our sock drawers, our magazine racks to the extent we still have them, and winnow out the stuff that is just taking up space. You know, the shirt you don’t love and haven’t worn in two years but it’s in decent shape, so you keep it? The baseball faux stirrup socks that no longer entirely encompass your toes? All those old print issues of The Onion? The day you finally go after all that dreck you wonder where all the open space came from, and why you put it off for so long. You know where I’m going with this. It’s the same with data storage, but more so. There is a cost for storage, just as there is a cost per square foot of what a landlord calls living space. That used to be physical space, so it was at least easier to see. Now that it lives in electronic devices, possibly not even located in your downtown office, it is further out of sight and out of mind… until you need to find something, or are compelled to do so by a production request or subpoena. Searching, retrieval, backup, transfer, power, bandwidth, security, everything becomes more complicated, more time consuming, and more expensive if you do not have rules governing data retention and retirement. Add potential litigation to the scenario, and the costs rise by an order of magnitude; that that point, you don’t have the time for an orderly inventory and rulemaking process. You just need to freeze, find, preserve and collect responsive data, dynamic data, privileged data, possibly damaging data, now. Policy Options Let’s say that you do not (yet) have a document retention / document destruction policy, which would govern the prescribed handling and life cycles of various document November 2015 E-Discovery Tip Sheet Page 2 types. Such a policy, if scrupulously applied, may be submitted to explain why a certain class of requested documents is not available from the period of the actions described in the complaint. But without a policy, that explanation is not available and the data must be produced. If you don’t have a formal corporate policy, you may at least have an IT-based policy regarding maintenance and storage of backups, or an email archive system. Backup software, particularly now that tapes are fading from the scene, may be set to run full/differential backup cycles of a defined length and retention period. This is acceptable, as long as backups are retained once the possibility of litigation is acknowledged. Archiving systems that retain emails in an alternate location would need to be examined to make sure that no mails are being removed, and then be accounted for in collection planning. Funnelling Data Okay, so you have no policy of any kind and have to produce from everything you’ve ever committed to electronic storage (and, one hopes, not deleted ad hoc). Let’s say a data map of servers and custodial user systems turns up a tidy Terabyte (1,024 Gigabytes) of raw data. How can human beings possibly analyze this much information, let alone under deadline? Think of a kitchen funnel: wide at the top, narrow at the bottom, to aid in transferring a big bowl of stuff into a bottle or jar. That is the model for culling, except the means used to electronic information are also electronic in nature, using properties of the files called metadata, information describing aspects of an electronic file and file system, and sometimes by searching for terms in the textual body of each file. Here are a few prime examples of how one would funnel a Terabyte of data: DeNISTing – If an entire volume is collected (generally a smaller volume such as a custodian’s computer, where possibly relevant data may have been deleted), as opposed to selectively searching for relevant information, known program files and operating system components may well have been swept up. Applying lists of hashes of such files (for example, winword.exe), as registered with a body such as the National Institute of Standards and Technology (NIST), can help remove these artifacts from your collected data. November 2015 E-Discovery Tip Sheet Page 3 Deduplication – A mathematical calculation of the content of each file is made, so that when the final representation (the hash value) is rendered, all identical hashes represent identical files. This comes up a lot in email with multiple custodians who are on the same distribution list, for example, and can significantly reduce the amount of information to be processed and reviewed. Date Range – The subject matter of litigation is alleged to have occurred at some point in time. Only files created or modified within that date range need be collected for consideration. File Types – Based upon the circumstances of a case, you may select the kinds of file types you wish to exclude, or collect. ● Some file types are commonly associated with user-created data: .DOC/.DOCX (MS Word), .XLS/.XLSX (MS Excel), .PPT/.PPTX (MS PowerPoint), .PDF (Adobe Acrobat), sometimes .TXT (Text). ● Containers such as MS Outlook PST or OST files, or archive formats such as ZIP, RAR, 7z or TAR.gz, are collected for extraction and further culling and examination. ● Some may be user content, or something downloaded: think .JPG (a photo or a Web picture) or .HTML (a Web page, or a user-generated report). These may also be collected, depending on the case. ● Others are program-related and will almost never have user data: .EXE, .DLL, .OCX, .LIB, and so on. Custodian Names – Where general file locations or mailbox names are known, specific folder branches or mailbox folders may be collected. The amount of chaff removed by these means will vary, but we see 60% to 80%, sometimes better culling results. Additional means, such as conversational analytics and concept clustering, may provide further improvement. This leads us to keyword searching within the body of the text that remains. My principal concern with keywords is that, if used in the collection process rather than in culling a collected set, you would need to go back to the source and collect again if the search terms changed – which they tend to do as one knows more about a case. But there are other caveats to bear in mind when using keywords: (a) You need text, or the ability within your culling (or collection) tool to OCR on the fly. The same applies, by the way, to technology assisted clustering/culling tools. November 2015 (b) E-Discovery Tip Sheet Page 4 Your tool must be able to search within the aforementioned archive container files. (c) Not all OCR is good OCR – think PDFs from images, or pure images. (d) Not all possible relevant search terms, or their synonyms, abbreviations or codewords, are known at the point of culling (or collecting). The best policy is always to have a policy beforehand. This makes it easier to impose a legal hold, to undertake a risk assessment, and to prepare a data map for preservation and collection. It also costs a lot less overall. When contemplating the cost of collecting and culling data, bear in mind that the less material at the top of the funnel, the lower the cost of the funnel required. Culling a banker’s box worth of hard drives is going to cost a lot, but nothing compared to the cost of reviewing more data than is absolutely necessary. And that is the bottom line. -- Andy Kass [email protected] 917-512-7503 The views expressed in this E-Discovery Tip Sheet are solely the views of the author, and do not necessarily represent the opinion of U.S. Legal Support, Inc. U.S. LEGAL SUPPORT, INC. ESI & Litigation Services PROVIDING EXPERT SOLUTIONS FROM DISCOVERY TO VERDICT • • • • • • e-Discovery Document Collection & Review Litigation Management Litigation Software Training Meet & Confer Advice Court Reporting Services • • • • • • At Trial Electronic Evidence Presentation Trial Consulting Demonstrative Graphics Courtroom & War Room Equipment Deposition & Case Management Services Record Retrieval www.uslegalsupport.com Copyright © 2015 U.S. Legal Support, Inc., 90 Broad Street, New York NY 10004 (800) 824-9055. All rights reserved. To update your e-mail address or unsubscribe from these mailings, please reply to this email with CANCEL in the subject line.
© Copyright 2026 Paperzz