Keywords And Concepts: Searching For The Legal Needle In The Cyber Haystack Ellen S. Pyle “If Edison had a needle to find in a haystack, he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search.... I was a sorry witness of such doings, knowing that a little theory and calculation would have saved him ninety per cent of his labor.” Nikola Tesla (1857 - 1943), New York Times, October 19, 1931 Ellen S. Pyle is an attorney and freelance writer based in Washington D.C. Over the past decade, her litigation practice has had increasing emphasis on project management for large e-discovery and electronic production. She can be reached at elle.pyle@gmail. com. The Practical Lawyer | 60 VIRTUALLY ALL FIRMS have heard about (and experienced) the escalating costs of electronic document review. The problem can so overburden a small client or law firm that legal disputes run the risk of being a contest over whom can best afford the e-discovery, rather than a contest of the actual legal issues. To illustrate the problem, and its potential financial gravity, let’s take a recent case that I was involved in. The case’s initial electronic document pull (copying the relevant portions of the client’s hard drives onto CD-ROM) yielded a set of 52 CD-ROMs. My review team included a variety of seasoned reviewers. Over the course of the first year of the review, the average reviewer reviewed 200 pages a day. Taking this average speed, and extrapolating it over the 250 days worked that year, I estimated that each reviewed approximately 50,000 pages, or 500 megabytes, over the course of the year. 500 megabytes is basically the equivalent of one CD-ROM of information. Our re- Keyword and Concept Searching | 61 viewers worked 50 hours a week for that year, and billed a total of 2,500 hours at the rate of $200 per hour. That is, just this one reviewer billed $500,000. (One CD is about 650 MB, or 50,000 pages. One DVD is about 4.7 GB, or 350,000 pages. One digital linear tape (DLT) is about 40/80 GB, or three to six million pages. An average email is 1.5 pages (100,099 pages per GB). The average experienced reviewer can read and code between 100 and 200 pages in an eight-hour period.) So, for our unfortunate client, the cost of reviewing just one of the 52 CDs was $500,000. Extrapolating this out to the entire case’s documents, its length over the course of a few years, and the involvement of several full time reviewers/coders, associates, and partners, it is easy to see why the client would want the review shortened in length and narrowed in breadth. To determine a way to shorten its length and breadth, we need to examine the two most common search methodologies, keyword and concept, and determine their relative strengths and weaknesses. We must first understand what exactly each search method really is. That is, what are “keyword” and “concept” searching? KEYWORD SEARCHING • Keyword searching is the most common form of textual search in e-discovery, and indeed, on the Web. Most Web search engines operate from a simple keyword search platform. A keyword forms the heart of the search query and retrieval. A keyword can be any word that might exist in an Internet search database. To be most effective, it should not be a common word such as “the,” “in,” “an,” or “I.” Such a word would return too many hits, when we ran the search. Importantly, such a word, as it is not truly keyed to the search intent of our database searching, would return a lot of hits but not much substance related to our research. Our keyword, rather, should be a unique word that is keyed to our search intent. If for example, we were researching the topic of this article, keyword searches, we might be looking for information substantive to database searches, and our keyword might be “keyword,” or it might be “concept.” Both words are relatively uncommon. Both words are also uniquely keyed to our search intent: database searching. Both words will pull information and substantive hits that are truly related or keyed to our intent. Keyword searches are familiar to most litigators. The theme of keyword searching has been around for decades in legal research. You can run a keyword search in Lexis, Westlaw, or any number of other legal research platforms. In e-discovery, keyword searches are often over vast databases of documents, usually emails. One can use a keyword that is a particular name to limit emails to just those involving a particular person, or you can run a search on a particular legal term, or on a particular product in a products liability case. In such a case, the keyword search will certainly cull the database results, resulting in a reduced size database to formally review. In each case though, there may be a high percentage of irrelevant documents. In e-discovery, as in any other search platform, there are a high number of irrelevant hits with keyword searches. As anyone who has run a search based solely on a name can attest, this is one of the main drawbacks of keyword searching. There generally occurs a high volume of irrelevant documents and email since the search retrieves the name’s personal emails, junk mails, and mails that are not remotely related to the case. Such a search can also be underinclusive, ignoring many emails that might be quite relevant because they simply do not include the keyword/name, or might have misspelled them. So: keyword searching has a major plus — it is so intuitive that it is simple to use. Keyword searches also have distinct limitations — they tend to be overinclusive in that they retrieve an inordinate amount of irrelevant documents. Keyword 62 | The Practical Lawyer searches also are distinctly underinclusive, failing to capture those emails that are related to the searchintended subject, but which simply do not include, or misspell, the keyword itself. CONCEPT SEARCHING • Judge Facciola recently delivered an opinion that distinguishes keyword searching and concept searching, Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 242 F.R.D. 139 (D.D.C. 2007). After plaintiff proposed a keyword search of emails based on persons’ names, Judge Facciola suggested instead that the parties consider utilizing concept searching. Ordering the parties to meet and discuss as required by the courts rules, he encouraged them further to discuss to research emails, either through keyword searching, or, as he suggested might be more efficient, through concept searches. So what is concept searching? Concept searches endeavor to go a little further than keyword searches. Instead of focusing on the exact word used, the concept search focuses on the meaning behind the word(s). Often the concept search is based on a series of words, or word clusters. The researcher generally broadens the keyword search, including synonyms, and/or a thesaurus of specific case-related vocabulary. Word clusters are chosen to represent those word groupings that together might best retrieve hits on documents that deal with the subject matter or theme of the research. The words may not precisely match the set in the query, may match only a few of the set, or may match only one. Those documents that hit on the majority of the terms in the search set are generally ranked highest statistically in relevance upon search retrieval. Those that hit on just one or a few of the terms in the search set are ranked as less relevant. The concept search method functions by forming and building upon word clusters based on the search subject. Words are chosen which, when placed in proximity to each other, present a meaning unique to the research. June 2010 An Example Let’s suppose for example, we are working on a trademark case involving the parameters of semiconductors wafers. We are trying to retrieve a company’s research and emails related to the “bow” and warp of their semiconductor substrates. We run a search on “bow,” and retrieve an extraordinary number of hits, in the hundreds of thousands. At first we are buoyed by our successful results. However, as we begin to review the emails and documents retrieved in our hits, we find that there are just as extraordinary a number of hits on irrelevant items. Emails discussing persons taking a bow, or bowing to pressure, are in the cull; as are documents relative to an employee’s multiple issues with his boat repairs (on the bow.) Emails relative to an employee’s daughter’s tap dancing costume are also among the hits, as the discussion touches on the color of the bow for the youngster’s hair. Same Problem? This exemplifies the problem we discussed earlier on keyword searching, and e-discovery searches in general: numerous irrelevant hits. The problem stems often from homonyms. A homonym is a word or group of words which, though they share the same spelling and pronunciation, have different meanings. In the case of our word “bow,” documents could include documents including a variety of the uses of the word. The word may refer the weapon used in archery, a device used to play music, a part of a boat, a type of knot, a city in England or Kentucky, or, on some occasions, parameters of semiconductors (as our research intended). We can quickly see how this has become an unproductive and potentially very expensive exercise. As we are paying by the hour to review these documents, each and every irrelevant document is costing us time and hence money. As the irrelevant hits number in the hundreds of thousands, so also may our review costs.
© Copyright 2026 Paperzz