Keywords And Concepts: Searching For The Legal

Keywords And Concepts:
Searching For The Legal Needle In
The Cyber Haystack
Ellen S. Pyle
“If Edison had a needle to find in a haystack,
he would proceed at once with the diligence of
the bee to examine straw after straw until he
found the object of his search.... I was a sorry
witness of such doings, knowing that a little
theory and calculation would have saved him
ninety per cent of his labor.” Nikola Tesla (1857
- 1943), New York Times, October 19, 1931
Ellen S. Pyle
is an attorney and freelance writer based in Washington D.C. Over the past decade, her litigation
practice has had increasing emphasis on project
management for large e-discovery and electronic
production. She can be reached at elle.pyle@gmail.
com.
The Practical Lawyer | 60
VIRTUALLY ALL FIRMS have heard about (and experienced) the escalating costs of electronic document
review. The problem can so overburden a small client or
law firm that legal disputes run the risk of being a contest
over whom can best afford the e-discovery, rather than a
contest of the actual legal issues.
To illustrate the problem, and its potential financial
gravity, let’s take a recent case that I was involved in. The
case’s initial electronic document pull (copying the relevant portions of the client’s hard drives onto CD-ROM)
yielded a set of 52 CD-ROMs. My review team included
a variety of seasoned reviewers. Over the course of the
first year of the review, the average reviewer reviewed 200
pages a day. Taking this average speed, and extrapolating
it over the 250 days worked that year, I estimated that each
reviewed approximately 50,000 pages, or 500 megabytes,
over the course of the year. 500 megabytes is basically
the equivalent of one CD-ROM of information. Our re-
Keyword and Concept Searching | 61
viewers worked 50 hours a week for that year, and
billed a total of 2,500 hours at the rate of $200 per
hour. That is, just this one reviewer billed $500,000.
(One CD is about 650 MB, or 50,000 pages. One
DVD is about 4.7 GB, or 350,000 pages. One digital linear tape (DLT) is about 40/80 GB, or three
to six million pages. An average email is 1.5 pages
(100,099 pages per GB). The average experienced
reviewer can read and code between 100 and 200
pages in an eight-hour period.) So, for our unfortunate client, the cost of reviewing just one of the
52 CDs was $500,000. Extrapolating this out to the
entire case’s documents, its length over the course
of a few years, and the involvement of several full
time reviewers/coders, associates, and partners, it
is easy to see why the client would want the review
shortened in length and narrowed in breadth.
To determine a way to shorten its length and
breadth, we need to examine the two most common
search methodologies, keyword and concept, and
determine their relative strengths and weaknesses.
We must first understand what exactly each search
method really is. That is, what are “keyword” and
“concept” searching?
KEYWORD SEARCHING • Keyword searching
is the most common form of textual search in e-discovery, and indeed, on the Web. Most Web search
engines operate from a simple keyword search platform.
A keyword forms the heart of the search query and retrieval. A keyword can be any word that
might exist in an Internet search database. To be
most effective, it should not be a common word
such as “the,” “in,” “an,” or “I.” Such a word would
return too many hits, when we ran the search. Importantly, such a word, as it is not truly keyed to
the search intent of our database searching, would
return a lot of hits but not much substance related
to our research. Our keyword, rather, should be a
unique word that is keyed to our search intent.
If for example, we were researching the topic of
this article, keyword searches, we might be looking
for information substantive to database searches,
and our keyword might be “keyword,” or it might
be “concept.” Both words are relatively uncommon. Both words are also uniquely keyed to our
search intent: database searching. Both words will
pull information and substantive hits that are truly
related or keyed to our intent.
Keyword searches are familiar to most litigators. The theme of keyword searching has been
around for decades in legal research. You can run
a keyword search in Lexis, Westlaw, or any number
of other legal research platforms. In e-discovery,
keyword searches are often over vast databases of
documents, usually emails. One can use a keyword
that is a particular name to limit emails to just those
involving a particular person, or you can run a
search on a particular legal term, or on a particular
product in a products liability case. In such a case,
the keyword search will certainly cull the database
results, resulting in a reduced size database to formally review. In each case though, there may be a
high percentage of irrelevant documents.
In e-discovery, as in any other search platform, there are a high number of irrelevant hits
with keyword searches. As anyone who has run a
search based solely on a name can attest, this is
one of the main drawbacks of keyword searching.
There generally occurs a high volume of irrelevant
documents and email since the search retrieves the
name’s personal emails, junk mails, and mails that
are not remotely related to the case. Such a search
can also be underinclusive, ignoring many emails
that might be quite relevant because they simply
do not include the keyword/name, or might have
misspelled them.
So: keyword searching has a major plus —
it is so intuitive that it is simple to use. Keyword
searches also have distinct limitations — they tend
to be overinclusive in that they retrieve an inordinate amount of irrelevant documents. Keyword
62 | The Practical Lawyer searches also are distinctly underinclusive, failing to
capture those emails that are related to the searchintended subject, but which simply do not include,
or misspell, the keyword itself.
CONCEPT SEARCHING • Judge Facciola recently delivered an opinion that distinguishes keyword searching and concept searching, Disability
Rights Council of Greater Wash. v. Wash. Metro. Area
Transit Auth., 242 F.R.D. 139 (D.D.C. 2007). After plaintiff proposed a keyword search of emails
based on persons’ names, Judge Facciola suggested
instead that the parties consider utilizing concept
searching. Ordering the parties to meet and discuss
as required by the courts rules, he encouraged them
further to discuss to research emails, either through
keyword searching, or, as he suggested might be
more efficient, through concept searches.
So what is concept searching? Concept searches
endeavor to go a little further than keyword searches. Instead of focusing on the exact word used, the
concept search focuses on the meaning behind the
word(s). Often the concept search is based on a series of words, or word clusters. The researcher generally broadens the keyword search, including synonyms, and/or a thesaurus of specific case-related
vocabulary. Word clusters are chosen to represent
those word groupings that together might best retrieve hits on documents that deal with the subject
matter or theme of the research. The words may
not precisely match the set in the query, may match
only a few of the set, or may match only one. Those
documents that hit on the majority of the terms in
the search set are generally ranked highest statistically in relevance upon search retrieval. Those that
hit on just one or a few of the terms in the search
set are ranked as less relevant. The concept search
method functions by forming and building upon
word clusters based on the search subject. Words
are chosen which, when placed in proximity to each
other, present a meaning unique to the research.
June 2010
An Example
Let’s suppose for example, we are working on a
trademark case involving the parameters of semiconductors wafers. We are trying to retrieve a company’s research and emails related to the “bow”
and warp of their semiconductor substrates. We
run a search on “bow,” and retrieve an extraordinary number of hits, in the hundreds of thousands.
At first we are buoyed by our successful results.
However, as we begin to review the emails and
documents retrieved in our hits, we find that there
are just as extraordinary a number of hits on irrelevant items. Emails discussing persons taking a
bow, or bowing to pressure, are in the cull; as are
documents relative to an employee’s multiple issues
with his boat repairs (on the bow.) Emails relative to
an employee’s daughter’s tap dancing costume are
also among the hits, as the discussion touches on
the color of the bow for the youngster’s hair.
Same Problem?
This exemplifies the problem we discussed earlier on keyword searching, and e-discovery searches
in general: numerous irrelevant hits. The problem
stems often from homonyms. A homonym is a
word or group of words which, though they share
the same spelling and pronunciation, have different
meanings.
In the case of our word “bow,” documents
could include documents including a variety of the
uses of the word. The word may refer the weapon
used in archery, a device used to play music, a part
of a boat, a type of knot, a city in England or Kentucky, or, on some occasions, parameters of semiconductors (as our research intended).
We can quickly see how this has become an unproductive and potentially very expensive exercise.
As we are paying by the hour to review these documents, each and every irrelevant document is costing us time and hence money. As the irrelevant hits
number in the hundreds of thousands, so also may
our review costs.