Demystifying Analytics in eDiscovery

1
FOCUSED DISCOVERY
Demystifying Analytics in eDiscovery
The facts you need to succeed and thrive in an electronic world
By Steven Toole, vice president of marketing at Content Analyst Company
A
recent eDiscovery Journal blog
entry by Greg Buckles pointed out
that the number of companies
providing the analytics technology
in early case assessment (ECA) and
review platforms is few, and the specific
analytics capabilities can vary wildly
from one platform to the next. While
this may not be breaking news to those
closest to analytics in eDiscovery, it did
reveal some mysticism about how the
market uses analytics in the early stages
of eDiscovery. In addition, it raised
questions about how analytics fit into
the eDiscovery workflow, and ultimately,
the return on investment that analytics
have on eDiscovery and information
governance in general.
This overview is designed to further
unveil this mysticism surrounding
analytics in eDiscovery and information
governance, and provide insights about
the return on investment analytics can
enable for those who embrace these
capabilities. Corporate counsel that get
ahead of the curve today with forwardthinking strategies such as these will be
the ultimate heroes and beneficiaries
of eDiscovery analytics, leading their
field with a much more proactive and
cost-effective approach to information
governance and legal technology.
The Analytics Land Grab
While there are plenty of stakes in the
ground across the eDiscovery landscape,
law firms and service providers are
looking “to the West” for unclaimed
territory. The gigabyte gold rush is all
about applying analytics to the data
further upstream in order to “own” the
data long before it’s needed in a matter.
ECA was the first frontier law firms and
service providers looked toward in order
to move upstream, but the greenest
pastures are still further upstream, in what
some call “pre-discovery.” The Golden
Rule in eDiscovery is simple: Those
who “rule” the content get the gold.
Translation: the vendor that can apply
analytics earliest — before it’s needed in
a matter — provides the most value to
corporations, and therefore is at a great
competitive advantage.
Recipe for Success
This all sounds good, but what does
this really mean? You can’t bake a cake
until you know what the ingredients are,
what they do, and how they can affect
the output. And since each eDiscovery
solution has a different set of analytics
capabilities, here’s a brief tutorial on
the key ingredients of analytics for
eDiscovery and information governance.
Dynamic Clustering — This is a good
place to start, especially if you know little
to nothing about the content. Clustering
“buckets” the content (documents,
emails, etc.) into natural groupings of
conceptually related materials. One major
benefit of clustering is that it provides a
very fast map of the document landscape
in a highly objective, consistent, conceptaware fashion. As a result, the reviewer
can jump straight to the cluster that’s of
most interest (conceptually relevant),
and avoid spending time in clusters of
no conceptual relevance. In terms of
information governance, it can also help
identify and weed out the ROT (redundant,
outdated and transient) content quickly
and easily, thus reducing costs and helping
to increase ROI.
Term Expansion — You have a keyword,
name or technical term, abbreviation,
or acronym, and you want a list of all
similar, or highly related, terms so you
can expand your search for documents
containing those terms as well. Term
expansion identifies conceptually related
terms, customized to your content, and
ranked in order of relevance. For example,
Barack Obama might produce a list such
as President Obama, Commander-inChief, Senator Obama, Michelle Obama,
the Oval Office, Office of the President,
POTUS, etc. In a matter, that means
finding more conceptually related content
faster and easier, saving time and money.
In information governance, it helps
identify content related to corporate
records, intellectual property, and
compliance, as well as, of course, more
ROT for defensible deletion.
Conceptual Search — You’ve identified
a key document or paragraph, now you
want to find similar ones. Keyword search
will give you documents containing the
specific keywords as best as you can write
the Boolean search string, and as long
as those keywords are included in the
resulting documents. But writing Boolean
search strings can be time-consuming and
still may miss key documents containing
the ”unknown” terms not included in your
search string. To find the documents you’d
otherwise miss with keyword searches,
you’ll need to use conceptual search.
Applying mathematical algorithms to
your example document or text selection,
conceptual search looks for matching
patterns in the “map” of the data called
the conceptual space. The benefit is
that conceptual search can find similar
results even if the matching document
doesn’t contain any of the same terms
as the example text. Think abbreviations,
misspellings, acronyms, code words, and
related terms you hadn’t heard. Then take
away false positives from synonyms and
polysemes. Translation: uncover highly
relevant yet latent relationships, saving
time and costs.
Demystifying Analytics in eDiscovery
2
Auto-Categorization — Predictive coding
is one area that’s gained a lot of attention
in eDiscovery over the past two years.
Auto-categorization is what makes
predictive coding possible. Predictive
coding is applying machine learning to
a corpus of documents to intelligently
categorize them any number of ways,
such as, as privileged, responsive,
nonresponsive. For example, users can
categorize documents as responsive,
then categorize the responsive
documents even further into relevant
issue sub-categories. Auto-categorization
uses the same conceptual space and
sample document exemplars to find
conceptually similar documents and
label them as appropriate. Again, the big
benefit here is a tremendous amount of
time saving (and cost saving) by letting
the technology bring the most relevant
documents to the forefront, and into the
hands of the domain experts, as quickly
and easily as possible.
categorization can identify documents
that are relevant to the case, many
could be various versions of the same
document. Knowing that they’re near
duplicates of each other can save the
time of having to review each one. If it’s
important to know what changed from
one version to the next, when, and by
whom, difference highlighting shows
these changes, again saving time and
reducing cost. Batching near duplicates
together from the outset of a matter also
provides reviewers with a more focused
set of documents.
Putting Your Data on a Diet
Email Threading — The concept of email
threading is fairly simple — find the
subset of emails at the end of each branch
of a conversation thread. Rather than
reading 30 emails back and forth — as
well as sideways among forwarded
branches of the conversation — email
threading finds the subset of emails that
include all of the previous replies (called
”inclusive” emails because these six, for
example, include the whole history). Time
and cost savings of using email threading
are self-evident, but it’s also important
to note that threading reveals exactly
who knew what, when — pretty critical in
piecing together the course of events that
unfolded surrounding a matter.
Near-Duplicate Identification — A
similar benefit to email threading exists
with near-duplicate identification.
While conceptual search, clustering or
Putting these analytics capabilities to
work for you may cause serious weight
loss in your data. In eDiscovery, that
means fewer documents to review by
expensive reviewers. It also means
that the documents they are reviewing
are nothing but the absolutely most
conceptually relevant documents to the
case. Moreover, reviewers are being
presented with documents that are not
batched haphazardly, allowing for a more
focused review, driving accuracy to an alltime high and costs and time even lower.
But Wait — There’s More!
Remember the gigabyte gold rush from
above? The Golden Rule of data? This
is where analytics really are the key to
unlocking all those hidden insights in
a company’s data — long before they’re
needed in eDiscovery. Applying text
analytics to a company’s electronic
records proactively through pre-discovery
means that data is already organized,
reduced, and ready to be presented if and
when a matter arises. Corporate counsel
love this idea because it keeps litigation
costs as low as possible, decreases the
crucial time it takes to investigate or
decide whether to settle a case, and helps
them present their side in the very best
possible light.
The Bottom Line: ROI for
eDiscovery Analytics
Measuring the ROI of using analytics
in eDiscovery comes down to this:
Review is the greatest cost factor in
eDiscovery. Expert reviewers don’t
come cheap — their expertise is clearly
of utmost value in a case. But if a
large percentage of their time is spent
reviewing documents not relevant to a
case, then you’re not getting the most
value out of them in the first place. Their
hours cost the same whether they’re
looking at the smoking gun document
in a case or something completely
unrelated. You wouldn’t wear gloves
during a palm reading — they’d just get
in the way of the psychic doing her job.
Presenting a corpus of documents to
expensive reviewers without applying
machine learning first makes no more
sense. Further, finding documents
otherwise missed without analytics
can also hinder the experts’ ability to
formulate your case strategy.
But the ROI of applying analytics to your
documents and email pre-discovery
goes even further. The cost benefits of
organizing and analyzing your content
proactively are huge, helping to drive
decision making and information
governance practices for compliance,
risk mitigation and cost avoidance.
Applying these strategies early can
provide a tremendous advantage long
term through a much more proactive and
cost-effective approach to information
governance and eDiscovery.
Steven Toole is vice president of marketing at Content Analyst Company where his unique
combination of business acumen, creativity, strategic vision and tactical execution yield impactful
results toward the company’s mission.
Discover More. Review Less.®
2301 Columbia Pike | Suite 121 | Arlington, VA 22204 | www.mindseyesolutions.com | 1.888.770.3876