Is your TAR temperature 98.6?

!!
!
TAR 2.0 CASE STUDY
Is your TAR temperature 98.6?
We’re getting hot results using Catalyst’s Insight Predict platform to significantly streamline
document review.
Challenge: Find the responsive documents out of an already filtered population of 2.1 million in a
secure, timely and cost-effective manner
Solution: Using DSi and Catalyst’s Insight Predict, a robust TAR 2.0 platform, the review team found
approximately 98% of the relevant documents after viewing only approximately 6% of the document
population
Overview
A large financial institution had been allegedly defrauded by a borrower. The details aren’t important to
this discussion, but assume the borrower employed a variety of creative accounting techniques to make
its financial position look better than it really was. And, as is often the case, the problems were missed
by the accounting and other financial professionals conducting due diligence. Indeed, there were strong
factual suggestions that one or more of the professionals were in on the scam.
As the fraud came to light, litigation followed. Perhaps in retaliation or simply to mount a counteroffense, the defendants hit the bank with lengthy document requests. After collection and best-efforts
culling, DSi was left with over 2.1 million potentially responsive documents. Neither time deadlines nor
budget allowed for review of such volume. Keyword search offered some help, but the problem
remained. How do you work with 2.1 million potentially responsive documents?
Process
DSi loaded the documents into Insight Predict, Catalyst’s proprietary system for Technology Assisted
Review. Predict uses an advanced form of Continuous Active Learning (CAL), which was developed by
Catalyst over the past few years. The process takes advantage of Insight’s ability to rank documents in
the review population on a continuous basis. As reviewers tag documents, the system takes into
•
877-797-4771 •
!
DSicovery.com
!!
!
account the new judgments, and re-ranks the remaining, unseen documents. Review continues,
followed by continuous ranking. This review/train/ranking cycle continues until the review process is
complete.
Catalyst’s Continuous Active Learning process
often starts with documents found through
keyword search or other methods of locating
relevant documents – such as witness
interviews, key custodian review, etc. These
are fed into the system as seeds for an initial
ranking to get the process started. They can
also be added later to aid in training the
algorithm. No matter how documents are
found or where they are coded, those
additional judgments can be fed back into
Predict as judgmental seeds/training
documents.
CONTINUOUS ACTIVE LEARNING
There are two aspects to continuous active learning.
The first is that the process is “continuous.” Training
doesn’t stop until the review finishes. The second is
that the training is “active.” That means the computer
feeds documents to the review team with the goal of
making the review as efficient as possible,
minimizing the total cost of review. Because training
and review are part of the same process, there is no
requirement that a separate subject matter expert
review 3,000 or so documents as “training” in
advance of the review. The system also includes
contextual diversity samples to combat bias, and the
QC algorithm continues to learn as the review
progresses.
CAL also allows rolling document collections,
which occurred in this case (and are common
in most cases). Since the system’s training is
not based on a separate control set—but
instead by measuring the fluctuation in ranking
across all the files—newly collected
documents can be added on the fly. They are
immediately ranked, and any new subject
matter introduced by the new collections is
identified for review by Catalyst’s contextual
diversity algorithm.
Results: 98% Recall; Only 6% Reviewed
Our CAL protocol allowed the review team to find and review approximately 136,000 documents out of
the total population of approximately 2.1 million. Of the documents reviewed, the team marked 23,950
as relevant. A systematic sample of just under 6,000 documents confirmed that the team had found
approximately 98% of the relevant documents in the collection.
•
877-797-4771 •
!
DSicovery.com
!!
!
We can illustrate these results through a yield curve, which is drawn from the systematic sample taken
at the end of the review.
Yield Curve from Systematic Sample Taken at the End of the Review
Yield curves are relatively easy to understand. The X-axis shows the number of documents actually
reviewed by the team as a percentage of the total documents. The Y-axis shows the percentage of
relevant/responsive documents found as the review progressed. The red line shows the expected
outcome of a linear review where documents would be presented randomly. The blue line shows the
progress of the review team in finding relevant documents.
In total, the team reviewed approximately 6% of the document population and found 98% of the
relevant files.
•
877-797-4771 •
!
DSicovery.com
!!
!
Seed Sets
TAR 1.0 products require that a senior attorney, often called a Subject Matter Expert (SME), do initial
training before review can begin. Training is iterative in that the SME goes through a series of training
rounds before the process is complete. It is a one-time process in that when training concludes, that is
it. The review team jumps in to look at documents but there is no easy mechanism to return their
judgments to the algorithm to make it smarter. One-time training means a one-time ranking.
Catalyst’s Insight Predict is built using a TAR 2.0 engine, which allows but does not require that an
SME do initial training. It encourages the use of review teams for training and the use of senior
attorneys to find relevant documents using keyword search, witness interviews and any other means at
their disposal.
In this case, the senior attorneys used Insight’s powerful search tools to find initial seeds for training.
They were also able to use relevant documents from an earlier production as examples of positive
seeds. Predict allows these judgment seeds to be added at any stage of the process.
Prioritized Review
After the initial ranking based on keyword and tagged seed documents, the review team began
reviewing batches containing a mixture of highly ranked documents, plus a smaller number of
exploratory documents chosen by the system through a “contextual diversity” algorithm.
The purpose of the contextual diversity process is to find documents that are markedly different than
the ones already reviewed. The platform’s proprietary algorithm identifies the most diverse sets of
documents, pulls a representative document from each and presents it to the reviewer as part of the
review batch. If the reviewer tags it as relevant, Predict uses this new information to promote other
similar documents for review.
Through sampling, one in 100 documents were estimated to be relevant, indicating a richness of 1%.
As the CAL review progressed and the training took hold, reviewers received higher volumes of
relevant documents, reflecting CAL’s objective to move relevant documents to the top of the order. We
saw relevance rates between 10% and 25% (occasionally 35%), which represented a large increase
over what could be expected from a linear review.
Ultimately, the review team continued the review process until the percentage of relevant documents in
their batches petered out. Then, a systematic sample was conducted to determine the success at
finding relevant documents.
•
877-797-4771 •
!
DSicovery.com
!!
!
Validation
We built a yield curve based on a systematic sample of approximately 6,000 documents. We then
focused on 5,354 sample documents that had not been reviewed and which came from the
approximately 1.8 million documents left in the discard pile (i.e. below the cutoff). The purpose was to
confirm that we were not leaving too many relevant documents in the discard pile and to calculate
recall.
Out of the 5,354 not-reviewed samples, the attorneys found only one document that they tagged as
relevant. Using a binomial calculator, we can we determine a point estimate for richness in the discard
pile along with a confidence interval around that point estimate. Here are the figures we obtained:
With a point estimate of 0.02%, we estimate there could be 371 relevant documents in the discard pile,
out of 1,852,589. Using the upper confidence interval figure (0.0010) to calculate a worst-case
scenario, we estimate that there could be as many as 1,853 documents in the discard pile. Note that we
are using a confidence level of 95%, which is an industry standard.
As noted earlier, 23,950 documents were found relevant. Using the point estimate, we can estimate
that the team found 98% of the relevant documents (23,950 out of 24,321). If we use the higher
boundary of the confidence interval, we can estimate that we found at least 93% of the relevant
documents (23,950 out of 25,803). Both are markedly higher than the recall values approved by the
courts, which are closer to 75%.
•
877-797-4771 •
!
DSicovery.com
!!
!
TAR at 98.6? Pretty Hot
The protocol allowed the review team to find and review approximately 136,000 documents out of the
total population of approximately 2.1 million. Of the documents reviewed, the team marked 23,950 as
relevant. The team found approximately 98% of the relevant documents in the collection after viewing
only about 6% of the document population – skipping the review of over 1.8 million documents. That’s a
pretty hot result.
About DSi
Serving law firms and corporate legal departments worldwide since 1999, DSi (formerly Document Solutions, Inc.) is a
litigation support services company that provides advanced eDiscovery and digital forensics services. Through five core
business processes—DSicollect, DSintake, DSinsight, DSireview, DSisupport—DSi’s highly trained staff will help you
harness today’s most forward technology to gain a competitive advantage. DSi is headquartered in Nashville, Tenn.
with offices in Knoxville, Tenn., Cincinnati, Ohio, Charlotte, N.C., Minneapolis, Minn., Atlanta, Ga. and Washington D.C.
For more information, please visit DSi at www.dsicovery.com or follow us on Twitter at: @DSicovery.
About Catalyst
Catalyst designs, hosts and services the world’s fastest and most powerful document repositories for large-scale
discovery and regulatory compliance. For more than 15 years, corporations and their counsel have relied on Catalyst to
help reduce litigation costs and take control of complex legal matters. Catalyst provides secure, scalable multi-language
document repositories specifically built to manage Big Discovery. Through Catalyst Insight, its next-generation ediscovery platform, and Insight Predict, its advanced technology-assisted review tool, Catalyst enables corporations to
reduce the cost and risk of discovery, achieve greater control and predictability in workflows, and gain greater visibility
and accountability across all their matters. To learn more about Catalyst, visit catalystsecure.com or follow the company
on Twitter at @CatalystSecure.
•
877-797-4771 •
!
DSicovery.com