Data Mining Approach to the Discovery of Responsive

Data Mining Approach to the Discovery of Responsive Documents
Nicole Pennington
Christopher Briere
Wei Wu
CS 526 Data Mining
10.29.2012
I. Problem Description
The objective of this e-discovery challenge was to apply data mining techniques to
attempt to classify legal documents in a fictional law suit based on document
requests made by UT law students. Our group was representing the plaintiff Boone
Radley, and our task was to determine which of his documents were responsive to
the requests made by the defendant, Charlotte Harris.
II. Approach
A
B
C
D
• Convert pst Files & Filter Emails
• Keyword Generation
C. Text to Matrix Generator (TMG)
Indexing (Term-Document Matrix Generation)
• Term-Document Matrix is a mathematical matrix that describes the frequency of
terms that occur in a collection of documents. In a term-document matrix, rows
correspond to documents in the collection and columns correspond to terms.
• Stop list contains words excluded from the matrix. Usually, those words are
common words.
• As the semantic content of a document is generally determined by the relative
frequencies of terms, the elements of the term-document matrix are sometimes
scaled so that the Euclidean norm of each column is 1.
• We used inverse document frequency (IDF) as a global weighting scheme. This
ensures that if there are only a few documents that contain a term, they will get
a higher relevance boost than a case where there are many documents
containing a term.
Figure: Keyword Refinement Process
Query Matching in Vector Space Model
• In order to retrieve keyword “KEYWORD” from documents set, the frequencies
of term KEYWORD in the query would specify the values of the appropriate
nonzero entries in the query vector.
• Query matching in the vector space model can be viewed as a search in the
column space of matrix A for the documents most similar to the query.
• The similarity measures used for query matching in this project is the cosine of
the angles between the query vector and the document vectors (matrix A).
III. Results
Sample Request and Production
Produce all documents relating to any cards or invitations sold or produced by
Radley or Simply Stated after September 1, 2010, in Georgia, North Carolina, or
Tennessee, other than those sold or produced on behalf of Sassy Sentiments.
• Use TMG for Classification
• Documents whose corresponding document vectors produce cosines greater
than some threshold value are judged as relevant to the query.
• Analyze Results & Refine Keywords If
Necessary
Our approach, as outlined above, involves a good deal of preprocessing and
iterative searching. The next few sections describe each step and give background
where necessary. Ultimately, our approach was a sophisticated keyword search
operating on a term-document matrix by finding the cosine similarity to a query
vector. We experimented with the use of SVD, but its results were not as accurate.
Latent Semantic Indexing via SVD
• SVD can reduce computation, but as with any dimensionality reduction, it can
also reduce accuracy. Assume A can be decomposed to U, Σ and V, the
multiplication of them is Ak, that is Ak = UΣVT. The error in approximating A by Ak
is given by:
Text-Document Matrix
• Since singular values here are very small, the error can be omitted.
• The cosine of the angles between query vector and document vectors can be
represented by:
A. Preprocessing
Convert pst Files & Filter Emails
• Of the 2,635 files provided by the law professor, 2,613 were emails in Outlook
pst format.
• After several attempts to convert the emails to text files, we settled on the use of
software called OutlookConverter. This program removed unnecessary email
header information.
• Next, we filtered the emails into separate folders based on the date range in the
document production requests. This ensured that the results returned for each
request would be within the proper date range.
• With so few files that were not emails, we decided to classify them by
inspection, since it may be more difficult to take the time to convert all of them to
plain text files.
B. Keyword Generation
• We read through some of the emails and the meta-data word documents to find
candidates for keywords and the stop list. A few important finds for these lists
were:
• Stoplist: Added “Law Pre-Trial Litigation”, which was the email account
suffix for these fictional characters. Also added common email header
terms like “Subject” and “Original Message”
• Keywords: We noticed “Pickle” was a common nickname for Harris.
“Radley” was a poor keyword because there was also a Nathan Radley.
Credits:
• After we tried to set the dimensions of Σ to 50 and 100, we found the results
were worse than vector space model. A larger dimension of may reduce the
error, however, the quantity of documents is not that big, it is not always
necessary to use SVD to reduce computation. Therefore, we decided not to use
the SVD results
D. Refine Keywords
Why is this Necessary?
• The initial keywords were generated from problem description, common sense,
or just random guessing with regard to a specific problem. However, the results
from this approach were inaccurate since we did not know the content of
documents set. In order to improve our results, we iteratively refined our
keywords based on the results of initial keywords.
How to Refine?
• Some queries were too general and could be split into several small queries,
such as the query “Produce all documents and communications that relate to
Simply Stated’s customers.” “All customers” is general and can not be
represented by keywords, therefore, we could list all customers and query each
customer, rather than query them together. Another reason is that products sold
to each customer were different, so it would have different keywords for each
customer’s product.
Nicole – Related project assignment to group and led discussion of project goals, organized meetings and task assignments, extracted mail files, explored the features of TMG MATLAB toolkit,
Request for Production #8-10, edited, formatted, and aggregated changes to poster, wrote the Project Description and some of the Approach section of the poster
Wei – Request for Production #1-3, contacted law student for keyword suggestions, wrote up details of classifier design and conclusion for the poster, explored the features of TMG MATLAB toolkit
Chris – Request for Production #4-7, wrote python script to insert our results into the given Excel format, contacted law student about our results, created Results section of poster
Figure: Color map of the term document matrix (left) and a reduced version using
100 dimensions (right) achieved via SVD.
V. Conclusion
In this project, we successfully applied data-mining techniques, specifically a vector
space model, to discover responsive documents from a simulated law suit provided
by the UT law school. The power of a vector space model is that it enables us to
search the document space based on keywords in a way which is different from the
integrated search function in a program like Outlook. Another significant advantage
of a vector space model is its ability to merge similar words such as “babe” and
“baby” into a single term which can save computation and reduce human errors.
The methodology we used proved to be fairly successful, but would require further
modifications before it could be used for actual production, since sometimes
documents returned were not always relevant, and there is still no adequate way to
judge whether the documents are privileged.
Future work could take two directions. First, filtering of forwarded and replied
emails could be added since most e-mails contain previous e-mails in their text.
This can interfere with the results because TMG can mistakenly return e-mails
based on keywords in the previous e-mails. The catch 22 is that the same
information could also be relevant in classification, though it arguably does not
deserve the same weight. Secondly, a method to measure whether or not a
document is privileged should be developed.