Data Mining Approach to the Discovery of Responsive Documents Nicole Pennington Christopher Briere Wei Wu CS 526 Data Mining 10.29.2012 I. Problem Description The objective of this e-discovery challenge was to apply data mining techniques to attempt to classify legal documents in a fictional law suit based on document requests made by UT law students. Our group was representing the plaintiff Boone Radley, and our task was to determine which of his documents were responsive to the requests made by the defendant, Charlotte Harris. II. Approach A B C D • Convert pst Files & Filter Emails • Keyword Generation C. Text to Matrix Generator (TMG) Indexing (Term-Document Matrix Generation) • Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a term-document matrix, rows correspond to documents in the collection and columns correspond to terms. • Stop list contains words excluded from the matrix. Usually, those words are common words. • As the semantic content of a document is generally determined by the relative frequencies of terms, the elements of the term-document matrix are sometimes scaled so that the Euclidean norm of each column is 1. • We used inverse document frequency (IDF) as a global weighting scheme. This ensures that if there are only a few documents that contain a term, they will get a higher relevance boost than a case where there are many documents containing a term. Figure: Keyword Refinement Process Query Matching in Vector Space Model • In order to retrieve keyword “KEYWORD” from documents set, the frequencies of term KEYWORD in the query would specify the values of the appropriate nonzero entries in the query vector. • Query matching in the vector space model can be viewed as a search in the column space of matrix A for the documents most similar to the query. • The similarity measures used for query matching in this project is the cosine of the angles between the query vector and the document vectors (matrix A). III. Results Sample Request and Production Produce all documents relating to any cards or invitations sold or produced by Radley or Simply Stated after September 1, 2010, in Georgia, North Carolina, or Tennessee, other than those sold or produced on behalf of Sassy Sentiments. • Use TMG for Classification • Documents whose corresponding document vectors produce cosines greater than some threshold value are judged as relevant to the query. • Analyze Results & Refine Keywords If Necessary Our approach, as outlined above, involves a good deal of preprocessing and iterative searching. The next few sections describe each step and give background where necessary. Ultimately, our approach was a sophisticated keyword search operating on a term-document matrix by finding the cosine similarity to a query vector. We experimented with the use of SVD, but its results were not as accurate. Latent Semantic Indexing via SVD • SVD can reduce computation, but as with any dimensionality reduction, it can also reduce accuracy. Assume A can be decomposed to U, Σ and V, the multiplication of them is Ak, that is Ak = UΣVT. The error in approximating A by Ak is given by: Text-Document Matrix • Since singular values here are very small, the error can be omitted. • The cosine of the angles between query vector and document vectors can be represented by: A. Preprocessing Convert pst Files & Filter Emails • Of the 2,635 files provided by the law professor, 2,613 were emails in Outlook pst format. • After several attempts to convert the emails to text files, we settled on the use of software called OutlookConverter. This program removed unnecessary email header information. • Next, we filtered the emails into separate folders based on the date range in the document production requests. This ensured that the results returned for each request would be within the proper date range. • With so few files that were not emails, we decided to classify them by inspection, since it may be more difficult to take the time to convert all of them to plain text files. B. Keyword Generation • We read through some of the emails and the meta-data word documents to find candidates for keywords and the stop list. A few important finds for these lists were: • Stoplist: Added “Law Pre-Trial Litigation”, which was the email account suffix for these fictional characters. Also added common email header terms like “Subject” and “Original Message” • Keywords: We noticed “Pickle” was a common nickname for Harris. “Radley” was a poor keyword because there was also a Nathan Radley. Credits: • After we tried to set the dimensions of Σ to 50 and 100, we found the results were worse than vector space model. A larger dimension of may reduce the error, however, the quantity of documents is not that big, it is not always necessary to use SVD to reduce computation. Therefore, we decided not to use the SVD results D. Refine Keywords Why is this Necessary? • The initial keywords were generated from problem description, common sense, or just random guessing with regard to a specific problem. However, the results from this approach were inaccurate since we did not know the content of documents set. In order to improve our results, we iteratively refined our keywords based on the results of initial keywords. How to Refine? • Some queries were too general and could be split into several small queries, such as the query “Produce all documents and communications that relate to Simply Stated’s customers.” “All customers” is general and can not be represented by keywords, therefore, we could list all customers and query each customer, rather than query them together. Another reason is that products sold to each customer were different, so it would have different keywords for each customer’s product. Nicole – Related project assignment to group and led discussion of project goals, organized meetings and task assignments, extracted mail files, explored the features of TMG MATLAB toolkit, Request for Production #8-10, edited, formatted, and aggregated changes to poster, wrote the Project Description and some of the Approach section of the poster Wei – Request for Production #1-3, contacted law student for keyword suggestions, wrote up details of classifier design and conclusion for the poster, explored the features of TMG MATLAB toolkit Chris – Request for Production #4-7, wrote python script to insert our results into the given Excel format, contacted law student about our results, created Results section of poster Figure: Color map of the term document matrix (left) and a reduced version using 100 dimensions (right) achieved via SVD. V. Conclusion In this project, we successfully applied data-mining techniques, specifically a vector space model, to discover responsive documents from a simulated law suit provided by the UT law school. The power of a vector space model is that it enables us to search the document space based on keywords in a way which is different from the integrated search function in a program like Outlook. Another significant advantage of a vector space model is its ability to merge similar words such as “babe” and “baby” into a single term which can save computation and reduce human errors. The methodology we used proved to be fairly successful, but would require further modifications before it could be used for actual production, since sometimes documents returned were not always relevant, and there is still no adequate way to judge whether the documents are privileged. Future work could take two directions. First, filtering of forwarded and replied emails could be added since most e-mails contain previous e-mails in their text. This can interfere with the results because TMG can mistakenly return e-mails based on keywords in the previous e-mails. The catch 22 is that the same information could also be relevant in classification, though it arguably does not deserve the same weight. Secondly, a method to measure whether or not a document is privileged should be developed.
© Copyright 2025 Paperzz