I427 Assignment 1: Analyzing text documents

I427 Assignment 1: Analyzing text documents
Fall 2013
Due: Monday, September 30, 2013, 11:59PM
(You may submit up to 48 hours late for a 10% penalty.)
For this assignment you will build programs to analyze text documents. This is a first step towards building
programs that can analyze web documents, and eventually to building a web search engine. This assignment
is also an opportunity to gain experience writing Perl programs.
Please read this document carefully, as it is filled with details that are meant to help you, and you
might waste time and effort on issues that are already addressed here. Also, please start this assignment
early and feel free to ask questions on the OnCourse Forum. You may work on this assignment alone
or in a partnership with another student in the class. If you work with a partner, please submit
only one copy of your source code. To help get you started on the assignment, we’ve created some skeleton
code and sample data files which you can use as a basis for this assignment. This code is available via
OnCourse.
Part 1: Spelling check
Modern word processing programs like Microsoft Word include spelling and grammar checkers to help detect
and correct typographic errors. These algorithms have become quite complex, including large dictionaries
of common errors and sophisticated natural language processing (NLP) analysis techniques.
In this problem, you will develop a Perl program to detect errors using a much simpler approach. The
program should take the name of the file to spell check as a command-line parameter. Then, the program
should scan the document for typos, alerting the user if any potential mistakes are detected. In particular,
the program should enforce the following rules:
1. Each word should contain at least one vowel.
2. The letter ‘q’ should always be followed by the letter ‘u’ (e.g. qickly is not allowed, but quickly is).
3. Capital letters should only occur at the beginning of a word (e.g. States is allowed, but STates is
not).
4. The pronoun I should always be capitalized (e.g. i ran is illegal).
5. The letter sequence cie should never appear in a word (e.g. recieved is illegal, but received is
allowed).
6. If the letter sequence ei appears in a word, the letter before it must be a c (e.g. receive is okay, but
freind is not).
7. The number of left parentheses in the document should equal the number of right parentheses.
8. If a word begins with a vowel, the preceding word must not be “a”. If a word begins with a consonant,
the preceding word must not be “an”.
9. If a word is plural, the next word must not be “is.” You can assume that a word is plural if and only
if it ends with an ‘s’.
When an error is detected, the program should print a helpful message and then continue to check the rest
of the document. Here is an example of how your program would be run from the UNIX command line:
perl spellcheck.pl myessay.txt
and here is what your program might produce:
1
Running spell check on myessay.txt:
- Potential problem with word: ‘‘freind’’
’ei’ only allowed after the letter ‘‘c’’!
- Potential problem with word: ‘‘An’’
The next word (‘‘good’’) does not start with a vowel!
- Potential problem with word: ‘‘qickly’’
The word has a ‘‘q’’ not followed by ‘‘u’’!
- Potential problem with word: ‘‘brght’’
The word does not have a vowel!
Your output format need not match this example exactly. Of course, these spelling and grammar rules are
very simplistic; you may ignore the fact that some correct English words will be flagged as incorrect by your
program.
Part 2: Spam filtering
A large fraction of traffic on the Internet is ‘spam’ — fake or misleading emails and web pages that promote
dubious businesses, spread malware, and support other nefarious enterprises. An important feature of modern
search engines is to identify these pages so that they do not pollute search results. In this assignment, you’ll
implement a simple spam filter, using a technique called a Bayesian classifier. You’ll study Bayesian classifiers
in detail if you take a machine learning class; here, we’ll use a simple version that nonetheless works relatively
well. The basic idea is that some words (like ‘weightloss’, ‘debt’, or ‘cialis’) appear much more often in spam
than in legitimate documents. Other words appear more often in legitimate documents than in spam, and
some words (like ‘the’, ‘internet’, or ‘company’) appear roughly equally in either one.
To know which words are likely to appear in which documents, we can use a training corpus, or set of
documents that are known to be spam or not spam. Then we can count how many times each word appears
in each document to identify the words associated with spam and the ones not associated with spam. Here
is a particular way to do this. Let S(word) be the number of times that word appears in the spam training
corpus, and let N (word) be the number of times that word appears in the non-spam training corpus. Given
a new document, we can evaluate a ‘spaminess score’ according to the following formula:
X
Score =
log S(word) − log N (word).
word∈Document
In other words, one initially sets Score to 0, then loops through each word of the document, adding
log S(word) − log N (word) to Score. This score should be high for documents that are spam and low
for documents that aren’t. A search engine can then use this score to decide whether or not a document is
spam: if Score is greater than 0, it can be declared to be spam, and if less than 0 it can be declared to be
not spam. Of course, as with any kind of artificial intelligence, the program will sometimes make mistakes.
What to do. Implement a program that analyzes a file and decides whether it is spam or not. Your spam
filter should take three filenames as command-line parameters: a document known to be spam, a document
known not to be spam, and then a document (with unknown spaminess) for which we’d like to compute
a spam score. The program should calculate and display the spam score and then say whether or not the
document is spam. For example,
[djcran@raichu i427]$ perl spam.pl known_spam.txt known_notspam.txt new_document.txt
Evaluating the spaminess of new_document.txt...
Spam score: 346.017439426476
Document is probably spam.
2
To help get you started, we’ve provided skeleton code and some files: a spam document (known spam.txt), a
legitimate document (known notspam.txt), and some test documents (test *.txt). In your report, please
give the spam scores computed by your program for each test document. How well does your detector work?
Important hints and warnings
Getting started. In a project like this one with many components, it’s good software development practice
to break the problem into smaller pieces, each of which you can write and test independently. To help you
get started, we’ve prepared files with skeleton code, which is available via OnCourse. Download the skeleton
code and begin filling in some of the missing code (marked with comments), testing each function as you
write it. Feel free to modify the skeleton code (or even ignore it completely) if you’d like.
Design decisions. You’ll likely encounter some decisions as you write your programs that are not specified
in this assignment. Feel free to make some reasonable assumptions in these cases, or to ask on the OnCourse
forum for clarification.
Academic integrity. You and your partner may discuss the assignment with other people at a high level,
e.g. discussing general strategies to solve the problem, talking about Perl syntax and features, etc. You
may also consult printed and/or online references, including books, tutorials, etc., but you must cite these
materials in the documentation of your source code. However, the code that you (and your partner, if
working in a group) submit must be your own work, which you personally designed and wrote. You may not
share written code with any other students except your own partner, nor may you possess code written by
another student who is not your partner, either in whole or in part, regardless of format.
What to turn in
Turn in three files, via OnCourse:
1. Two Perl source code files for part 1 and part 2 of the assignment. Make sure your code is thoroughly
debugged and tested. Make sure your code is thoroughly legible, understandable, and commented.
Please use meaningful variable and function names. Cryptic or uncommented code is not acceptable.
2. A separate text file or PDF with brief documentation about your programs. Explain, at a high level,
how you implemented it, what design decisions and other assumptions you made, etc. Give credit
to any source of assistance (students with whom you discussed your assignments, instructors, books,
online sources, etc.). Also give the output details requested above (on the results of your testing on
our sample datasets).
Grading
We’ll grade based on the correctness, style, and documentation of the code. In particular: Does the code work
as expected? Does it use reasonable algorithms and data structures as discussed in class? Does it produce
reasonable results? Was it tested thoroughly? Does it check for appropriate input and fail gracefully? Is the
code legible and understandable? Does it use subroutines for clarity, reuse and encapsulation? Is the code
thoroughly and clearly commented? Is there adequate documentation in the report?
Extra credit. We’ll give extra credit to assignments that implement more sophisticated features than the
ones described here. For example, in part 1 you might add some additional spelling check rules (maybe to
catch more misspelled words, or to prevent some correct words from being flagged as incorrect) or automatically suggest how to fix some of the errors. Or you might implement some more sophisticated spam
detection features in part 2.
3