1 Abstract - The Department of Computer Science

P2P INSPECTOR GADGET PROJECT
Algorithm Description Document
PortAuthority® Technologies & AMOS team
Ben Gurion University – Software Engineering department
Table of Content
1
2
ABSTRACT
.................................................................................. 3
ALGORITHM HIGH LEVEL EXPLANATION
.................................................................................. 3
2.1
THE LEARNING STAGE ........................................................................................................................ 3
2.2
THE CLASSIFICATION STAGE – DONE FOR EACH DOCUMENT WE ANALYZE. ............................................. 3
2.2.1
2.2.2
3
3.1
3.2
3.3
3.4
4
5
6
Build probabilities ...................................................................................................... 3
Combine Probabilities ................................................................................................ 3
ALGORITHM DETAILED
.................................................................................. 4
THE LEARNING STAGE ......................................................................................................................... 4
GENERATING WORD PROBABILITIES: .................................................................................................... 4
DEALING WITH RARE W ORDS .............................................................................................................. 4
COMBINING THE PROBABILITIES ........................................................................................................... 5
CLASSIFICATION SYSTEM DESIGN
.................................................................................. 7
APPENDIX A -BAYESIAN FILTERING
................................................................................ 10
RESOURCES
................................................................................ 11
1 ABSTRACT
This document describes the categorization algorithm used in P2P Inspector Gadget®.
The algorithm used is derived from the Bayes probability theorem and from the
“SpamBayes” project that uses this algorithm to classify spam mail from wanted mail.
2 ALGORITHM HIGH LEVEL EXPLANATION
The Bayesian filtering algorithm has two main stages of work:
2.1
THE LEARNING STAGE
In the learning stage we use 2 Databases one of classified files and the other of non
classified files in order to teach the System what words in a document imply on its
belonging to a certain group(either classified or non classified).
We do this by parsing the Databases and counting for every word in the Databases
 In how many classified documents it appears.
 In how many non classified documents it appears.
2.2 THE CLASSIFICATION STAGE – DONE FOR EACH DOCUMENT WE ANALYZE.
This is the “running” state of the algorithm when we want to analyze an unknown file and
classify it to either the classified or non classified group.
This is done in two phases:
2.2.1 BUILD PROBABILITIES
In this stage we use the Databases built at the learning stage to find
For each word in the document:
The probability that based on the appearance of this word in the document, the document
is classified.
2.2.2 COMBINE PROBABILITIES
In this stage we combine the probabilities from the former stage in order to give a total
grade for the document that reflects the assurance that we have that this file is classified.
3 ALGORITHM DETAILED
3.1 THE LEARNING STAGE
The learning stage is pretty straight forward, we keep 3 hashtables[Key,Value]:
Classified Hash Table:[word in classified files, in how many classified files it appears]
Non Classified Hash Table: [word in non classified files, in how many non classified files it
appears]
Corpus: [Word in any document, in how many documents in data base it appears]
3.2 GENERATING WORD PROBABILITIES:
For each word that appears in the corpus, we calculate:
b(w) = (the number of classified files containing the word w) / (the total number of
classified files).
g(w) = (the number of unclassified files containing the word w) / (the total number of
unclassified files).
p(w) = b(w) / (b(w) + g(w))
p(w) can be roughly interpreted as the probability that a randomly file containing word w
will be a classified file. This information is the basis for further calculations to determine
whether the file is classified or not.
However, there is one notable wrinkle. This approach Relies on the fact that the amount of
classified files and unclassified files learned from is about the same.
We have used about twice the amount of unclassified files so in order to “straighten” that
wrinkle we have doubled the probability for the classified words.
This has worked fairly well.
3.3 DEALING WITH RARE WORDS
There is a problem with probabilities calculated as above when words are very rare. For
instance, if a word appears in exactly one file and it is a classified file, the calculated p(w)
is 1.0. But clearly it is not absolutely certain that all future e-mail containing that word will
be classified. In fact, we simply don't have enough data to know the real probability.
Bayesian statistics gives us a powerful technique for dealing with such cases. This branch
of statistics is based on the idea that we are interested in a person's degree of belief in a
particular event.
That is the Bayesian probability of the event.
When exactly one file contains a particular word and that file is classified, our degree of
belief that the next time we see that word it will be in a classified file is not 100%. That's
because we also have our own background information that guides us. We know from
experience that virtually any word can appear in either a classified or non-classified context
or that one or a handful of data points are not enough to be completely certain we know
the real probability. The Bayesian approach lets us combine our general background
information with the data we have collected for a word in such a way that both aspects are
given their proper importance. In this way, we determine an appropriate degree of belief
about whether, when we see the word again, it will be in a classified file.
We calculate this degree of belief, f(w), as follows:
where:
s is the strength we want to give to our background information.
x is our assumed probability, based on our general background information, that a word
we don't have any other experience of will first appear in a classified file.
n is the number of files we have received that contain word w.
This gives us the convenient use of x to represent our assumed probability from
background information and s as the strength we will give that assumption. In practice, the
values for s and x where found through testing to optimize performance.
We will use f(w) rather than p(w) in our calculations so that we are working with
reasonable probabilities rather than the unrealistically extreme values that can often occur
when we don't have enough data. This formula gives us a simple way of handling the case
of no data at all; in that case, f(w) is exactly our assumed probability based on background
information.
In testing, replacing p(w) with f(w) in all calculations where p(w) would otherwise be used,
has uniformly resulted in more reliable classification.
3.4 COMBINING THE PROBABILITIES
At this point, we can compute a probability, f(w), for each word that may appear in a new
file This probability reflects our degree of belief, based on our background knowledge and
on data in our training corpus, that a file chosen randomly from those that contain w will
be classified.
So each file is represented by a set of of probabilities. We want to combine these individual
probabilities into an overall combined indicator of classification indication for the file as a
whole.
“In the field of statistics known as meta-analysis, probably the most common way of
combining probabilities is due to R. A. Fisher. If we have a set probabilities, p1, p2, ..., pn,
we can do the following. First, calculate -2ln p1 * p2 * ... * pn. Then, consider the result to
have a chi-square distribution with 2n degrees of freedom, and use a chi-square table to
compute the probability of getting a result as extreme, or more extreme, than the one
calculated. This "combined'' probability meaningfully summarizes all the individual
probabilities.
Let our hypothesis be "The f(w)s are accurate, and the present file is a random collection
of words, each independent of the others, such that the f(w)s are not in a uniform
distribution.'' Now suppose we have a word, "Python'', for which f(w) is .01. We believe it
occurs in classified files only 1% of the time. Then to the extent that our belief is correct,
an unlikely event occurred; one with a probability of .01. Similarly, every word in the file
has a probability associated with it. Then we use the Fisher calculation to compute an
overall probability for the whole set of words. If the file is classified, it is likely that it will
have a number of very low probabilities and relatively few very high probabilities to
balance them, with the result that the Fisher calculation will give a very low combined
probability. This will allow us to reject the null hypothesis and assume instead the
alternative hypothesis that the file is not classified.
Let us call this combined probability H:
where C-1() is the inverse chi-square function, used to derive a p-value from a chi-squaredistributed random variable.
In the real world, the Fisher combined probability is not a probability at all, but rather an
abstract indicator of how much (or little) we should be inclined to accept the null
hypothesis. It is not meant to be a true probability when the null hypothesis is false, so the
fact that it isn't doesn't cause computational problems for us.
The individual f(w)s are only approximations to real probabilities (i.e., when there is very
little data about a word, our best guess about its confidential probability as given by f(w)
may not reflect its actual reality).
The final result given is:
The below is the most effective technique that has been identified in recent testing efforts
follows.
First, "reverse'' all the probabilities by subtracting them from 1 (that is, for each word,
calculate 1 - f(w)). Because f(w) represents the probability that a randomly chosen file
from the set of files containing w is a classified, 1 - f(w) represents the probability that
such a randomly chosen file will be non-classified.
Now do the same Fisher calculation as before, but on the (1 - f(w))s rather than on the
f(w)s. This will result in near-0 combined probabilities, in rejection of the null hypothesis,
when a lot of very classified: words are present. Call this combined probability S.
Now calculate:
The “reverse” calculation reduces the amount of classified files that are classified as non
classified instead of reducing the amount of files that are not classified that are classified
as “classified” (which is much less important).
4 CLASSIFICATION SYSTEM DESIGN
System Class Diagram
System Classes Description
Class Name
Goal
Description
Status
Comments
Classifier
Main Class and
interface of the
Classification system
Implemented
+ tested
none
ClassifierTest
Test Class for the
classifier
A txt file
representation
Initiates the system.
Starts the learning process.
Loads/saves learned data to
files.
Classifies Files.
Tests the classifier class
Parses a file, and represents
it.
Implemented
+ tested
Implemented
+ tested
JUnit like
system
none
A directory
representation in the
system.
Tests the Directory
and file logic.
Reads a directory’s files and
holds the Fie objects
Implemented
+ tested
none
Tests the Directory and File
classes
Implemented
+ tested
JUnit like
system
Represents a token
pool .
Creates the token
Pools.
This class represents a Pool
tokens.
Creates TokenPool for
directories.
Loads/Saves TokenPools to
files.
Implemented
+ tested
Implemented
+ tested
none
Tests the logic for
TokenPoolFactory and
TokenPool.
Interface for a tokenizer
Implemented
+ tested
JUnit like
system
none
none
A simple regex-based
whitespace tokenizer.
It expects a string and can
return all tokens lowercased or in their existing
case.
Classifies a txt message to a
relevant pool.
Implemented+
tested
Implements
Tokenizer_IF
None
none
Classifies a txt message to a
relevant pool.
Implemented+
tested
computes the probability of
a message being part of a
pool
computes the probability of
a message being part of a
pool (Robinson's method)
P = 1 - prod(1-p)^(1/n)
Q = 1 - prod(p)^(1/n)
None
Implements
ClassifyAlgorit
hm_IF
None
File
Directory
FilesHandlerTe
st
TokenPool
TokenPoolFact
ory
PoolClassesTes
t
Tests class for
TokenPoolFactory
and TokenPool.
Tokenizer_IF
Interface for a
tokenizer
SimpleTokenize Tokenizer
r
implementation
ClassifyAlgorith
m_IF
BayesClassifyAl
go
Interface of the
classification
algorithm
Bayes classification
algorithm
CombinerAlgo_
IF
Combiner algorithm
interface
RobinsonCombi
ner
Robinson Combiner
Implemented+
tested
none
Implements
CombinerAlgo
_IF
RobinsonFisher
Combiner
Robinson Fisher
Combiner
S = (1 + (P-Q)/(P+Q)) / 2
computes the probability of
a message being classified
(Robinson-Fisher method)
H=C-1( -2.ln(prod(p)), 2*n
)
S=C-1( -2.ln(prod(1-p)),2*n
)
I = (1 + H - S) / 2
Implemented+
tested
Implements
CombinerAlgo
_IF
5 APPENDIX A -BAYESIAN FILTERING
Bayesian filtering is the process of using Bayesian statistical method to classify documents
into categories.
Bayesian filtering gained attention when it was described in the paper A Plan for Spam by
Paul Graham, and has become a popular mechanism to distinguish illegitimate spam email
from legitimate "ham" email.
Bayesian filtering take advantage of Bayes' theorem, says that the probability that a
document is of a certain group (confidential documents), given that it has certain words in
it, is equal to the probability of finding those certain words in a document from that group
(confidential documents), times the probability that any document is of that group
(confidential documents), divided by the probability of finding those words in any Group:
Furthermore, Bayesian theorem is a part of a statistical inference which is a large area in
statistics and mathematics.
The canonical example for the Bayesian world is a man that have doubts whether or nor
taking an umbrella, if he takes the umbrella and it will not be a raining day, he'll carry it in
vain, and if it will be a raining day and he won't take the umbrella he'll get wet.
The man looks out of the window and sees black clouds and decided to take the
umbrella…
In the Bayesian terms there are the following definitions:
  {i } - "World states" the states that can occur.
Each state is extrinsic, and the conjunction of all of the states equals  .
i  j : i  j  
and  i   .
X  {x1 ,..., xn }
Observations:
, this is the data we have, the facts.
From those observation we will try to conclude the state of the world.
P  {P0 (i ), P( X | i )}
Probabilistic model of the world:
.
In the Bayesian approach we assume that we hold the complete probabilistic knowledge of
the world. The knowledge include the a priori probability ( P0 (i ) ) to have the world state
of
i .
Available activities: A  {a1 ,..., ak } , this is the set of activities we can choose from, every
activity has its "price" and we would like to choose the best suitable activity according to
the state of the world.
  { (ak ,i )}
Activities "prices":
, Every activity we choose has its price according to the
state of the world, activities that are not suitable to the current state of the world has a
positive price, activities that are suitable to the current state of the world have a negative
price.
We can use the Bayesian theorem in the "naïve" way to classify the downloaded
documents.
We will treat any "token" independently from each other, and we will multiply their
conditional probabilities and get the result.
Tokens can be almost anything, starting from specific words and ending in the amount of
sequential capital letters in a document (anything that can help classifying the document).
6 RESOURCES
http://www.paulgraham.com/spam.html - The original article that used bayes filtering to
filter spam mail.
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
an improvement on that algorithm.
http://www.linuxjournal.com/article/6467
another improvement and the one we used.