Weka Textbook Figures

If tear production ra te = reduced t hen r ecommendation = no ne.
If age = yo ung and as tigmatic = n o and t ear p roduction rate = normal
then recommendation = soft
If age = pr e-presbyopic an d astigmatic = no a nd tear production
rate = normal t hen r ecommendation = s oft
If age = pr esbyopic and sp ectacle p rescription = m yope and
astigmatic = no th en re commendation = no ne
If spectacle prescription = h ypermetrope an d astigmatic = no an d
tear production ra te = normal then recommendation = s oft
If spectacle prescription = m yope a nd astigmatic = ye s and
tear production ra te = normal then recommendation = h ard
If age = yo ung and as tigmatic = y es an d tear production ra te =
normal
then recommendation = hard
If age = pr e-presbyopic an d spectacle prescription = hypermetrope
and a stigmatic = yes th en re commendation = no ne
If age = pr esbyopic and sp ectacle p rescription = h ypermetrope
and a stigmatic = yes th en re commendation = no ne
Figure 1.1 Rules for the contact lens data.
Figure 1.2 Decision tree for the contact lens data.
(a)
(b)
Figure 1.3 Decision trees for the labor negotiations data.
Peter
=
M
Steven
Graham
M
M
Peggy
Grace
F
F
Pam
F
Anna
F
first person
second
person
sister of?
Peter
Peter
Ι
Steven
Steven
Steven
Steven
Ι
Ian
Ι
Anna
Ι
Nikki
Peggy
Steven
Ι
Peter
Graham
Pam
Grace
Ι
Pippa
Ι
Nikki
Ι
Anna
no
no
Ι
no
no
yes
no
Ι
yes
Ι
yes
Ι
yes
=
=
Ray
M
Ian
Pippa
Brian
M
F
M
Nikki
F
first person
second
person
Steven
Pam
Graham
Pam
Ian
Pippa
Brian
Pippa
Anna
Nikki
Nikki
Anna
all the rest
sister of?
yes
yes
yes
yes
yes
yes
no
Figure 2.1 A family tree and two ways of expressing the sister-of relation.
% ARFF file for the we ather d ata w ith s ome numeric features
%
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute
outlook { sunny, overcast, rainy }
temperature numeric
humidity numeric
windy { tr ue, false }
play? { ye s, no }
@data
%
% 14 instances
%
sunny, 8 5, 85, false, n o
sunny, 8 0, 90, true, no
overcast, 83, 86 , false, yes
rainy, 7 0, 96, false, y es
rainy, 6 8, 80, false, y es
rainy, 6 5, 70, true, no
overcast, 64, 65 , true, ye s
sunny, 7 2, 95, false, n o
sunny, 6 9, 70, false, y es
rainy, 7 5, 80, false, y es
sunny, 7 5, 70, true, yes
overcast, 72, 90 , true, ye s
overcast, 81, 75 , false, yes
rainy, 7 1, 91, true, no
Figure 2.2 ARFF file for the weather data.
Figure 3.1 Decision tree for a simple disjunction.
If x=1 and y=0
then class =
If x=0 and y=1
then class =
If x=0 and y=0
then class =
If x=1 and y=1
then class =
Figure 3.2 The exclusive-or problem.
a
a
b
b
If x=1 and y=1
then class = a
If z=1 and w=1
then class = a
Otherwise class = b
Figure 3.3 Decision tree with a replicated subtree.
Default: Iris-setosa
except if petal-length  2.45 and petal-length < 5.355
and petal-width < 1.75
then Iris-versicolor
except if petal-length  4.95 and petal-width < 1.55
then Iris-virginica
else if sepal-length < 4.95 and sepal-width  2.45
then Iris-virginica
else if petal-length  3.35
then Iris-virginica
except if petal-length < 4.85 and sepal-length < 5.95
then Iris-versicolor
Figure 3.4 Rules for the Iris data.
1
2
3
4
5
6
7
8
9
10
11
12
Shaded: standing
Unshaded: lying
Figure 3.5 The shapes problem.
PRP =
- 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX
Figure 3.6(a) Models for the CPU performance data: linear regression.
Figure 3.6(b) Models for the CPU performance data: regression tree.
Figure 3.6(c) Models for the CPU performance data: model tree.
(a)
(b)
(c)
Figure 3.7 Different ways of partitioning the instance space.
(d)
(a)
(b)
d
h
k
a
c
j
f
b
h
f
g
1
2
3
0.4
0.1
0.3
0.1
0.4
0.1
0.7
0.5
0.1
0.8
0.3
0.1
0.2
0.4
0.2
0.4
0.5
0.1
0.4
0.8
0.4
0.5
0.1
0.1
c
j
k
i
g
a
b
c
d
e
f
g
h
…
e
e
a
(c)
d
i
(d)
g a c i e d k b j f h
Figure 3.8 Different ways of representing clusters.
b
For ea ch at tribute,
For each value o f that attribute, ma ke a rule as fo llows:
co unt how of ten each class appears
fi nd th e most frequent c lass
ma ke th e rule assign that class to t his a ttribute-value.
Calculate t he error r ate o f the r ules.
Choose t he rules w ith t he smallest error ra te.
Figure 4.1 Pseudo-code for 1R.
(a)
(b)
(c)
(d)
Figure 4.2 Tree stumps for the weather data.
(a)
(b)
(c)
Figure 4.3 Expanded tree stumps for the weather data.
Figure 4.4 Decision tree for the weather data.
Figure 4.5 Tree stump for the ID code attribute.
b
b
b
b b
y
b
b
a
b
b
b
a
a
a
b
b
b
y
a
a b
b
b
b
b b
b
b
b
b
a
b
a
a
a
y
a
a b
b
b
b
a
b
b
b
b
a
a
b b
2·6
b
b
b
a
a
b
a b
b
b
b
x
x
1·2
x
1·2
(a)
(b)
Figure 4.6 (a) Operation of a covering algorithm; (b) decision tree for the same problem.
sp ace o f
examp les
rule s o far
rule after
ad ding new
term
Figure 4.7 The instance space during operation of a covering algorithm.
For ea ch cl ass C
Initialize E to the i nstance set
While E contains in stances in c lass C
Cr eate a r ule R with an empty l eft-hand s ide t hat p redicts class
C
Un til R i s perfect (or t here are no more attributes to u se) d o
For each attribute A n ot mentioned in R , and ea ch va lue v,
Consider a dding th e condition A=v to the L HS of R
Select A a nd v t o maximize the a ccuracy p/t
(break t ies b y choosing t he condition with the largest p)
Add A=v to R
Re move the instances covered by R from E
Figure 4.8 Pseudo-code for a basic rule learner.
Figure 5.1 A hypothetical lift chart.
Figure 5.2 A sample ROC curve.
Figure 5.3 ROC curves for two learning schemes.
(a)
(b)
Figure 6.1 Example of subtree raising, where node C is “raised” to subsume node B.
Figure 6.2 Pruning the labor negotiations decision tree.
Initialize E t o the instance se t
Until E i s empty d o
For each class C fo r which E contains an instance
Us e the b asic covering a lgorithm to create the be st pe rfect r ule
for cl ass C
Ca lculate th e probability measure m(R) for the ru le, and fo r the
rule with the fi nal condition omitted m(R-)
Wh ile m(R-) < m (R), re move the final co ndition fr om th e rule and
repeat t he previous step
From t he rules g enerated, select the o ne with the smallest m( R)
Print the r ule
Remove th e instances covered by t he rule from E
Continue
Figure 6.3 Generating rules using a probability measure.
univers e
contains n
examples
rule s elects
s examples
clas s
contains k
examples
p
number of instances of that class that the rule selects;
t
total number of instances that the rule selects;
p
total number of instances of that class in the dataset;
t
total number of instances in the dataset.
of the s elected examples ,
z are in the clas s
Figure 6.4 Definitions for deriving the probability measure.
Initialize E t o the instance se t
Split E i nto Grow and P rune in the ratio 2: 1
For each class C fo r which Gr ow a nd Prune b oth c ontain an instance
Us e the b asic covering a lgorithm to create the be st pe rfect r ule
for cl ass C
Ca lculate th e worth w( R) fo r the rule on Prune, and of t he rule
with the fi nal condition omitted w(R-)
Wh ile w(R-) > w (R), re move the final co ndition fr om th e rule and
repeat t he previous step
From t he rules g enerated, select the o ne with the largest w(R)
Print the r ule
Remove th e instances covered by t he rule from E
Continue
Figure 6.5 Algorithm for forming rules by incremental reduced error pruning.
Expand-subset (S ):
Choose a test T a nd u se it to s plit the set of e xamples into
subsets
Sort s ubsets into increasing order of average entropy
while (there is a subset X th at h as not yet be en ex panded
AND all s ubsets expanded s o far ar e leaves)
ex pand-subset(X)
if (all the su bsets e xpanded are leaves
AND estimated error for subtree  estimated error f or node)
un do ex pansion in to su bsets a nd ma ke no de a leaf
Figure 6.6 Algorithm for expanding examples into a partial tree.
(a)
(c)
(b)
Figure 6.7 Example of building a partial tree.
(d)
(e)
Figure 6.7 (continued) Example of building a partial tree.
Exceptions are represented as
Dotted paths, alternatives as
solid ones.
Figure 6.8 Rules with exceptions, for the Iris data.
Figure 6.9 A maximum margin hyperplane.
Figure 6.10 A boundary between two rectangular classes.
MakeModelTree (i nstances)
{
SD = s d(instances)
for each k-valued nominal attribute
co nvert i nto k -1 synthetic binary attributes
root = ne wNode
root.instances = in stances
split(root)
prune(root)
printTree(root)
}
split(node)
{
if sizeof(node.instances) < 4 o r sd(node.instances) < 0. 05*SD
no de.type = LEAF
else
no de.type = INTERIOR
fo r each attribute
for all po ssible split po sitions o f the at tribute
calculate the a ttribute's S DR
no de.attribute = attribute with ma ximum S DR
sp lit(node.left)
sp lit(node.right)
}
prune(node)
{
if node = I NTERIOR then
pr une(node.leftChild)
pr une(node.rightChild)
no de.model = l inearRegression(node)
if s ubtreeError(node) > e rror(node) then
node.type = L EAF
}
subtreeError(node)
{
l = node.left; r = node.right
if node = I NTERIOR then
re turn (sizeof(l.instances)*subtreeError(l)
+
sizeof(r.instances)*subtreEerror(r))/sizeof(node.instances)
else r eturn er ror(node)
}
Figure 6.11 Pseudo-code for model tree induction.
Figure 6.12 Model tree for a dataset with nominal attributes.
(a)
(b)
(c)
Figure 6.13 Clustering the weather data.
(d)
(e)
Figure 6.13 (continued) Clustering the weather data.
(f)
Figure 6.13 (continued) Clustering the weather data.
(a)
Figure 6.14 Hierarchical clusterings of the Iris data.
(b)
Figure 6.14 (continued) Hierarchical clusterings of the Iris data.
data
A
A
B
B
A
A
A
A
A
51
43
62
64
45
42
46
45
45
B
A
A
B
A
B
A
A
A
62
47
52
64
51
65
48
49
46
B
A
A
B
A
A
B
A
A
64
51
52
62
49
48
62
43
40
A
B
A
B
A
B
B
B
A
48
64
51
63
43
65
66
65
46
model
A=50, A =5, pA=0.6
B=65, B =2, pB=0.4
Figure 6.15 A two-calss mixture model.
A
B
B
A
B
B
A
B
A
39
62
64
52
63
64
48
64
48
A
A
B
A
A
A
51
48
64
42
48
41
Figure 7.1 Attribute space for the weather dataset.
Figure 7.2 Discretizing temperature using the entropy method.
64
65
68
69
70
71
yes
no
yes
yes
yes
no
72
no
75
yes
yes
F
66.5
E
70.5
80
81
83
85
no
yes
yes
no
yes
D
73.5
C
77.5
B
80.5
Figure 7.3 The result of discretizing temperature.
A
84
Figure 7.4 Class distribution for a two-class, two-attribute problem.
Figure 7.5 Number of international phone calls from Belgium, 1950–1973.
model generation
Let n be the number of instances in the training data.
For each of t iterations:
Sample n instances with replacement from training data.
Apply the learning algorithm to the sample.
Store the resulting model.
classification
For each of the t models:
Predict class of instance using model.
Return class that has been predicted most often.
Figure 7.6 Algorithm for bagging.
model generation
Assign equal weight to each training instance.
For each of t iterations:
Apply learning algorithm to weighted dataset
resulting model.
Compute error e of model on weighted dataset
error.
If e equal to zero, or e greater or equal to
Terminate model generation.
For each instance in dataset:
If instance classified correctly by model:
Multiply weight of instance by e / (1 Normalize weight of all instances.
and store
and store
0.5:
e).
classification
Assign weight of zero to all classes.
For each of the t (or less) models:
Add -log(e / (1 - e)) to weight of class predicted by
model.
Return class with highest weight.
Figure 7.7 Algorithm for boosting.
(a)
(b)
Figure 8.1 Weather data: (a) in spreadsheet; (b) comma-separated.
(c)
Figure 8.1 Weather data: (c) in ARFF format.
J48 pr uned tree
ŃŃŃŃŃŃ
outlook = sunny
|
hu midity <= 75: yes (2 .0)
|
hu midity > 75: no ( 3.0)
outlook = overcast: yes (4 .0)
outlook = rainy
|
wi ndy = T RUE: no (2.0)
|
wi ndy = F ALSE: ye s (3.0)
Number o f Leaves :
Size of the t ree :
5
8
=== Er ror on training d ata = ==
Correctly Classified Instances
Incorrectly C lassified Instances
UnClassified Instances
Mean absolute er ror
Root mean squared error
Total Number of Instances
14
0
0
0
0
14
100
0
0
%
%
%
=== Co nfusion Ma trix ===
a b
<-- classified a s
9 0 | a = yes
0 5 | b = no
=== St ratified cross-validation = ==
Correctly Classified Instances
Incorrectly C lassified Instances
UnClassified Instances
Mean absolute er ror
Root mean squared error
Total Number of Instances
9
5
0
0.3036
0.4813
14
64.2857 %
35.7143 %
0
%
=== Co nfusion Ma trix ===
a b
<-- classified a s
7 2 | a = yes
3 2 | b = no
Figure 8.2 Output from the J4.8 decision tree learner.
(a)
(b)
Figure 8.3 Using Javadoc: (a) the front page; (b) the weka.core package.
Figure 8.4 A class of the weka.classifiers package.
Pruned t raining model t ree:
MMAX <= 14000 : LM1 (141/4.18%)
MMAX > 14000 : LM2 (68/51.8%)
Models a t the le aves (smoothed):
LM1:
LM2:
class = 4.15
- 2.05Õvendor=honeywell,ipl,ibm,cdc,ncr,basf,
gould,siemens,nas,adviser,sperry,amdahlÕ
+ 5.43Õvendor=adviser,sperry,amdahlÕ
- 5.78Õvendor=amdahlÕ + 0 .00638MYCT
+ 0.00158MMIN + 0 .00345MMAX
+ 0.552CACH + 1.14CHMIN + 0.0945CHMAX
class = -113
- 56.1Õvendor=honeywell,ipl,ibm,cdc,ncr,basf,
gould,siemens,nas,adviser,sperry,amdahlÕ
+ 10.2Õvendor=adviser,sperry,amdahlÕ
- 10.9Õvendor=amdahlÕ
+ 0.012MYCT + 0.0145MMIN + 0.0089MMAX
+ 0.808CACH + 1.29CHMAX
=== Er ror on training d ata = ==
Correlation c oefficient
Mean absolute er ror
Root mean squared error
Relative ab solute error
Root relative sq uared e rror
Total Number of Instances
0.9853
13.4072
26.3977
15.3431 %
17.0985 %
209
=== Cr oss-validation ===
Correlation c oefficient
Mean absolute er ror
Root mean squared error
Relative ab solute error
Root relative sq uared e rror
Total Number of Instances
0.9767
13.1239
33.4455
14.9884 %
21.6147 %
209
Figure 8.5 Output from the M5´ program for numeric prediction.
J48 pr uned tree
ŃŃŃŃŃŃ
: yes (14.0/0.74)
Number o f Rules :
Size of the t ree :
1
1
=== Co nfusion Ma trix ===
a b
<-- classified a s
9 0 | a = yes
5 0 | b = no
=== St ratified cross-validation = ==
Correctly Classified Instances
Incorrectly C lassified Instances
UnClassified Instances
Correctly Classified With Cost
Incorrectly C lassified With Cost
UnClassified With Cost
Mean absolute er ror
Root mean squared error
Total Number of Instances
Total Number With Cost
9
5
0
90
5
0
0.3751
0 .5714
14
95
64.2857
35.7143
0
94.7368
5.2632
0
%
%
%
%
%
%
=== Co nfusion Ma trix ===
a b
<-- classified a s
9 0 | a = yes
5 0 | b = no
Figure 8.6 Output from J4.8 with cost-sensitive classification.
@relation weather-weka.filters.DeleteFilter-R1_2
@attribute humidity real
@attribute windy { TRUE,FALSE}
@attribute play {yes,no}
@data
85,FALSE,no
90,TRUE,no

Figure 8.7 Effect of AttributeFilter on the weather dataset.
Apriori
=======
Minimum support: 0 .2
Minimum confidence: 0.9
Number o f cycles p erformed: 17
Generated sets of large it emsets:
Size
Size
Size
Size
of
of
of
of
set
set
set
set
of
of
of
of
large
large
large
large
itemsets
itemsets
itemsets
itemsets
L(1):
L(2):
L(3):
L(4):
12
47
39
6
Best rules found:
1
2
3
4
5
6
7
8
9
.
.
.
.
.
.
.
.
.
hu midity=normal windy=FALSE 4 = => play=yes 4 ( 1)
te mperature=cool 4 = => h umidity=normal 4 ( 1)
ou tlook=overcast 4 = => p lay=yes 4 (1)
te mperature=cool play=yes 3 = => hu midity=normal 3 ( 1)
ou tlook=rainy windy=FALSE 3 = => pl ay=yes 3 (1)
ou tlook=rainy play=yes 3 == > windy=FALSE 3 (1)
ou tlook=sunny humidity=high 3 = => play=no 3 (1)
ou tlook=sunny play=no 3 = => h umidity=high 3 (1)
te mperature=cool windy=FALSE 2 ==> h umidity=normal play=yes 2
(1)
10. te mperature=cool humidity=normal windy=FALSE 2 == > play=yes 2
(1)
Figure 8.8 Output from the APRIORI association rule learner.
EM
==
Number o f clusters: 2
Cluster: 0 Prior p robability: 0.2816
Attribute: outlook
Discrete Es timator. Counts = 2.96 2.98 1 (Total = 6.94)
Attribute: temperature
Normal D istribution. Mean = 82.2692 StdDev =
2.2416
Attribute: humidity
Normal D istribution. Mean = 83.9788 StdDev =
6.3642
Attribute: windy
Discrete Es timator. Counts = 1.96 3.98 (T otal = 5 .94)
Attribute: play
Discrete Es timator. Counts = 2.98 2.96 (T otal = 5 .94)
Cluster: 1 Prior p robability: 0.7184
Attribute: outlook
Discrete Es timator. Counts = 4.04 3.02 6 (Total = 13.06)
Attribute: temperature
Normal D istribution. Mean = 70.1616 StdDev =
3.8093
Attribute: humidity
Normal D istribution. Mean = 80.7271 StdDev = 11.6349
Attribute: windy
Discrete Es timator. Counts = 6.04 6.02 (T otal = 1 2.06)
Attribute: play
Discrete Es timator. Counts = 8.02 4.04 (T otal = 1 2.06)
=== Cl ustering stats for training d ata = ==
Cluster Instances
0
4 (2 9 %)
1
10 (7 1 %)
Log li kelihood: -9.01881
Figure 8.9 Output from the EM clustering scheme.
/**
* Java program for classifying s hort text messages into two
classes.
*/
import
import
import
import
import
w eka.core.*;
w eka.classifiers.*;
w eka.filters.*;
j ava.io.*;
j ava.util.*;
public c lass MessageClassifier implements Serializable {
/* Our (r ather a rbitrary) set o f keywords. */
private final String[] m_Keywords = {"product", "only", "offer",
"great", "a mazing", "phantastic", " opportunity", " buy", "n ow"};
/* The tr aining data. */
private Instances m_Data = nu ll;
/* The fi lter. * /
private Filter m _Filter = n ew D iscretizeFilter();
/* The cl assifier. */
private Classifier m_Classifier = ne w IBk();
/**
* Constructs empty t raining dataset.
*/
public Me ssageClassifier() th rows Ex ception {
St ring nameOfDataset = " MessageClassificationProblem";
// C reate nu meric a ttributes.
Fa stVector attributes = n ew F astVector(m_Keywords.length + 1);
fo r (int i = 0 ; i < m_Keywords.length; i ++) {
attributes.addElement(new A ttribute(m_Keywords[i]));
}
// A dd class a ttribute.
Fa stVector classValues = ne w FastVector(2);
cl assValues.addElement("miss");
cl assValues.addElement("hit");
at tributes.addElement(new Attribute("Class", c lassValues));
// C reate da taset w ith i nitial capacity o f 100, and se t index of
class.
m_ Data = n ew I nstances(nameOfDataset, attributes, 1 00);
m_ Data.setClassIndex(m_Data.numAttributes() - 1 );
}
/**
* Updates model us ing the gi ven training m essage.
*/
public vo id up dateModel(String message, String c lassValue)
th rows Exception {
Figure 8.10 Source code for the message classifier.
// C onvert message string into instance.
In stance instance = ma keInstance(cleanupString(message));
// A dd class v alue to instance.
in stance.setClassValue(classValue);
// A dd instance to training d ata.
m_ Data.add(instance);
// U se filter.
m_ Filter.inputFormat(m_Data);
In stances fi lteredData = Fi lter.useFilter(m_Data, m _Filter);
// R ebuild classifier.
m_ Classifier.buildClassifier(filteredData);
}
/**
* Classifies a g iven me ssage.
*/
public vo id cl assifyMessage(String message) th rows Exception {
// C heck if classifier h as been bu ilt.
if ( m_Data.numInstances() == 0) {
throw n ew Exception("No classifier av ailable.");
}
// C onvert message string into instance.
In stance instance = ma keInstance(cleanupString(message));
// F ilter in stance.
m_ Filter.input(instance);
In stance filteredInstance = m _Filter.output();
// G et index o f predicted class va lue.
do uble predicted =
m_Classifier.classifyInstance(filteredInstance);
// C lassify instance.
Sy stem.err.println("Message c lassified as : " +
m_Data.classAttribute().value((int)predicted));
}
/**
* Method t hat c onverts a t ext message in to an i nstance.
*/
private Instance ma keInstance(String m essageText) {
St ringTokenizer tokenizer = n ew St ringTokenizer(messageText);
In stance instance = ne w Instance(m_Keywords.length + 1);
St ring token;
// I nitialize counts to zero.
fo r (int i = 0 ; i < m_ Keywords.length; i++) {
Figure 8.10 (continued)
instance.setValue(i, 0 );
}
// C ompute attribute values.
wh ile (tokenizer.hasMoreTokens()) {
token = to kenizer.nextToken();
for (int i = 0; i <
m_Keywords.length; i++) {
if (token.equals(m_Keywords[i])) {
instance.setValue(i, instance.value(i) + 1. 0);
break;
}
}
}
// G ive i nstance access to attribute in formation from the
dataset.
in stance.setDataset(m_Data);
re turn instance;
}
/**
* Method t hat d eletes all no n-letters fr om a string, an d
lowercases it.
*/
private String c leanupString(String messageText) {
ch ar[] result = new ch ar[messageText.length()];
in t position = 0;
fo r (int i = 0 ; i < me ssageText.length(); i+ +) {
if (C haracter.isLetter(messageText.charAt(i)) ||
Character.isWhitespace(messageText.charAt(i))) {
result[position++] =
Character.toLowerCase(messageText.charAt(i));
}
}
re turn new String(result);
}
/**
* Main method.
*/
public st atic void main(String[] options) {
Me ssageClassifier m essageCl;
by te[] charArray;
tr y {
// Re ad me ssage f ile i nto s tring.
String messageFileString = U tils.getOption('m', o ptions);
if (m essageFileString.length() != 0) {
FileInputStream me ssageFile = new
FileInputStream(messageFileString);
int n umChars = m essageFile.available();
Figure 8.10 (continued)
charArray = new by te[numChars];
messageFile.read(charArray);
messageFile.close();
} else {
throw ne w Exception ("Name of message fi le no t provided.");
}
// Ch eck if class va lue is given.
String classValue = Utils.getOption('c', options);
// Ch eck for mo del file. If e xistent, r ead i t, otherwise
create n ew
// on e.
String modelFileString = Utils.getOption('t', options);
if (m odelFileString.length() != 0) {
try {
FileInputStream modelInFile = n ew
FileInputStream(modelFileString);
ObjectInputStream modelInObjectFile =
new ObjectInputStream(modelInFile);
messageCl = ( MessageClassifier)
modelInObjectFile.readObject();
modelInFile.close();
} catch (FileNotFoundException e ) {
messageCl = n ew MessageClassifier();
}
} else {
throw ne w Exception ("Name of data file not provided.");
}
// Ch eck if there ar e any o ptions left
Utils.checkForRemainingOptions(options);
// Pr ocess m essage.
if (c lassValue.length() != 0) {
messageCl.updateModel(new S tring(charArray), classValue);
} else {
messageCl.classifyMessage(new String(charArray));
}
// If c lass has b een g iven, u pdated message classifier mu st be
saved
if (c lassValue.length() != 0) {
FileOutputStream modelOutFile =
new Fi leOutputStream(modelFileString);
ObjectOutputStream m odelOutObjectFile =
new Ob jectOutputStream(modelOutFile);
modelOutObjectFile.writeObject(messageCl);
modelOutObjectFile.flush();
modelOutFile.close();
}
} catch ( Exception e) {
e.printStackTrace();
}
}
}
Figure 8.10 (continued)
import
import
import
import
w eka.classifiers.*;
w eka.core.*;
j ava.io.*;
j ava.util.*;
/**
* Class im plementing a n Id3 de cision tree classifier.
*/
public c lass Id3 e xtends DistributionClassifier {
/** The nodeÕs s uccessors. */
private Id3[] m_Successors;
/** Attribute used for splitting. */
private Attribute m_Attribute;
/** Class v alue if node is le af. */
private double m _ClassValue;
/** Class d istribution if node is le af. */
private double[] m_ Distribution;
/** Class a ttribute o f dataset. * /
private Attribute m_ClassAttribute;
/**
* Builds I d3 decision tree classifier.
*/
public vo id bu ildClassifier(Instances data) th rows Exception {
if ( !data.classAttribute().isNominal()) {
throw n ew Exception("Id3: n ominal class, please.");
}
En umeration enumAtt = data.enumerateAttributes();
wh ile (enumAtt.hasMoreElements()) {
Attribute attr = ( Attribute) enumAtt.nextElement();
if (! attr.isNominal()) {
throw ne w Exception("Id3: o nly n ominal attributes,
please.");
}
Enumeration enum = data.enumerateInstances();
while ( enum.hasMoreElements()) {
if (((Instance) en um.nextElement()).isMissing(attr)) {
throw new Exception("Id3: no m issing values, please.");
}
}
}
da ta = new Instances(data);
da ta.deleteWithMissingClass();
ma keTree(data);
}
/**
* Method b uilding Id3 tree.
*/
private void makeTree(Instances d ata) throws Exception {
// C heck if no in stances ha ve r eached this node.
Figure 8.11 Source code for the ID3 decision tree learner.
if ( data.numInstances() == 0) {
m_Attribute = null;
m_ClassValue = Instance.missingValue();
m_Distribution = n ew d ouble[data.numClasses()];
return;
}
// C ompute attribute with maximum information gain.
do uble[] infoGains = new do uble[data.numAttributes()];
En umeration attEnum = data.enumerateAttributes();
wh ile (attEnum.hasMoreElements()) {
Attribute att = ( Attribute) a ttEnum.nextElement();
infoGains[att.index()] = computeInfoGain(data, att);
}
m_ Attribute = d ata.attribute(Utils.maxIndex(infoGains));
// M ake l eaf i f information g ain is zero.
// O therwise c reate su ccessors.
if ( Utils.eq(infoGains[m_Attribute.index()], 0 )) {
m_Attribute = null;
m_Distribution = n ew d ouble[data.numClasses()];
Enumeration instEnum = da ta.enumerateInstances();
while ( instEnum.hasMoreElements()) {
Instance i nst = (I nstance) instEnum.nextElement();
m_Distribution[(int) in st.classValue()]++;
}
Utils.normalize(m_Distribution);
m_ClassValue = Utils.maxIndex(m_Distribution);
m_ClassAttribute = data.classAttribute();
} else {
Instances[] splitData = splitData(data, m_ Attribute);
m_Successors = new Id3[m_Attribute.numValues()];
for (int j = 0; j <
m_Attribute.numValues(); j+ +) {
m_Successors[j] = new Id3();
m_Successors[j].buildClassifier(splitData[j]);
}
}
}
/**
* Classifies a g iven te st in stance using t he decision tree.
*/
public do uble classifyInstance(Instance instance) {
if ( m_Attribute == null) {
return m_ClassValue;
} else {
return m_Successors[(int) i nstance.value(m_Attribute)].
classifyInstance(instance);
}
}
/**
* Computes cl ass distribution for instance us ing decision tr ee.
*/
public do uble[] distributionForInstance(Instance in stance) {
Figure 8.11 (continued)
if ( m_Attribute == null) {
return m_Distribution;
} else {
return m_Successors[(int) i nstance.value(m_Attribute)].
distributionForInstance(instance);
}
}
/**
* Prints t he decision tree using th e private toString method f rom
below.
*/
public St ring toString() {
re turn "Id3 classifier\n==============\n" + toString(0);
}
/**
* Computes in formation gain for an attribute.
*/
private double c omputeInfoGain(Instances data, A ttribute a tt)
th rows Exception {
do uble infoGain = c omputeEntropy(data);
In stances[] splitData = s plitData(data, a tt);
fo r (int j = 0 ; j < at t.numValues(); j+ +) {
if (s plitData[j].numInstances() > 0 ) {
infoGain - = ((double) splitData[j].numInstances() /
(double) data.numInstances()) *
computeEntropy(splitData[j]);
}
}
re turn infoGain;
}
/**
* Co mputes the e ntropy of a dataset.
*/
private double c omputeEntropy(Instances data) throws Exception {
do uble [] cl assCounts = n ew d ouble[data.numClasses()];
En umeration instEnum = d ata.enumerateInstances();
wh ile (instEnum.hasMoreElements()) {
Instance inst = ( Instance) instEnum.nextElement();
classCounts[(int) in st.classValue()]++;
}
do uble entropy = 0;
fo r (int j = 0 ; j < da ta.numClasses(); j++) {
if (c lassCounts[j] > 0 ) {
entropy -= cl assCounts[j] * Ut ils.log2(classCounts[j]);
}
}
en tropy / = (double) da ta.numInstances();
re turn entropy + Utils.log2(data.numInstances());
}
/**
* Splits a da taset a ccording t o the v alues of a no minal
attribute.
Figure 8.11 (continued)
*/
private Instances[] s plitData(Instances data, Attribute att) {
In stances[] splitData = n ew I nstances[att.numValues()];
fo r (int j = 0 ; j < at t.numValues(); j+ +) {
splitData[j] = new Instances(data, da ta.numInstances());
}
En umeration instEnum = d ata.enumerateInstances();
wh ile (instEnum.hasMoreElements()) {
Instance inst = ( Instance) instEnum.nextElement();
splitData[(int) i nst.value(att)].add(inst);
}
re turn splitData;
}
/**
* Outputs a tree at a certain level.
*/
private String t oString(int level) {
St ringBuffer t ext = ne w StringBuffer();
if ( m_Attribute == null) {
if (I nstance.isMissingValue(m_ClassValue)) {
text.append(": null");
} else {
text.append(": "+m_ClassAttribute.value((int)
m_ClassValue));
}
} else {
for (int j = 0; j <
m_Attribute.numValues(); j+ +) {
text.append("\n");
for ( int i = 0; i < l evel; i++) {
text.append("| ");
}
text.append(m_Attribute.name() + " = " +
m_Attribute.value(j));
text.append(m_Successors[j].toString(level + 1));
}
}
re turn text.toString();
}
/**
* Main method.
*/
public st atic void main(String[] args) {
tr y {
System.out.println(Evaluation.evaluateModel(new I d3(), ar gs));
} catch ( Exception e) {
System.out.println(e.getMessage());
}
}
}
Figure 8.11
import w eka.filters.*;
import w eka.core.*;
import j ava.io.*;
/**
* Replaces a ll missing va lues for nominal and numeric attributes in
a
* dataset with the modes and means fr om th e training d ata.
*/
public c lass ReplaceMissingValuesFilter extends Filter {
/** The modes and means */
private double[] m_ ModesAndMeans = null;
/**
* Sets the fo rmat of th e input i nstances.
*/
public bo olean i nputFormat(Instances i nstanceInfo)
throws Ex ception {
m_ InputFormat = new In stances(instanceInfo, 0);
se tOutputFormat(m_InputFormat);
b_ NewBatch = t rue;
m_ ModesAndMeans = n ull;
re turn true;
}
/**
* Input an in stance for f iltering. Filter requires a ll
* training in stances be r ead b efore p roducing o utput.
*/
public bo olean i nput(Instance i nstance) throws E xception {
if ( m_InputFormat = = null) {
throw n ew Exception("No input in stance format defined");
}
if ( b_NewBatch) {
resetQueue();
b_NewBatch = fa lse;
}
if ( m_ModesAndMeans == n ull) {
m_InputFormat.add(instance);
return false;
} else {
convertInstance(instance);
return true;
}
}
/**
* Signify that this batch of i nput to th e filter is finished.
*/
public bo olean b atchFinished() throws Exception {
if ( m_InputFormat = = null) {
throw n ew Exception("No input in stance format defined");
}
Figure 8.12 Source code for a filter that replaces the missing values in a dataset.
if ( m_ModesAndMeans == n ull) {
// Co mpute m odes and m eans
m_ModesAndMeans = ne w double[m_InputFormat.numAttributes()];
for (int i = 0; i <
m_InputFormat.numAttributes(); i++) {
if (m_InputFormat.attribute(i).isNominal() ||
m_InputFormat.attribute(i).isNumeric()) {
m_ModesAndMeans[i] = m_InputFormat.meanOrMode(i);
}
}
// Co nvert p ending input instances
for(int i = 0 ; i < m_InputFormat.numInstances(); i++) {
Instance c urrent = m _InputFormat.instance(i);
convertInstance(current);
}
}
b_ NewBatch = t rue;
re turn (numPendingOutput() != 0 );
}
/**
* Convert a single i nstance over. The co nverted in stance is
* added to th e end o f the ou tput qu eue.
*/
private void convertInstance(Instance instance) throws Exception {
In stance newInstance = n ew Instance(instance);
fo r(int j = 0; j < m _InputFormat.numAttributes(); j ++){
if (i nstance.isMissing(j) & &
(m_InputFormat.attribute(j).isNominal() ||
m_InputFormat.attribute(j).isNumeric())) {
newInstance.setValue(j, m _ModesAndMeans[j]);
}
}
pu sh(newInstance);
}
/**
* Main method.
*/
public st atic void main(String [] ar gv) {
tr y {
if (U tils.getFlag(ÕbÕ, ar gv)) {
Filter.batchFilterFile(new
ReplaceMissingValuesFilter(),argv);
} else {
Filter.filterFile(new ReplaceMissingValuesFilter(),argv);
}
} catch ( Exception ex) {
System.out.println(ex.getMessage());
}
}
}
Figure 8.12
Figure 9.1 Representation of Iris data: (a) one dimension.
Figure 9.1 Representation of Iris data: (b) two dimensions.
Figure 9.2 Visualization of classification tree for grasses data.