Data mining project Phase (2) Index: 1.1) introduction

Data mining project
Phase (2)
Index:
1.1) introduction…………………………………………………………………..2
1.2) associations…………………………………………………………………..2
1.3) classifications………………………………………………………………4
1.4) conclusion………………………………………………………………………7
1
1.1) Introduction:
In this phase we introduce association and classifications methods on two data sets
"White wine and Brest tissue" which we described in the last phase.
We apply association to see the base rules on each data sets. And apply
classifications rule (Rule induction and K-nearest on white wine, Decision tree
and naïve Byes on Brest tissue) to find best accuracy for prediction.
1.2) Associations:
1.2.1) White wine dataset:
At first make selected attribute [alcohol, free sulfur dioxide, pH, quality, residual
sugar, volatile acidity] which we know it is related from preprocessing phase. Then
convert real to integer to prepare data to convert to binominal. Then use FP-Growth to
generate frequent item sets at min support =0.8. Finally create association to generate
rules.
Figure 1.2.1.1: build association rule to white wine dataset
Figure 1.2.1.2: table view result for white wine dataset
2
Figure 1.2.1.3: text view association rules for white wine dataset
Explain results:
From figure 1.2.1.3 illustrate that the confidence is greater than or equal 0.9,
this means this rule is important. But when you see figure 1.2.1.2 the left is equal 1
that means all items are not correlated .
1.2.2) Brest tissue dataset:
At first make selected attribute [AD/A, DR, IO, MAX IP, P, CASE#, CALSS]
which we know it is related from preprocessing phase. Then convert real to integer to
prepare data to convert to binominal. Then use FP-Growth to generate frequent item
sets at min support =0.95. Finally create association to generate rules.
Figure 1.2.2.1: build association rule to Brest tissue dataset
Figure 1.2.2.2: table view result to Brest tissue dataset
3
Figure 1.2.2.3: text view result to Brest tissue dataset (association rules)
Explain results:
From figure 1.2.2.3 illustrate that the confidence is equal 100%, this means this
rule is important. But when you see figure 1.2.2.2 the left is equal 1 that means all
items are not correlated .
1.3) Classifications:
1.3.1) White wine dataset:
In this dataset before we using any model to make classification we need to make
discretizeing in training and testing side because there is no nominal target class.
1.3.1.1) Rule Induction:
Figure 1.3.1.1.1: the first step to build classification for white wine dataset
Figure 1.3.1.1.2: the second step to build training and testing for white wine dataset by using rule induction
4
Figure 1.3.1.1.3: the result of classification for white wine dataset
1.3.1.2) K-nearest:
Figure 1.3.1.2.1: the first step to build classification for white wine dataset
Figure 1.3.1.2.3: the second step to build training and testing for white wine dataset by using K-NN
Figure 1.3.1.2.3: the result for white wine dataset
Explain results:
By comparing the accuracy of Rule induction =73.27%, Decision tree =60.22,
K-nearest=72.45%, Naïve Byes=65.31%. We found that the best one is rule induction.
Because predict true good and true very good with large ratio of correct prediction,
true excellent with less correct prediction, and no prediction for true bad. But in KNN predict true very good with large ratio of true prediction and predict true good
with less ratio of true prediction but true bad and true excellent there is no correct
prediction.
5
1.3.2) Brest tissue dataset:
1.3.2.1) Decision Tree:
Figure 1.3.2.1.1: the first step to build classification for Brest tissue dataset
Figure 1.3.2.1.2: the second step to build training and testing for Brest tissue dataset by using Decision tree
Figure 1.3.2.1.3: the result for Brest tissue dataset
1.3.2.2) Naïve Bayes:
Figure 1.3.2.2.1: the first step to build classification for Brest tissue dataset
Figure 1.3.2.1.2: the second step to build training and testing for Brest tissue dataset by using Naïve Bayes
6
1.3.2.2.3: the result for Brest tissue dataset
Explain result:
By comparing the accuracy of models in Brest tissue set the decision tree and rule
induction is the best 100% , Naïve Bayes 93.75% and K-NN 71.88% . because in
decision tree all of column are correctly predict the target class but in Naïve Bayes
there are an errors prediction in true con and true mas.
1.4) Conclusion:
1. If confidence is equal to 100% that means the left and right sides in the rule
are view in transaction set equally to view number of left side
2. If confidence is less than 100% that means the left and right sides in the rule
are view in transaction set less than to view number of left side
3. Rule Induction is better than Decision Tree with huge data and numerical
because the induction rule give an excess rules than decision tree which give a
rule from the tree only.
4. In white wine dataset have a noise and large dataset but when we apply Knearest does not give a good accuracy because it depends on the value of k and
the attributes are independence.
5. Naïve Bayes also given a low accuracy in white wine dataset, it is give the
prediction of the (true very good) is greatest one because very good repeated
in target class of dataset more than the other classes.
7