306-622 Business Intelligence Case Based Reasoning Go to the

306-622 Business Intelligence
Case Based Reasoning
Go to the following URL (also available via the LMS):
http://app.lms.unimelb.edu.au/bbcswebdav/courses/306622_2006_2/CBRApplet5
It may take a moment to load but then you should see the following:
The program has the following features, from left to right, buttons to open the various data
files (Height/Weight 1, Height/Weight 2, Loan, Iris, Cancer & Heart), one to enter data
for classification (New Case), in the middle is the Processing (or output) text area, on the
right is an Input Data text area.
Laurens
306-622
August 2006
1. Start with the top button,
Height/Weight 1, you should
see this display:
The files that are opened are
listed along with the number
of cases (10) on the left, the
data appears on the right.
We can see data of the
format: number, number, text
(height in cms, weight in kgs,
gender). So the first case is:
185cm, 90kg  Male.
Also shown are references to
a file called hw.key, which specifies data weights and types (number or text), and goalfield(s),
there are 3 (height, weight & gender).
2. Use the New Case button to see this dialog:
Example data is displayed to show the format
required: two numbers separated by a comma.
Shown here is just the first case, (185cm and
90kg) without the last attribute or value (Male).
Don’t change anything, choose OK to process
the data and confirm the decision is Male.
The
result
was
predictable because it’s
an exact match with an
existing case (this may
not always be true).
Choose New Case
again but try to predict
the result for the
following data first:
170, 70. (Note that
there is no instance/case
of 170, 70 in the data).
Q2.1 Do you predict that 170cm and 70kg is Male or Female?
Use New Case to see the decision, now repeat with 170 & 60.
Q2.2 Do you predict that 170cm and 60kg is Male or Female?
Q2.3 Now enter your own height and weight – is the decision correct?
Q2.4 Would this process work for young children?
Note that there is no crossover in the data, i.e. the two sets (Male, Female) are distinct so we
could apply simple rules to determine most cases (rather than use CBR).
Q2.4 What are the two simple rules?
Laurens
306-622
August 2006
3. Try the next data set,
Height/Weight 2, note that
our rules will no longer apply
because there is crossover in
both height and weight
(there’s a 170cm Male and a
175cm Female, a 65kg Male
and a 75kg Female). Let’s see
if the CBR process works, try
the data from Q2.1 & Q2.2
again.
Q3.1 Do you predict that 170cm and 70kg is Male or Female?
Use New Case to see what the decision is, repeat with 170 & 60
Q3.2 Do you predict that 170cm and 60kg is Male or Female?
Now try the outlying data, 170cm and 65kg first.
Q3.3 Do you predict that 170, 65 is Male or Female?
Try the other outlying data then extreme or unusual data e.g. 1, 1 and 100, 100 etc. You
should always get a decision if the data is numeric, even if it’s unrealistic. You won’t get a
decision if the data is invalid (try ‘a’, ’b’), these are limitations of the current system - it could
be modified to check input but for now the user is expected to know data ranges and
meanings.
4. Now move on to the next data set called Loan,
shown left to right are age, married, salary, size of
business, type of business, and loan approved (this is
the data from the lecture). Use New Case to add the
fourth case: Age = 26, Married = No, Salary = 50,000,
Employer = Mid Size, and Industry = Manufacturing.
Q4.1 Do you predict that the loan will be approved?
5. Next is a famous CBR data set, Iris. There are 4 numeric attributes and a flower type or
class, the numbers are measurements of the sepal (length and width), and the petal (length and
width). The sepal is the leaf-like part of the plant under the petal.
There are 3 types of Iris associated with these measurements: Iris
Setosa, Iris Versicolour and Iris Virginica. Try a few cases too
see if you can identify any relationships within the data.
Suppose we found a small iris which was actually a juvenile of what is usually a large variety.
Identify the biggest iris (probably a virginica), halve the data then enter it as a New Case.
Q5.1 Do you predict your data will be classified as a virginica?
Laurens
306-622
August 2006
6. Next we can look at two more complicated & bigger data sets, Cancer and Heart.
Cancer is a breast cancer dataset with 683 cases and 10 attributes, the attributes are as
follows:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Clump Thickness
Uniformity of Cell Size
Uniformity of Cell Shape
Marginal Adhesion
Single Epithelial Cell Size
Bare Nuclei
Bland Chromatin
Normal Nucleoli
Mitoses
Class
1 – 10
1 - 10
1 - 10
1 - 10
1 - 10
1 - 10
1 - 10
1 - 10
1 - 10
benign, or malignant
When pathologists examine FNA (Fine Needle Aspirate) tissue samples in breast cancer
diagnosis, they consider nine characteristics: clump thickness, uniformity of cell size and
shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal
nucleoli and mitoses. Each of these characteristics is assigned a number from 1 to 10 by the
pathologist. The larger the number the greater the likelihood of malignancy. No single
measurement can be used to determine whether a sample is benign or malignant.
Heart is a heart disease dataset with 297 cases and 14 attributes, the 14th attribute was
originally a number from 0 to 4, (0 meaning no disease). I changed these to None, Low,
Medium, High, and Severe. The attributes are as follows:
1. age
2. gender (male, female)
3. chest pain type (1 = typical angina, atypical angina, non-anginal pain, asymptomatic)
4. resting blood pressure (in mm Hg on admission to the hospital, 1 - 500)
5. serum cholesterol in mg/dl (1 - 50)
6. fasting blood sugar > 120 mg/dl (high, low)
7. resting electrocardiograph results (0 = normal, 1 = having ST-T wave abnormality,
2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
8. maximum heart rate achieved (1 – 500)
9. exercise induced angina (yes, no)
10. ST depression induced by exercise relative to rest (0 – 10)
11. slope of the peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping)
12. number of major vessels (0-3) coloured by flourosopy
13. heart rate (3 = normal, 6 = fixed defect, 7 = reversable defect)
14. the class or predicted heart disease attribute, was 0 to 4 (where 0 was no disease)
Attributes 8, 9, 10 and 11 come from TMT (TreadMill Test)
Laurens
306-622
August 2006