306-622 Business Intelligence Case Based Reasoning Go to the following URL (also available via the LMS): http://app.lms.unimelb.edu.au/bbcswebdav/courses/306622_2006_2/CBRApplet5 It may take a moment to load but then you should see the following: The program has the following features, from left to right, buttons to open the various data files (Height/Weight 1, Height/Weight 2, Loan, Iris, Cancer & Heart), one to enter data for classification (New Case), in the middle is the Processing (or output) text area, on the right is an Input Data text area. Laurens 306-622 August 2006 1. Start with the top button, Height/Weight 1, you should see this display: The files that are opened are listed along with the number of cases (10) on the left, the data appears on the right. We can see data of the format: number, number, text (height in cms, weight in kgs, gender). So the first case is: 185cm, 90kg Male. Also shown are references to a file called hw.key, which specifies data weights and types (number or text), and goalfield(s), there are 3 (height, weight & gender). 2. Use the New Case button to see this dialog: Example data is displayed to show the format required: two numbers separated by a comma. Shown here is just the first case, (185cm and 90kg) without the last attribute or value (Male). Don’t change anything, choose OK to process the data and confirm the decision is Male. The result was predictable because it’s an exact match with an existing case (this may not always be true). Choose New Case again but try to predict the result for the following data first: 170, 70. (Note that there is no instance/case of 170, 70 in the data). Q2.1 Do you predict that 170cm and 70kg is Male or Female? Use New Case to see the decision, now repeat with 170 & 60. Q2.2 Do you predict that 170cm and 60kg is Male or Female? Q2.3 Now enter your own height and weight – is the decision correct? Q2.4 Would this process work for young children? Note that there is no crossover in the data, i.e. the two sets (Male, Female) are distinct so we could apply simple rules to determine most cases (rather than use CBR). Q2.4 What are the two simple rules? Laurens 306-622 August 2006 3. Try the next data set, Height/Weight 2, note that our rules will no longer apply because there is crossover in both height and weight (there’s a 170cm Male and a 175cm Female, a 65kg Male and a 75kg Female). Let’s see if the CBR process works, try the data from Q2.1 & Q2.2 again. Q3.1 Do you predict that 170cm and 70kg is Male or Female? Use New Case to see what the decision is, repeat with 170 & 60 Q3.2 Do you predict that 170cm and 60kg is Male or Female? Now try the outlying data, 170cm and 65kg first. Q3.3 Do you predict that 170, 65 is Male or Female? Try the other outlying data then extreme or unusual data e.g. 1, 1 and 100, 100 etc. You should always get a decision if the data is numeric, even if it’s unrealistic. You won’t get a decision if the data is invalid (try ‘a’, ’b’), these are limitations of the current system - it could be modified to check input but for now the user is expected to know data ranges and meanings. 4. Now move on to the next data set called Loan, shown left to right are age, married, salary, size of business, type of business, and loan approved (this is the data from the lecture). Use New Case to add the fourth case: Age = 26, Married = No, Salary = 50,000, Employer = Mid Size, and Industry = Manufacturing. Q4.1 Do you predict that the loan will be approved? 5. Next is a famous CBR data set, Iris. There are 4 numeric attributes and a flower type or class, the numbers are measurements of the sepal (length and width), and the petal (length and width). The sepal is the leaf-like part of the plant under the petal. There are 3 types of Iris associated with these measurements: Iris Setosa, Iris Versicolour and Iris Virginica. Try a few cases too see if you can identify any relationships within the data. Suppose we found a small iris which was actually a juvenile of what is usually a large variety. Identify the biggest iris (probably a virginica), halve the data then enter it as a New Case. Q5.1 Do you predict your data will be classified as a virginica? Laurens 306-622 August 2006 6. Next we can look at two more complicated & bigger data sets, Cancer and Heart. Cancer is a breast cancer dataset with 683 cases and 10 attributes, the attributes are as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class 1 – 10 1 - 10 1 - 10 1 - 10 1 - 10 1 - 10 1 - 10 1 - 10 1 - 10 benign, or malignant When pathologists examine FNA (Fine Needle Aspirate) tissue samples in breast cancer diagnosis, they consider nine characteristics: clump thickness, uniformity of cell size and shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses. Each of these characteristics is assigned a number from 1 to 10 by the pathologist. The larger the number the greater the likelihood of malignancy. No single measurement can be used to determine whether a sample is benign or malignant. Heart is a heart disease dataset with 297 cases and 14 attributes, the 14th attribute was originally a number from 0 to 4, (0 meaning no disease). I changed these to None, Low, Medium, High, and Severe. The attributes are as follows: 1. age 2. gender (male, female) 3. chest pain type (1 = typical angina, atypical angina, non-anginal pain, asymptomatic) 4. resting blood pressure (in mm Hg on admission to the hospital, 1 - 500) 5. serum cholesterol in mg/dl (1 - 50) 6. fasting blood sugar > 120 mg/dl (high, low) 7. resting electrocardiograph results (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria) 8. maximum heart rate achieved (1 – 500) 9. exercise induced angina (yes, no) 10. ST depression induced by exercise relative to rest (0 – 10) 11. slope of the peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping) 12. number of major vessels (0-3) coloured by flourosopy 13. heart rate (3 = normal, 6 = fixed defect, 7 = reversable defect) 14. the class or predicted heart disease attribute, was 0 to 4 (where 0 was no disease) Attributes 8, 9, 10 and 11 come from TMT (TreadMill Test) Laurens 306-622 August 2006
© Copyright 2025 Paperzz