assignment3

CS 640 Assignment #3
Disease Prediction using Weka
Use Weka to analyze a data set of 768 examples: each having 8 attributes and a class designation
indicating diabetes diagnosis (case or control). These examples represent data collected from
768 women aged 21 or older of Pima Indian heritage. The women were classified as having
diabetes by either medical diagnosis or as the result of a 2 hour post-load plasma glucose test.
More information on the 8 attributes is below.
Select 3 prediction methods from among weka’s options. One of these methods should be
either linear or logistic regression.
For each of the 3 prediction methods pre-process the data by attribute (for example, you can
choose: NominalToBinary, Normalize, or Standardize), attempting to improve results. Compare
results with/without data pre-processing.
For one of your prediction methods try removing some features (inputs) to attempt to improve
results.
Use information from class or do some research on your selected methods to find out how you
can manipulate the most influential parameters for your learning method (ex: kernel and C for
support vector machine, gamma for RBF kernel). Write a brief description of each prediction
method. Describe the parameters you can vary.
For each of the 3 prediction methods you chose, paste a summary of its results: the prediction
accuracy, confusion matrix, and area under the ROC curve. Then write a paragraph discussing
results for that method: assessing and comparing the outputs of about 4 runs for that method.
Write one paragraph summarizing and comparing the work you did using the 3 methods.: what
strategies were the best in attaining a good predictor?
Your above discussions should show understanding of data preprocessing, varying of
parameters, and evaluation of results as they impact machine learning and data analysis:




how data preprocessing affects your prediction method. (For example, you may have
fewer input nodes for your neural network).
machine learning methods you chose and the role of parameters (ex: gamma for a
support vector machine), including in overfitting or underfitting the predictive relation.
evaluating the accuracy of your prediction model: and of measures of prediction
accuracy, such as TP, FP, etc.
how evaluation of prediction accuracy is impacted by the test set you selected. (Ex – what
are the implications of testing with the training set?).
Please do not just copy/paste in everything that was output by weka - this shows no thought or
understanding. A copy/paste of prediction accuracy and confusion matrix as a basis of your
discussion is important. However, any long copy/paste that would distract from the flow of your
discussion should be confined to an appendix at the end of your document.
The scientific method:
The scientific method is based upon repeatability: given your description (including relevant
parameters), another researcher should be able to rerun your experiments and verify that they get
the results that agree.
The data is in Attribute-Relation File Format (arff), a format for storing data that is usable by Weka and
other machine learning packages. You may retrieve the file from the UC Irvine machine learning
repository, or from here: http://www.cs.usfca.edu/~pfrancislyon/uci-diabetes.arff
Information about the diabetes data set:
Abstract: From National Institute of Diabetes and Digestive and Kidney Diseases; Includes cost data
(donated by Peter Turney)
Data Set
Characteristics:
Multivariate
Number of
Instances:
768
Area:
Life
Attribute
Characteristics:
Integer, Real
Number of
Attributes:
8
Date Donated
1990-0509
Associated Tasks:
Classification
Missing Values?
Yes
Number of Web
Hits:
125423
Attribute Information:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
** UPDATE: Until 02/28/2011 this web page indicated that there were no missing values in the dataset.
As pointed out by a repository user, this cannot be true: there are zeros in places where they are
biologically impossible, such as the blood pressure attribute. It seems very likely that zero values encode
missing data. However, since the dataset donors made no such statement we encourage you to use your
best judgement and state your assumptions.