CS 640 Assignment #3 Disease Prediction using Weka Use Weka to analyze a data set of 768 examples: each having 8 attributes and a class designation indicating diabetes diagnosis (case or control). These examples represent data collected from 768 women aged 21 or older of Pima Indian heritage. The women were classified as having diabetes by either medical diagnosis or as the result of a 2 hour post-load plasma glucose test. More information on the 8 attributes is below. Select 3 prediction methods from among weka’s options. One of these methods should be either linear or logistic regression. For each of the 3 prediction methods pre-process the data by attribute (for example, you can choose: NominalToBinary, Normalize, or Standardize), attempting to improve results. Compare results with/without data pre-processing. For one of your prediction methods try removing some features (inputs) to attempt to improve results. Use information from class or do some research on your selected methods to find out how you can manipulate the most influential parameters for your learning method (ex: kernel and C for support vector machine, gamma for RBF kernel). Write a brief description of each prediction method. Describe the parameters you can vary. For each of the 3 prediction methods you chose, paste a summary of its results: the prediction accuracy, confusion matrix, and area under the ROC curve. Then write a paragraph discussing results for that method: assessing and comparing the outputs of about 4 runs for that method. Write one paragraph summarizing and comparing the work you did using the 3 methods.: what strategies were the best in attaining a good predictor? Your above discussions should show understanding of data preprocessing, varying of parameters, and evaluation of results as they impact machine learning and data analysis: how data preprocessing affects your prediction method. (For example, you may have fewer input nodes for your neural network). machine learning methods you chose and the role of parameters (ex: gamma for a support vector machine), including in overfitting or underfitting the predictive relation. evaluating the accuracy of your prediction model: and of measures of prediction accuracy, such as TP, FP, etc. how evaluation of prediction accuracy is impacted by the test set you selected. (Ex – what are the implications of testing with the training set?). Please do not just copy/paste in everything that was output by weka - this shows no thought or understanding. A copy/paste of prediction accuracy and confusion matrix as a basis of your discussion is important. However, any long copy/paste that would distract from the flow of your discussion should be confined to an appendix at the end of your document. The scientific method: The scientific method is based upon repeatability: given your description (including relevant parameters), another researcher should be able to rerun your experiments and verify that they get the results that agree. The data is in Attribute-Relation File Format (arff), a format for storing data that is usable by Weka and other machine learning packages. You may retrieve the file from the UC Irvine machine learning repository, or from here: http://www.cs.usfca.edu/~pfrancislyon/uci-diabetes.arff Information about the diabetes data set: Abstract: From National Institute of Diabetes and Digestive and Kidney Diseases; Includes cost data (donated by Peter Turney) Data Set Characteristics: Multivariate Number of Instances: 768 Area: Life Attribute Characteristics: Integer, Real Number of Attributes: 8 Date Donated 1990-0509 Associated Tasks: Classification Missing Values? Yes Number of Web Hits: 125423 Attribute Information: 1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Class variable (0 or 1) ** UPDATE: Until 02/28/2011 this web page indicated that there were no missing values in the dataset. As pointed out by a repository user, this cannot be true: there are zeros in places where they are biologically impossible, such as the blood pressure attribute. It seems very likely that zero values encode missing data. However, since the dataset donors made no such statement we encourage you to use your best judgement and state your assumptions.
© Copyright 2026 Paperzz