Download PDF

Data Mining Techniques for Predic5ng Breast Cancer Survivability Among Women in the United States Sultanah M. Alshammari, Tawfiq M. Shah
Faculty Advisor: Yan Huang, Ph.D.
Department of Computer Science and Engineering, University of North Texas
SEER DocumentaCon Introduction
SELECTED ATTRIBUTES ANALYSIS: Further analysis of SEER Breast Cancer Dataset BREAST CANCER is the most common invasive cancer in ARFF File Data Mining Tools ExtracCng and Formamng Tools Figure 2: SEER breast cancer dataset pre-­‐processing. The breast cancer selected survivability aZributes: Nominal AJribute: -­‐ Race
-­‐ Primary site code
-­‐ Behavior code
-­‐ Extension
-­‐ Site specific surgery code
-­‐ Tumor size -­‐ Marital status -­‐ Histologic type -­‐ Grade -­‐ Lymph node involvement -­‐ RadiaCon Numerical AJribute: -­‐ Age
-­‐ Number of nodes
-­‐ Number of posiCve nodes -­‐ Number of primaries mining techniques for predicCng breast cancer survivability are:[6] •  Naive Bayes (NB) classifier is a simple probabilisCc classifier based on applying Bayes' theorem with strong independence assumpCons. •  Decision tree (C4.5) algorithm is a predicCve model that maps observaCons about an item to conclusions about that item's target value. •  Ar5ficial neural network (ANN) is an analyCcal system inspired by the structure of biological neural networks and their way of encoding and solving problems. For the ANN implementaCons, two types of ANN were used: MulClayer perceptron and Radial Basis FuncCon (RBF) Network. In this study we invesCgate the use of several data mining techniques to automaCcally predict breast cancer survival status based on specific features. Several measurements were used to evaluate the accuracy and the performance of the selected data mining methods. Furthermore, by implemenCng feature selecCon analysis we were able to idenCfy the set of aZributes that contributed the most in breast cancer survivability predicCon. This study is based on the Surveillance, Epidemiology, and End Results (SEER) breast cancer database.[3] Materials and Methods
DATASET: The SEER dataset of breast cancer incidence was used. SEER is a program of the NaConal Cancer InsCtute that provides informaCon on cancer staCsCcs in an effort to reduce the problem of cancer among the U.S. populaCon. Each record in the SEER dataset is described with 142 fields. A full descripCon of each field was provided in the SEER documentaCon. Based on the descripCve staCsCcs provided by the SEER program and several related works, only the aZributes that were considered to be important in predicCng breast cancer survivability were included. PERFORMANCE METRICS: The three commonly used performance metrics were used to compare the selected classificaCon techniques:[6] Accuracy: the percentage of the correctly classified instances, = (TP + TN) / (TP + FP + TN + FN) Sensi5vity: refers to true posiCve rate, = TP / (TP + FN) Specificity: refers to true negaCve rate, = TN / (TN + FP) where the confusion matrix is used to compute true posiCves (TP), false posiCves (FP), true negaCves (TN), and false negaCves (FN) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 *4,-56/&
*4,5*4*/&
7'8!%9:;<=&
**,*+-./&
**,1-6-/&
Gain RaCo Results
In this study, the goal is to have high accuracy, as well as high sensiCvity and specificity metrics. When running the selected classifiers, the final version of the dataset, with missing values and detected outliers removed, was used. In comparison with the other selected models, as shown in Figure 3, the decision tree (C4.5) algorithm outperformed the other classifiers with a classificaCon accuracy of 0.8930, sensiCvity of 0.891, and specificity of 0.985. The experimental results of the used performance measurements for each predicCon algorithm are listed in the following table: !"#$%&'%()*+($#,*
"#$%&!'#(&)!
-&&."/&0*
*+,-+../!
1#)2%'%3%'0!*
0,*+.!
14#&%5%&%'0*
0,12+!
3.,4!
6""7!89:;<:#(&=!
>&=?&@;=AB!
6""7!C'D"&;EA=F!
*1,-05/!
*+,.202/!
0,*15!
0,*+.!
0,1*4!
0,1+-!
**,*+-./!
0,**1!
0,1G*!
!
89.30% 89.50% 88.87% 89.00% Conclusion
0.1 Figure 3: The ranking of the survivability aZributes based on informaCon gain (IG) and grain raCo (GR). In this work, the use of several data mining techniques for predicCng breast cancer survivability was studied. The obtained results are promising for applying the data mining methods into the survivability predicCon problem. These results suggest that among the predicCon algorithms tested, the C4.5 decision tree achieved the highest results confirming published results of related work.[4], [5] Moreover, based on feature selecCon analysis and the evaluaCon of the accuracy of the different predicCon models, this study shows that the set of the top ranked features contributes more in the predicCon accuracy. Future Work
Due to the Cme constrains this work has limitaCons and could be improved in several ways. First, we need all records in the SEER 1973–2010 dataset to be in the same electronic format. Second, acquire more powerful resources to run different classificaCon algorithms. Third, use different strategies when dealing with missing values. Finally, the integraCon of staCsCcal funcCons should be examined to provide more reliable results to predict survivability. References
88.50% 88.00% 87.50% 2.,3&
!
InformaCon Gain For the implementaCon of the selected data mining tools, WEKA (Waikato Environment for Knowledge Analysis) soLware was used. !"#$%&'(%)*+,#&-*%./#+
0&&/"'&1+
+
2/33+'(("%4/(#5+ ,)6+7+'(("%4/(#5+
!"#$%&'"(%)&
*+,-+../&
*+,.*01/&
0 PREDICTION MODELS: In this study, the selected data Figure 1: Breast cancer represents 14.1% of all new cancer cases in the U.S. Number of primaries Primary site Marital status Age Race RadiaCon Histologic type Tumor size Number of nodes Grade Site specific surgery code Behavior code Extension Number of posiCve nodes Lymph node involvement Accuracy female worldwide. It accounts for 16% of all females’ cancer and 22.9% of invasive cancer in women and 18.2% of all cancer deaths worldwide including both males and females.[3] According to NaConal Cancer InsCtute (NCI), it is esCmated that 232,340 women will be diagnosed with breast cancer and 39,620 women will die of cancer of the breast in 2013.[2] An accurate predicCon of breast cancer survivability is very important as it oLen directs the treatment process and provides reliable indicaCons for the public health authoriCes. PredicCng the outcome of a disease is one of the most interesCng and challenging tasks where data mining techniques can be used. the selected features was conducted to idenCfy which of the aZributes had a higher impact on the predicCon results. The ranking method, based on informaCon gain (IG) and gain raCo (GR) measurements, was used. When an aZribute contributes addiConal informaCon about the class, informaCon gain is measured as the amount of the entropy difference. Based on the conducted feature selecCon analysis, an evaluaCon of the accuracy of the selected model was preformed using the top nine aZributes (Histologic type, Tumor size, Number of nodes, Grade, Site specific surgery code, Extension, Number of posiCve nodes, and Lymph node involvement). As shown in the table below, the achieved accuracy by the predicCon models when using the top nine features is very similar to that of the accuracy when using the full set of features. Based on these results, we can infer that this set of top ranked features contributed more in predicCon accuracy. [1] Howlader N, et. al. SEER Cancer StaCsCcs Review, 1975-­‐2010, NaConal Cancer InsCtute. Bethesda, MD, hZp://seer.cancer.gov/csr/1975_2010/. , based on November 2012 SEER data submission. 87.42% 87.37% [2] Ries Lag, et. al. Chapter 13: Cancer of the Female Breast. NaConal Cancer InsCtute, SEER Program, NIH Pub. No. 07-­‐6215, Bethesda, MD, 2007. 87.00% [3] Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) Research Data (1973-­‐2010), NaConal Cancer InsCtute, DCCPS, Surveillance Research Program, Surveillance Systems Branch, released April 2013. 86.50% 86.00% Naïve Bayes C4.5 MulClayer Perceptron RBFNetwork Predic5on model Figure 4: The classificaCon accuracy of the used predicCon algorithms. [4] S.Vijiyarani, S.Sudha. Disease PredicCon in Data Mining Technique: A Survey. InternaConal Journal of Computer ApplicaCons & InformaCon Technology. Vol. II, Issue I, 2013. ISSN: 2278-­‐7720. [5] Delen D, Walker G, Kadam A. PredicCng breast cancer survivability: a comparison of three data mining methods. ArCficial Intelligence in Medicine. 34(2), 2005, Pages 113-­‐27. [6] Pang-­‐Ning Tan, Michael Steinbach, Vipin Kumar. IntroducCon to Data Mining. Addison-­‐Wesley. 2005.