Data Mining and Information Discovery Final Project 指導教授 : 黃三益 教授 組員 : M964011062 Vipin Saini M964020009 許博淞 M964020043 陳昀志 Data Mining and Information Discovery-Final Project Contents Step 1. Introduction ..................................................................................... 3 Step 2. Select appropriate data .................................................................... 3 Step 3. Data Preparation .............................................................................. 5 Step 4. Creating a Subset .............................................................................. 5 Step 5. Initial Exploration ............................................................................. 6 Step 6. The Training and Testing Approach ................................................... 6 Step 7. Assess models .................................................................................. 7 Step 8. Deploy models.................................................................................. 9 Step 9. Assess result ................................................................................... 11 Appendix ................................................................................................... 15 Data Mining and Information Discovery-Final Project Step 1. Introduction Direct marketing companies have to spend money for every lead they contact, whether it’s by telephone, mail, fax, or in person. Most leads never respond. The small percentage that does is called the response rate. Sometimes, if the response rate is incredibly low, a company may find itself spending more money on contacting leads than it makes in sales to the leads, resulting in a negative return on investment. The goal is then to improve this response rate before spending a Time on contacting leads. To reach this goal, companies can collect data, and analyze it to identify the best leads. The analysis is based on a real consulting project for a telecommunications company. The data and names have been disguised. The company was planning to introduce a new DSL service, and wanted to call prospective customers and send out mass-mailings. Since the marketing campaign was so expensive, the company turned to data mining to reduce the cost of the campaign, and at the same time increase the potential revenue. The data consists of publicly available business data on customers contacted in the past, along with a column indicating whether or not customers responded to the offer. The company selected a list of known responders, and then added to the list a few thousand random companies (not contacted in the past) to serve as examples of non-responders. We will use PolyAnalyst to generate a response model that would help the direct marketer reduce marketing expenses. Contacting only predicted responders, rather than randomly contacting everyone, will reduce campaign cost. Step 2. Select appropriate data We got 13,117 records corresponds to a prospective business. There are some characteristics about each prospect, such as: Number of employees at a particular office Number of employees for the entire company Annual sales (in thousands) at a particular office Annual sales (in thousands) for the entire company Whether or not the company does business outside the United States Annual advertising expense Whether the company has moved recently or is a new business -3- Data Mining and Information Discovery-Final Project The type of ownership Specific industry code General industry code Age of the company (in years) Correcting the data types. Make sure "Buyer" is the type Yes/No. Change the type of Age to integer. Make sure the "International" type is string or Boolean. Change "Local Employees" to integer. Change "Local Sales" to integer. Change "Industry Type" to categorical. Change "Total Employees" to integer. Change "Total Sales" to integer. We show a pie chart below to describe the records distributions. Age in year: Age 1~4 21% 9~12 45% 1~4 5~8 9~12 5~8 34% Figure 1 a pie chart with age. Industry Category: Industry Category F 12% H G 11% I A B 1% 7% 5% A 4% B C D E E 2% D 0% C 58% F G H I Figure 2 a pie chart with Industry category. -4- Data Mining and Information Discovery-Final Project Buyer (the class label 0 for No and 1 for Yes) Buyer 1 44% 1 0 56% 0 Figure 3 a pie chart with buyer. Step 3. Data Preparation The number of employees and the number of sales differ based on the size of the company. All of these characteristics represent a picture of company size. Employees and sales are influenced by other company characteristics like productivity, profit margin, and the number of branch offices. The current data contains too many measures of size. A huge company with low productivity might still have a higher sales figure than a small company. Let's make company size proportional by using ratios of branch sales and employees. We will create a few rules to give us more helpful attributes. Figure 4 creating the rules Step 4. Creating a Subset With our newly applied rules, the World dataset now has redundant -5- Data Mining and Information Discovery-Final Project columns. For convenience, let's create a new dataset with different columns. Figure 5 creating the data sub set without some attributes. Step 5. Initial Exploration Let's take a look at the statistics for the different attributes in the Explored dataset. This requires a basic understanding of working with the dataset statistics panel, which is covered in the earlier Mpg analysis tutorial. Figure 6 remove the attributes with too many different values. Step 6. The Training and Testing Approach A big problem in data analysis is that sometimes the models we develop -6- Data Mining and Information Discovery-Final Project on the data are not accurate. Sometimes they really only represent patterns in the one dataset we analyze, and the patterns do not generalize to new data. This is sometimes referred to as "over fitting", as our model only fits the current data. If you are working with a sample of all your leads, and find a prediction for a sample of the leads, you want to make sure that the prediction that works for the sample works for all of your leads. To deal with this problem, a common data analysis practice is to partition the data into two subsets. One set is used to develop the model, and is called the training set. The data in the training set is used to "train" the model. The second set is called the testing set. We take the model developed by the training set, and apply it to the testing set, and evaluate how well the model works. This helps support the validity of the model. If it does not work, its probably because our model only reflects patterns in the training set, and not patterns for all the data. Let's create training and testing set before we continue. Figure 7 selecting the training set. Step 7. Assess models -7- Data Mining and Information Discovery-Final Project Figure 8 decision Tree. Decision tree’s leaves and nodes for training set: Number of non-terminal nodes 41 Number of leaves 91 Depth of the tree 8 Here is the result of decision tree of training set Real/predict No Yes undefined No 3018 528 49 Yes 379 2535 49 Some information about the decision tree we made: Total classification error: 14.04% Percentage of undefined prediction cases: 1.49% Classification accuracy: 85.96% Classification error for class No: 14.89% Classification error for class Yes: 13.01% The classification error is low (14.04%) and the classification accuracy is over 80%, this seems can be a good model to predict. Figure 9 decision Tree with training set report. We show out how the performance of the classification model (Decision Tree) by the lift charts below in Figure 10 a lift chat of the decision model.. We say we if we use top 40% of data and can use this model to predict 80% corrected response. According to the lift chart, this decision tree’s -8- Data Mining and Information Discovery-Final Project performance is good to predict. Figure 10 a lift chat of the decision model. We show decision tree structure we made in Figure 11. We didn’t show some nodes in this figure cause the node don’t conspicuous classification result. The root of the decision tree is number of Local Employee. Root Local Local Employee<23 Employee>=2 3 Age<3 Age<2 Age>=3 Age>=2 Sales Sales Sales Ratio < Ratio >= Ratio 0.0027 0.0027 Industry Industry Industry Industry Industry Industry Industry Industry Industry Category = C Category = H Category = F Category = E Category = D Category = A Category = B Category = G Category = I Local Local Employee Employee Employee Employee Ratio< 0.214 Ratio < 0.214 <10 >=10 =N/A Figure 11 some nodes of the decision tree we make1 Step 8. Deploy models We also need to consider the performance about the decision tree with the testing set. The testing set is random selected 50 % of records from the whole dataset. Figure 12 is the performance report. 1 A bigger graph is in the Appendix page. -9- Data Mining and Information Discovery-Final Project Figure 12 the report of the decision tree with testing set. - 10 - Data Mining and Information Discovery-Final Project We use this decision tree model to testing the testing test to see how the model accuracy is. Real/predict No Yes undefined No 3074 610 45 Yes 396 2395 39 Some information about the testing set apply to the model: Total classification error: 15.54% Percentage of undefined prediction cases: 1.28% Classification accuracy: 84.46% Classification error for class No: 16.56% Classification error for class Yes: 14.19% Step 9. Assess result The subtree of the level 2 is that the number of Local Employee is more than 23 person and for each Category of industry show how the trends to reply in Figure 13, we can say that almost every company that have more than 23 employee have higher ratio to respond. (Class label is Yes and the ratio is 75.5%).Says, a bigger company with more employee which have higher trends to response. Figure 13 general view in Decision tree. On the other hand, in a general view the number of employee is smaller than 23, are likely not to response (Class label is No and the ratio is 72.9%). It shows out in Figure 14. Says a small company doesn’t have trends to response. - 11 - Data Mining and Information Discovery-Final Project Figure 14 general view in Decision tree Until now, we show a general view of the decision tree. Here comes some special case in decision tree. The companies which have more than 23 employees for almost every Industry Category are likely to response, but only one kind of Industry Category is a little bit different with other Industry Category. A bigger companies with the Industry Category is E, if the Local Employee ratio is smaller than 0.214 then the response ratio is low (class label is No and the ratio is 85.7%), else if the Local Employee ratio is bigger than 0.214 then the response ratio is high (class label is Yes and the ratio is 66.2%). We can say the Local employee ratio have influence on response ratio of the bigger companies and Industry Category is E, depends on how is the Local employee Ratio is. We show these results in Figure 15 and Figure 16. Figure 15 the Local Employee Ratio is smaller than 0.214 and the response ratio is low. - 12 - Data Mining and Information Discovery-Final Project Figure 16 the Local Employee Ratio is more than 0.214 and the response ratio is high. The companies which have less the 23 Local employee but the Age of the company is not older then 2.with these conditions, if the Sales ratio is more than 0.27% then the response ration is high (class label is Yes and the ratio is 98.2%).We can say a new beginning company and his sales rate is good, so he likes to response. Maybe the reason is the new company try to built the completive advantage to others new company, so they try to get any resource he could get. Figure 17 special nodes in decision tree. According to the result, when we try to mark out what kind of the target market we are going to advertise, we can take more attention on the special characteristics companies, let them into consideration. - 13 - Data Mining and Information Discovery-Final Project Conclusion We propose a classification approach to make the target marketing. The final result is going well as our desire to find, but we have some limitation here: first, we lost some information about how to category the industry of the Industry Category attribute. If we can get this information, we can get more detail and more interesting mining results. Second, we didn’t erase the missing value, cause we think the records which missing some value possible have some import information and without the missing value records, we still can make a decision tree with good performance. After all, the classification model (decision tree) we made, still have some improvements. - 14 - Data Mining and Information Discovery-Final Project Root Appendix Local Local Employee<2 Employee> 3 =23 Age<3 Age<2 Age>=3 Age>=2 Sales Sales Sales Ratio < Ratio Ratio 0.0027 >= =N/A 0.0027 Industry Industry Industry Industry Industry Industry Industry Industry Industry Category = Category = Category = Category = Category = Category = Category = Category = Category = I C H F E D A B G Local Local Employee Employee Employ Employ Ratio< Ratio < ee <10 ee >=10 0.214 0.214 - 15 -
© Copyright 2025 Paperzz