Data Mining

Data Mining
and Information Discovery
Final Project
指導教授 : 黃三益 教授
組員 :
M964011062 Vipin Saini
M964020009 許博淞
M964020043 陳昀志
Data Mining and Information Discovery-Final Project
Contents
Step 1. Introduction ..................................................................................... 3
Step 2. Select appropriate data .................................................................... 3
Step 3. Data Preparation .............................................................................. 5
Step 4. Creating a Subset .............................................................................. 5
Step 5. Initial Exploration ............................................................................. 6
Step 6. The Training and Testing Approach ................................................... 6
Step 7. Assess models .................................................................................. 7
Step 8. Deploy models.................................................................................. 9
Step 9. Assess result ................................................................................... 11
Appendix ................................................................................................... 15
Data Mining and Information Discovery-Final Project
Step 1. Introduction
Direct marketing companies have to spend money for every lead they
contact, whether it’s by telephone, mail, fax, or in person. Most leads never
respond. The small percentage that does is called the response rate.
Sometimes, if the response rate is incredibly low, a company may find itself
spending more money on contacting leads than it makes in sales to the leads,
resulting in a negative return on investment. The goal is then to improve this
response rate before spending a Time on contacting leads. To reach this goal,
companies can collect data, and analyze it to identify the best leads.
The analysis is based on a real consulting project for a
telecommunications company. The data and names have been disguised. The
company was planning to introduce a new DSL service, and wanted to call
prospective customers and send out mass-mailings. Since the marketing
campaign was so expensive, the company turned to data mining to reduce
the cost of the campaign, and at the same time increase the potential
revenue.
The data consists of publicly available business data on customers
contacted in the past, along with a column indicating whether or not
customers responded to the offer. The company selected a list of known
responders, and then added to the list a few thousand random companies
(not contacted in the past) to serve as examples of non-responders. We will
use PolyAnalyst to generate a response model that would help the direct
marketer reduce marketing expenses. Contacting only predicted responders,
rather than randomly contacting everyone, will reduce campaign cost.
Step 2. Select appropriate data
We got 13,117 records corresponds to a prospective business. There are
some characteristics about each prospect, such as:

Number of employees at a particular office

Number of employees for the entire company

Annual sales (in thousands) at a particular office

Annual sales (in thousands) for the entire company

Whether or not the company does business outside the United States

Annual advertising expense

Whether the company has moved recently or is a new business
-3-
Data Mining and Information Discovery-Final Project




The type of ownership
Specific industry code
General industry code
Age of the company (in years)
Correcting the data types.

Make sure "Buyer" is the type Yes/No.

Change the type of Age to integer.

Make sure the "International" type is string or Boolean.

Change "Local Employees" to integer.

Change "Local Sales" to integer.

Change "Industry Type" to categorical.

Change "Total Employees" to integer.

Change "Total Sales" to integer.
We show a pie chart below to describe the records distributions.

Age in year:
Age
1~4
21%
9~12
45%
1~4
5~8
9~12
5~8
34%
Figure 1 a pie chart with age.

Industry Category:
Industry Category
F
12%
H
G 11%
I A
B
1% 7%
5%
A
4%
B
C
D
E
E
2%
D
0%
C
58%
F
G
H
I
Figure 2 a pie chart with Industry category.
-4-
Data Mining and Information Discovery-Final Project

Buyer (the class label 0 for No and 1 for Yes)
Buyer
1
44%
1
0
56%
0
Figure 3 a pie chart with buyer.
Step 3. Data Preparation
The number of employees and the number of sales differ based on the
size of the company. All of these characteristics represent a picture of
company size. Employees and sales are influenced by other company
characteristics like productivity, profit margin, and the number of branch
offices. The current data contains too many measures of size. A huge
company with low productivity might still have a higher sales figure than a
small company. Let's make company size proportional by using ratios of
branch sales and employees. We will create a few rules to give us more
helpful attributes.
Figure 4 creating the rules
Step 4. Creating a Subset
With our newly applied rules, the World dataset now has redundant
-5-
Data Mining and Information Discovery-Final Project
columns. For convenience, let's create a new dataset with different columns.
Figure 5 creating the data sub set without some attributes.
Step 5. Initial Exploration
Let's take a look at the statistics for the different attributes in the
Explored dataset. This requires a basic understanding of working with the
dataset statistics panel, which is covered in the earlier Mpg analysis tutorial.
Figure 6 remove the attributes with too many different values.
Step 6. The Training and Testing Approach
A big problem in data analysis is that sometimes the models we develop
-6-
Data Mining and Information Discovery-Final Project
on the data are not accurate. Sometimes they really only represent patterns
in the one dataset we analyze, and the patterns do not generalize to new
data. This is sometimes referred to as "over fitting", as our model only fits the
current data. If you are working with a sample of all your leads, and find a
prediction for a sample of the leads, you want to make sure that the
prediction that works for the sample works for all of your leads.
To deal with this problem, a common data analysis practice is to
partition the data into two subsets. One set is used to develop the model,
and is called the training set. The data in the training set is used to "train" the
model. The second set is called the testing set. We take the model developed
by the training set, and apply it to the testing set, and evaluate how well the
model works. This helps support the validity of the model. If it does not work,
its probably because our model only reflects patterns in the training set, and
not patterns for all the data. Let's create training and testing set before we
continue.
Figure 7 selecting the training set.
Step 7. Assess models
-7-
Data Mining and Information Discovery-Final Project
Figure 8 decision Tree.
Decision tree’s leaves and nodes for training set:
Number of non-terminal nodes
41
Number of leaves
91
Depth of the tree
8
Here is the result of decision tree of training set
Real/predict
No
Yes
undefined
No
3018
528
49
Yes
379
2535
49
Some information about the decision tree we made:

Total classification error: 14.04%

Percentage of undefined prediction cases: 1.49%

Classification accuracy: 85.96%

Classification error for class No: 14.89%

Classification error for class Yes: 13.01%
The classification error is low (14.04%) and the classification accuracy is
over 80%, this seems can be a good model to predict.
Figure 9 decision Tree with training set report.
We show out how the performance of the classification model (Decision
Tree) by the lift charts below in Figure 10 a lift chat of the decision model..
We say we if we use top 40% of data and can use this model to predict 80%
corrected response. According to the lift chart, this decision tree’s
-8-
Data Mining and Information Discovery-Final Project
performance is good to predict.
Figure 10 a lift chat of the decision model.
We show decision tree structure we made in Figure 11. We didn’t show
some nodes in this figure cause the node don’t conspicuous classification
result. The root of the decision tree is number of Local Employee.
Root
Local
Local
Employee<23
Employee>=2
3
Age<3
Age<2
Age>=3
Age>=2
Sales
Sales
Sales
Ratio <
Ratio >=
Ratio
0.0027
0.0027
Industry
Industry
Industry
Industry
Industry
Industry
Industry
Industry
Industry
Category = C
Category = H
Category = F
Category = E
Category = D
Category = A
Category = B
Category = G
Category = I
Local
Local
Employee
Employee
Employee
Employee
Ratio< 0.214
Ratio < 0.214
<10
>=10
=N/A
Figure 11 some nodes of the decision tree we make1
Step 8. Deploy models
We also need to consider the performance about the decision tree with
the testing set. The testing set is random selected 50 % of records from the
whole dataset. Figure 12 is the performance report.
1
A bigger graph is in the Appendix page.
-9-
Data Mining and Information Discovery-Final Project
Figure 12 the report of the decision tree with testing set.
- 10 -
Data Mining and Information Discovery-Final Project
We use this decision tree model to testing the testing test to see how
the model accuracy is.
Real/predict No
Yes
undefined
No
3074
610
45
Yes
396
2395
39
Some information about the testing set apply to the model:

Total classification error: 15.54%

Percentage of undefined prediction cases: 1.28%

Classification accuracy: 84.46%

Classification error for class No: 16.56%

Classification error for class Yes: 14.19%
Step 9. Assess result
The subtree of the level 2 is that the number of Local Employee is more
than 23 person and for each Category of industry show how the trends to
reply in Figure 13, we can say that almost every company that have more
than 23 employee have higher ratio to respond. (Class label is Yes and the
ratio is 75.5%).Says, a bigger company with more employee which have
higher trends to response.
Figure 13 general view in Decision tree.
On the other hand, in a general view the number of employee is smaller
than 23, are likely not to response (Class label is No and the ratio is 72.9%). It
shows out in Figure 14. Says a small company doesn’t have trends to
response.
- 11 -
Data Mining and Information Discovery-Final Project
Figure 14 general view in Decision tree
Until now, we show a general view of the decision tree. Here comes
some special case in decision tree.
The companies which have more than 23 employees for almost every
Industry Category are likely to response, but only one kind of Industry
Category is a little bit different with other Industry Category. A bigger
companies with the Industry Category is E, if the Local Employee ratio is
smaller than 0.214 then the response ratio is low (class label is No and the
ratio is 85.7%), else if the Local Employee ratio is bigger than 0.214 then the
response ratio is high (class label is Yes and the ratio is 66.2%). We can say
the Local employee ratio have influence on response ratio of the bigger
companies and Industry Category is E, depends on how is the Local employee
Ratio is. We show these results in Figure 15 and Figure 16.
Figure 15 the Local Employee Ratio is smaller than 0.214 and the response ratio is low.
- 12 -
Data Mining and Information Discovery-Final Project
Figure 16 the Local Employee Ratio is more than 0.214 and the response ratio is high.
The companies which have less the 23 Local employee but the Age of
the company is not older then 2.with these conditions, if the Sales ratio is
more than 0.27% then the response ration is high (class label is Yes and the
ratio is 98.2%).We can say a new beginning company and his sales rate is
good, so he likes to response. Maybe the reason is the new company try to
built the completive advantage to others new company, so they try to get any
resource he could get.
Figure 17 special nodes in decision tree.
According to the result, when we try to mark out what kind of the target
market we are going to advertise, we can take more attention on the special
characteristics companies, let them into consideration.
- 13 -
Data Mining and Information Discovery-Final Project
Conclusion
We propose a classification approach to make the target marketing. The
final result is going well as our desire to find, but we have some limitation
here: first, we lost some information about how to category the industry of
the Industry Category attribute. If we can get this information, we can get
more detail and more interesting mining results. Second, we didn’t erase the
missing value, cause we think the records which missing some value possible
have some import information and without the missing value records, we still
can make a decision tree with good performance.
After all, the classification model (decision tree) we made, still have
some improvements.
- 14 -
Data Mining and Information Discovery-Final Project
Root
Appendix
Local
Local
Employee<2
Employee>
3
=23
Age<3
Age<2
Age>=3
Age>=2
Sales
Sales
Sales
Ratio <
Ratio
Ratio
0.0027
>=
=N/A
0.0027
Industry
Industry
Industry
Industry
Industry
Industry
Industry
Industry
Industry
Category =
Category =
Category =
Category =
Category =
Category =
Category =
Category =
Category = I
C
H
F
E
D
A
B
G
Local
Local
Employee
Employee
Employ
Employ
Ratio<
Ratio <
ee <10
ee >=10
0.214
0.214
- 15 -