eBusiness Tutorial

Issues in Data Mining Applications
-TutorialHow to Make A Decision
About Your Own Data Mining Tool?
Authors:
Nemanja Jovanovic, [email protected]
Valentina Milenkovic, [email protected]
Prof. Dr. Veljko Milutinovic, [email protected]
Data Mining vs. Knowledge Mining = ?
?
Page Number: 2
Evolution Of Data Mining
Evolutionary Step
Business Question
Enabling
Technologies
Product Providers
Characteristics
Data Collection
(1960s)
What was my average
total revenue over the
last 5 years?
Computers,
tapes,
disks
IBM,
CDC
Retrospective,
static data delivery
Data Access
(1980s)
What were unit sales
in New England
last March?
RDBMS,
SQL,
ODBC
Oracle, Sybase
Informix, IBM,
Microsoft
Retrospective,
dynamic data delivery
at record level
Data Navigation
(1990s)
What were unit sales
in New England last
March?
Drill down to Boston.
OLAP,
Multidimensional
databases,
data warehouses
Pilot, IRI,
Arbor, Redbrick,
Evolutionary
Technologies
Retrospective,
dynamic data delivery
at multiple levels
Data Mining
(2000)
What’s likely to
happen to Boston unit
sales next month?
Why?
Advanced algorithms,
multiprocessors,
massive databases
Lockheed,
IBM, SGI,
numerous startups
Prospective, proactive
information delivery
Page Number: 3
Examples of DM projects to stimulate your imagination

Here are six examples of how data mining is helping corporations
to operate more efficiently and profitably in today's business environment.
– Targeting a set of consumers
who are most likely to respond to a direct mail campaign
– Predicting the probability of default for consumer loan applications
– Reducing fabrication flaws in VLSI chips
– Predicting audience share for television programs
– Predicting the probability that a cancer patient
will respond to radiation therapy
– Predicting the probability that an offshore oil well is actually going
to produce oil
Page Number: 4
Comparison of forteen DM tools





Evaluated by four undergraduates inexperienced at data mining,
a relatively experienced graduate student and
a profesional data mining consultant
Run under the MS Windows 95, MS Windows NT,
Macintosh System 7.5
Use one of the four technologies:
Decision Trees, Rule Inductions, Neural or Polynomial Networks
Solve two binary classification problems:
multi-class classification and noiseless estimation problem
Price from 75$ to 25.000$
Page Number: 5
Comparison of forteen DM tools




The Decision Tree products were
- CART
- Scenario
- See5
- S-Plus
The Rule Induction tools were
- WizWhy
- DataMind
- DMSK
Neural Networks were built from three programs
- NeuroShell2
- PcOLPARS
- PRW
The Polynomial Network tools were
- ModelQuest Expert
- Gnosis
- a module of NeuroShell2
- KnowledgeMiner
Page Number: 6
Criteria for evaluating DM tools
A list of 20 criteria for evaluating DM tools, put into 4 categories:

Capability measures what a desktop tool can do,
and how well it does it
- Handless missing data
- Considers misclassification costs
- Allows data transformations
- Quality of tesing options
- Has programming language
- Provides useful output reports
- Visualisation
Page Number: 7
Visualisation
+ excellent capability  good capability - some capability “blank” no capability
Page Number: 8
Criteria for evaluating DM tools

Learnability/Usability shows how easy a tool is to learn and use
-
Tutorials
Wizards
Easy to learn
User’s manual
Online help
Interface
Page Number: 9
Criteria for evaluating DM tools

Interoperability shows a tool’s ability to interface
with other computer applications
- Importing data
- Exporting data
- Links to other applications

Flexibility
- Model adjustment flexibility
- Customizable work enviroment
- Ability to write or change code
Page Number: 10
Data Input & Output Model
+ excellent capability
 good capability
- some capability
“blank” no capability
Page Number: 11
A classification of data sets

Pima Indians Diabetes data set
–
–

Wisconsin Breast Cancer data set
–
–

699 instances of breast tumors
some of which are malignant, most of which are benign
10 attributes plus the binary malignancy variable per case
The Forensic Glass Identification data set
–
–

768 cases of Native American women from the Pima tribe
some of whom are diabetic, most of whom are not
8 attributes plus the binary class variable for diabetes per instance
214 instances of glass collected during crime investigations
10 attributes plus the multi-class output variable per instance
Moon Cannon data set
–
–
300 solutions to the equation:
x = 2v 2 sin(g)cos(g)/g
the data were generated without adding noise
Page Number: 12
Evaluation of forteen DM tools
Page Number: 13
Strenghts and Weaknesses
Strengths




Weaknesses
Ease of use
(Scenario, WizWhy..)
Data visualisation
(S-plus,MineSet...)
Depth of algorithms (tree options)
(CART,See5,S-plus..)
Multiplte neural network
architectures
(NeuroShell)



Difficult file I/O
(OLPARS,CART)
Limited visualisation
(PRW,See5,WizWhy)
Narrow analyses path
(Scenario)
Page Number: 14
How to improve existing DM applications
The top ten points:
 Database integration
– no more flat files
– use the millions $ spent on data warehousing
 Automated model scoring
– without scoring DM is pretty useless
– should be integrated with the driving applications
 Exporting models to other applications
– close the loop between DM and applications
that need to use the results (scores)
Page Number: 15
How to improve existing DM applications

Business templates
– cross-selling specific application is more valuable
than a general modeling tool
 Effort knob
– it is relevant in a way that tuning parametars are not
 Incorporate financial information
– the financial information is very important and often available
and shold be provided as input to the DM application
Page Number: 16
How to improve existing DM applications

Computed target columns
– allow the user to interactively create a new target variable
 Time-series data
– a year’s worth of monthly balance information is qualitatively
different than twelve distinct non-time-series variables
 Use versus View
– do not present visually to user the full model,
only the most important levels
 Wizards
– not necessarily but desirable
– prevent human error by keeping the user on track
Page Number: 17
Potential Applications
Data mining has many varied fields of application,
some of which are listed below.

Retail/Marketing

Identify buying patterns from customers

Find associations among customer demographic characteristics

Predict response to mailing campaigns

Market basket analysis
Page Number: 18
Potential Applications
• Banking

Detect patterns of fraudulent credit card use

Identify `loyal' customers

Determine credit card spending by customer groups

Find hidden correlations between different financial indicators

Identify stock trading rules from historical market data
Page Number: 19
Potential Applications
• Insurance and Health Care

Claims analysis - i.e., which medical procedures are claimed together

Predict which customers will buy new policies

Identify behaviour patterns of risky customers

Identify fraudulent behaviour
Page Number: 20
Potential Applications
• Transportation

Determine the distribution schedules among outlets

Analyse loading patterns
• Medicine

Characterise patient behaviour to predict office visits

Identify successful medical therapies for different illnesses

To predict the effectiveness of surgical procedures or
medical tests
Page Number: 21
Potential Applications
• Sport

To make the best choice about players in different circumstance

To predict the results of relevance match

Do a better list of seed players in groups or tournament
 DM report from an NBA game
When Price was Point-Guard, J.Williams missed 0% (0)
of his jump field-goal attempts and made 100% (4)
of his jump field-goal-attempts.
The total number of such field-goal-attempts was 4.
Page Number: 22
DM and Customer Relationship Management

CRM is a process that manages the interactions
between a company and its customers
 Users of CRM software applications are database marketers
 Goals of database marketers are:
 identifying market segments, which requires significant data
about prospective customers and their buying behaviors
 build and execute campaigns

Tightly integrating the two disciplines presents an opportunity
for companies to gain competetive adventage
Page Number: 23
DM and Customer Relationship Management





How Data Mining helps Database Marketing
Scoring
The role of Campaign Management Software
Increasing the customer lifetime value
Combining Data Mining and Campaign Management
Page Number: 24
DM and Customer Relationship Management

Evaluating the benefits of a Data Mining model
Gains chart
Profability chart
Page Number: 25
Data Mining Examples

Bass Brewers
“We’ve been brewing beer since 1777, with increased competition
comes a demand to make faster better informed decision”
 Northern Bank
“The information is now more accessible, paperless and timely.”
 TSB Group Plc
“We are using Holos because of its flexibility and its excellent
multidimensional database”
Page Number: 26
Data Mining Examples

Delphic Universites
“Real value is added to data by multidimensional manipulation
(being able to to easily compare many different views
of the avaible information in one report) and by modeling.”
 Harvard - Holden
“Sybase technology has allowed us to develop an information
system that will preserve this legacy into the twenty-first century”
 J.P.Morgan
“The promise of data mining tools like Information Harvester is
that they are able to quickly wade through massive amounts
of data to identify relationships or trending information
that would not have been avaible without the tool”
Page Number: 27
Case study of Breast Cancer Survival Analysis

Case study of the influence of various patient characteristics
on survival rates for breast cancer
 The survival analysis technique employed is Cox Regression
(this technique is useful in situations,
where some of the patients do not die during the observation
period)
 Linear regression technique
(if all patients had died during the observation period)
Page Number: 28
Case study of Breast Cancer Survival Analysis

The observation period runs for 133.8 months
 The modeling sample contains 746 patients
(50 patients died during the observation period and 696
who survived beyond the end of the observation period)
 In this example, we are testing only four predictors:




Age, in years, at the start of the observation period (22 to 88)
Pathological tumor size, in centimeters (0.10 to 7.00)
Number of positive axillary lymph nodes (0 to 35)
Estrogen receptor status (positive vs. negative)
Page Number: 29
Case study of Breast Cancer Survival Analysis

The Cox Regression used a backward stepwise likelihood-ratio
variable selection method
 Significance criteria were set at 0.05 for inclusion in the model,
and 0.10 for removal from the model
 Printout from the final step of the stepwise regression analysis:
________________ Variables in the Equation ______________
Variable
B
S.E.
Wald df Sig
R
Exp(B)
AGE
-.0314 .0121 6.7486 1 .0094 -.0893 .9691
PATHSIZE
.3975 .1175 11.4476 1 .0007 .1259 1.4881
LNPOS
.1372 .0361 14.4100 1 .0001 .1443 1.1471
_______________________________________________________
The column labeled "Sig" shows the statistical significance of included variables
The column labeled "R" shows the degree of unique correlation with the dependent variable
Page Number: 30
Case study of Breast Cancer Survival Analysis
Some key things to note are:





Estrogen status was removed as a predictor because
it did not reach the 0.05 significance criterion for inclusion
Number of positive axillary lymph nodes was the strongest
predictor of survival rates (R=.1443 / Sig=.0001),
then follow pathological tumor size (R=.1259 / Sig.=.0007),
over the course of the observation period
Age, although significant, is somewhat less influential
than the other two predictors (R=-0.893 / Sig.=.0094)
Note that both the number of positive axillary lymph nodes and
the pathological tumor size are positively correlated, which means
that they are directly associated with more rapid mortality.
Age is negatively correlated with the dependent variable, which
means that younger age is predictive of somewhat longer survival.
Page Number: 31
Case study of Breast Cancer Survival Analysis




All patients survive through
the 10 month of the observation
period
At the fortieth month,
the mortality rate increases and
continues at this fairly constant
increased rate
through the forty-fifth month
At the forty-fifty month,
there is a five-month period
without additional mortality
11% of the original sample has
died
The following chart shows the cumulative
survival function during the observation
period:
Page Number: 32
Case study of Breast Cancer Survival Analysis
Conclusions and Implications

The case study presented here is relatively simple,
and is for illustrative purposes only.
 With the addition of more candidate predictors
(progesterone receptor status, histologic grade, blood type
etc.), an even more powerful model could emerge.
 By understanding the influence of patient characteristics
on mortality rates over time, we are in a better position to
estimate survival times for individual patients, and to defend
using different or more aggressive therapeutic approaches for
some patients.
Page Number: 33
Securities Brokerage Case Study

The following four pages are derived
from a copyrighted case study
originally created by SmartDrill Data Mining
(Marlborough, MA, U.S.A.).
 Their website is:
http://smartdrill.com
 And the original case study appears in its entirety here:
http://smartdrill.com/CHAID.html
Page Number: 34
Securities Brokerage Case Study

Predictive market segmentation model designed to identify
and profile high-value brokerage customer segments
as targets for special marketing communications efforts.
 The dependent variable for this ordinal CHAID model
is brokerage account commission dollars during the past 12 months
 We begin by splitting the client's entire customer file
into a modeling sample and a validation sample.
(Once the model is built using the modeling sample,
we apply it to the validation sample to see how well it works
on a sample other than the one on which it was built).
Page Number: 35
Securities Brokerage Case Study

The resulting CHAID model has 55 segments.
 However, the results are summarized in the following comb chart,
showing the segment indexes (indexes of average dollar value)
Page Number: 36
Securities Brokerage Case Study
The part of Gains Chart: Average Annual Brokerage Commission Dollars
Gains chart provides
quantitative detail useful
for financial and marketing
planning.

We have highlighted the
top 20% of the file in blue

The top 20% of the file
is worth an average
of about $334 per account,
which is nearly three times
the average account value
for the entire sample.

…
…
…
…
Page Number: 37
…
…
…
…
...
Securities Brokerage Case Study

Using the data in the gains chart this information,
we can better plan our communications/promotion budget.
 In general, the best segments represent customers
who are experienced, aggressive, self-directed traders.
 The other decisions, which the gains chart
and the segmentation rules can help us make:
 We might wish to conduct some market research among customers
in under-performing segments, or among under-performing customers
in the better segments
 We can use the segment definitions to help us identify possible issues
and question areas to include in the survey

Before we try to apply such a model, we perform a validation
against a holdout sample, to confirm that it is a good model.
Page Number: 38
The End
Page Number: 39