Harvesting Hidden Information in Structured Data

Harvesting Hidden Information in
Structured Data
Mark Grundy, ConSolve Pty Ltd
[email protected]
In this talk
What is Data Mining?
„ Who is doing it
„ What they’re finding
„ How you do it
„ Techy buzzwords
„ Sample tools
„ Tricks and traps
„ Data Mining and Knowledge
Management
„ Conclusions
„
Data Mining: Definition
KDD: “Knowledge Discovery in
Databases”
„ “The process of finding valid, novel,
potentially useful and understandable
patterns in data”
„
Usama Fayyad,
The KDD Process for Extracting Useful Knowledge
from Volumes of Data
CACM 1996,v39,11
What does it mean?
Process: it is iterative and interactive
„ Pattern: models, trends, links, causes,
behaviours
„ Valid: you want the truth
„ Novel: you didn’t already know it
„ Useful: should help you with some
problem
„ Understandable: need to know why
the pattern is there
„
Management Reporting vs Data
Mining
Management
Reporting
Periodic
Automated
Data Mining
Recent events
Potential future events
Ad-hoc
Interactive, iterative
Well defined problem, eg: Ill defined problems, eg:
profit, operational
fraud, customer
performance
behaviours
Simple tools, simple
Complex tools, complex
processes
processes
Separated data sets
“Joined up” data sets
Low risk, medium value
High risk, high value
Who is doing it
Sector
Why
Financial
Customer prediction
Communications
Customer prediction
Retail
Customer prediction
Health
Customer prediction, epidemics
Manufacturing
Production failures, supply
chain management
Modelling complex relationships
Science
Business in general: Efficiency, performance, waste
minimisation, compliance &
audit
What you can find
„
Hidden customer behaviours:
„
„
„
„
„
Buying patterns (eg, beer and nappies)
Life events (eg, childbirth, divorce)
Churn (eg, mobile phones)
Health trends (eg, preventative medicine)
Performance opportunities
„
„
„
„
Underdeveloped markets
Emerging trends
Inefficiencies
Fraud and non-compliance
Î “Nuggets”
of gold: new discoveries
Î New predictive models
KDD Process – How you do it
Learn the
domain
Locate the
right data
Clean the
data
Develop the
model
Analyse
Interpret
Apply the
learning
Key KDD Tools
Simplified Process Schema
Goal
Dictionary
Database
Cleansing
Modelling
Analysis
Output
Query
Tools
Stats,
AI Tools
Visualisation
Tools
Presentation
Tools
Supporting Tools
Sample Products
Red Brick data warehouses
„ SQL Server Analysis Services
„ Datastage ETL, Microsoft DTS
„ SAS
„ Intelligent Miner
„ Cognos, Hyperion, Business Objects
„ Netmap
„
Some Techy KDD Buzzwords
Data mart, data warehouse, cube
„ Extract, Transform, Load (ETL)
„ Classification and Regression Trees
(CART)
„
„
OLAP, ROLAP, MOLAP, HOLAP, DOLAP
Clusters and centroids
„ Genetic algorithms
„ Univariate & multivariate analyses
„
KDD vs Knowledge
Management
Prediction
Formal
Knowledge
Informal
Knowledge
Tacit
Knowledge
Domain Study,
Model Interpretation
Knowledge
Discovery
Key issues
Finding the right problems
„ Data quality
„ Data meaning and interpretation
„ Effective models (FABRIC criteria)
„
Theory vs. practice: what can
go wrong
„
„
„
„
„
„
„
„
„
„
Poor sponsorship
Expectations
Politics, exposure, accountability
Scope & success criteria
Business & data definitions
Poor model designs
Technology vs. users
Process vs. technologies
Process vs. outcome
Focus on just internal records
What IM can contribute
Text mining (work in progress)
„ Image/sound mining?
„ Metadata, meanings
„ Business information definitions
„ Inter-organisational standards
„ Registration of data sets, reports
„
Questions & Discussion