Derive Value From Excellence

Derive Value From Excellence …
ODMAD Algorithm for Mixed
Attribute Outlier Detection
GCE Solutions
Issues with Common Outlier Detection
Ideologies
• Many are limited to numeric data only
• Many are limited to Supervised data
• What if you don’t have predictive data?
• What if you don’t know the supervised relationships?
• Limited options for Mixed-Attribute, Unsupervised data
• Unreliable results
• Often costly run times
GCE Solutions
Derive value from excellence…
Solution? The ODMAD Algorithm.
• Outlier Detection for Mixed Attribute Data
• Found Anna Koufakou’s dissertation, “Scalable and Efficient Outlier
Detection in Large Distributed Data Sets with Mixed-Type Attributes”
• Combines simplistic frequency count ideologies with commonly known
numeric outlier detection methods
• Dr. Koufakou’s paper suggests just cosine similarity, but we expanded to a more
general numeric approach
• 2 Phases of ODMAD:
• Phase 1 – Categorical phase: Finds infrequencies in categorical values and
combinations of categorical values
• Phase 2 – Numeric phase: Subsets numeric portion of data based on categorical
values and performs numeric outlier detection methods on each subset
• Sub setting reduces risk of “Masking”
GCE Solutions
Derive value from excellence…
Definitions
• Categorical Score – measures the “infrequentness” for each
observation
• Low values = not a categorical outlier
• High values = categorical outlier
• Dimension – number of variables used at a given time
• E.g. if looking at total frequencies of 2 variables combined, SEX and RACE,
this would be checking categorical outliers at a dimension of 2.
GCE Solutions
Derive value from excellence…
Example
• Suppose we have data:
• And we want to find outliers
• We have both categorical and numeric variables
• We know nothing about this dataset
• Want to treat as unsupervised
GCE Solutions
Derive value from excellence…
Getting started with Phase 1 - Categorical
• Initialize the Categorical Score to 0 for every observation
• Set thresholds:
• Need minimum allowed frequency
• Typically set to 2% of the total observations
• Need cut off value for the Categorical Score before it is considered and outlier
• An observation’s Categorical Score will increase if there is an infrequency
• Categorical Score is weighted according to frequency and number of variables that caused
the infrequency:
Low Frequency
High Frequency
Smaller Dimension
Higher Impact
No Impact
Higher Dimension
Lower Impact
No Impact
GCE Solutions
Derive value from excellence…
Example: Phase 1 - Categorical
• First it will scan all the individual categorical variables for
infrequencies
Total Frequencies:
‘F’: 717
‘M’: 776
‘Rather not Say’: 5
GCE Solutions
Derive value from excellence…
Example: Phase 1 - Categorical
• Set frequency threshold:
• Total Observations in dataset = 1498
• Frequency threshold = 0.02 × 1498 ≈ 30
• Frequencies:
• 717 occurrences of ‘F’
• 776 occurrences of ‘M’
• 5 occurrences of ‘Rather not Say’
• If a frequency is less than the threshold:
• Need to increment the observation’s categorical score:
• 𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 +
1
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦×𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠
• Since ‘Rather not Say’ has frequency less than 30, need to increment score for all
observations with ‘Rather not Say’:
1
• 𝑆𝑐𝑜𝑟𝑒 = 0 + 5×1 = 0.2
GCE Solutions
Derive value from excellence…
Example: Phase 1 - Categorical
• Then we finish the same process on the rest of the categorical
variables
Total Frequencies:
‘F’: 717
‘M’: 776
‘Rather not Say’: 5
Total Frequencies:
‘American Indian or Alaska
Native’: 237
‘Asian’: 776
‘Black or African American’:
297
…Etc…
Derive value from excellence…
Total Frequencies:
‘America’: 861
‘Canada’: 251
‘India’: 388
Total Frequencies:
‘Hispanic or Latino’:
779
‘Hispanic or Latino’:
721
Total Frequencies:
‘Reference’: 721
‘Test’: 779
GCE Solutions
Example: Phase 1 - Categorical
• Once all individual variables have been scanned, check combinations of
those variables (loop to the next dimension)
SEX
RACE
FREQUECY
F
Asian
1
F
Black
142
F
White
274
M American Indian 237
M
Asian
384
M
Black
154
…etc…
GCE Solutions
Derive value from excellence…
Example: Phase 1 - Categorical
• Once all individual variables have been scanned, check combinations of
those variables
SEX
RACE
FREQUECY
F
Hispanic
1
F Not Hispanic
716
M
Hispanic
775
M Not Hispanic
1
GCE Solutions
Derive value from excellence…
Example: Phase 1 - Categorical
• Once all individual variables have been scanned, check combinations of
those variables
SEX
F
M
M
ACTARM
Reference
Reference
Test
…etc…
FREQUECY
716
1
775
Etc….
GCE Solutions
Derive value from excellence…
Example: Phase 1 - Categorical
• Perform the same process across all variable combinations:
• Check the frequencies of each combo
• If frequency less than threshold, increment categorical score
• Set dimensions to number of variables used
• Recall: 𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 +
1
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦×𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠
• E.g. if you used SEX and RACE, then Dimensions equal 2
• E.g. if you used SEX, RACE, and ETHNIC, then Dimensions equal 3
• Typically stop at Dimensions of 3
• Stops providing additional information
• Too many Dimensions = Potential to overly flag infrequencies (increase type I error)
• Want to keep run time low
GCE Solutions
Derive value from excellence…
Example: Phase 1 - Categorical
• When the entire scan is finished:
• Look at final categorical scores:
Set a cut-off for categorical score to
establish outlandish values
Infrequent
Observations
GCE Solutions
Derive value from excellence…
Example: Phase 2 - Numeric
• Subset numeric variables by categorical values and test for outliers
• Using subsets versus full numeric data reduces risk of masking
• Only subset on frequent categorical values
• Any numeric outlier detection technique may be used on the subsets
• If there is an outlier in any of the subsets, that observation gets flagged
• If there are a lot of numeric variables, cosine similarity is suggested as it
does not have distributional assumptions
• Doesn’t perform as well in low dimensions
GCE Solutions
Derive value from excellence…
Example: Phase 2 - Numeric
• Subset observations of numeric variables according to categorical values
and look for outliers within each subset
Subset where SEX = ‘M’
Subset where SEX = ‘F’
GCE Solutions
Derive value from excellence…
Example: Phase 2 - Numeric
• Subset observations of numeric variables according to categorical values
and look for outliers within each subset
Subset where RACE = ‘American Indian’
Subset where RACE = ‘White’
…etc…
Etc….
GCE Solutions
Derive value from excellence…
Concluding Remarks
• Provides flexibility
• Categorical and numeric data
• Diverse numerical approaches
• Simple calculation
• Useful for many applications, for example:
• Detecting typos
• Finding how identifiable a patient is with sensitive data
GCE Solutions
Derive value from excellence…
Acknowledgements
• Anna Koufakou
• Author of dissertation introducing ODMAD
• Abhay Thakur
• Contributor towards implementing ODMAD at GCE Solutions
GCE Solutions
Derive value from excellence…
Questions?
GCE Solutions
Derive value from excellence…