Derive Value From Excellence … ODMAD Algorithm for Mixed Attribute Outlier Detection GCE Solutions Issues with Common Outlier Detection Ideologies • Many are limited to numeric data only • Many are limited to Supervised data • What if you don’t have predictive data? • What if you don’t know the supervised relationships? • Limited options for Mixed-Attribute, Unsupervised data • Unreliable results • Often costly run times GCE Solutions Derive value from excellence… Solution? The ODMAD Algorithm. • Outlier Detection for Mixed Attribute Data • Found Anna Koufakou’s dissertation, “Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes” • Combines simplistic frequency count ideologies with commonly known numeric outlier detection methods • Dr. Koufakou’s paper suggests just cosine similarity, but we expanded to a more general numeric approach • 2 Phases of ODMAD: • Phase 1 – Categorical phase: Finds infrequencies in categorical values and combinations of categorical values • Phase 2 – Numeric phase: Subsets numeric portion of data based on categorical values and performs numeric outlier detection methods on each subset • Sub setting reduces risk of “Masking” GCE Solutions Derive value from excellence… Definitions • Categorical Score – measures the “infrequentness” for each observation • Low values = not a categorical outlier • High values = categorical outlier • Dimension – number of variables used at a given time • E.g. if looking at total frequencies of 2 variables combined, SEX and RACE, this would be checking categorical outliers at a dimension of 2. GCE Solutions Derive value from excellence… Example • Suppose we have data: • And we want to find outliers • We have both categorical and numeric variables • We know nothing about this dataset • Want to treat as unsupervised GCE Solutions Derive value from excellence… Getting started with Phase 1 - Categorical • Initialize the Categorical Score to 0 for every observation • Set thresholds: • Need minimum allowed frequency • Typically set to 2% of the total observations • Need cut off value for the Categorical Score before it is considered and outlier • An observation’s Categorical Score will increase if there is an infrequency • Categorical Score is weighted according to frequency and number of variables that caused the infrequency: Low Frequency High Frequency Smaller Dimension Higher Impact No Impact Higher Dimension Lower Impact No Impact GCE Solutions Derive value from excellence… Example: Phase 1 - Categorical • First it will scan all the individual categorical variables for infrequencies Total Frequencies: ‘F’: 717 ‘M’: 776 ‘Rather not Say’: 5 GCE Solutions Derive value from excellence… Example: Phase 1 - Categorical • Set frequency threshold: • Total Observations in dataset = 1498 • Frequency threshold = 0.02 × 1498 ≈ 30 • Frequencies: • 717 occurrences of ‘F’ • 776 occurrences of ‘M’ • 5 occurrences of ‘Rather not Say’ • If a frequency is less than the threshold: • Need to increment the observation’s categorical score: • 𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 + 1 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦×𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 • Since ‘Rather not Say’ has frequency less than 30, need to increment score for all observations with ‘Rather not Say’: 1 • 𝑆𝑐𝑜𝑟𝑒 = 0 + 5×1 = 0.2 GCE Solutions Derive value from excellence… Example: Phase 1 - Categorical • Then we finish the same process on the rest of the categorical variables Total Frequencies: ‘F’: 717 ‘M’: 776 ‘Rather not Say’: 5 Total Frequencies: ‘American Indian or Alaska Native’: 237 ‘Asian’: 776 ‘Black or African American’: 297 …Etc… Derive value from excellence… Total Frequencies: ‘America’: 861 ‘Canada’: 251 ‘India’: 388 Total Frequencies: ‘Hispanic or Latino’: 779 ‘Hispanic or Latino’: 721 Total Frequencies: ‘Reference’: 721 ‘Test’: 779 GCE Solutions Example: Phase 1 - Categorical • Once all individual variables have been scanned, check combinations of those variables (loop to the next dimension) SEX RACE FREQUECY F Asian 1 F Black 142 F White 274 M American Indian 237 M Asian 384 M Black 154 …etc… GCE Solutions Derive value from excellence… Example: Phase 1 - Categorical • Once all individual variables have been scanned, check combinations of those variables SEX RACE FREQUECY F Hispanic 1 F Not Hispanic 716 M Hispanic 775 M Not Hispanic 1 GCE Solutions Derive value from excellence… Example: Phase 1 - Categorical • Once all individual variables have been scanned, check combinations of those variables SEX F M M ACTARM Reference Reference Test …etc… FREQUECY 716 1 775 Etc…. GCE Solutions Derive value from excellence… Example: Phase 1 - Categorical • Perform the same process across all variable combinations: • Check the frequencies of each combo • If frequency less than threshold, increment categorical score • Set dimensions to number of variables used • Recall: 𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 + 1 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦×𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 • E.g. if you used SEX and RACE, then Dimensions equal 2 • E.g. if you used SEX, RACE, and ETHNIC, then Dimensions equal 3 • Typically stop at Dimensions of 3 • Stops providing additional information • Too many Dimensions = Potential to overly flag infrequencies (increase type I error) • Want to keep run time low GCE Solutions Derive value from excellence… Example: Phase 1 - Categorical • When the entire scan is finished: • Look at final categorical scores: Set a cut-off for categorical score to establish outlandish values Infrequent Observations GCE Solutions Derive value from excellence… Example: Phase 2 - Numeric • Subset numeric variables by categorical values and test for outliers • Using subsets versus full numeric data reduces risk of masking • Only subset on frequent categorical values • Any numeric outlier detection technique may be used on the subsets • If there is an outlier in any of the subsets, that observation gets flagged • If there are a lot of numeric variables, cosine similarity is suggested as it does not have distributional assumptions • Doesn’t perform as well in low dimensions GCE Solutions Derive value from excellence… Example: Phase 2 - Numeric • Subset observations of numeric variables according to categorical values and look for outliers within each subset Subset where SEX = ‘M’ Subset where SEX = ‘F’ GCE Solutions Derive value from excellence… Example: Phase 2 - Numeric • Subset observations of numeric variables according to categorical values and look for outliers within each subset Subset where RACE = ‘American Indian’ Subset where RACE = ‘White’ …etc… Etc…. GCE Solutions Derive value from excellence… Concluding Remarks • Provides flexibility • Categorical and numeric data • Diverse numerical approaches • Simple calculation • Useful for many applications, for example: • Detecting typos • Finding how identifiable a patient is with sensitive data GCE Solutions Derive value from excellence… Acknowledgements • Anna Koufakou • Author of dissertation introducing ODMAD • Abhay Thakur • Contributor towards implementing ODMAD at GCE Solutions GCE Solutions Derive value from excellence… Questions? GCE Solutions Derive value from excellence…
© Copyright 2026 Paperzz