Structure Learning by Clustering on Joint Probability Models Ahmed El-Kishky: The University of Tulsa Faculty Advisor: David Jensen ABSTRACT While there exist several constraint-based algorithms to discover the underlying joint probability distribution of feature data, often the data is best modeled by a combination of several joint probability distributions. We present ASIA (Automatic Subgroup Identification Algorithm), a novel non-parametric algorithm for partitioning sets of data by explicitly applying clustering algorithms on the underlying joint probability models (JPM). Graduate Mentors: David Arbour, James Atwood STRUCTURE LEARNING WITH NAIEVE PARTITIONS Often natural partitions arise within data. A naïve assumption would be to assume these natural partitions reflect subpopulations. CLUSTERING ON JOINT PROBABILITY MODELS Naively Partitioned Data Set The Euclidean distance between the vector of probabilities that the data point was generated by each model defines a natural metric. As such both models that are likely to have generated the data point and models unlikely to have generated the data point are considered for when clustering. Constraint-based structure learning is performed for each partition resulting in a JPM for each. HYPOTHESIS D atasets lack homogeneity and are best modeled by a combination of several joint probability distributions. Clustering on the joint probability distributions of feature data takes advantage of the data’s dynamics and thus provides a better partitioning than feature data clustering. RESEARCH OBJECTIVES DISTANCE OVER LIKELIHOOD MATRIX ASIA groups sets of data by explicitly considering their underlying joint distribution during the clustering. This approach allows for accurate groupings where membership corresponds to dependency relationships between variables. RESULTS AND CONCLUSION JPM Learned for each Partition Develop Algorithm – discovers appropriate partitioning of data and learns the joint probability distribution over each partition Demonstrate Effectiveness - To demonstrate his approach assumes that there exists a n aturally defined partitioning of the given data. T his is often not the case. that the algorithm finds better a partitioning of data than naïve partitions. CLUSTERING ON FEATURE DATA Justify Use - To demonstrate that large data sets are generally better modeled by multiple joint probability distributions. CONSTRAINT-BASED STRUCTURE LEARNING To learn the joint probability models given data, a hybrid approach of constraint-based structure learning is used to learn the undirected graph skeleton. This is then followed by the edgeorientation algorithm Max-Min Hill-Climbing. Another natural partitioning can be created by clustering over feature data. By defining a concept of “distance” using a metric, individuals from the dataset are grouped into clusters. For each cluster, a joint probability model is learned.. Automatic Subgroup Identification Algorithm (ASIA) With each iteration, data instances are moved to clusters that are most likely to have generated that specific instance, the mean joint probability model is then recalculated. This process is repeated for a finite number of iterations or until convergence. DISTANCE OVER LIKELIHOOD MATRIX ASIA’s central data structure is a “likelihood matrix” containing the probability that each data point was generated by each of the models. Preliminary results suggest that partitioning and structure learning with ASIA outperforms clustering over feature data. With each iteration of ASIA, the algorithm seems to provide a better representation of the underlying JPM of the data. This phenomenon may be explained by the lack of homogeneity of clustering on feature data. As feature data often consists of various data types, clustering on this data is very sensitive to the distance metric defined. FUTURE WORK Making ASIA fully non-parametric by clustering based on the dirichlet process extension Running ASIA on various real-world and synthetic datasets. JPM Learned on Each Cluster TEMPLATE DESIGN © 2008 www.PosterPresentations.com
© Copyright 2026 Paperzz