Structure Learning by Clustering on Joint Probability Models

Structure Learning by Clustering on Joint Probability Models
Ahmed El-Kishky: The University of Tulsa
Faculty Advisor: David Jensen
ABSTRACT
While there exist several constraint-based
algorithms to discover the underlying joint
probability distribution of feature data, often
the data is best modeled by a combination of
several joint probability distributions. We
present ASIA (Automatic Subgroup Identification
Algorithm), a novel non-parametric algorithm for
partitioning sets of data by explicitly applying
clustering algorithms on the underlying joint
probability models (JPM).
Graduate Mentors: David Arbour, James Atwood
STRUCTURE LEARNING WITH NAIEVE PARTITIONS
Often natural partitions arise within data. A
naïve assumption would be to assume these
natural partitions reflect subpopulations.
CLUSTERING ON JOINT PROBABILITY MODELS
Naively Partitioned Data Set
The Euclidean distance between the vector of
probabilities that the data point was generated
by each model defines a natural metric. As such
both models that are likely to have generated
the data point and models unlikely to have
generated the data point are considered for
when clustering.
Constraint-based structure learning is performed
for each partition resulting in a JPM for each.
HYPOTHESIS
 D atasets lack homogeneity and are best
modeled by a combination of several joint
probability distributions.
 Clustering on the joint probability distributions
of feature data takes advantage of the data’s
dynamics and thus provides a better partitioning
than feature data clustering.
RESEARCH OBJECTIVES
DISTANCE OVER LIKELIHOOD MATRIX
ASIA groups sets of data by explicitly
considering their underlying joint distribution
during the clustering. This approach allows for
accurate groupings where membership
corresponds to dependency relationships
between variables.
RESULTS AND CONCLUSION
JPM Learned for each Partition
 Develop Algorithm – discovers appropriate
partitioning of data and learns the joint
probability distribution over each partition
 Demonstrate Effectiveness - To demonstrate
his approach assumes that there exists a n
aturally defined partitioning of the given data. T
his is often not the case.
that the algorithm finds better a partitioning of
data than naïve partitions.
CLUSTERING ON FEATURE DATA
 Justify Use - To demonstrate that large data
sets are generally better modeled by multiple
joint probability distributions.
CONSTRAINT-BASED STRUCTURE LEARNING
To learn the joint probability models given data,
a hybrid approach of constraint-based structure
learning is used to learn the undirected graph
skeleton. This is then followed by the edgeorientation algorithm Max-Min Hill-Climbing.
Another natural partitioning can be created by
clustering over feature data. By defining a
concept of “distance” using a metric, individuals
from the dataset are grouped into clusters. For
each cluster, a joint probability model is
learned..
Automatic Subgroup Identification Algorithm (ASIA)
With each iteration, data instances are moved
to clusters that are most likely to have
generated that specific instance, the mean
joint probability model is then recalculated.
This process is repeated for a finite number of
iterations or until convergence.
DISTANCE OVER LIKELIHOOD MATRIX
ASIA’s central data structure is a “likelihood
matrix” containing the probability that each data
point was generated by each of the models.
Preliminary results suggest that partitioning and
structure learning with ASIA outperforms
clustering over feature data. With each iteration
of ASIA, the algorithm seems to provide a better
representation of the underlying JPM of the data.
This phenomenon may be explained by the lack
of homogeneity of clustering on feature data. As
feature data often consists of various data types,
clustering on this data is very sensitive to the
distance metric defined.
FUTURE WORK
 Making ASIA fully non-parametric by clustering
based on the dirichlet process extension
 Running ASIA on various real-world and
synthetic datasets.
JPM Learned on Each Cluster
TEMPLATE DESIGN © 2008
www.PosterPresentations.com