Finite Mixture Models for Regression Problems

Finite Mixture Models for Regression Problems
Hien Duy Nguyen
Bachelor of Economics and Bachelor of Science (Hons I)
A thesis submitted for the degree of Doctor of Philosophy at
The University of Queensland in 2015
School of Mathematics and Physics
2
Abstract
Finite mixture models (FMMs) are a ubiquitous tool for the analysis of heterogeneous data
across a broad number of fields including agriculture, bioinformatics, botany, cell biology, economics, fisheries research, genetics, genomics, geology, machine learning, medicine, palaeontology, psychology, and zoology, among many others. Due to their flexibility, FMMs can be used
to cluster data, classify data, estimate densities, and increasingly, they are also being used to
conduct regression analysis and to analyze regression outcomes. There is now an expansive
literature on the usage of FMMs for regression, as well as a broad demand for the development
of such methods for the analysis of new and complex data.
This thesis begins with a summary of the current literature on FMMs and their applications
to regression problems. Here, the mixture of regression models (MRMs), cluster-weight models
(CWMs), mixtures of experts (MoEs), and mixtures of linear of mixed effects models (MLMMs),
as well as other variants of FMMs for regression analysis are introduced. Various properties
such as denseness and identifiability, as well as maximum likelihood (ML) estimation techniques
such as the expectation–maximization (EM) and minorization–maximization (MM) algorithms
are discussed, and a review is presented regarding asymptotic inference and model selection in
FMMs. A new result on the characterization of a t linear CWM (LCWM) is also presented.
Some new applications of FMMs to regression problems are then discussed.
Firstly, a series of models based on FMMs are presented for the clustering and classification
of sparsely sampled bivariate functional data. These methods are named mixture of spatial
spline regression (MSSR) and MSSR discriminant analysis (MSSRDA). MSSR is constructed
using the theory of MLMMs and spatial splines, and an EM algorithm for the ML estimation
of the model is presented. MSSRDA is then constructed by combining MSSR with the mixture
discriminant analysis framework for classification. The methods are tested on their ability to
cluster and classify simulated data. An example application to handwritten digits recognition
is then presented. Here, it is shown that MSSR and MSSRDA perform comparably to currently
available methods, and outperform said methods in missing data scenarios.
Secondly, an FMM is used to produce a false discovery rate (FDR) control procedure for magnetic resonance imaging (MRI) data. In MRI data analysis, millions of hypotheses are often
tested simultaneously, resulting in inflated numbers of false positive results. Many of the available FDR techniques for MRI data either do not take into account the spatial structure or rely
on difficult to verify assumptions and user-specified parameters. To address these shortcomings,
the Markov random field (MRF) FDR (MRF-FDR) technique is presented. MRF-FDR uses a
Gaussian mixture model (GMM) to perform FDR control based on empirical-Bayesian principles. An MRF is then used to make the outcome of the GMM spatially coherent. The MRF is
fitted using maximum pseudolikelihood estimation, and the pseudolikelihood information critei
rion is used to automatically specify the MRF model. MRF-FDR is shown to perform favorably
in simulations against some currently used methods. An application to the PATH study data
is presented, which shows that MRF-FDR generates inference that is clinically consistent with
the available literature on brain aging.
Thirdly, a new mixture of linear experts (MoLE) model is constructed using the Laplace error
model; this is named the Laplace MoLE (LMoLE) model. An MM algorithm for the ML estimation of the LMoLE model is construct, which can be generalized for the monotonic likelihood
maximization of any MoE. Theoretical properties such as identifiability and consistency are
proven for the LMoLE, and connections are drawn between the LMoLE and the least absolute
deviation regression criterion. Through simulations, the consistency of the ML estimator for
the LMoLE is demonstrated. Results regarding the robustness of the LMoLE model against
the more popular Gaussian MoLE are also provided. An application to a climate science data
set is used to demonstrate the utility of the model.
Finally, the Gaussian LCWM is extended via the linear regression characterization (LRC) of
the GMM. The LRC is shown to be equivalent to the Gaussian LCWM. An MM algorithm that
requires no matrix operations is constructed for the ML estimation of the LRC GMM. The MM
algorithm is shown to monotonic increase the likelihood and to be convergent to a stationary
point of the likelihood function. The ML estimator of the LRC GMM is proven to be consistent
and asymptotically normal, thus providing an alternative proof for GMMs. A simple procedure
for the mitigation of singularities in ML estimation via the LRC is discussed. Simulations are
used to provide evidence that the MM algorithm based on the LRC may improve upon the
performance of the traditional EM algorithm for GMMs, in some practically relevant scenarios.
ii
Declaration by author
This thesis is composed of my original work, and contains no material previously published or
written by another person except where due reference has been made in the text. I have clearly
stated the contribution by others to jointly-authored works that I have included in my thesis.
I have clearly stated the contribution of others to my thesis as a whole, including statistical
assistance, survey design, data analysis, significant technical procedures, professional editorial
advice, and any other original research work used or reported in my thesis. The content of my
thesis is the result of work I have carried out since the commencement of my research higher
degree candidature and does not include a substantial part of work that has been submitted
to qualify for the award of any other degree or diploma in any university or other tertiary
institution. I have clearly stated which parts of my thesis, if any, have been submitted to
qualify for another award.
I acknowledge that an electronic copy of my thesis must be lodged with the University Library
and, subject to the policy and procedures of The University of Queensland, the thesis be made
available for research and study in accordance with the Copyright Act 1968 unless a period of
embargo has been approved by the Dean of the Graduate School.
I acknowledge that copyright of all material contained in my thesis resides with the copyright
holder(s) of that material. Where appropriate I have obtained copyright permission from the
copyright holder to reproduce material in this thesis.
iii
Publications during candidature
Published
Nguyen, H. D. and Wood, I. A. (2012). Variable selection in statistical models using populationbased incremental learning with applications to genome-wide association studies. In Proceedings
of the 2012 IEEE Congress on Evolutionary Computation (CEC).
Nguyen, H. D., Hill, M. M., and Wood, I. A. (2012). A robust permutation test for quantitative
SILAC proteomics experiments. Journal of Integrated OMICS, 2:80–93.
Inder, K. L., Zheng, Y. Z., Davis, M. J., Moon, H., Loo, D., Nguyen, H., Clements, J. A.,
Parton, R. G., Foster, L. J., and Hill, M. M. (2012). Expression of PTRF in PC-3 cells modulated cholesterol dynamics and actin cytoskeleton impacting secretion pathways. Molecular
and Cellular Proteomics, 11:M111.012245.
Nguyen, H. D., Janke, A. L., Cherbuin, N., McLachlan, G. J., Sachdev, P., and Anstey, K.
J. (2013). Spatial false discovery rate control for magnetic resonance imaging studies. In
Proceedings of the 2013 Digital Imaging: Techniques and Applications (DICTA) conference.
Nguyen, H. D., McLachlan, G. J., Cherbuin, N., and Janke, A. L. (2014). False discovery rate
control in magnetic resonance imaging studies via Markov random fields. IEEE Transactions
on Medical Imaging, 33:1735–1748.
Nguyen, H. D. and McLachlan, G. J. (2014). Asymptotic inference for hidden process regression
models. In Proceedings of the 2014 IEEE Statistical Signals Processing Workshop.
Lloyd-Jones, L. R., Nguyen, H. D., Wang, Y-G., and O’Neill, M. F. (2014). Improved estimation
of size-transition matrices using tag-recapture data. Canadian Journal of Fisheries and Aquatic
Sciences, 71:1385–1394.
Chen, D., Shah, A., Nguyen, H., Loo, D., Inder, K., and Hill, M. (2014). Online quantitative proteomics p-value calculator for permutation-based statistical testing of peptide ratios.
Journal of Proteomics Research, 13:4184–4191.
Nguyen, H. D., McLachlan, G. J., and Wood, I. A. (2015). Mixtures of spatial spline regressions
for clustering and classification. Computational Statistics and Data Analysis, In press.
Nguyen, H. D. and McLachlan, G. J. (2015). Laplace mixture of linear experts. Computational
Statistics and Data Analysis, In press.
Nguyen, H. D. and Wood, I. A (2015). Asymptotic normality of the maximum pseudolikelihood
estimator for fully-visible Boltzmann machines. IEEE Transactions on Neural Networks and
Learning Systems, In press.
Under Review
Nguyen, H. D. and McLachlan, G. J. Maximum likelihood estimation of Gaussian mixture
models without matrix operations. Submitted to Advances in Data Analysis and Classification.
iv
Oyarzun, C., Sanjuro, A., and Nguyen, H. Response functions. Submitted to Journal of Economic Theory.
v
Publications included in this thesis
Nguyen, H. D., McLachlan, G. J., and Wood, I. A. (2015). Mixtures of spatial spline regressions for clustering and classification. Computational Statistics and Data Analysis, In press.
Incorporated as Chapter 2.
Contributor
Statement of contribution
Author Nguyen
Designed methodology (100%)
Wrote and edited the article (90%)
Analyzed data (90%)
Author McLachlan
Wrote and edited the article (5%)
Analyzed data (5%)
Author Wood
Wrote and edited the article (5%)
Analyzed data (5%)
Nguyen, H. D., McLachlan, G. J., Cherbuin, N., and Janke, A. L. (2014). False discovery rate
control in magnetic resonance imaging studies via Markov random fields. IEEE Transactions
on Medical Imaging, 33:1735–1748. Incorporated as Chapter 3.
Contributor
Statement of contribution
Author Nguyen
Designed methodology (90%)
Wrote and edited the article (85%)
Analyzed data (80%)
Author McLachlan
Designed methodology (5%)
Wrote and edited the article (5%)
Analyzed data (10%)
Author Cherbuin
Wrote and edited the article (5%)
Collected and processed data (50%)
Author Janke
Designed methodology (5%)
Wrote and edited the article (5%)
Analyzed data (10%)
Collected and processed data (50%)
Nguyen, H. D. and McLachlan, G. J. (2015). Laplace mixture of linear experts. Computational
Statistics and Data Analysis, In press. Incorporated as Chapter 4.
vi
Contributor
Statement of contribution
Author Nguyen
Designed methodology (100%)
Wrote and edited the article (90%)
Analyzed data (90%)
Author McLachlan
Wrote and edited the article (10%)
Analyzed data (10%)
Nguyen, H. D. and McLachlan, G. J. Maximum likelihood estimation of Gaussian mixture
models without matrix operations. Submitted to Advances in Data Analysis and Classification.
Incorporated as Chapter 5.
Contributor
Statement of contribution
Author Nguyen
Designed methodology (100%)
Wrote and edited the article (90%)
Analyzed data (90%)
Author McLachlan
Wrote and edited the article (10%)
Analyzed data (10%)
vii
Contributions by others to the thesis
Chapters 1 and 6 are written entirely by the candidate with editorial advice provided by Geoffrey
McLachlan, Ian Wood, and Andrew Janke. Along with the listed authors, three anonymous
reviewers contributed to Chapter 2, three anonymous reviewers contributed to Chapter 3, two
anonymous reviewers contributed to Chapter 4, and one anonymous reviewer contributed to
Chapter 5.
Statement of parts of the thesis submitted to qualify for the award of another
degree
None
viii
Acknowledgments
The completion of this thesis could not have been achieved without the help and support of
many friends, colleagues, and teachers whom I wish to thank wholeheartedly. I shall start by
acknowledging my advisors Professor Geoff McLachlan, Dr. Ian Wood, and Dr. Andrew Janke.
Thank you Geoff for being a wonderful supervisor. Your enthusiasm for statistics, your wealth
of knowledge, your diligence, and your productiveness inspires me to learn more and work
harder, in an attempt to emulate your achievements. I greatly appreciate all of the time you
have spent politely listening to all of my bad ideas, and all of the effort that you have put
into fixing my bad grammar. Most of all, I appreciate your friendship; a supervisor who is as
passionate about statistical history as he is about Formula One racing is hard to find. I am
overjoyed that you have given me the opportunity to continue our relationship into the future.
Thank you Ian for being a role model and a friend. You were the first academic that I felt
comfortable speaking to casually, and the first teacher to introduce me to the wealth of statistics
beyond what is presented in undergraduate courses. I am thankful for all of the times you let
me rant and the way you respectfully acknowledge my crazy opinions. I would be less of a
researcher if you hadn’t guided me through both my honors and my PhD studies. I greatly
appreciate all of your support and your encouragement for me to do better, even when I am
resistant to the idea.
Thank you Andrew for not being a statistician. It is always refreshing to hear your point of
view and to discuss the relevance, or lack thereof, of whatever idiosyncratic ideas that I am
currently obsessing over. I am indebted to you for the financial help that you have been able to
provide me, and for the many opportunities that you have given me, to look beyond my familiar
horizon. I appreciate all of the things you have taught me, from the writing of a grant, to the
reading of one’s document backwards to check for errors. I believe that both of these skills will
be invaluable in my future research career. I look forward to our future work together.
Thank you to my fellow PhD students Andrew Jones, Kate Helmstedt, Luke Lloyd-Jones, Julia
Kuhn, Nirav Shah, Chris van der Heide, Will Pettersson, and Amy Chan, and to my office
mates Mingzhu Sun, Sharina Ismail, and Sam Kault. It has been a pleasure to spend the last
three years or more getting to know you; your company makes the usual drudgery of writing a
PhD thesis almost pleasant and enjoyable.
Luke, you are my oldest friend at the university, and I am very glad that you have decided to
stay on campus upon finishing your PhD. I have thoroughly enjoyed our many conversations
over coffee, and our numerous discussions on the philosophy of statistics, the state of modern
politics, the virtues of instant coffee, and everything in between. Your thoughts on statistics,
and your life advice have often proven much more useful that I care to admit. I hope that we
can be friends and colleagues into the future, mainly for pasta and pizza benefits.
Thank you Andy and Kate, for your enduring friendship and your comradery throughout the
ix
candidature. Having been through both our honors and PhD years together, I feel that we share
a sense of understanding and solidarity that could not have been forged in any other situation.
The number of conversations that we have had over coffees and drinks is innumerable, and I
anticipate that the same can be said in the future tense.
Next, to my friends Han Mai, Cora Chen, Lam Le-Huy, and Nicholas Culpitt. Thank you for
perseverance with me, in the face of absence and seeming apathy. I am very fortunate to have
such strong friendships, and such tolerant friends.
Han, thank you for your companionship through these trying years. You, more than anyone
else, have had to endure me at my worst. I cannot thank you enough for your support, which
I could not have done without. You have been a constant friend in turbulent times.
I would like to extend a thank you to Dr Michelle Hill, Dr Carlos Oyarzun, and Dr Nicholas
Cherbuin, for giving me the chance to collaborate on your research. Special thanks must be
given to Dr Walter Abhayaratna for providing me with financial support and allowing me the
opportunity to participate in the ASPREE-ENVISION research project.
Thank you to everyone at the School of Mathematics and Physics, for providing me with such a
wonderful environment in which to do my research. I would like to extend a special thank you
to Professor Joseph Grotowski and Professor You-Gan Wang, who have always been supportive
of my research and activities.
Finally, to my parents Duyen Nguyen and Luyen Vo, and my sisters Melissa and Catherine,
thank you for being such strong anchors in my life. I am very fortunate to have such strong
support, and to be so loved. Thank you for always being there for me. There is not enough
space for me to fully convey my full appreciation for all of you.
x
Keywords
Finite mixture models, linear regression models, mixed-effect models, maximum likelihood estimation, expectation–maximization algorithm, minorization–maximization algorithm, mixture
of experts, false discovery rate.
Australian and New Zealand Standard Research Classifications (ANZSRC)
ANZSRC code: 010401 Applied Statistics, 40%
ANZSRC code: 010405 Statistical Theory, 40%
ANZSRC code: 080106 Image Processing, 20%
Fields of Research (FoR) Classification
FoR code: 0104 Statistics, 80%
FoR code: 0801 Artificial Intelligence and Image Processing, 20%
xi
xii
Contents
Abstract
i
Contents
xiii
List of Figures
xvii
List of Tables
xxi
List of Abbreviations
1
xxiii
Overview and Mathematical Background
1
1.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2.2
Non-Gaussian Mixture Models on Rd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3
Mixture Models on Subsets of Rd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4
Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3
1.4
Mixture Models for Regression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1
Linear Mixtures of Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2
Generalized Linear Mixtures of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.3
Mixtures of Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.4
Mixtures of Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Likelihood-Based Inference
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.1
Expectation–Maximization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.4.2
Minorization–Maximization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.4.3
Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.4.4
Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.4.5
Determining the number of components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xiii
2 Mixtures of Spatial Spline Regressions for Clustering and Classification
55
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2
Spatial Spline Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3
2.2.1
Nodal Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.2
Surface Approximation
Spatial Spline Mixed Models
2.3.1
2.4
2.5
2.6
2.7
2.8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Fitting SSMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Mixture of Spatial Spline Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.1
Fitting MSSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.2
Choosing g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Clustering and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.5.1
Clustering with MSSR
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.5.2
Classification by MSSRDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Simulated Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.6.1
Surface Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.6.2
Clustering
2.6.3
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Handwritten Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.7.1
Clustering
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.7.2
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3 False Discovery Rate Control in Magnetic Resonance Imaging Studies via Markov Random
Fields
71
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2
Noncontexual FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3
3.2.1
Empirical Null Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2.2
Rejection Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2.3
FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Spatial FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1
Rejection Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.2
Theoretical Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.3
FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xiv
3.4
3.5
3.6
Estimation of Parameters and Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.1
Noncontexual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.2
Spatial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.3
FDR Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.4
Range Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.1
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.2
Threshold Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5.3
False Discovery Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.4
Assessing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Path Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.1
Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.2
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.6.3
Technical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.6.4
Clinical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.7
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.8
Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.9
Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4 Laplace Mixture of Linear Experts
99
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2
Model Description and Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3
4.4
4.2.1
MM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.2
Relationship to Least Absolute Deviations Regression . . . . . . . . . . . . . . . . . . . . 106
Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.1
Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.2
Consistency of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.3
Estimation of Quantities of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3.4
Determining the Number of Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.1
Consistency of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.2
Robustness of the LMoLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
xv
4.4.3
4.5
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.5.1
4.6
Selection of g via Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Linear Mixtures of Laplace Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5 Maximum Likelihood Estimation of Gaussian Mixture Models without Matrix Operations123
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2
Linear Regression Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2.1
5.3
Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.1
Minorization–Maximization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.2
Covariance Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4
Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5
Numerical Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6 Discussions and Conclusions
143
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2
Discussions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3
6.2.1
Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.2.2
Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.2.3
Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2.4
Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Bibliography
153
xvi
List of Figures
1.1
Histogram of Weldon’s crab data, which was analyzed in Pearson (1894).
. . . . . . . . . . . . .
1.2
Fitted densities and component densities to Weldon’s crab data, using estimates from Pearson
5
(1894). The black curve is the overall ratio density, and the red curves are the two component
densities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
6
In order from 1–4, each plot displays the 95% confidence ellipse of the components of a g = 4
component bivariate GMM with the model restrictions λI, λi I, λDAD T , and λi Di Ai DiT ,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
9
Univariate GMMs from Marron and Wand (1992). The order of the plots are the same as in Table
1.2. The black curve is the overall density, and the colored curves are the component densities. . 11
1.5
Plot of n = 1000 uniformly sampled observations, colored by the value of τ1 (y), as per model
(1.15). The blue and red circles are 95% confidence ellipses of the component 1 and 2 densities
from model (1.15), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6
Realizations of the 5 simulation scenarios from Quandt (1972), as per Table 1.3. The blue and
red lines are the conditional expectations of components 1 and 2, respectively. . . . . . . . . . . . 19
1.7
Realizations of n = 450 observations from Example 2 of Ingrassia et al. (2012), using the parameters presented in Table 1.4. The red, blue, and green lines are the regression lines βi0 + βi1 x for
components 1 to 3, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8
Mean functions of form (1.35) for models M1 and M2, from Chamroukhi et al. (2010). The red
and blue curves correspond to M1 and M2, respectively. . . . . . . . . . . . . . . . . . . . . . . . 29
2.1
Nodal basis function s (x; c, δ1 , δ2 ), where c = (0, 0), δ1 = 1 and δ2 = 1.
2.2
The top row displays surface plots of the functions from (2.14) over R = [−1, 1] × [−1, 1] (in the
. . . . . . . . . . . . . . 58
order i = 1, 2, 3 from left to right). The bottom row displays respective example SSMM fitted
mean surfaces for each i = 1, 2, 3, under scenario S2. . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3
Cluster mean surfaces for data set B (without missing data) estimated by MSSR with g = 12.
3.1
Reference schematics for simulation study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
xvii
. 66
3.2
0
d and
Sub-figures A1-A3 and B1-B3 display the relationships between c1 and c2 , versus mFDR
N̂R /n×100% (percentage of hypotheses rejected), respectively, for simulations of scenarios S1E20
d could not be evaluated (i.e. when there
S3E2. The white space indicates values of which mFDR
are no rejections).
3.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Sub-figures A and B displays the p-values and z-scores histograms for the PATH data hypothesis
tests, where the blue, red and black lines are graphs of π0 f0 (zi ), (1 − π0 ) f1 (zi ), and f (zi ),
0
d
respectively (as per Section 3.2). Sub-figures C and D show mFDR
and N̂R /n × 100% for
various threshold combinations, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4
Sub-figures A1, A2 and A3 show FDR control by MRF-FDR using rule (3.11) at the α = 0.1 level,
and B1 B2 and B3 show control by NC-FDR using rule (3.6) at the sample level. Sub-figures A1
and B1 are midsaggital slices, A2 and B2 are midcoronal slices and A3 and B3 are midhorizontal
slices. Red indicates rejected voxels, and vice versa for black. . . . . . . . . . . . . . . . . . . . . 89
4.1
Example of an n = 512 observations simulation (as described in Section 4.4.1), where each point
has abscissa xj and ordinate yj , for j = 1, ..., n. The black curve is the mean function Eθ0 (Y |v, x)
and the red lines are the component mean functions Eθ0 (Y |Z = i, x) for i = 1, 2 (as described in
Section 4.3.3). The black curve is Eθ0 (Y |v, x) and the red lines are Eθ0 (Y |Z = i, x) for i = 1, 2.
The green curve and the blue lines are the respective functions, each evaluated at θ̂n instead. . . 112
4.2
Plot of estimated variance function (i.e. (4.20)) and estimated probability function versus the
respective true functions. Here, the data and coordinates are the same as those which appear in
Figure 4.1. The black curve is π1 v; α0 and the red curve is Vθ0 (Y |v, x). The green and blue
curves are the respective functions, each evaluated at θ̂n instead. . . . . . . . . . . . . . . . . . . 113
4.3
Example of an n = 512 observations simulation from the LMoLE model, with outlier probability
c = 0.05 (as described in Section 4.4.2), where each point has abscissa xj and ordinate yj . The
black curve is the mean function Eθ0 (Y |v, x). The green and red curves are the estimated mean
functions via the LMoLE and GMoLE models, respectively. . . . . . . . . . . . . . . . . . . . . . 115
4.4
Examples of n = 512 observation simulations of the four scenarios from Section 4.4.3, where
each point has abscissa xj and ordinate yj , for j = 1, ..., n. Each of the colored lines are mean
functions of the various components from each scenario (as per Section 4.3.3). . . . . . . . . . . . 116
4.5
Temperature anomalies data from Hansen et al. (2001), where each point has abscissa tj (year)
and ordinate yj (temperature anomaly, in degrees Celsius). The black curve is the estimated mean
function Eθ̂ (Y |v, x). The green and red curves are the estimated mean functions of components
1 and 2, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.1
Pairwise marginal plots of data generated from the D = 2, G = 3, and N = 15 case of the
simulations. The 8 colors indicate the different origins of each of the generated data points. . . . 133
xviii
5.2
Plots of the average ratio of EM to MM algorithm computational times. The panels are separated
into the G values of each scenario, and the N values are indicated by the line colors. Here, red,
green, and blue indicate the values N = 15, 16, 17, respectively. . . . . . . . . . . . . . . . . . . . 136
xix
xx
List of Tables
1.1
Characterizations of the possible restrictions on the component covariance decomposition of form
(1.7) for the GMM. Here A = g (d + 1) − 1 and B = d (d + 1) /2. . . . . . . . . . . . . . . . . . .
8
1.2
Various univariate GMMs from Marron and Wand (1992). . . . . . . . . . . . . . . . . . . . . . . 10
1.3
Simulation study parameters from Quandt (1972). . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4
Simulation study parameters from Example 2 of Ingrassia et al. (2012). . . . . . . . . . . . . . . 23
1.5
Models, sample spaces, link functions, densities, and mean parametrization for the GLMs originally considered in Nelder and Wedderburn (1972). . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6
Characterizations of the MGLMs from Wedel and DeSarbo (1995), by sample space, density
function, mean characterization, and other parameters. . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1
Average ISE and AI-ISE performances (over ten runs) for SSMM estimations. . . . . . . . . . . . 64
2.2
Digit frequencies in ZIP code data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.3
Average ARIs over 10 repetitions for clusters of data set B. Bold entries indicate the best number
of clusters at each missingness level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4
Estimated error rates for the classification of data set B. Classifiers were trained on set A. Bold
entries indicate the best performance at each missingness level. . . . . . . . . . . . . . . . . . . . 67
2.5
Average number of subpopulations selected at each missingness level. . . . . . . . . . . . . . . . . 67
3.1
Average oFDR and oTPR values (over 10 repetitions) for the tested rules under various simulation
scenarios. Bold values indicate the best performance within each scenario (lowest FDR and
highest TPR). Dashed lines indicate no rejections made by the method in the specified scenario.
3.2
85
Anatomical decomposition with proportions of rejected voxels (FDR level α = 0.1) and volumes
(in voxels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.1
d n,l ) between each component of θ̂n and θ 0 , for each
Estimated mean squared errors (i.e. MSE
d n,l is
n = 64, 128, 256, 512, 1024 are presented. Here, θ 0 is as given in Section 4.4.1. Each MSE
obtained using 1000 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
xxi
4.2
d c ) of the estimated LMoLE and GMoLE
Estimated integrated mean squared errors (i.e. MISE
models, for both the Laplace and Gaussian cases are presented. Here c ∈ {0, 0.01, ..., 0.05} is the
d c is obtained using 1000 trials. . . . . . . . 114
probability of outliers in each simulation. Each MISE
4.3
Average IC (l) values, for the AIC, BIC CLC, and ICL, and the number of times g = l according
to rule (4.21) are listed, for each number of components l = 2, .., 5. The averages were taken over
1000 repetitions of the simulation from Section 4.4.3. Bold numbers in the l column indicate the
true number of components. Bold numbers in the mean and number columns indicate the value
of l with the lowest average IC (l) and highest number of selections, respectively. . . . . . . . . . 117
5.1
Results of the simulations for assessing the performance of the MM and EM algorithms. The
columns N , D, and G indicate the simulation scenario, and MM and EM column displays the
average computational time, in seconds, of the respective algorithms (over the 100 trials). The
Ratio column displays the average ratio between the EM and the MM algorithm computation
times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
xxii
List of Abbreviations
AIC
Akaike Information Criterion
AI-ISE
Averaged Individual Integrated Squared-Error
ARI
Adjusted Rand Index
AWE
Approximate Weight of Evidence
BIC
Bayesian Information Criterion
CLC
Classification Likelihood Information Criterion
CML
Composite Marginal Likelihood
CWM
Cluster-Weighted Model/Modeling
ECM
Expectation–Conditional Maximization
EE
Extremum Estimator
EM
Expectation–Maximization
FDA
Functional Data Analysis
FDR
False Discovery Rate
FMM
Finite Mixture Model
GLM
Generalized Linear Model
GLMM
Generalized Linear Mixed Model
GMM
Gaussian Mixture Model
GMoLE
Gaussian Mixture of Linear Experts
IC
Information Criterion/Criteria
ICL
Integrated Classification Likelihood Information Criterion
IID
Independent and Identically Distributed
xxiii
ISE
Integrated Squared-Error
LAD
Least Absolute Deviation
LCWM
Linear Cluster-Weighted Model/Modeling
LDA
Linear Discriminant Analysis
LMM
Linear Mixed Model
LMoLE
Laplace Mixture of Linear Experts
LMRM
Linear Mixture of Regression Model
LRC
Linear Regression Characterization
MCMLE
Maximum Composite Marginal Likelihood Estimate/Estimation/Estimator
MDA
Mixture Discriminant Analysis
mFDR
Marginal False Discovery Rate
MGLM
Mixture of Generalized Linear Model
ML
Maximum Likelihood
MLE
Maximum Likelihood Estimate/Estimation/Estimator
MLMM
Mixture of Linear Mixed Models
MM
Minorization–Maximization
MoE
Mixture of Experts
MoLE
Mixture of Linear Experts
MPLE
Maximum Pseudo-Likelihood Estimate/Estimation/Estimator
MRF
Markov Random Field
MRI
Magnetic Resonance Imaging
MRM
Mixture of Regression Models
MSSR
Mixture of Spatial Spline Regressions
MSSRDA Mixture of Spatial Spline Regressions Discriminant Analysis
NB
Naive Bayes
NBF
Nodal Basis Function
xxiv
oFDR
Observed False Discovery Rate
oTPR
Observed True Positive Rate
PL
Pseudo-Likelihood
PLIC
Pseudo-Likelihood Information Criterion
RFT
Random Field Theory
SSMM
Spatial Spline Mixed Model
SSR
Spatial Spline Regression
SVM
Support Vector Machine
SVT
Singular-Value Threshold
xxv
xxvi
Chapter 1
Overview and Mathematical Background
1.1
Overview
Since their introduction into the popular statistical literature by Pearson (1894), finite mixture models (FMMs)
have been utilized in numerous areas of research for the modeling of heterogeneous and multipopulational
data. In Quandt (1972), FMMs were extended to model regression data, and thus allowed for the modeling of
heterogeneous regression relationships. This thesis utilizes and extends upon the work of Quandt (1972), and
the literature on both FMMs and regression analysis thereafter, to produce novel methods that utilize FMMs
for the construction of heterogeneous regression models, and for the analysis of regression outputs.
In this chapter, we give a historical account and a review of the various relevant FMM topics that are useful for
the understanding of the remainder of the thesis. Firstly, in Section 1.2, we provide a short historical account
and describe the various types of FMMs that are used in density estimation. Furthermore, we discuss the
theoretical properties of some of these FMMs and describe the application of FMMs to the problem of modelbased clustering. Next, in Section 1.3, we introduce the notion of regression models and discuss the use of FMMs
for the regression of heterogeneous data. Here, we also discuss the variants of regression-based FMMs such as
those based on generalized linear models, and mixed-effects models as well as mixtures of experts (MoEs). We
also present a new result regarding the relationship between the multivariate t FMM and the t cluster-weighted
model (CWM), which amends an incorrect result from Ingrassia et al. (2012). Lastly, in Section 1.4, we describe
the maximum likelihood (ML) approach to estimating model parameters in FMMs and their variants. Here, we
describe the use of the expectation–maximization (EM) and minorization–maximization (MM) algorithms for
the construction of ML estimation algorithms. Furthermore, we discuss the theoretical issues of identifiability,
asymptotic inference, and model selection, in the FMM context. The rest of the thesis is organized as follows.
In Chapter 2 we present the mixture of spatial spline regression (MSSR) and the MSSR discriminant analysis
(MSSRDA). These methods are novel extensions of the functional data discrimination and clustering techniques
of James and Hastie (2001) and James and Sugar (2003), for application to functional data with bivariate
support. The extensions are made possible via the spatial spline constructions of Sangalli et al. (2013) and the
mixture discriminant framework of Hastie and Tibshirani (1996). Here, we utilize the new methods to cluster
1
and classify handwritten numerals from the MNIST data set from Hastie et al. (2009) (originally from Le Cun
et al. (1990)), and show that they are particularly useful in a missing data setting.
Next, in Chapter 3, we present an FMM approach to the empirical-Bayes estimation of false discovery rates
(FDRs) in magnetic resonance imaging (MRI) experiments, where millions of hypotheses are often tested simultaneously. This is done through the use of Markov random fields (MRFs) for the construction of a spatial model.
We utilize a two-stage estimation approach, where we first estimate a two-component FMM for the separation
of significant and insignificant z-scores, obtained by transforming the p-values from the hypothesis tests of a
covariate effect, on each of the MRI voxels. The first phase is adapted from the approach of McLachlan et al.
(2006), which we justify via theoretical results. In the second phase, we use an MRF construction similar to
that of Geman and Graffigne (1986), in order to produce a spatial relationship structure between the significance results from the FMM clustering. The FMM and MRF results are then combined to create a spatially
coherent and conservative estimate of the FDR. Empirical validation is then conducted via simulations and an
application to the PATH data of Anstey et al. (2012).
In Chapter 4, we describe the construction and estimation of a Laplace mixture of linear experts model which
extends the mixture of Laplace regressions of Song et al. (2014) via the the mixture of experts (MoE) framework
of Jordan and Jacobs (1994). Here, we present an minorization–maximization (MM) algorithm for the monotonic
maximization of the likelihood function for the model and in the process, to the best of our knowledge, we have
produced the first likelihood increasing algorithm for any MoE model. Furthermore, we relate the new method
to the least absolute deviation regression model, which is known to be robust among various linear regression
estimation processes, and we provided theoretical results regarding the identifability and consistency of the
model and its ML estimator. Finally, we provide empirical evidence towards the robustness and consistency
of the parameter estimates, via simulations, and give an example application by analyzing the climate data of
Hansen (1982).
In Chapter 5, we extend upon the Gaussian CWM of Ingrassia et al. (2012) by providing an alternative characterization of the multivariate Gaussian mixture model (GMM), via the linear regression characterization (LRC).
By doing this, we are able to show that the likelihood of a GMM can be maximized without using matrix operations. Such a maximization process was constructed via the MM algorithms framework. Furthermore, we
show that the LRC yields a simple procedure for the handling of singularities in the GMM likelihood function.
Empirical results are provide for the comparison of the performances between the new MM algorithm and the
traditional EM algorithm for ML estimation of GMMs. These results show that the MM algorithm appears to
be faster than the EM algorithm in the R environment (R Development Core Team, 2013).
Finally, conclusions and discussions regarding the previous chapters are provided in Chapter 6. We also propose
future directions of research that we may proceed towards, from the starting position of this thesis.
Before we proceed, we would like to make a few notes to the reader regarding the content and style of this
thesis. Firstly, due to the composition of this thesis from published and submitted articles, the notation in
each of Chapters 1–5 may be different, as are some definitions and descriptions. Although Chapter 1 attempts
to maintain as much consistency as possible, Chapters 2–5 can be very different due to either the authors’
2
perspectives at the time of writing, or the notational and stylistic requirements of the target audience and
journal. We have chosen to retain the original notation in the published and submitted articles to maintain
consistency with the published articles, in case the reader has read the articles before this thesis.
In Chapter 6, we discuss the contents of each of the previous chapters individually. As such, in each discussion,
we shall use the notation of the chapter being discussed.
Secondly, on the point of perspective, there is one key change in perspective to note. In each of Chapters 2–5,
we refer to “the EM” or “the MM” algorithm or algorithms, without distinction. However, in Chapter 1, we
often write “an EM” or “an MM” algorithm and “the EM” or “the MM” algorithms, which should now be read
with distinction.
The reason for this change is to make a philosophical point that the EM or MM algorithms as classes of metaalgorithms for optimization problems. Thus we shall refer to general instances of such algorithms as an EM or
an MM algorithm, and refer to the classes themselves as the EM or the MM algorithms. If we make a specific
reference to an algorithm constructed from the EM or MM paradigm, we shall refer to it as the EM or the MM
algorithm, respectively.
Lastly, because of the composition of this thesis, there is no correct order of reading. My only recommendation
is that Chapter 1 should be read first and Chapter 6 last. We shall now proceed in presenting the mathematical
background.
1.2
Finite Mixture Models
Finite mixture models (FMMs) are a class of probability distributions, which can be characterized in the
following way. Let Y ∈ Ω be a random variable such that Ω ⊂ Rd for some d ∈ N, and let ρi (y) be a probability
density function over Ω for each i = 1, ..., g, where g ∈ N. The random variable Y is said to arise from a finite
mixture model if it has a density function of the form
f (y) =
g
X
πi ρi (y) ,
(1.1)
i=1
where πi > 0 for each i, and
Pg
i=1
πi = 1.
We can characterize FMMs via a hierarchical construction by considering a latent random variable Z, where
P (Z = i) = πi , for each i. If we suppose that Y |Z = i has density function ρi (y), then the joint density of Y
and Z can be written as
f (y, z) =
g
Y
I(z=i)
[πi ρi (y)]
.
(1.2)
i=1
Summing (1.2) over z = 1, ..., g yields (1.2): the marginal density of Y .
Although FMM constructions were considered earlier (see Section 1.18 of McLachlan and Peel (2000) for an in
depth history of FMMs), it was the seminal work of Pearson (1894) that had brought FMMs into prominence,
where he utilized an FMM to model the distribution of the forehead breadth to body length ratios for 1000
Neapolitan crabs provided by his colleague W. F. R. Weldon.
3
In his article, Pearson fitted a g = 2 component univariate Gaussian mixture model (GMM), to the data
presented in Figure 1.1, of the form
f (y; θ) = π1 φ1 y; µ1 , σ12 + π2 φ1 y; µ2 , σ22 ,
(1.3)
where
−d
2
φd (y; µ, Σ) = (2π)
− 12
det (Σ)
1
T
−1
exp − (y − µ) Σ (y − µ)
2
(1.4)
is the d-variate Gaussian distribution with mean vector µ ∈ Rd and Σ ∈ Rd×d is positive-definite covariance
matrix; note that when d = 1, Σ = σ 2 > 0 is the variance. Here, a superscript T indicates matrix transposition
T
and θ = π1 , µ1 , µ2 , σ12 , σ22 is the vector of all model parameters. Note that π1 , π2 > 0, and π2 = 1 − π1 , since
π1 and π2 are collectively exhaustive proportions.
Using the method of moments, Pearson produced the estimates π̂1 = 0.4145, π̂2 = 0.5845, µ̂1 = 0.63262,
µ̂2 = 0.65666, σ̂12 = 0.017882 , and σ̂22 = 0.012482 ; the estimated ratio density and component densities are
displayed in Figure 1.2.
It is easy to motivate Pearson’s usage of an FMM in two ways. Firstly, FMMs have the capacity to distinguish
subpopulations within an overall population, as a result of the latent variable construction, which we presented
earlier. This was the primary motivation of the analysis in Pearson (1894), whereupon Weldon was interested in
the potential divergence in his Neapolitan crabs, into two distinct subspecies. The use of FMMs to distinguish
subpopulations shall be discussed further in Section 1.2.4.
The second motivation for using FMMs is the simple expression of moments, as functions of the moments of
the component densities. For instance, if Y has mixture density (1.1), then the expectation of η (Y ) can be
written as
ˆ
E [η (Y )]
=
η (y) f (y) dy
Ω
ˆ
=
η (y)
Ω
=
=
g
X
i=1
g
X
ˆ
g
X
πi ρi (y) dy
i=1
πi
η (y) ρi (y) dy
(1.5)
Ω
πi Ei [η (Y )] ,
i=1
where η (·) is a function over Ω, and Ei (·) is the expectation over the component density ρi (y). Using the
4
20
15
Probability Density
10
5
0
0.58
0.60
0.62
0.64
0.66
0.68
0.70
Ratio
Figure 1.1: Histogram of Weldon’s crab data, which was analyzed in Pearson (1894).
5
20
15
10
0
5
Probability Density
0.58
0.60
0.62
0.64
0.66
0.68
0.70
Ratio
Figure 1.2: Fitted densities and component densities to Weldon’s crab data, using estimates from Pearson
(1894). The black curve is the overall ratio density, and the red curves are the two component densities.
6
expression above, for univariate Y with mixture density (1.1), we can write the mth central moment of Y as
m
E ([Y − E (Y )] )
=
g
X
m
πi Ei ([Y − E (Y )] )
i=1
=
=
g
X
i=1
g
X
i=1
m
πi Ei ([Y − Ei (Y ) + Ei (Y ) − E (Y )] )
πi
m
X
j=1
m
j
!
j
m−j
Ei [Y − Ei (Y )] [Ei (Y ) − E (Y )]
,
using the binomial expansion; this derivation is given in Section 1.2.4 of Fruwirth-Schatter (2006). A version of
the g = 2 case of the result was used in Pearson (1894) to produce the moment conditions, for estimation.
FMMs are now utilized in a variety of applications. As listed in Table 2.1.3 of Titterington et al. (1985), such
applications include agriculture, botany, cell biology, economics, fisheries research, genetics, geology, medicine,
palaeontology, psychology, and zoology. Since the publication of Titterington et al. (1985), however, there
has been an explosion of interest in FMMs in many modern facets of scientific research, most notably in
bioinformatics, pattern recognition, and machine learning.
Due to their popularity, FMMs are also widely studied from a variety of perspectives. Book length treatments
of FMMs can be found in Everitt and Hand (1981), Fruwirth-Schatter (2006), Lindsay (1995), McLachlan and
Basford (1988), McLachlan and Peel (2000), and Titterington et al. (1985). Furthermore, prominent discussions
of FMMs found in Chapter 9 of Bishop (2006), Chapter 8 of Clarke et al. (2009), Chapter 10 of Duda et al.
(2001), Section 14.3 of Hastie et al. (2009), and Section 9.3 of Ripley (1996).
1.2.1
Gaussian Mixture Models
The most common FMMs are those similar to (1.3), where the component densities arise from the same parametric family; namely both components of (1.3) are univariate Gaussian densities. We shall refer to such FMMs
as homogenous-parametric FMMs, to distinguish them from heterogeneous FMMs and nonparametric FMMs.
Some heterogeneous FMMs include the zero-inflated Poisson distribution (Lambert, 1992), uniform-Gaussian
FMMs for outlier detection (Fraley and Raftery, 2003), and Laplace-Gaussian FMM for wind shear clustering
(Jones and McLachlan, 1990). Examples of nonparametric FMMs can be found in Benaglia et al. (2009), Hall
and Zhou (2003), and Hall et al. (2005).
Of the homogenous-parametric FMMs, the most popular is the GMM for modeling random variables in Ω = Rd .
In general, GMMs have density functions of the form
f (y; θ) =
g
X
πi φd (y; µi , Σi ) ,
(1.6)
i=1
T
where θ = π1 , ..., πg−1 , µT1 , ..., µTg , vechT (Σ1 ) , ..., vechT (Σg ) , and vech (·) is the vectorization of the lower
Pg
Pg−1
triangle of a matrix. Note that the restriction that i=1 πi = 1 implies that πg = 1 − i=1 πi is determined
given π1 , ..., πg−1 . Further, unless otherwise stated θ will refer to the parameter vector of a density function.
7
Table 1.1: Characterizations of the possible restrictions on the component covariance decomposition of form
(1.7) for the GMM. Here A = g (d + 1) − 1 and B = d (d + 1) /2.
Model
Complexity
Sphericity
Volume
Shape
Orientation
λI
A+1
Spherical
Equal
Equal
None
λi I
A+d
Spherical
Variable
Equal
None
λDAD T
A+B
Elliptical
Equal
Equal
Equal
λi DAD T
A+B+g−1
Elliptical
Variable
Equal
Equal
λDAi D T
A + B + (g − 1) (d − 1)
Elliptical
Equal
Variable
Equal
λDi ADiT
A + gB − (g − 1) d
Elliptical
Equal
Equal
Variable
λi DAi D T
A + B + (g − 1) d
Elliptical
Variable
Variable
Equal
λi Di ADiT
A + gB − (g − 1) (d − 1)
Elliptical
Variable
Equal
Variable
λDi Ai DiT
A + gB − g + 1
Elliptical
Equal
Variable
Variable
λi Di Ai DiT
A + gB
Elliptical
Variable
Variable
Variable
It is not difficult to motivate the use of GMMs considering that many natural measurements and processes tend
to have Gaussian distributions, and thus populations containing subpopulations of such measurements will tend
have densities resembling GMMs. Furthermore, GMMs are among the simplest FMMs to estimate and are thus
straightforward to apply in practice.
One major criticism of the GMMs is that they require a large number of parameters. That is, a d-variate GMM
with g components of form (1.6) requires A + gB parameters, where A = g (d + 1) − 1 and B = d (d + 1) /2. In
order to reduce the parametric complexity of (1.6), Banfield and Raftery (1993) and Celeux and Govaert (1995)
considered the decomposition
Σi = λi Di Ai DiT ,
(1.7)
for each i, where λi ∈ R, Di ∈ Rd×d is an orthogonal matrix, and Ai ∈ Rd×d is a diagonal matrix with unit
determinant. The number of parameters can be decreased by different amounts by making various restrictions,
such as fixing λi = λ, Di = D, or Ai = A, for each i, or setting Di Ai Di = I; here I is the identity matrix
of appropriate dimensionality. Each of the restrictions also enforces conditions on the confidence ellipses of
the component densities; namely the sphericity, volume, shape, and orientation. Table 1.1 presents possible
combinations of these restrictions, along with the parameter complexity and confidence ellipse conditions of
each restriction; visualizations of some of the restrictions can be found in Figure 1.3. These restrictions have
been successfully implemented in the popular Mclust package for the R program environment; see Fraley and
Raftery (1999) and Fraley and Raftery (2003) regarding Mclust.
Besides the natural motivation for GMMs, are also extremely flexible with respect to the possible shapes that
8
1
−6
−5
6
0
4
2
Y
−2
●
−4
−4
●
● ●
●
● ●
● ●●● ●●
●●
● ●●
●
●
●
●●
●
● ●● ● ●●●●
●● ●
●●
●
●●
●
●
● ●●● ●●
●
●
● ●
●● ●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●
●● ●
●
●
● ●● ●
●
●●
●●●●●
●●●
● ●● ●
●
● ● ●● ●●
●●
●●
●
●
●●
●
●
●
● ●●
●●
●
●●●●●● ●●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●● ● ●
●
●● ●
●
●
●●
●
●●
●●●
●
●●
●● ●
● ●●●
●
●
●
●
●
● ● ●●
● ●
● ●● ●
●● ● ●
●
●●
● ●
●●
●●
●
●
●
●● ● ●● ●
● ●
● ●
●● ●
●● ●
●●
●● ● ●
●
●● ●● ●
●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●●
● ●
● ●●● ●
●● ●
●
● ● ● ●●
●
●●
●●●
●● ●
●●●
●
●● ●
●
●
●
●
●●
●
● ●●
●●
●
●
●
●
●
●
●
●
●
●● ● ● ●
●●
●
●
●
●●●
●●
●●●●
● ●● ●●●●
●
●●
●
●●●
●
● ●●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●●
●●
●
●●
●
●
●●
●●
● ●● ●●●●●
●● ●
●●
●●
●
●● ●
●●●
●
●
●●●
●●● ●
●
● ●
●●●
●
●
● ●●
●●
●
●● ● ●
●● ●
●
●●●
●
●
● ●●●
●
●●
●●
● ●●●
●●
●
●
●
●
●
●
●●●● ●
●●
●●●
● ●
● ●● ●
●
●●
●
●●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●● ●
●
●●
●
●●●
●●
●● ●
● ●● ●
●
●●
●●
●
●●
●●●
● ● ●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
● ●●●
● ●●
●
●● ●●●●
●
●
●
● ●
●
● ●● ●
●
●●
●●● ●
●●
●
●● ●
● ●●●
●
●
●
●
●
●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●●
● ●● ●
● ●
●
● ●● ●
●
●●●
● ●
●
●
●
●
●
●
−6
0
2
4
●
●
●
●
●●
●
● ●
● ●●
● ●●
●● ●
●●
● ●
●
●
●●
●
●
●
●●
●●● ●
● ●
● ● ●●●●
●●●●
●●●● ●
●●
●
●
●●
●
●● ●●
●●●
●
●
●●●
●
●●
●●●● ●
●●●
●● ●
●
●● ●
●●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●●
● ●
●
● ●
●
●
●
●●
●
●●
● ●●●● ●
●
●
●●
● ●●●
●●
●
●●
●
●
●
●
●● ●●
●
●
●●●●
●
●●
●●●
●●
● ●●
●● ● ●●●
●
●
●
●
●●
●
●●
●●●●
●● ●
●●●
●●●●
●
●
●●● ● ●
●
●
● ●
●●
●
●
●●
5
●
●
●
●
●
● ●
●
●
●●
●
● ●●
●
●
●
● ● ●
●
●
●
●●
●
●
● ● ●● ●
● ●
●●
●●
●●
●● ●● ●● ●●
●● ● ● ● ● ● ●
●●
●
●●
●●
●●
●●●●● ●
● ●●
●
● ●●
●
●
● ●● ●
●●●
●●●
●● ●● ●
●
●● ●●
●●
●
●●●
●●●●●
● ●
● ●●
●● ●
●●●
●
● ● ●
● ●●
●●
● ●
● ●
● ●●●●●●
●
●
●
●●
●
●
●
●
●
●
● ●●●●
●
●
●
● ●●●
●
●
●
●
●●
● ●●
● ●● ●
●
●
●
●● ●
●●
●●
●●●
●
●●
●
●●
●
●
●
●●
●● ● ●●● ●●
●●
●●●●●
● ●
● ●
●
●●
●●
● ●● ●
● ● ● ● ●
●●
●
● ●
●●●
●
●
●
● ●●
●●
●
●●
● ●
●
●●
●
● ●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●●●
●
●
●●
●
●●
● ●●●
● ●
●
●
● ● ● ● ●● ●●●●●● ●● ● ●
●
●● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●●● ●
●● ●
● ●●● ●● ● ●
●
●
●
●
● ● ●
●
●●● ●
●●
●
●
● ●●
●
●
●
●
●
●
●
●● ● ●●
●
● ●
● ●●●
●
●
●
● ●
● ●●●
● ● ●
●
●●●● ●
●
●●
● ● ● ● ●
● ●
●●
●
●●● ● ● ● ●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
● ●
●
●●●
●
●
●
●
●
● ●
●
●
● ●
●● ●●
● ●●
●● ● ● ●
●
●
● ●
●● ●
●
● ●
● ●●
●
●
●
●●● ●
●
●
●
●
●
●
●● ●
●
●
●
●
●●●●
●● ● ● ●
●
●●
●
●●
●●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●●●●
●●
●
● ● ●●
●
● ● ●●● ●
●
●● ●● ●
●●
●●●
●
●
● ●●●
●● ●
●
●●●● ●
●
●●●●
● ● ●
● ●
●
●●
●
●●
●
●●●
●
● ●
●
●
●●
●●
●
●● ● ● ●●
●
●●●●●
●●●
● ●
●
●
●●
●●
●●
● ● ●
●
●
●
●
●
●● ●
●●●●
●
●
●●
● ●
●●
●
●
●●
●●
●● ●●●
●● ● ●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●●
●
●
●●●●
● ●
●
● ●● ●
●
●
●●
●●●
●
●
●●● ●
●
●● ●
●
●
●
●● ● ●●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●●● ●● ●●●
●● ● ●
● ●●
● ● ●●●●
●●● ● ●
● ●●
●
● ●●
● ●●●
●●
● ● ●
●
●
●●
● ● ● ● ● ● ●●
● ●
●●
●●●
●
● ●●●
●● ● ●
●
●
●
●
●
●
●
● ● ●●●● ●
● ●
● ●● ●●●
●
●
●
●
●
●
● ●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●
●
●
● ●
●
●
●
0
6
●
●
−2
Y
2
−5
0
X
X
3
4
5
●
−5
0
6
●
●
●●
●
●
●●
●
●
●
● ●
●
●● ●
●
●
●
● ●● ●
●●
●
●
●
●●
● ●
●●
●
● ●●●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ● ●
●
● ●●
● ● ●● ●
●
●
● ●
●
●●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ● ●●●●●
●
●
●
●● ●●
●
●● ●●●● ●
●● ●●●
●
●●
● ●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●● ●
● ● ●●
●●
● ●●● ●
●●
●
●
●●
●
●● ●● ●●● ●●
●
●
●
●
● ●●
●●●
●●●
●
●
●
●
●
●
●●●●●●●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●●
●
●
● ●
●
●
●●●● ● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
● ●●
●
●●
●● ●
●●
● ●●
●
●
●
●●●●●
●●
●
●
●●
●●●●●●●
● ● ●● ● ●
● ●●
●●●
●
● ●
●
● ●
●
●
●
●●
●●
●●● ●●●
●●●● ● ●
● ●
●●
●
● ●
●
● ●
●
●
● ●●
● ●
●●● ●●
●●●●
●
● ●●●
●●
● ●●
●
●●● ●
●
●
●● ●
● ●
● ●
●
●
●●
●
●● ●
●
● ●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●● ● ●
●
●
●
●● ●
● ●●
●
●
●●
●●
● ●●
●●
●
●
●
●
● ●
●
●
● ●●
●●
●
● ●●
●
● ●
●●
● ●●
●● ● ● ●
●
●
●●
●●
●
● ●
●
●● ●
●●
●●●●● ●●● ●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●●
●● ● ● ●
●
●●●● ● ●
● ●●
●● ●
●
●
●● ● ●
●
● ●●
●●
●
●●
●
● ●●
●●
●●●●
● ● ●
●
●●● ●●
●
●●● ●
●●● ●●●●●
●● ● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●● ●●● ●
●
●●
●●
●●●
●●●
●
●●● ●
●●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ●●
●
●●●
●
●●
●
●●
●
●
●
●
●●●●
●●●
●
●●
●●●● ●●
● ●
●
●●●
●
●
●●
●
●
●
●●
●●●
● ● ●●
●●
●● ●●
●●● ●●
● ●
●● ●
●●
●●
●●●
● ●●●●
●●
●●
●●
●●●● ●
●●
●
● ●
●
●● ●
●
●
●●
●
●
●
●
●●
●●●
●●
●
● ●
● ● ●
●●●
● ●●
● ●
●
●● ●
●● ●●●
●
●● ●
●
●● ●
●●
●
●●
●
●● ●
●
●●
● ●
●
●●●●
●●
●
●
●
●● ●
●
●
●●
●
●●●●●
●
● ●● ●●●
●
● ● ●● ●
●●
●● ●
●
●
●
●
● ● ●
●●
● ● ● ● ●
●
●● ● ● ●
●●
● ● ●●
●●
●
●
● ●●
●
●
●
●
●
0
−2
Y
2
4
●
−4
4
2
0
Y
−2
−4
●
● ●
●●
●
●
● ●●●
●
●
●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ● ●
● ●
● ●●● ●●
●
●
●
●●
●● ●●
●● ● ●●●●● ●
●● ● ●
●
●
●
●
●
●
●
●● ● ●
●
● ●●
●
● ●
●
●●
●● ●
● ●● ●
●
●● ●
●
●
●●
●● ●
●
●
● ●●
●
●●
●
●
● ●
●● ●
●●
●●
●●
●●● ● ●
●●
●
●●
●●
● ●
●
●● ● ● ●
● ●●● ●
●●●● ●● ●
●● ●●● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●●
● ●
● ●
● ● ●
● ●●
●●
● ●●
●
●
●●
●
●● ●●●
●
●●
●●●●
●
●●
●●
● ●
●
●
●
●● ● ●●
●
●●
●
● ●●
● ●●
●
●●
●●
●
●●●
●
●
● ●
● ●
● ●● ●
● ●●●
● ●● ●
●
●●
●● ●●
● ●●
●●
●
●●
●●● ●
●●
● ●●●
●●
●●
● ●
●
●●
●●
●● ●
●● ●
●
● ●
●
●
●●●
●
●●●
● ●
● ●●●
●
● ●● ●●
●●●
●
●●● ● ●
●
●
● ●
●●● ● ●
●●●● ●
● ●
●●
● ●●
●
●
●
●
●
●
●
● ●
●
●●● ●
●
● ● ●●● ●
● ●
●
●
●
● ●● ● ● ●
●● ●● ● ●● ●
●
● ●●
●
●
●
● ● ●●
●
●● ●●
● ●●
●
●
● ●
● ●●● ●
●
●● ●
● ● ● ●●
●
●
●● ●●
●●●
● ●● ● ●
●
●
●
●●
● ●
●
●
●● ●
●
●
● ● ● ●● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●●●
● ● ●● ●
●
●
●
●●
● ●
●●
●● ● ● ●● ●● ●●●
●
● ●
●
●
●
●
● ●● ●
●
●
●● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●● ● ●●● ●●
●
●
●
●
● ●●
●
●
●
●●● ● ● ● ●
● ●
●
●
●●
●●
●
●●
●●
●● ● ●● ●
●●
●
●●●●●●
●
●●
●●●●
●
●●
● ●●
●
●
●
●
●● ● ●
●
●●
●
●
●
●●
●●
●
●●
●● ●● ● ●
●●●●
●
●
●●●●
● ●●
●
●
●● ●● ●
●
● ●●
● ●●
● ●
●
●●● ● ●●● ●
●
●●
●●
● ● ●●● ●
●
●
● ●
●●
●
●
●●
●
●●● ●●●
●
● ●
●
● ●●
●
●
●●
●●●●●
●● ● ●
● ●●●
●●
● ● ●
●●●
●●
●
● ●● ●●
●●●
● ●
●●
●●
●
●● ●
●●●
●
●●
●●
●●
●
●
●
●● ●
●
●
●●● ●●
●●●
● ● ● ●●
●
●● ●
● ●
●
●●● ●
●●
●● ● ● ● ●
●● ●
●
● ●
●● ●
●
● ●
●●● ●
●
● ●
●
●
●●
●
●●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
● ●●●● ●
●
●● ●●
● ●● ●●●
●●
●● ● ●● ● ●
●●
●
●●●● ●
●
●
●
●
●
●● ● ●
● ● ●●●● ● ● ●
●
●●
● ●
●
●
●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−6
●
●
●
−6
6
●
5
−5
X
0
5
X
Figure 1.3: In order from 1–4, each plot displays the 95% confidence ellipse of the components of a g = 4
component bivariate GMM with the model restrictions λI, λi I, λDAD T , and λi Di Ai DiT , respectively.
9
Table 1.2: Various univariate GMMs from Marron and Wand (1992).
Pg
Model Name
f (y) = i=1 πi φ y; µi , σi2
1
Skewed unimodal
2
Strongly skewed
3
Outlier density
4
Bimodal
5
Claw
6
Smooth Comb
2 3 13 5 2 + 5 φ1 y; 12 , 9
y; 0, 12 + 15 φ1 y; 12 , 23
h
i
2i
P7 1
2 i
− 1 , 23
i=0 8 φ1 y; 3
3
1 2 9
1
2
φ
y;
0,
1
+
φ
y;
0,
1
1
10
10
10
2 2 2
1
1
2
+
φ
y;
−1,
φ
y;
1,
1
1
2
3
2
3
1 2 P
4
1
i
1
2
φ
φ
y;
−
1,
y;
0,
1
+
1
1
i=0 10
2
2
10
i
Pg 25−i h
i
1
32 2
2i
φ
y;
65
−
96
/21,
/2
1
i=0
63
2
63
1
5 φ1
they can take. For instance, GMMs are able to take on shapes with varying skewness, kurtosis and modality.
In Table 1.2, we present examples of unimodal GMMs, from Marron and Wand (1992). Visualizations of these
densities are presented in Figure 1.4.
The richness and flexibility of the GMMs can theoretically be proven. Consider the following theorem from
Section 33.1 of DasGupta (2008).
Theorem 1.1 (Denseness of FMMs). Let 1 ≤ d < ∞ and f0 (y) be a continuous density on Rd . If ρ (y) is any
continuous density on Rd , then given > 0 and a compact set C ⊂ Rd , there exists an FMM of the form
g
X
1
f (y) =
πi d ρ
σi
i=1
y − µi
σi
,
such that sup |f (y) − f0 (y)| < , for some g ∈ N, where πi > 0, µi ∈ Rd , σi > 0 for each i, and
y∈C
Pg
i=1
πi = 1.
By setting ρ (y) = φd (y; 0, I) in Theorem 1.1, we get the result that the GMMs are dense in the class of
continuous densities on Rd . Thus, for any density f0 (y) of interest, there exists a GMM that is arbitrarily close
to f0 (y), in the sup-norm sense, over some compact subset of Rd . This is a powerful result, as it motivates the
use of GMMs an approximation to any unknown distribution. However, Theorem 1.1 can also be applied to
any family of densities, ρ (y), that are defined over Rd and thus also motivates the use of non-Gaussian mixture
models which may have more desirable properties, in some sense.
1.2.2
Non-Gaussian Mixture Models on Rd
Recently, there has been an interest in the formulations of FMMs for random variables in Ω = Rd , utilizing
non-Gaussian component densities. The most popular of these FMMs are those which utilize multivariate t
component densities to produce mixtures of the form
f (y; θ) =
g
X
πi td (y; µi , Σi , vi ) ,
i=1
10
(1.8)
2
0.8
Density
0.0
0.0
0.2
0.1
0.4
0.6
0.3
0.2
Density
0.4
1.0
0.5
1.2
1.4
0.6
1
−2
−1
0
1
2
3
−3
−2
−1
0
X
X
3
4
1
2
3
1
2
3
1
2
3
0.15
Density
0.10
2
0
0.00
0.05
1
Density
0.20
3
0.25
0.30
−3
−2
−1
0
1
2
3
−3
−2
−1
0
X
X
5
6
0.2
Density
0.3
0.0
0.0
0.1
0.1
0.2
Density
0.4
0.3
0.5
0.4
0.6
−3
−3
−2
−1
0
1
2
3
−3
X
−2
−1
0
X
Figure 1.4: Univariate GMMs from Marron and Wand (1992). The order of the plots are the same as in Table
1.2. The black curve is the overall density, and the colored curves are the component densities.
11
T
where θ = π1 , ..., πg−1 , µT1 , ..., µTg , vechT (Σ1 ) , ..., vechT (Σg ) , v1 , ..., vg , Γ (·) is the Gamma function, and
Γ
td (y; µ, Σ, v) =
d/2
(πv)
Γ
v
2
v+d
2
|Σ|
−1/2
i(v+d)/2
h
T
1 + (y − µ) Σ−1 (y − µ) /v
(1.9)
is the λ = v case of the generalized t density of Arellano-Valle and Bolfarine (1995)
Γ
td (y; µ, Σ, λ, v) =
d/2
(πv)
Γ
v
2
h
v+d
2
−1/2
|Σ|
T
1 + (y − µ) Σ−1 (y − µ) /λ
i(v+d)/2 .
(1.10)
Here µ ∈ Rd , Σ ∈ Rd×d is a positive-definite matrix, λ > 0, and v > 0. Section 5.5 of Kotz and Nadarajah
(2004) provides a comprehensive account of the properties of model (1.10).
The multivariate t FMM was first considered in McLachlan and Peel (1998) and Peel and McLachlan (2000),
for the clustering of data that arise from non-Gaussian subpopulations. Furthermore, model (1.8) allows for the
robust estimation of subpopulation covariances in the presence of outliers and atypical data; see Chapter 7 of
McLachlan and Peel (2000) for an in depth discussion.
In Sun et al. (2010), mixtures of Pearson type VII component densities were considered, as an alternative to
the multivariate t FMMs. Such models have the form
f (y; θ) =
g
X
πi pd (y; µi , Λi , ui ) ,
(1.11)
i=1
T
where θ = π1 , ..., πg−1 , µT1 , ..., µTg , vechT (Λ1 ) , ..., vechT (Λg ) , u1 , ..., ug
and
−1/2
pd (y; µ, Λ, u) =
Γ (u) |Λ|
iu
h
T
d
d/2
π Γ u − 2 1 + (y − µ) Λ−1 (y − µ)
(1.12)
is the multivariate Pearson type VII density function; see Section 3.3 of Fang et al. (1990) for further details
regarding Pearson type VII densities. Here µ ∈ Rd , Λ ∈ Rd×d is a positive-definite matrix, and u > 0.
It is not difficult to show that there is a one-to-one mapping between models (1.9) and (1.12) by setting
u = (v + d) /2 and Λ = vΣ. Thus, the approximation capacity of (1.8) and (1.11) is identical, for any fixed g.
Furthermore, due to Theorem 1.1, both models are dense in the class of continuous densities on Rd .
Recently there has been an interest in the use of skewed densities in FMMs, to model data where the component
densities have non-elliptical confidence sets. Examples of such models are the multivariate skew Gaussian and
t variant FMMs, and the shifted asymmetric Laplace FMM; see Lee and McLachlan (2013) and Franczak et al.
(2014), respectively.
12
1.2.3
Mixture Models on Subsets of Rd
In the case where Ω = [0, 1] for some interval, FMMs of beta densities are commonly used for density estimation.
Such mixtures have the form
f (y; θ) =
g
X
πi β (y; ai , bi ) ,
i=1
T
where θ = (π1 , ..., πg−1 , a1 , ..., ag , b1 , ..., bg ) and
β (y; a, b) =
Γ (a + b) a−1
b−1
y
(1 − y)
Γ (a) Γ (b)
is a beta density function, with parameters a > 0 and b > 0. Example applications of beta mixtures include
modeling gene correlations in bioinformatics studies (Ji et al., 2005) and for modeling p-values distributions in
large-scale hypothesis testing (Allison et al., 2002).
FMMs of beta densities are attractive as they are dense on the set of continuously densities over the interval
[0, 1]. This fact can be established via the following theorem from Petrone (2002).
Theorem 1.2 (Denseness of Bernstein polynomials). Let F0 (y) be a continuous distribution function on [0, 1],
and let
g
X
i
B (y; g, F0 ) =
F0
g
i=0
g
i
!
y i y g−i ;
the relationship
lim B (y; g, F0 ) = F0 (y)
g→∞
holds uniformly for all y ∈ [0, 1].
The function B (y; g, F0 ) is known as the Bernstein polynomial of order g for the function F0 , and has been
studied broadly; for example, see Babu et al. (2002), Ghosal (2001), and Petrone (2002). Theorem 1.2 can be
used to show the denseness of FMMs of beta densities by noting that the density function of B (y; g, F0 ) is
B 0 (y; g, F0 ) =
g X
i−1
i
− F0
β (y; i, g − i + 1) .
F0
g
g
i=1
T
d
A similar result to Theorem 1.2 can be used to show that, for random variables Y = (y1 , ..., yd ) in Ω = [0, 1] ,
the product beta FMM
f (y; θ) =
g
X
i=1
πi
d
Y
β (yj ; aij , bij )
j=1
d
T
is dense in the class of continuous densities on the hypercube [0, 1] . Here θ = (π1 , ..., πg−1 , a1 , ..., ag , b1 , ..., bg ) ,
T
d
T
d
where ai = (ai1 , ..., aid ) ∈ (0, ∞) and bi = (bi1 , ..., bid ) ∈ (0, ∞) , for each i; see Babu and Chaubey (2006).
We finally note the product beta FMMs can be used to model densities over any hyperrectangle via scaling of
the random variable.
13
Next, we consider random variables Y on Ω = [0, ∞). In such situations, the mixture of gamma densities
f (y; θ) =
g
X
πi γ (y; ai , bi )
i=1
T
is a popular choice, where θ = (π1 , ..., πg−1 , a1 , ..., ag , b1 , ..., bg ) and
γ (y; a, b) =
ba a−1
y
exp (−by)
Γ (a)
is the gamma density function, with shape parameter a > 0 and rate parameter b > 0.
The gamma FMMs are used in Bayesian statistics for the estimation of posterior densities, as well as in modeling
complex queuing scenarios; see Wiper et al. (2001) for details. Like the GMMs and the beta FMMs, gamma
FMMs are also dense in the class of continuous functions on [0, ∞). Such a fact can be established via the
following result from DeVore and Lorentz (1993) (Problem 5.6).
Theorem 1.3 (Denseness of gamma FMMs). If f0 (y) is a continuous function on [0, ∞) and has limit zero
for y → ∞, then
lim Sb (y) = f0 (y)
b→∞
uniformly for all y ∈ [0, ∞), where b > 0 and
Sb (y) = exp (−by)
∞
i
X
(by)
i!
i=0
f0
i
.
b
In Webb (2000), a product gamma FMM of the form
f (y; θ) =
g
X
i=1
πi
d
Y
γ (yj ; aij , bij )
j=1
d
T
was used to model radar range profiles on Ω = [0, ∞) . Here θ = (π1 , ..., πg−1 , a1 , ..., ag , b1 , ..., bg ) , where
T
d
T
d
ai = (ai1 , ..., aid ) ∈ (0, ∞) and bi = (bi1 , ..., bid ) ∈ (0, ∞) , for each i. Although we could not find a result,
we conjecture that a denseness property exists for the product gamma FMM, similar to the property of the
product beta FMM.
Both the Gaussian density and gamma density are members of the class of exponential family density functions
ρ (y; θ) = h (y) exp η T (θ) T (y) − A (θ) ,
(1.13)
where h (·) and A (·) univariate functions, and η (·) and T (·) are vector-variate functions. The functions η (·)
and T (·) are usually referred to as the natural parameter and sufficient statistic, respectively. We can represent
the Gaussian density in form (1.13) by setting
−d/2
h (y) = (2π)
1
1
, A (θ) = µT Σ−1 µ + log |Σ| , η (θ) =
2
2
14
"
Σ−1 µ
− 12 Σ−1
#
"
, and T (y) =
y
yy T
#
,
T
where θ = µT , Σ . Similarly, we can represent the gamma density in form (1.13) by setting
"
h (y) = 1, A (θ) = log Γ (a) − a log Γ (b) , η (θ) =
a−1
−b
#
"
, and T (y) =
log y
y
#
,
T
where θ = (a, b) .
The class of exponential family FMMs of the form
f (y; θ) =
g
X
πi ρ (y; θi ) ,
i=1
where θ = π1 , ..., πg−1 , θ1T , ..., θgT
T
, has been broadly explored in the literature. See for instance Bohning
et al. (1994), Hasselblad (1969), and Lutful Kabir (1968) for coverage of exponential family FMMs in generality;
Chapters 2 and 3 of Lindsay (1995) also contains discussions and analysis of such FMMs.
We now consider some exponential family FMMs that are useful for modeling random variables on countable
subsets of Rd (i.e. discrete random variables). Firstly, for count data (i.e. Ω = N), the Poisson FMM with
density
ρ (y; θi ) =
λyi
exp (−λi )
y!
is commonly used, where θi = λi > 0 for each i; see Karlis and Xekalaki (2001), Margulis (1993), and Rider
n
o
Pd
(1962) for examples and detail. For categorical data (i.e. Ω = y ∈ Nd : j=1 yj = N , for some fixed N ∈ N),
the multinomial FMM with density
N!
d
Y
j=1
yj ! j=1
ρ (y; θi ) = Qd
T
is popular, where θi = (pi1 , ..., pid )
such that pij > 0 and
Pd
y
pijj
j=1
pij = 1 for each i and j. Examples of such
FMMs can be found in Li and Zhang (2008), Novovicova and Malik (2003), and Rigouste et al. (2007). Such
models have also been used for binned estimation problems, such as for fitting GMMs to histogram data; see
Cadez et al. (2002), McLachlan and Jones (1988), and Chapter 9 of McLachlan and Peel (2000). In the case
where d = 1, we have the binomial FMMs, which have been studied in Blischke (1964) and Rider (1962). See
Section 8.2 of Johnson et al. (2005) for a general treatment on FMMs for discrete random variables.
1.2.4
Cluster Analysis
Clustering, in Hartigan (1985), is defined simply as the placing of objects into similarity classes. This, of course,
depends on the definition of similarity, which is a point of contention that has made cluster analysis a broad
and rich field of research. This breadth can be seen in classic books such as Hartigan (1975) and Kaufman and
Rousseeuw (1990), which present numerous examples of clustering algorithms that utilize various philosophies
regarding similarities.
Suppose that we observe an observation Y from a FMM of form (1.1). One can consider each of the component
densities ρi (y) as being a subpopulation density of the overall population defined by density (1.1). If so, then
a reasonable notion of similarity is the probability of Y belonging to one of the g densities, which can be
15
considered as clusters. This is the notion of similarity explored in Wolfe (1970); see McLachlan and Basford
(1988) for a book length treatment on the topic.
Using the probabilistic characterization of an FMM, presented at the beginning of the section, we can consider
the label Z = i∗ as being the true cluster that the observation Y belongs to (i.e. Y is sampled from the
component density ρi∗ (y)). Furthermore, through an application of Bayes’ rule, we can write the a posteriori
probability that Z = i given the observation of Y as
P (Z = i|Y = y)
πi ρi (y)
0
0
i0 =1 πi ρi (y)
= τi (y) ,
=
Pg
(1.14)
for each i. The set of a posteriori probabilities τ1 (y),...,τg (y) is considered as the soft-clustering of Y , as it
assigns a probability to each of the outcomes i∗ = i.
However, it is often desirable to perform a hard-clustering, and make a precise statement regarding i∗ = R (y),
for some rule R (y), which outputs in the range {1, ..., g}. In FMMs, this is done using the a posteriori
probabilities (1.14), via the Bayes’ assignment rule
R (y) = arg max τi (y) ;
i=1,...,g
(1.15)
rule (1.15) can be shown to be optimal via Theorem 22.6 from Wasserman (2004).
Theorem 1.4 (Optimality of Bayes’ rule). Rule (1.15) minimizes the loss function
L (R) = P (R (Y ) 6= i∗ )
over the class of all assignment rules R (y).
We note that the loss function from Theorem 1.4 is equivalent to the conditional risk criterion discussed in
Section 1.4 of McLachlan (1992). See Chapter 1 of McLachlan (1992), and Chapters 1 and 2 of Devroye et al.
(1996) for further details regarding the implications and applications of Theorem 1.4.
As with most FMM scenarios, the most common model for clustering is the GMM. In Figure 1.5, we present
2
n = 1000 observations generated uniformly in the rectangle [−5, 5] , along with the a posteriori probabilities
τ1 (y) from a bivariate g = 2 component GMM with parameters
1
π1 = , µ 1 =
2
1.3
"
−2
−2
#
"
, µ2 =
2
2
#
, and Σ1 = Σ2 = 2I.
(1.16)
Mixture Models for Regression Data
In data analysis, it is often more interesting to explore the relationship of some random variable Y against
some covariate vector X ∈ Rq , than to simply explore the distribution of Y on its own. Regression analysis is
16
6
●
●
●
●
●
●● ●
● ● ● ●●
●
●
● ●● ● ● ●●
● ●
● ●● ●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
● ● ●
●
●
●
●
●
●
●
●
●
●●
● ●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●●
●
● ●●
●
●●
● ●
●●
●
●
●
●
● ● ●●● ●●
●● ● ●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●●
●
●
●●
●
● ●
● ●
●●
●
● ●● ● ●●
●●
●
●● ● ● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●● ●
●
●●
●
●
●●
● ●●
●
● ● ● ● ● ●
● ● ● ●
●
●●
● ●
● ● ●
● ● ●
●● ●
●
● ●●
●
●
●●
●
● ● ●●
●●
●
●
●●
●● ●
●
●
●●
●●
●
●
●●
●
●●
●● ● ●
● ● ● ●●●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●● ●●● ●
●●
● ●●
●●
●
●
●
● ● ● ●● ●
●
●
● ●
●
● ● ● ● ●
● ● ●
●●
●●
●●●
●
●● ●
●
●
●
●
●
●
●
●
●
● ●●
●
● ● ●●●
●
● ●●
●
●
● ●
● ● ●●
●
●
● ●●
●●
●
●
●
●
● ●●
● ●
●
●●● ●
● ●● ● ●
●
●
●●
● ● ●●
●
●
●
●●
●
●●
● ●● ●
● ● ●
●
● ●●●● ● ● ●
● ●●
●
● ●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
● ● ●●
●
●●
●
●●
●●
● ●
●●
●
●
● ●● ● ●
●● ●
●●
●
●
● ●●
●●
●●
●
●
●
● ●●
● ●● ●
● ●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●●● ● ●
●
● ●
● ●
●
●
●● ● ●
● ● ● ●
●●
●
●●
●●
●
●●●
●
●
●
●●
● ●●
● ●
●
●
● ●● ●
●
●
● ●
●●●
● ●
●●
●●
●
●
● ●
● ●
●
● ●● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ● ● ●●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●● ●
●
● ●
●
● ● ● ●●
●● ●
● ●
● ● ●●
● ●
●
●●
●
● ● ●
● ●
●
●
● ●●
●● ●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
● ● ●
● ●●
●
● ●● ● ●
● ●
● ●●
●
●
●
●
● ●
●
● ●
●
●
●● ●
●● ●
●●
●
●
● ●● ●
●●
● ●
●
●
●
●
●
●
●
●●
● ● ●●
● ●●
●
● ● ● ●
●●
●
●
●
●●
●
●●●
●●
●
●
●
● ●●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
● ●
●
●● ●
●
● ●●
●
●
●
●
●
● ●
●● ●
●
● ●● ●●
●
●
●
●
●
●
●
●
●
●●
● ●●
● ●
●● ●●● ●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●● ● ●● ● ●●● ● ●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
● ●
●
● ●● ●
●
●
● ● ●●●
●
●
●
●● ●● ●
●●
0
−6
−4
−2
Y
2
4
●
●
●
●
−6
−4
−2
0
0
0.5
1
2
4
6
X
Figure 1.5: Plot of n = 1000 uniformly sampled observations, colored by the value of τ1 (y), as per model
(1.15). The blue and red circles are 95% confidence ellipses of the component 1 and 2 densities from model
(1.15), respectively.
17
Table 1.3: Simulation study parameters from Quandt (1972).
Scenario
[a, b]
n
σ12
σ22
π1
1
[10, 20]
60
2
2.5
0.5
2
[10, 20]
120
2
2.5
0.5
3
[10, 20]
60
2
2.5
0.75
4
[10, 20]
60
2
25
0.5
5
[0, 40]
60
2
2.5
0.5
the process of modeling such relationships via density functions for Y |X = x of the form f (y|x); see Greene
(2003) and McCulloch and Searle (2001) for comprehensive treatments on regression modeling and analysis.
Henceforth, we shall refer to such density functions as regression models.
Like density estimation in general, regression analysis is most commonly conducted via parametric models of the
form f (y|x) = f (y|x; θ), where θ is the parameter vector. The most popular of such models is the Gaussian
linear regression model of the form
f (y|x; β) = φ1 y; xT β, σ 2 ,
(1.17)
for univariate Y , where β ∈ Rq and σ 2 > 0. However, there is also a broad literature on nonparametric
and semiparametric regression models; see for example Green and Silverman (1994), Gyorfi et al. (2002), and
Ruppert et al. (2003).
Like the general FMM, if we suppose that there exists some additional latent random variable Z with probability
density P (Z = i) = πi , for each i, such that Y |X = x, Z = i has the density ρi (y|x), then we have the mixture
of regression model (MRM)
f (y|x) =
g
X
πi ρi (y|x)
(1.18)
i=1
via the same argument as that of (1.2).
To the best of our knowledge, the MRMs were first introduced to the statistical literature in Quandt (1972),
where they were proposed for use in the estimation of regression relationships in econometric data. In Quandt
(1972), a simulation study was performed whereupon data was generated from a univariate g = 2 component
Gaussian linear MRM (LMRM) with one covariate, with the form
f (y|x; θ) = π1 φ1 y; β10 + β11 x, σ12 + π2 φ1 y; β20 + β21 x, σ22 ,
where θ = π1 , β10 , β11 , β20 , β21 , σ12 , σ22
T
(1.19)
, and x is a realization from a uniform distribution over [a, b]. In the
study, the parameters were set at β10 = 1, β11 = 1, β20 = 0.5, and β21 = 1.5. The article considered 5 scenarios
for the rest of the parameters, a, b, and the sample size n, as per Table 1.3. Visualizations of the 5 simulation
scenarios are provided in Figure 1.6.
18
2
30
1
●
●
●
●
● ●
● ●
●
● ●
●
25
●
●
●
●
●
●
20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
15
15
●
●
●
●
● ● ●
●
●
●
10
●
12
14
16
18
20
●
●
●●
●
●
● ●
●
● ●●
●
10
●
● ●
●● ●
●
●
●
●
●
●●
●
●
●
12
14
16
18
●
30
●
25
●
● ●
25
●
●
15
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
10
●
14
16
18
20
10
X
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
12
●
●
●
10
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
●
●
●
20
●
15
●
●
●
●●
●
●●
●
●
●
Y
20
●
●
●
●
●
●
●
●
●
●●
●
Y
●
●
●
●
20
●
●
●
●
●
●
●● ●
● ●
●
● ●● ●
●● ●
●
●
● ●
●
●
●● ●●
● ●●
●
●
●
4
●
●
●
●
●●
●
●
●
3
●
●
●
●
●
X
●
●
●
●
X
30
10
●●
●
●
●
●
●
●
●●
●●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
● ● ● ●●
●
●
●
●
20
●
Y
25
●
● ●
●
●
●
●
10
●
30
●
●
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
12
14
16
18
20
X
60
5
50
●
●
●
●
●
●
●
●
40
●
●
●
●
●
20
●
●
●
●
●
●
● ●
30
Y
●
●
●
●●
●
●
●
●
●
●
● ●●● ●
●
● ●
●
● ●● ●
10
●●
●
0
●
●
●
●
●
●
●
●
●● ●
0
10
20
30
40
X
Figure 1.6: Realizations of the 5 simulation scenarios from Quandt (1972), as per Table 1.3. The blue and red
lines are the conditional expectations of components 1 and 2, respectively.
19
The more general univariate Gaussian LMRM of the form
f (y|x; θ) =
g
X
πi φ1 y; βiT x, σi2 ,
(1.20)
i=1
where θ = π1 , ..., πg−1 , β1T , ..., βgT , σ12 , ..., σg2
T
was also proposed in Quandt (1972), followed by a moment
based estimation process for such models in Quandt and Ramsey (1978). An alternative maximum likelihood
(ML) estimation technique was suggested in Hosmer (1974), for models of form (1.19).
Due to the computational complexity of the estimation procedures at the time, MRMs were not embraced
until the advent and widespread adoption of the expectation–maximization (EM) algorithms of Dempster et al.
(1977), which we shall discuss further in Section 1.4.1.
1.3.1
Linear Mixtures of Regression Models
The number of articles pertaining to the estimation and application of MRMs has increased since the publication of Dempster et al. (1977), due to the reduced complexity of ML estimation via EM algorithms. As in
all domains of the regression modeling literature, the most prolific of publications are those regarding linear
models (i.e. LMRMs). Here, for clarity, we shall define an LMRM as any model for the case of Y ∈ Rd and
E (Y |X = x, Z = i) = BiT x, for each i, where Bi ∈ Rq×d .
Following on from the works of Quandt (1972), Hosmer (1974), and Quandt and Ramsey (1978); De Veaux
(1989) considered the use of EM algorithms for the ML estimation for model (1.19). The EM algorithms based
ML estimation of the more general model of form (1.20) was given in DeSarbo and Cron (1988).
In Jones and McLachlan (1992), the multivariate Gaussian LMRM of the form
f (y|x; θ) =
g
X
πi φd y; BiT X, Σi ,
(1.21)
i=1
T
where θ = π1 , ..., πg−1 , vecT (B1 ) , ..., vecT (Bg ) , vechT (Σ1 ) , ..., vechT (Σg ) , and Bi ∈ Rq×d is a matrix of
coefficients, for each i. Here, the jth column of Bi corresponds to the relationship between vector component
yj and the covariates x. Model (1.21) was used in Jones and McLachlan (1992) to investigate the heterogeneous
relationships between the ratings assigned to cat food attributes, under assessment by different individuals. A
similar model to (1.21) was presented in Jedidi et al. (1996), for the modeling of simultaneous regression data.
The use of t densities for error models is also popular in the LMRM literature. For example, Yao et al. (2014)
considered the univariate t LMRM model
f (y|x; θ) =
g
X
πi t1 y; βiT x, σi2 , vi ,
(1.22)
i=1
as a robust version of (1.20). Similarly Galimberti and Soffritti (2014) considered the multivariate t LMRM
model
f (y|x; θ) =
g
X
πi td y; BiT x, Σi , vi ,
i=1
20
(1.23)
for robust estimation of heteroskedastic multivariate models. Here model (1.22) is the d = 1 case of (1.23), and
T
θ = π1 , ..., πg−1 , vecT (B1 ) , ..., vecT (Bg ) , vechT (Σ1 ) , ..., vechT (Σg ) , v1 , ..., vg
.
Besides the Gaussian and t models, there have been many other error models proposed for use in LMRMs. Of
particular interest is the Laplace LMRM
f (y|x; θ) =
g
X
πi λ y, βiT x, ξi
i=1
of Song et al. (2014), where
√
exp − 2 |y − µ| /ξ
√
λ (y; µ, ξ) =
2ξ
(1.24)
is the Laplace density with mean µ ∈ R and scale parameter ξ > 0, and θ = π1 , ..., πg−1 , β1T , ..., βgT , ξ1 , ..., ξg
T
.
The Laplace LMRM is naturally interesting considering the relationship between the Laplace error model and
the least absolute deviation criterion, which is known to be more robust than least squares under various
circumstances; see for example Forsythe (1972).
Recently, there has been an interest in the modeling of the densities of both a univariate Y and its covariates
X jointly within the mixture model framework. Such relationships are referred to as cluster-weight models
(CWMs) and were first introduced in Gershenfeld (1997) and Gershenfeld et al. (1999), for the modeling of
time-series data pertaining to the parameters of musical instruments.
Using the LMRM framework Ingrassia et al. (2012) considered the Gaussian linear CWM (LCWM) and the t
LCWM of the forms
f (x, y; θ) =
g
X
πi φd (x; µi , Σi ) φ1 y; βi0 + βiT x, ξi2
(1.25)
i=1
and
f (x, y; θ) =
g
X
πi td (x; µi , Σi , ui ) t1 y; βi0 + βiT x, ξi2 , ui ,
(1.26)
i=1
respectively. Here the parameter vectors are
T
θ = π1 , ..., πg−1 , µT1 , ..., µTg , vechT (Σ1 ) , ..., vechT (Σg ) , β̃1T , ..., β̃gT , ξ12 , ..., ξg2 ,
and
T
θ = π1 , ..., πg−1 , µT1 , ..., µTg , vechT (Σ1 ) , ..., vechT (Σg ) , v1 , ..., vg , β̃1T , ..., β̃gT , ξ12 , ..., ξg2 ,
respectively, where β̃i = βi0 , βiT
T
, for each i. Notice that LCWM are simply products of density functions
on X and regression models of Y |X = x.
In Ingrassia et al. (2012), it was shown that the Gaussian LCWM occurs as a special case of many regression
models. For example, if µ1 = µi and Σ1 = Σi , for each i, model (1.25) is equivalent to (1.20). Additionally, if
πi φd (x; µi , Σi ) is taken as a gating function, then model (1.25) is equivalent to the Gaussian gated experts of
21
Xu et al. (1995). Furthermore, let µ̃i and Σ̃i take the forms
"
µ̃i =
µi
µY,i
#
"
and Σ̃i =
Σi
ΣXY,i
ΣTXY,i
2
σY,i
#
,
(1.27)
for each i, where µY,i and σY2 are the mean and variance of Y |Z = i, and ΣXY,i ∈ Rd is the covariance of Y
and X. Proposition 1 of Ingrassia et al. (2012) shows the following relationship between the Gaussian LCWM
and the d + 1 dimensional GMM.
Theorem 1.5 (Equivalence of the Gaussian LCWM). There exists a mapping between the parameter vector of
model (1.25) and that of the model
f (ỹ; θ) =
g
X
πi φd+1 ỹ; µ̃i , Σ̃i ,
i=1
where Ỹ = X T , Y
T
T
and θ = π1 , ..., πg−1 , µ̃T1 , ..., µ̃Tg , vechT Σ̃1 , ..., vechT Σ̃g
.
Thus, Theorem 1.5 suggests that the Gaussian LCWM is a special case of the (d + 1) -variate GMM.
In Proposition 6 of Ingrassia et al. (2012), it is also suggested that model (1.26) is equivalent to a (d + 1) -variate
t FMM. However, it is well known that the multivariate t densities of form (1.9) does not retain its conditional
distributions; see Equation 5.13 of Kotz and Nadarajah (2004). Thus, the result presented in Ingrassia et al.
(2012) is impossible.
Let
f (x, y; θ) =
g
X
πi td (x; µi , Σi , λi , ui ) t1 y; βi0 + βiT x, ξi2 , λi + δ (x; µi , Σi ) , ui + d
(1.28)
i=1
T
be the generalized t LCWM, where δ (x; µ, Σ) = (x − µ) Σ−1 (x − µ) and
T
θ = π1 , ..., πg−1 , µT1 , ..., µTg , vechT (Σ1 ) , ..., vechT (Σg ) , λ1 , ..., λg , u1 , ..., ug , β̃1T , ..., β̃gT , ξ12 , ..., ξg2 .
Now, let µ̃i and Σ̃i have the same form as in (1.27), except without the covariance interpretation. The following
lemma establishes the relationship between each of the component densities of (1.28) and the (d + 1) -variate
generalized t FMM.
Lemma 1.1. There exists a mapping between the parameter vector of the generalized t LCWM component
density
td (x; µ, Σ, λ, u) t1 y; β0 + β T x, ξ 2 , λ + δ (x; µ, Σ) , u + d
T
and that of the model td+1 ỹ; µ̃, Σ̃, λ, u , where Ỹ = X T , Y .
Proof. Using the results from Section 5.5 of Kotz and Nadarajah (2004), we can write
td ỹ; µ̃, Σ̃, λ, u = td (x; µ, Σ, λ, u) t1 y; µY |X (x) , σY2 |X , λ + δ (x; µ, Σ) , u + d ,
22
Table 1.4: Simulation study parameters from Example 2 of Ingrassia et al. (2012).
i
πi
µi
σi2
βi0
βi1
ξi2
1
2/9
5
12
40
6
22
2
4/9
10
22
40
−1.5
12
3
1/3
20
32
150
−7
22
where
µY |X (x)
=
=
µY + ΣTXY Σ−1 (x − µ)
µY − ΣTXY Σ−1 µ + ΣTXY Σ−1 x
and
σY2 |X = σY2 − ΣTXY Σ−1 ΣXY .
We get the desired result by setting β0 = µY − ΣTXY Σ−1 µ, β = Σ−1 ΣXY , and ξ 2 = σY2 |X .
We can apply Lemma 1.1 to get the following result regarding the equivalence between the generalized t LCWM
and the (d + 1) -variate generalized t FMM.
Theorem 1.6 (Equivalence of the generalized t LCWM). There exists a mapping between the parameter vector
of model (1.28) and that of the model
f (ỹ; θ) =
g
X
πi td+1 ỹ; µ̃i , Σ̃i , λi , ui ,
i=1
where Ỹ = X T , Y
T
T
and θ = π1 , ..., πg−1 , µ̃T1 , ..., µ̃Tg , vechT Σ̃1 , ..., vechT Σ̃g , λ1 , ..., λg , u1 , ..., ug .
We note that both models (1.25) and (1.26) require large numbers of parameters to characterize. This issue is
addressed in Ingrassia et al. (2014) where a number of parameter restrictions are considered.
We provide an example of data generated from a Gaussian LCWM, as presented in Example 2 of Ingrassia et al.
(2012). Here, a g = 3 component Gaussian LCWM with univariate X is considered, where the model has the
form
f (x, y; θ) =
3
X
φ1 x; µi , σi2 φ1 y; βi0 + βi1 x, ξi2 ,
i=1
and the parameters are set to the values from Table 1.4. A visualization of n = 450 realizations from the model
is presented in Figure 1.7.
23
●
●●
●
●
●
●
●●
●
●
●
●●
●●
● ●●
●
●
●●●
●●● ●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●● ●●
●●
●
●
●●
●
● ●● ●
●●
●●
●
●● ● ●
●●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●
●
●●
●
●●
●
●
●●
●●●●
●
●
●●
●●
●
●● ●
●
●
●●
●
● ●●
● ●
●
●
●
●
−40
−20
0
20
●
Y
●
●
●
●
40
60
80
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
5
10
15
20
25
X
Figure 1.7: Realizations of n = 450 observations from Example 2 of Ingrassia et al. (2012), using the parameters
presented in Table 1.4. The red, blue, and green lines are the regression lines βi0 + βi1 x for components 1 to 3,
respectively.
24
Table 1.5: Models, sample spaces, link functions, densities, and mean parametrization for the GLMs originally
considered in Nelder and Wedderburn (1972).
Model
Ω
h−1 β T x
f (y)
E (Y |X = x)
1.3.2
Binomial
{0, ..., N }
exp(β T x)
1+exp(β T x)
N!
y
(N −y)!y! p
Gamma
(0, ∞)
−1/ β T x
γ (y; a, b)
Gaussian
R
βT x
φ1 y; µ, σ 2
Poisson
N
exp β T x
λy
y!
N −y
(1 − p)
Np
a/b
µ
exp (−λ)
λ
Generalized Linear Mixtures of Models
Thus far, we have considered the case of Y ∈ Rd . We now proceed in describe the case where Ω ⊂ Rd through
the use of generalized linear models (GLMs). GLMs were first introduced in Nelder and Wedderburn (1972)
as a method for unifying various disparate regression methods, such as Poisson, logistic, binomial and gamma
regressions. In their formulation, Nelder and Wedderburn (1972) considered various univariate cases where
the expectation of Y given the covariates x can be expressed as E (Y |X = x) = h−1 β T x , where h (·) is a
univariate invertible link function.
For example, if h−1 β T x = β T x and Y ∈ R, we simply have the Gaussian linear regression model (1.17).
In Table 1.5, we present the link functions, densities, and the mean parametrization for each of the regression
models originally considered in Nelder and Wedderburn (1972). See Chapter 5 of McCulloch and Searle (2001)
for further details regarding GLMs.
To the best of our knowledge, the first general treatment of mixtures of GLMs (MGLMs) is Jansen (1993). Here,
the author concentrated on computational issues, as oppose to modeling considerations. A more comprehensive
treatment of MGLMs is given in Wedel and DeSarbo (1995), where descriptions of mixtures of the original
GLMs, proposed by Nelder and Wedderburn (1972). Along with the original GLMs, Wedel and DeSarbo (1995)
also reported on the mixture of inverse Gaussian regression models as an alternative to the gamma model. See
also Section 5.5 of McLachlan and Peel (2000) for a review on the topic.
To characterize the MGLMs, we consider the following notation. Suppose that any MGLM can be expressed in
the form
f (y|x, θ) =
g
X
πi ρ y; η βiT x; ψi , ψi ,
(1.29)
i=1
where η (·) is a function, ρ (·) is a density function, and θ = π1 , ..., πg−1 , θ1T , ..., θgT
T
. Here θi = βiT , ψiT
T
and ψi is a vector of parameters other than βi , for each i. Using the notation in (1.29), we present the
characterizations of the binomial, gamma, Gaussian, inverse Gaussian, and Poisson MGLMs in Table 1.6.
There is also a rich literature regarding the development and use of specific MGLMs. For example, Follman
and Lambert (1989), Follman and Lambert (1991), and Kamakura and Russell (1989) considered binomial
GLMs; Schlattmann and Bohning (1993), Seeber (1989), and Wedel et al. (1993) considered Poisson GLMs;
25
Table 1.6: Characterizations of the MGLMs from Wedel and DeSarbo (1995), by sample space, density function,
mean characterization, and other parameters.
Model
Ω
ρ y; η βiT x; ψi , ψi
η βiT x; ψi
ψi
y N −y
exp(βiT x)
Binomial
{0, ..., N }
N!
(N −y)!y! η
Gamma
(0, ∞)
γ y; η βiT x; ψi , bi
−bi / βiT x
Gaussian
R
βiT x
σi2
Inverse Gaussian
(0, ∞)
βiT x
λi
Poisson
N
φ1 y; η βiT x; ψi , σi2
−1/2
2
λi [y−η (βiT x;ψi )]
λi
exp − 2η2 βT x;ψ y
2πη 3 (βiT x;ψi )
( i
i)
y
T
η (βi x;ψi )
exp −η βiT x; ψi
y!
exp βiT x
None
βiT x; ψi
1 − η βiT x; ψi
1+exp(βiT x)
None
bi
and Farewell (1982), Haughton (1997), Hougaard (1991), and McLachlan and McGriffin (1994) considered the
gamma and inverse Gaussian GLMs.
The binomial MGLM is particularly popular, due to its link to logistic regression (i.e. when N = 1). Besides
the use of the logistic link
η
βiT x
exp βiT x
,
=
1 + exp βiT x
(1.30)
the probit link
η βiT x = Φ βiT x; 0, 1
(1.31)
is also prominent utilized in the literature as an alternative characterization; see for example Ashford and Walker
(1972), Kamakura (1991), and Lwin and Martin (1989). Here, Φ x; µ, σ 2 is the Gaussian distribution function
with mean µ and variance σ 2 .
n
o
Pd
The multinomial MGLM for Ω = y ∈ Nd : j=1 yj = N of the form
f (y|x, θ) =
g
X
i=1
N!
πi Qd
j=1
d
Y
yj ! j=1
ηj (θi ) yj ,
(1.32)
is also considered in Kamakura and Russell (1989) and Kamakura (1991), as a generalization of the binomial
MGLM. In (1.32),
T
exp βij
x
ηj (θi ) = P
d
T
j 0 =1 exp βij 0 x
T
and θi = (βi1 , ..., βid ) , where βij ∈ Rq for j = 1, ..., d − 1 and βid = 0 for each i.
1.3.3
Mixtures of Experts
It is often useful to consider that the latent class Z also has a conditional relationship with some covariate vector
W ∈ Rp as well as the relationship between Y and X. For such situations, we usually models the probability
26
of Z = i|W = w via the logistic model
P (Z = i|W = w)
=
exp αTi w
Pg
T
i0 =1 exp αi0 w
= πi (w; α) ,
(1.33)
where αi ∈ Rp for i = 1, ..., g − 1 and αg = 0.
Using the marginal density notation from (1.29), we can write the density function of a MGLM with mixing
proportions of form (1.33) as
f (y|w, x; θ) =
g
X
πi (w; α) ρ y; η βiT x; ψi , ψi ,
(1.34)
i=1
where α = αT1 , ..., αTg−1
T
and θ = αT , θ1T , ..., θgT
T
.
Models of form (1.34) are usually referred to as Mixtures of Experts (MoEs), due to the terminology first
introduced in Jacobs et al. (1991). However, such models have been studied much earlier in the statistical
literature under various guises; see for example Dayton and Macready (1988), Farewell (1982), Farewell (1986),
and Larson and Dinse (1985). More recent treatments of MoEs can be found in Grun and Leisch (2007), Grun
and Leisch (2008), Section 5.13.1 of McLachlan and Peel (2000), and Wedel (2002). A comprehensive literature
review of the area can be found in Yuksel et al. (2012).
MoEs have become popular due to their flexibility. This is especially true for the univariate mixture of linear
experts (MoLE) with Gaussian component density
ρ y; η βiT x; ψi , ψi = φ1 y; βiT x, σi2 .
Using the expectation expression (1.5), it is not difficult to show that the expectation of a MoLE (i.e. any case
E (Y |Z = i, x) = βiT x) can be written as
E (Y |W = w, X = x; θ) =
g
X
πi (w; α) βiT x.
(1.35)
i=1
The flexibility of the Gaussian MoLE is exemplified in Chamroukhi et al. (2009), Chamroukhi et al. (2010),
and Same et al. (2011) were considered for estimating, discrimination, and clustering of time series data arising
from train switch signals. To demonstrate the flexibility of the model, Chamroukhi et al. (2010) simulated data
arising from various Gaussian MoLE models. In Figure 1.8, we display the mean functions of two Gaussian
T
MoLEs, M1 and M2, with w = x = 1, t, t2 for t ∈ [0, 5]. The parameters for M1 and M2 are set to
T
T
β1 = (1, 0, 0)
α1 = (334.133, −170.696, 0)
T
T
β2 = (10, 0, 0) and α2 = (243.697, −81.007, 0) ,
T
T
β3 = (5, 0, 0)
α3 = (0, 0, 0)
27
and
T
T
β1 = (23, −36, 18)
α1 = (92.72, −46.72, 0)
T
T
β2 = (−3.9, 11.08, −2.2) and α2 = (61.16, −15.28, 0) ,
T
T
β3 = (−337, 141.5, −14)
α3 = (0, 0, 0)
respectively.
In the case when w = x ∈ K and x1 = 1, where K is a compact subset of Rq , it can be shown that under some
regularity conditions, the mean function (1.35) can closely approximate any function η (x) in the Sobolev space
Wkr (L) =


η (x) : kηkW r
k



X (a)
=
η (x) ≤ L

k
|a|≤r
for some L < ∞, where
η (a) (x) =
∂ kak η (x)
a
∂xa1 1 ∂xa2 2 ...∂xq q
and a ∈ Nq . The following result from Zeevi et al. (1998) establishes bounds on the error of approximation via
equation (1.35).
Theorem 1.7 (Approximation error of MoLE mean functions). Let µg (x) be the mean function of a g component LMoLE of form (1.35), and let η (x) ∈ Wkr . If w = x ∈ K, where K is a compact subset of Rq and
x1 = 1, and if Assumption 2 from Zeevi et al. (1998) holds, then
sup
inf kη (x) − µg (x)kk ≤
η∈Wkr µg ∈Qg
c
g r/q
,
for 1 ≤ k ≤ ∞, where c is a constant and Qg is a restriction on the set of parameters of µg (x), imposed by
Assumption 2 of Zeevi et al. (1998).
Thus, by Theorem 1.7 we can verify that as g → ∞ and under regularity conditions, then there exists a MoLE
model that can approximate the mean function of any form, arbitrarily closely.
Asides from the MoLEs, MoEs based on GLMs have also been widely considered in the literature. For example,
Grun and Leisch (2007), Grun and Leisch (2008), Jiang and Tanner (1999b), Jiang and Tanner (1999a), Jiang
and Tanner (1999c), Jiang and Tanner (2000), and Jordan and Jacobs (1994) all considered the use of GLM
based MoEs from various perspectives. Such MoEs utilized component densities with forms such as those from
Table 1.6, and are also provably flexible among a wide class of models.
Let h (y; η) and h (y; η, ψ) be a one or two-parameter exponential family probability density, with parameters
2
η ∈ R and ψ ∈ (0, ∞), and let H be the set of all such models where η ∈ W∞
is a function of x ∈ K, where K
is a compact subset of Rq and x1 = 1. Further, let fg (x) be a univariate MoE with component densities in H
(e.g. any of the densities from Table 1.6), where w = x and η = β T x. The following theorem from Jiang and
Tanner (1999a) establishes the approximation error when using fg (x) to approximate any density in class H.
Theorem 1.8 (Approximation error of GLM based MoEs). If w = x ∈ K, where K is a compact subset of Rq
and x1 = 1, then
sup inf
h∈H fg ∈Πg
KL (fg , h) ≤
28
c
,
g 4/q
20
15
0
5
10
Y
0
1
2
3
4
5
t
Figure 1.8: Mean functions of form (1.35) for models M1 and M2, from Chamroukhi et al. (2010). The red
and blue curves correspond to M1 and M2, respectively.
29
where c is a constant and Πg is a restriction on the set of parameters of fg , imposed by Jiang and Tanner
(1999a).
In Theorem 1.8,
ˆ ˆ
KL (fg , h) =
x
fg (y|x, θ)
fg (y|x, θ) log
dκ (x, y)
h (y; η (x) , ψ)
y
is the Kullback-Leibler divergence between fg and h, and κ (x, y) is a probability measure over Ω = K × R.
Thus, Theorem 1.8 shows that GLM based MoEs are able to approximate any density in H, arbitrarily closely.
Although our discussions have concentrated on MoE models for cases where Y is univariate, there have also been
various descriptions of MoEs for Ω ⊂ Rd . For example, Peng et al. (1996) and Chen et al. (1999) considered
the use of multinomial MoEs of the form
g
X
f (y|w, x, θ) =
i=1
N!
πi (w; α) Qd
d
Y
j=1 yj ! j=1
ηj (θi ) yj ,
(1.36)
and Chamroukhi et al. (2013) considered the use of multivariate Gaussian MoLEs of the form
f (y|w, x, θ) =
g
X
πi (w; α) φd y; BiT x, Σi ,
(1.37)
i=1
where the parameters of (1.36) and (1.37) are the same as those from (1.32) and (1.21), respectively.
1.3.4
Mixtures of Linear Mixed Models
T
Suppose that the elements of Y = (Y1 , ..., Yd ) are such that Y1 , ..., Yd are dependent random variables characterized by the expression
Yj |b = β T xj + bT uj + Ej ,
T
for each j = 1, ..., d, where E = (E1 , ..., Ed )
matrices X ∈ R
d×q
and U ∈ R
d×p
(1.38)
has expectation 0. Here xj and uj are rows of the covariate
, respectively, and B ∈ Rp has expectation 0. We shall refer to B as the
random effect, and assume that U and X are fixed. This characterization is adapted from that of Laird and
Ware (1982).
The model described is generally known as a linear mixed model (LMM) and has been studied extensively; see
for example Chapters 1 and 2 of Jiang (2007), Chapters 6 and 7 of McCulloch and Searle (2001), Part 1 of
Pinheiro and Bates (2000), and Verbeke and Molenberghs (2000).
As usual, the Gaussian case where upon f (e; θ) = φd (e; 0, Σ) and f (b; θ) = φp (b; 0, D) is the most common
variant of the LMMs. In such models, the conditional density of Y given B = b has the form
f (y|b; U , X, θ) = φd (y; Xβ + U b, Σ) ,
and thus, the joint density of Y and B can be written as
f (y, b; U , X) = φd (y; Xβ + U b, Σ) φp (b; 0, D) .
30
(1.39)
Integrating out b yields the marginal density with respect to Y
ˆ
f (y; U , X, θ)
=
f (y|b; U , X) f (b) db
(1.40)
ˆ
Rp
=
φd (y; Xβ + U b, Σ) φp (b; 0, D) db.
Rp
The integral form of (1.40) usually causes difficulty in the analysis of mixed models, due to its potential of being
unevaluatable. However, in the case of the Gaussian LMM, the properties of the multivariate Gaussian implies
that the density of Y is multivariate Gaussian with mean
E (Y )
=
E (Xβ + U B + E)
=
Xβ + U E (B) + E (E)
=
Xβ
and covariance
V (Y )
= V (Xβ + U B + E)
= V (Xβ) + U V (B) U T + V (E)
= U DU T + Σ.
Thus, (1.40) can be equivalently written as
f (y; U , X, θ) = φd y; Xβ, U DU T + Σ ,
T
which allows for simple estimation and interpretation. Here, θ = β T , vechT (Σ) , vechT (D)
is the vector of
all model parameters. This is the major reason for the popularity of the Gaussian LMM. See Section 20.5 of
Seber (2008) regarding the properties of the multivariate Gaussian density.
A popular variant of the LMMs is to utilize the t density in describing either the marginal densities or the
conditional densities of Y |B = b. For example, Pinheiro et al. (2001) suggests an LMM defined via the
marginal densities
f (y; U , X, θ) = td y; Xβ, U DU T + Σ, v ,
T
and f (b; θ) = tp (b; 0, D, v), where θ = β T , vechT (Σ) , vechT (D) , v . Alternatively, Wakefield (1996)
considered the Gaussian-t LMM, defined via the conditional density (1.39) and the marginal density f (b; θ) =
tp (b; 0, D, v), over the random effect. The LMMs of both Pinheiro et al. (2001) and Wakefield (1996) are
generalized in Song et al. (2007), where various combinations of conditional densities of forms (1.39) or
f (y|b; U , X, θ) = td (y; Xβ + U b, Σ, v1 )
are combined with random effect densities of forms f (b; θ) = φp (b; 0, D) or f (b; θ) = tp (b; 0, D, v2 ). Aside
from the Gaussian and t models, other variants include the skew Gaussian and t LMMs; see Arellano-Valle et al.
31
(2005) and Lachos et al. (2010), and Ho and Lin (2010) and Zhou and He (2008), respectively.
To the best of our knowledge, the first appearance of a mixture of LMM (MLMM) is the Gaussian MLMM of
Verbeke and Lesaffre (1996). Here, the model is characterized by the conditional density (1.39) and the mixture
marginal density
f (b; θ) =
g
X
πi φp (b; δi , Di )
i=1
for the random effect, where δi ∈ Rp , for each i, and
Pg
i=1
πi δi = 0. This characterization yields the marginal
Gaussian MLMM model
f (y; U , X, θ) =
g
X
πi φd y; Xβ + Zδi , U Di U T + Σ ,
(1.41)
i=1
where
T
T
θ = π1 , ..., πg−1 , β T , δ1T , ..., δg−1
, vechT (Σ) , vechT (D1 ) , ..., vechT (Dg ) .
Model (1.41) is elaborated upon in Chapter 12 of Verbeke and Molenberghs (2000), and extended via over
parametrization and penalized smoothing in Ghidey et al. (2004).
As an alternative, Celeux et al. (2005) suggests to characterize the Gaussian MLMM by the conditional relationships
f (b|Z = i; θ) = φp (b; 0, Di )
and
f (y|b, Z = i; θ) = φd (y; Xβi + U b, Σi ) ,
which implies that the marginal component density and the marginal density of Y have the forms
f (y|Z = i; θ) = φd y; Xβi , U Di U T + Σi
and
f (y; U , X, θ) =
g
X
πi φd y; Xβi , U Di U T + Σi ,
(1.42)
i=1
where
T
θ = π1 , ..., πg−1 , β1T , ..., βgT , vechT (Σ1 ) , ..., vechT (Σg ) , vechT (D1 ) , ..., vechT (Dg ) .
Model 1.42 was applied in Celeux et al. (2005) for the clustering of time-course gene expression data. In Ng
et al. (2006) and Wang et al. (2012), (1.42) is extended upon to incorporate multilevel and nested random
effects, and autoregressive random effects, respectively. Furthermore James and Sugar (2003) and Luan and
Li (2003) applied model (1.42) to the problem of curves clustering in the context of functional data analysis
(FDA); see Ramsay and Silverman (1997) for details regarding FDA.
Asides from the Gaussian MLMM, Pyne et al. (2014) suggests the t MLMM and skew t MLMM, where the t
32
MLMM can be described via the marginal density
f (y; U , X, θ) =
g
X
πi td y; Xβi , U Di U T + Σi , vi ,
i=1
and conditional characterizations f (b|Z = i; θ) = tp (b; 0, Di , vi ), where
T
θ = π1 , ..., πg−1 , β1T , ..., βgT , vechT (Σ1 ) , ..., vechT (Σg ) , vechT (D1 ) , ..., vechT (Dg ) , v1 , .., vg .
The two t MLMM variants where used for clustering flow-cytometric data, which have non-Gaussian underlying
component densities.
Further to modeling heterogeneity of LMMs, FMMs can be used to nonparametrically model the random effects
in LMMs. For instance, using model (1.38), if one sets the conditional density for Y |B = b to be of form (1.39)
and a random effect density of form
f (b; θ) =
g
X
πi I (b = δi ) ,
(1.43)
i=1
where δi ∈ Rp ,
Pg
i=1
πi δi = 0, and I (x = y) is an indicator variable that takes a value of 1 if x = y and 0
otherwise. Using (1.39) and (1.43), the marginal density of Y can then be given as
f (y; U , X, θ) =
g
X
πi φd (y; Xβ + U δi , Σ) ,
i=1
T
with parameters θ = π1 , ..., πg−1 , β T , δ1T , ..., δg−1
, vechT (Σ) , which is a kind of MLRM. The density (1.43)
is generally referred to as a nonparametric mixed effects model, and has been investigated in Aitkin (1999) and
Laird (1978).
Finally, we note that in the same way that GLMs extend and generalize LMs, LMMs can be extended upon by
via the generalized LMMs (GLMMs); see Chapters 3 and 4 of Jiang (2007) and Chapter 8 of McCulloch and
Searle (2001), for details. Mixtures of GLMMs have also been considered, albeit less widely than MLMMs; see
for example Hall and Wang (2005), Komarek and Lesaffre (2008), and Muthen and Shedden (1999). Furthermore, Ng and McLachlan (2007), Ng and McLachlan (2014), and Yau et al. (2003) have also considered the
incorporation of random effects in the MoE setting.
1.4
Likelihood-Based Inference
In conducting statistical inference, one is usually presented with data y1 , ..., yn that arise from a population
defined via a parametric probability density f (y1 , ..., yn ; θ0 ), where θ0 is an unknown population parameter
vector. To draw inferences regarding θ0 , a point estimator θ̃n is generally necessary to proceed.
Point estimation is a broad and varied aspect of statistical analysis. Traditional means of point estimation
include the method of moments, Bayes’ estimators, loss minimization estimators (e.g. least squares and least
absolute deviations), and ML estimators; see Lehmann and Casella (1998) for an introduction to point esti33
mation. More modern point estimators include the empirical likelihood estimators (Owen, 1988, 2001), generalized method of moments (Hall, 2005; Hansen, 1982), and quasi-likelihood estimators (Heyde, 1997; Liang
and Zeger, 1986; Wedderburn, 1974), which generalize upon the method of moments; and the extremum estimators (Amemiya, 1985, Ch. 4), maximum Lq -likelihood estimators (Ferrari and Yang, 2010), maximum
pseudo-likelihood estimators (Arnold and Strauss, 1991; Varin et al., 2011), and M-estimators (Huber, 1964;
Huber and Ronchetti, 2009), which generalize the loss minimization estimators and ML estimators.
ML estimators are the most commonly used estimators in the FMM literature, although instances of use of
other estimators include the empirical likelihoods (Huang et al., 2007; Zou et al., 2002), Lq -likelihoods (Qin
and Priebe, 2013), method of moments (Day, 1969; Lindsay and Basak, 1993; Pearson, 1894), and M-estimators
(De Veaux and Krieger, 1990; Yao et al., 2014), as well as more ad hoc methods such as the classification
EM algorithm (Celeux and Govaert, 1992; Ganesalingam and McLachlan, 1980; McLachlan, 1982), and various
metaheuristic based methods (Ingrassia, 1991; Pernkopf and Bouchaffra, 2005; Andrews and McNicholas, 2013).
We now proceed to describe the method of ML estimation, in generality.
Let y1 , ..., yn be a realization of the random variables Y1 , ..., Yn . The likelihood and log-likelihood functions
constructed from the data are given as
Ln (θ; y1 , ..., yn ) = f (y1 , ..., yn ; θ)
(1.44)
log Ln (θ; y1 , ..., yn ) = log f (y1 , ..., yn ; θ) ,
(1.45)
and
respectively. Note that although the likelihood and log-likelihood functions are simply the probability density
and log density, respectively, both functions are with respect to θ holding y1 , ..., yn constant, rather than the
other way around.
It is common to consider the realizations y1 , ..., yn as n identically and independently distributed (IID) sample
of some random variable Y with probability density f (y; θ0 ). In the IID case, the likelihood and log-likelihood
can instead be written as
Ln (θ) =
n
Y
f (yj ; θ)
j=1
and
log Ln (θ) =
n
X
log f (yj ; θ) ,
j=1
respectively. We drop the conditioning on y1 , ..., yn in Ln (θ), for brevity.
Using equations (1.44) or (1.45), the ML estimator can be defined as
θ̂n = arg max Ln (θ)
θ∈Θ
34
or equivalently
θ̂n = arg max log Ln (θ) ,
θ∈Θ
(1.46)
since log (·) is a strictly increasing function.
The ML estimator θ̂n has numerous desirable properties as an estimator, under certain regularity conditions.
For instance, in the IID case, if one assumes the following set of assumptions, then the ML estimator can be
shown to be consistent, asymptotically normal, and asymptotically efficient.
EL1 If θ 6= θ 0 , then there exists a y ∈ Ω such that f (y; θ) 6= f (y; θ 0 ) (i.e. f (y; θ) is identifiable).
EL2 The support Ω of f (y, θ) is not a function of θ.
EL3 There exists an open subset N of Θ containing the true parameter vector θ0 , such that for all y, the
density f (y; θ) admits all third derivatives
∂ 3 f (y; θ)
,
∂θk1 ∂θk2 ∂θk3
for each k1 , k2 , k3 = 1, ..., r.
EL4 The first and second logarithmic derivatives of f (y; θ) satisfy the equations
Eθ
∂ log f (y; θ)
=0
∂θ
and
Eθ
2
∂ log f (Y ; θ)
∂ log f (Y ; θ) ∂ log f (Y ; θ)
·
=
E
.
−
θ
∂θ
∂θ T
∂θ∂θ T
EL5 The matrix
2
∂ log f (Y ; θ)
I (θ) = Eθ −
∂θ∂θ T
is finite and positive definite.
EL6 There exists functions Mk1 k2 k3 (y) such that
3
∂ log f (y; θ) ∂θk ∂θk ∂θk ≤ Mk1 k2 k3 (y)
1
2
3
for all θ ∈ N , where Eθ0 [Mk1 k2 k3 (y)] < ∞, for each k1 , k2 , k3 = 1, ..., r.
Here, the parameter space Θ is assumed to be a subset of Rr , θk is the kth element of θ, and
ˆ
η (y; θ) f (y; θ 0 ) dy
Eθ0 [η (Y ; θ)] =
Ω
for any θ 0 ∈ Θ. Under assumptions EL1–EL6, Lehmann and Casella (1998, p. 463) gives the following result.
Theorem 1.9 (Asymptotic properties of the ML estimator). Let Y1 , ..., Yn be IID with density f (y; θ0 ). If
assumption EL1–EL6 are satisfied, then the ML estimator is such that
35
P
1. θ̂n → θ0 (i.e. θ̂n is a consistent estimator for θ).
2.
√ √ D
n θ̂n − θ0 → N 0, I −1 (θ0 ) (i.e. n θ̂n − θ0 is asymptotically normal).
3.
√ D
n θ̂n,k − θk0 → N 0, I −1 (θ0 ) kk (i.e. θ̂n,k is an asymptotically efficient estimator of θk0 ).
P
D
In Theorem 1.9, → denotes convergence in probability, → denotes convergence in distribution, and [A]k1 k2
denotes the k1 th row and k2 th column element of A. Further, N (µ, Σ) denotes a Gaussian law with mean µ
and covariance Σ. See Chapter 3 of Amemiya (1985) regarding the definitions for the modes of convergence.
The direct technique for ML estimation is to let θ̂n to be a solution of the score equation
∂ log Ln (θ)
= 0.
∂θ
(1.47)
The Hessian matrix
∂ 2 log Ln (θ)
∂θ∂θ T
is then evaluated at θ̂n to inspected that the solution is maximal. The described process can be difficult in some
mixture model settings. For example, consider an IID sample y1 , ..., yn from a population with a two component
univariate Gaussian density of form (1.3), where π2 = 1 − π1 . In this case, the likelihood and log-likelihood
functions can be written as
Ln (θ) =
n
Y
π1 φ1 y; µ1 , σ12 + (1 − π1 ) φ1 y; µ2 , σ22
(1.48)
j=1
and
log Ln (θ) =
n
X
log π1 φ1 y; µ1 , σ12 + (1 − π1 ) φ1 y; µ2 , σ22 ,
(1.49)
j=1
respectively. The score vector components of (1.49) can then be given as
n
φ1 yj ; µ1 , σ12 − φ1 yj ; µ2 , σ22
∂ log Ln (θ) X
,
=
∂π1
π φ (yj ; µ1 , σ12 ) + (1 − π1 ) φ1 (yj ; µ2 , σ22 )
j=1 1 1
n
π1 (yj − µ1 ) φ1 yj ; µ1 , σ12
∂ log Ln (θ) X
=
,
∂µ1
σ 2 [π1 φ1 (yj ; µ1 , σ12 ) + (1 − π1 ) φ1 (yj ; µ2 , σ22 )]
j=1 1
n
(1 − π1 ) (yj − µ2 ) φ1 yj ; µ2 , σ22
∂ log Ln (θ) X
=
,
∂µ2
σ 2 [π1 φ1 (yj ; µ1 , σ12 ) + (1 − π1 ) φ1 (yj ; µ2 , σ22 )]
j=1 2
∂ log Ln (θ)
=
∂σ12
n
X
h
i
2
π1 (yj − µ1 ) − σ12 φ1 yj ; µ1 , σ12
j=1
2σ14 [π1 φ1 (yj ; µ1 , σ12 ) + (1 − π1 ) φ1 (yj ; µ2 , σ22 )]
∂ log Ln (θ)
=
∂σ22
n
X
h
i
2
(1 − π1 ) (yj − µ2 ) − σ22 φ1 yj ; µ2 , σ22
j=1
2σ24 [π1 φ1 (yj ; µ1 , σ12 ) + (1 − π1 ) φ1 (yj ; µ2 , σ22 )]
and
36
,
.
We note that the system of score equations is highly nonlinear and thus a solution to (1.47) for the maximization
of (1.3) cannot be represented in closed form.
Another situation where the direct method of ML estimation cannot be performed is in the case of nondifferentiable likelihoods. For instance, in the case where y1 , ..., yn is an IID sample from a population with a Laplace
density of form (1.24). In this scenario, the log-likelihood function
log Ln (θ)
=
n
X
log λ (yj ; µ, ξ)
j=1
= −
√ n
2X
n
log 2 − n log ξ −
|yj − µ|
2
ξ j=1
(1.50)
is not differentiable and thus unsolvable via the direct method.
In difficult problems such as (1.49) and (1.50), indirect methods are necessary. Two broadly useful techniques for
the ML estimation of FMMs are the EM algorithms and the minorization–maximization (MM) algorithms, which
we shall devote subsections towards. Other indirect optimization techniques include derivative free methods
such as Nelder-Mead (Nelder and Mead, 1965), as well as various Newton and quasi-Newton techniques; see
Lange (2013) and Nocedal and Wright (2006) for details regarding numerical optimization theory.
1.4.1
Expectation–Maximization Algorithms
The EM algorithms are a class of iterative algorithms that were first considered in Dempster et al. (1977), and
can be described as follows. Let f (y; θ) be the density function of the random variable Y that we wish to
maximize and suppose that f (y; θ) can be written as
ˆ
f c (y, z; θ) dz
f (y; θ) =
ΩZ
where f c (y, z; θ) is the joint density of Y T , Z T
T
and ΩZ is the sample space of Z, for some random variable
c
Z. We call f (y, z; θ) the complete-data likelihood of f (y; θ).
Let θ (m) denote the mth iterate of an algorithm, and let θ (0) be some initialization value. In the E-step
(expectation-step) at the mth iteration of an EM algorithm, one computes the conditional expectation
Q θ; θ (m) = Eθ(m) (log f c (Y , Z; θ) |Y = y) ,
and in the M-step (maximization-step), Q θ; θ (m) is maximized with respect to θ, and the (m + 1) th iterate
of an algorithm is defined as
θ (m+1) = arg max Q θ; θ (m) .
θ∈Θ
(1.51)
The E- and the M-steps are iterated until a convergence criterion is met, such as absolute or relative converge
criteria
log f y; θ (m+1) − log f y; θ (m) < 37
(1.52)
or
log f y; θ (m+1) − log f y; θ (m) < ,
log f y; θ (m) (1.53)
respectively, where > 0 is a threshold value. Upon convergence, the algorithm is terminated and the final
iterate is declared to be a candidate ML estimate. Due to the potential multimodality of the log-likelihood
function, multiple initialization values θ (0) are required to obtain a set of candidate estimates. Once the set is
obtained, the local maximum corresponding to the largest log-likelihood value is declared the ML estimate θ̂n ;
see Section 2.12 of McLachlan and Peel (2000) for a discussion regarding initialization in the FMM context,
and see Gan and Jiang (1999) for a more sophisticated approach to selecting the ML estimate from a set of
candidates.
If f (y; θ) is reinterpreted as the likelihood of the data, then the sequence of likelihood functions corresponding
to the sequence of iterates θ (m) can be shown to be ascending, in the sense that
log f y; θ (m+1) ≥ log f y; θ (m) ,
(1.54)
for each m = 0, 1, .... This property can be shown via the following results from Chapter 9 of Lange (2013).
Theorem 1.10 (Information inequality). Let f (y) and g(y) be probability densities with respect to a measure
κ, and suppose that f (y) > 0 and g (y) > 0 almost everywhere with respect to κ. If Ef denotes expectation with
respect to the measure f (y) dκ, then
Ef [log f (y)] ≥ Ef [log g (y)] ,
with equality if and only if f (y) = g (y), almost everywhere with respect to κ.
Using Theorem 1.10, we can show the ascent property (1.54) by demonstrating the inequality
log f y; θ (m+1) ≥ log f y; θ (m) + Q θ (m+1) ; θ (m) − Q θ (m) ; θ (m) ,
(1.55)
since Q θ (m+1) ; θ (m) − Q θ (m) ; θ (m) ≥ 0 by definition (1.51). Inequality (1.55) can be demonstrated by
Q θ
(m)
;θ
(m)
− log f y; θ
(m)
"
!
#
log fY Z Y , Z; θ (m)
|Y = y
= Eθ(m) log
log f Y ; θ (m)
"
!
#
log fY Z Y , Z; θ (m+1)
≥ Eθ(m) log
|Y = y
log f Y ; θ (m+1)
= Q θ (m+1) ; θ (m) − log f y; θ (m+1) ,
(1.56)
where Theorem 1.10 is utilized on Line 2 of (1.56).
The motivation behind the EM algorithms is that it is often possible to construct conjugate functions Q θ; θ (m)
which are easier to maximize than f (y; θ), and thus the repeated maximization of Q θ; θ (m) over many iterations is computationally more feasible than the direct maximization of f (y; θ). Furthermore, as we have
38
demonstrated, the EM algorithms also exhibits monotonic ascent. The property provides a stability in optimization iterations that is not demonstrable in other techniques, such as in Newton algorithms.
Let θ ∗ be a limit point of an EM algorithm, such that θ (m) → θ ∗ as m → ∞ (alternatively, as → 0, under
criteria (1.52) or (1.53)). Under certain regularity conditions, the following result from Wu (1983) (via Vaida
(2005)) relates the limit points to the solutions of the score equation (1.47).
Theorem 1.11 (Convergence of EM sequence). For appropriate starting values θ (0) , if θ ∗ is a limit point of
the sequence θ (m) , then
1. θ ∗ is a stationary point of the log-likelihood function log f (y; θ).
2. the sequence log f y; θ (m) is monotonically ascending and converges to log f (y; θ ∗ ).
We note that the EM algorithms we have described are in fact members of a broader class of algorithms known
as the generalized EM (GEM) algorithms. The GEM algorithms were considered in Dempster et al. (1977), and
extend upon the EM algorithms by simply allowing the update θ (n+1) to be defined as
n
o
θ (n+1) ∈ θ ∈ Θ : Q θ; θ (m) ≥ Q θ (m) ; θ (m) .
(1.57)
It is not difficult to see that the use of update (1.57) preserves the monotonic ascent property of the EM
algorithms.
Other variants of the EM algorithms include the expectation–conditional maximization (ECM) algorithms
(Meng and Rubin, 1993), the ECM either algorithms (Liu and Rubin, 1994), and the alternating ECM algorithm
(Meng and van Dyk, 1997), among many others. See McLachlan and Krishnan (2008) for a comprehensive review
of the EM algorithms and their variants.
We shall now consider EM algorithms for the maximization of (1.49) and(1.50).
Using the latent variable characterization from (1.2), we can write the complete data log-likelihood for (1.49)
as
log Lcn (θ)
=
n
X
I (zj = i) log π1 + log φ1 yj ; µ1 , σ12
(1.58)
j=1
+
n
X
[1 − I (zj = i)] log π1 + log φ1 yj ; µ2 , σ22 .
j=1
In the E-step, we construct the conditional expectation over the latent variables Z1 , ..., Zn with respect to the
39
mth iterate θ (m) to get the conjugate function
Q θ; θ (m)
= Eθ(m) [log Lcn (θ) |Y1 = y1 , ..., Yn = yn ]
n
X
=
(1.59)
Eθ(m) [I (Zj = i) |Yj = yj ] log π1 + log φ1 yj ; µ1 , σ12
j=1
+
n
X
[1 − Eθ(m) [I (Zj = i) |Yj = yj ]] log π1 + log φ1 yj ; µ2 , σ22
j=1
n
X
=
τ yj ; θ (m) log π1 + log φ1 yj ; µ1 , σ12
j=1
+
n h
X
i 1 − τ yj ; θ (m)
log π1 + log φ1 yj ; µ2 , σ22 ,
j=1
where
π1 log φ1 yj ; µ1 , σ12
.
τ (yj ; θ) =
π1 log φ1 (yj ; µ1 , σ12 ) + π2 log φ1 (yj ; µ2 , σ22 )
(1.60)
Here the expression τ (yj ; θ) = Pθ (Zj = 1|Yj = yj ), where the notation Pθ indicates that the probability is
conditioned on the parameter θ.
Using the conjugate function (1.59), the M-step can be conducted by solving the first-order condition
∂Q θ; θ (m)
= 0,
∂θ
(1.61)
which yields the parameter component updates
(m+1)
π1
=n
−1
n
X
τ yj ; θ (m) ,
j=1
Pn
(m+1)
µ1
=
1 − τ yj ; θ (m) yj
,
Pn
n − j=1 τ yj ; θ (m)
j=1
=
Pn
2(m+1)
,
Pn
(m+1)
µ2
σ1
(m)
yj
j=1 τ yj ; θ
Pn
(m)
j=1 τ yj ; θ
j=1
=
2
(m+1)
τ yj ; θ (m) yj − µ1
Pn
,
(m)
j=1 τ yj ; θ
and
Pn
2(m+1)
σ2
=
j=1
2
(m+1)
1 − τ yj ; θ (m)
yj − µ2
Pn
.
n − j=1 τ yj ; θ (m)
T
(m+1)
2(m+1)
2(m+1)
(m+1)
(m+1)
, σ2
is the global max, µ1
, µ2
, σ1
It is routine to show that the update θ (m+1) = π1
(m)
imizer of Q θ; θ
. We note that this result is well known and is generally used as an introductory example
of an EM algorithm; see for example Section 2.7 of McLachlan and Krishnan (2008).
Now, we shall consider an EM algorithm for the maximization of (1.50). Consider that the Laplace distribution
40
can be written as the scale mixture
λ (y; µ, ξ) =
√ ˆ
2
∞
0
ξ2
vφ1 z; µ, 2 fV (v) dv,
2v
(1.62)
where
fV (v) =
1
1
exp
−
,
v3
2v 2
(1.63)
is the mixing density function over the latent variable V . Using (1.62), the complete-data log-likelihood of
(1.50) can be written as
log Lcn
(θ)
=
n
X
√
"
log φ1
2
yj ; µ, ξ / 2vj2
j=1
=
−
1
exp − 2
2
vj
2vj
2
!#
n
n
1 X 2
2
log ξ 2 − 2
v (yj − µ) + C,
2
ξ j=1 j
where v1 , ..., vn are IID latent variables sampled from a population with density (1.63) and C is a function of
T
the data that does not depend on θ = (µ, ξ) . From Phillips (2002), we have the following result,
ξ (m)
√ 2 yj − µ(m) = ω yj ; θ (m) ,
Eθ(m) Vj2 |Yj = yj
=
which we can use to obtain the conjugate function
Q θ; θ (m)
= Eθ(m) [log Lcn (θ) |Y1 = y1 , ..., Yn = yn ]
(1.64)
= −
n
1 X
n
2
log ξ 2 − 2
E (m) Vj2 |Yj = yj (yj − µ) + C̄
2
ξ j=1 θ
= −
n
n
1 X 2
log ξ 2 − 2
ω yj ; θ (m) (yj − µ) + C̄,
2
ξ j=1
where C̄ is a constant that does not depend on θ. In the M-step, we solve the first-order condition (1.61), which
yields the updates
Pn
(m+1)
µ
and
ξ (m+1) =
=
(m)
yj
j=1 ω yj ; θ
Pn
(m)
j=1 ω yj ; θ
v
2
u P
(m+1)
u
n
t 2 j=1 ω yj ; θ (m) yj − µ1
n
Again, it is simple to check that θ (m+1) = µ(m+1) , ξ (m+1)
T
.
maximizes (1.64). We make a final remark that
the maximum of (1.50) can be obtained by setting µ̂ to be the median of the data y1 , ..., yn and by setting
41
√ Pn
ξˆ = n−1 2 j=1 |yj − µ̂j |, although this solution arises via a nonstandard argument; see Norton (1984) for
details.
1.4.2
Minorization–Maximization Algorithms
Although the EM algorithms can provide simple optimization procedures for difficult ML estimation problems,
the necessity to construct a complete-data likelihood as well as the requirement for a probabilistic interpretation
of the maximization problem can restrict the application of these methods. To resolve the restrictions of the
EM algorithms, Becker et al. (1997) suggested the MM algorithms framework as an alternative; see Hunter and
Lange (2004) and Chapter 8 of Lange (2013) for details. The MM algorithms can be described as follows.
Let η (θ) be an objective function to be maximized, with respect to θ ∈ Θ, for some set Θ ⊂ Rr , and let θ 0 ∈ Θ.
Further, suppose that η (θ) cannot be maximized directly (e.g. the first-order conditions cannot be solved, or
the function is nondifferentiable). In such as case, we can seek a minorizer of η (θ) instead. The minorizer
h (θ; θ 0 ) is a function of θ and takes a conditional argument θ 0 , such that η (θ) = h (θ; θ) and η (θ) ≥ h (θ; θ 0 ),
whenever θ 6= θ 0 . If h (θ; θ 0 ) satisfies the two conditions above, we say that h (θ; θ 0 ) minorizes η (θ).
Given the existence of a minorizer of h (θ; θ 0 ), we denote the mth iterate of an MM algorithm for the maximization of η (θ) as θ (m) and we define the (m + 1) th iterate as
θ (m+1) = arg max h θ; θ (m) .
θ∈Θ
(1.65)
The motivation behind the MM algorithms is the same as that of the EM algorithms; that is, it is often easier to
iteratively maximize the minorizer than it is to maximize η (θ) directly. Furthermore, like the EM algorithms,
the MM algorithms can also be shown to exhibit the ascent property since the inequality
η θ (m+1) ≥ h θ (m+1) ; t(m) ≥ h θ (m) ; θ (m) = η θ (m)
is satisfied by the definition of a minorizer and the definition of the iterate (1.65).
Let θ ∗ be the limit point of the sequence θ (m) (i.e. θ (m) → θ ∗ as m → ∞). The following result from Vaida
(2005) establishes the link between the limit point and the first-order condition,
∂η (θ)
= 0.
∂θ
Theorem 1.12 (Convergence of MM sequence). For appropriate starting values θ (0) , if θ ∗ is a limit point of
the sequence θ (m) , then
1. θ ∗ is a stationary point of the function η (θ).
2. the sequence η θ (m) is monotonically ascending and converges to η (θ ∗ ).
We note that both the EM and the MM algorithms share the ascent property and that Theorems 1.11 and 1.12
42
are identical in implication. As such, it is generally considered that the MM algorithms are a generalization of
the EM algorithms, for nonprobabilistic problems.
We shall now consider MM algorithms for the maximization of (1.49) and(1.50).
T
T
Let ψ = ψ1T , ..., ψnT
and ψj = (ψ1j , ..., ψgj ) such that ψij > 0, for all i = 1, ..., g and j = 1, ..., n, and let
η (ψ) =
n
X
g
X
log
!
ψij
.
(1.66)
i=1
j=1
In Zhou and Lange (2010), it is shown that η (ψ) can be minorized by
h ψ; ψ
(m)
=
(m)
g
n X
X
ψij
where
Cj =
i0 =1
ψi0 j
(m)
g
X
i=1
(m)
Pg
j=1 i=1
(m)
i0 =1
(m)
i0 =1 ψi0 j
(m)
ψij
Pg
ψij
Pg
log ψij + Cj ,
log
ψi0 j
(1.67)
!
.
Using (1.67), we can minorize (1.49) by
h θ; θ (m)
=
=
+
n
X
τ yj ; θ (m) log π1 + log φ1 yj ; µ1 , σ12
(1.68)
j=1
n h
X
i 1 − τ yj ; θ (m)
log π1 + log φ1 yj ; µ2 , σ22
j=1
+
n
X
Cj ,
j=1
where
Cj
=
−τ yj ; θ (m) log τ yj ; θ (m)
h
i
h
i
− 1 − τ yj ; θ (m) log 1 − τ yj ; θ (m)
and τ yj ; θ (m) is the same as in (1.60). Expression (4.7) can be obtained by setting ψ1j = π1 φ1 yj ; µ1 , σ12
and ψ2j = (1 − π1 ) φ1 yj ; µ2 , σ22 , for each j = 1, ..., n. It is easy to see that since Cj is independent of θ for all
j, the solution to the first-order condition
∂h θ; θ (m)
=0
∂θ
(1.69)
is the same as the EM updates for (1.49). Thus, both the EM and the MM algorithm for the maximization of
(1.49) are identical.
T
Now, let ψ = (ψ1 , ..., ψn ) such that ψj > 0 for each j = 1, ..., n, and let
η (ψ) =
n
X
aj −
j=1
n
X
bj
p
ψj ,
j=1
where aj ∈ R and bj ∈ (0, ∞). It is simple to show that (1.70) can be minorized by
43
(1.70)
(m)
q
n
n
n
X
X
X
ψj − ψj
(m)
(m)
=
h ψ; ψ
aj −
bj ψ j −
bj q
.
(m)
j=1
j=1
j=1
2 ψj
(1.71)
Using (1.71), we can minorize (1.50) by
n
log 2 − n log ξ
2
√ n n
2
√2 X
2 X
(y − µ)
(m) j
−
yj − µ −
ξ j=1
ξ j=1 yj − µ(m) =
h θ; θ (m)
−
(1.72)
n
log 2 − n log ξ
2
√ n
n
2
√2 X
(yj − µ)
2 X 0
(m)
,
−
ω yj ; θ
−
ξ j=1
ξ j=1 ω 0 yj ; θ (m)
−
=
where ω 0 (yj ; θ) = |yj − µ|. Expression (1.72) can be obtained by setting ψj = ω 0 (yj ; θ), aj = −2−1 log 2 − log ξ,
√
and bj = ξ −1 2, for each j.
Solving the first-order conditions (1.69) for (1.72) yields the updates
(m+1)
µ
=
n
X
j=1
and
"
# n "
#
X
1
yj
/
ω 0 yj ; θ (m)
ω 0 yj ; θ (m)
j=1
√
ξ (m+1)
"
#
n
(m+1) 2
y
−
µ
2 X 0
j
.
=
ω yj ; θ (m) +
n j=1
ω 0 yj ; θ (m)
We note that although the updates µ(m+1) are identical between the EM and the MM algorithms, the updates
ξ (m+1) are slightly different. Thus, although both EM and MM algorithms can be constructed for this problem,
the resulting algorithms may differ from one another. An example comparison between the performances of the
EM and MM algorithms, when both are applicable, can be found in Zhou and Zhang (2012). Lastly, like the
EM algorithms, the MM algorithms are generalizable into a broader class; see for instance the block successive
methods of Razaviyayn et al. (2013).
1.4.3
Identifiability
In Theorem 1.9, one of the assumptions required for deducing the consistency, asymptotic normality, and
asymptotic efficiency of an ML estimator is for the probability density of the statistical model to be identifiable
(i.e. EL1). Unfortunately, FMMs are not identifiable in the strict sense of EL1. For example, consider model
(1.3); if we set
θ
=
=
π1 , µ1 , µ2 , σ12 , σ22
T
(1/2, −1, 1, 1, 1)
44
T
and
θ0
=
π10 , µ01 , µ02 , σ102 , σ202
=
(1/2, 1, −1, 1, 1) ,
T
T
then it is not difficult to see that f (y; θ) = f (y; θ 0 ) for every y ∈ R.
Because of the lack of strict identifability, weaker notions of identifiability have been suggested in the FMM
literature. Here, we shall follow the conventions set out in Chapter 3 of Titterington et al. (1985); see Redner
(1981) for an asymptotic notion of identifiability. From Titterington et al. (1985), let R be a family of density
functions defined as
R = ρ (y; θ) : θ ∈ Θ, y ∈ Rd
and let Fg be the family of FMMs defined as
(
Fg =
f (y; θ) : f (y; θ) =
g
X
πi ρ (y; θi ) , πi > 0,
i=1
where θ = π1 , ..., πg−1 , θ1T , ..., θgT
T
g
X
)
πi = 1, ρ (y; θi ) ∈ R, i = 1, .., g
,
i=1
. Suppose that f (y; θ) and f (y; θ 0 ) are two members of Fg , such that
f (y; θ) = f (y; θ 0 ). We say that Fg is identifiable if and only if we can order the summations such that πi = πi0
and ρ (y; θi ) = ρ (y; θi0 ), for each i = 1, ..., g.
Using the notion of identifiability above, Teicher (1961) and Teicher (1963) gave the following results for specific
families of FMMs.
Theorem 1.13 (Teicher (1961,1963)). The following families of density functions R generate identifiable FMM
families Fg :
1. Poisson, ρ (y; θ) =
λy
y!
exp (−λ), where θ = λ.
T
2. Univariate Gaussian, ρ (y; θ) = φ1 y; µ, σ 2 , where θ = µ, σ 2 .
T
3. Gamma, ρ (y; θ) = γ (y; a, b), θ = (a, b) .
4. Binomial, ρ (y; θ) =
N!
y
(N −y)!y! p
(1 − p)
N −y
, where θ = p and N ≥ 2g − 1.
The following result from Yakowitz and Spragins (1968) provides an alternative definition of identifiability.
Theorem 1.14 (Characterization of identifiability). A necessary and sufficient condition that the class Fg of
FMMs of the family R be identifiable is that the family R be a linearly independent set of functions.
Using Theorem (1.14), Yakowitz and Spragins (1968) gave the following identifiability results.
Theorem 1.15 (Yakowitz and Spragins (1968)). The following families of density functions R generate identifiable FMM families Fg :
45
1. Product exponential, ρ (y; θ) =
Qd
j=1
T
d
λj exp (−λj yj ), where θ = (λ1 , ..., λd ) ∈ (0, ∞) .
T
2. Multivariate Gaussian, ρ (y; θ) = φd (y; µ, Σ), where θ = µT , vechT (Σ) .
T
3. Univariate Cauchy, ρ (y; θ) = t1 y; µ, σ 2 , 1 , θ = µ, σ 2 .
4. Negative binomial, ρ (y; θ) =
(y+r−1)!
(r−1)!y!
r
T
(1 − p) py , where θ = (p, r) ∈ (0, 1) × (0, ∞).
In Holzmann et al. (2006), the techniques developed in Teicher (1961), Teicher (1963), and Yakowitz and
Spragins (1968) were generalized and applied to the class of FMMs of elliptical symmetric densities to obtain
the following results.
Theorem 1.16 (Identifiability of elliptical symmetric FMMs). The following families of density functions R
generate identifiable FMM families Fg :
T
1. Multivariate t, ρ (y; θ) = td (y; µ, Σ, v), where θ = µT , vechT (Σ) , v
.
2. Kotz and exponential power,
ρ (y; θ) =
h
ir
h
is sΓ (d/2)
T
T
−1
−1
(y
−
µ)
Σ
(y
−
µ)
exp
−
(y
−
µ)
Σ
(y
−
µ)
,
π d/2 Γ [(2r + 1) / (2s)]
T
where θ = µT , vechT (Σ) , r, s , r ∈ (−d/2, ∞) and s ∈ (0, ∞).
3. Multivariate symmetric stable; see Section 3.5 of Fang et al. (1990) for details.
We note that the Kotz and exponential power family includes many common models such as the multivariate
Gaussian, multivariate t, and the Laplace densities. Further results regarding the identifiability of some other
classes of FMMs can be found in Ahmad (1988) and Al-Hussaini and Ahmad (1981).
For MLRMs and MGLMs, notions of identifiability have been established in articles such as Follman and
Lambert (1991) and Hennig (2000), where conditions for FMMs of linear and logistic regressions were considered,
respectively. The results of Follman and Lambert (1991) and Hennig (2000) have be subsumed by Jiang and
Tanner (1999b), who discuss the general identifiability of MoEs.
In Jiang and Tanner (1999b), the ordered and initialized identifiability of nondegenerate MoEs is considered.
Here, the family of MoEs are defined as
(
Fg =
f (y|w, x; θ) : f (y|w, x; θ) =
g
X
)
p
πi (w; α) ρ (y|x; θi ) , αi ∈ R , ρ (y|x; θi ) ∈ R, i = 1, .., g
,
i=1
where ρ (y|x; θi ) = ρ y; η βiT x; ψi , ψi is a univariate GLM (e.g. component models from Table 1.6), such that
T
θi = βiT , ψi ∈ Rp ×(0, ∞). In context, ordered, initialized, and nondegenerate are to be interpreted as follows.
46
Ordered implies that there exists an ordering relationship ≺, such that αT1 θ1T
T
≺ αT2 θ2T
T
≺ ... ≺ αTg θgT
T
;
0
initialized implies that αg = 0; and nondegenerate implies that θi 6= θi0 for any i 6= i . Using this notion of
identifiability, Jiang and Tanner (1999b) give the following results.
Theorem 1.17 (Identifiability of MoEs). The following families of MoEs Fg are ordered and initialized identifiable:
y
1. Poisson, ρ (y|x; θ) =
η (β T x)
y!
exp −η β T x , where η β T x = exp β T x .
2. Univariate Gaussian, ρ (y|x; θ) = φ1 y; η β T x , σ 2 , where η β T x = β T x and ψ = σ 2 .
3. Gamma, ρ (y; θ) = γ y; η β T x; b , b , η β T x; b = −b/ β T x and ψ = b.
4. Binomial, ρ (y; θ) =
η (β T x;ψi )
y!
y
exp −η β T x , where η β T x =
exp(βiT x)
1+exp(βiT x)
and N ≥ 2g − 1.
We note that the literature on identifiability is broad and disparate, with respect to the FMMs of interest; that
is, each type of FMM is generally considered individually for the purpose of the research at hand. We also note
that there are many FMMs that are unidentifiable and thus can create problems in inference. For example, the
binomial FMM with N < 2g − 1 and triangular FMMs are unidentifiable; see Teicher (1963) and Holzmann
et al. (2006), respectively. Results on the lack of identifiability in Beta and Pearson Type VI FMMs can be
found in Ahmad and Al-Hussaini (1982).
1.4.4
Asymptotic Theory
Due to the lack of strict sense identifiability (i.e. assumption EL1), it is not possible to utilize Theorem 1.9 to
establish the asymptotic properties of the ML estimates for FMMs. Furthermore, in many FMMs, the likelihood
and log-likelihood functions are unbounded and thus no ML estimate of form (1.46) need exist. To see how this
can occur, consider the two component univariate GMM likelihood (1.48) for a single observation y1 , such that
π1 , µ1 and σ12 are fixed at π1∗ , µ∗1 and σ1∗2 , and µ2 = y1 . Under this setup, the only free parameter is σ22 , and
the likelihood function, with respect to σ22 , can be simplified to
(1 − π ∗ )
L1 σ22 = π1∗ φ1 y1 ; µ∗1 , σ1∗2 + p 12 .
2πσ2
It is not difficult to see that limσ22 →0 L1 σ22 = ∞, thus showing that L1 (θ) is unbounded.
The problem of asymptotic inference (i.e. consistency, asymptotic normality, and asymptotic efficiency) of
FMMs has been approached in various ways. Classically, Redner and Walker (1984) considered the consistency
and asymptotic normality of FMMs via the notion of asymptotic identifiability established in Redner (1981).
This was extended upon in Hathaway (1985), where the consistency and existence of ML estimators for GMMs
were considered when the likelihood is bounded from above via variance restrictions. More recent works include
empirical processes arguments for the asymptotic normality of FMMs (Van De Geer, 1997), asymptotic results
47
for the case where πi = 0 for some i (Cheng and Liu, 2001; Feng and McCulloch, 1996), and refinements of the
assumptions of Redner and Walker (1984) (Atienza et al., 2007).
Here, we prefer to address the problem of asymptotic inference through the more general perspective of extremum
estimators (EEs); see Chapter 4 of Amemiya (1985) and Chapter 24 of Gourieroux and Monfort (1995). We
can define an EE as follows.
Let Qn (θ) = Q (y1 , ..., yn ; θ) be a function of the data y1 , ..., yn and the parameter vector θ ∈ Θ ⊂ Rr , and
define the EE with respect to Qn (θ) as
θ̂n = arg max Qn (θ) .
θ∈Θ
(1.73)
The extremum estimators subsumes a large class of statistical estimators. For example, if we set Qn (θ) =
log Ln (θ), then we simply have ML estimation. Let θ0 ∈ Θ be the true value of the parameter θ (i.e. the data
y1 , ..., yn arises from a population with density of form f (y1 , ..., yn ; θ0 )); we shall firstly consider the conditions
for consistency of the EE, when the parameter space Θ is a compact. Make the following assumptions.
TA1.1 The parameter space Θ ⊂ Rr is compact and θ0 ∈ Θ.
TA1.2 The function Qn (θ) is continuous in θ ∈ Θ for all y1 , ..., yn and is a measurable function of y1 , ..., yn
for all θ ∈ Θ.
TA1.3 n−1 Qn (θ) converges to a nonstochastic function Q (θ) in probability uniformly in θ ∈ Θ as n → ∞,
and Q (θ) attains a unique maximum at Q (θ0 ).
Under assumptions TA1.1–TA1.3, Amemiya (1985) gives the following result.
Theorem 1.18 (Theorem 4.1.1, Amemiya (1985)). Under the assumptions TA1.1–TA1.3; if θ̂n is defined as
P
(1.73), then θ̂n → θ0 .
Theorem 1.18 is useful for FMMs since once can bound the parameter space of an FMM parameter vector in
order to break the potential permutation symmetries. For example, in the two component univariate GMM,
the parameter space can be set to
n
o
T
2
Θ = θ = π1 , µ1 , µ2 , σ12 , σ22 ∈ (0, 1) × R2 × (0, ∞) : a ≤ π1 ≤ 1 − a, b1 ≤ µ1 ≤ µ2 ≤ b2 , c1 ≤ σ12 , σ22 ≤ c2 ,
where a ∈ (0, 1/2), c1 , c2 ∈ (0, ∞) and b1 , b2 ∈ R, such that b1 ≤ b2 and c1 ≤ c2 .
Although functional, it is dissatisfactory and often computationally difficult to enforce compactness restrictions
on the parameter space. As such, an alternative result is required for the case where Θ is not compact. Make
the following assumptions.
TA2.1 The parameter space Θ ⊂ Rr is an open set and θ0 ∈ Θ.
TA2.2 The function Qn (θ) is a measurable function of y1 , ..., yn , and ∂Qn (θ) /∂θ exists and is continuous in
an open neighborhood N1 (θ0 ) of θ0 .
48
TA2.3 There exists an open neighborhood N2 (θ0 ) of θ0 such that n−1 Qn (θ) converges to a nonstochastic
function Q (θ) in probability uniformly in θ ∈ N2 (θ0 ) as n → ∞, and Q (θ) attains a strict local
maximum at Q (θ0 ).
Under assumptions TA2.1–TA2.3, Amemiya (1985) gives the following result.
Theorem 1.19 (Theorem 4.1.2, Amemiya (1985)). Under the assumptions TA2.1–TA2.3; if
Θn =
∂Qn (θ) θ̂ ∈ Θ :
= 0 and θ̂ is a local maximum ,
∂θ θ=θ̂
where Θn = {0} if there are no roots, then for any > 0,
lim P
n→∞
inf
θ̂ − θ0
T θ̂ − θ0
> = 0.
θ̂∈Θn
Theorem 1.19 is usually taken to mean that there exists a consistent sequence of roots to the first-order condition
∂Qn (θ) /∂θ = 0. Note that the consistent root need only be a local maximum of Qn (θ), and thus, for example,
2
all of the permutations of θ ∈ (0, 1) × R2 × (0, ∞) which correspond to the same likelihood value obtained via
the EM algorithm for maximizing (1.49) can be considered simultaneously. Furthermore, since the consistent
root is assumed to be a local maximum, the unboundedness of the likelihood function is circumvented.
We shall refer to the estimator defined by
θ̂n ∈ Θn =
θ̂ ∈ Θ :
∂Qn (θ) =
0
and
θ̂
is
a
local
maximum
,
∂θ θ̂
(1.74)
as a local EE, and make the following additional assumptions.
TA3.1 The Hessian ∂ 2 Qn (θ) /∂θ∂θ T is continuous in an open and convex neighborhood of θ0 .
TA3.2 Let
n
−1
"
#
2
∂ 2 Qn (θ) P
−1 ∂ Qn (θ) ,
→ A (θ0 ) = lim Eθ0 n
T
T
n→∞
∂θ∂θ
∂θ∂θ θ0
θ∗
n
P
for any sequence θn∗ , such that θn∗ → θ0 , where A (θ0 ) is finite and nonsingular.
TA3.3 Let
n
−1/2
∂Qn (θ) D
→ N (0, B (θ0 )) ,
∂θ θ0
where
"
B (θ0 ) = lim Eθ0 n−1
n→∞
#
∂Qn (θ) ∂Qn (θ) ·
.
∂θ θ0
∂θ T θ0
Under assumptions TA2.1–TA2.3 and TA3.1–TA3.3, Amemiya (1985) gives the following result regarding asymptotic normality.
Theorem 1.20 (Theorem 4.1.3, Amemiya (1985)). Under assumptions TA2.1–TA2.3 and TA3.1–TA3.3, if θ̂n
P
is a sequence of elements as defined in (1.74), such that θ̂n → θ0 , then
D
n1/2 θ̂n − θ0 → N 0, A−1 (θ0 ) B (θ0 ) A−1 (θ0 ) .
49
Now, we shall narrow our focus to the case where Qn (θ) = log Ln (θ) and Y1 , ..., Yn are an IID random sample
from a population with density f (y; θ0 ) by making the following assumptions.
TA4.1 The function f (y; θ) is a measurable function of y1 , ..., yn , and ∂f (y; θ) /∂θ exists and is continuous
in an open neighborhood N1 (θ0 ) of θ0 .
TA4.2 There exists an open neighborhood N2 (θ0 ) of θ0 such that n−1 log Ln (θ) converges to a nonstochastic
function Eθ0 [log f (y; θ)] in probability uniformly in θ ∈ N2 (θ0 ) as n → ∞, and Eθ0 [log f (y; θ)] attains
a strict local at Eθ0 [log f (y; θ0 )].
TA4.3 The matrix
n−1
n
X
∂ 2 log f (y; θ)
j=1
∂θ∂θ T
converges in probability uniformly in θ in an open neighborhood of θ0 .
Under assumptions TA4.1–TA4.3, EL2, EL4, and EL5, Amemiya (1985) gives the following result for local ML
estimators.
Theorem 1.21 (Theorem 4.2.4, Amemiya (1985)). Let θ̂n be a sequence of elements as defined by
θ̂n ∈ Θn =
θ̂ ∈ Θ :
∂ log Ln (θ) =
0
and
θ̂
is
a
local
maximum
.
∂θ
θ̂
(1.75)
P
If Assumptions TA4.1–TA4.3, EL2, EL4, and EL5 hold, and θ̂n → θ0 , then
D
n1/2 θ̂n − θ0 → N 0, I −1 (θ0 ) .
Thus, Theorem 1.21 shows that the local ML estimator, as defined by (1.75), is both asymptotically normal
and asymptotically efficient. This is a powerful result for the FMMs since neither uniqueness nor existence of
an ML estimator (i.e. (1.46)) are guaranteed.
Lastly, we note that Theorems 1.18–1.21 all depended on the uniform convergence in probability of various
functions. These results can be established via uniform laws or large numbers, such as the following result from
Jennrich (1969).
Theorem 1.22 (Uniform law of large numbers). Let g (y; θ) be a measurable function of y and a continuous
function of θ for each θ ∈ Θ ⊂ Rr , such that Θ is compact. Assume that Eθ0 [g (y; θ)] = 0, and let y1 , ..., yn be
a sample of IID random vectors with density f (y; θ0 ), such that Eθ0 supθ∈Θ |g (yi ; θ)| < ∞, for each i = 1, ..., n.
Pn
Then, i=1 n−1 g (yi ; θ) converges to 0 in probability uniformly in θ ∈ Θ.
In situations where Theorem 1.22 does not apply (e.g. when the data are not IID), various forms of uniform
laws of large numbers can be used in its place. For example, such alternatives can be found Andrews (1992),
Chapter 29 of Devroye et al. (1996), and Dudley (1999).
50
1.4.5
Determining the number of components
Thus far, we have considered the problem of ML estimation for the parameters of an FMM in the context of a
known number of components g. However, in practice, there is often no reason as to why one may form a prior
hypothesis of the value of g. As such, methodology must be considered for the selection of g from data. In the
FMM literature, there has been strong interest in the problem; see Fonseca and Cardoso (2007), Fraley and
Raftery (1998), Chapter 6 of McLachlan and Peel (2000), and McLachlan and Rathnayake (2014) for reviews
on the topic.
With respect to ML estimation, the most popular methodology for the selection of g is via information criteria
(IC). See Burnham and Anderson (2002) and Claeskens and Hjort (2008) for general treatments on IC. An
information criterion (IC) based selection process can be described as follows.
Let fγ (y; θ) =
Pγ
i=1
πi ρ (y; θ) be an FMM with component density of form ρ (y; θ) and suppose that we observe
a sample y1 , ..., yn that arises from a population with density fg (y; θ0 ), such that g ∈ G ⊂ N. In an IC process,
for each γ ∈ G, we estimate the ML estimate θ̂n,γ and compute the IC for n samples and γ components as
IC (n, γ) = −2 log Ln θ̂n,γ + η n, θ̂n,γ ,
where η is the penalty function and is dependent on the sample size and parameter estimates. Upon computing
the IC for each γ, an estimate of the number of components ĝn is selected via the rule
ĝn = arg min IC (n, γ) .
(1.76)
γ∈G
The most popular IC used in the literature are the Akaike IC (AIC) (Akaike, 1974) and the Bayesian IC
(BIC) (Schwarz, 1978), which use the penalties η n, θ̂n,γ = 2D θ̂n,γ and η n, θ̂n,γ = D θ̂n,γ log n,
respectively, where D (θ) is the dimension of the vector θ (i.e. the number of parameter elements). Under
regularity conditions, rule (1.76) can shown to correctly select g via various measures of correctness. For
example, Leroux (1992) gives the following result for the case where Ω ⊂ R.
Theorem 1.23 (Leroux (1992)). Let Y be a random variable with density fg (y; θ0 ), and let Ŷn be a random
variable with density fĝn y; θ̂n,ĝn . Under regularity conditions, if for every γ < g, η n, θ̂n,γ+1 ≥ η n, θ̂n,γ
for all n and lim sup n−1 η n, θ̂n,γ = 0, with probability 1, then under rule (1.76), lim inf ĝn ≥ g, with
n→∞
n→∞
D
probability 1, and Ŷn → Y .
Theorem 1.23 shows that under favorable conditions, an IC process will not underestimate the number of
components in an FMM, and furthermore, the random variable implied by the estimated model approaches the
random variable from the generative distribution, in law. It is simple to validate that both the AIC and BIC
penalties meet the conditions of Theorem 1.23. Theorem 1.23 is extended upon in Keribin (2000).
Theorem 1.24 (Keribin (2000)). Under regularity conditions, suppose that η n, θ̂n,γ satisfies the conditions
51
η n, θ̂n,γ+1 ≥ η n, θ̂n,γ > 0, lim η n, θ̂n,γ = ∞, lim n−1 η n, θ̂n,γ = 0, and
n→∞
n→∞
lim η n, θ̂n,γ /η n, θ̂n,γ 0 > 1,
n→∞
for each γ 0 < γ ≤ ḡ, where ḡ = max G; and
lim
n→∞
P
log log n
= 0.
η n, θ̂n,γ
P
If ḡ < ∞, then under rule (1.76), ĝn → g and θ̂n,ĝn → θ0 .
Theorem 1.24 states that under stronger regularity on the penalty function, the IC process does in fact correctly
select the number of components as the sample size becomes large. Furthermore, in the process of selecting the
correct number of components, we also obtain consistent estimates of the parameter vector. We note that the
BIC meets the conditions of Theorem 1.24, but not the AIC.
The regularity on the component densities are difficult to check in both Theorems 1.23 and 1.24. However, it
has been shown that these theorems are valid for both the univariate GMM and the Poisson FMM in Keribin
(2000).
Due to the difficulty in validating regularity, and the lack of generalization to the multivariate setting, many
authors in the area have focused on the construction of FMM specific IC with good empirical performance in
simulations. Example of these IC are the entropy-based methods.
Define the entropy of the n sample likelihood for a g component FMM as
EN (θ) = −
g X
n
X
τi (yj ; θ) log τi (yj ; θ) ,
i=1 j=1
where
πi ρ (yj ; θi )
.
0
0
i0 =1 πi ρ (yj ; θi )
τi (yj ; θ) = Pg
It is not difficult to show that for any FMM, the log-likelihood can be can be written as
log Ln (θ) = log Lcn (θ) + EN (θ) ,
where log Lcn (θ) is the complete-data log-likelihood from Section 1.4.1.
The entropy is used in three popular IC: the approximate weight of evidence (AWE) (Banfield and Raftery,
1993), the classification likelihood information criterion (CLC) (Biernacki and Govaert, 1997), and the integrated
CLC (ICL) (Biernacki et al., 2000). The AWE, CLC, and ICL set the penalty to be
η n, θ̂n,γ = 2EN θ̂n,γ + 2D θ̂n,γ (3/2 + log n) ,
η n, θ̂n,γ = 2EN θ̂n,γ ,
52
and
η n, θ̂n,γ = 2EN θ̂n,γ + D θ̂n,γ log n,
respectively. The rational behind the use of the entropy as a penalty is that it is small when component densities
are separated and larger when they are close. Thus, the penalty on complexity is greater when it appears as if
the components are similar and less when they are different.
We note that although the CLC utilizes the entropy as the only penalty factor, the AWE and ICL uses both
entropy and dimensionality penalties which causes the behavior of these criteria to be closer to those of the AIC
or BIC. Empirical evidence from simulation studies in Section 6.11 of McLachlan and Peel (2000) show that
the BIC and ICL tend to perform better than the other discussed methods, in the setting of estimating GMMs.
This evidence is supported by the performance results in Grun and Leisch (2007), which shows that the BIC
and ICL perform similarly in the regression setting, and tend outperform the AIC.
Apart from IC methods, there is also strong research in the use of hypothesis testing for the estimation of the
number of components in an FMM. Such methods generally utilize the sequential hypotheses H0 : g = g0
versus H1 : g = g1 , for some g0 ≥ 1 and g1 > g0 , by using the likelihood ratio statistic

Ln θ̂n,g0

T (y1 , ..., yn ) = −2 log 
Ln θ̂n,g1
= 2 log Ln θ̂n,g1 − 2 log Ln θ̂n,g0 .

Under usual regularity conditions (i.e. the assumptions of Theorems 1.9 or 1.21), it is known that under the
H0 , T has an asymptotic χ2 distribution. However, due to the nature of the tests described, H0 generally
corresponds to a nonidentifiable subset of the parameter space or a boundary point of the parameter space. As
such, the null distribution of T is either unknown or difficult to establish; see Andrews (1999) and Andrews
(2001) for details regarding boundary problems, and see Section 6.5 of McLachlan and Peel (2000) for discussions
regarding hypothesis testing in FMMs.
Even when hypothesis tests are possible, the test statistics tend to be heavily modified from T , the null distributions tend to be difficult to specify or analyse, and the tests are only valid in specific situations. See for
example Chen (1998), Chen and Chen (2001), Chen et al. (2001), Chen and Li (2009), Cho and Han (2009),
Ghosh and Sen (1985), and Zhu and Zhang (2004).
In McLachlan (1987), a bootstrap methodology is suggested for the approximation of the distribution of T , and
some justification of the methodology is provided in Feng and McCulloch (1996). Section 6.7 of McLachlan
and Peel (2000) shows that the bootstrapped p-values for T tend to be reliable in simulations. Besides the IC
and hypothesis testing methods, regularized likelihoods have also become increasingly popular for component
selection in FMMs, see for instance Bhattacharya and McNicholas (2014), Chen and Khalili (2009), and Pan
and Shen (2007).
53
54
Chapter 2
Mixtures of Spatial Spline Regressions for
Clustering and Classification
Citation
Nguyen, H. D., McLachlan, G. J., and Wood, I. A. (2015). Mixtures of spatial spline regressions for clustering
and classification. Computational Statistics and Data Analysis, In press.
Abstract
Classification and clustering of functional data arises in many areas of modern research. Currently, the literature
on techniques for performing such tasks has been concentrated on applications to univariate functions. In this
article, we extend the literature into the domain of classifying and clustering bivariate functions (i.e. surfaces)
over rectangular domains. This is achieved by combining the current techniques in spatial spline regression (SSR)
with finite mixture models and mixed-effects models. As a result, we have developed three novel techniques:
spatial spline mixed models (SSMM) for fitting populations of surfaces, mixtures of SSR (MSSR) for clustering
surfaces, and MSSR discriminant analysis (MSSRDA) for classification of surfaces.
Through simulations and applications to problems in handwritten character recognition, we show that SSMM,
MSSR, and MSSRDA are effective in performing their desired tasks. We also show that in the context of
handwritten character recognition, MSSR and MSSRDA are comparable to established methods, and are able
to outperform competing approaches in missing-data situations.
2.1
Introduction
In recent years, functional data analysis (FDA) (Ramsay and Silverman, 1997) has become a popular tool for
statistical analysis and pattern recognition. The popularity of FDA techniques has largely come from their
ability to drastically reduce the dimensionality of curve-type data. Because of this fact, FDA has lent itself to
55
numerous modern applications in the areas of biology, economics, medicine, machine learning, and sociology.
Examples of such applications can be found in Ramsay and Silverman (2002).
Of particular interest in this article are the applications of FDA to problems of clustering and classifying
functional data. Here, to dispel confusion, we define classification as the learning of decision rules based on
labeled data from different populations, and we define clustering as the partitioning of a single population into
separate subpopulations.
There is currently a large literature on techniques for clustering and classifying functional data. Interesting
developments in the area include B-spline regression for classification by linear discriminant analysis (James
and Sugar, 2003) and for clustering by mixtures of linear mixed models (MLMMs) (James and Hastie, 2001),
Fourier basis regression for clustering by MLMMs (Ng et al., 2006), piecewise polynomial regression for clustering
(Chamroukhi et al., 2010) and for classification Chamroukhi et al. (2013), Gaussian process regression for
classification by principal component analysis (Hall et al., 2001) and by centroid-based methods (Delaigle and
Hall, 2012), support vector machines (SVMs) for classification (Rossi and Villa, 2006), and nonparametric
density estimation for clustering (Boulle, 2012).
Upon review of the literature, it is evident that the concentration in the area is towards the analysis of univariate
functions. Although not as prolific, the analysis of surfaces generated from bivariate functions is also a welldeveloped area. Some important works in this direction are: Kriging (Matheron, 1963), thin-plate splines
(Duchon, 1977), multivariate adaptive regression splines (Friedman, 1991), soap film smoothing (Wood et al.,
2008), and spatial spline regression (SSR) (Malfait and Ramsay, 2003; Ramsay et al., 2011; Sangalli et al., 2013).
The developments in bivariate FDA appear focused on the aspects of smoothing and estimating data from a
single surface population rather than performing inference regarding data from a population of functions.
In this article we aim to bridge the gap by extending the functional clustering technique of James and Sugar
(2003) and the classification technique of James and Hastie (2001) to the case of bivariate functional data. This
is done through a novel application of bivariate spline functions from Malfait and Ramsay (2003) to the problem
of probability density estimation for populations of bivariate functions.
The approach that we developed uses a linear mixed-effects model (LMM) framework for modeling the distributions of bivariate functions. When finite mixtures of such distributions are constructed, the approach naturally
leads to a MLMM clustering technique, similar to those of James and Sugar (2003), Celeux et al. (2005) and
Ng et al. (2006). We refer to the technique described as mixtures of spatial spline regressions (MSSR).
In the cases where we have multiple populations of surfaces, we can apply the MSSR method to model each
population. These population distributions can then be used to produce discrimination rules of the mixture
discriminant analysis (MDA) type (Hastie and Tibshirani, 1996). We name this approach: mixtures of spatial
spline regressions discriminant analysis (MSSRDA), and apply both it and MSSR to the problems of clustering
and classifying handwritten characters.
We shall proceed as follows. A brief overview of the SSR approach will be given in Section 2.2. We describe the
spatial spline mixed model framework (SSMM) which extends SSR to model populations of surfaces in Section
2.3. The MSSR extension of the SSMM framework is presented in Section 2.4, and its application to clustering
56
and classification is described in Section 2.5. Simulated examples of SSMM, MSSR, and MSSRDA are given in
Section 2.6, and an application to problems in handwritten character recognition is presented in Section 2.7.
Finally, conclusions are drawn in Section 2.8.
2.2
Spatial Spline Regressions
Spatial spline regression (SSR) is a technique first introduced in Malfait and Ramsay (2003) and later elaborated
upon in Ramsay et al. (2011) and Sangalli et al. (2013). In this article, we concentrate on the application of
SSR in a rectangular domain as applied in the original article. The method is outlined as follows.
Let Y1 , ..., Ym be an i.i.d. random sample with realizations y1 , ..., ym such that given the coordinates xk =
T
(x1k , x2k ) , we have the relationship Yk = µ (xk ) + Ek , where Ek ∼ N 0, σ 2 and k = 1, ..., m. Here, the
+
superscript T indicates matrix transposition, and µ is an unknown function which maps from R = x−
1 , x1 ×
− +
x2 , x2 to R.
If µ has a known parametric form, then techniques from nonlinear regression may be used to estimate the
function from data (see Bates and Watts (1988) for details). However, we are only concerned with the case
where µ is unknown in this article.
2.2.1
Nodal Basis Functions
In the univariate literature, a popular approximation for µ in a bounded domain is through the use of B-splines
(de Boor, 1978). Specific details of such use can be found in Ramsay and Silverman (1997).
The idea of B-splines can be extended to the domain of surface approximation through applications of nodal
basis functions (NBFs) from the finite elements literature (see for example Braess (2001)). In this article we
use the linear “tent shaped” NBFs of Malfait and Ramsay (2003), which are sufficient for the tasks of clustering
and classification. Higher order spatial splines for applications where smoothness is important are discussed in
Sangalli et al. (2013).
T
The linear NBF is a function s with parameter c = (c1 , c2 ) (center), δ1 (horizontal shape parameter), and δ2
(vertical shape parameter). For completeness, the exact form of the linear NBF is


− xδ22 + c2δ2δ2





− xδ 1 + c1δ+δ1

1

 1


x1
x2

−
+
+ δ1 δ2 +δδ21 cδ12 −δ1 c2

δ2

 δ1
s (x; c, δ1 , δ2 ) = xδ 2 + δ2δ−c2
2
2



 x1 + δ1 −c1


δ1
δ1




x1
x2


−
+ δ1 δ2 +δδ11 cδ22 −δ2 c1
 δ1
δ2



0
n
o
2 c1
if x ∈ (x1 , x2 ) : c1 < x1 ≤ c1 + δ1 , δδ21 x1 + δ1 c2δ−δ
≤
x
2 ≤ c 2 + δ2 ,
1
n
o
2 c1
if x ∈ (x1 , x2 ) : c1 < x1 ≤ c1 + δ1 , c2 ≤ x2 < δδ21 x + δ1 c2δ−δ
,
1
n
o
δ2
δ1 c2 −δ2 c1 −δ1 δ2
if x ∈ (x1 , x2 ) : c1 ≤ x1 ≤ c1 + δ1 , δ1 x1 +
≤ x2 < c 2 ,
δ1
n
o
2 c1
if x ∈ (x1 , x2 ) : c1 − δ1 ≤ x1 < c1 , c2 − δ2 ≤, x2 ≤ δδ21 x1 + δ1 c2δ−δ
,
1
n
o
2 c1
if x ∈ (x1 , x2 ) : c1 − δ1 ≤ x1 < c1 , δδ21 x1 + δ1 c2δ−δ
< x2 ≤ c2 ,
1
n
o
δ2
if x ∈ (x1 , x2 ) : c1 − δ1 ≤ x1 ≤ c1 , c2 < x2 ≤ δ1 x1 + δ1 c2 +δ1δ1δ2 −δ2 c1 ,
otherwise,
(2.1)
and an example of one appears in Figure 2.1.
57
1.0
0.8
0.6
1.0
0.5
0.4
0.8
0.0
x2
0.6
µ
0.4
−0.5
0.2
0.2
−1.0
0.0
−1.0
−0.5
0.0
0.5
1.0
0.0
x1
Figure 2.1: Nodal basis function s (x; c, δ1 , δ2 ), where c = (0, 0), δ1 = 1 and δ2 = 1.
58
Like B-splines, the user needs to determine the number of basis functions required to approximate µ. As
mentioned earlier, we only consider the case where µ is defined over a rectangular domain R. As such, a regular
grid of d1 row and d2 columns of NBFs can be used. This results in a representation with d = d1 × d2 bases.
To clarify, a basis function s (x; cl , δ1 , δ2 ) (l = 1, ..., d) is centered on each of the points c1 , ..., cd on a d1 ×d2 grid,
−1
−
where cl is labeled from the bottom-left to the top-right corner of R in increments of δ1 = (d1 − 1)
x+
1 − x1
−1
−
−
−
−
horizontally and δ2 = (d2 − 1)
x+
vertically. For example, c1 = x−
2 − x2
1 , x2 , c2 = x1 + δ1 , x2 , and so
on.
2.2.2
Surface Approximation
The grid of NBFs described above leads to an approximation scheme for µ of the form µ (x) =
Pd
l=1
βl s (x; cl ),
T
where β = (β1 , ..., βd ) is an unknown coefficients vector and s (x; cl ) is a linear NBF with the shape parameters
Pd
suppressed. Upon imputing the data, the coefficients of the linear relationships Yk = l=1 βl s (xk ; cl ) + Ek can
be fitted by least squares. This is exactly the method employed in Malfait and Ramsay (2003).
2.3
Spatial Spline Mixed Models
T
Let us now consider the case where we have independent random samples Yj = Yj1 , ..., Yjmj
of mj i.i.d.
T
observations with realization yj = yj1 , ..., yjmj
such that Yjk = µj (xjk ) + Ejk for each j = 1, ..., n and
T
k = 1, ..., mj . Here Ejk ∼ N 0, σ 2 , xjk = (x1jk , x2jk ) is the coordinate of Yjk , and µj is a function mapping
R to R such that E (µj (x)) = µ (x) for all x ∈ R, for some mean function µ. We can view each µj as random
functions with mean µ and Yj as random samples from the respective functions.
Upon letting µ and µj be unknown, we can fulfill the equation E (µj (x)) = µ (x) by approximating the functions
with the SSRs
µ (x) =
d
X
βl s (x; cl ) , and µj (x) =
l=1
d
X
(βl + Bjl ) s (x; cl ) ,
(2.2)
l=1
respectively, where Bjl ∼ N 0, ξ 2 . Because Bjl can be viewed as a random effect, we appropriately name this
approach: spatial spline mixed models (SSMMs).
2.3.1
Fitting SSMMs
For each sample Yj , the SSMM approximation yields the LMM Yj = Sj (β + Bj ) + Ej , where

s (xj1 ; c1 )

..
Sj = 
.

s xjmj ; c1
Ej = Ej1 , ..., Ejmj
T
, and Bj = Bj1 , ..., Bjmj
T
···
..
.
···
.
59

s (xj1 ; cd )

..
,
.

s xjmj ; cd
(2.3)
From the LMM literature (see McCulloch and Searle (2001) for details), it is known that Yj has density
f (yj ; Ψ, Sj ) = φmj yj ; Sj β, ξ 2 Sj SjT + σ 2 Imj ,
where φm (y; µ, Σ) is an m-variate normal density function with mean µ and covariance Σ, Im is an identity
T
matrix of subscripted dimensions, and Ψ = β T , σ 2 , ξ 2 is a vector of model parameters.
The likelihood function based on the samples y1 , ..., yn is
L (Ψ; y1 , ..., yn , S1 , ..., Sn ) =
n
Y
φmj yj ; Sj β, ξ 2 Sj SjT + σ 2 Imj ,
(2.4)
j=1
which we will write as L (Ψ), for brevity.
Because of the matrices Sj occurring in both the mean and the covariance of the densities, it is not possible to
give in closed form the parameters that maximize (2.4). As such, the maximum likelihood estimates (MLEs)
must be obtained by means of a numerical method.
In this article, we choose to use the expectation–maximization (EM) algorithm of Dempster et al. (1977). We
note here that the EM algorithm for the SSMM is the same as the g = 1 case of the algorithm for MSSR which
can be found in the appendix.
Upon convergence of the EM algorithm, we obtain the MLEs Ψ̂ and estimated posterior expected b̂j =
EΨ̂ (Bj |yj , Sj ) (see (2.15) in the appendix). These estimates can be used in (2.2) to yield the estimated
mean and sample specific approximation equations,
µ̂ (x) =
d
X
β̂l s (x; cl ) , and
d X
β̂l + b̂jl s (x; cl ) ,
(2.5)
l=1
l=1
respectively.
2.4
Mixture of Spatial Spline Regressions
We now consider the case where the random samples Y1 , ..., Yn arises from g populations with mean functions
µ(i) for i = 1, ..., g, rather than a single population with mean function µ. If we do not assume knowledge of the
population of each sample, we can suppose Yj to have arisen from one of g possible functions µ(i)j (x) with some
Pg
probability πi where E µ(i)j (x) = µ(i) (x) for all x ∈ R, i=1 πi = 1 and πi > 0. The situation described
can be viewed as a finite mixture model (see McLachlan and Peel (2000) for details), and thus we term our
methodology for approximating such functions via SSRs: mixtures of spatial spline regressions (MSSR).
2.4.1
Fitting MSSRs
Assuming the model described above, we can approximate µ(i) and µ(i)j with the SSMM type equations
60
µ(i) (x) =
d
X
βil s (x; cl ) , and µ(i)j (x) =
l=1
d
X
(βil + Bijl ) s (x; cl ) ,
(2.6)
l=1
respectively, where βi = (βi1 , ..., βid )
T
T
is a vector of coefficients and Bij = (Bij1 , ..., Bijd )
∼ N 0d , ξi2 Id .
Here 0d is a zero vector of subscripted dimension.
If we assume the realizations are given as Yjk = µ(i)j (x) + Eijk conditioned on Yj belonging to population i,
then we can write the density function for Yj as
f (yj ; Ψ, Sj ) =
g
X
πi φmj yj ; Sj βi , ξi2 Sj SjT + σ 2 Imj ,
(2.7)
i=1
where Ψ = β1T , ..., βgT , σ 2 , ξ12 , ..., ξg2 , π1 , ..., πg
T
and Eijk ∼ N 0, σ 2 .
Since (2.7) is a finite mixture density, the corresponding likelihood
L (Ψ) =
g
n X
Y
πi φmj yj ; Sj βi , ξi2 Sj SjT + σ 2 Imj ,
(2.8)
j=1 i=1
does not have closed form MLEs. As such, we once again utilize the EM algorithm to obtain the MLEs (Ψ̂) of
(2.8) (see Appendix).
With the MLEs obtained from the EM algorithm, we can estimate the conditional expectations
b̂ij = EΨ̂ (Bij |yj , Sj , Zij = 1) ,
where Zij is a random variable which equals 1 if Yj arose from population i, and 0 otherwise; see (2.15) in the
appendix. These results can be substituted into (2.6) to yield the estimated approximation equations
µ̂(i) (x) =
d
X
β̂il s (x; cl ) , and µ̂(i)j (x) =
l=1
d X
β̂il + b̂ijl s (x; cl ) .
(2.9)
l=1
Additionally, the posterior probabilities PΨ̂ (Zij = 1|yj , Sj ) can be estimated via the expression
π̂i φmj yj ; Sj β̂i , ξˆi2 Sj SjT + σ̂ 2 Imj
.
τ̂ij = P
g
2 S S T + σ̂ 2 I
ˆ
0
0
π̂
φ
y
;
S
β̂
,
ξ
0
0
j
j i
mj
i j j
i =1 i mj
(2.10)
This expression is of importance in Section 2.5.
2.4.2
Choosing g
In the discussions above, it is assumed that g is known. It is not possible to determine an unknown g within the
likelihood maximizing procedure of the appendix. In this paper, we suggest the Bayesian information criterion
(BIC) of Schwarz (1978) for choosing g. Under the BIC, the estimation method in Section 2.4.1 is performed
to obtain the MLEs Ψ̂γ for various population numbers γ = 1, 2, .... We then choose
61
g = arg min − 2 log L Ψ̂γ + γ (d + 2) log (n) ,
(2.11)
γ=1,2,..
where L Ψ̂γ is as in (2.8).
2.5
2.5.1
Clustering and Classification
Clustering with MSSR
Like all mixture models, clustering arises naturally from the definition of MSSR by using the posterior probabilities (2.10). That is, if we view the g subpopulations of the MSSR as clusters, then we can apply the plug-in
Bayes’ rule by allocating the realized sample yj into cluster i if
i = arg 0 max τ̂i0 j .
(2.12)
i =1,...,g
See Page 14 of McLachlan (1992) regarding the use of plug-in rules.
2.5.2
Classification by MSSRDA
(h)
(h)
(h)
(h)
Suppose now that we have q independent samples y1 , ..., ynh observed at coordinates x1 , ..., xnh for population h = 1, ..., q. Here q is the number of distinct populations with potentially more than one subpopulation
∗
in each. If y ∗ = (y1∗ , ..., ym
∗)
T
is a realize sample with corresponding coordinate x∗ arising from one of the q
populations, then the plug-in Bayes’ rule for classification is to allocate y ∗ to population h∗ if
ν̂h f y ∗ ; Ψ̂(h) , S ∗
,
h∗ = arg max P
q
∗ ; Ψ̂(h0 ) , S ∗
h0 =1,...,q
0f
ν̂
y
0
h
h =1
(2.13)
Pq
nh0 , S ∗ is the basis matrix (2.3) of x∗ , and Ψ̂(h) is the MLE of the m∗ -variate MSSR
(h)
(h)
density f y ∗ ; Ψ(h) , S ∗ (see (2.7)) estimated using y1 , ..., ynh . We name this method MSSR discriminant
where ν̂h = nh /
h0 =1
analysis (MSSRDA) because of its similarities to the MDA technique of Hastie and Tibshirani (1996).
2.6
Simulated Examples
We now demonstrate the methods of Sections 2.3 to 2.5 via some simulations. In each study, we will make
reference to the three surfaces,
µ(1) (x) = 1 −
x21 + x22
, µ(2) (x) = x1 + x2 , and µ(3) (x) = x21 − 1.
2
62
(2.14)
2.0
1.5
µ
µ
µ
1.0
0.5
x2
x2
x1
x2
x1
x1
0.0
−0.5
µ
µ
µ
−1.0
−1.5
x2
x1
x2
x1
x2
x1
−2.0
Figure 2.2: The top row displays surface plots of the functions from (2.14) over R = [−1, 1] × [−1, 1] (in the
order i = 1, 2, 3 from left to right). The bottom row displays respective example SSMM fitted mean surfaces
for each i = 1, 2, 3, under scenario S2.
2.6.1
Surface Fitting
To assess the performance of SSMM, we simulated 100 random surfaces for each i = 1, ..., 3, by computing
µ(i)j (x) = µ(i) (x) + bij , where Bij ∼ N 0, 0.252 and j = 1, ..., 100. We then simulated m observations from
each µ(i)j (x) by computing yjk = µ(i)j (xjk ) + eijk , where Eijk ∼ N 0, 0.252 and xjk is distributed uniformly
over domain R = [−1, 1] × [−1, 1]. A d = 5 × 5 nodal SSMM over R is then fitted for each i = 1, 2, 3 to estimate
the functions µ(i) (x) and µ(i)j (xjk ). The simulations are conducted under two scenarios: S1 and S2, where
m = 50 and 100, respectively.
To evaluate the performance of SSMM, we computed 51×51 point approximations to the integrated squared-error
2
2
´
P100 ´
(ISE) R µ(i) (x) − µ̂(i) (x) dx and averaged individual ISE (AI-ISE) 100−1 j=1 R µ(i)j (x) − µ̂(i)j (x) dx
for each i. We note that lower ISE and AI-ISE imply better model fits, and report the average results over ten
runs in Table 2.1. As expected, increasing the number of observations (m) for each i results in lower ISEs and
AI-ISEs. Example fitted mean functions for scenario S2 are given in Figure 2.2.
2.6.2
Clustering
Next, we illustrate the use of MSSR on the problem of clustering. To do this, we simulated 100 random surfaces
from each function in (2.14). This simulation as conducted under S1 and S2, as per Section 2.6.1. For each
scenario, we combined the 300 surfaces into a single set of samples.
Without using knowledge of the population from which each observation arose, we clustered the samples into
63
Table 2.1: Average ISE and AI-ISE performances (over ten runs) for SSMM estimations.
µ(1) (x)
µ(2) (x)
µ(3) (x)
Scenario
m
ISE
AI-ISE
ISE
AI-ISE
ISE
AI-ISE
S1
50
0.031
0.281
0.018
0.284
0.020
0.272
S2
100
0.025
0.179
0.013
0.179
0.014
0.182
g = 3 classes using MSSR. We then tested the quality of clustering by means of the adjusted Rand index (ARI)
of Hubert and Arabie (1985) which has the range of −1 to 1. Here, an ARI of 1 indicates perfect concordance
between the clustering outcome and the generating function for each sample, and vice versa for −1. The process
was repeated ten times for each simulation scenario, and the average ARIs found were 0.696 and 0.699 for S1
and S2, respectively. The ARI values were similarly high for both simulations, indicating that the method has
produced an effective clustering in each case.
2.6.3
Classification
Lastly, we demonstrate the performance of MSSRDA in a classification problem. We firstly simulated 100
random samples from each surface of (2.14) twice. We called the first 300 samples the training set, and the
second 300 samples the test set. Once again, this simulation was conducted under S1 and S2, as per Section
2.6.1.
Using knowledge of the population from which each observation in the training set arose, we fitted a q = 3
population MSSRDA rule with just g = 1 subpopulation each. We then apply the MSSRDA rule to classify
the 300 samples in the test set without using knowledge of the generating functions. The estimated error rates
(err)
c are then computed, where err
c is the number of incorrect classifications divided by the total number of
classifications. The average err
c results over ten simulation runs are 0.061 and 0.059 for S1 and S2, respectively.
We see that the error rates are low for both scenarios. The average err
c were similarly low for both simulations,
suggesting a successful implementation of MSSRDA to the problem of classification in each scenario.
2.7
Handwritten Character Recognition
Given the successful simulation studies, we now apply the methods of Sections 2.3 to 2.5 to the real data
problems of clustering and classifying handwritten characters data. For this application, we use the ZIP code
data of Hastie et al. (2009); a subset of the MNIST handwritten digits data used in Le Cun et al. (1990). The
ZIP code data consist of 9298 images of Hindu-Arabic numbers, split into two subsets: A and B. The frequencies
of the ten classes of digits appearing in the two subsets are presented in Table 2.2.
Each observation in the data set is a normalized and deslanted 16 by 16 pixel greyscale image with intensities
between −1 and 1. Here, −1 is perfectly white, and 1 is perfectly black. We can represent each image as a sample
64
Table 2.2: Digit frequencies in ZIP code data sets.
Set
0
1
2
3
4
5
6
7
8
9
Total
A
1194
1005
731
658
652
556
664
645
542
644
7291
B
359
264
198
166
200
160
170
147
166
177
2007
Table 2.3: Average ARIs over 10 repetitions for clusters of data set B. Bold entries indicate the best number
of clusters at each missingness level.
k-means
MSSR
Missingness (%)
g=
8
9
10
11
12
8
9
10
11
12
0
0.376
0.352
0.370
0.390
0.400
0.443
0.45
0.464
0.467
0.455
50
0.387
0.408
0.404
0.413
0.406
0.433
0.424
0.449
0.434
0.447
75
0.381
0.392
0.387
0.386
0.401
0.369
0.393
0.386
0.378
0.371
90
0.235
0.241
0.249
0.286
0.278
0.223
0.224
0.216
0.222
0.217
95
0.143
0.152
0.158
0.154
0.163
0.099
0.104
0.084
0.080
0.091
of observed intensities y1 , ..., y256 with corresponding spatial coordinates xk ∈ R, where R = [1, 16] × [1, 16] and
k = 1, ..., 256. With this representation, we can apply the methods from Sections 2.4 and 2.5, using d = 8 × 8
NBFs.
2.7.1
Clustering
Here, we apply MSSR to the problem of clustering on data set B. We do so by fitting a g = 8, ..., 12 population
MSSR model and applying rule (2.12). As in Section 2.6.2, the performances of the clusters are assessed by
means of evaluating the ARIs using the known digit classes as references.
In addition to data set B, we sought to increase the difficulty of the problem by introducing missingness to the
data. We did this by removing 50, 75, 90, and 95% of the pixels at random from each image (rounded up).
MSSR was then applied, and the ARI for each case described above was evaluated.
For comparison, we also clustered the data in set B by k-means (MacQueen, 1967) as applied through the stats
package in the R programming environment (R Development Core Team, 2013). For the missing-data cases,
k-means could only be applied after data imputation. For this purpose, we used the singular-value thresholding
(SVT) algorithm of Cai et al. (2010) as applied through the imputation package (Wong, 2011) in R. The average
ARIs over ten repetitions for the MSSR and k-means clusters are presented in Table 2.3.
We observe in the results that MSSR performs better than k-means at high levels of missingness (75 to 95%),
however, it is less able to produce the correct clusters at the 0 and 50% levels. Considering that MSSR can
be applied without the aid of data imputation techniques, this result shows that MSSR is a viable means of
65
1.0
0.5
0.0
−0.5
−1.0
−1.5
Figure 2.3: Cluster mean surfaces for data set B (without missing data) estimated by MSSR with g = 12.
clustering sparsely sampled surface data. Additionally, we note that for each g, MSSR only requires g (d + 2) =
66g parameters, whereas k-means requires 256g. In light of this four-fold reduction in representation size, the
differences in ARIs at lower levels of missingness do not appear to be serious. We also observe that MSSR
appears to perform better with larger numbers of clusters than does k-means. This is especially true for the
higher levels of missingness, where using less than the number of true classes is better for k-means. Using rule
(2.11), the average optimal number of clusters for missingness levels 0 to 95% are 11.8, 11.8, 11.6, 10.0, and
8.6, respectively. We see that a model with g ≥ 10 is selected on average in all but the 95% case. This is
encouraging since there are ten known classes and multiple subclasses within each class (for example, the class
of zeros can consist of thin ellipses or wide circles). An example clustering using MSSR with g = 12 for data
set B is given in Figure 2.3. The mean surfaces in Figure 2.3 show that MSSR is able to identify the 10 digits
as well as separate subpopulations of zeros.
2.7.2
Classification
We now consider the classification performance of MSSRDA. This was done by estimating an MSSRDA classifier
using data set A. We allowed the MSSR density function (2.7) for each numeral class h = 0, ..., 9 to have an
individual number of subpopulations gh , which is chosen by forward selection under rule (2.11). We then applied
the constructed MSSRDA to classifying data set B. Like in Section 2.6.3, the estimated error rate is used to
assess performance. As in Section 2.7.1, we also created missing-data versions of set A with 50, 75, 90 and 95%
of missing pixels. MSSRDA was then estimated for each of these cases to classify the respective missing-data
versions of set B.
66
Table 2.4: Estimated error rates for the classification of data set B. Classifiers were trained on set A. Bold
entries indicate the best performance at each missingness level.
Missingness (%)
MSSRDA
LDA
NB
1-NN
10-NN
SVM (linear)
SVM (radial)
0
0.111
0.115
0.262
0.056
0.064
0.070
0.062
50
0.142
0.175
0.626
0.125
0.131
0.171
0.160
75
0.168
0.484
0.801
0.476
0.442
0.497
0.508
90
0.288
0.788
0.911
0.771
0.761
0.785
0.818
95
0.435
0.910
0.920
0.829
0.824
0.845
0.816
Table 2.5: Average number of subpopulations selected at each missingness level.
Numeral (h)
Missingness (%)
0
1
2
3
4
5
6
7
8
9
0
6.4
9.2
4.6
4.0
7.0
4.6
6.2
5.0
5.8
7.6
50
4.4
3.2
3.0
2.0
2.0
1.4
2.6
2.8
2.4
4.8
75
4.4
1.0
2.4
1.6
1.8
2.4
2.6
2.6
3.2
2.4
90
4.8
1.0
2.2
1.4
2.8
2.8
2.6
2.4
2.0
2.8
95
4.0
1.0
1.8
1.8
2.2
2.0
2.2
2.4
2.0
2.6
For comparison, we also perform classifications using the linear discriminant analysis (LDA), naive Bayes (NB),
k-nearest neighbors (k-NN), and SVM methods. Here, LDA and k-NN were applied through MASS (Venables
and Ripley, 2002), and NB and SVM were applied through e1071 (Meyer et al., 2012). The k-NN classifier was
applied using k = 1 and 10, and SVM was applied using linear and Gaussian radial basis kernels. Missing-data
imputation by SVT was once again performed to enable the application of these classifiers in the missing-data
cases. The err
c for MSSRDA and the competing methods are presented in Table 2.4. The results for MSSRDA
are averaged over five repetitions. The average number of subpopulations for each numeral (h), and each
percentage of missingness is presented in Table 2.5.
The results of Table 2.4 appear to indicate that MSSRDA performs very well in the presence of missing-data.
It can be seen that in the cases 75 to 95% missingness cases, MSSRDA has estimated error rates that are half
(or less) . The average err
c of 0.435 in the 95% missingness case is particularly remarkable due to the number
of classes in this problem. Additionally, at lower levels of missingness, MSSRDA produces competitive values
of err.
c Upon considering that MSSRDA was applied without imputation in the missing-data situations, we
conclude that MSSRDA is a capable classification method when data can be modeled as random samples from
surfaces. Additionally, the results of Table 2.5 suggest that in most cases, the populations of numerals had
67
multiple subpopulations each, and thus allowing for subpopulations is a useful feature of MSSRDA. We also see
that lower levels of missingness tended to result in the identification of more subpopulations. This is likely due
to the larger number of observations allowing for more detailed model fits.
2.8
Conclusions
The literature on classification and clustering of functional data has focused on the analysis of univariate
functions. In this article, we present the MSSR and MSSRDA methods to expand the scope of the literature
to bivariate functional data. These two methods can be viewed as two-dimensional extensions to the spline
based methods of James and Hastie (2001) and James and Sugar (2003), using the linear NBFs of Malfait and
Ramsay (2003) in place of the usual B-splines. As an intermediate step to constructing the MSSR and MSSRDA
methods, we also introduce the SSMM models, which can be viewed as mixed-effects versions to the SSRs of
Ramsay et al. (2011) and Sangalli et al. (2013). It is shown in this article that MSSR and MSSRDA can be
constructed within a mixture model framework and fitted using the EM algorithm. We also establish MSSRDA
as a functional analog to the MDA technique of Hastie and Tibshirani (1996).
In order to demonstrate the performance of our methods, we performed a set of simulation studies in which we
found that the new methods effectively performed the desired tasks. A further study of performance was then
conducted using the ZIP code data of Hastie et al. (2009). Using the ZIP code data, we applied MSSR and
MSSRDA to problems in handwritten character recognition. We found that in both clustering and classification,
MSSR and MSSRDA were comparable to a number of competing methods. We also assessed the performances
of both methods in the case where the handwritten character images were sparsely sampled by introducing
missing data. Both MSSR and MSSRDA were found to surpass the competition in such circumstances. This
result is particularly notable considering that the methods were able to be applied without the need of data
imputation, unlike the competitors. The overall performances of MSSR and MSSRDA are encouraging, and the
methods appear to be particularly useful in sparse-data contexts.
Appendix
The EM algorithm for computing the MLE of a MSSR model with g populations (as in Section 2.4) proceeds
as follows. Initialize the parameter vector Ψ with some starting values Ψ(0) , and denote the parameter vector
iterate on the kth iteration as Ψ(k) .
On the (k + 1) th iteration of the algorithm, the E-step is performed by computing the conditional expectations
(k+1)
τij
(k+1)
bij
(k)
(k) (k)2
πi φmj yj ; Sj βi , ξi Sj SjT + σ (k)2 Imj
,
=P
(k)
(k) (k)2
g
Sj SjT + σ (k)2 Imj
i=1 πi0 φmj yj ; Sj βi0 , ξi0
(k)2
= ξi
−1 (k)2
(k)
SjT ξi Sj SjT + σ (k)2 Imj
yj − S j β i
,
68
(k+1)
χij
−1 (k)2
(k)
(k)2
= tr ξi
Id − ξi SjT ξi Sj SjT + σ (k)2 Imj
Sj
,
and
(k+1)
λij
−1 (k)2
(k)2
(k)
= tr ξi Sj Id − ξi SjT ξi Sj SjT + σ (k)2 Imj
Sj SjT .
(k+1)
Following the E-step, the M-step is performed by computing the parameter estimates πi
(k+1)
βi
= n−1
(k+1)
,
j=1 τij
Pn

−1 

n
n
X
X
(k+1) T
(k+1) 
(k+1) T
=
τij
Sj Sj  
,
τij
Sj yj − Sj bij
j=1
j=1
(k+1)
Pm
(k+1)2
ξi
=
j=1 τij
d
(k+1)T (k+1)
bij
bij
Pn
(k+1)
+ χij
(k+1)
j=1 τij
,
and
Pn
σ
(k+1)2
j=1
Pg
=
i=1
(k+1)
τij
yj − Sj
(k+1)
βi
(k+1)
+ bij
Pn
j=1
T (k+1)
(k+1)
(k+1)
yj − Sj βi
+ bij
+ λij
mj
.
The E- and M-steps are alternated until convergence is reached, that is
log L Ψ(k) − log L Ψ(k−1)
< 10−4 ,
log L Ψ(k−1) where upon we define the final iterates as the MLE (Ψ̂). The random-effect terms Bij are estimated by the
conditional expectations
−1 EΨ̂ (Bij |yj , Sj , Zij = 1) = ξˆi2 SjT ξˆi2 Sj SjT + σ̂ 2 Imj
yj − Sj β̂i .
(2.15)
See McLachlan and Krishnan (2008) for further details on the application of the EM algorithm to mixture
models.
Acknowledgments
We thank the associate editor and the three reviewers for their helpful comments which lead to many improvements in the article.
69
70
Chapter 3
False Discovery Rate Control in Magnetic
Resonance Imaging Studies via Markov
Random Fields
Citation
Nguyen, H. D., McLachlan, G. J., Cherbuin, N., and Janke, A. L. (2014). False discovery rate control in magnetic
resonance imaging studies via Markov random fields. IEEE Transactions on Medical Imaging, 33:1735–1748.
Abstract
Magnetic resonance imaging (MRI) is widely used to study population effects of factors on brain morphometry.
Inference from such studies often require the simultaneous testing of millions of statistical hypotheses. Such
scale of inference is known to lead to large numbers of false positive results. Control of the false discovery
rate (FDR) is commonly employed to mitigate against such outcomes. However, current methodologies in
FDR control only account for the marginal significance of hypotheses, and are not able to explicitly account
for spatial relationships, such as those between MRI voxels. In this article, we present novel methods that
incorporate spatial dependencies into the process of controlling FDR through the use of Markov random fields.
Our method is able to automatically estimate the relationships between spatially dependent hypotheses by
means of maximum pseudo-likelihood estimation and the pseudo-likelihood information criterion. We show
that our methods have desirable statistical properties with regards to FDR control and are able to outperform
noncontexual methods in simulations of dependent hypothesis scenarios. Our method is applied to investigate
the effects of aging on brain morphometry using data from the PATH study. Evidence of whole brain and
component level effects that correspond to similar findings in the literature is found in our investigation.
71
3.1
Introduction
Magnetic resonance imaging (MRI) is often used by neurological researchers to assess the population effects of
factors on brain morphometry. Examples of such studies include investigations regarding the effects of aging
(e.g. Coffey et al. (1999); Sowell et al. (2003); Smith et al. (2007); Tisserand et al. (2002); Pfefferbaum et al.
(1992, 1997)), alcoholism (e.g. Pfefferbaum et al. (1992, 1997)), education (e.g. Coffey et al. (1999)), and gender
(e.g. Smith et al. (2007)) on MRI-measurable morphological features.
In such studies, a test of hypothesis may be carried at each individual voxel of the MRI scans by means of
regression or similar statistical procedures. Aggregating across an entire brain, the number of such simultaneous
hypothesis tests can number in the millions.
Such large scales of simultaneous hypothesis testing is known to result in large numbers of false positive results
(i.e. incorrectly rejected hypothesis tests). For example, if there are no factor effects, one million hypothesis
tests at the usual 5% significance would yield fifty thousand incorrectly rejected hypotheses on average. In the
neuroimaging community, this pitfall was famously alluded to in Bennett et al. (2009a), where brain activity
was found to be significant in a functional MRI study of a dead salmon.
The problem of false positives in simultaneous hypothesis testing has been well studied in both the statistics
literature (e.g. Miller (1981); Dudoit and van der Laan (2008); Efron (2010)) and the neuroimaging literature
(e.g. Turkheimer et al. (2001); Genovese et al. (2002); Nichols and Hayasaka (2003)). Among the solutions
developed, controlling the false discovery rate (FDR) has emerged to be particularly popular and practical.
The principle of FDR control was first introduced in Benjamini and Hochberg (1995), where it was suggested
that researchers should reject only as many hypotheses as to maintain the FDR below a predetermined level,
where FDR = E (N01 /NR |NR > 0) P (NR > 0) with N01 and NR denoting the number of false positives and
the number of rejected hypotheses, respectively. Since the publication of Benjamini and Hochberg (1995),
many techniques have been proposed for the control of FDR (e.g. Benjamini and Yerutieli (2001); Genovese
and Wasserman (2002); Storey (2002); Efron (2004); McLachlan et al. (2006); Sun and Cai (2007); Xie et al.
(2011)).
Although these techniques are able to correctly control the FDR when hypotheses have implicit dependencies,
they are not able to model the explicit nature of such dependencies. As such, the spatial relationships between
hypothesis tests on neighboring voxels cannot be utilized to inform the significance of said voxels.
In the neuroimaging literature, the exploitation of spatial relationships for false positive mitigation is often
conducted under the random field theory (RFT) framework (cf. Friston et al. (1994, 1995); Worsley and Friston
(1995); Worsley et al. (2004)). Some examples of RFT techniques for FDR control include Pacifico et al. (2004)
and Chumbley and Friston (2009).
While widely used, RFT techniques are not without their caveats. These include issues such as test dependency
(e.g. χ2 , Gaussian, F , T , or T 2 distributed test statistics require different field models), the assumption of a
continuous field over a discrete lattice, and requirements of user-specified smoothing parameters (cf. Nichols
and Hayasaka (2003) andBennett et al. (2009b) for discussions of these issues). As such, RFT techniques can
72
produce incorrect inferences, or be misapplied.
In order to resolve the shortcomings of RFT techniques, we introduce a novel method that uses a testindependent discrete-lattice model to spatially inform FDR control. Our method utilizes the mixture modelbased procedure of McLachlan et al. (2006) to control FDR at the marginal (spatially uninformed and noncontextual) level. Spatial dependencies between marginally controlled hypothesis tests are then modeled by a
Markov random field (MRF); a technique previously used for model-based segmentation of MRIs (e.g. Zhang
et al. (2001); Ng and McLachlan (2004); Ashburner and Friston (2005); Leemput et al. (1999)). As such, we
name our collection of techniques: Markov random field FDR control (MRF-FDR).
Usual problems in the application of MRFs are the determination of spatial relationship magnitudes and the
distance at which voxels can be considered neighbors. We resolve these issues by using a data-driven process
through applications of the pseudo-likelihood (PL) (cf. Besag (1974)) and the PL information criterion (PLIC)
(cf. Ji and Seymour (1996)), respectively.
The MRF-FDR methods that we present here are a continuation of the work from Nguyen et al. (2013). In this
article, we suggest and theoretically justify a refined version of the technique in Nguyen et al. (2013).
In order to demonstrate and assess the performance of MRF-FDR, we perform a set of simulation studies with
the aim of replicating brain MRI-like scenarios, where features of various sizes are statistically significant. We
find that our method is able to outperform some commonly available techniques. We then apply our technique to
study the effect of aging on the morphometry of the brain in an elderly cohort study using data from the PATH
project (cf. Anstey et al. (2012)). Here, we are able to produce inferences consistent with the neuroimaging
literature on aging.
3.2
Noncontexual FDR Control
Suppose that we test a statistical hypothesis at each of the n voxels of an MRI, where the voxels are indexed
by i = 1, ..., n. Let Hi be a binary random variable, where Hi = 0 if the null hypothesis at voxel i is true,
and 1 otherwise, and let Pi be the p-value associated with the hypothesis test at voxel i. If Pi arises from a
i.d.
well-behaved hypothesis test (i.e. Pi |Hi = 0 ∼ U (0, 1)), then the probit transformation of Pi yields a z-score
i.d.
Zi = Φ−1 (1 − Pi ), which has the property Zi |Hi = 0 ∼ N (0, 1), under the well-behaved test assumption.
Here, i.d. denotes identically distributed.
In Efron (2004), it was suggested that if 0 < π0 < 1, then Zi is marginally distributed according to a twocomponent mixture distribution with density
f (zi ) = π0 f0 (zi ) + (1 − π0 ) f1 (zi ) ,
(3.1)
when Hi is unobserved. Here π0 = P (Hi = 0), f0 is the density of Zi |Hi = 0, f1 is the density of Zi |Hi = 1,
i.d.
and hi and zi are realizations of Hi and Zi , respectively. Since Zi |Hi = 0 ∼ N (0, 1) under the well-behaved
test assumption, we can write f0 (zi ) = φ (zi ; 0, 1), where φ x; µ, σ 2 is a normal density function with mean
73
µ and variance σ 2 . The determination of f1 is flexible and often depends on situation specifics. However, it
was shown in McLachlan et al. (2006) that using f1 (zi ) = φ zi ; µ1 , σ12 performs well in general settings, where
µ1 ∈ R and σ12 > 0. It can be seen that under such conditions, density (3.1) is a two-component normal mixture
model (cf. Chapter 3 of McLachlan and Peel (2000) for details).
3.2.1
Empirical Null Density Function
The density f0 (zi ) = φ (zi ; 0, 1) arising from the well-behaved test assumption is referred to as the theoretical
null density by Efron (2004), and has been broadly tested for veracity. It was found in Efron (2004, 2007b,a)
that in some testing procedures (e.g. permutation tests and correlated test statistics), the assumption that
i.d.
Zi |Hi = 0 ∼ N (0, 1) is unreasonable.
As an alternative, it was suggested in Efron (2004) that the marginal density of Zi |Hi = 0 should be modeled
by f0 (zi ) = φ zi , µ0 , σ02 instead, where µ0 ∈ R, σ02 > 0, and µ0 < µ1 . This model for Zi |Hi = 0 is referred
to as the empirical null density, and has been shown to improve the fit of the two-component normal mixture
model in McLachlan et al. (2006). As such, we adopt the two-component normal mixture with an empirical
null density,
f (zi ; θ) = π0 φ zi ; µ0 , σ02 + (1 − π0 ) φ zi ; µ1 , σ12 ,
(3.2)
for noncontextual FDR (NC-FDR) control in the following sections, and shall refer to (3.2) as the NC-FDR
T
model. Here θ = π0 , µ0 , σ02 , µ1 , σ12
is the vector of model parameters, where the T superscript indicates
matrix transposition.
3.2.2
Rejection Rule
Under the NC-FDR model, Bayes’ rule allows us to write the conditional probability of the event Hi = 0|Zi = zi
as
π0 φ zi ; µ0 , σ02
P (Hi = 0|Zi = zi ) =
f (zi ; θ)
= τ (zi ; θ) ,
(3.3)
which can be used to derive the rejection rule
r1 (zi ; θ, c1 ) =

1
if τ (zj ; θ) ≤ c1 ,
0
otherwise,
(3.4)
where 0 ≤ c1 ≤ 1. Here, r1 (zi ; θ, c1 ) = 1 if the null hypothesis at voxel i is rejected (declared statistically
significant), and 0 otherwise.
Expression (3.3) is referred to as the local FDR by Efron (2004), and it can be interpreted as the probability
that the null hypothesis at voxel i is true, given knowledge of its z-score. Rejection rule (3.4) requires that
the probability of voxel i being null is sufficiently low (defined through the threshold c1 ) before rejecting the
74
hypothesis at this voxel. For brevity, we will suppress the parameter vector θ when convenient, and write
τ (zi ; θ) and r1 (zi ; θ, c1 ), as τ (zi ) and r1 (zi ; c1 ), respectively.
Let I {A} be the indicator variable, where I {A} = 1 if proposition A is true, and 0 otherwise. Using the indicator
Pn
notation, we can express the number of hypotheses rejected by rule (3.4) as NR = i=1 I {r1 (Zi ; c1 ) = 1}, and
Pn
the number of false positives as N01 = i=1 I {r1 (Zi ; c1 ) = 1, Hi = 0}. Furthermore, we can show that the
marginal FDR (mFDR = EN01 /ENR ) can be consistently estimated (in the probabilistic sense) by
mFDR =
where we define N̄R =
Pn
i=1
N̄01
,
N̄R
I {r1 (zi ; c1 ) = 1} and N̄01 =
(3.5)
Pn
i=1
I {r1 (zi ; c1 ) = 1} τ (zi ). Here, mFDR =
mFDR (c1 ) is a function of c1 , although we will drop the notation for brevity.
Theorem 3.1. For each i, assume that there are at most ν < ∞ indices i0 6= i (i0 = 1, ..., n), such that Zi and
Hi are dependent on Zi0 and Hi0 . If the probabilities P (r1 (Zi ; c1 ) = 1, Hi = 0) and P (r1 (Zi ; c1 ) = 1) > 0 are
P
constant for all i, then mFDR → mFDR.
The assumption from Theorem 3.1 states that although the voxels are not assumed to be independent, we
only permit a finite and fixed amount of dependencies. Under the same assumption, we can show that mFDR
approaches FDR as n gets large.
Theorem 3.2. Under the same assumption as Theorem 3.1, if the probabilities
P (r1 (Zi ; c1 ) = 1, Hi = 0) ,
P (r1 (Zi ; c1 ) = 1, Hi = 1) ,
and P (r1 (Zi ; c1 ) = 1) > 0 are constant for all i, then lim |mFDR − FDR| = 0.
n→∞
Proofs of all theoretical results can be found in Appendix 3.8.
3.2.3
FDR Control
Using Theorems 3.1 and 3.2, we can apply equation (3.5) to approximate the FDR for any given threshold
c1 . Alternatively, we can also use mFDR to approximately control the FDR at a specified level 0 < α ≤ 1 by
choosing c1 using the threshold rule
c1 = arg 0max
c1 ∈(0,1]
mFDR (c01 ) ≤ α .
(3.6)
For brevity, we will refer to the applications of (3.4) and (3.6) for estimation and control of FDR as NC-FDR
collectively.
Although rule (3.6) asymptotically controls FDR at the correct level for hypotheses behaving as conjectured by
75
the NC-FDR model, it cannot exploit the spatial structure of MRI data. However, the NC-FDR model does
provide a useful framework for incorporating spatial information, as discussed in the following section.
3.3
Spatial FDR Control
Consider for any voxel i, τ (zi ) < 1/2 indicates that the null hypothesis at the voxel is more likely to be false
than not, given zi is observed. Using this notion we can define the binary random variable Ti = I {τ (Zi ) < 1/2},
which equals 0 if the null hypothesis at voxel i is more likely to be true, and 1 otherwise.
T
Let si = (s1i , s2i , s3i ) be the spatial coordinates and Sid = {i0 6= i : δ (si , s0i ) ≤ d} be the d-range Moore-type
neighborhood of voxel i, where δ (si , s0i ) = max {|s1i − s1i0 | , |s2i − s2i0 | , |s3i − s3i0 |} is the maximum distance
function. Using an MRF argument, a spatial relationship between Ti and its neighbors T(i) = {Ti0 : i0 ∈ Si } can
be induced through the conditional probability statement
P Ti = ti |T(i) = t(i)
exp ati + bti η t(i)
1 + exp a + bη t(i)
= g ti , t(i) ; ψ ,
=
(3.7)
−1 P
T
0
where a, b ∈ R and ψ = (a, b) is the parameter vector. Here η t(i) = Sid i0 ∈S d ti is the mean over the
i
neighborhood Sid of voxel i, where |S| denotes the cardinality of set S.
Model (3.7) allows for the probability of Ti to be dependent upon the probability that its neighboring hypotheses
are null, and thus, provides a way of incorporating spatial dependencies for FDR control.
3.3.1
Rejection Rule
Using (3.7), we can write the probability of Ti = 0|T(i) = t(i) as
P Ti = 0|T(i) = t(i)
1
1 + exp a + bη t(i)
= ξ t(i) ; ψ .
=
(3.8)
Combining expressions (3.3) and (3.8), we can specify the spatial informed rejection rule
r2 z(i) ; θ, ψ, c1 , c2 =


1





0
where 0 < c1 , c2 ≤ 1, and z(i) = zi , t(i)
T
if τ (zi ; θ) ≤ c1
and ξ t(i) ; ψ ≤ c2 ,
(3.9)
otherwise,
. As with rule (3.4), r2 z(i) ; c1 , c2 = 1 if the null hypothesis at voxel
i is rejected, and 0 otherwise.
Rule (3.9) extends rule (3.4) to also take into account the spatial relationship between nearby voxels by rejecting
the null hypothesis at voxel i only if the local FDR is below the specified threshold c1 , and the probability that
the null hypothesis at the voxel is false, given the state of its neighbors is also sufficiently low (defined through
76
the threshold c2 ). The second condition ξ t(i) ≤ c2 , guarantees that the rejection of the hypothesis at voxel i
3
is spatially coherent with the decisions made at the (2d + 1) − 1 voxels in its neighborhood Sid . Note that if
c2 = 1, then rule (3.9) is exactly rule (3.4).
3.3.2
Theoretical Justification
Pn
Under rule (3.9), we can write the number of rejections and false positives as NR = i=1 I r2 Z(i) ; c1 , c2 = 1
Pn
and N01 = i=1 I r2 Z(i) ; c1 , c2 = 1, Hi = 0 , respectively. As with equation (3.5), we can show that
0
mFDR =
N̄01
N̄R
(3.10)
can conservatively estimate (i.e. over estimate) the mFDR when we assume that nearby hypotheses are spatially
Pn
Pn
coherent, where N̄R = i=1 I r2 z(i) ; c1 , c2 = 1 and N̄01 = i=1 I r2 z(i) ; c1 , c2 = 1 τ (zi ), respectively.
Theorem 3.3. For each i, assume that there are at most ν < ∞ indices i0 6= i (i0 = 1, ..., n), such that Z(i) and
Hi are dependent on Z(i0 ) and Hi0 . Additionally, assume that γ1i ≤ γ2i for all i, where
γ1i
= E I {Hi = 0} |r2 Z(i) ; c1 , c2 = 1 ,
and
γ2i
=
E (I {Hi = 0} |r1 (Z; c1 ) = 1)
=
τ (Zi ) .
If
P r2 Z(i) ; c1 , c2 = 1 > 0,
P r2 Z(i) ; c1 , c2 = 1, Hi = 0 ,
0 P
and E I r2 Z(i) ; c1 , c2 = 1 τ (Zi ) are constant for all i, then mFDR → mFDR0 ≥ mFDR.
As with rule (3.4), we can show that the mFDR for rule (3.9) also approaches the FDR as n gets large.
Theorem 3.4. Under the same assumptions as Theorem 3.3, if the probabilities
P r2 Z(i) ; c1 , c2 = 1, Hi = 0 ,
P r2 Z(i) ; c1 , c2 = 1, Hi = 1 ,
and P r2 Z(i) ; c1 , c2 = 1 > 0 are constant for all i, then lim |mFDR − FDR| = 0.
n→∞
Note that the assumption that γ1i ≤ γ2i in Theorem 3.3 is a formal statement of the conjecture of local coherency
among voxels. That is, the assumption states that the null hypothesis at voxel i is less likely to be true if it is
known to be in a region of false hypotheses, compared to when this is unknown. Here, we again assume limited
77
dependencies between the voxels; furthermore, notice that ν is only assumed to be finite and may be greater
than the size of Sid .
3.3.3
FDR Control
Combined, Theorems 3.3 and 3.4 can be applied to conservatively estimate the FDR by (3.10), for any given
thresholds c1 and c2 . Additionally, like (3.5), equation (3.10) can be used to conservatively control the FDR by
choosing c1 and c2 such that
(c1 , c2 ) = arg
n
o
0
mFDR (c01 , c02 ) ≤ α .
max
(c01 ,c02 )∈(0,1]2
(3.11)
As with NC-FDR, the usage of (3.9), (3.10), and (3.11) for control or estimation of FDR will henceforth be
referred to as MRF-FDR, for brevity. We now proceed to discuss the estimation of the parameter vectors θ and
ψ for application in NC-FDR and MRF-FDR.
3.4
Estimation of Parameters and Quantities
3.4.1
Noncontexual Model
Let z1 , ..., zn be a set of observed z-scores as defined in Section 3.2, assuming that Zi is marginally distributed
as per (3.2). The parameter vector (i.e. θ) can be estimated by maximizing the composite marginal likelihood
(CML) function
CL (θ; z1 , ..., zn )
=
n
Y
f (zi ; θ) .
(3.12)
i=1
Because of the mixture form of (3.2), the parameter vector θ that maximizes (3.12) cannot be expressed in
closed form. However, it is possible to apply the minorization–maximization (MM) algorithm (cf. Hunter and
Lange (2004)) to compute the maximum CML estimate (MCMLE) θ̂ through an iterative scheme. Particulars
of the algorithm used are given in Appendix 3.9. Here, the MCMLE is preferred over the maximum likelihood
estimate because it can be computed without explicit declaration of the dependencies between the z-scores; see
Varin (2008) for a treatment on CMLs and Varin et al. (2011) for a general overview of composite likelihood
methods.
3.4.2
Spatial Model
Upon computing the MCMLE, the conditional probabilities τ (zi ; θ) can be estimated by τ zi ; θ̂ , and thus
n o
ti can be approximated by t̂i = I τ zi ; θ̂ < 1/2 . Under the hypothesis that the random variables T̂i are
related to their neighbors T̂(i) through the conditional model (i.e. (3.7)), we can estimate the parameter vector
ψ by maximizing the PL
n
Y
P L ψ; t̂1 , ..., t̂n =
g t̂i , t̂(i) ; ψ .
i=1
78
(3.13)
The maximum PL estimation (MPLE) technique was introduced in Besag (1974), and the maximum PL estimate
(MPLE) can be shown to have desirable statistical properties in a broad range of circumstances (cf. Geman
and Graffigne (1986) and Mase (1995)). As with the estimation of θ, the MPLE ψ̂ cannot be estimated in
closed form due to the intractability of equation (3.13). However, we can implement a Newton-type algorithm
to iteratively compute ψ̂; see Appendix 3.9 for details.
3.4.3
FDR Estimates
Using the estimates θ̂ and ψ̂, we can estimate the marginal FDR expressions (3.5) and (3.10) by
n o I
r
z
;
θ̂,
c
=
1
τ
z
;
θ̂
1
i
1
i
i=1
n o
Pn
I
r
z
;
θ̂,
c
=
1
1
i
1
i=1
Pn
\
mFDR
=
(3.14)
and
\
mFDR
n o I
r
ẑ
;
θ̂,
ψ̂,
c
,
c
=
1
τ
z
;
θ̂
2
1
2
i
(i)
i=1
n o
,
Pn
I
r
ẑ
;
θ̂,
ψ̂,
c
,
c
=
1
2
1 2
(i)
i=1
Pn
0
=
(3.15)
respectively, where ẑ(i) = zi , t̂(i)
T
. We can show that (3.14) and (3.15) asymptotically approach (3.5) and
(3.10), respectively, when θ̂ and ψ̂ are consistent estimates of the relevant parameters.
Theorem 3.5. If
Pn
i=1
P
D
\ →
mFDR.
I {r1 (Zi ; θ, c1 ) = 1} > 0 and θ̂ → θ, then mFDR
Proof. The proof is similar to that of Theorem 3.6.
Theorem 3.6. If
Pn
i=1
0 D
0
P
P
\ →
mFDR .
I r2 Z(i) ; θ, ψ, c1 , c2 = 1 > 0, θ̂ → θ, and ψ̂ → ψ, then mFDR
The assumptions that θ̂ and ψ̂ be consistent are well justified for our models. In the case of the density function
(3.2), it is well known that if 0 < π0 < 1 and µ0 < µ1 , then the model is identifiable. Under the assumption of
Theorem 3.1, it can be shown that (3.12) satisfies the regularity conditions for the consistency of an extremum
estimator (cf. Theorem 4.1.2 of Amemiya (1985)) with the aid of an appropriate uniform law of large numbers
(e.g. Theorem 4 of Andrews (1992)). Hence, there exist a consistent MCMLE.
To obtain the consistency of ψ̂, we need to first establish the identifiability (in the sense of Geman and Graffigne
(1986)) of the MRF equation (3.7).
Lemma 3.1. If ψ ∗ 6= ψ, then there exists ti and t(i) such that g ti , t(i) ; ψ ∗ =
6 g ti , t(i) ; ψ , where ψ ∗ =
T
(a∗ , b∗ ) .
The consistency of the MPLE ψ̂ can be obtained as follows.
Proposition 3.1. If T1 , ..., Tn are random observations from the MRF (3.7), and if (3.7) is identifiable, then
P
ψ̂ → ψ.
79
Proof. See the consistency of pseudo-likelihood theorem of Geman and Graffigne (1986).
The results above allow for the use of equations (3.14) and (3.15) in the place of (3.5) and (3.10) for FDR
estimation and control, respectively.
3.4.4
Range Estimation
Thus far, the range (i.e. d) of the neighborhoods Sid has been assumed constant. Due to the intractability of
its inclusion in the MPLE process, it is necessary to determine d by means of an external method. We can use
the PLIC for such a purpose.
The PLIC at any d is defined as
PLIC (d) = logP L ψ̂d ; t̂1 , ..., t̂n − 2−1 kd log n,
where ψ̂d is the MPLE, and kd is the number of parameters for a d-ranged MRF (here, kd = 2 for all d).
According to Ji and Seymour (1996), the optimal range can be estimated consistently via the rule
d = arg 0max PLIC (d0 ) .
d =1,2,..
(3.16)
Rule (3.16) has been applied successfully in imaging applications (e.g. Stanford and Raftery (2002)), and can
be viewed as the PL equivalent of the Bayesian Information Criterion of Schwarz (1978). Because an exhaustive
search over all d = 1, 2, ... is not feasible, we can instead apply a greedy version of (3.16) in practice. We do
this by initializing at d0 = 1 and iteratively increasing d0 until PLIC (d0 + 1) − PLIC (d0 ) ≤ 0, whereupon we set
d = d0 . The use of this method is implicit in the ensuing discussions.
3.5
Simulation Study
In order to discuss the application and assess the performance of MRF-FDR, we explore a number of simulation
scenarios. These scenarios are coded by the size of their spatial features (S1-S3, where S1 has the smallest
spatial features and S3 has the largest), and the size of their statistical effect (E1-E3, where E1 has the smallest
statistical effect, and E3 has the largest). For example, scenario S1E1 has small spatial features and statistical
effects.
3.5.1
Setup
In each of the 9 scenarios, a cube of n = 1203 voxels was simulated with intensities generated independently
from the N (0, 1) distribution. The total size of the simulation was chosen to replicate the number of voxels in
a typical MRI study. For S3, the cube is partitioned into 8 sub-cubes (as per schema B of Fig. 3.1), each of size
603 . The dark sub-cubes are then replaced with intensities generated independently from N (∆, 1). In scenario
80
S2, the cube is partitioned into 27 sub-cubes (as per schema A), each of size 403 , whereby the dark sub-cubes
are replaced with independent intensities from N (∆, 1).
Scenario S1 is a modification of S3. In S1, a type B schema is used, whereby each of the black sub-cubes are
replaced with a type A sub-cube with sub-sub-cubes of size 203 . Here, the white sub-sub-cubes are unchanged,
and the black sub-sub-cubes are replaced by independent intensities from N (∆, 1). If we call the voxels with
intensities from N (0, 1) null, and those with intensities from N (∆, 1) alternative, then there are 56 × 203 ,
14 × 403 , and 4 × 603 alternative voxels in S1-S3, respectively.
The statistical effect can be thought of as the difference between a null and an alternative voxel. Here, the
statistical effect is determined by the ∆ in the alternative intensity generation. We use ∆ = 1, 2, 3, for scenarios
E1-E3, respectively.
Let ζ1 , ..., ζn be the intensities of the n voxels, as per the simulation described. Using ζi , we can write the
p-value for the test of the null hypothesis ∆ = 0, against the alternative hypothesis ∆ > 0, as pi = 1 − Φ (ζi ),
for each voxel i. This is simply a one-side z-test for a normal mean.
3.5.2
Threshold Selection
Integral to the application of MRF-FDR is the selection of the marginal and spatial thresholds, c1 and c2 ,
respectively. In practice, the choice of combinations of the two thresholds can be difficult due to the number
of options. However, a decision can be made based on two primary statistics. Firstly, (3.15) can be used to
conservatively estimate the FDR for the combination of thresholds, and secondly,
N̂R =
n
n o
X
I r2 ẑ(i) ; θ̂, ψ̂, c1 , c2 = 1
(3.17)
i=1
can be used to evaluate the number of voxels rejected under the choice of threshold. The tradeoff in these two
quantities, with respect to the thresholds can be assessed by graphical means.
To demonstrate the process of threshold selection, we simulated a single instance of S1E2, S2E2, and S3E2. The
MRF-FDR method was then applied in each case using threshold levels, c1 , c2 = 0.01, 0.02, ..., 1. Level plots of
0
\ and N̂R /n × 100% (the percentage of hypotheses rejected) were then produced, and are given in Fig.
mFDR
3.2. We can make some general observations from the figure. Firstly, for a fixed value of c2 , higher values c1
0
\ and N̂R . Secondly, the effect is the same for higher levels of c2 when
correspond to increases in both mFDR
0
\ and N̂R appear to behave strangely when c2 < 0.1.
c1 is fixed. We also observe that in all scenarios, mFDR
This is potentially due to a discordance between γ1i and γ2i (from Theorem 3.3) at very small threshold levels.
Lastly, we note that the joint effects of c1 and c2 appear to be linear in the sense that there are no paradoxical
decreases in FDR or rejections when both thresholds are increased in tandem.
These results appear to confirm the common sense notion that rejecting hypotheses under more stringent
conditions would decrease the number of false positives, but also diminish the ability to reject potentially
significant hypotheses. If the desired level of FDR and number of rejections are known, then the level plots
from Fig. 3.2 can be used to decide upon a combination of thresholds for the study. For example, if one chooses
81
Figure 3.1: Reference schematics for simulation study.
82
A1
A2
A3
0.5
0.7
0.8
0.8
0.6
0.8
0.4
0.8
0.6
0.3
0.6
0.5
c2
c2
0.4
0.3
0.4
c2
0.6
0.6
0.4
0.2
0.4
0.4
0.2
0.2
0.1
0.2
0.2
0.1
0.2
0.0
0.2
0.4
0.6
0.0
0.0
0.8
0.2
0.4
c1
0.6
0.8
0.2
c1
B1
100
100
0.8
0.8
80
80
0.6
40
0.6
0.8
20
0.2
0
0.6
40
0.4
20
0.2
0.4
60
c2
40
0.4
20
0.2
80
60
c2
c2
60
0.4
0.8
B3
100
0.6
0.6
c1
B2
0.8
0.2
0.4
0
0.2
0.4
c1
0.6
0.8
0
0.2
0.4
c1
0.6
0.8
c1
0
\ and
Figure 3.2: Sub-figures A1-A3 and B1-B3 display the relationships between c1 and c2 , versus mFDR
N̂R /n × 100% (percentage of hypotheses rejected), respectively, for simulations of scenarios S1E2-S3E2. The
0
\ could not be evaluated (i.e. when there are no rejections).
white space indicates values of which mFDR
to control FDR at the 0.1 level, and make N̂R /n × 100% = 40% in scenario S2E2, then a possible choice of
thresholds is c1 = 0.3 and c2 = 0.5. However, there are many other choices that would yield approximately the
same combination of outcomes.
3.5.3
False Discovery Rate Control
In practice, one does not determine the number of rejections to make. As such, the only decision to make is to
control the FDR at an appropriate level α. Here, as in the previous section, the choice is not obvious due to the
multitude of combinations of thresholds which lead to similar FDR levels. To alleviate the issue, we propose
the use of three thresholding rules: (3.6), (3.11), and
c1 = arg 0max
c1 ∈(0,1]
o
n
0
mFDR (c01 , 0.5) ≤ α .
(3.18)
The first rule is the NC-FDR marginal rule which ignores the spatial information available for FDR control.
The second is the MRF-FDR method discussed in Section 3.3, which allows for dependencies in the rejections
of neighboring voxels, and thirdly, rule (3.18) is a stringent criterion, whereby it is insisted that voxels are only
rejected if their neighborhood indicates a lower chance of the null being true than false.
83
3.5.4
Assessing Performance
In order to assess the performance of suggested rules, we simulated 10 repetitions each of scenarios S1E1 to
S3E3 controlling α at the scientifically-typical levels 0.005, 0.01, 0.05 and 0.1. For each repetition, we compute
the observed FDR (oFDR = oN01 /oNR ), and observed true positive rate (oTPR = oN11 /oN1 ), where oN01 ,
oN11 , oNR and oN1 are the numbers of observed false positives, true positives (correctly rejected hypotheses),
rejections and number of false null hypotheses, respectively. We then average over the ten repetitions of each
scenario and α level.
The oFDR and oTPR measurements were selected for their interpretability. Intuitively, oFDR measures the
number of false positives per rejected hypothesis and should be near α if the technique assessed is correctly
controlling the FDR level. The oTPR instead measures the power of the procedure to reject hypotheses where
the null is false and should be as large as possible. The most desirable property for a control method is to
achieve oFDR close to α, and a high oTPR simultaneously, over the scenarios.
For additional comparison, we also assess the performances of the Benjamini and Hochberg (1995) and Benjamini
and Yerutieli (2001) methods which we shall refer to as BH and BY, respectively.
3.5.5
Results
The results of the simulation described above are reported in Table 3.1. From the results we can make a number
of observations. We can firstly rank the rules based on the number of scenarios where they achieved the best
results. In terms of oFDR, the order from best to worst (counting all ties, and not counting E1 scenarios, where
α = 0.005 or 0.01) is (3.18), BY, (3.11), and BH tied with (3.6). For oTPR, the order from best to worst is
(3.11), (3.6), (3.18), BH, and BY.
We firstly note that in E1 scenarios, where α = 0.005 or 0.01, all methods were either incapable of rejecting any
hypotheses, or were only able to reject negligible numbers of alternative voxels. This is due to the conservative
nature of all the tested methods, when α is very low. When ignoring these conditions, we find that the mixture
model family of rules (i.e. (3.6), (3.11), and (3.18)) tended to outperform BH and BY in terms of oTPR, and
that the two spatially informed rules (i.e. (3.11) and (3.18)) tended to outperform or equal the marginal rules
in many scenarios, while also achieving lower oFDR values.
Additionally, we note some specific observations. It can be seen that only BH and BY were meaningful in S1E2
for α = 0.005, indicating that these methods may be more appropriate than the mixture model family when
controlling for very low levels of FDR when there are smaller spatial features. This observation is supported
by observing the performance of BH in terms of oTPR in S1E3. However, we also observe that BH and BY
tend to conservatively control FDR over all other scenarios, and thus suffer in terms of oTPR due to the lack
of rejections made.
Next, we observe that rules (3.6) and (3.11) tend to behave similarly in terms of both oFDR and oTPR for low α
values and low levels of effects, although increasing either of these two quantities results in (3.11) outperforming
(3.6) in both criteria.
84
85
0.1
0.05
0.01
0.005
α
-
-
(3.6)
(3.11)
-
.072
-
(3.18)
BH
BY
-
.002
-
.005
.074
.000
.087
.004
.016
(3.11)
.100
.008
(3.6)
.037
.094
-
BY
.000
.000
.003
.039
BH
-
-
-
(3.18)
.049
.001
.013
(3.11)
.049
.001
.049
(3.6)
.008
.000
.002
.006
.030
.382
.376
.462
.460
.012
.230
.227
.284
.284
.001
.048
.027
.022
.034
.000
.002
.005
.074
.000
.020
.100
.002
.037
.000
.020
.050
.001
.007
.000
.010
.010
.000
.004
.021
.004
.003
.003
.000
-
-
.000
.536
.431
.845
.878
.908
.880
.329
.751
.772
.803
.793
.147
.495
-
.049
.002
.095
.095
-
.024
.001
.031
.044
-
.000
.000
.000
.536
.513
-
-
-
-
-
-
-
-
-
-
-
.018
.099
.101
.101
-
.002
.013
.011
.013
-
.000
.000
.000
.000
S2E1
.096
.386
.337
.352
.352
S1E3
-
-
-
-
.001
-
BY
-
-
-
-
-
-
-
-
-
oTPR
S1E2
-
-
BH
-
-
BY
(3.18)
-
BH
-
(3.11)
-
-
(3.6)
(3.18)
oFDR
S1E1
Rule
Scenario
.003
.048
.002
.020
.067
.546
.763
.764
.738
.030
.001
.098
.370
.549
.551
.024
.001
.049
.551
.003
.000
.049
.101
.181
.182
.182
.001
.049
.912
.541
.003
.999
.993
.963
.431
.839
.932
.932
.913
.214
.606
.706
.707
.707
.147
.494
.590
.590
.590
.048
.003
.021
.099
.002
.024
.000
.003
.049
.000
.005
.000
.009
.009
.000
.002
.000
.004
.104
.104
.004
S2E3
.104
.005
.000
.009
.009
.000
.002
.000
.005
.005
S2E2
-
.048
.001
.096
.096
-
.024
.000
.017
.042
-
.000
.000
.000
.010
-
-
-
-
-
-
-
-
-
-
-
.017
.088
.090
.090
-
.002
.011
.006
.011
-
.000
.000
.000
.000
S3E1
.003
.050
.000
.051
.098
.002
.025
.000
.049
.049
.000
.005
.000
.009
.009
.000
.003
.000
.004
.005
.064
.537
.746
.738
.718
.029
.361
.535
.537
.531
.003
.097
.172
.172
.172
.001
.046
.097
.097
.097
S3E2
.003
.050
.001
.024
.100
.002
.025
.000
.002
.049
.000
.005
.000
.009
.009
.000
.002
.000
.004
.004
.535
.909
.999
.989
.960
.425
.835
.926
.926
.907
.210
.600
.698
.698
.698
.144
.488
.580
.580
.580
S3E3
Table 3.1: Average oFDR and oTPR values (over 10 repetitions) for the tested rules under various simulation scenarios. Bold values indicate the best
performance within each scenario (lowest FDR and highest TPR). Dashed lines indicate no rejections made by the method in the specified scenario.
Unlike rule (3.6) which controls FDR at the desired α level, rules (3.11) and (3.18) tend to be conservative
like BH and BY, supporting the assumption that γ1i ≤ γ2i , in Theorem 3.3. This does not appear to impact
upon the oTPR results, as exemplified by the numerous scenarios where (3.18) achieved the best results in both
criteria. Upon inspection, we observe that the aforementioned scenarios where (3.18) performs well are cases
where the effect sizes or spatial features are large (E3 or S3); both of which are rare in practice. Furthermore, the
oFDR is always much smaller than the allowed level α, indicating that rule (3.18) may be overly conservative.
Finally, we note that in the scenarios where (3.18) achieves the best oTPR, (3.11) tends to produce a comparable
result. The closeness in oTPR in combination with the ability of rule (3.11) to better use the FDR allowance
indicates that (3.11) may be a better scenario across all scenarios and α levels.
Thus, we conclude that the mixture family of rules tends to be better than BH and BY, and should be preferred
to these methods in almost all scenarios. In cases where there are small effects or when controlling at moderately
low α levels, the spatial relationships can be ignored since (3.6) tends to perform well. However, when there
are very large effect sizes or spatial features, rule (3.18) produces both low FDR levels and high power. In all
other circumstances, and in general, rule (3.11) is able to control FDR at below the desired level, and provides
highly comparable power. Thus, we recommend rule (3.11) for general use, and apply it to a real data study in
the following section.
3.6
Path Study
PATH is a large longitudinal study of aging aimed at investigating the course of mood disorders, cognition,
health, and other individual characteristics across the lifespan (cf. Anstey et al. (2012)). It surveys 7485
individuals in three age groups of 20 to 24, 40 to 44, and 60 to 64 years at baseline. Follow-up is every
four years over a period of 20 years. The data from the PATH project were obtained from the normative
older age group (2551 at baseline). PATH surveys residents of the city of Canberra and the adjacent town
of Queanbeyan, Australia, who were randomly recruited through the electoral roll. Enrollment to vote is
compulsory for Australian citizens; consequently this cohort is representative of the population. The study was
approved by the Australian National University Ethics Committee.
3.6.1
Data Description
Of the 2551 randomly selected older PATH participants included in the study in wave 1, 2076 consented to be
contacted regarding an MRI scan. From these, a randomly selected sub-sample of 622 participants were offered
an MRI scan and 478 participants eventually completed MRI scanning; 360 participants were rescanned at wave
3.
From the 360 rescanned individuals, we excluded 18 participants from the current study due to gross brain
abnormalities evident in their MRI scans; another 30 participants were excluded due to a a history of epilepsy,
Parkinson’s disease or stroke. Furthermore, 46 individuals were diagnosed with mild cognitive impairment or
86
dementia (based on full neurological assessment meeting clinical criteria), and were subsequently excluded. Our
data consists of the wave 1 scans of the remaining 266 participants.
The 266 participants were scanned using a 1.5 Tesla Philips Gyroscan ACS-NT scanner (Philips Medical Systems, Best, the Netherlands) for T1-weighted three-dimensional structural MRI, and all participant identifiers
were stripped from the resulting scans. This 266 subjects sub-sample of scans constitutes the data for our
investigation. The sub-sample has mean age 63.16 (standard deviation 1.47) and a comparable gender ratio to
that of the greater PATH cohort.
All data were first converted to the MINC format and then intensity non-uniformity corrected using the N3
algorithm (cf. Sled et al. (1998)). Data were then roughly normalized by clamping the image between histogram
percent critical thresholds of 0.5 and 99.5% and rescaling the resulting data between 0 and 100. The histogram
clamping step is to remove small peripheral image artifacts. Data were then linearly registered to an internal
average. The resulting MRIs have dimensions 264 × 264 × 199 with a total brain volume of 1387661 voxels, as
determined by atlas based registration to a common template (cf. Fonov et al. (2009) and Fonov et al. (2011)).
3.6.2
Hypothesis Testing
Let xj be the age of individual j = 1, ..., m, and yij be the intensity of voxel i from the MRI of individual j.
Here, m = 266 and n = 1387661.
At each voxel, we model the relationship between age and intensity by the linear regression model Yij =
βi0 + βi1 xj + Eij , where βi0 and βi1 are voxel specific parameters, and Eij ∼ N 0, σi2 . This model allows us
to perform n separate regression computations.
Since we only model the intensities by age, there is a potential for inferential bias due to the absence of other
relevant factors such as education, gender, and neighboring voxel intensities in each regression. Thus, instead of
the usual least-squares based inference, we opted to use the misspecification-adjusted standard errors of White
(1982) to correct for these effects (see Appendix B of Nguyen et al. (2013) for computational details).
The misspecification-adjusted standard errors allow us to perform the n voxel-specific regressions separately
even though they may be related, and in absence of additional factors; this is necessary since modeling the
full dependency structure among the n regression models would not be computationally feasible. The major
drawbacks of the approach are as follows: the true effect-size of the modeled factor upon the voxel intensity will
be biased, and the standard errors of the regression parameters are conservative (i.e. inefficient) in comparison
to the correctly-specified model. The first issue is not a concern considering we are only interested in the
significance (i.e. p-value) of the effect of aging upon each voxel; the p-value is still correct even if the sizes of the
effect is biased. The second issue is not problematic considering it is unlikely that the correctly-specified model
is determinable (with regards to the inclusion of all relevant factors, considering the complexity of biological
processes); see White (1982) and White (1994) for further details regarding the analysis of misspecified models.
Based on the misspecification-adjusted standard errors, we can compute p-values (i.e. pi ) for each voxel to test
the null hypothesis that βi1 = 0, using a Wald-type test. That is, we test whether or not age has an effect on
87
A
B
0.4
2.0
0.3
Density
Density
1.5
1.0
0.2
0.1
0.5
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
−4
−2
0
p−value
2
4
z−score
C
D
100
0.30
0.8
0.8
0.25
80
0.20
0.6
60
0.4
40
c2
c2
0.6
0.15
0.4
0.10
20
0.2
0.2
0.05
0
0.00
0.2
0.4
0.6
0.8
0.2
0.4
c1
0.6
0.8
c1
Figure 3.3: Sub-figures A and B displays the p-values and z-scores histograms for the PATH data hypothesis
tests, where the blue, red and black lines are graphs of π0 f0 (zi ), (1 − π0 ) f1 (zi ), and f (zi ), respectively (as
0
\ and N̂R /n × 100% for various threshold combinations,
per Section 3.2). Sub-figures C and D show mFDR
respectively.
the intensity at each voxel, correcting for the potential effects of excluding other intensity-affecting factors and
between-voxel dependencies.
3.6.3
Technical Results
After performing the hypothesis tests described above, we then assessed the p-values for indications of voxels,
for which the null hypothesis was false. In Fig. 3.3 A, we note that the p-values do not appear uniformly
distributed, thus implying that not all null hypotheses are true, under the well-behaved assumption. This
implies that there may be an effect of age on intensity across some of the analyzed voxels.
The z-scores were then computed and the NC-FDR model with empirical null density was fitted to the data
(see Fig. 3.3 B). The MCMLE of the parameter vector for density (3.2) was found to be
θ̂
=
=
π̂0 , µ̂0 , σ̂02 , µ̂1 , σ̂12
T
T
(0.306, −0.492, 0.860, 0.915, 0.752) ,
which indicates that there is a large proportion of age affect voxels, and that the choice to use an empirical null
was appropriate in this situation.
The MRF of Section 3.3 was then estimated with the PLIC selected neighborhood range of d = 1, indicating
that the relationship between age-effected voxels appears to be very local in nature. The potential FDR and
88
A2
A3
200
50
50
50
100
100
100
150
150
150
250
A1
50
100
150
200
250
50
100
150
200
250
50
100
B2
150
200
250
150
200
250
B3
200
50
50
50
100
100
100
150
150
150
250
B1
50
100
150
200
250
50
100
150
200
250
50
100
Figure 3.4: Sub-figures A1, A2 and A3 show FDR control by MRF-FDR using rule (3.11) at the α = 0.1
level, and B1 B2 and B3 show control by NC-FDR using rule (3.6) at the sample level. Sub-figures A1 and B1
are midsaggital slices, A2 and B2 are midcoronal slices and A3 and B3 are midhorizontal slices. Red indicates
rejected voxels, and vice versa for black.
number of rejections for various threshold levels were then assessed via level plots (Fig. 3.3 C and D). The
graphs suggest that controlling the FDR at α = 0.1 would yield a rejection proportion of approximately 60%
of the tested voxels.
We choose to control FDR at the α = 0.1 level due to the potentially high power from the large proportion of
rejections, and the conservative nature of rule (3.11). This resulted in the choice of thresholds c1 = 0.38 and
c2 = 0.21, which indicates a strong preference towards spatial coherency of the rejected hypotheses.
Applying rule (3.11) with the chosen thresholds yields 826687 (59.6%) rejections, which can be compared to
NC-FDR which yields 883117 (63.6%) at the same α level. Both of these results could be anticipated from Fig.
3.3 D. Images of rejected voxels under MRF-FDR, and NC-FDR can be found in Fig. 3.4.
On inspection of Fig. 3.4, we note that the NC-FDR images show that there are a large number of isolated
rejected regions when compared to the MRF-FDR images which appear to be more spatially coherent. This is
exemplified in the comparison of Sub-figure A1 to B1. In order to numerically assess the spatial coherency of
the two sets of rejections, we can compute the number of isolated voxels (voxels with no same-class neighbors in
its 1-range Moore-type neighborhood). Using the aforementioned metric, we find that MRF-FDR produces 177
(0.013%) isolated voxels, compared to 336 (0.024%) by NC-FDR. Thus, MRF-FDR has half as many isolated
voxels as NC-FDR, which implies lower FDR levels under the spatial coherency assumption of Theorem 3.3.
89
Table 3.2: Anatomical decomposition with proportions of rejected voxels (FDR level α = 0.1) and volumes (in
voxels).
Left
3.6.4
Right
Total
Component
Proportion
Volume
Proportion
Volume
Proportion
Volume
Brainstem
-
-
-
-
0.603
43109
Caudate Nucleus
0.733
4242
0.108
4362
0.416
8604
Cerebellum
0.617
79762
0.617
80091
0.617
159853
Fornix
0.160
873
0.105
962
0.131
1835
Frontal Lobe
0.499
259165
0.495
256852
0.497
516017
Globus Pallidus
0.998
936
0.991
932
0.995
1868
Occipital Lobe
0.555
77162
0.588
75281
0.571
152443
Parietal Lobe
0.426
100907
0.519
128996
0.478
229903
Putamen
0.989
4049
0.968
4076
0.979
8125
Temporal Lobe
0.585
153812
0.589
128952
0.587
282764
Thalamus
0.674
5636
0.435
7589
0.537
13225
Clinical Results
Using the ANIMAL+INSECT technique for structural segmentation (cf. Collins et al. (1999)), we were able to
compute the proportion of rejected voxels in various major anatomical components (see Table 3.2).
Our finding of whole-brain effects due to age in the previous section is clinically consistent with the brain
aging literature (i.e. the clinical inferences drawn from our results are compatible with those drawn from the
quantitative results of articles in the literature; e.g. Sowell et al. (2003); Smith et al. (2007); Raz et al. (2005);
Raz and Rodrigue (2006); Courchesne et al. (2000); Salat et al. (2004); Walhovd et al. (2005)), as are the age
effects on the caudate nucleus (e.g. Raz et al. (2005); Walhovd et al. (2005); Yamashita et al. (2011)), cerebellum
(e.g. Smith et al. (2007); Raz et al. (2005); Luft et al. (1999)), frontal lobe (e.g. Sowell et al. (2003); Smith
et al. (2007); Tisserand et al. (2002)), globus pallidus (e.g. Walhovd et al. (2005)), occipital lobe (e.g. Salat
et al. (2004)), parietal lobe (e.g. Sowell et al. (2003); Smith et al. (2007); Courchesne et al. (2000)), putamen
(e.g. Walhovd et al. (2005)), temporal lobe (e.g. Sowell et al. (2003); Smith et al. (2007)), and thalamus (e.g.
Walhovd et al. (2005); Hughes et al. (2012); Sullivan et al. (2004)).
90
3.7
Conclusion
We have developed a spatially informed technique that is able to conservatively control the FDR by complementing the technique of McLachlan et al. (2006) with an MRF structure. This was achieved through the
use of MPLE and PLIC for estimation and model selection, respectively. The resulting method was shown to
outperform traditional marginal techniques in terms of both false positive mitigation and power over various
simulations of spatially dependent hypotheses.
Applications of MRF-FDR to the PATH study data showed that it was able to produce more spatially coherent
rejections than those rejections made by noncontexual methods. This implies that there is less potential for
false discovery when adopting an assumption that voxel effects are spatially dependent.
Using an anatomic segmentation, we also investigated clinical findings of our study. Our results corresponded
well with the literature in terms of both whole brain aging and aging in various anatomic components.
This article has improved upon the work of Nguyen et al. (2013) by suggesting a set of more interpretable rules
and proving a number of technical theorems regarding the workings of the methodology; however, there are still
some limitations, which need addressing in future work. For instance, we have assumed uniformity with respect
to the spatial dependencies between voxels which may be untrue for the effects of some factors. To address
such issues, we may allow for a separate set of MRF parameters (i.e. ψ) for different regions of interest (e.g.
the different anatomical components from Section 3.6.4). Another solution may be to allow for each voxel to
have a different set of MRF parameters (i.e. ψi ) which can be treated as a random effect, or a random variable
with a prior distribution (in the Bayesian sense); this would allow for each voxel to be differently related to its
neighbors.
Furthermore, our current method has been developed under an assumption of information austerity (i.e. we only
assume knowledge of the p-values for the effect of the factor of interest at each voxel and the spatial location of
said voxels) and therefore we have not discussed any issues regarding the use of complementary data. However,
given complementary data is available, there are many ways in which our methodology can be expanded upon.
For example, given contiguous data (i.e. information regarding the voxels which is independent of the factor of
interest), we could perform a clustering of the voxels using the method of Heller et al. (2006) before computing
p-values, and conducting FDR control using MRF-FDR. Additionally, factors that are not directly of interest
can be incorporated into the prior probability for each z-score via a mixture-of-experts construction within the
noncontextual-phase of our method (cf. Jordan and Jacobs (1994) and Section 5.12 of McLachlan and Peel
(2000) regarding mixtures-of-experts); this would allow for individual-specific FDR control. We believe that
these avenues represent interesting future directions.
Further from the aforementioned future directions, we hope to apply this work to other brain study scenarios to
assess its robustness under different experimental conditions, and implement the method in major open-source
software projects to encourage its use.
91
3.8
Proof of Theorems
Proof of Theorem 3.1
The equation
I {r1 (Zi ; c1 ) = 1} τ (Zi )
=
I {r1 (Zi ; c1 ) = 1} E (Hi = 0|I {r1 (Zi ; c1 ) = 1}) ,
implies
E (I {r1 (Zi ; c1 ) = 1} τ (Zi ))
=
EI {r1 (Zi ; c1 ) = 1, Hi = 0} ,
by the law of iterative expectations.
Under the theorem’s assumption we have var n−1 N̄01 → 0 and var n−1 N̄R → 0, as n → ∞. Thus, by
application of a dependent-variable law of large numbers (e.g. Theorem 5.3 of Boos and Stefanski (2013)), we
P
P
have n−1 N̄01 → EI {r1 (Zi ; c1 ) = 1, Hi = 0} and n−1 N̄R → EI {r1 (Zi ; c1 ) = 1}. It follows that
mFDR
N̄01
N̄R
n−1 N̄01
=
n−1 N̄R
EI {r1 (Zi ; c1 ) = 1, Hi = 0}
P
→
EI {r1 (Zi ; c1 ) = 1}
nEI {r1 (Zi ; c1 ) = 1, Hi = 0}
=
nEI {r1 (Zi ; c1 ) = 1}
Pn
E i=1 I {r1 (Zi ; c1 ) = 1, Hi = 0}
Pn
=
E i=1 I {r1 (Zi ; c1 ) = 1}
EN01
=
ENR
= mFDR,
=
by continuous mapping.
Proof of Theorem 3.2
Let U0 =
Pn
i=1
Pn
I {r1 (Zi ; c1 ) = 1, Hi = 0} and U1 =
i=1
I {r1 (Zi ; c1 ) = 1, Hi = 1}, and define V0 = n−1 EU0
and V1 = n−1 EU1 . Using this notation, we can write
mFDR
=
=
EU0
EU0 + EU1
V0
V0 + V1
92
and
FDR = E
U0
U0 + U1
.
Let κ (u0 , u1 ) = u0 / (u0 + u1 ). By Taylor’s theorem, we can write
FDR =
=
=
Eκ (U0 , U1 )
V0
+ R1 + R2
V0 + V1
mFDR + R1 + R2 ,
where
R1
R2 =
X
k0 +k1 =2
=
nV1
2 E (U0 − nV0 )
(nV0 + nV1 )
nV0
−
2 E (U1 − nV1 ) ,
(nV0 + nV1 )
2
k
k
E (U0 − nV0 ) 0 (U1 − nV1 ) 1 ρk0 k1 ,
k0 !k1 !
and
=
ρk0 k1 (U0 , U1 , V0 , V1 )
ˆ 1
∂2κ
(1 − t)2
(tU0 + (t − 1) nV0 , tU1 + (t − 1) nV1 ) dt.
k0
k1
∂u
0
0 ∂u1
Firstly, note that R1 = 0, because E (U0 − nV0 ) = EU0 − nV0 = 0, and similarly E (U1 − nV1 ) = 0 by definition
of V0 and V1 .
Now, we need to bound |R2 | by a function which shrinks to zero as n → ∞. Consider that for each k0 and k1
such that k0 + k1 = 2, we have
∂2κ
2
(tU
+
(t
−
1)
nV
,
tU
+
(t
−
1)
nV
)
(1 − t)
0
0
1
1
∂uk0 0 ∂uk1 1
≤
≤
≤
2C (1 − t)2 max {tU0 + (t − 1) nV0 , tU1 + (t − 1) nV1 }
(t (U0 + U1 ) + (1 − t) (V0 + V1 ))3
2C
2
−1
t (t − 1) (U0 + U1 ) + n (V0 + V1 )
2C
,
(V0 + V1 )2 n2
and hence |ρk0 k1 (U0 , U1 , V0 , V1 )| ≤ n−2 C 0 (V0 , V1 ), where C is a constant, and C 0 is a function of V0 and V1
which does not depend on n. Thus, by the Cauchy-Schwarz inequality, for each k0 and k1 ,
k
k
k
k E (U0 − nV0 ) 0 (U1 − nV1 ) 1 ρk0 k1 ≤ n−2 C 0 (V0 , V1 ) E (U0 − nV0 ) 0 (U1 − nV1 ) 1 .
93
Observe that for k0 = k1 = 1, we have
|E ((U0 − nV0 ) (U1 − nV1 ))|
n X
n
X
= E
Wi0 Wi1 i=1 i0 =1
n X
n
X
= E (Wi0 Wi1 )
0
i=1 i =1
≤ n (ν + 1) ,
where Wij = I {r1 (Zi ; c1 ) = 1, Hj = 0} − Vj for j = 1, 2, since each voxel i is at most dependent on the ν other
2
voxels, |E (Wi0 Wi0 1 )| ≤ 1, and EWi0 EWi0 1 = 0. By similar reasoning, we also have E (U0 − nV0 ) ≤ n (ν + 1)
2
and E (U1 − nV1 ) ≤ n (ν + 1).
Finally, by the triangle inequality,
|R2 |
X
≤
=
2 C 0 (V0 , V1 )
n (ν + 1)
k0 !k1 !
n2
k0 +k1 =2
C 0 (V0 , V1 ) (ν
+ 1)
n
→ 0,
as n → ∞, as required.
Proof of Theorem 3.3
Like Theorem 3.1, the limited dependency assumption implies that var n−1 N̄01 → 0 and var n−1 N̄R → 0,
as n → ∞. Theorem 5.3 of Boos and Stefanski (2013) is then applied to yield
P
n−1 N̄01 → E I r2 Z(i) ; c1 , c2 = 1 τ (Zi )
and
P.
n−1 N̄R → EI {r2 (Z; c1 , c2 ) = 1} .
Similarly to Theorem 3.1, we have
mFDR
0
Pn
i=1 I r2 Z(i) ; c1 , c2 = 1 τ (Zi )
Pn
E i=1 I r2 Z(i) ; c1 , c2 = 1
P
→
E
=
mFDR0 ,
by continuous mapping.
94
Now, γ1i ≤ γ2i implies
E
n
X
I r2 Z(i) ; c1 , c2 = 1 τ (Zi )
i=1
n
X
= E
I r2 Z(i) ; c1 , c2 = 1 γ2i
i=1
n
X
≥ E
I r2 Z(i) ; c1 , c2 = 1 γ1i
i=1
n
X
= E
I r2 Z(i) ; c1 , c2 = 1, Hi = 0 ,
i=1
by the law of iterative expectations. Finally, we have
Pn
E i=1 I r2 Z(i) ; c1 , c2 = 1, Hi = 0
Pn
mFDR0 ≥
E i=1 I r2 Z(i) ; c1 , c2 = 1
= mFDR,
as required. Here, mFDR must exist by the theorem’s assumption.
Proof of Theorem 3.4
In the proof of Theorem 3.2, set
U0 =
n
X
I r2 Z(i) ; c1 , c2 = 1, Hi = 0 ,
i=1
U1 =
n
X
I r2 Z(i) ; c1 , c2 = 1, Hi = 1 ,
i=1
and Wij = I r2 Z(i) ; c1 , c2 = 1, Hj = 0 − Vj , for each i and j = 1, 2.
Proof of Theorem 3.6
P
D
D
Observe that θ̂ → θ implies τ Zi ; θ̂ → τ (Zi ; θ) by Slutsky’s theorem, and thus T̂i → Ti by the portmanteau
D
P
D
lemma. Now consider that T̂i → Ti and ψ̂ → ψ imply that ξ T̂i , T̂(i) ; ψ̂ → ξ Ti , T(i) ; ψ by Slutsky’s theorem,
and hence, by the portmanteau lemma,
n o
I r2 Ẑ(i) ; θ̂, ψ̂, c1 , c2 = 1
D
→ I r2 Z(i) ; θ, ψ, c1 , c2 = 1 .
0
0
D
\ →
mFDR by the continuous mapping theorem.
Lastly, as required, mFDR
95
Proof of Lemma 3.1
If we suppose that g ti , t(i) ; ψ ∗ = g ti , t(i) ; ψ for all ti and t(i) , and ψ ∗ 6= ψ, then we have
g 1, t(i) ; ψ
g 1, t(i) ; ψ ∗
=
,
g 0, t(i) ; ψ ∗
g 0, t(i) ; ψ
which implies
exp a∗ + b∗ η t(i)
= exp a + bη t(i)
.
Upon simplification of the expression above, we have
(a∗ − a) + η t(i) (b∗ − b) = 0,
which only holds if ψ ∗ = ψ, or η t(i) = 0. Thus, the result follows by contradiction.
3.9
Algorithm Details
MM Algorithm for Mixture Models
The MM algorithm for estimating θ proceeds by initializing the parameter vector by θ (0) . Upon letting θ (k) =
T
(k)
(k)
(k)2
(k)
(k)2
π0 , µ0 , σ0 , µ1 , σ1
be the estimates on the kth iteration, the (k + 1) th iteration of the algorithm
(k+1)
is conducted in two steps: the minorization (min) and maximization (max) -step. In the min-step, τi0
=
(k+1)
(k+1)
τ zi ; θ (k) = 1 − τi1
is computed for each i. The max-step then requires the evaluation of π0
=
P
(k+1)
n
−1
n
, and
i=1 τi0
Pn
(k+1)
zi
(k+1)
i=1 τij
,
µj
= Pn
(k+1)
i=1 τij
and
(k+1)
Pn
(k+1)2
σj
=
i=1 τij
Pn
(k+1)
zi − µj
2
,
(k+1)
i=1 τij
for each j = 0, 1.
The steps are iterated until convergence is achieved, whereupon the final vector of estimates is declared the
MCMLE θ̂. Upon letting π1 = 1 − π0 , the algorithm arises through applications of the minorizer:


1
1
X
X
log 
Aij  ≥
j=0
j=0
Bij
P1
j 0 =0
Bij 0
P1
!
log
j 0 =0
Bij 0
Bij
!
Aij
,
P1
(k)
where Aij = πj φ zi ; µj , σj2 and τij = Bij / j 0 =0 Bij 0 , for each i and j (cf. Zhou and Lange (2010)). Because
the algorithm is constructed via minorization, it must monotonically increase the CML at each iteration.
96
Newton Algorithm for Spatial Model
The PL of equation (3.13) can be maximized indirectly by considering the logarithm
logP L ψ; t̂1 , ..., t̂n
n
X
=
log g t̂i , t̂(i) ; ψ
i=1
n
X
= a
t̂i + b
i=1
n
X
−
n
X
t̂i η t̂(i)
(3.19)
i=1
log 1 + exp a + bη t̂(i)
.
i=1
Let ∇ log P L and ∇2 log P L be the gradient and Hessian of (3.19), respectively, with components
∇a log P L =
n
X
t̂i −
i=1
∇b log P L =
n
X
ξ¯ t̂i , t̂(i) ; ψ ,
i=1
n
X
η t̂(i) ξ¯ t̂i , t̂(i) ; ψ ,
t̂i η t̂(i) −
i=1
i=1
∇2a2 log P L =
n
X
−
n
X
n
X
2
ξ¯ t̂i , t̂(i) ; ψ +
ξ¯ t̂i , t̂(i) ; ψ ,
i=1
i=1
∇2ab log P L = −
n
X
η t̂(i) ξ¯ t̂i , t̂(i) ; ψ
i=1
+
n
X
2
η t̂(i) ξ¯ t̂i , t̂(i) ; ψ ,
i=1
and
∇2b2 log P L = −
n
X
η t̂(i)
2
ξ¯ t̂i , t̂(i) ; ψ
i=1
+
n
X
2
η t̂(i) ξ¯ t̂i , t̂(i) ; ψ
,
i=1
where ξ¯ ti , t(i) ; ψ = 1 − ξ ti , t(i) ; ψ .
Using an initial estimate ψ (0) , equation (3.19) can be maximized through iterative computations of the Newton
step
−1
ψ (k+1) = ψ (k) − ∇2 log P L ψ (k)
∇ log P L ψ (k) ,
where ψ (k) = a(k) , b(k)
T
is the parameter estimate at the kth iteration. Upon convergence of the sequence of
estimates, the final iterate is declared to be the MPLE ψ̂.
97
Acknowledgment
The authors are grateful to Kaarin Anstey, Peter Butterworth, Helen Christensen, Simon Easteal, Patricia
Jacomb, Luke Lloyd-Jones, Karen Maxwell, Perminder Sachdev and Ian Wood for helpful discussions on various
aspects of the manuscript, and to the PATH interviewers. Furthermore, the authors are indebted to the reviewers
for providing many useful comments which have lead to numerous improvements in the manuscript.
98
Chapter 4
Laplace Mixture of Linear Experts
Citation
Nguyen, H. D. and McLachlan, G. J. (2015). Laplace mixture of linear experts. Computational Statistics and
Data Analysis, In press.
Abstract
Mixture of Linear Experts (MoLE) models provide a popular framework for modeling nonlinear regression
data. The majority of applications of MoLE models utilizes a Gaussian distribution for regression error. Such
assumptions are known to be sensitive to outliers. The use of a Laplace distributed error is investigated. This
model is named the Laplace MoLE (LMoLE). Links are drawn between the Laplace error model and the least
absolute deviations regression criterion, which is known to be robust among a wide class of criteria. Through
application of the minorization-maximization algorithm framework, an algorithm is derived that monotonically
increases the likelihood in the estimation of the LMoLE model parameters. It is proven that the maximum
likelihood estimator (MLE) for the parameter vector of the LMoLE is consistent. Through simulation studies,
the robustness of the LMoLE model over the Gaussian MOLE model is demonstrated, and support for the
consistency of the MLE is provided. An application of the LMoLE model to the analysis of a climate science
data set is described.
4.1
Introduction
Mixture of experts (MoE) models were first introduced in Jacobs et al. (1991) as a model for nonlinear regression
relationships; see Jordan and Jacobs (1994) and Section 5.12 of McLachlan and Peel (2000) for details. Since
their inception, the development in MoE research has been rich, and the framework has been successfully applied
to problems of clustering, classification, and regression in a variety of fields. A review of the current state of
the art can be found in Yuksel et al. (2012). The MoE framework can be defined as follows.
99
Let Z ∈ {1, ..., g} be a categorical random variable such that
P (Z = i|v)
=
=



exp(αT
i v)
P
1+ g−1
exp
(αTi0 v)
i0 =1
if i = 1, ..., g − 1,


1
P
1+ g−1
exp(αT
v
i0 =1
i0 )
otherwise,
πi (v; α) ,
(4.1)
for some covariate v ∈ Rp , and let Y ∈ R be a random variable such that Y |Z = i (for i = 1, ..., g) has density
function fi (y|x), which we refer to as the component density, for some covariate x ∈ Rq . Here, T denotes
T
matrix transposition and α = αT1 , ..., αTg−1 ∈ R(g−1)p .
If one observes Y without observing Z (i.e. Z is a latent variable), then the density function of Y |v, x can be
written as
fY (y|v, x) =
g
X
πi (v; α) fi (y|x) .
(4.2)
i=1
Density functions of form (4.2) are known as MoE models. In this article, we concentrate on the case when
Y ∈ R and E (Y |Z = i, x) = βiT x, where βi ∈ Rq . We shall refer to such densities as mixture of linear experts
(MoLE) models.
MoLE models have recently received strong interest from the computational statistics and neural computation
communities. For example, Wedel (2002) and Grun and Leisch (2008) considered MoLE densities for modeling
concomitant variables; Ingrassia et al. (2012) showed it to be related to cluster-weighted modeling; and Chamroukhi et al. (2009), Chamroukhi et al. (2010), and Same et al. (2011) applied it to fit, classify, and cluster
time-series data, respectively.
Although a rich class, the current research in MoLE models has been restrictive in the sense that fi (y|x) is
always considered to be Gaussian (i.e. fi (y|x) = φ y; βiT x, σi2 , where φ y; µ, σ 2 is the Gaussian density
function with mean µ ∈ R, and variance σ 2 ∈ (0, ∞)). We will call this model the Gaussian MoLE (GMoLE).
When misapplied, the Gaussian assumption is known to incur problems in mixture models (e.g. misspecification
error, and outlier sensitivity; see p. 221 of McLachlan and Peel (2000) for a brief discussion), which can often
lead to incorrect inference making. As a remedy to these problems, the mixture model literature has extended in
scope to using robust generalizations of the Gaussian densities as component functions (e.g. Lee and McLachlan
(2013) recently reviewed a variety of skewed-generalizations of Gaussian mixture models). Outside of Gaussian
generalizations, Jones and McLachlan (1990) have considered Laplace distribution components for modeling data
that departs from the Gaussian assumption and Franczak et al. (2014) have considered mixtures of asymmetric
Laplace distributions for density estimation.
In the mixtures of linear regression context, Galimberti and Soffritti (2014), Ingrassia et al. (2014), and Yao
et al. (2014) have suggested the use of the t-distribution as an error model in various settings; and Song
et al. (2014) have considered the use of Laplace distributed errors. When g ≥ 2, the density function for the
Laplace mixture from Song et al. (2014) can be expressed in the form of (4.2) by taking v = 1, and setting
100
fi (y|x) = λ y; βiT x, ξi , where
λ (y; µ, s) =
exp (− |y − µ| /ξ)
,
2ξ
(4.3)
is the Laplace density function with mean µ ∈ R and scale parameter ξ ∈ (0, ∞). Unlike MoLE models, the
aforementioned mixtures of linear regression models do not allow for covariate dependencies in the component
probabilities.
Considering the state of current research in MoLE models (with respect to distributional assumptions), we
believe that an extension of the Laplace mixture to the more general MoLE setting (i.e. model (4.2) with
fi (y|x) = λ y; βiT x, ξ ) is timely and pertinent. We name our new model the Laplace MoLE (LMoLE). The
following considerations regarding the model shall be discussed in this article.
Firstly, we will discuss maximum likelihood estimation (MLE) for the LMoLE model parameters. Like other
MoE models, MLE for LMoLE models cannot be conducted in closed form; as such, an iterative numerical
scheme is required for MLE. Unlike previous works (e.g. Jordan and Jacobs (1994) and Grun and Leisch
(2008)), we do not use an expectation-maximization (EM) algorithm for the task; see ? for a treatment on EM
algorithms. This is because EM algorithms require specialist knowledge of probabilistic characterizations in
order to express the iterative updates (e.g. Song et al. (2014) required a Gaussian scale mixture representation
to express their updates). Furthermore, to the best of our knowledge, all current EM algorithms for MLE of
MoE model parameters require a Newton or quasi-Newton update step (e.g. Jordan and Jacobs (1994) and
Grun and Leisch (2008), respectively), which can violate the usual monotonicity property of EM algorithms.
Instead of an EM algorithm, we suggest a monotonic iterative scheme using the minorization-maximization
(MM) algorithm framework; see Hunter and Lange (2004) for a concise introduction to MM algorithms. The
MM algorithms are attractive due to their use of analytic inequalities, rather than probabilistic characterizations,
in order to construct iterative schemes.
We then show the relationship between LMoLE and least absolute deviations (LAD) regression; treatments on
LAD and related regression methods can be found in Maronna et al. (2006) and Section 2.3 of Amemiya (1985).
Such a relationship indicates that LMoLE models should be more robust than GMoLE models, in the sense of
Huber and Ronchetti (2009, Ch. 7).
Next, we show that the maximum likelihood estimator (MLE) for the parameter vector of the LMoLE model
is consistent. Estimates for various quantities of interest are also given. We then use simulations to provide
empirical validation that the estimates appear to converge to their true values (as predicted by the consistency
of the MLE), and demonstrate situations whereby the LMoLE is robust in comparison to the GMoLE.
Lastly, we demonstrate the LMoLE model via an application to climate science data. Here, we describe an
analysis of temperature anomalies data from Hansen et al. (2001).
The article will proceed as follows. The LMoLE is defined, and the MM algorithm for its MLE is presented in
Section 4.2. The relationship between LMoLE and LAD regressions is also given here. Theoretical results and
derivations of quantities of interest are then given in Section 4.3. Empirical evidence of practical and theoretical
claims is provided via simulations in Section 4.4. A short application of LMoLE to climate data is described in
101
Section 4.5. Finally, conclusions are drawn in Section 4.6.
4.2
Model Description and Parameter Estimation
Using expressions (4.1), (4.2), and (4.3), the LMoLE density function can be written as
fY (y|v, x; θ) =
g
X
πi (v; α) λ y; βiT x, ξi ,
(4.4)
i=1
T
where θ = αT , β1T , ..., βgT , ξ1 , ..., ξg is the vector containing the model parameters. For brevity, we shall put
T
T
β = β1T , ..., βgT
and ξ = (ξ1 , ..., ξg ) .
Let Y1 , ..., Yn be an i.i.d random sample, with corresponding covariates vj and xj (j = 1, ..., n), arising from a
distribution with density of form (4.4). The likelihood and log-likelihood functions for such a sample can be
expressed as
Ln (θ) =
g
n X
Y
πi (vj ; α) λ yj ; βiT xj , ξi
j=1 i=1
and
log Ln (θ) =
n
X
j=1
log
g
X
πi (vj ; α) λ yj ; βiT xj , ξi ,
(4.5)
i=1
respectively. We denote the MLE for a sample of size n, as θ̂n , where θ̂ n is a local maximizer of (4.5).
Notice from (4.5) that θ̂n cannot be expressed in closed form. As such, we now present an MM algorithm to
compute θ̂n iteratively.
4.2.1
MM Algorithm
Let h (a) be a function to be maximized, with respect to a ∈ S, for some set S, and let b ∈ S. If h (a) cannot
be maximized directly, then an MM algorithm can be constructed by finding another function η (b; a) such that
h (a) = η (a; a), and h (b) ≥ η (b; a) when a 6= b; here, η (b; a) is said to minorize h (b). Upon finding an
appropriate η (b; a), and letting a(k) be the kth iterate of the algorithm, an MM algorithm can be defined by
the update scheme
a(k+1) = arg max η b; a(k) .
b∈S
(4.6)
By the relationship between h (b) and η (b; a), and by definition (4.6), the following inequality can be shown:
h a(k+1) ≥ η a(k+1) ; a(k) ≥ η a(k) ; a(k) = h a(k) .
This implies that the sequence h a(k) is monotonically increasing, and thus the MM algorithm defined by
update scheme (4.6) can be shown to converge to either a local maximum or a saddle point of h (a), given an
appropriate starting value a(0) ; see Hunter and Lange (2004) for further details.
The following proposition suggests a minorizer of (4.5) for use in computing θ̂n .
102
Proposition 4.1. The function log Ln θ (k) is minorized by
C + η1 θ; θ (k) + η2 θ; θ (k) ,
=
η θ; θ (k)
(4.7)
where C is independent of θ, and where
g X
n
X
(k+1)
=
η1 α; θ (k)
τij
log πi (vj ; α) ,
(4.8)
i=1 j=1
=
η2 β, ξ; θ (k)
−
g X
n
X
(k+1)
τij
i=1 j=1
g
n
X
X
1
ξi
log ξi
(k+1)
yi − βiT xj ,
(4.9)
(k)T
(k)
πi vj ; α(k) λ yj ; βi xj , ξi
= P
(k)
(k)T
g
0 vj ; α(k) λ
x
,
ξ
π
y
;
β
0
0
0
j
i
j
i
i
i =1
(k)
= τi yj , vj , xj ; θ
.
(4.10)
−
i=1
τij
j=1
and
(k+1)
τij
Proof. The expression is derived via application of the minorizer,
log
g
X
i=1
!
ai
≥
g X
bi
Pg
i=1
0
i0 =1 bi
Pg
log
0
i0 =1 bi
bi
ai ,
Pg
(k+1)
where ai = πi (vj ; α) λ yj ; βiT xj , ξi , and τij
= bi / i0 =1 bi0 .
Note that (4.8) and (4.9) are dependent on disjoint subsets of the elements of θ. As such, (4.7) can be
maximized by considering (4.8) and (4.9) separately. The following propositions suggest minorizers for the
noted components.
Proposition 4.2. For each i = 1, ..., g − 1, let
g X
n
X
(k+1)
η1i αi ; θ (k) =
τi0 j
log πi0 vj ; α(i) ,
(4.11)
i0 =1 j=1
∗T
T
∗T
∗T
where α(i) = α∗T
1 , ..., αi−1 , αi , αi+1 , ..., αg−1
η1i αi ; θ (k) is minorized by
∗
η1i
αi ; θ (k)
=
T
and α∗i0 is a constant vector for each i0 6= i. The function
(k+1)T
(k)
C + δi
αi − αi
T 1
(k)
(k)
−
αi − αi
∆ αi − αi
,
8
103
(4.12)
where C is independent of θ,
(k+1)
δi
=
n h
X
(k+1)
τij
i
(k)
− πi vj ; α(i) vj ,
j=1
∆=
n
X
vj vjT ,
j=1
T
(k)T
(k)
∗T
∗T
.
, α∗T
and α(i) = α∗T
1 , ..., αi−1 , αi
i+1 , ..., αg−1
Proof. Consider
n
X h (k+1)
i
∂η1i
=
τij
− πi vj ; α(i) vj
∂αi
j=1
and
∂ 2 η1i
∂αi ∂αTi
−
=
n
X
πi vj ; α(i)
1 − πi vj ; α(i) vj vjT
j=1
for each i = 1, ..., g − 1.
Since πi vj ; α(i) 1 − πi vj ; α(i) ≤ 1/4,
−4−1 ∆ −
∂ 2 η1i
∂αi ∂αTi
is negative-semidefinite. Because ∆ is positive-definite, except in pathological cases, we get the result by
application of the bounded-curvature quadratic minorization inequality (see Hunter and Lange (2004)).
Proposition 4.3. The function η2 β, ξ; θ (k) is minorized by
η2∗
β, ξ; θ
(k)
= −
−
g X
n
X
(k+1)
where wij
log ξi
i=1 j=1
g
X
i=1
−
(k+1)
τij
n
1 X (k+1) (k+1)
τ
wij
2ξi j=1 ij
g
n
X
1 X (k+1) yi − βiT xj
τ
(k+1)
2ξi j=1 ij
wij
i=1
(k)T
= yi − βi xj .
Proof. The expression is derived via application of the minorizer,
√
√
a−b
− a≥− b− √ ,
2 b
where a = yi − βiT xj
2
2
(k)T
and b = yi − βi xj .
Propositions 4.1–4.3 suggest the following MM algorithm for maximizing (4.5).
104
(4.13)
2
,
Let θ (0) be an initial parameter vector. At the (k + 1) th iteration, the algorithm proceeds through two steps:
(k+1)
the min (minorization)-step and the max (maximization)-step. In the min-step, we compute τij
(k+1)
and wij
for each i and j as per Propositions 4.1 and 4.3, respectively. We split the max-step into two phases: the α
phase and the β (and ξ) phase.
(k+1)
for i0 < i and α∗i0 = αi0 for i0 > 1 (here,
(k)
(4.14)
(k)
In the α phase, at each i = 1, ..., g − 1 (in order) we set α∗i0 = αi0
α∗i0 is as in Proposition 4.2); and make the update,
(k+1)
αi
(k+1)
= 4∆−1 δi
+ αi .
In the β phase, we make the updates,
(k+1)
βi
−1

(k+1)
(k+1)
n
n
X
X
τij
yj xj
τij
xj xTj

=
(k+1)
(k+1)
wij
wij
j=1
j=1
and
"
(k+1)
j=1 τij
Pn
(k+1)
ξi
(k+1)
wij
=
2
+
(k+1)T
yj −βi
xj
(4.15)
2 #
(k+1)
wij
(k+1)
j=1 τij
Pn
,
(4.16)
for each i.
Updates (4.14), (4.15), and (4.16) can be shown to maximize the functions (4.9) and (4.11). Additionally, the
sequence of α phase updates can be viewed as a block-ascent, and thus be shown to not decrease (4.8) at each
iteration. Further details can be found in Appendix A.
The algorithm is iterated until convergence is reached, where convergence can be defined as either
log Ln θ (k+1) − log Ln θ (k) < or
log Ln θ (k+1) − log Ln θ (k)
<
log Ln θ (k) (4.17)
where < 0 is a small constant.
Upon convergence we declare the final iterate the MLE θ̂n . Due to the monotonicity property of the MM
algorithm (as noted above), it applies that given an appropriate θ (0) , θ̂n → θ̂n∗ as → 0, where θ̂n∗ is a local
maximum of (4.5). We choose to use criterion (4.17) due to its scale invariance to the size of the likelihood,
and we set = 10−5 , throughout this article; see Section 11.5 of Lange (2013) for a discussion on convergence
criteria.
An appropriate θ (0) can be obtained via a modified version of the the randomized initial assignment method
described in Section 2.12.2 of McLachlan and Peel (2000). This initialization is conducted by first uniformly
(0)
generating n random integers s1 , ..., sn ∈ {1, ..., g}. Next, for each i and j, set τij = 1 if j = si and 0 otherwise,
(0)
and set wij = 1. We then set α(0) = 0, and compute β (0) and ξ (0) via the β and ξ phases of the max-step, to
105
obtain the initial parameter vector θ (0) . The log-likelihood under the initialization, log Ln θ (0) , can then be
evaluated.
This process can be repeated multiple times where upon the initialization with the greatest log-likelihood is
used as the best available initialization. A similar protocol is used to identify appropriate starting parameters
for GMoLEs in Grun and Leisch (2008). In this article, we choose to repeat the process ten times.
We conclude this section with the following two remarks: firstly, to the best of our knowledge, our MM algorithm
(k+1)
is the first provably monotonic algorithm for any MoE of form (4.2). Secondly, updates for τij
(k+1)
ξi
4.2.2
(k+1)
, βi
, and
bear strong resemblance to their corresponding quantities in the EM algorithm of Song et al. (2014).
Relationship to Least Absolute Deviations Regression
Let
LAD (β1 ) =
n
X
yj − β1T xj (4.18)
j=1
be the LAD regression criterion. The relationship between (4.18) and (4.5) was alluded to in Song et al. (2014).
We will now make a more formal connection and extend upon their observations.
Suppose that g is fixed, βi = β1 , and si = s1 is a constant for each i (in (4.5)). For any vj , it is simple to show
that (4.5) can be expressed as
C1 − C2 LAD (β1 ) ,
where C1 and C2 are independent of β1 . As such, maximizing (4.5) is equivalent to minimizing (4.18), with
respect to β1 .
It is well known that LAD regression is most robust in designed experiments (i.e. LAD has the highest maximum
breakdown-point; see Ch. 7 of Huber and Ronchetti (2009)). This implies that under unfavorable conditions,
LAD regression is less influenced by outliers when compared to other regression methods (e.g. least-squares
regression). Furthermore, simulation studies (e.g. Forsythe (1972)) suggest that LAD is preferable to leastsquares, in the presence of outliers. Since it can be shown that the MLE of the GMoLE (when g is fixed,
βi = β1 , and σi2 = σ12 is a constant) is equivalent to least-squares regression, we can infer that LMoLE models
are more robust than GMoLE models, at least in this restricted case.
Furthermore, we can show that the maximization of (4.9) with respect to βi (holding all else fixed, at the
(k + 1) th iteration of the MM algorithm) is equivalent to minimizing the weighted-LAD criterion:
WLAD (βi ) =
n
X
ωj yj − βiT xj ,
j=1
(k+1)
where ωj = ξi−1 τij
. Here, the weights ωj can be interpreted as local-relevance factors. Upon inspecting
the EM algorithm for GMoLE models (e.g. that of Chamroukhi et al. (2009)), we see that the updates for βi ,
at the (k + 1) th iteration, is equivalent to minimizing the weighted least-squares criterion (again, holding all
else constant, and with weights corresponding to the local relevance of observations). As such, we infer that
106
LMoLE models may be more robust than GMoLE models in the unrestricted setting as well; for further details
regarding weighted regressions, see Maronna et al. (2006).
4.3
4.3.1
Statistical Inference
Identifiability
At the core of much of statistical inference is the notion of identifiability. It is well known that mixture models
are not absolutely identifiable in general. However, in some mixture model settings, it is possible to establish a
weaker sense of identifiability (i.e. ordered identifiability or identifiability up to a permutation); see Section 3.1
of Titterington et al. (1985) for details.
In the case of MoE models, Jiang and Tanner (1999b) showed that ordered and initialized identifiability can be
established in the class of irreducible MoE models with location-scale parametrized components. Here, in the
context of LMoLE models, irreducible implies that if i 6= i0 , then either βi 6= βi0 or ξi 6= ξi0 ; initialized implies
that the component probabilities are as in (4.1); and ordered implies that there exist some absolute ordering
T
T
T
relation ≺, such that β1T , ξ1T
≺ β2T , ξ2T
≺ ... ≺ βgT , ξgT .
By applying Lemma 2 of Jiang and Tanner (1999b), the ordered and initialized identifiability of LMoLE models
can be established via the validation of the following nondegeneracy condition.
A1 The set {λ (y; µ1 , ξ1 ) , ..., λ (y; µ2g , ξ2g )} contains 2g linearly independent functions of y, for any 2g distinct
pairs (µi , ξi ) ∈ R × (0, ∞), for i = 1, ..., 2g.
Theorem 3 of Holzmann et al. (2006) can be used to show the identifiability up to a permutation for irreducible
finite mixtures of Laplace distributions; see Example 3 of the same article for details. The result can be shown
to be equivalent to condition A1 via the main theorem of Yakowitz and Spragins (1968). Thus, we have the
following result via Lemma 2 of Jiang and Tanner (1999b).
Proposition 4.4. Any irreducible, ordered, and initialized LMoLE is identifiable.
4.3.2
Consistency of the MLE
It is known that both the LAD criterion and the MLE of MoLE models lack unique or globally optimal solutions
(i.e. the LAD solution is generally not unique, and MoLE models have unbounded likelihood functions). As
such, we would expect the LMoLE to suffer from the same issues, as well as issues induced by the lack of
absolute identifiability.
In this situation, the consistency of the global maximizer cannot be sought, since (4.5) is unbounded. However,
the existence of a consistent local maximizer can still be proven. Such a proof can be given as a consequence of
the following modification of Theorem 4.1.2 from Amemiya (1985).
Lemma 4.1. Let hn (a; Y ) be a measurable function of the random variable Y , indexed by n, with parameter
vector a ∈ A, and let An be the set of all strict local maximizers of hn (a; Y ) with respect to a. Upon letting a0
107
be the true value of a, and making Assumptions B1–B4 (from Appendix B) we have the following result. For
any > 0,
lim P
n→∞
inf
a∈An
0 T
0
a−a
a−a
> = 0.
Here we take An = {0} if there are no strict local maxima of hn (a; Y ).
Proof. Let N be a compact subset of N1 ∩ N2 . If we denote a∗n to be the value of a that globally maximizes
hn (a; Y ) in N , then by Assumptions B1–B3, Theorem 4.1.1 of Amemiya (1985) states that a∗n is consistent.
Assumption B4 then implies that as n → ∞, the probability that n−1 hn (a; Y ) attaining a strict local maximum
at a∗n approaches one. Therefore, we can conclude that limn→∞ P (a∗n ∈ An ) = 1. The result then follows.
Upon assuming an appropriate probability distribution for V and X (i.e. the probability density function
fV X (v, x) is bona fide) it is not difficult to validate the assumptions of Lemma 4.1 for the maximization of
(4.5). This yields the following theorem.
Theorem 4.1. Let θ 0 be the true value of θ, and assume that θ 0 is a strict local maximizer of E log fY (Y |V , X; θ)
(here, fY (y|v, x; θ) is as defined in (4.4)). Let Θn be the set of all strict-local maximizers of log Ln (θ) (as
defined in (4.5)). Then, we have the following result. For any > 0,
lim P
n→∞
inf
θ∈Θn
θ − θ0
θ − θ 0 > = 0.
T
Here we take Θn = {0} if there are no strict local maxima of log Ln (θ).
Proof. Assumptions B1 and B2 hold for hn = log Ln and a = θ. By a uniform law of large numbers (e.g.
Theorem 4.2.2 of Amemiya (1985)) it can be shown that n−1 log Ln (θ) approaches E log fY (Y |V , X; θ) in a
neighborhood N2 , thus fulfilling B3. Lastly, B4 is validated by the Theorem’s statement. The result then follows
from an application of Lemma 4.1.
Theorem 4.1 implies that the set of strict local maximizers of (4.5) will contain the true parameter, as n gets
large. This is a good result considering the capacity of the MM algorithm to search for local maxima given
good initial values θ (0) , and considering the existence of multiple local maxima of (4.5).
We note that a similar theorem can be stated for the case when v and x are nonstochastic. Such a result
requires Assumption B3 be made explicitly.
4.3.3
Estimation of Quantities of Interest
T
Given the MLE θ̂n , we can estimate the component-specific means and variances by Eθ̂n (Y |Z = i, x) = β̂n,i
x
2
and Vθ̂n (Y |Z = i, x) = 2ξˆn,i
, respectively. The component probability can also be estimated by Pθ̂n (Z = i|v) =
πi (v; α̂n ), for each i.
The component-specific estimates can then be combined to compute the LMoLE mean and variance estimates:
Eθ̂n (Y |v, x) =
g
X
i=1
108
T
πi (v; α̂n ) β̂n,i
x
(4.19)
and
Vθ̂n (Y |v, x)
=
g
X
i2
h
T
πi (v; α̂n ) β̂n,i
x − Eθ̂n (Y |v, x)
i=1
+2
g
X
2
πi (v; α̂n ) ξˆn,i
,
(4.20)
i=1
respectively.
We can also estimate the a posteriori probability of component membership given observed values y, v, and x
by,
Pθ̂n (Z = i|y, v, x) = τi y, v, x; θ̂n .
This expression is a consequence of Bayes’ rule, and can be applied in model-based clustering; see Section 1.15
of McLachlan and Peel (2000) for details regarding the relationship between mixture models and the clustering
problem. Note that each of the aforementioned estimates is continuous in θ and thus consistent in the sense that
each of the quantities approaches their respective values with θ 0 replacing θ̂n , as n → ∞ (under the assumption
that θ̂n is consistent, and by application of the continuous mapping theorem). We will refer to the functions
evaluated at θ 0 as the true functions.
4.3.4
Determining the Number of Components
In all preceding discussions, it has been assumed that the number of components of the LMoLE (i.e. g) is fixed.
It is not possible to determine an unknown g within the MLE procedure described in Section 4.2.1. Due to the
computational demands of estimating LMoLE model parameters, information-theoretic criteria tend to be more
appropriate than hypothesis testing techniques, which often require resampling to conduct; see Section 6.3.1 of
McLachlan and Peel (2000) for a discussion of this issue in the mixture model context.
As per Chapter 6 of McLachlan and Peel (2000), we propose the use of the general Akaike information criterion
(AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978), and the mixture model specific
classification likelihood criterion (CLC) (Biernacki and Govaert, 1997) and integrated CLC (ICL) (Biernacki
et al., 2000).
Suppose that g0 ∈ {γ1 , ..., γm } is the true value of g. For each model l = 1, ..., m, we fit an LMoLE with γl
components and compute its MLE θ̂(l)n . The AIC, BIC, CLC, and ICL for each l can then be given as
AIC (l) = −2 log Ln θ̂(l)n + 2 [γl (1 + p + q) − p] ,
BIC (l) = −2 log Ln θ̂(l)n + log n [γl (1 + p + q) − p] ,
CLC (l) = −2 log Ln θ̂(l)n + 2EN θ̂(l)n ,
109
and
ICL (l) = −2 log Ln θ̂(l)n + log n [γl (1 + p + q) − p] + 2EN θ̂(l)n ,
respectively, where
EN (θ) = −
g X
n
X
τi (yj , vj , xj ; θ) log τi (yj , vj , xj ; θ)
i=1 j=1
and γl (1 + p + q) − p is the total number of parameter elements in model l. Under all four criteria, the number
of components is selected by setting g = γl̂ using the rule,
ˆl = arg min IC (l) ,
(4.21)
l=1,...,m
where any of the four criteria can be substituted in the place of IC.
There is some empirical evidence (e.g. Section 6.11 of McLachlan and Peel (2000)) to show that the BIC and
ICL outperform the other criteria in the mixture model context, and simulations show that the BIC and ICL
show similar performance in GMoLE models; see Grun and Leisch (2007). Furthermore, theoretical studies have
found that under certain conditions, the BIC tends to correctly select the number of components in mixture
models (i.e. γl̂ tends towards g0 ); for example, see Keribin (2000).
4.4
Simulation Studies
To demonstrate our claims, we performed three sets of simulations for the following purposes: to provide
empirical evidence on the consistency of the MLE in small samples, to demonstrate the robustness of LMoLE
versus GMoLE models, and to assess which of the information criteria is most appropriate for choosing the
number of components in the LMoLE context. We report on the findings of our simulations in the order of
listing.
4.4.1
Consistency of the MLE
We simulated n = 64, 128, 256, 512, 1024 observations yj from a g = 2 component LMoLE with parameter θ 0 ,
T
T
T
as per model (4.4), where the parameter components are α01 = (0, 10) , β10 = (0, 1) , β20 = (0, −1) , and
T
ξ10 = ξ20 = 0.1. Here j = 1, ..., n, and the covariates are simulated such that vj = xj = (1, xj ) , where xj is
simulated uniformly over the interval (−1, 1). For each sample of n observations, we compute the MLE θ̂n of
model (4.4) with g = 2, using the MM algorithm as described in Section 4.2.1.
To demonstrate the consistency of MLE in this particular case, we generated 1000 trials of the simulation for
2
each n. At each of the simulations, we computed the squared error (SE): SEn,l = θ̂n,l − θl0 , where θl is the
lth component of θ, for l = 1, ..., g (1 + p + q) − p. We then average over the 1000 simulation trials for each n,
[ n,l ), for each component. The results are presented in Table 4.1.
to get the estimated mean SE (MSE
[ n,l is monotonically decreasing as n increases, for each element of
Observing Table 4.1, we notice that the MSE
110
[ n,l ) between each component of θ̂n and θ 0 , for each
Table 4.1: Estimated mean squared errors (i.e. MSE
[ n,l is obtained using
n = 64, 128, 256, 512, 1024 are presented. Here, θ 0 is as given in Section 4.4.1. Each MSE
1000 trials.
n
64
128
256
512
1024
0
α10
2.31 × 100
6.36 × 10−1
2.45 × 10−1
8.96 × 10−2
4.19 × 10−2
0
α11
4.46 × 101
11.5 × 101
3.90 × 100
1.41 × 100
7.43 × 10−1
0
β10
2.67 × 10−3
1.08 × 10−3
4.58 × 10−4
2.09 × 10−4
9.56 × 10−5
0
β11
7.35 × 10−3
3.00 × 10−3
1.29 × 10−3
5.96 × 10−4
2.65 × 10−4
0
β20
2.64 × 10−3
1.99 × 10−3
4.24 × 10−4
1.99 × 10−4
9.75 × 10−5
0
β21
7.53 × 10−3
2.73 × 10−3
1.10 × 10−3
5.55 × 10−4
2.68 × 10−4
ξ10
3.91 × 10−4
1.74 × 10−4
8.96 × 10−5
4.59 × 10−5
2.26 × 10−5
ξ20
3.48 × 10−4
1.87 × 10−4
8.50 × 10−5
4.75 × 10−5
2.07 × 10−5
the parameter vector. We believe this to be in support of the conclusions of Theorem 4.1. Furthermore, it can
[ n,l of each parameter element is approximately halved for each doubling in n.
be observed that the MSE
In addition to the results from Table 4.1, we plotted the estimated mean function and estimated component
mean functions for an instance of the n = 512 scenario, along with the respective true functions, in Figure 4.1.
In Figure 4.2, we plotted the estimated variance function (i.e. (4.20)) and component probability π1 (v; α̂n ),
along with the respective true functions. The estimated functions from Figures 4.1 and 4.2 closely resemble
their true counterparts. This provides further evidence towards the consistency of the MLE, in small samples.
4.4.2
Robustness of the LMoLE
To examine the robustness of the LMoLE versus the GMoLE, we performed simulation experiments in two
cases: the Laplace case and the Gaussian case. In the Laplace case, we simulated n = 512 observations,
whereby observations were generated in the same way as in Section 4.4.1 with probability 1 − c and from
an outlier class with probability c. Here, we set c = 0, 0.01, ..., 0.05. Outliers are generated by simulating x
T
uniformly over the interval (−1, 1), and setting v = x = (1, x) and y = −2.
In the Gaussian case, we simulated n = 512 observations, where observations from a g = 2 component GMoLE
with α and β exactly the same as in the Laplace case are simulated with probability 1 − c, but with σ12 =
σ22 = 2 × 0.12 (this is done to match the component variances to those from the LMoLE case). Again, with
probability c, outliers are simulated in the same way as in the Laplace case.
To assess robustness, we firstly computed the MLE θ̂n,c for 1000 simulation trials, for each c. We then evaluate
the difference between the estimated mean function from each simulation (i.e. (4.19)) to the true mean function
(note that the mean functions are the same in both the LMoLE and the GMoLE). This difference is computed
111
●
●
0.5
●
●
●
● ●
●
●
−1.0
−0.5
y
0.0
●
●
●
●
● ●
●●
●
●
●
●
●●
● ●●
●●
● ●●
●
●
●● ● ●
● ●
●
●
●
●●●●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
● ●●●●● ●●●
● ●
●
●
●
● ●●
● ●●
●
●●
●
●●
● ●●●●● ●
● ● ●● ●●●
●
●
●
●● ●● ●● ●
● ●● ●
●●
●
●
●● ●
●● ●●
●●●●●●
●●●●
●
●
● ●
●● ● ●
●
●●●●
●●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●●
●●
● ●●
●●
● ●●●
● ●
● ●●
●
●
●
●
● ●●
●
● ● ●● ● ●● ●
●
●
●
●
●
●
●●●●
●
●●
●● ●
●●●●● ●
● ● ●
● ● ●●
●
●
● ●
●● ●
●●●●●●
●
●●
●
● ●●
●
●
●●
●
●●
● ●
●●
●●
●
●
●●
●● ●
●
● ●● ●
●
●●
●
●●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
● ●
●●
● ●●
●
●
●
●●●
●● ●●
● ●
●
●
● ● ●●
●●
●
●●
●
●●
●
●
● ●●
● ●●
●
●
●●
●
● ● ● ●●●
●
●
●
●
●●●●
●
●
●●●●
●
● ●
●
●
●
●●
●
● ●
●●
●●●
●● ● ● ● ●
● ●
●
●●●
●●
●●● ● ● ●
● ●● ●
● ●
●
●
●
●
●●
●● ●●
●● ●
●
●●
●●
● ● ● ●●● ● ●
● ●●● ●●● ●
●●●
●● ● ●
● ●
●●● ● ●●● ●
● ●●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
−1.5
●
−1.0
−0.5
0.0
0.5
1.0
x
Figure 4.1: Example of an n = 512 observations simulation (as described in Section 4.4.1), where each point
has abscissa xj and ordinate yj , for j = 1, ..., n. The black curve is the mean function Eθ0 (Y |v, x) and the red
lines are the component mean functions Eθ0 (Y |Z = i, x) for i = 1, 2 (as described in Section 4.3.3). The black
curve is Eθ0 (Y |v, x) and the red lines are Eθ0 (Y |Z = i, x) for i = 1, 2. The green curve and the blue lines are
the respective functions, each evaluated at θ̂n instead.
112
0.030
Variance
0.035
1.0
0.8
0.6
0.0
0.020
0.2
0.025
0.4
Probability
−2
−1
0
1
2
x
Figure 4.2: Plot of estimated variance function (i.e. (4.20)) and estimated probability function versus the
respective true functions. Here,
the data and coordinates are the same as those which appear in Figure 4.1.
The black curve is π1 v; α0 and the red curve is Vθ0 (Y |v, x). The green and blue curves are the respective
functions, each evaluated at θ̂n instead.
113
\ c ) of the estimated LMoLE and GMoLE models,
Table 4.2: Estimated integrated mean squared errors (i.e. MISE
for both the Laplace and Gaussian cases are presented. Here c ∈ {0, 0.01, ..., 0.05} is the probability of outliers
\ c is obtained using 1000 trials.
in each simulation. Each MISE
c
0
0.01
0.02
0.03
0.04
0.05
LMoLE
2.85 × 10−4
3.17 × 10−4
4.64 × 10−4
5.54 × 10−4
5.58 × 10−4
7.91 × 10−4
GMoLE
4.21 × 10−4
3.15 × 10−3
1.33 × 10−2
2.90 × 10−2
4.52 × 10−2
8.05 × 10−2
LMoLE
5.62 × 10−4
6.37 × 10−4
6.91 × 10−4
9.02 × 10−4
1.07 × 10−3
1.28 × 10−3
GMoLE
4.24 × 10−4
2.41 × 10−3
1.05 × 10−2
2.50 × 10−2
4.19 × 10−2
6.94 × 10−2
Laplace
Gaussian
via the integrated SE (ISE),
ˆ
1
ISEc =
−1
2
Eθ0 (Y |v, x) − Eθ̂n,c (Y |v, x) dx;
T
recall that we defined v = x = (1, x) . The average over the 1000 simulation trials is then computed for each
\ c ). This is done for both cases, and the results are presented in Table
c, to get the estimated mean ISE (MISE
4.2. Here, the integrals are each computed numerically via the integrate function in R, and the GMoLE are
estimated via the flexmix package; see R Core Team (2013) and Grun and Leisch (2008), respectively.
\ c for the LMoLE is less than that of the GMoLE for every
Upon inspection of Table 4.2, we notice that MISE
c > 0, in both the Laplace and Gaussian cases. This indicates that the LMoLE is more robust than the GMoLE
in the simulated scenario. We observe that the LMoLE is more robust than the GMoLE, even when the data
are simulated from the Gaussian model (except in the case where there are no outliers). We also note that
\ c for both LMoLE and GMoLE models increase as c increases. This is expected as there are more outliers
MISE
to influence the shape of the curves. Figure 4.3 displays a typical example of estimated LMoLE and GMoLE
mean functions in the Laplace case, where c = 0.05. In this example, we see that the LMoLE estimated mean
function is closer in resemblance to the true mean function in comparison to that of the GMoLE estimate.
4.4.3
Selection of g via Information Criteria
To assess the suitability of rule (4.21) for determining g, we simulate n = 512 observations from four different
T
scenarios, S1-S4. In S1, we simulate from a g = 3 component LMoLE, with parameters α01 = α02 = (1, 0) ,
T
T
T
β10 = (0, 0) , β20 = (0, 1) , β30 = (0, −1) , and s0i = 0.1, for each i = 1, 2, 3. S2 is the same as S1, except we set
s01 = 1.
T
T
In S3, we simulate from a g = 4 component LMoLE, with parameters α01 = α02 = (0, 10) , α03 = (1, 0) ,
T
T
β10 = β30 = (0, 1) , β20 = β40 = (0, −1) , s01 = s02 = 0.1, and s03 = s04 = 0.25. S4 is the same as S3, except we set
s03 = s04 = 0.5.
Simulations S1 and S2 were designed to assess the capacity of the information criteria to assess the differences
114
0.5
●
●●
●
● ● ●
●
● ●
●
●
●
● ●●
●
●
●
●
●
●
●
●
● ● ●●
●
● ● ●●
●
● ● ●●
●●
●
●
● ●
●● ●
●
●●
● ●
●
●
●
●
●
●
● ●
●
● ● ●●
●
●● ●
●
●●
●
● ● ●● ●
●
●●
● ●●
●
●●
● ●
●
● ●● ● ●
●
●
●
● ● ●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●● ●
●●
●● ●
●
●● ●●
●
●● ●
●●●● ●
●●●
● ●
●● ●
●
●●
●●●
●
●
●
●
●
●
●
●
●●●
●
● ●
●● ●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
● ● ●●
●
●
●●
●● ● ●● ●
●● ●
● ●● ● ●
● ● ● ●●●
●
●
●●
●
●
● ●●●
● ●
● ●●
●●
● ●● ●
●
● ● ●●
●
●●
●
●
●
● ●
●
● ● ●●●
● ●●
●
●
●
●
●
● ●
● ●●
● ●●● ●● ●
● ●
● ●
●●
●
● ●●●
●
●
●●●
● ●● ●●●
● ●
●●●
●
●
●
●●
● ●
●
● ●
●
●●●
●●●● ●●
●●
● ●●
●
●
●
●
●
●
●
●
● ●
● ●
●●● ● ● ●
●● ●
● ● ●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●● ●
●
●●
● ● ●●
● ●●●
●●
●●
●
●
●
●
●
−0.5
0.0
●
−2.0
−1.5
−1.0
y
●
●
●
●
●●
−1.0
● ●● ●
●
●
−0.5
●●
●
●● ●
●●
0.0
●
●●●
●●
0.5
●●
1.0
x
Figure 4.3: Example of an n = 512 observations simulation from the LMoLE model, with outlier probability
c = 0.05 (as described in Section 4.4.2), where each point has abscissa xj and ordinate yj . The black curve is
the mean function Eθ0 (Y |v, x). The green and red curves are the estimated mean functions via the LMoLE
and GMoLE models, respectively.
115
1
2
●
●
0.5
y
0.0
−0.5
4
●●
●
● ●●●
●
●
●
● ●●
● ●●●
●
●●
●
●●
●●
●
●
●● ●
●
●
●● ● ●●
●
● ● ●
●
● ●
● ●
●
●
●
●●
●
●
●
●
●
●● ●●
●
●● ●
●
●
●
●
●● ●
●●
● ●
● ●
● ●
●
● ●
●
●
●
●●
● ● ●● ●
●●
●
● ● ●● ● ●
●●
●
●
●
●
●
●
●
●
●● ● ● ●
●● ● ●
●
●
●
● ●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●●
●
●● ●●●●
●
● ●● ●
● ●
● ●●
●
●
● ●● ● ●
●●●
● ●
●● ●● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ● ●●
● ● ●●●
●● ● ●
●
●
●
● ● ●●
●
● ●
●
●●
●
● ● ●●● ●
● ● ●● ●
●● ●● ●● ●●●●
● ●
●
● ●
●
●
●
●● ● ●
● ● ●●
●●●● ●●●● ●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
● ● ●
●
●
●●
●
●
● ●
●●●
● ●
●●●
●●
●
●
● ● ●
●
●●
●
●
● ●
●● ●●
●●
●
●
●
●
●
●●●
●
● ●●● ●
●● ● ●●●
●
●
● ●
● ●
●
●●
●
●
● ●
●
●● ●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
● ●●● ●
●●
●
●●
●
●
●●
● ●
●
● ●
●●
●
●
●● ●●
●
● ●● ●●
●
●
●
●
●
●
●
●
●●
●
●● ●
●● ●
●
●
●● ●
●
● ●
●
●
●
●
●
−0.5
0.0
0.5
●
●
●
●
●
●
●
●
●
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●●●
● ●● ● ●
●● ●
● ●
● ● ●●
●
●●
●
●●
●● ● ● ●
●●●●●
● ●
●●
● ●●
●● ●●
● ● ● ●●
●●
●
●●
●
●
●
●
● ●
●
●
●●●● ●●
●
● ●● ● ●● ●
●
●
●
●
●●●●
●
● ●
●
●
●
●●
●
●
●
● ●
●
●●
●
●● ●●● ● ● ●●
●
● ●
●● ●
● ● ●● ●●● ● ● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
● ●● ●
● ●
●
● ●●
●
●● ●●
● ●
●●
●
●
● ●
●●
●
●
●● ●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
● ●
●●
●
● ●●●●●●●● ● ●
● ●●●●●
●
●
● ● ●●
●● ●● ●●●
●●
●
●● ●●
● ●
●●●●
●●
●●
●
●
●
●●●
●
●● ●
●
●●●●●
●●
●
●
● ● ●●●●●●●●● ● ●
●
●
●
●● ●
●
●
●
●
● ●●
● ●● ●●
●
● ●
●
●● ●
●●●
●
● ● ● ●●
● ● ●● ●
●
● ●●
●●
●
●
●
●●●
●●
●
●
●●
●● ●
●
●●
●
● ● ● ●
● ●●
● ●●●
●
● ●●
●●● ●●
●
● ● ●●●
●
●●
● ● ●●
●
●●
●●
●
●
●●●
●●●●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.0
−0.5
0.0
x
x
3
4
●
0.5
1.0
3
2
−1.0
●
●
●
2
●
●●
● ●
●
●●●
●
●
●
●
●
●
● ●●
●
●
−1.0
●●
y
●
●
0
● ●
●
●
●
●●
●●
−2
● ●
●
●
−4
1.0
●
●
● ●
● ●
●●
●
●
●
●
●
●
●
●
1
● ●
y
● ●●
●
●
●
●
● ●
●
●
●
●
● ●
●●
●
●●● ●●
●
●
●
●
●● ●
●●
●
●
●
●
●
●
●●● ●●●
●
● ● ●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●● ●●
●●
●
●
●
●
●
●●●● ●
●
●●
● ●●●
●
●●
●
● ●
● ● ●
●●●
● ●●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ● ●
● ●
●
●
●●
●
●
●●●
● ●●
●
● ●
●
● ●
● ●
●
●
●
●
● ● ●
●●●● ●
●
●● ●●
●●
●
●●
●
●
●
● ●● ●
●
●
●●
●●
●●
●●●●
●●
●● ● ●●
● ●
●
●
●
●
●
●
●
●● ●
●
●
●●●● ● ●
●
●●
●
● ●●
●● ● ●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●●
●● ●
●●
●●
● ●
● ● ●● ●● ●●
●●
●
●
●
●
●
●
● ●● ●
● ●● ●
●● ●
●●●
●●●●●●
●
●
● ● ● ●●
●
●
●
●
●●
●
●
● ●●● ●
●●● ●
●
●
●
●●
●
● ●
●●● ● ● ●
●
●
●
●
●
●
●
● ●
●●
●
● ●●
●
●
●
●●
●
●
●
● ●●
●
●●
●● ●
● ●
●
●●
● ●●
●
●
●
●
●
● ●
● ●● ●
● ●●
●●
●
●● ●
●
●
● ● ●
●
● ●
● ● ●●● ●
●
●● ● ●●● ●
●
●
● ●●● ●
●
●
●
● ●●
● ●● ●●
●●
●
●●
●
●
●● ●
●
●
●
●
●
●
●●
● ●● ●
●
● ●
● ●●●
●
●
● ●
●
●●●
●● ● ●
●
●
●
●
●
● ●
●
●
2
●
0
●
−1
●
−2
y
−1
0
1
●●
●
●
●
●
●
●
●●
●●
●
●● ● ●
●●●● ●●●●●●
●
●
●
●
●● ●
● ●● ●
●
● ●●●
●●
●●
●●● ●● ●●
●●●
●●
●●● ●● ●
●
●● ●
● ●● ● ● ●
●
●
●
●
●
●●
●
● ●
●● ●
●
●
●
●
●
●
●● ● ●● ●
● ●
●
● ●●
●●
●
●
●
●
●●
● ● ●● ● ●● ●
●
●●
●
●
●
● ●●●
● ●●
●
● ●● ●
●
●●
●●●
●
●
●
●●
● ●● ●
●●
● ●
●
●●
●
●●
●●
●
●
● ●●
●
●
●● ● ● ●
●●●●
●
●
●
●
●
●
●
●
●●
●
●●● ●
●●
●
●
●● ●●
●
●●
● ●●● ●
●●
●●
●
●
●
●●●●●●
●●●
●
●
●● ●●●
●
●●● ●●
●
●
●
●
●
● ●●● ●
●
● ●●●
●
●●
●
●
●
●
●
●
●●●●●●
●
● ●●
●●
● ● ● ●●●● ●●
●
●●
● ●●
●●●
●●●●
●
● ●
●
●●●
●●●
● ● ●● ● ● ●●
●● ●●
●● ●●
●
●
●● ● ● ●
●● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●● ● ●
●
●
●
●
● ● ●● ● ●
●
●
●
●● ●
●
●
●●●
●
●
●
●
●
●
●
●
●● ● ●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●● ● ●
●
●●
● ●
●● ●●
●
● ●●
●
● ●●●●
●
●●●
● ● ●
●●
●●●●●●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
● ●
●
●●
●
● ●
●
●
●
●
●
●
−3
−2
●
●
−1.0
−0.5
0.0
0.5
1.0
●
−1.0
x
−0.5
0.0
0.5
1.0
x
Figure 4.4: Examples of n = 512 observation simulations of the four scenarios from Section 4.4.3, where each
point has abscissa xj and ordinate yj , for j = 1, ..., n. Each of the colored lines are mean functions of the various
components from each scenario (as per Section 4.3.3).
in means of the components. The variance of component 1 was set large to resemble random noise rather than
a separate component.
Simulations S3 and S4 were designed to assess the capacity to distinguish changes in component variances. The
difference in variance of S3 is more subtle that that of S4, although components 3 and 4 of S4 are designed to
resemble a single high variance component with constant mean.
T
In each of the four scenarios, the covariates are simulated such that vj = xj = (1, xj ) , where xj is simulated
uniformly over the interval (−1, 1). Examples of the four scenarios are presented in Figure 4.4.
We performed 1000 simulation trials and estimated an l = 2, ..., 5 components LMoLE for each trial (here,
γl = l). We then computed the AIC, BIC, CLC, and ICL criteria, for each simulation and each l. The average
information criteria values were then computed and are presented in Table 4.3. We also listed the number of
times l was the correct number of components (i.e. the number of times g = l according to rule (4.21)), for each
l.
116
Table 4.3: Average IC (l) values, for the AIC, BIC CLC, and ICL, and the number of times g = l according to
rule (4.21) are listed, for each number of components l = 2, .., 5. The averages were taken over 1000 repetitions
of the simulation from Section 4.4.3. Bold numbers in the l column indicate the true number of components.
Bold numbers in the mean and number columns indicate the value of l with the lowest average IC (l) and highest
number of selections, respectively.
AIC
BIC
CLC
ICL
Scenario
l
Mean
#g=l
Mean
#g=l
Mean
#g=l
Mean
#g=l
S1
2
530.6
0
564.5
0
803.6
7
853.5
30
3
218.4
701
273.5
901
522.7
894
603.8
911
4
199.0
164
275.3
88
628.2
85
740.5
54
5
200.6
135
298.1
11
730.1
14
873.6
5
2
1021.1
0
1055.0
0
1254.1
692
1304.0
885
3
872.0
910
927.1
992
1269.1
308
1350.2
115
4
887.7
63
954.0
8
1472.9
0
1585.1
0
5
884.2
27
981.6
0
1622.5
0
1766.0
0
2
464.2
0
498.1
202
639.2
983
689.2
986
3
428.8
31
484.7
310
775.0
12
856.1
10
4
408.4
500
483.9
441
864.8
5
977.1
3
5
404.5
469
502.0
47
948.5
0
1091.9
0
2
877.1
0
911.0
0
1134.0
57
1183.9
138
3
721.0
14
776.1
266
1025.7
888
1106.8
853
4
690.8
673
767.1
708
1126.5
49
1238.8
9
5
691.9
313
789.4
26
1232.6
6
1376.1
0
S2
S3
S4
117
The results from Table 4.3 indicate that the BIC is the only criterion that chooses g correctly, on average in
each of the four scenarios; although, it has a tendency to select a simpler model than is required, with noticeable
frequency, in S3 and S4.
We notice that in S1, only AIC performs poorly, where it has a noticeable tendency towards complexity. In
S2, both AIC and BIC perform well, whereas CLC and ICL have a tendency towards simplicity. In S3, all
four criteria performed poorly, although BIC was the only criterion to be minimized, on average, at the correct
number of components. In S4, both AIC and BIC perform well, although AIC tended to favor complexity,
whereas BIC favored simplicity; here, both CLC and ICL had a tendency towards simplicity.
We find that Table 4.3 presents good evidence towards the usefulness of the BIC for selecting the number of
components in MoLE models, as it performed the best out of the criteria in all four scenarios. We finally
note that the MM algorithm monotonically increased the likelihood in each of its applications across the three
simulation studies.
4.5
Application
To demonstrate the LMoLE model in its application to real data, we consider the climate science data set from
Hansen et al. (2001). The data consist of n = 129 yearly measurements of temperature anomalies (a measure
of temperature deviations to a long-ranged average, in degrees Celsius) over the period of 1882 to 2010.
In their scientific analysis, Hansen et al. (2001) found that the data could be segmented into two periods of
global warming (i.e. 1882 to 1940 and 1965 to 2010), separated by a transition period where there was a slight
global cooling (i.e. 1940 to 1965). It was found that during the period of 1882 to 1940, there was net global
warming, partially due to greenhouse gases but also contributed to by solar forcing. From 1940 to 1965, a net
cooling occurred via increased aerosol forcing as well as fluctuations in ocean heat transport. The period of
1965 to 2010 was observed to have exhibited net warming, at a greater rate than from 1882 to 1940; it was
speculated that greenhouse gas forcing was stronger here, than in the previous period of warming.
4.5.1
Linear Mixtures of Laplace Experts
T
To analyze this data, we set y as the temperature anomalies and vj = xj = (1, tj ) , where tj = j + 1881 is the
year of each observation and j = 1, ..., 129. We then proceed to fit a two and a three-component LMoLE model
to the data, which yielded BIC values of −143.847 and −117.790, respectively. As such, we choose to utilize
the two-component LMoLE for our analysis.
T
T
T
The MLE components of are α̂n = (536.717, 0.273) , β̂1,n = (−13.956, 0.007) , β̂2,n = (−43.075, 0.022) ,
ξˆ1,n = 0.089, and ξˆ2,n = 0.090. A plot of the data, along with the estimated LMoLE mean function and
component means, is given in Figure 4.5.
Upon observing the mean function from Figure 4.5, we note that the average temperature anomaly undergoes
two phases of warming, broken by a short period of cooling. By inspection of the derivative of the mean function,
118
●
●
●
●
●
● ●
● ●
0.5
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
0.0
y
● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
−0.5
●●● ●
● ●
●
●
●
●
●●
●●●
● ● ●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
● ●● ●
● ●
●
●
●●● ●
●
●●
●
● ●
●●
●
●
●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
1880
1900
1920
1940
1960
1980
2000
t
Figure 4.5: Temperature anomalies data from Hansen et al. (2001), where each point has abscissa tj (year)
and ordinate yj (temperature anomaly, in degrees Celsius). The black curve is the estimated mean function
Eθ̂ (Y |v, x). The green and red curves are the estimated mean functions of components 1 and 2, respectively.
119
d Eθ̂ (Y |v, x) /dt, we find that the mean is decreasing in the period between 1957 to 1966. This cooling period
corresponds with the period noted by Hansen et al. (2001). Furthermore, we notice that the mean function of
component 2 is steeper than that of component 1. This indicates that warming was more gradual prior to 1957
than it was after 1966. Again, this agrees with the analysis by Hansen et al. (2001).
4.6
Conclusions
In this article, we have described a new MoLE model, which is based on the Laplace distribution, and we have
derived an MM algorithm that monotonically increases the likelihood for the process of MLE. We were able to
prove that the MLE for the LMoLE parameter vector is consistent, and thus, estimators of interesting functions
(e.g. the mean and variance functions) are consistent as well. The LMoLE was also shown to be identifiable
under some restrictions, and links were established between the LMoLE model and the LAD regression model.
Using a set of simulation studies, we were able to support our theoretical result regarding the consistency of
the MLE for LMoLE models, as well as show that the LMoLE was more robust than the GMoLE under the
studied scenarios. We also found that the BIC outperformed other information criteria, in the role of selecting
between LMoLE models with differing numbers of components. An example application of the LMoLE model
was then described, via the analysis of a climate science data set.
To the best of our knowledge, this article is the first to explore the use of non-Gaussian MoLE models. Instead
of the Laplace distribution, we could utilize generalizations of the Gaussian distribution (e.g. t and skew-t
distributions), which may be more suitable in different contexts; these generalizations constitute an interesting
direction for future work. Furthermore, it would be interesting to investigate the combination of the MoLE
framework for incorporating covariate dependencies with the semiparametric mixtures of linear regression models
of Hunter and Young (2012) and Vandekerkhove (2013). We hope that this article provokes further research
into the application of LMoLE models and the extension of the MoLE framework.
Acknowledgments
We thank the associate editor and the two reviewers for their helpful comments, which led to improvements in
the article.
Appendices
A. Derivation of MM Algorithm Updates
For each i, we can write the first-order partial derivatives of (4.12) and (4.13) as
∗
1 ∂η1i
(k+1)
(k)
= δi
− ∆ αi − αi
,
∂αi
4
120
(k+1)
n
yj − βiT xj xTj
1 X τij
∂η2∗
=
,
(k+1)
ξi j=1
∂βiT
wij
and
∂η2∗
∂ξi
= −
n
1 X (k+1)
τ
ξi j=1 ij
"
2 #
n
yj − βiT xj
1 X (k+1)
(k+1)
+ 2
τ
.
wij
+
(k+1)
2ξi j=1 ij
wij
Solving for the roots of the partial derivatives yields updates (4.14), (4.15), and (4.16), respectively.
Additionally, we can write the second-order partial derivatives as
∗
1
∂ 2 η1i
= − ∆,
4
∂αi ∂αTi
(k+1)
n
xj xTj
1 X τij
∂ 2 η2∗
,
=
−
ξi j=1 w(k+1)
∂βi ∂βiT
ij
∂ 2 η2∗
∂ξi2
=
n
1 X (k+1)
τ
ξi2 j=1 ij
"
2 #
n
yj − βiT xj
1 X (k+1)
(k+1)
− 3
τ
wij
+
,
(k+1)
ξi j=1 ij
wij
and
(k+1)
n
yj − βiT xj xTj
1 X τij
∂ 2 η2∗
.
=− 2
(k+1)
ξi j=1
∂ξi ∂βiT
wij
∗
Let H1i (αi ) = −4−1 ∆ be the Hessian of η1i
. Notice that H1i (αi ) is negative-definite for all αi (except in
pathological cases). This implies that update (4.14) maximizes (4.11). Similarly, let

H2i (βi , ξi ) = 
∂ 2 η2∗
∂βi ∂βiT
∂ 2 η2∗
∂ξi ∂βiT
∂ 2 η2∗
∂ξi ∂βi
∂ 2 η2∗
∂ξi2

,
(k+1) (k+1)
is also
be the Hessian of (4.13) with respect to βi and ξi . It is not difficult to show that H2i βi
, ξi
negative-definite which implies that updates (4.15) and (4.16) maximize (4.13), simultaneously. This is true
since βi and ξi are separable from βi0 and ξi0 , in (4.13), for each i 6= i0 .
B. Assumptions of Lemma 4.1
The following assumptions utilize the same notation as appears in Lemma 4.1.
B1 The set A is an open subset of Rq , where q is the dimension of a.
121
B2 For each n, hn (a; Y ) is continuous in an open neighborhood N1 of a0 .
B3 There exists an open neighborhood N2 of a0 in which n−1 hn (a; Y ) converges to a nonstochastic function
h (a) in probability uniformly with respect to a.
B4 The function h (a) attains a strict local maximum at a0 .
122
Chapter 5
Maximum Likelihood Estimation of
Gaussian Mixture Models without Matrix
Operations
Citation
Nguyen, H. D. and McLachlan, G. J. Maximum likelihood estimation of Gaussian mixture models without
matrix operations. Submitted to Advances in Data Analysis and Classification.
Abstract
The Gaussian mixture model (GMM) is a popular tool for multivariate analysis, in particular, cluster analysis. The expectation–maximization (EM) algorithm is generally used to perform maximum likelihood (ML)
estimation for GMMs due to due to the M-step existing in closed form and its desirable numerical properties,
such as monotonicity. However, the EM algorithm has been criticized as being slow to converge and thus computationally expensive in some situations. In this article, we introduce the linear regression characterization
(LRC) of the GMM. We show that the parameters of an LRC of the GMM can be mapped back to the natural
parameters, and that a minorization–maximization (MM) algorithm can be constructed, which retains the desirable numerical properties of the EM algorithm, without the use of matrix operations. We prove that the ML
estimators of the LRC parameters are consistent and asymptotically normal, like their natural counterparts.
Furthermore, we show that the LRC allows for simple handling of singularities in the ML estimation of GMMs.
Using numerical simulations, we then demonstrate that the MM algorithm can be faster than the EM algorithm
in various practical scenarios when implemented in the R programming environment.
123
5.1
Introduction
The Gaussian mixture model (GMM) is a ubiquitous tool in the domain of model-based cluster analysis; for
instance, see McLachlan and Basford (1988) and Chapter 3 of McLachlan and Peel (2000) for a statistical
perspective, and Chapter 9 of Bishop (2006) and Chapter 10 of Duda et al. (2001) for a machine learning point
of view. Furthermore, discussions of applications of GMMs and comparisons of GMMs to other cluster analysis
methods can be found in Chapter 8 of Clarke et al. (2009), Section 14.3 of Hastie et al. (2009), and Section 9.3
of Ripley (1996), as well as Hartigan (1985), Jain et al. (1999), and Jain (2010). The GMM framework can be
defined as follows.
Let Z ∈ {1, ..., g} be a categorical random variable such that
P (Z = i) = πi > 0
for i = 1, ..., g − 1 and P (Z = g) = 1 −
Pg−1
i=1
πi = πg , and let X ∈ Rd be such that X|Z = i has density
φd (x; µi , Σi ), where
−d
2
φd (x; µ, Σ) = (2π)
− 21
det (Σ)
1
T
−1
exp − (x − µ) Σ (x − µ)
2
(5.1)
is a d-dimensional multivariate Gaussian density function with mean vector µ ∈ Rd and positive-definite covariance matrix Σ ∈ Rd×d . Here, the superscript T represents matrix/vector transposition.
If we suppose that Z is unobserved (i.e. Z is a latent variable), then the density of X can be written as
f (x; θ) =
g
X
πi φd (x; µi , Σi ) ,
(5.2)
i=1
T
T
where θ = π T , µT1 , ..., µTg , vechT (Σ1 ) , ..., vechT (Σg ) is the model parameter vector and π = (π1 , ..., πg−1 ) .
Densities of form (5.2) are known as GMMs, and we refer to each φd (x; µi , Σi ) as component densities.
Let X1 , ..., Xn be a random sample from a population characterized by density f x; θ 0 , where the parameter
vector θ 0 is unknown. In such cases, the estimation of θ 0 is required for further inference regarding the
data. Given an observed sample x1 , ..., xn , such estimation can be conducted via maximum likelihood (ML)
estimation to yield the ML estimator θ̂n , where θ̂n is an appropriate local maximizer of the likelihood function
Qn
Ln (θ) = j=1 f (xj ; θ).
Due to the summation form of (5.2), the computation of θ̂n cannot be conducted in closed form. However,
since its introduction by Dempster et al. (1977), the expectation–maximization (EM) algorithm has provided a
stable and monotonic iterative method for computing the ML estimator; see Section 3.2 of McLachlan and Peel
(2000) for details. Although effective, the EM algorithm for GMM is not without some criticisms; for example,
it is known that the convergence of EM algorithms can be very slow, and may be computationally expensive in
some applications; see Chapters 3 and 4 of McLachlan and Krishnan (2008) for details regarding theses issues.
To remedy the aforementioned issues, there have been broad developments in constructing modifications of
124
the GMM EM algorithm, as well as suggestions of alternative methodologies for estimation. For example,
Andrews and McNicholas (2013), Botev and Kroese (2004), Ingrassia (1991), and Pernkopf and Bouchaffra
(2005) considered the use of metaheuristic algorithms; and Andrews and McNicholas (2013), Celeux and Govaert
(1992), Ganesalingam and McLachlan (1980), and McLachlan (1982) considered alternatives to the likelihood
criterion (see Chapter 12 of McLachlan and Peel (2000) for a literature review of other developments in this
direction). Each of the aforementioned methods have been shown to improve upon the performance of the EM
algorithm via simulation studies, although there are no theoretical results to show that any of the methods
uniformly out performs the EM algorithm. Furthermore, each of the methods either require randomization,
which relinquishes the monotonicity of the EM algorithm, or replacement of the likelihood criterion, which
abandons the statistical properties of the ML estimators.
In this article, we devise a new algorithm for the estimation of GMMs that retains the monotonicity properties of the EM algorithm whilst not utilizing matrix operations. Our approach extends from the following
characterization of the multivariate Gaussian density function.
Consider the following decomposition of µ and Σ, from (5.1), in which
"
µ=
µ1:d−1
µd
#
"
and Σ =
Σ1:d−1,1:d−1
Σd,1:d−1
ΣTd,1:d−1
Σdd
#
,
(5.3)
where µ1:k is the first k elements of µ, Σ1:k,1:k is the submatrix made up of the first k rows and columns of
Σ, Σk,1:l is the first l elements from row k of Σ, and Σkl is the element from the kth row and lth column of
T
Σ, for k, l = 1, ..., d. Furthermore, let X̃k = (1, X1:k−1 ) , where X1:k is the first k elements of X, and let
βk = (βk0 , ..., βk,k−1 )
T
∈ Rk and σk2 > 0 be parameters. Using this notation, Ingrassia et al. (2012) showed
that for every µ and Σ, there exists a parametrization βd , σd2 , µ1:d−1 , and Σ1:d−1,1:d−1 such that the density
function
fCW x; βd , σd2 , µ1:d−1 , Σ1:d−1,1:d−1
=
φ1 xd ; βdT x̃d , σd2
(5.4)
×φd−1 (x1:d−1 ; µ1:d−1 , Σ1:d−1,1:d−1 )
is equal to φd (x; µ, Σ), for all values of x. This alternative parametrization allows for the d-variate Gaussian
distribution to be considered in two parts: a linear regression (LR) and density estimation component. In
Ingrassia et al. (2012) and Ingrassia et al. (2013), this parametrization was used within the cluster-weighted
modeling framework for clustering data arising from LR processes.
We extend upon the regression decomposition of Ingrassia et al. (2012) to characterize the multivariate Gaussian
distribution entirely in terms of LR components. In doing so, we are able to apply a minorization–maximization
(MM) algorithm presented in Becker et al. (1997), for the estimation of LR models without matrix operations via
an iterative scheme; see Hunter and Lange (2004) for further details regarding MM algorithms. By application
of the aforementioned MM algorithm, we are able to devise a method for estimating GMM densities without
matrix operations. Furthermore, by leveraging its MM construction, the algorithm that we present is proven to
be monotonic in its iterations as well as convergent to a stationary point of the log-likelihood function. We are
125
able to show via simulations that our algorithm compares favorably with the EM algorithm in various practical
scenarios, when implemented in the R programming environment R Core Team (2013).
Aside from the numerical properties that we derive, we also address the statistical properties of our LR characterization (LRC). We are able to establish both the consistency and asymptotic normality of the LRC ML
estimators, as well as devise a simple procedure for the satisfactory handling of singularities, which can arise in
the ML estimation of GMMs.
The remainder of the article is organized as follows. Firstly, we describe the LRC of the multivariate normal
distribution in Section 5.2. In Section 5.3, we devise the MM algorithm for ML estimation as well as establish its
numerical properties. In Section 5.4, the statistical properties of the ML estimators and the model are derived,
and in Section 5.5, the performance of the MM algorithm is demonstrated via numerical simulations. Lastly,
conclusions are drawn in Section 5.6.
5.2
Linear Regression Characterization
Using the same notation as in (5.3) and (5.4), the LRC of the multivariate Gaussian density is
d
Y
λ x; B, σ 2 =
φ1 xk ; βkT x̃k , σk2 ,
(5.5)
k=1
where B = β1T , ..., βdT
T
and σ 2 = σ12 , ..., σd2 . Here, we define x̃1 = 1 and β1 = β10 . We now show that (5.5)
is a d-dimensional multivariate Gaussian density and that there is a one-to-one correspondence between the
LRC parameters B and σ 2 and the natural parameters µ and Σ of (5.2). To attain such a result, we require
the following lemma.
Lemma 5.1. Using the same notation as in (5.3) and (5.4), if X has density function (5.2), then for each k,
X1:k and Xk |X1:k−1 = x1:k−1 have density functions
φk (x1:k ; µ1:k , Σ1:k,1:k ) and φ1 xk ; µk|1:k−1 (x1:k−1 ) , Σk|1:k−1 ,
respectively, where
µk|1:k−1 (x1:k−1 ) = µk + Σk,1:k−1 Σ−1
1:k−1,1:k−1 (x1:k−1 − µ1:k−1 ) ,
and
T
Σk|1:k−1 = Σkk − Σk,1:k−1 Σ−1
1:k−1,1:k−1 Σk,1:k−1 .
Lemma 5.1 can be seen as a special case Theorems 2.4.1 and 2.5.1 from Anderson (2003). We can apply Lemma
5.1 to derive the following result.
Theorem 5.1. Every density function of form (5.2) can be expressed in form (5.5) via a bijective mapping
between the parameters µ and Σ, and B and σ 2 .
The proof of Theorem 5.1 and other major results can be found in the Appendix.
126
5.2.1
Gaussian mixture model
We now consider the LRC of the GMM density function
fR (x; ψ) =
g
X
πi λ x; Bi , σi2 ,
(5.6)
i=1
T
T
T T
where ψ = π T , B1T , ..., BgT , σ12 , ..., σg2
is the vector of model parameters. Here, Bi = βi1
, ..., βid
, σi2 =
T
2
2
σi1
, ..., σid
, and βik = (βik0 , ..., βik,k−1 ) , for each i and k. Furthermore, π is restricted in the same way as
in (5.2). Using Theorem 5.1, the following result can be shown.
Corollary 5.1. For each natural parameter vector θ, there exists a mapping to an LRC parameter vector ψ,
and vice versa, such that f (x; θ) = fR (x; ψ) at every x ∈ Rd .
Corollary 5.1 can be seen as an extension of Proposition 1 from Ingrassia et al. (2012). Unfortunately, unlike
Theorem 5.1, the mapping between θ and ψ is not bijective, due to the non-identifiability of GMMs; this issue
is well documented in Section 3.1 of Titterington et al. (1985). Nevertheless, Corollary 5.1 allows us to consider
the density estimation of data generated from a GMM using an LRC of the GMM instead. We shall show that
this representation permits the construction of a matrix-free algorithm for ML estimation.
5.3
Maximum Likelihood Estimation
Upon observing data x1 , ..., xn , the likelihood and the log-likelihood that the data arise from a LRC of the
Qn
GMM are LR,n (ψ) = j=1 fR (xj ; ψ) and
log LR,n (ψ)
=
n
X
log fR (x; ψ)
j=1
=
n
X
log
j=1
g
X
πi λ x; Bi , σi2 ,
(5.7)
i=1
respectively.
As with the natural parametrization of the GMM, we generally assume that the data were generated from a
process with density function fR x; ψ 0 , where ψ 0 is unknown. In such cases, ψ 0 can be estimated via the ML
estimator ψ̂n , where ψ̂n is an appropriate local maximizer of (5.7).
Like θ̂n , ψ̂n cannot be computed in closed form. Thus, we must devise an iterative computation scheme. We
now present an MM algorithm for such a purpose.
5.3.1
Minorization–Maximization Algorithms
Suppose that we wish to maximize some objective function η (t), where t ∈ S for some set S ⊂ Rr . If we cannot
obtain the maximizer of η (t) directly, then we can seek a minorizer of η (t) over S, instead. A minorizer of η (t)
127
is a function h (s; t) such that η (t) = h (t; t) and η (s) ≥ h (s; t), whenever s 6= t, and s ∈ S. Here, h (s; t) is
said to minorize η (t).
Upon finding an appropriate minorizer and denoting t(m) as the mth iteration of the algorithm, an MM algorithm
can be defined using the update scheme
t(m+1) = arg max h s; t(m) .
s∈S
(5.8)
By the properties of the minorizer and by definition of (5.8), iterative applications of the MM update scheme
yields the inequalities,
η t(m+1) ≥ h t(m+1) ; t(m) ≥ h t(m) ; t(m) = η t(m) .
This shows that the sequence of objective function evaluations η t(m) is monotonically increasing in each step.
Furthermore, under some regularity conditions, it can also be shown that the sequence of iterates t(m) converges
to some stationary point t∗ of η (t).
In this article, we consider minorizers for the functions
η1 (t) = log
r
X
!
ti
,
i=1
where S = {t : ti ≥ 0,i = 1, ..., r}, and
η2 (t) = − a − tT b
2
,
where a ∈ R, b ∈ Rr , and S = Rr . These objective functions (i.e. η1 (t) and η2 (t)) can be minorized via the
functions
h1 (s; t) =
r
X
i=1
and
Pr
0
i0 =1 ti
log
+ log (si ) ,
ti
i0 =1 ti0
ti
Pr
(5.9)
r
X
2
bi
T
h2 (s; t) = −
αi a −
(si − ti ) − t b ,
(5.10)
αi
i=1
Pr
p
p
respectively, where αi = (|bi | + δ) / i0 =1 (|bi0 | + δ), p > 0, and δ > 0 is a small coefficient. The two minorizers
were devised in Zhou and Lange (2010) and Becker et al. (1997), respectively; the latter was applied to perform
LR without matrix operations. Here, we choose p = 2 in αi for use throughout the article, as per a suggestion
from Becker et al. (1997).
(m)
(m)
(m)2
Let ψ (m) be the mth MM iterate. By setting si = πi λ xj ; Bi , σi2 and ti = πi λ xj ; Bi , σi
in (5.9),
128
for each i and j, we get the minorizer for (5.7),
g X
n
X
= C ψ (m) +
τi xj ; ψ (m) log πi λ x; Bi , σi2
Q ψ; ψ (m)
i=1 j=1
= C ψ
(m)
+ Q1 ψ; ψ (m)
g X
n
d
X
1 X (m)
+
τi xj ; ψ (m) Q2ijk βik ; βik ,
2
2σik j=1
i=1
(5.11)
k=1
where
g X
g
n
d
n
X
X
1 XX
2
Q1 ψ; ψ (m) =
τi xj ; ψ (m) log πi −
log σik
τi xj ; ψ (m) ,
2 i=1
i=1 j=1
j=1
k=1
(m)
Q2ijk βik ; βik
T
= − xjk − βik
x̃jk
2
,
(5.12)
and
πi λ x; Bi , σi2
τi (x; ψ) = Pg
2 .
0
0
i0 =1 πi λ (x; Bi , σi0 )
(5.13)
T
T
Here, xj = (xj1 , ..., xjd ) and x̃jk = (1, xj1 , ..., xj,k−1 ) for each j and k, and C ψ (m) is a constant that does
not depend on ψ.
(m)
Now, by setting a = xjk , b = x̃jk , s = βik , and t = βik
Q02ijk
(m)
βik ; βik
=−
k−1
X
l=0
p
where αjl = (|xjl | + δ) /
Pk−1
l0 =0
in (5.10), we obtain the minorizer for (5.12),
2
xjl (m)
(m)T
αjl xjk −
βikl − βikl − βik x̃jk ,
αjl
(5.14)
p
(|xjl | + δ) and xj0 = 1 by definition, for j and l = 0, ..., k − 1.
Using (5.14), we can further minorize (5.11), and thus (5.7), by
Q0 ψ; ψ (m)
= C ψ (m) + Q1 ψ; ψ (m)
+
g X
n
d
X
1 X (m)
(m)
0
.
τ
x
;
ψ
Q
β
;
β
i
j
ik
2ijk
ik
2
2σik
j=1
i=1
(5.15)
k=1
T
T
We now consider the partition of ψ into ψ1 = π T , B1T , ..., BgT
and ψ2 = σ12 , ..., σg2 , where ψ =
T
(m)
(m)
ψ1T , ψ2T . By fixing ψ2 at ψ2 , Q0 ψ1 , ψ2 ; ψ (m) is additively separable in the subsets of ψ1 ; fur(m)
thermore, for each i, the elements of Bi are additively separable as well. Similarly, by fixing ψ1 at ψ1 ,
(m)
Q ψ1 , ψ2 ; ψ (m) is additively separable in the elements of the subsets of ψ2 . This result suggests the following block successive MM update scheme.
Let ψ (0) be an initial parameter vector. At the (m + 1) th iteration, the algorithm proceeds in two steps: the
min (minorization)-step and the max (maximization)-step. In the min-step, we either construct (5.15) if m is
odd, or (5.11) if m is even; in either cases, the min-step requires the computation of τi xj ; ψ (m) , for each j.
129
(m+1)
In the max-step, if m is odd, then we set ψ2
(m)
= ψ2
and solve for the root of
(m)
∂Q0 ψ1 , ψ2 ; ψ (m)
∂ψ1
to get the updates
(m+1)
πi
and
Pn
=
j=1 τi
= 0,
xj ; ψ (m)
(5.16)
n
i
h
(m)T
(m)
x
τ
x
−
β
x̃
x
;
ψ
jk
jk
j
j=1 jl i
ik
i
Pn h 2
(m) /α
jl
j=1 xjl τi xj ; ψ
Pn
(m+1)
βikl
(m)
= βikl +
(5.17)
for each i, k, and l = 0, ..., k − 1. Here, 0 is a zero matrix/vector of appropriate dimensionality.
(m+1)
Similarly, the max-step for even m proceeds by setting ψ1
(m)
∂Q ψ1 , ψ2 ; ψ (m)
∂ψ2
(m)
= ψ1
and solving for the root of
=0
to obtain the updates
i2
h
(m)T
(m)
x̃
τ
x
;
ψ
x
−
β
jk
j
jk
j=1 i
ik
Pn
(m)
j=1 τi xj ; ψ
Pn
(m+1)2
σik
=
(5.18)
for each i and k. We now show that together, updates (5.16)–(5.18) generate a sequence of monotonically
increasing log-likelihood values.
(m+1)
Theorem 5.2. If m is odd and ψ2
(m)
= ψ2
, then updates (5.16) and (5.17) result in the inequalities,
(m+1)
(m)
log LR,n ψ (m+1)
≥ Q0 ψ1
, ψ2 ; ψ (m)
(m)
(m)
≥ Q0 ψ1 , ψ2 ; ψ (m) ≥ log LR,n ψ (m) .
(m+1)
If m is even and ψ1
(m)
= ψ1
(5.19)
, then update (5.18) results in the inequalities,
log LR,n ψ (m+1)
(m)
(m+1)
≥ Q ψ1 , ψ2
; ψ (m)
(m)
(m)
≥ Q ψ1 , ψ2 ; ψ (m) ≥ log LR,n ψ (m) .
(5.20)
In general, the MM algorithm is iterated until some convergence criterion is met. Usually, this is either the
absolute convergence criterion
log LR,n ψ (m+1) − log LR,n ψ (m) ≤ ,
or the relative convergence criterion
log LR,n ψ (m+1) − log LR,n ψ (m)
≤ ,
log LR,n ψ (m) 130
(5.21)
for some small > 0. In either case, upon convergence, the final iterate of the algorithm is declared the ML
estimator ψ̂n . Let ψ ∗ be a limit point of the MM algorithm, such that ψ̂n → ψ ∗ as → 0 for some starting
parameter ψ (0) .
The algorithm that we have devised is an instance of the block successive lower-bound maximization algorithms
of Razaviyayn et al. (2013). As such, we can apply Theorem 2 of Razaviyayn et al. (2013) to obtain the following
result regarding its limit points.
(m+1)
Theorem 5.3. Let ψ ∗ be a limit point of the algorithm conducted via the steps ψ2
(m+1)
(5.17), when m is odd; and ψ1
(m)
= ψ1
(m)
= ψ2
, (5.16), and
and (5.18), when m is even, for some initial vector ψ (0) . If ψ ∗ is a
finite and regular point of (5.7), as defined by Razaviyayn et al. (2013), then ψ ∗ is a stationary point of (5.7).
Theorem 5.3 shows that given suitable initial parameter vector ψ (0) , the MM algorithm generates a sequence
ψ (m) that converges to a stationary point of (5.7); this is a good result considering its nonlinearity and multimodality.
5.3.2
Covariance Constraints
We note that Theorem 5.3 requires that the limit points be finite values. This cannot always be guaranteed
(m)2
since log LR,n ψ (m) → ∞ if any of the sequences σik → 0. This is equivalent to the problem of component
covariance matrices Σi becoming singular in the natural parametrization (i.e. in Ln (θ)). In the natural
parametrization, the usual approach is to restrict the component covariance matrices to be positive definite via
conditioning on the eigenvalues of the matrices. Such approaches were pioneered in Hathaway (1985); examples
of recent developments include Greselin and Ingrassia (2008), Ingrassia (2004), Ingrassia and Rocci (2007), and
Ingrassia and Rocci (2011). We proceed to provide a simple alternative to the aforementioned approaches, based
upon the LRC parametrization.
(m)2
In the LRC, ensuring finite limit points amounts to guaranteeing that for each i and k, σik
→ ξik for some
ξik > 0. This can be implemented by adding a small ξ > 0 to the right-hand side of update (5.18) at each
∗2
iteration. Through doing this, we ensure that each σik
is positive, as well as retaining the monotonicity of the
likelihood, for each update. The following result is then applicable.
Theorem 5.4. In (5.5), if σk2 > 0 for each k, then the corresponding covariance matrix Σ, of the natural
parametrization, is positive-definite.
Theorem 5.4 implies that the covariance matrices in the natural parametrization will always be positive-definite,
if we apply the described process. We use ξ = 10−10 for all numerical applications presented in this article.
5.4
Statistical Properties
The consistency and asymptotic normality of ML estimators for GMM under the natural parametrization have
been proven in many instances; see for example, Redner and Walker (1984), Hathaway (1985), and Atienza
131
et al. (2007). We now seek the consistency of the ML estimators of the LRC of the GMM. Such a result can be
obtained via Theorem 4.1.2 of Amemiya (1985).
Theorem 5.5. Let X1 , ..., Xn be independent and identically distributed random samples from a distribution
with density fR x; ψ 0 , and let Ψn be the set of roots of the equation ∂ (log LR,n (ψ)) /∂ψ = 0, where Ψn = {0}
if there are no roots. If ψ 0 is a strict local maximizer of E [log fR (X; ψ)], then for any > 0,
lim P
n→∞
inf
ψ∈Ψn
ψ−ψ
0 T
ψ−ψ
0
> = 0.
Theorem 5.5 is an adequate result, considering that the log-likelihoods of GMMs are often multimodal and
unbounded, and that the MM algorithm is able to locate local maximizers when started from suitable values.
∗2
We note that the result similarly holds when a lower bound is enforced for each of the variance limit points σik
,
as in Section 5.3.2.
We now seek to establish the asymptotic normality of the ML estimators. Upon making some assumptions (see
the proof in the Appendix), we are able to utilize Theorem 4.2.4 of Amemiya (1985) to get the following result.
Theorem 5.6. Under Assumption B4, the ML estimator ψ̂n (as in Theorem 5.5) satisfies

"
#−1 
2
√ ∂
log
f
(x;
ψ)
D
R
.
n ψ̂n − ψ 0 → N 0, −E
∂ψ∂ψ T
ψ=ψ 0
Theorems 5.5 and 5.6 allow for inferences to be drawn from the ML estimators and their functions. For
example, if x is an observation that arises from a distribution with density fR x; ψ 0 , then we can compute the
conditional probability of its latent component variable Z = i given X = x, by τi x; ψ 0 via an application of
P
P
Bayes rule. Furthermore, if ψ̂n → ψ 0 , then by continuous mapping, we have τi x; ψ̂n → τi x; ψ 0 , for each
i. Thus, the estimated allocation rule that assigns x into component ẑ ∈ {1, ..., g},
ẑ = arg
max τi x; ψ̂n ,
i∈{1,...,g}
is asymptotically correct (i.e. it asymptotically assigns x to the component which maximizes the a posteriori
probability).
5.5
Numerical Simulations
To assess the performance of the MM algorithm proposed in Section 5.3.1, we simulated 2N −G observations from
each of 2G Gaussian distributions of dimensionality 2D , where D = 1, ..., 4, G = 1, ..., 4, and N = 15, 16, 17.
These sample sizes were chosen since performance improvements are most relevant in large data sets.
In each scenario, each of the 2G distribution means is randomly sampled from a Gaussian distribution with mean
0 and covariance matrix 20 × I, where I is the identity matrix of appropriate dimensionality. The covariance
matrices of the Gaussian distributions are each sampled from a Wishart distribution with scale matrix I and
2D + 2 degrees of freedom. An example of the D = 2, G = 3, and N = 15 case is shown in Figure 5.1.
132
Figure 5.1: Pairwise marginal plots of data generated from the D = 2, G = 3, and N = 15 case of the
simulations. The 8 colors indicate the different origins of each of the generated data points.
133
For each D, G, and N , 100 trials are simulated. In each trial, the MM algorithm is used to compute the ML
estimates ψ̂n for the LRC parameters. Here, the algorithm is terminated using criterion (5.21) with = 10−5 ,
and the computational time is recorded. The traditional EM algorithm (see Section 3.2 of McLachlan and Peel
(2000)) is then used to compute the ML estimates θ̂n for the natural parameters, using the same starting values
as for the MM algorithm. The EM algorithm is terminated using the criterion
log Ln ψ̂n − log LR,n θ̂n(m) < ,
using = 10−5 , and the computational time is recorded. The k-means algorithm was used to initialize parameters, as per Section 2.12 of McLachlan and Peel (2000); see MacQueen (1967) regarding the k-means
algorithm.
The algorithms were applied via implementations in the R programming environment (version 3.0.2) on an Intel
Core i7-2600 CPU running at 3.40 GHz with 16 GB internal RAM, and the timing was conducted using the
proc.time function from said environment. The computational times, in seconds, for both algorithms were then
averaged over the trials, for each scenario, and the results are reported in Table 5.1. In Figure 5.2, we also plot
the average ratio of EM to MM algorithm computational times, for each scenario.
Upon inspection, Table 5.1 suggests that both the MM and the EM algorithms behave as expected, with regards
to the increases in computation times with respect to increases in D, G, and N . Furthermore, we notice that
in each scenario, the MM algorithm is faster than the EM algorithm on average. In the best case, the MM
algorithm is approximately 35 times faster than the EM (i.e. case D = 1, G = 2, and N = 15), and in the worst
case, the MM and EM are approximately at parity (i.e. case D = 4, G = 1, and N = 17).
In Figure 5.2, the performance of the MM algorithm over the EM decreases due to increases in D, G, and N ,
with D decreasing this gain more severely than the other two variables. This pattern may be explained by some
of the additional computation overhead of the MM algorithm. For instance, notice that the MM algorithm
requires the computation and storage of αjl for each j, l, and k. The number of these computations increase
quadratically in D, but only linearly in G and N . Due to this effect, we cannot recommend the MM algorithm
in all situations. However, it is noticeable that in the low D cases, the MM algorithm appears to be distinctly
faster than the EM.
We finally note that these results are dependent on the specific performances of the subroutines, algorithms,
and hardware. Thus, the results may vary when conducting performance tests under different settings. As such,
we believe that this simulation study serves to demonstrate the potential computational gains in R, rather than
to suggest gains in all settings.
5.6
Conclusions
In this article, we introduced the LRC of the GMM, and show that there is a mapping between the LRC
parameters and the natural parameters of a GMM. Using the LRC, we devised an MM algorithm for ML
134
135
3
3
3
4
4
4
4
15
15
15
15
15
15
15
2
15
3
2
15
15
1
15
2
1
15
15
1
15
2
1
15
15
D
N
4
3
2
1
4
3
2
1
4
3
2
1
4
3
2
1
G
8.36
3.83
1.67
1.22
3.96
1.73
0.64
0.39
2.60
1.05
0.23
0.12
1.75
0.64
0.10
0.05
MM
11.92
5.98
3.00
1.52
10.64
5.39
2.72
1.47
13.18
7.24
3.07
1.51
29.13
14.01
2.95
1.46
EM
1.43
1.66
1.80
1.25
2.78
3.50
4.35
3.84
5.14
8.91
14.20
13.00
16.62
25.09
34.93
31.99
Ratio
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
N
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
D
4
3
2
1
4
3
2
1
4
3
2
1
4
3
2
1
G
20.01
8.48
3.54
2.27
7.83
3.19
1.10
0.68
4.85
2.00
0.56
0.28
3.69
1.11
0.25
0.11
MM
24.17
12.08
6.06
3.05
21.26
10.96
5.34
2.69
25.99
12.59
4.98
2.51
56.46
19.70
6.48
2.97
EM
1.21
1.50
1.72
1.35
2.81
3.81
4.86
3.97
5.36
7.43
9.38
9.38
15.00
20.01
29.53
28.57
Ratio
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
N
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
D
4
3
2
1
4
3
2
1
4
3
2
1
4
3
2
1
G
42.20
19.11
8.25
5.83
17.43
6.32
2.12
1.31
9.31
3.67
0.91
0.54
6.80
2.85
0.38
0.20
MM
48.55
24.30
12.20
6.18
43.60
21.98
10.69
5.38
48.43
25.38
9.73
5.47
121.18
45.24
10.22
5.71
EM
1.15
1.34
1.48
1.06
2.58
3.93
5.06
4.13
5.23
8.10
10.69
10.38
17.51
18.90
28.00
29.56
Ratio
Table 5.1: Results of the simulations for assessing the performance of the MM and EM algorithms. The columns N , D, and G indicate the simulation
scenario, and MM and EM column displays the average computational time, in seconds, of the respective algorithms (over the 100 trials). The Ratio
column displays the average ratio between the EM and the MM algorithm computation times.
30
25
2.0
2.5
3.0
3.5
4.0
1.0
1.5
2.0
2.5
D
D
G=3
G=4
3.0
3.5
4.0
3.0
3.5
4.0
30
25
20
15
10
5
0
0
5
10
15
Ratio
20
25
30
35
1.5
35
1.0
Ratio
20
0
5
10
15
Ratio
20
0
5
10
15
Ratio
25
30
35
G=2
35
G=1
1.0
1.5
2.0
2.5
3.0
3.5
4.0
1.0
D
1.5
2.0
2.5
D
Figure 5.2: Plots of the average ratio of EM to MM algorithm computational times. The panels are separated
into the G values of each scenario, and the N values are indicated by the line colors. Here, red, green, and blue
indicate the values N = 15, 16, 17, respectively.
136
estimation, which does not depend on matrix operations. We then proved that the MM algorithm monotonically
increases the log-likelihood in ML estimation, and that the sequence of estimators obtained from the algorithm is
convergent to a stationary point of the log-likelihood function, under regularity conditions. Through simulations,
we were able to demonstrate that the computational speed of the MM algorithm for the LRC parameter estimates
was faster than the traditional EM algorithm for estimating GMMs in some large data situations, when both
algorithms are implemented in the R programming environment.
We also proved that the ML estimators of the LRC parameters, like those of the natural parameters, are also
consistent and asymptotically normal. This allows for asymptotically valid statistical inference, such as using
the LRC of the GMM for clustering data. Furthermore, we showed that the LRC allows for a simple method
for handling singularities in the ML estimation of GMM parameters.
To the best of our knowledge, we are the first to apply the LRC for constructing a matrix operation-free
algorithm for estimating GMMs. In the future, we hope to extend our mtarix operaton-free approach to the
ML estimation of mixtures of t-distributions, as well as skew variants of the GMM.
Appendix
Proof of Theorem 5.1
We shall show the result by construction. Firstly, set
β01 = µ1 and σ12 = Σ11 ,
(5.22)
βk0 = µk − Σk,1:k−1 Σ−1
1:k−1,1:k−1 µ1:k−1 ,
(5.23)
(βk1 , ..., βk,k−1 ) = Σk,1:k−1 Σ−1
1:k−1,1:k−1 ,
(5.24)
T
σk2 = Σkk − Σk,1:k−1 Σ−1
1:k−1,1:k−1 Σk,1:k−1 ,
(5.25)
followed by
and
for each k = 2, ..., d, in order, to get
βkT x̃k
= βk0 + (βk1 , ..., βk,k−1 ) x1
= µk + Σk,1:k−1 Σ−1
1:k−1,1:k−1 (x1:k−1 − µ1:k−1 )
= µk|1:k−1 (x1:k−1 ) ,
and σk2 = Σk|1:k−1 .
Now, by Lemma 5.1, and by definition of conditional densities,
φ1 (x1 ; µ1 , Σ11 )
d
Y
φ1 xk ; µk|1:k−1 (x1:k−1 ) , Σk|1:k−1 = φd (x; µ, Σ) ,
k=2
137
for all x ∈ Rd , which implies λ x; B, σ 2 = φd (x; µ, Σ) by application of the mappings (5.22)–(5.25). Note
that µ and vech (Σ), and B and σ 2 have equal numbers of elements, and (5.22)–(5.25) are unique for each k.
Thus, there is an injective mapping between the LRC and the natural parameters. The inverse mapping can
also be constructed by setting
µ1 = β01 and Σ11 = σ12 ,
(5.26)
Σk,1:k−1 = (βk1 , ..., βk,k−1 ) Σ−1
1:k−1,1:k−1 ,
(5.27)
T
Σkk = σk2 + Σk,1:k−1 Σ−1
1:k−1,1:k−1 Σk,1:k−1 ,
(5.28)
µk = βk0 + Σk,1:k−1 Σ−1
1:k−1,1:k−1 µ1:k−1 ,
(5.29)
followed by
and
for each k = 2, ..., d, in order. The mappings (5.26)–(5.29) are also unique for each k, and thus constitutes a
surjective mapping.
Proof of Theorem 5.2
The first and last inequalities of (5.19) and (5.20) are due to the definition of minorization (i.e. (5.11) and
(5.14) are of forms (5.9) and (5.10), respectively). The middle inequality of (5.19) is due to the concavity of
(m)
Q0 ψ1 , ψ2 ; ψ (m) . This can be shown by firstly noting that
g−1 X
n
X
τi xj ; ψ
(m)
log (πi ) +
i=1 j=1
τg x j ; ψ
(m)
log 1 −
j=1
is concave in log (πi ) since 1 −
note that
n
X
Pg−1
i=1
g−1
X
!
exp [log (πi )]
i=1
exp [log (πi )] is concave and log is an increasing concave function. Secondly,
(m)
∂ 2 Q0 ψ1 , ψ2 ; ψ (m)
2
∂βikl
1
=−
(m)2
2σik
n
X
x2jl τi xj ; ψ (m)
αjl
j=1
(m)
is negative, and thus Q0 ψ1 , ψ2 ; ψ (m) is concave with respect to each βikl for each i, k and l = 0, ..., k − 1.
(m)
Thus, Q0 ψ1 , ψ2 ; ψ (m) is the additive composition of concave functions and is therefore concave with respect
to a bijection of ψ1 . Furthermore, the system of equations
(m)
∂Q0 ψ1 , ψ2 ; ψ (m)
∂ log (πi )
= 0,
for i = 1, ..., g − 1, has a unique root that is equivalent to update (5.16), which always satisfies the positivity
restrictions on each πi .
(m)
The middle inequality of (5.20) is due to the concavity of Q ψ1 , ψ2 ; ψ (m) . This can be shown by noting
that
n
n
X
X
1
1
(m)
2
(m)
− log σik
τ xj ; ψ (m) −
τ
x
;
ψ
Q
β
;
β
i
j
2ijk
ik
ik
2 ]
2
2 exp [log σik
j=1
j=1
138
(m)
2
is concave in log σik
for each i and k, since the inverse of exp (x) is convex. Thus, Q ψ1 , ψ2 ; ψ (m) is concave
with respect to a bijection of ψ2 . Furthermore, the system of equations
(m)
∂Q ψ1 , ψ2 ; ψ (m)
2
∂ log σik
=0
has a unique root that is equivalent to update (5.18).
Proof of Theorem 5.3
(m)
This result follows from part (a) of Razaviyayn et al. (2013, Theorem 2), which assumes that Q0 ψ1 , ψ2 ; ψ (m)
(m)
and Q ψ1 , ψ2 ; ψ (m) both satisfy the definition of a minorizer, and are quasi-concave and have unique critical
points, with respect to the parameters ψ1 and ψ2 , respectively.
Firstly, the definition of a minorizer is satisfied via construction (i.e. (5.11) and (5.14) are of forms (5.9) and
(5.10), respectively). Secondly, from the proof of Theorem 5.2, both functions are concave with respect to
some bijective mappings, and are therefore quasi-concave under said mappings (see Section 3.4 of Boyd and
Vandenberghe (2004) regarding quasi-concavity). Finally, since both functions are concave with respect to some
bijective mapping, the critical points obtained must be unique.
Proof of Theorem 5.4
We show this result via induction. Firstly, using (5.26), we see that σ12 = det (Σ11 ) > 0 is the first leading
principal minor of Σ, and is positive. Now, by definition of (5.23), σ22 is the Schur complement of Σ1:k,1:k , for
k = 2, where
"
Σ1:k,1:k =
Σ1:k−1,1:k−1
Σk,1:k−1
ΣTk,1:k−1
Σkk
#
.
(5.30)
Since σ22 is positive and Σ11 is positive definite, we have the result that
det (Σ1:2,1:2 ) = det (Σ11 ) σ22 > 0
via the partitioning of the determinant. Thus, Σ1:2,1:2 is also positive definite because both the first and second
leading principal minors are positive.
Now, for each k = 3, ..., d, we assume that Σ1:k−1,1:k−1 is positive-definite. Since σk2 > 0 is the Schur complement
of the partitioning (5.30), we have the result that
det (Σ1:k,1:k ) = det (Σ1:k−1,1:k−1 ) σk2 > 0.
Thus, the kth leading principal minor is positive, for all k. The result follows by the property of positive-definite
matrices; see Chapters 10 and 14 of Seber (2008) for all relevant matrix results.
139
Proof of Theorem 5.5
Theorem 5.5 can be established from Theorem 4.1.2 of Amemiya (1985), which requires the validation of the
assumptions,
A1 The parameter space Ψ is an open subset of some Euclidean space.
A2 The log-likelihood log LR,n (ψ) is a measurable function for all ψ ∈ Ψ, ∂ (log LR,n (ψ)) /∂ψ exist and is
continuous in an open neighborhood N1 ψ 0 of ψ 0 .
A3 There exists an open neighborhood N2 ψ 0 of ψ 0 , where n−1 log LR,n (ψ) converges to E [log fR (X; ψ)]
in probability uniformly in ψ in any compact subset of N2 ψ 0 .
2
g−1
Assumption A1, and A2 are fulfilled by noting that the parameter space Ψ = (0, 1)
× Rg(d +d)/2+gd is an
2
open subset of R(g−1)+g(d +d)/2+gd , and that log LR,n (ψ) is smooth with respect to the parameters ψ. Using
Theorem 2 of Jennrich (1969), we can show that A3 holds by verifying that
E sup |log fR (X; ψ)| < ∞,
(5.31)
ψ∈N̄
where N̄ is a compact subset of N2 ψ 0 . Since fR (X; ψ) is smooth, this is equivalent to showing that
E |fR (X; ψ)| < ∞, for any fixed ψ ∈ N̄ . This is achieved by noting that
E |log fR (X; ψ)| =
=
E |log fR (X; ψ)|
g
X
E log
πi λ x; Bi , σi2 i=1
≤
g
X
E log λ x; Bi , σi2 i=1
=
g
d
X
X
E
log φ1 xk ; βkT x̃k , σk2 i=1
≤
(5.32)
k=1
g X
d
X
T
2 E log φ1 xk ; βik
x̃k , σik
.
i=1 k=1
T
2
The inequality on line 3 of (5.32) is due to Lemma 1 of Atienza et al. (2007). Since log φ1 xk ; βik
x̃k , σik
is
T
2 a polynomial function of Gaussian random variables, we have E log φ1 xk ; βik
x̃k , σik
< ∞ for each i and k.
The result then follows.
Proof of Theorem 5.6
Theorem 5.6 can be established from Theorem 4.2.4 of Amemiya (1985), which requires the validation of the
assumptions,
B1 The Hessian ∂ 2 (log LR,n (ψ)) /∂ψ∂ψ T exists and is continuous in an open neighborhood of ψ 0 .
140
B2 The equations
and
ˆ
ˆ
∂ log fR (ψ)
dx = 0,
∂ψ
∂ 2 log fR (ψ)
dx = 0,
∂ψ∂ψ T
hold, for any ψ ∈ Ψ.
B3 The averaged Hessian satisfies
2
∂ log fR (X; ψ)
1 ∂ 2 log LR,n (ψ) P
,
→
E
n
∂ψ∂ψ T
∂ψ∂ψ T
uniformly in ψ, in all compact subsets of an open neighborhood of ψ 0 .
B4 The Fisher information
#−1
∂ 2 log fR (x; ψ) ,
−E
∂ψ∂ψ T
ψ=ψ 0
"
is positive-definite.
Assumption B1 is validated via the smoothness of log LR,n (ψ), and it is mechanical to check the validity of B2.
Assumption B3 can be shown via Theorem 2 of Jennrich (1969). Unlike the others, B4 must be taken as given.
141
142
Chapter 6
Discussions and Conclusions
6.1
Summary
This thesis has been written to introduce the many variants of FMMs and FMMs for regression data; to discuss
the various aspects of analysis, estimation, and inference in such models; and to report on our various FMM
based constructions. The introduction in Chapter 1 was was written to provide context and precedence to the
main aim of this thesis, that is, to utilize and extend upon the literature on both FMMs and regression analysis,
and to produce novel methods that utilize FMMs for the construction of heterogeneous regression models and
for the analysis of regression outputs. We believe that this aim was achieved via the four new methods that
we have constructed, which are report on in Chapters 2–5. We shall now summarize the content of these four
chapters.
In Chapter 2, we introduced the MSSR and MSSRDA models, for the modeling and clustering of heterogeneous
functions with rectangular support, and the classification of such functions, respectively. Our methodology
utilizes the regular spatial spline NBFs of Malfait and Ramsay (2003) and Sangalli et al. (2013) to produce rankreduced representations of sparsely sampled functional data on a rectangular support. Using this representation,
we were able to construct the likelihood of a sample from a population of functions, in the style of Rice and Wu
(2001), under the framework of Gaussian LMMs; we termed this technique SSMM. Via an FMM argument, the
SSMM was extended to capture the effects of heterogeneous subpopulations via a mixture of Gaussian LMM
constructions, similar to those of Celeux et al. (2005), Luan and Li (2003), Ng et al. (2006), and James and
Sugar (2003). Lastly, a classification technique for functions on rectangular supports was constructed using the
mixture discriminant analysis framework of Hastie and Tibshirani (1996) in conjunction with the functional
discriminant framework of James and Hastie (2001); we named this technique MSSRDA.
The performances of MSSR and MSSRDA were assessed via a simulation study and an application to the ZIP
code data of Hastie et al. (2009) (which is a subset of the data from Le Cun et al. (1990)). Our simulation study
showed that MSSR was able to accurately estimate the mean functions from a heterogeneous population of
simulated functions, as well as satisfactorily cluster the samples into the correct family of functions. MSSRDA
was also demonstrably accurate in the task classifying the sample functions into subpopulation classes. Using the
143
ZIP code data, we showed that MSSR was able to cluster and reconstruct the mean functions for the different
numeral classes from the data, and that MSSRDA was able to classify the data with comparable accuracy
to currently used discriminant methods. Furthermore, MSSR and MSSRDA were both shown to outperform
conventional methods, when applied to data with moderate to high levels of missingness.
In Chapter 3, we consider the problem of FDR control in voxel-based population MRI experiments, whereupon
millions of spatially correlated hypotheses are tested simultaneously. Using the empirical Bayes approach of
Efron and Tibshirani (2002), we modeled the marginal distribution of the z-scores (obtained from the probit
transformation of the p-values) using a two-component GMM, as suggested by McLachlan et al. (2006); we
termed this the NC-FDR model. As an extension to McLachlan et al. (2006), we demonstrated that NC-FDR
was valid even in the context of correlated hypotheses, as long as the hypotheses have a limited range of
dependence. Furthermore, we proved that a thresholding method based on the a posteriori probabilities from
NC-FDR was able to asymptotically control the FDR at a given level.
The a posteriori probabilities from NC-FDR were then used to construct an MRF, which provided an explicit
model for the correlations between spatially neighboring hypotheses. The inference from the MRF and NC-FDR
were combined into the MRF-FDR technique for spatially coherent FDR control. We showed that all of the
relevant parameters in the MRF-FDR model could be consistently estimated, and that MRF-FDR was able to
conservatively control the FDR at a given level, asymptotically.
Using a simulation study, we were able to empirically demonstrate the improvement in performance of MRFFDR against conventional methods such as the NC-FDR and the methods of Benjamini and Hochberg (1995)
and Benjamini and Yerutieli (2001). Furthermore, an empirical study of the normative effects of aging on the
brain was conducted using the PATH data described in Anstey et al. (2012). Here, we found that MRF-FDR
was able to produce more spatially coherent groupings of significant voxels when compared to NC-FDR, and
that the inference from MRF-FDR was consistent with the general literature on brain aging.
In Chapter 4, we constructed a MoLE model based on the Laplace error model, which we named the LMoLE.
The LMoLE model can be considered as an extension to both the GMoLE of Jordan and Jacobs (1994) and the
Laplace MRM of Song et al. (2014). We showed that the LMoLE is related to the LAD regression criterion, and
demonstrated that any irreducible, ordered, and initialized LMoLE is identifiable via the results of Holzmann
et al. (2006) and Jiang and Tanner (1999b).
Using the MM algorithm, we showed that the likelihood function could be maximized monotonically, which is
a new result in the area of MoEs. Further, we also prove the consistency of the MM algorithm obtained ML
estimator, and demonstrate this consistency in finite samples via a simulation study. Other simulation studies
were used to provide evidence for the robustness of the LMoLE against the GMoLE model, as well as the
usefulness of various information criteria, such as the BIC, for the determination of the number of components
in an LMoLE model. As a demonstration, a two-component LMoLE was fitted to the climate data of Hansen
et al. (2001), and the conclusions reached by the LMoLE model were consistent with those of the original study.
Lastly, in Chapter 5, we introduced the LRC of the GMM, which can be seen as a natural extension to the
Gaussian LCWM. Using the LRC, we constructed a matrix operations free MM algorithm for the ML estimation
144
of the GMM parameters. This MM algorithm is proven to monotonically increase the likelihood function, and to
be convergent to a stationary point of the log-likelihood function, under some regularity conditions. Furthermore,
we demonstrated that the LRC allows for a simple method for handling singularities in the ML estimation of
GMMs and thus provides an alternative to the methods of Hathaway (1985), Ingrassia (2004), and Ingrassia
and Rocci (2011). We then showed that the MM algorithm obtained ML estimator for the LRC of the GMM
are consistent and asymptotically normal, thus providing an alternative proof to those of Hathaway (1985),
Redner and Walker (1984), and Van De Geer (1997). The MM algorithm was then demonstrated to be faster
than EM algorithm for GMM parameter estimation in some practically relevant simulation studies, when both
algorithms are implemented in the R programming environment.
6.2
Discussions and Future Work
Although we sought to give a thorough account of the FMM literature in Chapter 1, it is nevertheless impossible
to provide an exhaustive account, due to the breadth and depth of the literature. We must admit to our bias
against the presentation of Bayesian methods and Bayesian approaches to mixture modeling. This was a
deliberate choice due to the fact that the methods developed in Chapters 2–5 are all Frequentist constructions;
although, one may view the FDR control techniques from Chapter 3 from the perspective of empirical-Bayesian
inference (see for instance Efron (2010), Efron and Tibshirani (2002), and Efron et al. (2001)).
There is a rich literature on Bayesian inference in FMMs, as noted in Fruwirth-Schatter (2006), Chapter 4 of
McLachlan and Peel (2000) and Section 4.4 of Titterington et al. (1985). The Bayesian inference of FMMs has
been traditionally plagued by two key problems: the computational complexity of the simulation techniques
required, such as Markov chain Monte Carlo (MCMC), and the problem of label switching; for details regarding
MCMC, see Chapter 6 of Kroese et al. (2011), Robert and Casella (1999), and the references therein. These
two problems have been largely addressed in the modern statistical literature, both directly and indirectly.
Examples of the direct addressing of computational complexity in MCMC methods can be found in articles
such as Richardson and Green (1997) and Zhang et al. (2004), where reversible jump MCMC was used to
both estimate the parameters and determine the number of components of an FMM simultaneously; see also
Stephens (2000a) for a birth–death MCMC approach. A general discussions regarding label switching can be
found in Jasra et al. (2005), and examples of approaches to address the problem include random permutation
relabeling (Fruwirth-Schatter, 2001), lexicographic identifiability constraints (Jasra et al., 2005, Section 1.3),
risk-minimization over label permutations (Stephens, 2000b), and cluster-based relabeling (Yao and Lindsay,
2009), among many others; see Geweke (2007) for an argument as to why label switching is not a problem at
all.
Furthermore, the recent arrival of variational Bayesian inference in the statistical literature has the capacity to
address both of the aforementioned problems; see Ormerod and Wand (2010) for a recent tutorial, and Jaakola
and Jordan (2000) and Rustagi (1976) for historical references within the statistical literature. Applications
of variational methods to the analysis of FMM include the estimation of GMMs with a known number of
145
components (Teschendorff et al., 2005; Wang and Titterington, 2006), GMMs with an unknown number of
components (McGrory and Titterington, 2007; Ueda and Ghahramani, 2002), MoEs with an unknown number
of components (Ueda and Ghahramani, 2002), and t FMMs (Svensen and Bishop, 2005), among others.
Even further, we note that there is an ever expanding literature regarding the Frequentist inference of Bayesian
estimates, such as the works of Diaconis and Freedman (1986), Efron (2015), Johnson (2013), and Little (2006).
We believe that such approaches, along with the new advances in Bayesian analysis, will combine to increase
the attractiveness of Bayesian inference of FMMs to Frequentist and mainstream statisticians.
Aside from Bayesian inference, there are a number of other interesting topics regarding FMMs and their variants
that we did not cover in Chapter 1, due to lack of direct relevance to the core of the thesis. Among these topics
is the mixture of factor analyzer (MFA) models for high dimensional clustering, which come in many varieties,
such as the Gaussian MFA (McLachlan et al., 2003), the common-factor Gaussian MFA (Baek et al., 2010), and
the t MFA (McLachlan et al., 2007); see also Baek and McLachlan (2011) and Xie et al. (2010) for applications
of MFAs in bioinformatics. Also of interest are the many results regarding the geometry and modality of FMMs,
and the problem of variable selection in FMMs and related models; see Lindsay (1983), Lindsay (1995), Marriott
(2002), Ray and Lindsay (2005), and Ray and Ren (2012); and Khalili (2010), Khalili and Chen (2007), Khalili
and Lin (2013), Maugis et al. (2009), and Raftery and Dean (2006), respectively.
Next, we shall critically discuss the results of Chapters 2–5, and suggest possible future directions of research
that we may pursue with regards to each.
6.2.1
Chapter 2
Although a reasonable first step, the results of Chapter 2 are fairly incomplete as a generalization of the
classification and clustering works of James and Hastie (2001) and James and Sugar (2003) to the case of
functions on R2 . The most obvious caveat of this work is the fact that we only consider rectangular subsets
of R2 . Whereas it is sufficient to consider intervals on R, where all convex and compact sets are intervals, the
variety of potential convex and compact sets on R2 are much more abundant. As such, a natural extension to
the work of Chapter 2 is to consider convex polygons on R2 , as suggested in Sangalli et al. (2013), which may
be directly inferred from data via Delaunay triangulations (see Chapter 9 of de Berg et al. (2008) and references
therein) or other related methods.
In the context of observing functions from multiple samples, the extension would introduce a new problem
that was not anticipated by Sangalli et al. (2013); namely, it is nontrivial to obtain a convex polygon that
is representative of the support of all of the samples, when there is no a priori knowledge of the shape. We
propose that this issue can be addressed via the following protocol. Firstly, we pool the support data from
all of the observed samples. Secondly, we remove the supporting points corresponding to the convex hull of
the pooled data. Thirdly, we use a clustering algorithm such as an FMM or k-means to cluster the remaining
support points, and lastly, we utilize the cluster centers and the convex hull points to construct a Delaunay
triangulation. Under regularity assumptions, and the supposition that the support points are randomly sampled
146
from a common distribution, we can infer that the triangulation obtained will have nodes at the location of the
underlying means.
Furthermore, we note that the NBFs used in Chapter 2 are linear and thus only provide a crude approximation
of the functional shapes. To address this, we may consider polynomial extensions of these NBFs, such as via
those of Sangalli et al. (2013), or novel constructions based on the methods outlined in Chapter 5 of Braess
(2001).
More ambitiously, we may also consider extensions of our methods to the cases of surface topological manifolds,
and to hyperrectangles in Rd . The prior of these suggestions has been progressed for the case of a single function
in Ettinger et al. (2012), where conformal mappings of such manifolds onto R2 were considered. However, to the
best of our knowledge, the latter of these suggestions has not been considered in the literature; the execution of
such an extension would require the definition and computation of NBFs on Rd , which may be expensive as d
increases. Although, we note that the extension to the case of d = 3 would yield a method for the whole-brain
analysis of functional MRI data.
Next, we shall discuss the statistical aspects of Chapter 2. Firstly, our analysis has focused exclusively on the
construction and estimation of a statistical model to represent the heterogeneous population of functions on
rectangular domains, that are observed with some error. Our inferences conducted on these models are conducted
under the assumption that our estimation processes are correct and that our models are good approximations
of arbitrary functions on rectangular domains. The prior of the assumptions can be assessed by checking the
asymptotic properties of our estimators. We believe that under some regularity conditions (e.g. assuming each
function is observed an equal number of times, and assuming that the underlying error model is Gaussian),
we may apply Theorem 1.21 to prove the required consistency and normality of our estimators. The latter
of the assumptions can be addressed by showing that the class of NBFs that we have considered is, in some
sense, dense over the class of continuous functions on R2 . Possible avenues for obtaining such a result are the
convergence theorems of Buhmann (2003) and Lange (1999, Chapter 9).
Finally, there is much scope to apply the methodology beyond the handwriting recognition problem that is
addressed in Chapter 2. For example, it is possible to use MSSR and MSSRDA for facial recognition (Zhao
et al., 2003) using publicly available databases such as SCface (Grgic et al., 2011), and to interpolate and
clustering images and textures in the style of Yu et al. (2012).
6.2.2
Chapter 3
In Chapter 3, we sought to improve upon the FMM based NC-FDR technique of McLachlan et al. (2006) by
way of a MRF construction over the obtained clustering results from NC-FDR, for the purpose of analyzing
data from MRI studies. As a first attempt, this methodology is functional, and also effective and comparable
to currently used methods, as we showed via simulations and with an example application. However, there is
room for improvement here, with regards to the utilization of more recent techniques in both the noncontextual
and the spatial modeling phases.
147
The two component GMM that we used is adapted from that of McLachlan et al. (2006), with minor modification
in interpretation to allow for the analysis of correlated hypotheses. This model is among the earliest of schemes
for the empirical-Bayesian estimation of FDR and has a number of shortcomings. For example, it is not
impossible to observe paradoxical results of very large z-scores with high a posteriori probabilities of belonging
to the null class (cf. Qu et al. (2012)). Furthermore, the z-score distribution cannot always be fitted well by a
two component GMM, and the GMM suffers from identifiability issues (both theoretical and asymptotic) when
the empirical null is used in place of the theoretical null, especially when the null and alternative means are
close.
There has been some progress in the literature, in the addressing of these issues. For example, McLachlan et al.
(2006) considered an alternative density that is itself a GMM to better fit the z-scores, Ghosal and Roy (2011)
and Bean et al. (2013) proposed skew Gaussian and skew Gaussian FMM alternative densities to provide better
goodness-of-fit, Qu et al. (2012) considered the use of t FMMs and spline-registered GMMs for the alternative
density to address both identifiability and avoid paradoxical results, and Efron (2004) utilizes a nonparametric
empirical-Bayes approach to avoid declaring a form for the alternative density altogether. All of the examples
given can be explored as potential models for the noncontexual stage of our methodology.
Next, in the spatial stage of our method, we utilize a discrete MRF, in the style of Geman and Graffigne (1986),
over the clustering results from NC-FDR. We deliberately took this approach to exploit the simple MPLE
consistency results of Geman and Graffigne (1986) and Ji and Seymour (1996), since they were necessary to
establish a number of our results. We note that the same consistency results can be applied to an MPLE
estimator for a categorical conditional MRF, which means that we could instead construct a more complex relationship structure (e.g. cluster the NC-FDR result into multiple groups rather than simply null or alternative)
to produce a more nuanced representation of the relationship between neighboring voxels.
Furthermore, we may also construct an MRF over the raw a posteriori probabilities using the results from
Comets (1992), or over the z-scores using the Gaussian MRF results in Dryden et al. (2002) and Rue and Held
(2005); the latter approach can be viewed as a Frequentist alternative to the Bayesian Gaussian MRF mixture
model approaches of Wei and Pan (2008) and Wei and Pan (2010). Additionally, we can combine both the
noncontexual and MRF phases of estimation into one, in the manner of Ng and McLachlan (2004), Qian and
Titterington (1989), and Qian and Titterington (1991) (see Section 13.8 of McLachlan (1992) for details); this
approach can also be viewed as a Frequentist version of the Bayesian discrete MRF mixture model of Wei and
Pan (2010).
Aside from the modeling considerations, there are various theoretical results left to consider. For example, it is
generally provable that controlling the FDR will simultaneously control the false negative rate; see for instance
Genovese and Wasserman (2002), Sun and Cai (2007), and Xie et al. (2011). We conjecture that such a result
would also be provable in our framework via the methods from Xie et al. (2011).
As noted in the Section 3.7, we may also consider the incorporation of spatial information directly into the
clustering process. This can be achieved via MoE constructions and spline-based field registration such as in
Ashburner and Friston (2005), or by allowing the neighborhood effect sizes in the MRF to vary via a Bayesian
148
setup. Finally, there is much work available in the application of this methodology to new data, and to make
the algorithms and processes more readily available via software implementations.
6.2.3
Chapter 4
The presentation of the LMoLE model in Chapter 4 served two purposes; firstly we sought construct a MoLE
extension to the recent work of Song et al. (2014), and secondly to communicate the new MM algorithm step
for estimating the mixture proportion function. We believe that this second purpose is perhaps more important
than the first, since it is highly generalizable, and allows for the construction of likelihood increasing MM
algorithms for many other MoE constructions.
Although Chapter 4 addressed various theoretical problems of concern, such as identifiability and consistency,
there are still a number of theoretical considerations to explore. For example, we may be able to obtain results
regarding the asymptotic normality of the ML estimates via the results for the asymptotics of LAD regressions
from Phillips (1991) and Pollard (1991). Furthermore, we may also be able to obtain some denseness results by
modifying the Theorems of Jiang and Tanner (1999a), for compatibility with the Laplace density. We also seek
to provide more theoretical evidence for the robustness of the LMoLE in comparison to the GMoLE. This may
be possible under the influence curve framework expounded in Huber and Ronchetti (2009); see also Hampel
(1974).
Since the LMoLE is effectively the same as the GMoLE but with a Laplace error model instead of a Gaussian
one, it is possible to apply the LMoLE to any problem where the GMoLE is currently used. For instance,
it is possible to use the LMoLE to conduct clustering and discrimination of curves in the same manner as in
Chamroukhi et al. (2010) and Chamroukhi et al. (2013), respectively. However, unlike the model of Chamroukhi
et al. (2013), we cannot trivially extend our methodology to handle the modeling of multiple correlated series
simultaneously, although it may be possible to construct such a model using the multivariate generalization of
Eltoft et al. (2006) (see also Fang et al. (1990, Section 3.5)); these functions are generally difficult to work with
due to the modified Bessel function in their definitions.
Additionally, to model families of curves simultaneously, we may be able to construct mixed-effects versions of
the LMoLE in a similar manner to the model in Ng and McLachlan (2014). Unfortunately, unlike Gaussian
variates, the conditionals of a Laplace (or multivariate Laplace density) are not also Laplace. This problem can
be somewhat circumvented via the nonparametric mixed-effects of Aitkin (1999), as discussed in Section 1.3.4.
We may also consider a autoregressive version of the LMoLE, in the same vain as the autoregressive GMoLE
models of Carvalho and Tanner (2005), to better account the the correlation structures in time series data.
Lastly, as mentioned in the conclusion of Chapter 4, we can extend upon the theme of the method and construct
MoLE models based on the t error model and its related skewed variants, and to incorporate the mixing
proportion of the MoLE model into nonparametric FMRM frameworks such as those of Hunter and Young
(2012) and Vandekerkhove (2013). This research direction has the potential to generate a large class of flexible
models for the analysis of nonlinear data across various disciplines.
149
6.2.4
Chapter 5
In Chapter 5, we considered the LRC characterization as a kind of alternative parametrization, using the
relationships between conditional densities and regression expressions in the characterization of random variates.
This representation can be seen as a generalization of the Gaussian LCWM approach of Ingrassia et al. (2012)
and Ingrassia et al. (2013). Recently, Punzo (2014) suggested the construction of a Gaussian polynomial CWM
of the form
f (x, y; θ) =
g
X
T
T 2
T
πi φd (x; µi , Σi ) φ1 y; βi0 + β1i
x + β2i
x + ... + βmi
xm , ξi2 ,
(6.1)
i=1
where xl = xl1 , ..., xld
T
T
T
T
for l = 1, ..., m, β̃i = βi0 , β1i
, β2i
, ..., βmi
T
, and
T
θ = π1 , ..., πg−1 , µT1 , ..., µTg , vechT (Σ1 ) , ..., vechT (Σg ) , β̃1T , ..., β̃gT , ξ12 , ..., ξg2 .
We notice that there are two ways in which to extend upon model (6.1). Firstly, in (6.1), we may replace
T
T 2 2
φ1 y; βi0 + β1i
x + β2i
x , ξi by φ1 y; η (x; ψ) , ξi2 , where η (x; ψ) is an arbitrary function of x that may be
controlled by parameter vector ψ. This approach would have applications in situation where the covariates X
have joint Gaussian densities that are localized with respect to each mixture component, and E (Y |X = x) has
a complex and nonlinear form, with respect to x.
Secondly, we may consider a polynomial regression characterization model with form
d
Y
T
T
φ1 xk ; β1k
Π x; B, σ 2 =
x̃k + ... + βmk
x̃lk , σk2 ,
(6.2)
k=1
T
T
T
T
where B = β11
, ..., β1d
, ..., βm1
, ..., βmd
T
and x̃lk = xl1 , ..., xlk−1
T
, for k = 2, .., d and l = 2, ..., m. It is easy
to see that model 6.2 contains the LRC (and thus the Gaussian density) as a special case, and mixtures of
6.2 would contain model (6.1) as a special case. We believe that 6.2 may be useful in general nonlinear and
nonparametric model-based clustering, and especially in situations where the shapes of clusters are not convex;
see for example Beliakov and King (2006) and Mitra et al. (2003).
Although we have found some evidence towards computational improvements of the matrix-free MM algorithm
over the traditional EM algorithm for ML estimation, it is not generalizable across programming languages,
computers, and data generation setup. As such, some more theoretical evidence is needed. Although it is not
difficult to establish that the two algorithms have the same order of computational complexity, up to a constant,
is is unclear what the constants of complexity for the two algorithms are; we may utilize tools of algorithm
analysis such as those from Cormen et al. (2002) to answer this question. However, the complexity constants
may themselves be implementation specific and thus would provide no clearer resolution to the question of
computation improvement.
As well as an alternative algorithm for ML estimation, we also deduced a simple technique for mitigating
singularities in the ML estimation of GMMs by using the LRC. It is known that many component densities of
interests, such as the skew Gaussian, skew t, and t have conditionally Gaussian characterizations; for example,
150
see Lee and McLachlan (2013) and Lee and McLachlan (2014) for such results. As such, the LRC can be
modified for use in such scenarios to mitigation singularities as well. To the best of our knowledge, there is no
formal study on the available techniques for bounding the likelihood is such cases, in the literature.
6.3
Final Remarks
When the aim of a thesis is to further the frontiers of a research discipline, it is arguable as to whether or not
a the thesis can be completed at all. There is always one more question to answer, one more example to give,
or one more facet to discuss, and thus there is always one more thing to write or to append to the thesis at any
given stage. However, although the thesis cannot be completed, it must be concluded, which is where I find
myself now.
FMMs have become a ubiquitous tool in data analysis, and there are now multitudes of FMMs techniques for
a broad range of purposes. In this thesis, I have contributed to the collection of these techniques by developing
an FMM for the clustering and classification of surfaces, and for the spatially consistent control of FDR in
MRI studies. Furthermore, I have developed new algorithms for the robust estimation of MoLEs using the
Laplace error model, and the matrix free estimation of GMMs. I believe that each of these developments has
the potential of leading to highly impactful research.
Although I have covered a large variety of applications in writing this thesis, there are still many areas of
research to explore. I am sure that what I have learnt from writing this thesis will serve me well in my future
endeavors.
151
152
Bibliography
Ahmad, K. E.-D. (1988). Identifiability of finite mixtures using a new transform. Annals of the Institute of
Mathematical Statistics, 40:261–265.
Ahmad, K. E.-D. and Al-Hussaini, E. K. (1982). Remarks on the non-identifiability of mixtures of distributions.
Annals of the Institute of Mathematical Statistics, 34:543–544.
Aitkin, M. (1999). A general maximum likelihood analysis of variance components in generalized linear models.
Biometrics, 55:117–128.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control,
19:716–723.
Al-Hussaini, E. K. and Ahmad, K. E.-D. (1981). On the identifiability of finite mixtures of distributions. IEEE
Transactions on Information Theory, 27:664–668.
Allison, D. B., Gadbury, G. L., Heo, M., Fernandez, J. R., Lee, C.-K., Prolla, T. A., and Weindruch, R. (2002).
A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and
Data Analysis, 39:1–20.
Amemiya, T. (1985). Advanced econometrics. Harvard University Press, Cambridge.
Anderson, T. W. (2003). An introduction to multivariate statistical analysis. Wiley, New York.
Andrews, D. W. K. (1992). Generic uniform convergence. Econometric Theory, 8:241–257.
Andrews, D. W. K. (1999). Estimation when a parameter is on a boundary. Econometrica, 67:1341–1383.
Andrews, D. W. K. (2001). Testing when a parameter is on the boundary of the maintained hypothesis.
Econometrica, 69:683–734.
Andrews, J. L. and McNicholas, P. D. (2013). Using evolutionary algorithms for model-based clustering. Pattern
Recognition Letters, 34:987–992.
153
Anstey, K. J., Christensen, H., Butterworth, P., Easteal, S., Mackinnon, A., Jacomb, T., Maxwell, K., Rodgers,
B., Windsor, T., Cherbuin, N., and Jorm, A. F. (2012). Cohort profile: The PATH through life project.
International Journal of Epidemiology, 41:951–960.
Arellano-Valle, R. B. and Bolfarine, H. (1995). On some characterizations of the t-distribution. Statistics and
Probability Letters, 25:79–85.
Arellano-Valle, R. B., Bolfarine, H., and Lachos, V. H. (2005). Skew-normal linear mixed models. Journal of
Data Science, 3:415–438.
Arnold, B. C. and Strauss, D. (1991). Pseudolikelihood estimation: some examples. Sankhya B, 53:233–243.
Ashburner, J. and Friston, K. J. (2005). Unified segmentation. NeuroImage, 26:839–851.
Ashford, J. R. and Walker, P. J. (1972). Quantal response analysis for a mixture of populations. Biometrics,
28:981–988.
Atienza, N., Garcia-Heras, J., Munoz-Pichardo, J. M., and Villa, R. (2007). On the consistency of MLE in finite
mixture models of exponential families. Journal of Statistical Planning and Inference, 137:496–505.
Babu, G. J., C, A. J., and Chaubey, Y. P. (2002). Application of Bernstein polynomials for smooth estimation
of a distribution and density function. Journal of Statistical Planning and Inference, 105:377–392.
Babu, G. J. and Chaubey, Y. P. (2006). Smooth estimation of a distribution and density function on a hypercube
using Bernstein polynomials for dependent random vectors. Statistics and Probability Letters, 76:959–969.
Baek, J. and McLachlan, G. J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional
microarray data. Bioinformatics, 27:1269–1276.
Baek, J., McLachlan, G. J., and Flack, L. (2010). Mixtures of factor analyzers with common factor loadings:
applications to the clustering and visualisation of high-dimensional data. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 32:1298–1309.
Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics,
49:803–821.
Bates, D. M. and Watts, D. G. (1988). Nonlinear regression analysis and its applications. Wiley, New York.
Bean, G. J., Dimarco, E. A., Mercer, L. D., Thayer, L. K., Roy, A., and Ghosal, S. (2013). Finite skew-mixture
models for estimation of positive false discovery rates. Statistical Methodology, 10:46–57.
Becker, M. P., Yang, I., and Lange, K. (1997). EM algorithms without missing data. Statistical Methods in
Medical Research, 6:38–54.
154
Beliakov, G. and King, M. (2006). Density based fuzzy c-means clustering of non-convex patterns. European
Journal of Operational Research, 173:717–728.
Benaglia, T., Chauveau, D., and Hunter, D. R. (2009). An EM-like algorithm for semi- and non-parametric
estimation in multivariate mixtures. Journal of Computational and Graphical Statistics, 18:505–526.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach
to multiple testing. Journal of the Royal Statistical Society Series B, 57:289–300.
Benjamini, Y. and Yerutieli, D. (2001). The control of the false discovery rate in multiple testing under
dependency. Annals of Statistics, 29:1165–1188.
Bennett, C. M., Baird, A. A., Miller, M. B., and Wolford, G. L. (2009a). Neuro correlates of interspecies
perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction.
In 15th Annual meeting of the Organization for Human Brain Mapping.
Bennett, C. M., Wolford, G. L., and Miller, M. B. (2009b). The principled control of false positives in neuroimaging. Social Cognitive and Affective Neuroscience, 4:417–422.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical
Society Series B, 36:192–236.
Bhattacharya, S. and McNicholas, P. D. (2014). A LASSO-penalized BIC for mixture model selection. Advances
in Data Analysis and Classification, 8:45–61.
Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a Mixture Model for Clustering with the Integrated
Completed Likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:719–725.
Biernacki, C. and Govaert, G. (1997). Using the classification likelihood to choose the number of clusters.
Computer Science and Statistics, 29:451–457.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer, New York.
Blischke, W. R. (1964). Estimating the parameters of mixtures of binomial distributions. Journal of the
American Statistical Association, 59:510–528.
Bohning, D., Dietz, E., Schaub, R., Schlattmann, P., and Lindsay, B. G. (1994). The distribution of the
likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute
of Mathematical Statistics, 46:273–388.
Boos, D. D. and Stefanski, L. A. (2013). Essential statistical inference: theory and methods. Springer, New
York.
155
Botev, Z. and Kroese, D. P. (2004). Global likelihood optimization via the cross-entropy method with an
application to mixture models. In Proceedings of the Winter Simulation Conference.
Boulle, M. (2012). Functional data clustering via piecewise constant nonparametric density estimation. Pattern
Recognition, 45:4389–4401.
Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge University Press, Cambridge.
Braess, D. (2001). Finite elements. Cambridge University Press, Cambridge.
Buhmann, M. D. (2003). Radial basis functions: theory and implementation. Cambridge University Press.
Burnham, K. P. and Anderson, D. R. (2002). Model selection and multimodel inference: a practical informationtheoretic approach. Springer, New York.
Cadez, I. V., Smyth, P., McLachlan, G. J., and McLaren, C. E. (2002). Maximum likelihood estimation of
mixture densities for binned and truncated multivariate data. Machine Learning, 47:7–34.
Cai, J. F., Candes, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion.
SIAM Journal on Optimization, 20:1956–1982.
Carvalho, A. X. and Tanner, M. A. (2005). Mixtures-of-experts of autoregressive time series: asymptotic
normality and model specification. IEEE Transactions on Neural Networks, 16:39–56.
Celeux, G. and Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions.
Computational Statistics and Data Analysis, 14:315–332.
Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28:781–793.
Celeux, G., Martin, O., and Lavergne, C. (2005). Mixture of linear mixed models for clustering gene expression
profiles from repeated microarray experiments. Statistical Modelling, 5:243–267.
Chamroukhi, F., Glotin, H., and Same, A. (2013). Model-based functional mixture discriminant analysis with
hidden process regression for curve classification. Neurocomputing, 112:153–163.
Chamroukhi, F., Same, A., Govaert, G., and Aknin, P. (2009). Time series modeling by a regression approach
based on a latent process. Neural Networks, 22:593–602.
Chamroukhi, F., Same, A., Govaert, G., and Aknin, P. (2010). A hidden process regression model for functional
data description. Application to curve discrimination. Neurocomputing, 73:1210–1221.
Chen, H. and Chen, J. (2001). Large sample distribution of the likelihood ratio test for normal mixtures.
Statistics and Probability Letters, 52:125–133.
156
Chen, H., Chen, J., and Kalbfleisch, J. D. (2001). A modified likelihood ratio test for homogeneity in finite
mixture models. Journal of the Royal Statistical Society Series B, 63:19–29.
Chen, J. (1998). Penalized likelihood-ratio test for finite mixture models with multinomial observations. Canadian Journal of Statistics, 26:583–599.
Chen, J. and Khalili, A. (2009). Order selection in finite mixture models with a nonsmooth penalty. Journal
of the American Statistical Association, 104:187–196.
Chen, J. and Li, P. (2009). Hypothesis test for normal mixture models: the EM approach. Annals of Statistics,
37:2523–2542.
Chen, K., Xu, L., and Chi, H. (1999). Improved learning algorithms for mixture of experts in multiclass
classification. Neural Networks, 12:1229–1252.
Cheng, R. C. H. and Liu, W. B. (2001). The consistency of estimators in finite mixture models. Scandinavian
Journal of Statistics, 28:603–616.
Cho, J. S. and Han, C. (2009). Testing for the mixture hypothesis of geometric distributions. Journal of
Economic Theory and Econometrics, 20:31–55.
Chumbley, J. R. and Friston, K. J. (2009). False discovery rate revisited: FDR and topological inference using
Gaussian random fields. NeuroImage, 44:62–70.
Claeskens, G. and Hjort, N. L. (2008). Model selection and model averaging. Cambridge University Press,
Cambridge.
Clarke, B., Fokoue, E., and Zhang, H. H. (2009). Principles and theory for data mining and machine learning.
Springer, New York.
Coffey, C. E., Saxton, J. A., Ratcliff, G., Bryan, R. N., and Lucke, J. F. (1999). Relation of education to brain
size in normal aging. Neurology, 53:189–196.
Collins, D. L., Zijdenbos, A. P., Baare, W. F. C., and Evans, A. C. (1999). ANIMAL+INSECT: improved
cortical structure segmentation. In IPMI Lecture Notes in Computer Science, volume 1613, pages 210–223.
Springer.
Comets, F. (1992). On consistency of a class of estimators for exponential families of Markov random fields.
Annals of Statistics, 20:455–468.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2002). Introduction to Algorithms. MIT Press.
Courchesne, E., Chisum, H. J., Townsend, J., Cowles, A., Covington, J., Egaas, B., Hinds, S., and Press,
G. A. (2000). Normal brain development and aging: quantitative analysis at in vivo MR imaging in healthy
volunteers. Neuroradiology, 216:672–682.
157
DasGupta, A. (2008). Asymptotic theory of statistics and probability. Springer, New York.
Day, N. E. (1969). Estimating the components of a mixture of normal distributions. Biometrika, 56:463–474.
Dayton, C. M. and Macready, G. B. (1988). Concomitant-variable latent-class models. Journal of the American
Statistical Association, 83:173–178.
de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M. (2008). Computational geometry: algorithms and
applications. Springer.
de Boor, C. (1978). A practical guide to splines. Springer, New York.
De Veaux, R. D. (1989). Mixtures of linear regressions. Computational Statistics and Data Analysis, 8:227=245.
De Veaux, R. D. and Krieger, A. M. (1990). Robust estimation of a normal mixture. Statistics and Probability
Letters, 10:1–7.
Delaigle, A. and Hall, P. (2012). Achieving near-perfect classification for functional data. Journal of the Royal
Statistical Society Series B, 40:131–158.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society Series B, 39:1–38.
DeSarbo, W. S. and Cron, W. L. (1988). A maximum likelihood methodology for clusterwise linear regressions.
Journal of Classification, 5:249–282.
DeVore, R. A. and Lorentz, G. G. (1993). Constructive approximation. Springer.
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer, New
York.
Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes Estimates. Annals of Statistics, 14:1–26.
Dryden, I., Ippoliti, L., and Romagnoli, L. (2002). Adjusted maximum likelihood and pseudo-likelihood estimation for noisy Gaussian Markov random fields. Journal of Computational and Graphical Statistics,
11:370–388.
Duchon, J. (1977). Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Schempp, W. and
Karl, Z., editors, Constructive Theory of Functions of Several Variables, volume 571 of Lecture Notes in
Mathematics, pages 85–100. Springer.
Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern classification. Wiley, New York.
Dudley, R. M. (1999). Uniform central limit theorems. Cambridge University Press, Cambridge.
158
Dudoit, S. and van der Laan, M. J. (2008). Multiple testing procedures with applications to genomics. Springer,
New York.
Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the
American Statistical Association, 99:96–104.
Efron, B. (2007a). Correlation and large-scale simultaneous signficance testing. Journal of the American
Statistical Association, 102:93–103.
Efron, B. (2007b). Size, power and false discovery rates. Annals of Statistics, 35:1351–1377.
Efron, B. (2010). Large-scale inference. Cambridge University Press, Cambridge.
Efron, B. (2015). Frequentist accuracy of Bayesian estimates. Journal of the Royal Statistical Society Series B,
In Press.
Efron, B. and Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates in microarrays. Genetic
Epidemiology, 23:70–86.
Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes Analysis of a Microarray
Experiment. Journal of the American Statistical Association, 96:1151–1160.
Eltoft, T., Kim, T., and Lee, T.-W. (2006). On the multivariate Laplace distribution. IEEE Signal Processing
Letters, 13:300–303.
Ettinger, B., Perotto, S., and Sangalli, L. M. (2012). Spatial regression models over two-dimensional manifolds.
Technical Report MOXÐReport No. 54, Politecnico di Milano.
Everitt, B. S. and Hand, D. J. (1981). Finite mixture distributions. Chapman and Hall, London.
Fang, K. T., Kotz, S., and Ng, K. W. (1990). Symmetric multivariate and related distributions. Chapman and
Hall.
Farewell, V. T. (1982). The use of mixture models for the analysis of survival data with long-term survivors.
Biometrics, 38:1041–1046.
Farewell, V. T. (1986). Mixture models in survival analysis: are they worth the risk? Canadian Journal of
Statistics, 14:257–262.
Feng, Z. D. and McCulloch, C. E. (1996). Using bootstrap likelihood ratios in finite mixture models. Journal
of the Royal Statistical Society Series B, 58:609–617.
Ferrari, D. and Yang, Y. (2010). Maximum Lq-likelihood estimation. Annals of Statistics, 38:753–783.
159
Follman, D. A. and Lambert, D. (1989). Generalizing logistic regression by nonparametric mixing. Journal of
the American Statistical Association, 84:295–300.
Follman, D. A. and Lambert, D. (1991). Identifiability of finite mixtures of logistic regression models. Journal
of Statistical Planning and Inference, 27:375–381.
Fonov, V., Evans, A. C., Botteron, K., Almli, C. R., McKinstry, R. C., Collins, D. L., and the Brain Development
Cooperative Group (2011). Unbiased average age-approriate altases for pediatric studies. NeuroImage,
54:313–327.
Fonov, V. S., Evans, A. C., McKinstry, R. C., Almli, C. R., and Collins, D. L. (2009). Unbiased nonlinear
average age-approriate brain templates from birth to adulthood. NeuroImage, 47:S102.
Fonseca, J. R. S. and Cardoso, M. G. M. S. (2007). Mixture-model cluster analysis using information theoretical
criteria. Intelligent Data Analysis, 11:155–173.
Forsythe, A. B. (1972). Robust estimation of straight line regression coefficients by minimizing pth power
deviations. Technometrics, 14:159–166.
Fraley, C. and Raftery, A. E. (1998). How many Clusters? Which clustering method? Answers via model-based
cluster analysis. Computer Journal, 41:578–588.
Fraley, C. and Raftery, A. E. (1999). MCLUST: software for model-based cluster analysis. Journal of Classification, 16:297–306.
Fraley, C. and Raftery, A. E. (2003). Enhanced model-based clustering, density estimation, and discriminant
analysis software: MCLUST. Journal of Classification, 20:263–286.
Franczak, B., Browne, R., and McNicholas, P. (2014). Mixtures of shifted asymmetric Laplace distributions.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:1149–1157.
Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19:1–67.
Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J.-P., Frith, C. D., and Frackowiak, R. S. J. (1995).
Statistical parametric maps in functional imaging: a general linear approach. Human Brain Mapping, 2:189–
210.
Friston, K. J., Jezzard, P., and Turner, R. (1994). Analysis of functional MRI time-series. Human Brain
Mapping, 1:153–171.
Fruwirth-Schatter, S. (2001). Markov chain Monte Carlo estimation of classical and dynamic switching and
mixture models. Journal of the American Statistical Association, 96:194–209.
Fruwirth-Schatter, S. (2006). Finite mixture and Markov switching models. Springer, New York.
160
Galimberti, G. and Soffritti, G. (2014). A multivariate linear regression analysis using finite mixtures of t
distributions. Computational Statistics and Data Analysis, 71:138–150.
Gan, L. and Jiang, J. (1999). A test for global maximum. Journal of the American Statistical Association,
94:847–854.
Ganesalingam, S. and McLachlan, G. J. (1980). A comparison of the mixture and classification approaches to
cluster analysis. Communications in Statistics - Theory and Methods, 9:923–933.
Geman, S. and Graffigne, C. (1986). Markov random field image models and their applications to computer
vision. In Proceedings of the International Congress of Mathematicians, pages 1496–1517.
Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate
procedure. Journal of the Royal Statistical Society Series B, 64:499–517.
Genovese, C. R., Lazar, N. A., and Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15:870–878.
Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy
of Sciences, 808:18–24.
Gershenfeld, N., Schoner, B., and Metois, E. (1999). Cluster-weighted modelling for time-series analysis. Nature,
397:329–332.
Geweke, J. (2007). Interpretation and inference in mixture models: Simple MCMC works. Computational
Statistics and Data Analysis, 51:3529–3550.
Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in linear mixed model.
Biometrics, 60:945–953.
Ghosal, S. (2001). Convergence rates for density estimation with Bernstein polynomials. Annals of Statistics,
29:1264–1280.
Ghosal, S. and Roy, A. (2011). Identifiability of the proportion of null hypothesese in skew-mixture models for
the p-value distribution. Electronic Journal of Statistics, 5:329–341.
Ghosh, J. H. and Sen, P. K. (1985). On the asymptotic performance of the log likelihood ratio statistic for the
mixture model and related results. In Proceedings of the Berkeley Conference in Honour of Jerzy Neyman
and Jack Kiefer, volume 2, pages 789–806.
Gourieroux, C. and Monfort, A. (1995). Statistics and econometrics volume 2: testing, confidence regions, model
selection, and asymptotic theory. Cambridge University Press, Cambridge.
161
Green, P. J. and Silverman, B. B. (1994). Nonparametric regression and generalized linear models. Chapman
and Hall, London.
Greene, W. H. (2003). Econometric analysis. Prentice Hall.
Greselin, F. and Ingrassia, S. (2008). A note on constrained EM algorithms for mixtures of elliptical distributions.
In Proceedings of the 32nd Annual Conference of the German Classification Society.
Grgic, M., Delac, K., and Grgic, S. (2011). SCface – surveillance cameras face database. Multimedia Tools and
Applications, 51:863–879.
Grun, B. and Leisch, F. (2007). Fitting finite mixtures of generalized linear regressions in R. Computational
Statistics and Data Analysis, 51:5247–5252.
Grun, B. and Leisch, F. (2008). Flexmix version 2: finite mixtures with concomitant variables and varying and
constant parameters. Journal of Statistical Software, 28:1–35.
Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002). A distribution-free theory of nonparametric regression.
Springer, New York.
Hall, A. R. (2005). Generalized method of moments. Oxford University Press, Oxford.
Hall, D. B. and Wang, L. (2005). Two-component mixtures of generalized linear mixed effects models for cluster
correlated data. Statistical Modelling, 5:21–37.
Hall, P., Neeman, A., Pakyari, R., and Elmore, R. (2005). Nonparametric inference in multivariate mixtures.
Biometrika, 92:667–678.
Hall, P., Poskitt, D. S., and Presnell, B. (2001). A functional data-analytic approach to signal discrimination.
Technometrics, 43:1–9.
Hall, P. and Zhou, X.-H. (2003). Nonparametric estimation of component distributions in a multivariate mixture.
Annals of Statistics, 31:201–224.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical
Association, 69:383–393.
Hansen, J., Ruedy, R., Sato, M., Imhoff, M., Lawrence, W., Easterling, D., Peterson, T., and Karl, T. (2001).
A closer look at United States and global surface temperature change. Journal of Geophysical Research,
106:23947–23963.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica,
50:1029–1054.
162
Hartigan, J. A. (1975). Clustering algorithms. Wiley, New York.
Hartigan, J. A. (1985). Statistical theory in clustering. Journal of Classification, 2:63–76.
Hasselblad, V. (1969). Estimation of finite mixtures of distributions from the exponential family. Journal of
the American Statistical Association, 64:1459–1471.
Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal
Statistical Society Series B, 58:155–176.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning. Springer, New York.
Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for normal mixture
distributions. Annals of Statistics, 13:795–800.
Haughton, D. (1997). Packages for estimating finite mixtures: a review. The American Statistician, 51:194–205.
Heller, R., Stanley, D., Yekutieli, D., Rubin, N., and Benjamini, Y. (2006). Cluster-based analysis of FMRI
data. NeuroImage, 33:599–608.
Hennig, C. (2000). Identifiability of models for clusterwise linear regression. Journal of Classification, 17:273–
296.
Heyde, C. C. (1997). Quasi-likelihood and its applications: a general approach to optimal parameter estimation.
Springer, New York.
Ho, H. J. and Lin, T.-I. (2010). Robust linear mixed models using the skew t distribution with application to
schitzophrenia data. Biometrical Journal, 52:449–469.
Holzmann, H., Munk, A., and Gneiting, T. (2006). Identifiability of FInite Mixtures of Elliptical Distributions.
Scandinavian Journal of Statistics, 33:753–763.
Hosmer, D. W. (1974). Maximum likelihood estimates of the parameters of a mixture of two regression lines.
Communications in Statistics, 3:995–1006.
Hougaard, P. (1991). Modelling heterogeneity in survival data. Journal of Applied Probability, 28:695–701.
Huang, C.-Y., Qin, J., and Zou, F. (2007). Empirical likelihood-based inference for genetic mixture models.
Canadian Journal of Statistics, 35:563–674.
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:73–101.
Huber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley, New York.
Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218.
163
Hughes, E. J., Bond, J., Svrckova, P., Makropoulos, A., Ball, G., Sharp, D. J., Edwards, A. D., Hajnal, J. V.,
and Counsell, S. J. (2012). Regional changes in thalamic shape and volume with increasing age. NeuroImage,
63:1134–1142.
Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. The American Statistician, 58:30–37.
Hunter, D. R. and Young, D. S. (2012). Semiparametric Mixtures of Regressions. Journal of Nonparametric
Statistics, 24:19–38.
Ingrassia, S. (1991). Mixture decomposition via the simulated annealing algorithm. Applied Stochastic Models
and Data Analysis, 7:317–325.
Ingrassia, S. (2004). A likelihood-based constrained algorithm for multivariate normal mixture models. Statistical
Methods and Applications, 13:151–166.
Ingrassia, S., Minotti, S. C., and Punzo, A. (2013). Model-based clustering via linear cluster-weighted models.
Computational Statistics and Data Analysis, In Press.
Ingrassia, S., Minotti, S. C., and Punzo, A. (2014). Model-based clustering via linear cluster-weighted models.
Computational Statistics and Data Analysis, 71:159–182.
Ingrassia, S., Minotti, S. C., and Vittadini, G. (2012). Local statistical modeling via a cluster-weighted approach
with elliptical distributions. Journal of Classification, 29:363–401.
Ingrassia, S. and Rocci, R. (2007). Constrained monotone EM algorithms for finite mixture of multivariate
Gaussians. Computational Statistics and Data Analysis, 51:5339–5351.
Ingrassia, S. and Rocci, R. (2011). Degeneracy of the EM algorithm for the MLE of multivariate Gaussian
mixtures and dynamic constraints. Computational Statistics and Data Analysis, 55:1714–1725.
Jaakola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and
Computing, 10:25–37.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts.
Neural Computation, 3:79–87.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31:651–666.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys,
31:264–323.
James, G. and Hastie, T. (2001). Functional linear discriminant analysis for irregularly sampled curves. Journal
of the Royal Statistical Society Series B, 63:533–550.
164
James, G. and Sugar, C. (2003). Clustering for sparsely sampled functional data. Journal of the American
Statistical Association, 98:397–408.
Jansen, R. C. (1993). Maximum likelihood in a generalized linear finite mixture model by using the EM
algorithm. Biometrics, 49:227–231.
Jasra, A., Holmes, C. C., and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label
switching problem in Bayesian Mixture Modeling. Statistical Science, 20:50–67.
Jedidi, K., Ramaswamy, V., and DeSarbo, W. S. (1996). On estimating finite mixtures of multivariate regression
and simultaneous equation models. Structural Equation Modeling, 3:266–289.
Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. Annals of Mathematical
Statistics, 40:633–643.
Ji, C. and Seymour, L. (1996). A consistent model selection procedure for Markov random fields based on
penalized pseudolikelihood. Annals of Applied Probability, 6:423–443.
Ji, Y., Wu, C., Liu, P., Wang, J., and Coombes, K. R. (2005). Applications of beta-mixture models in bioinformatics. Bioinformatics, 21:2118–2122.
Jiang, J. (2007). Linear and generalized linear mixed models and their applications. Springer, New York.
Jiang, W. and Tanner, M. A. (1999a). Hierachical mixtures-of-experts for exponential family regression models:
approximation and maximum likelihood estimation. Annals of Statistics, 27:987–1011.
Jiang, W. and Tanner, M. A. (1999b). On the approximation rate of hierachical mixtures-of-experts for generalized linear models. Neural Computation, 11:1183–1198.
Jiang, W. and Tanner, M. A. (1999c). On the identifiability of mixture-of-experts. Neural Networks, 12:1253–
1258.
Jiang, W. and Tanner, M. A. (2000). On the asymptotic normality of hierachical mixtures-of-experts for
generalized linear models. IEEE Transactions on Information Theory, 46:1005–1013.
Johnson, N. L., Kemp, A. W., and Kotz, S. (2005). Univariate discrete distributions. Wiley, New York.
Johnson, V. E. (2013). Uniformly most powerful Bayesian tests. Annals of Statistics, 41:1716–1741.
Jones, P. N. and McLachlan, G. J. (1990). Laplace-normal mixtures fitted to wind shear data. Journal of
Applied Statistics, 17:271–276.
Jones, P. N. and McLachlan, G. J. (1992). Fitting finite mixture models in a regression context. Australian
Journal of Statistics, 34:233–240.
165
Jordan, M. I. and Jacobs, R. A. (1994). Hierachical mixtures of experts and the EM algorithm. Neural
Computation, 6:181–214.
Kamakura, W. A. (1991). Estimating flexible distributions of ideal-points with external analysis of preferences.
Psychometrika, 56:419–431.
Kamakura, W. A. and Russell, G. J. (1989). A probabilistic choice model for market segmentation and elasticity
structure. Journal of Marketing Research, 26:379–390.
Karlis, D. and Xekalaki, E. (2001). Robust inference for finite poisson mixtures. Journal of Statistical Planning
and Inference, 93:93–115.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data: an introduction to cluster analysis. Wiley,
New York.
Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhya A, 62:49–65.
Khalili, A. (2010). New estimation and feature selection methods in mixture-of-experts models. Canadian
Journal of Statistics, 38:519–539.
Khalili, A. and Chen, J. (2007). Variable selection in finite mixture of regression models. Journal of the
American Statistical Association, 102:1025–1038.
Khalili, A. and Lin, S. (2013). Regularization in Finite Mixture of Regression Models with Diverging Number
of Parameters. Biometrics, 69:436–446.
Komarek, A. and Lesaffre, E. (2008). Generalized linear mixed model with a penalized Gaussian mixture as a
random effects distribution. Computational Statistics and Data Analysis, 52:3441–3458.
Kotz, S. and Nadarajah, S. (2004). Multivariate t distributions and their applications. Cambridge University
Press, Cambridge.
Kroese, D. P., Taimre, T., and Botev, Z. I. (2011). Handbook for Monte Carlo methods. Wiley.
Lachos, V. H., Ghosh, P., and Arellano-Valle, R. B. (2010). Likelihood based inference for skew-normal independent linear mixed models. Statistica Sinica, 20:303–322.
Laird, N. (1978). Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the
American Statistical Association, 73:805–811.
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38:963–974.
Lambert, D. (1992). Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34:1–14.
166
Lange, K. (1999). Numerical analysis for statisticians. Springer.
Lange, K. (2013). Optimization. Springer, New York.
Larson, M. G. and Dinse, G. E. (1985). A mixture model for the regression analysis of competing risks data.
Journal of the Royal Statistical Society Series C, 34:201–211.
Le Cun, Y. L., Boser, B., Denker, J. S., Howard, R. E., Habbard, W., Jackel, L. D., and Henderson, D. (1990).
Handwritten digit recognition with a back-propagation network. In Touretzky, D., editor, Advances in neural
information processing systems 2, pages 396–404. Morgan Kaufman.
Lee, S. X. and McLachlan, G. J. (2013). On mixtures of skew normal and skew t-distributions. Advances in
Data Analysis and Classification, 7:241–266.
Lee, S. X. and McLachlan, G. J. (2014). Finite mixtures of multivaraite skew t-distributions: some recent and
new results. Statistics and Computing, 24:181–202.
Leemput, K. V., Maes, F., Vandermeulen, D., and Suetens, P. (1999). Automated model-based tissue classification of MR images of the brain. IEEE Transactions on Medical Imaging, 18:897–908.
Lehmann, E. L. and Casella, G. (1998). Theory of point estimation. Springer, New York.
Leroux, B. G. (1992). Consistent estimation of a mixing distribution. Annals of Statistics, 20:1350–1360.
Li, M. and Zhang, L. (2008). Multinomial mixture model with feature selection for text clustering. KnowledgeBased Systems, 21:704–708.
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika,
1:13–22.
Lindsay, B. G. (1983). The geometry of mixture likelihoods: a general theory. Annals of Statistics, 11:86–94.
Lindsay, B. G. (1995). Mixture models: theory, geometry and applications. In NSF-CBMS Regional Conference
Series in Probability and Statistics.
Lindsay, B. G. and Basak, P. (1993). Multivariate normal mixtures: a fast consistent method of moments.
Journal of the American Statistical Association, 88:468–476.
Little, R. J. (2006). Calibrated Bayes. The American Statistician, 60:213–223.
Liu, C. and Rubin, D. B. (1994). The ECME algorithm: a simple extension of EM and ECM with faster
monotone convergence. Biometrika, 81:633–648.
Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with
B-splines. Bioinformatics, 19:474–482.
167
Luft, A. R., Skalej, M., Schulz, J. B., Welte, D., Kolb, R., Burk, K., Klockgether, T., and Voigt, K. (1999).
Patterns of age-related shrinkage in cerebellum and brainstem observed in vivo using three-dimensional mri
volumetry. Cerebral Cortex, 9:712–721.
Lutful Kabir, A. B. M. (1968). Estimation of parameters of a finite mixture of distributions. Journal of the
Royal Statistical Society Series B, 30:472–482.
Lwin, T. and Martin, P. J. (1989). Probits of Mixtures. Biometrics, 45:721–732.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings
of the fifth Berkley symposium on mathematical statistics and probability.
Malfait, N. and Ramsay, J. O. (2003). The historical functional linear model. The Canadian Journal of Statistics,
31:115–128.
Margulis, E. L. (1993). Modelling documents with multiple poisson distributions. Information Processing and
Management, 29:215–227.
Maronna, R. A., Martin, R. D., and Yohai, V. J. (2006). Robust statistics: theory and methods. Wiley, New
York.
Marriott, P. (2002). On the local geometry of mixture models. Biometrika, 89:77–93.
Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. Annals of Statistics, 20:712–736.
Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous state space Gibbsian
processes. Annals of Applied Probability, 5:603–612.
Matheron, G. (1963). Principles of geostatistics. Economic Geology, 58:1246–1266.
Maugis, C., Celeux, G., and Martin-Magniette, M.-L. (2009). Variable selection for clustering with Gaussian
mixture models. Biometrics, 65:701–709.
McCulloch, C. E. and Searle, S. R. (2001). Generalized, linear, and mixed models. Wiley, New York.
McGrory, C. A. and Titterington, D. M. (2007). Variational approximation in Bayesian model selection for
finite mixture distributions. Computational Statistics and Data Analysis, 51:5352–5367.
McLachlan, G. J. (1982). The classification and mixture maximum likelihood approaches to cluster analysis. In
Krishnaiah, P. R. and Kanal, L., editors, Handbook of Statistics, volume 2. North-Holland.
McLachlan, G. J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in
a normal mixture. Journal of the Royal Statistical Society Series C, 36:318–324.
McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. Wiley, New York.
168
McLachlan, G. J. and Basford, K. E. (1988). Mixture models: inference and applications to clustering. Marcel
Dekker, New York.
McLachlan, G. J., Bean, R. W., and Ben-Tovim Jones, L. (2006). A simple implementation of a normal mixture
approach to differential gene expression in multiclass microarrays. Bioinformatics, 22:1608–1615.
McLachlan, G. J., Bean, R. W., and Ben-Tovim Jones, L. (2007). Extension of the mixture of factor analyzers
model to incorporate the multivariate t-distribution. Computational Statistics and Data Analysis, 51:5327–
5338.
McLachlan, G. J. and Jones, P. N. (1988). Fitting mixture models to grouped and truncated data via the EM
algorithm. Biometrics, 44:571–578.
McLachlan, G. J. and Krishnan, T. (2008). The EM algorithm and extensions. Wiley, New York, 2 edition.
McLachlan, G. J. and McGriffin, D. C. (1994). On the role of finite mixture models in survival analysis.
Statistical Methods in Medical Research, 3:211–226.
McLachlan, G. J. and Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t-distributions. In
Amin, A., Dori, D., Pudil, P., and Freeman, H., editors, Lecture Notes in Computer Science, volume 1451.
Springer.
McLachlan, G. J. and Peel, D. (2000). Finite mixture models. Wiley, New York.
McLachlan, G. J., Peel, D., and Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor
analyzers. Computational Statistics and Data Analysis, 41:379–388.
McLachlan, G. J. and Rathnayake, S. (2014). On the number of components in a Gaussian mixture model.
WIREs Data Mining and Knowledge Discovery, 4:341–355.
Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: a general
framework. Biometrika, 80:267–278.
Meng, X.-L. and van Dyk, D. (1997). The EM algorithm – an old folk-song sung to a fast new tune. Journal
of the Royal Statistical Society Series B, 59:511–567.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2012). E1071: misc functions of the
department of statistics (e1071),TU Wien.
Miller, R. G. (1981). Simultaneous statistical inference. Springer, New York.
Mitra, P., Pal, S. K., and Siddiqi, M. A. (2003). Non-convex clustering using expectation maximization algorithm
with rough set initialization. Pattern Recognition Letters, 24:863–873.
169
Muthen, B. and Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the EM algorithm.
Biometrics, 55:463–469.
Nelder, J. A. and Mead, R. (1965). A simplex algorithm for functional minimization. Computer Journal,
7:308–313.
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical
Society Series A, 135:370–384.
Ng, S.-K. and McLachlan, G. J. (2004). Speeding up the EM algorithm for mixture model-based segmentation
of magnetic resonance images. Pattern Recognition, 37:1573–1589.
Ng, S. K. and McLachlan, G. J. (2007). Extension of mixture-of-experts networks for binary classification of
hierachical data. Artificial Intelligence in Medicine, 41:57–67.
Ng, S. K. and McLachlan, G. J. (2014). Mixture of random effects models for clustering multilevel growth
trajectories. Computational Statistics and Data Analysis, 71:43–51.
Ng, S. K., McLachlan, G. J., Wang, K., Jones, L. B.-T., and Ng, S.-W. (2006). A mixture model with randomeffects components for clustering correlated gene-expression profiles. Bioinformatics, 22:1745–1752.
Nguyen, H. D., McLachlan, G. J., Janke, A. L., Cherbuin, N., Sachdev, P., and Anstey, K. J. (2013). Spatial false
discovery rate control for magnetic resonance imaging studies. In Proceedings of the International Conference
on Digital Image Computing: Techniques and Applications.
Nichols, T. and Hayasaka, S. (2003). Controlling the familywise error rate in functional neuroimaging: a
comparative review. Statistical Methods in Medical Research, 12:419–446.
Nocedal, J. and Wright, S. J. (2006). Numerical optimization. Springer, New York.
Norton, R. M. (1984). The double exponential distribution: using calculus to find a maximum likelihood
estimator. The American Statistician, 38:135–136.
Novovicova, J. and Malik, A. (2003). Application of multinomial mixture model to text classification. In Lecture
Notes in Computer Science, volume 2652. Springer.
Ormerod, J. T. and Wand, M. P. (2010). Explaining variational approximations. The American Statistician,
64:140–153.
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75:237–
249.
Owen, A. B. (2001). Empirical likelihood. CRC Press, Boca Raton.
170
Pacifico, M. P., Genovese, C., Verdinelli, I., and Wasserman, L. (2004). False discovery control for random
fields. Journal of the American Statistical Association, 99:1002–1014.
Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection. Journal
of Machine Learning Research, 8:1145–1164.
Pearson, K. (1894). Contributions to the theory of mathematical evolution. Philosophical Transactions of the
Royal Society of London A, 185:71–110.
Peel, D. and McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics and
Computing, 10:335–344.
Peng, F., Jacobs, R. A., and Tanner, M. A. (1996). Bayesian inference in mixtures-of-experts and hierachical
mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical
Association, 91:953–960.
Pernkopf, F. and Bouchaffra, D. (2005). Genetic-based EM algorithm for learning Gaussian mixture models.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:1344–1348.
Petrone, S. (2002). Random Bernstein polynomials. Scandinavian Journal of Statistics, 26:373–393.
Pfefferbaum, A., Lim, K. O., Zipursky, R. B., Mathalon, D. H., Rosenbloom, M. J., Lane, B., Ha, C. N., and
Sullivan, E. V. (1992). Brain gray and white matter volume loss accelarates with aging in chronic alcoholics:
a quantitative MRI study. Alcoholism: Clinical and Experimental Research, 16:1078–1089.
Pfefferbaum, A., Sullivan, E. V., Mathalon, D. H., and Lim, K. O. (1997). Frontal lobe volume loss observed
with magnetic resonance imaging in older chronic alcoholics. Alcoholism: Clinical and Experimental Research,
21:521–529.
Phillips, P. C. B. (1991). A shortcut to LAD estimator asymptotics. Econometric Theory, 7:450–463.
Phillips, R. F. (2002). Least absolute deviations estimation via the EM algorithm. Statistics and Computing,
12:281–285.
Pinheiro, J. C. and Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. Springer, New York.
Pinheiro, J. C., Liu, C., and Wu, Y. N. (2001). Efficient algorithms for robust estimation in linear mixed-effects
models using the multivariate t distribution. Journal of Computational and Graphical Statistics, 10:249–276.
Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7:186–
199.
Punzo, A. (2014). Flexible mixture modelling with the polynomial Gaussian cluster-weighted model. Statistical
Modelling, In Press.
171
Pyne, S., Lee, S. X., Wang, K., Irish, J., Tamayo, P., Nazaire, M.-D., Duong, T., Ng, S.-K., Hafler, D., Levy, R.,
Nolan, G. P., Mesirov, J., and McLachlan, G. J. (2014). Joint modeling and registration of cell populations
in cohorts of high-dimensional flow cytometric data. PLOS One, 9:e100334.
Qian, W. and Titterington, D. M. (1989). On the use of Gibbs Markov chain models in the analysis of images
based on secon-order pairwise interative distributions. Journal of Applied Statistics, 16:267–281.
Qian, W. and Titterington, D. M. (1991). Multidimensional Markov Chain Models for Image Textures. Journal
of the Royal Statistical Society Series B, 53:661–674.
Qin, Y. and Priebe, C. E. (2013). Maximum Lq-likelihood estimation via the expectation–maximization algorithm: a robust estimation of mixture models. Journal of the American Statistical Association, 108:914–928.
Qu, L., Nettleton, D., and Dekkers, J. C. M. (2012). Improved estimation of the noncentrality parameter distribution from a large number of t-statistics, with applications to false discovery rate estimation in microarray
data analysis. Biometrics, 68:1178–1187.
Quandt, R. E. (1972). A new approach to estimating switching regressions. Journal of the American Statistical
Association, 67:306–310.
Quandt, R. E. and Ramsey, J. B. (1978). Estimating mixtures of normal distributions and switching regressions.
Journal of the American Statistical Association, 73:730–738.
R Core Team (2013). R: a language and environment for statistical computing. R Foundation for Statistical
Computing.
R Development Core Team (2013). R: a language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria.
Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. Journal of the American
Statistical Association, 101:168–178.
Ramsay, J. O., Ramsay, T. O., and Sangalli, L. M. (2011). Spatial functional data analysis. In Ferraty, F.,
editor, Recent Advances in Functional Data Analysis and Related Topics, pages 269–275. Springer.
Ramsay, J. O. and Silverman, B. W. (1997). Functional data analysis. Springer, New York.
Ramsay, J. O. and Silverman, B. W. (2002). Applied functional data analysis: methods and case studies.
Springer, New York.
Ray, S. and Lindsay, B. G. (2005). The topography of multivariate normal mixtures. Annals of Statistics,
33:2042–2065.
172
Ray, S. and Ren, D. (2012). On the upper bound of the number of modes of a multivariate normal mixture.
Journal of Multivariate Analysis, 108:41–52.
Raz, N., Lindenberger, U., Rodrigue, K. M., Kennedy, K. M., Head, D., Williamson, A., Dahle, C., and Acker,
D. G. J. D. (2005). Regional brain changes in aging healthy adults: general trends, individual differences
and modifiers. Cerebral Cortex, 15:1676–1689.
Raz, N. and Rodrigue, K. M. (2006). Differential aging of the brain: patterns, cognitive correlates and modifiers.
Neuroscience and Biobehavioral Reviews, 30:730–748.
Razaviyayn, M., Hong, M., and Luo, Z.-Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal of Optimization, 23:1126–1153.
Redner, R. A. (1981). Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Annals of Statistics, 9:225–228.
Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM
Review, 26:195–239.
Rice, J. A. and Wu, C. O. (2001). Nonparametric Mixed Effects Models for Unequally Sampled Noisy Curves.
Biometrics, 57:253–259.
Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society Series B, 59:731–792.
Rider, P. (1962). Estimating the parameters of mixed poisson, binomial and Weilbull distributions by method
of moments. Bulletin of the International Statistical Institute, 39:225–232.
Rigouste, L., Cappe, O., and Yvon, F. (2007). Inference and evaluation of the multinomial mixture model for
text clustering. Information Processing and Management, 43:1260–1280.
Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge University Press, Cambridge.
Robert, C. P. and Casella, G. (1999). Monte Carlo statistical methods. Springer.
Rossi, F. and Villa, N. (2006). Support vector machine for functional data classification. Neurocomputing,
69:730–742.
Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications. Chapman and Hall.
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge University Press,
Cambridge.
Rustagi, J. S. (1976). Variational methods in statistics. Academic Press, New York.
173
Salat, D. H., Buckner, R. L., Snyder, A. Z., Greve, D. N., Desikan, R. S. R., Busa, E., Morris, J. C., Dale,
A. M., and Fischl, B. (2004). Thinning of the cerebral cortex in aging. Cerebral Cortex, 14:721–730.
Same, A., Chamroukhi, F., Govaert, G., and Aknin, P. (2011). Model-based clustering and segmentation of
time series with change in regime. Advances in Data Analysis and Classification, 5:301–321.
Sangalli, L. M., Ramsay, J. O., and Ramsay, T. O. (2013). Spatial spline regression models. Journal of the
Royal Statistical Society Series B, 75:681–703.
Schlattmann, P. and Bohning, D. (1993). Mixture models and disease mapping. Statistics in Medicine, 12:1943–
1950.
Schwarz, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6:461–464.
Seber, G. A. F. (2008). A matrix handbook for statisticians. Wiley, New York.
Seeber, G. U. H. (1989). On the regression analysis of tumour recurrence rates. Statistics in Medicine, 8:1363–
1369.
Sled, J. G., Zijdenbos, A. P., and Evans, A. C. (1998). A nonparametric method for automatic correction of
intensity nonuniformity in MRI data. IEEE Transactions on Medical Imaging, 17:87–97.
Smith, C. D., Chebrolu, H., Wekstein, D. R., Schmitt, F. A., and Markesbery, W. R. (2007). Age and gender
effects on human brain anatomy: A voxel-based morphometric study in healthy elderly. Neurology of Aging,
28:1075–1087.
Song, P. X.-K., Zhang, P., and Qu, A. (2007). Maximum likelihood inference in robust linear mixed-effects
models using multivariate t distributions. Statistica Sinica, 17:929–943.
Song, W., Yao, W., and Xing, Y. (2014). Robust mixture regression model fitting by Laplace distributions.
Computational Statistics and Data Analysis, 71:128–137.
Sowell, E. R., Peterson, B. S., Thompson, P. M., Welcome, S. E., Henkenius, A. L., and Toga, A. W. (2003).
Mapping cortical change across the human life span. Nature Neuroscience, 6:309–315.
Stanford, C. D. and Raftery, A. E. (2002). Approximate bayes factor for image segmentation: the pseudolikelihood information criterion (PLIC). IEEE Transactions on Pattern Analysis and Machine Intelligence,
24:1517–1520.
Stephens, M. (2000a). Bayesian analysis of mixture models with an unknown number of components—an
alternative to reversible jump methods. Annals of Statistics, 28:40–74.
Stephens, M. (2000b). Dealing with label switching in mixture models. Journal of the Royal Statistical Society
Series B, 62:795–809.
174
Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series
B, 64:479–498.
Sullivan, E. V., Rosenbloom, M., Serventi, K. L., and Pfefferbaum, A. (2004). Effects of age and sex on volumes
of the thalamus, pons, and cortex. Neurobiology of Aging, 25:185–192.
Sun, J., Kaban, A., and Garibaldi, J. M. (2010). Robust mixture clustering using Pearson type VII distribution.
Pattern Recognition Letters, 31:2447–2454.
Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control.
Journal of the American Statistical Association, 102:901–912.
Svensen, M. and Bishop, C. M. (2005). Robust Bayesian mixture modelling. Neuralcomputing, 64:235–252.
Teicher, H. (1961). Identifiability of mixtures. Annals of Mathematical Statistics, 32:244–248.
Teicher, H. (1963). Identifiability of finite mixtures. Annals of Mathematical Statistics, 34:1265–1269.
Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D., and Caldas, C. (2005). A variational
Bayesian mixture modelling framework for cluster analysis fo gene-expression data. Bioinformatics, 21:3025–
3033.
Tisserand, D. J., Pruessner, J. C., Sanz Arigita, E. J., van Boxtel, M. P. J., Evans, A. C., Jolles, J., and Uylings,
H. B. M. (2002). Regional frontal cortical volumes decrease differentially in aging: an MRI study to compare
volumetric approaches and voxel-based morphometry. NeuroImage, 17:657–689.
Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Statistical analysis of finite mixture distributions.
Wiley, New York.
Turkheimer, F. E., Smith, C. B., and Schmidt, K. (2001). Estimation of the number of "true" null hypotheses
in multivariate analysis of neuroimaging data. Neuroimage, 13:920–930.
Ueda, N. and Ghahramani, Z. (2002). Bayesian model search for mixture models based on optimizing variational
bounds. Neural Networks, 15:1223–1241.
Vaida, F. (2005). Parameter convergence for EM and MM algorithms. Statistica Sinica, 15:831–840.
Van De Geer, S. (1997). Asymptotic normality in mixture models. ESAIM: Probability and Statistics, 1:17–33.
Vandekerkhove, P. (2013). Estimation of a semiparametric mixture of regressions model. Journal of Nonparametric Statistics, 25:181–208.
Varin, C. (2008). On composite marginal likelihoods. Advances in Statistical Analysis, 92:1–28.
175
Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods. Statistica Sinica,
21:5–42.
Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer, New York.
Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects
population. Journal of the American Statistical Association, 91:217–221.
Verbeke, G. and Molenberghs, G. (2000). Linear mixed models for longitudinal data. Springer, New York.
Wakefield, J. (1996). The bayesian analysis of population pharmacokinetic models. Journal of the American
Statistical Association, 91:62–75.
Walhovd, K. B., Fjell, A. M., Reinvang, I., Lundervold, A., Dale, A. M., Eilertsen, D. E., Quinn, B. T., Salat,
D., Makris, N., and Fischl, B. (2005). Effects of age on volumes of cortex, white matter and subcortical
structures. Neurobiology of Aging, 26:1261–1270.
Wang, B. and Titterington, D. M. (2006). Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model. Bayesian Analysis, 1:625–650.
Wang, K., Ng, S. K., and McLachlan, G. J. (2012). Clustering of time-course gene expression profiles using
normal mixed models with autoregressive random effects. BMC Bioinformatics, 13:300.
Wasserman, L. (2004). All of statistics: a concise course in statistical inference. Springer, New York.
Webb, A. R. (2000). Gamma mixture models for target recognition. Pattern Recognition, 33:2045–2054.
Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton
method. Biometrika, 61:439–447.
Wedel, M. (2002). Concomitant variables in finite mixture models. Statistica Neerlandica, 56:362–375.
Wedel, M. and DeSarbo, W. S. (1995). A mixture likelihood approach for generalized linear models. Journal of
Classification, 12:21–55.
Wedel, M., DeSarbo, W. S., Bult, J. R., and Ramaswamy, V. (1993). A latent class poisson regression model
for heterogeneous count data. Journal of Applied Econometrics, 8:397–411.
Wei, P. and Pan, W. (2008). Incorporating gene networks into statistical tets for genomic data via a spatially
correlated mixture model. Bioinformatics, 24:404–411.
Wei, P. and Pan, W. (2010). Network-based genomic discovery: application and comparison of Markov randomfield models. Journal of the Royal Statistical Society Series C, 59:105–125.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50:1–25.
176
White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge University Press.
Wiper, M., Insua, D. R., and Ruggeri, F. (2001). Mixtures of gamma distributions with applications. Journal
of Computational and Graphical Statistics, 10:440–454.
Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research,
5:329–350.
Wong, J. (2011). Imputation: http://CRAN.R-project.org/package=imputation.
Wood, S. N., Bravington, M. V., and Hedley, S. L. (2008). Soap film smoothing. Journal of the Royal Statistical
Society Series B, 70:931–956.
Worsley, K. J. and Friston, K. J. (1995). Analysis of fMRI time-series revisited-again. NeuroImage, 2:173–181.
Worsley, K. J., Taylor, J. E., Tomaiuolo, F., and Lerch, J. (2004). Unified univariate and multivariate random
field theory. NeuroImage, 23:S189–S195.
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics, 11:95–103.
Xie, B., Pan, W., and Shen, X. (2010). Penalized mixtures of factor analyzers with application to clustering
high-dimensional microarray data. Bioinformatics, 26:501–508.
Xie, J., Cai, T. T., Maris, J., and Li, H. (2011). Optimal false discovery rate control for dependent data.
Statistics and Its Interface, 4:417–430.
Xu, L., Jordan, M. I., and Hinton, G. E. (1995). An alternative model for mixtures of experts. In Advances in
Neural Information Processing Systems 7. MIT Press.
Yakowitz, S. J. and Spragins, J. D. (1968). On the identifiability of finite mixtures. Annals of Mathematical
Statistics, 39:209–214.
Yamashita, K., Yoshiura, T., Hiwatash, A., Noguchi, T., Togao, O., Takayama, Y., Nagao, E., Kamano, H.,
Hatakenaka, M., and Honda, H. (2011). Volumetric asymmetry and differential aging effect of the human
caudate nucleus in normal individuals: a prospective MR imaging study. Journal of Neuroimaging, 21:34–37.
Yao, W. and Lindsay, B. G. (2009). Bayesian mixture labeling by higest posterior density. Journal of the
American Statistical Association, 104:758–767.
Yao, W., Wei, Y., and Yu, C. (2014). Robust mixture regression using the t-distribution. Computational
Statistics and Data Analysis, 71:116–127.
Yau, K. K. W., Lee, A. H., and Ng, A. S. K. (2003). Finite mixture regression model with random effects:
application to neonatal hospital length of stay. Computational Statistics and Data Analysis, 41:359–366.
177
Yu, G., Sapiro, G., and Mallat, S. (2012). Solving inverse problems with piecewise linear estimators: from
gaussian mixture models to structured sparsity. IEEE Transactions on Image Processing, 21:2481–2499.
Yuksel, S. E., Wilson, J. N., and Gader, P. D. (2012). Twenty years of mixture of experts. IEEE Transactions
on Neural Networks and Learning Systems, 23:1177–1193.
Zeevi, A. J., Meir, R., and Maiorov, V. (1998). Error bounds for functional approximation and estimation using
mixtures of experts. IEEE Transactions on Information Theory, 44:1010–1025.
Zhang, Y., Brady, M., and Smith, S. (2001). Segmentation of brain MR images through a hidden markov
random field model and the expectation–maximization algorithm. IEEE Transactions on Medical Imaging,
20:45–57.
Zhang, Z., Chan, K. L., Wu, Y., and Chen, C. (2004). Learming a multivariate Gaussian mixture model with
the reversible jump MCMC algorithm. Statistics and Computing, 14:343–355.
Zhao, W., Chellappa, R., Phillips, P. J., and Rosenfeld, A. (2003). Face recognition: a literature survey. ACM
Computing Surveys, 35:399–458.
Zhou, H. and Lange, K. (2010). Mm algorithms for some discrete multivariate distributions. Journal of
Computational and Graphical Statistics, 19:645–665.
Zhou, H. and Zhang, Y. (2012). EM vs MM: a case study. Computational Statistics and Data Analysis,
56:3909–3920.
Zhou, T. and He, X. (2008). Three-step estimation in linear mixed models with skew-t distributions. Journal
of Statistical Planning and Inference, 138:1542–1555.
Zhu, H.-T. and Zhang, H. (2004). Hypothesis testing in mixture regression models. Journal of the Royal
Statistical Society Series B, 66:3–16.
Zou, F., Fine, J. P., and Yandell, B. S. (2002). On empirical likelihood for a semiparametric mixture model.
Biometrika, 89:61–75.
178