A Class of Multivariate Skew Distributions: Properties and Inferential

A CLASS OF MULTIVARIATE SKEW DISTRIBUTIONS: PROPERTIES
AND INFERENTIAL ISSUES
Deniz Akdemir
A Dissertation
Submitted to the Graduate College of Bowling Green
State University in partial fulfillment of
the requirements for the degree of
DOCTOR OF PHILOSOPHY
May 2009
Committee:
Arjun K. Gupta, Advisor
James M. McFillen,
Graduate Faculty Representative
John T. Chen
Hanfeng Chen
ii
ABSTRACT
Arjun K. Gupta, Advisor
Flexible parametric distribution models that can represent both skewed and symmetric
distributions, namely skew symmetric distributions, can be constructed by skewing symmetric kernel densities by using weighting distributions. In this dissertation, we study a
multivariate skew family that have either centrally symmetric or spherically symmetric kernel. Specifically, we define multivariate skew symmetric forms of uniform, normal, Laplace,
and Logistic distributions by using the cdf’s of the same distributions as weighting distributions. Matrix variate extensions of these distributions are also introduced herein. To bypass
the unbounded likelihood problem related to the inference about this model, we propose an
estimation procedure based on the maximum product of spacings method. This idea also
leads to bounded model selection criteria that can be considered as alternatives to Akaike’s
and other likelihood based criteria when the unbounded likelihood may be a problem. Applications of skew symmetric distributions to data are also considered.
AMS 2000 subject classification: Primary 60E05, Secondary 62E10
Key words and phrases: Multivariate Skew-Symmetric Distribution, Matrix Variate
Skew-Symmetric Distribution, Inference, Maximum Product of Spacings, Unbounded Likelihood, Model Selection Criterion
iii
To little Pelin
iv
ACKNOWLEDGMENTS
This is a great opportunity to express my respect and thanks to Prof. Arjun K. Gupta
for his continuous support and invaluable input for this work. It was a pleasure to be his
student. I am also grateful to my wife, my parents and my children.
v
Table of Contents
CHAPTER 1: Notation, Introduction and Preliminaries
1
1.1
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.2
Random Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . .
5
1.2.3
Univariate, Multivariate, Matrix Variate Normal Distribution
. . . .
9
1.2.4
Univariate Normal Distribution . . . . . . . . . . . . . . . . . . . . .
9
1.2.5
Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . .
10
1.2.6
Matrix Variate Normal Distribution . . . . . . . . . . . . . . . . . . .
12
1.2.7
Symmetric Distributions . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2.8
Centrally Symmetric Distributions . . . . . . . . . . . . . . . . . . .
13
1.2.9
Spherically Symmetric Distributions
. . . . . . . . . . . . . . . . . .
14
1.2.10 Location-Scale Families from Spherically Symmetric Distributions . .
15
CHAPTER 2: A Class of Multivariate Skew Symmetric Distributions
18
2.1
Background: Multivariate Skew Normal Distributions . . . . . . . . . . . . .
18
2.2
Skew-Centrally Symmetric Densities
. . . . . . . . . . . . . . . . . . . . . .
22
Independent Observations, Location-Scale Family . . . . . . . . . . .
31
Multivariate Skew-Normal Distribution . . . . . . . . . . . . . . . . . . . . .
36
2.3.1
36
2.2.1
2.3
Motivation: Skew-Spherically Symmetric Densities . . . . . . . . . . .
2.4
2.3.2
Multivariate Skew-Normal Distribution . . . . . . . . . . . . . . . . .
vi
37
2.3.3
Generalized Multivariate Skew-Normal Distribution . . . . . . . . . .
51
2.3.4
Stochastic Representation, Random Number Generation . . . . . . .
60
Extensions: Some Other Skew Symmetric Forms . . . . . . . . . . . . . . . .
61
CHAPTER 3: A Class of Matrix Variate Skew Symmetric Distributions
70
3.1
Distribution of Random Sample . . . . . . . . . . . . . . . . . . . . . . . . .
70
3.2
Matrix Variable Skew Symmetric Distribution . . . . . . . . . . . . . . . . .
74
3.3
Matrix Variate Skew-Normal Distribution
. . . . . . . . . . . . . . . . . . .
76
3.4
Generalized Matrix Variate Skew Normal Distribution . . . . . . . . . . . . .
79
3.5
Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
CHAPTER 4: Estimation and Some Applications
84
4.1
Maximum products of spacings (MPS) estimation . . . . . . . . . . . . . . .
85
4.2
A Bounded Information Criterion for Statistical Model Selection . . . . . . .
88
4.3
Estimation of parameters of snk (α, µ, Σ) . . . . . . . . . . . . . . . . . . . .
91
4.3.1
Case 1: MPS estimation of sn1 (α, 0, 1) . . . . . . . . . . . . . . . . .
92
4.3.2
Case 2: Estimation of µ, Σc1/2 given α . . . . . . . . . . . . . . . . .
94
4.3.3
Case 3: Simultaneous estimation of µ, Σc1/2 and α. . . . . . . . . . .
96
BIBLIOGRAPHY
103
vii
List of Figures
1.1
Centrally symmetric t distribution . . . . . . . . . . . . . . . . . . . . . . . .
16
2.1
Surface of skew uniform density in two dimensions . . . . . . . . . . . . . . .
26
2.2
Surface of skew uniform density in two dimensions . . . . . . . . . . . . . . .
27
2.3
Surface of skew uniform density in two dimensions . . . . . . . . . . . . . . .
28
2.4
The contours for 2−dimensional skew normal pdf . . . . . . . . . . . . . . .
29
2.5
The surface for 2 dimensional pdf for skew normal variable . . . . . . . . . .
30
2.6
The surface for 2 dimensional pdf for skew Laplace variable . . . . . . . . . .
32
2.7
The surface for 2 dimensional pdf for skew-normal variable . . . . . . . . . .
63
2.8
The surface for 2 dimensional pdf for skew-normal variable . . . . . . . . . .
64
2.9
The surface for 2 dimensional pdf for skew-normal variable . . . . . . . . . .
65
2.10 The contours for 2 dimensional pdf for skew-normal variable . . . . . . . . .
66
2.11 The surface for 2 dimensional pdf for skew-normal variable . . . . . . . . . .
67
2.12 The contours for 2 dimensional pdf for skew-normal variable . . . . . . . . .
68
4.1
Frontier data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
4.2
95
4.3
Histogram of 5000 simulated values for α
b by MPS. . . . . . . . . . . . . . . .
1150 height measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
4.4
The scatter plot for 100 generated observations . . . . . . . . . . . . . . . .
99
4.5
The scatter plot for LBM and BMI in Australian Athletes Data . . . . . . . 100
4.6
The scatter plot for LBM and SSF in Australian Athletes Data
. . . . . . . 100
viii
List of Tables
4.1
M P SIC, M P SIC2 , and M P SIC3 are compared . . . . . . . . . . . . . . .
91
4.2
MPS estimation of α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
4.3
ML estimation of α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
4.4
Estimators of α, µ and σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.5
b and α
b . . . . . . . . . . . . . . . . . . . . . 101
The simulation results about µ
4.6
d
1/2
The simulation results about Σ
. . . . . . . . . . . . . . . . . . . . . . . . 102
c
1
CHAPTER 1
Notation, Introduction and
Preliminaries
1.1
Notation
We denote matrices by capital letters, vectors by small bold letters, scalars by small letters.
We usually use letters from the beginning of the alphabet to denote the constants, sometimes
we use Greek letters. For random variables we choose letters from the end of the alphabet.
We do not follow the convention of making a distinction between a random variable and its
values. Also the following notation is used in this dissertation:
Rk : k−dimensional real space.
(A)i. : the ith row vector of the matrix A.
(A).j : the jth row vector of the matrix A.
(A)ij : the element located at the ith row and jth column of matrix A.
A′ : transpose of matrix A.
A+ : Moore-Penrose inverse of matrix A.
2
|A|: determinant of square matrix A.
tr(A): trace of square matrix A.
etr(A): etr(A) for square matrix A.
A1/2 : symmetric square root of symmetric matrix A.
1/2
Ac : square root of positive definite matrix A by Cholesky decomposition.
Ik : k−dimensional identity matrix.
Sk : the surface of the unit sphere in Rk .
0k×n : k × n dimensional matrix of zeros.
ej (or sometimes cj ): a vector that has zero elements except for one 1 at its jth row.
1k : k−dimensional vector of ones.
0k : k−dimensional vector of zeros.
x ∼ F : x is distributed as F.
d
x = y: x and y have the same distribution.
Nk (µ, Σ1/2 ): k−dimensional normal distribution with mean µ and covariance Σ.
φk (x, µ, Σ): density of Nk (µ, Σ1/2 ) distribution evaluated at x.
Φ(x, µ, σ): cumulative distribution function of N1 (µ, σ) distribution evaluated at x.
φk (x): density of Nk (0, Ik ) distribution evaluated at x.
Φ(x): cumulative distribution function of N1 (0, 1) distribution evaluated at x.
Φk (x, µ, Σ): cumulative distribution function of Nk (µ, Σ1/2 ) distribution evaluated at x.
Φk (x): cumulative distribution function of Nk (0, Ik ) distribution evaluated at x.
3
χ2k :
2
central χ distribution with k degrees of freedom.
χ2k (δ): non central χ2 distribution of k degrees of freedom and non centrality parameter δ.
Wk (n, Σ): Wishart distribution with parameters n, and Σ.
1.2
Preliminaries
In this section, we provide certain definitions and results involving vectors and matrices as
well as random vectors and matrices. Most of these results are stated here without proof;
proofs can be obtained from ([4],[25]), or in most other textbooks on matrix algebra and
multivariate statistics.
1.2.1
Vectors and Matrices
A k−dimensional real vector a is an ordered array of real numbers ai , i = 1, 2, . . . , k, organized in a single column, written as a = (ai ). The set of all k−dimensional real vectors
is denoted by Rk . A real matrix A of dimension k × n is an ordered rectangular array of
real numbers aij arranged in rows i = 1, 2, . . . , k and columns j = 1, 2, . . . , n, written as
A = (aij ).
Two matrices, A and B of same dimension, are equal if aij = bij for i = 1, 2, . . . , k
and j = 1, 2, . . . , n. If all aij = 0, we write A = 0. A vector of dimension k that has all
zero components except for a single component equal to 1 at its ith row is called the ith
(k)
elementary vector of Rk and is denoted by ei , we drop the superscript k when no confusion
is possible.
For two k × n matrices A and B, the sum is defined as A + B = (aij + bij ). For a
real number c, scalar multiplication is defined as cA = (caij ). Product of two matrices A of
P
dimension k × n and B of dimension n × m is defined as AB = (cij ) where cij = kℓ=1 aiℓ bℓj .
A matrix is called square matrix of order k if the number of columns or rows it has are
4
equal to k. The transpose of a matrix A is obtained by interchanging the rows and columns
of A and represented by A′ . When A = A′ , we say A is symmetric. A k × k symmetric
matrix A for which aij = 0, i 6= j for i, j = 1, 2, . . . , k is called diagonal matrix of order
k, and represented by A = diag(a11 , a22 , . . . , akk ); in this case if aii = 1 for all i, then we
denote this by Ik and call it identity matrix. A square matrix of order k is orthogonal if
A′ A = AA′ = Ik .
A symmetric matrix A of order k is positive definite and is denoted by A > 0 if for all
vectors a ∈ Rk , a′ Aa > 0. A symmetric matrix A of order k is positive semi definite and is
denoted by A ≥ 0 if for all vectors a ∈ Rk , a′ Aa ≥ 0.
Given a square matrix A of order k, the equation Ax = λx can be solved for k pairs
(λ, x) (λ ∈ R and x ∈ Rk ), the first of a pair of solutions is called an eigenvalue, then
second of the pair is called a corresponding eigenvector. Eigenvalues and eigenvectors of a
matrix are usually complex, but a real symmetric matrix always have real eigenvalues and
eigenvectors. Additional assumption of positive definiteness requires positive eigenvalues,
assumption of nonnegative definiteness requires nonnegative eigenvalues. The number of
nonzero eigenvalues of a square matrix is the rank of A and written as rk(A). We can define
the rank of a k × n matrix as rk(A) = rk(AA′ ) or equivalently as rk(A) = rk(A′ A), if
rk(A) = min(k, n) then A is called a full rank matrix.
The determinant of a square matrix A of order k is defined as the product of its eigenvalues
and written as |A|. If |A| =
6 0, A is called nonsingular, and we can calculate the unique matrix
B such that AB = BA = Ik , B is called the inverse of A and written as A−1 . A positive
definite matrix is nonsingular and has positive determinant. The trace of a square matrix A
of order k is defined as the sum of its eigenvalues and written as tr(A).
A symmetric matrix A of order k can be written as A = GDG′ where G is an orthogonal
matrix of order k and D is a diagonal matrix of order k. The symmetric square root matrix
A1/2 is defined as A1/2 = GD1/2 G′ where D1/2 is the diagonal matrix with diagonal elements
equal to the square roots of the elements of D. If A is also positive definite, there exists a
5
unique lower triangular matrix L with positive diagonal elements such that A = LL . This
′
is called the Cholesky decomposition of A. In the following, we denote the square root of A
1/2
obtained through Cholesky decomposition as Ac .
Suppose that A is a symmetric matrix of order k and rank r ≤ k. In this case, the unique
pseudo inverse of A can be written as A− = GD− G′ , where D− is the diagonal matrix of
order r with diagonal elements equal to the inverses of nonzero eigenvalues of A, G. A matrix
of dimensions k × r is constructed such that jth column of it is the normalized eigenvector
corresponding to the inverse of the jth diagonal element of D− ; A and A− satisfy AA− A = A,
and A− AA− = A− . Let B be a k × n matrix of rank r, r ≤ k ≤ n. The unique Moore-Penrose
inverse of B is defined as B + = B ′ (B ′ B)− . A symmetric matrix A of order k is idempotent
of order k if it satisfies AA′ = A′ A = A. BB + and B + B are idempotent matrices.


 Iq 0  ′
If A is idempotent of order k and rank q, then it can be written as A = G 
G
0 0
where G is an orthogonal matrix of order k, 0’s denote zero matrices of appropriate dimensions. If A1 , A2 , . . . , Am are each idempotent of order k with ranks r1 , r2 , . . . , rm such that
Pm
we can find an orthogonal matrix G of order k
j=1 rj = k and if Ai Aj = 0 for i 6= j, then






 0 0 0 

 ′
0 0  ′
 Ir 1 0  ′
 G , Am = G 
such that A1 = G 
 G , A2 = G 

 G where
0
I
0
r2




0 0
0 Ir m
0 0 0
0’s denote zero matrices of appropriate dimensions.
1.2.2
Random Vectors and Matrices
k random variables x1 , x2 , . . . , xk observable on a sampling unit are represented by a k
dimensional column vector x, and is called a multivariate random variable. The matrix
variable random variable obtained by considering n sampling units selected to a random
sample observed on k variables is represented by a k × n matrix, say X.
The distribution of a random variable x can be characterized by its cumulative distribution
function (cdf) Fx (.). It is defined as Fx (c) = P (x ≤ c). Given a cdf F (.), if there exists a
6
nonnegative function f (.) such that −∞ f (t)dt for every x ∈ R, we say that the random
Rx
variable x with cdf Fx (.) has probability density function (pdf), dF (x) = f (x). We can also
represent a distribution using the moment generating function (mgf) of the random variable
defined as Mx (t) = E(etx ), if it exists for all t in some neighborhood of 0. The mgf of a
random variable does not always exist, but the characteristic function (cf) always exists and
it is defined as Ψx (t) = E(eitx ). More generally, the distribution of a random variable x in
Rk can be characterized by its joint cumulative distribution function (jcdf) Fx(.). For c in
Rk , it is defined as Fx(c) = P (x1 ≤ c1 , x2 ≤ c2 , . . . , xk ≤ ck ). Given a jcdf F (.), if there
R x1 R x2
R xk
exists a nonnegative function f (.) such that −∞
.
.
.
f (t1 , t2 , . . . , tk )dt1 dt2 . . . dtk for
−∞
−∞
every x ∈ Rk , we say that the random vector x with jcdf Fx (.) has joint probability density
function (jpdf), dF (x) = f (x). The joint moment generating function (jmgf) of the random
′
variable defined as Mx(t) = E(et x), if it exists for all t in some neighborhood of 0k in Rk .
′
The joint characteristic function (jcf) is defined as Ψx(t) = E(eit x) for t ∈ Rk . We write
x ∼ F (x), x ∼ f (x), x ∼ Mx(t), or x ∼ Ψx(t) equivalently to say that a random vector
has a certain distribution. When we say that two random vectors x and y are have the
same distribution, we refer to the equivalence of either jcdf’s, jpdf’s, jmgf’s, or jcf’s for these
d
variables, in this case we write x = y.
Let x be a k dimensional random variable. Suppose we know either one of the following:
x ∼ F (x), x ∼ f (x), x ∼ Mx(t), or x ∼ Ψx(t). Let k denote the set of indices for random
variables in x. Let m be a subset of k. The marginal distribution of m−dimensional vector
of random variables with indices in m, say x(m) , can be found in one of the following ways:
R
R
c
x(m) ∼ F (x)|x(mc ) =∞ , x(m) ∼ f (x(m) ) = . . . Rk−m f (x)dx(m ) , x(m) ∼ Mx(t)t(mc ) =o, or
c)
x(m) ∼ Ψx(t)t(mc ) =0 where mc denotes the complement of m, x(m
c)
sub vector of x, and similarly t(m
is the corresponding
denotes the sub vector of t corresponding to indices
c)
in mc . If the random variable x has a jpdf, the conditional jpdf of x(m) given x(m
c
f (x(m) |x(m ) ) =
f (x)
c .
f (x(m ) )
is
c
When x does not have a density, the distribution of x(m) |x(m ) is
still defined from definition of conditional probability. Two random vectors, x and y, are
7
said to be independent if F (x, y) = G(x)H(y) where F (x, y), G(x) and H(y) are the jcdf’s
of (x, y), x, and y correspondingly. If they have densities f (x, y), g(x), and h(y), then x
and y are independent if and only if f (x, y) = g(x)h(y).
R
The expectation of a random variable x with cdf F (x) is defined by E(x) = R xdF (x).
R
For r = 1, 2, . . ., µra = R (x − a)r dF (x) is called the moment of order r about a. In general,
the expected value of an arbitrary function of x, g(X) with respect to the probability density
R
function dF (x) is given by E(g(x)) = R g(x)dF (x). We can write E(x) = µ10 for the mean.
The moment of order 2 about E(x) is called the variance of the random variable x, and
write it as var(x). Let x = (x1 , x2 , . . . , xk )′ be a k−dimensional random vector. If every
element xj of x has finite expectation µj , we write E(x) = (E(x1 ), E(x2 ), . . . , E(xk ))′ =
µ = (µ1 , µ2 , . . . , µk )′ for the mean vector. A k variable moment µra1 r2 ...rk about an origin a =
(a1 , a2 , . . . , ak )is defined as µra1 r2 ...rk = E((x1 − a1 )r1 (x2 − a2 )r2 . . . (xk − ak )rk ) when it exists.
R
Expectation of an arbitrary function of x, g(x) is defined as E(g(x)) = Rk g(x)dF (x).
Covariance matrix Σ for a k dimensional random vector x is the expectation of the k × k
matrix (x − µ)(x − µ)′ . Obviously, Σ is symmetric, that is, the class of covariance matrices
coincides with the class of positive semi definite matrices. The following is very useful: If
x is a k−dimensional random vector with mean vector µ and covariance matrix Σ, then
the variable y = Ax + ξ for A a m × k matrix and ξ ∈ Rm has mean vector Aµ + ξ, and
covariance matrix AΣA′ .
Assume x ∈ χ be a k−dimensional random vector described by its jcdf, F (x). Let h(x)
be a measurable function from χ to Υ ⊂ Rm . Then y = h(x) is a m−dimensional random
R
vector with with its distribution described by H(y) = Θ dF (x), where Θ = {x : h(x) ≤ y}.
R
′
The jcf of y can be obtained as Ψy (t) = Rk eit h(x) dF (x) for t ∈ Rm . Similarly, the when it
R
′
exists jmgf of y is given by My (t) = Rk et h(x) dF (x) for t ∈ Rm .
Let x ∈ χ be a k−dimensional random vector having density function f (x). Let h(x) be
an invertible function from χ to Υ ⊂ Rm such that h−1 has continuous partial derivatives in
Υ. Then y = h(x) is a m−dimensional random vector with density function f (h−1 (y))|J(y)|,
8
here |J(y)| is called the Jacobian of the transformation h , and it is the determinant of the
−1
k × k matrix whose ijth component is the partial derivative of the ith component of h−1 (y)
with respect to jth component of y. For example, the Jacobian of the linear transformation
y = Ax + µ for A a nonsingular matrix of order k and constant vector µ ∈ Rk is given by
|A|−1 . Also, for the k×n matrix variate random variable X the transformation Y = AXB+M
for nonsingular matrices A and B of orders k and n and a k × n constant matrix M is given
by |A|−n |B|−k . We also have the following useful results for transformation of variables:
′
If x ∼ Mx(t), then y = Ax + µ ∼ eµ tMx(A′ t) for any m × k matrix and µ ∈ Rm , if
X ∼ MX (T ) then Y = AY B + M ∼ etr(M ′ T )MX (A′ T B ′ ) for any m × k matrix A, n × l
matrix B and m × l matrix M.
We define a model to be a k−dimensional random vector x taking values in a set χ ⊂ Rk
and having jcdf F (x : θ) where θ is a p−dimensional vector of parameters taking values
in Θ ⊂ Rp . χ is called the sample space, Θ is called the parameter space. In parametric
statistical inference, we assume that the sample space, the parameter space, and the jcdf
F (x; θ) are all known. Usually we refer to a model family in one of the following ways:
F = [χ, Fx(x; θ), Θ] or F = [χ, fx(x; θ), Θ]. We try to make some inference about θ after we
observe a sample of observations, say X.
A group of transformations, G, is a collection of transformations from χ to χ that is
closed under inversion, and closed under composition. The model F = [χ, fx(x; θ), Θ] is
invariant with respect to G when x ∼ fx(x; θ), g(x) ∼ fg(x) (g(x); g(θ)) where g ∈ G and G
is a group of transformations from Θ to Θ. The family of the normal distributions and more
generally elliptical distributions are closed under nonsingular linear transformations.
Invariance under a group of transformations, say G, can be useful for reducing the dimension of the data since inference about θ should not depend on whether the simple random
sample X = (x1 , x2 , . . . , xn ) is observed or g(X) = (g(x1 ), g(x2 ), . . . , g(xn )) is observed for
g ∈ G. Suppose that F is invariant with respect to G and that the statistic T = T (X) is an
estimator of θ. That is, T is a function that maps χn to Θ. Then T is equivariant estimator
9
if T (g(X)) = g(T (X)) for all g ∈ G, g ∈ G, and all X ∈ χ . It can also be shown that if F is
n
invariant with respect to G and the mle is unique then the mle of θ is equivariant, if the mle is
not unique then the mle can be chosen to be equivariant. A statistic T (X) is invariant under
a group of transformations G if T (g(X)) = T (X) for all g ∈ G and all X ∈ χn . A function
M (X) is maximal invariant under a group of transformations G if M (g(X)) = M (X) for
all g ∈ G and all X ∈ χn , and if M (X1 ) = M (X2 ) for some X1 ∈ χn and X2 ∈ χn then
X1 = g(X2 ). Suppose that T is a statistic and that M is a maximal invariant. Then T
is invariant if and only if T is a function of M. Invariance property is usually utilized for
obtaining suitable test statistics for hypothesis testing problems that stay invariant under a
group of transformations.
1.2.3
Univariate, Multivariate, Matrix Variate Normal Distribution
The normal distribution arises in many areas of theoretical and applied statistics. The use of
the normal model is usually justified by assuming many small, independent effects additively
contributing to each observation. Also, in probability theory, normal distributions arise as
the limiting distributions of several continuous and discrete families of distributions. In
addition, the normal distribution maximizes information entropy among all distributions
with known mean and variance, which makes it the natural choice of underlying distribution
for data summarized in terms of sample mean and variance. In the following, we give the
definition and review some important results about the normal distribution for univariate,
multivariate and matrix variate cases.
1.2.4
Univariate Normal Distribution
Each member of the univariate normal distribution family is defined by two parameters
location and scale: the mean (”average”, µ) and variance (standard deviation squared) σ 2 ,
10
respectively. To indicate that a real-valued random variable x is normally distributed with
mean µ and variance σ 2 , we write x ∼ N (µ, σ). The pdf of the normal distribution is the
Gaussian function φ(x; µ, σ) =
1
2
√ 1 e− 2σ2 (x−µ)
2πσ
where σ > 0 is the standard deviation, the
real parameter µ is the expected value. φ(z; 0, 1) = φ(z) is the density function of the
”standard” normal random variable. The cumulative distribution function of the normal
Rx
)du. The
distribution is expressed in terms of the density function Φ(x; µ, σ) = −∞ φ( u−µ
σ
standard normal cdf, Φ(z), is just the general cdf evaluated with µ = 0 and σ = 1. The values
Φ(z) may be approximated very accurately by a variety of methods, such as numerical
integration, Taylor series, asymptotic series and continued fractions. Mgf exists, for x ∼
1 2 2
σ
N (µ, σ) and is given by Mx (t) = eµt+ 2 t
1.2.5
1 2 2
σ
. Characteristic function is hx (t) = eiµt− 2 t
.
Multivariate Normal Distribution
In classical multivariate analysis the central distribution is the multivariate normal distribution. There seems to be at least two reasons for this: First due to the effect of multivariate
central limit theorem, in most cases multivariate observations are at least approximately
normal. Second, multivariate normal distribution and its sampling distributions are usually
tractable.
A k−dimensional random vector x is said to have an k−dimensional normal distribution
if the distribution of α′ x is univariate normal for every α ∈ Rk . A random vector x with
k components has normal distribution with parameters µ ∈ Rk and k × k positive definite
matrix A has jpdf
φk (x; µ, A) =
1
1
(2π)p/2 |A|
′
′ −1 (x−µ)
e− 2 (x−µ) (AA )
.
(1.1)
We say that x is distributed according to Nk (µ, A), and write x ∼ Nk (µ, A). The mgf
corresponding to the density in (1.1) is
′
1 ′
′
Mx(t) = et µ+ 2 t AA t.
(1.2)
11
If we define a m−dimensional random variable y as y = Bx+ξ where B is any m×k matrix
and ξ ∈ Rm , then by the moment generating function, we can write y ∼ Nm (Bµ + ξ, BA);
note that a pdf representation does not exist when BB ′ is singular, i.e. when rk(B) = r < m.
If x ∼ Nk (µ, A) where A is nonsingular, then z = A−1 (x − µ) has the distribution
Nk (0k , Ik ) with density
φk (z) =
1 ′
1
e− 2 z z
1/2
(2π)
(1.3)
and moment generating function
1 ′
′
E(et z ) = e− 2 t t.
(1.4)
We denote the jpdf of z with φk (z), and its jcdf with Φk (x). v = z ′ z has chi-square distribution with k degrees of freedom and denoted by χ2k , the density of v can be written as
dF (v) =
1
2k/2 Γ(k/2)
1
1
v 2 k−1 e− 2 v .
(1.5)
d
d
Let u have uniform distribution on the k−dimensional sphere, and r = v 1/2 . Then, z = ru.
d
and x = Az + µ satisfies x = µ + rAu; in general, if y has Nm (ξ, B) distribution for any
d
m × k matrix B and any k−dimensional vector ξ, we have y = ξ + rBu.
The parameters of the Nk (µ, A) distribution have direct interpretation as the expectation
and the covariance of the variable x with this distribution since E(x) = µ, E((x − µ)(x −
µ)′ ) = AA′ . Any odd moments of x − µ are zero. The fourth order moments are E[(xi −
mui )(xj − muj )(xk − muk )(xl − mul )] = (σij σkl + σik σjl + σil σjk ) where AA′ = (σij ).
The family of the normal distributions is closed under linear transformations, marginalization and conditioning. A characterization of multivariate normal distribution in the elliptical
family of distributions is due to fact that it is the only distribution in this class for which
zero covariance matrix between two partitions of a random vector implies independence.
The following are important for x ∼ Nk (µ, A), where A is nonsingular:
1. (x − µ)′ (AA′ )−1 (x − µ) ∼ χ2k
′
′ −1
2. x (AA ) x ∼
χ2k (δ),
here
χ2k (δ)
12
denotes the non central χ distribution with non
2
centrality parameter δ = 12 µ′ (AA′ )−1 µ.
When x ∼ Nk (µ, A) where A is singular of rank r we can prove similar results:
1. (x − µ)′ (AA′ )+ (x − µ) ∼ χ2r
2. x′ (AA′ )+ x ∼ χ2r (δ), here χ2r (δ) denotes the non central χ2 distribution with non centrality parameter δ = 21 µ′ (AA′ )+ µ.
Assume that x ∼ Nk (µ, A). Let y = Bx + b, z = Cx + c where B is m × k, C is l × k,
b is m × 1 and c is l × 1. Then y and z are independent if and only if BAA′ C = 0.
1.2.6
Matrix Variate Normal Distribution
If a random sample of n observations are independently selected from a Nk (µ, A) distribution,
the joint density of X = (x1 , x2 , . . . , xn ) can be written as
−n
dF (X) = |A|
n
Y
j=1
φk ((AA′ )−1/2 (xj − µ)) =
1
1
e− 2
(2π)nk/2 |A|n
Pn
j=1 (xj −µ)
′ (AA′ )−1 (x −µ)
j
.
A more general matrix variate normal distribution can be defined through linear transformation to an iid random sample of a k dimensional standard normal vectors both from right
and left. Let z ∼ φk (z), and let Z = (z 1 , z 2 , . . . , z n ) denote an iid random sample from this
distribution. The joint density for Z is given by
dF (Z) =
n
Y
j=1
=
φk (z j ) =
P
1
′
− 12 n
j=1 zj zj
e
nk/2
(2π)
1
1
etr(− ZZ ′ ).
np/2
(2π)
2
We denote this by Z ∼ φk×n (Z). Now let X = AZB + M for nonsingular matrices A and B
of orders k and n and a k × n constant matrix M. The Jacobian of this transformation is
13
−n
|A|
−k
|B| . The density of X is
φk×n (X; M, A, B) =
1.2.7
1
(2π)nk/2 |A|n |B|k
1
etr(− (AA′ )−1 (X − M )(B ′ B)−1 (X − M )′ ).
2
Symmetric Distributions
d
A random variable x is symmetric if x = −x. Multivariate symmetry is usually defined in
terms of invariance of the distribution of a centered random vector x in Rk under a suitable
family of transformations. A random vector x has centrally symmetric distribution about
d
the origin if x = −x. An equivalent alternative description for central symmetry is through
d
requiring v ′ x = −v ′ x for any vector v ∈ Rk , that is any projection of x to a line through the
origin has a symmetric distribution. The joint distribution of k symmetrically distributed
random variables is centrally symmetric. A random vector x has a spherically symmetric
distribution about the origin 0 if an orthogonal rotation of x about 0 by any orthogonal
d
matrix A does not alter the distribution, i.e., x = Ax for all orthogonal k × k matrices A.
A random vector that has uniform distribution on a k−cube of form [−c, c]k is centrally but
not spherically symmetric, central symmetry is more general than spherical symmetry. Both
of these definitions of symmetry reduce to the usual notion of symmetry in the univariate
case.
1.2.8
Centrally Symmetric Distributions
The most obvious way of constructing centrally symmetric distributions over Rk is through
considering the joint distribution of k independent symmetric random variables. Because of
independence, the joint density is the product of univariate marginals. Some examples of
this kind of densities follows:
1. Multivariate uniform distribution on the k−cube over [−1, 1]k . Density:
g(x) =
1
, x ∈ [−1, 1]k .
2k
14
k
2. Multivariate symmetric beta distribution on the k−cube over [−1, 1] . Density:
g(x) = (
k
e′j x − 1/2 θ−1
Γ(2θ) k Y e′j x − 1/2
(
)
(1
−
)) , x ∈ [−1, 1]k .
(Γ(θ))2 2 j=1
2
2
3. Multivariate centrally symmetric t-distribution with v degrees of freedom. Density:
k k
Y
(e′j x)2 ( v+1 )
Γ( v+1
)
2
g(x) = ( √
) 2 , x ∈ Rk .
(1
+
)
v
vπΓ( v2 ) j=1
4. Multivariate Laplace distribution. Density:
k
X
1
|e′j x|), x ∈ Rk .
g(x; v) = k exp(−
2
j=1
1.2.9
Spherically Symmetric Distributions
A spherically symmetric random vector x has a jcf of the form h(t′ t) for some cf h(.);
and a density, if it exists, of the form f (x′ x) for some pdf f (.). For a spherically
symmetric random vector x, the random variable defined as u(k) = √ x ′
(x x)
is always
distributed uniformly over the surface of the unit sphere in Rk , denoted by Sk . A
stochastic representation for the random vector x which has spherically symmetric
d
distribution is given by x = ru where r is a nonnegative random variable with cdf
K(r) that is independent of u which is uniformly distributed over Sk . In this case
d p
d
r = (x′ x), u = √ x ′ and they are independent. If both K ′ (r) = k(r) ∈ K the pdf
(x x)
of r, and f (x′ x) the jpdf of x exists then they are related as follows:
k
2π 2
k(r) = k rk−1 f (r2 )
Γ( 2 )
(1.6)
15
or sometimes we like the expression for pdf of v = r
2
k
π2 1
k(v) = k v 2 (k−1) f (v).
Γ( 2 )
(1.7)
Some examples follow:
(a) The standard multivariate normal Nk (0k , Ik ) distribution with jpdf
φk (x) =
1
− 21 x′ x
e
;
(2π)k/2
We denote the jcdf of x by Φk (x).
(b) The multivariate spherical t distribution with v degrees of freedom with density
function
Γ( 21 (v + k))
1
.
dF (x) = 1
1 ′
k/2
Γ( 2 v)(vπ) (1 + v x x)(v+k)/2
When n = 1 the multivariate t distribution is called spherical Cauchy distribution.
Both centrally symmetric and spherically symmetric multivariate t distributions approach to the multivariate normal density. In Figure 1.2.9 we display centrally symmetric t distribution for various choices of degrees of freedom, v. As v gets larger the
contours approach to the contours of the multivariate normal density.
1.2.10
Location-Scale Families from Spherically Symmetric
Distributions
Elliptical distributions that are obtained from an affine transform of a spherically symmetric random vector are usually utilized in the study of robustness of procedures
developed under the normal distribution assumption. The family of the elliptical distributions is closed under linear transformations, marginalization and conditioning.
16
degrees of freedom is 5.
5
0
0
y
y
degrees of freedom is 1.
5
−5
−5
0
x
−5
−5
5
0
x
degrees of freedom is 100.
5
5
0
0
y
y
degrees of freedom is 15.
−5
−5
0
x
5
−5
−5
5
0
x
5
Figure 1.1: Centrally symmetric t distribution for various choices of degrees of freedom, v.
As v gets larger the contours approach to the contours of the multivariate normal density.
Let z be a k−dimensional random vector with spherical distribution F. A random
d
vector x = Az + µ for A a m × k constant matrix and µ a m−dimensional constant
vector has elliptically contoured distribution with parameters µ ∈ Rk and A with kernel
′
F, we denote this distribution by Em (µ, A, F ). x has a jcf of the form eiµ thz (A′ t) where
hz (t) = h(t′ t) is the jcf of the spherically distributed variable z. When it exists the
′
jmgf is given by eµ tMx(A′ t). When AA′ is nonsingular and jpdf F ′ = f exists the jpdf
of x is written as follows:
Ek (x; µ, A, f ) = |A|−1 f ((x − µ)′ (AA′ )−1 (x − µ)).
d
(1.8)
d
A characterization of x is given by x = Az + µ = rAu(k) + µ where r is a nonnegative random variable with cdf K(r) that is independent of u(k) which is uniformly
distributed over Sk . If x ∼ Ek (µ, A, f ) has finite first moment then it is given by
E(x) = µ. Any odd moments of x − µ are zero. Covariance matrix if it exists is
17
′
given by E[(x − µ)(x − µ) ] =
E(r 2 )
AA′ .
k
If fourth order moments exists, they are
E[(xi − mui )(xj − muj )(xk − muk )(xl − mul )] =
E(r 4 )
(σij σkl
k(k+2)
+ σik σjl + σil σjk ) where
AA′ = (σij ).
If a random sample of n observations are independently selected from a Ek (µ, A, f )
distribution, the density of X = (x1 , x2 , . . . , xn ) can be written as
−n
Ek×n (X; µ, A, f ) = |A|
n
Y
j=1
f ((xj − µ)′ (AA′ )−1 (xj − µ)).
Gupta and Varga give a generalization of this model ([37]):
Ek×n (X, M, A, B, f ) = |A|−n |B|−k f (tr((X − M )′ (AA′ )−1 (X − M )(B ′ B)−1 )).
18
CHAPTER 2
A Class of Multivariate Skew
Symmetric Distributions
The interest in flexible distribution models that can adequately accommodate asymmetric populations has increased during the last few years. Many distributions are
skewed and so do not satisfy some standard assumptions such as normality, or even elliptical symmetry. A reasonable approach is to develop models that can represent both
skewed and symmetric distributions. The book edited by Genton supplies a variety
of applications for such models in areas such as economics, engineering, and biological
sciences ([27]).
2.1
Background: Multivariate Skew Normal Dis-
tributions
The univariate skew normal density first appears in Roberts ([55]) as an example to
weighted densities. In Aigner et al., skew normal model is obtained in the context of
stochastic frontier production function models ([1]). Azzalini provides a formal study
19
of this distribution ([10]). The density of this model is given by
SN1 (y, σ 2 , α) = f (y, σ 2 , α) = 2φ(y, σ 2 )Φ(αy),
(2.1)
where α is a real scalar, φ(., σ 2 ) is the univariate normal density function with variance
σ 2 and Φ(.) denotes the cumulative distribution function of the univariate standard
normal variable.
Some basic properties of the univariate skew normal distribution are given in ([10]):
(a) SN1 (y, σ 2 , α = 0) = φ(y, σ 2 ),
(b) If y ∼ SN1 (y, σ 2 , α) then −y ∼ SN1 (y, σ 2 , −α),
(c) As α → ±∞ the density SN1 (y, σ 2 , α) approaches to the half normal density, i.e.
to the distribution of ±|z| when z ∼ φ(z, σ 2 ),
(d) If y ∼ SN1 (y, σ 2 = 1, α) then y 2 ∼ χ21 .
Properties 1, 2, and 3 follow directly from the definition while Property 4 follows
immediately from the following lemma:
Lemma 2.1.1. ([55]) w2 ∼ χ21 if and only if the p.d.f, of w has the form f (w) =
p
h(w)exp(−w2 /2) where h(w) + h(−w) = 2/π.
The univariate skew normal distribution family extends the widely employed family
of normal distributions by introducing a skewness factor α. The advantage of this
distribution family is that it maintains many statistical properties of the normal distribution family. The study of skew normal distributions explores an approach for
statistical analysis without the assumption of symmetry for the underlying population
distribution. The skew normal distribution family emerges to take into account the
skewness property. Skew symmetric distribution is useful in many practical situations.
20
A more general form of the density in (2.1) arises as
f (y, α) = 2g(y)H(αy)
(2.2)
where α is a real scalar, g(.) is a univariate density function symmetric around 0 and
H(.)is a absolutely continuous cumulative distribution function with H ′ (.) symmetric
around 0. This family of densities is called skew symmetric family. Taking α equal to
zero in (2.2) gives a symmetric density, we obtain a skewed density for any other value
of α. The parameter α controls both skewness and kurtosis.
When applying the skew normal distribution family in statistical inference, frequently
we need to discuss the joint distribution of a random sample from the population. This
consequently necessitates the study of multivariate skew normal distribution.
Several generalizations of densities (2.1), (2.2) to the multivariate case have been studied. For example, Azzalini introduced the k-dimensional skew-normal density in two
alternative forms ([10], [12]):
f (y, α, Σ) = cφ(y, Σ)
k
Y
Φ(αi yi )
(2.3)
i=1
f (y, α, Σ) = 2φ(y, Σ)Φ(α′ y)
(2.4)
where α = (α1 , α2 , . . . αk )′ is a k-dimensional real vector, φ(., Σ) is the k-dimensional
normal density function with covariance matrix Σ, Φ(.) denotes the cumulative distribution function of the univariate standard normal variable, and c is the normalizing
constant.
Chang and Gupta generalize the idea in (2.4) for cases other than normal ([31]). A
function
f (y, α) = 2g(y)H(α′ y)
(2.5)
21
defines a probability density function for y ∈ R when α is k-dimensional real vector,
k
g(.) is a k-dimensional density function symmetric around 0k and H(.) is a absolutely
continuous cumulative distribution function with H ′ (.) is symmetric around 0.
The skew symmetric family generated by Equation 2.5 has been studied by many
authors. For instance, one can define skew-t distributions ([16], [41], [56]), skew-Cauchy
distributions ([9]), skew-elliptical distributions ([13], [6]), or other skew symmetric
distributions ([60]). The models in ([35]) are obtained by taking both g(.) and H(.)
in (2.2) to belong to one of normal, Cauchy, Laplace, logistic or uniform family. In
([49]), g(.) is taken to be a normal pdf while the cumulative distributive function
H(.) is taken as the cumulative distribution function of normal, Students t, Cauchy,
Laplace, logistic or uniform distribution. Alternatively, Gomez, Venegas, and Bolfarine
consider the situation where H(.) is fixed to be the cumulative distribution function
of the normal distribution while g(.) is taken as the density function of the normal,
Student’s t, logistic, Laplace, and uniform distributions ([30]).
Univariate and multivariate skew symmetric family of distributions appear as a subset
of the so called selection models. Suppose x is a k dimensional random vector with density f (x). The usual statistical analysis involves using a random sample x1 , x2 , . . . , xn
to make inferences about f (x). Nevertheless, there are situations for which a weighted
sample instead of a random sample is selected from f (x) because it is either difficult,
costly, or even impossible to observe certain parts of the distribution. If we assume
that a weight function used in selection, say g(x), then the sample of observation may
be thought as coming from the following weighted density
h(x) = R
f (x)g(x)
.
g(x)f (x)dx
Rk
(2.6)
When the sample is only a subset of the population then the associated model would be
called a selection model. This kind of densities originate from Fisher ([26]). Patil, Rao
22
and Zelen provide a survey on this family of distributions ([51]). In their recent article
article, Arellano-Valle, Branco, and Genton show that a very general class of skew
symmetric densities emerge from selection models called fundamental skew distribution
([5]).
2.2
Skew-Centrally Symmetric Densities
In this section, we study a family of multivariate skew symmetric densities generated
by multivariate centrally symmetric densities.
Theorem 2.2.1. Let g(.) be the k-dimensional jpdf for k independent variables centrally symmetric around 0, H(.) be a absolutely continuous cumulative distribution
function with H ′ (.) is symmetric around 0, α = (α1 , α2 , . . . , αk )′ be a k−dimensional
real vector, and ej for j = 1, 2, . . . , k are the elementary vectors of Rk . Then
k
f (y, α) = 2 g(y)
k
Y
H(αj e′j y)
j=1
defines a probability density function.
Proof. First note that f (y) ≥ 0 for all y ∈ Rk . We need to show that
k(α1 , α2 , ..., αk ) =
Z
Rk
k
2 g(y)
k
Y
j=1
H(αj e′j y)dy = 1.
(2.7)
23
Observe that,
Z
k
Y
d k
2 g(y)
H(αj e′j y)dy
dα
k
ℓ
R
j=1
Z
k
Y
k
′
′
2 yℓ H (αℓ eℓ y)
=
H(αj e′j y)g(y)dy
∂
k(α1 , α2 , ..., αk ) =
∂αℓ
Rk
j6=ℓ
= 0.
The first equality is true because of Lebesgue dominated convergence theorem, the last
equality is due to independence through
Eg(y) (yℓ H ′ (αℓ e′ℓ y)
k
Y
j6=ℓ
k
Y
H(αj e′j y)) = Eg(y) (yℓ H ′ (αℓ e′ℓ y))Eg(y) ( H(αj e′j y)) = 0,
j6=ℓ
and for g(.) is centrally symmetric around 0, yℓ H ′ (αℓ e′ℓ y) is an odd function of yℓ .
Hence, k(α1 , α2 , ..., αk ) is constant as a function of αj for all j = 1, 2, . . . , k; and when
all αi = 0, k(α1 , α2 , ..., αk ) = 1. This concludes the proof.
In the next theorem, we relate the distribution of the even powers of a skew symmetric
random variable to those of its kernel’s.
Theorem 2.2.2. Let x be a random variable with probability density function g(x),
let y be the random variable with probability density function
f (y, α) = 2k g(y)
k
Y
H(αj e′j y)
j=1
where g(.), H(.) and α are defined as in Theorem 2.2.1. Then,
(a) the even moments of y and x are the same, i.e E(yy ′ )p = E(xx′ )p for p even
and E(y ′ y)m = E(x′ x)m for any natural number m,
(b) y ′ y and x′ x have the same distribution.
24
Proof. It suffices to show
E(xn1 1 xn2 2 . . . xnk k ) = E(y1n1 y2n2 . . . yknk )
for n1 , n2 , . . . , nk even.
Let Ψy (t) be the characteristic function of y. Then,
Ψy (t) =
Z
′
eit y 2k g(y)
Rk
k
Y
H(αj e′j y)dy.
(2.8)
j=1
Let n1 + n2 + . . . + nk = n. Taking the nth
j partial derivatives of (2.8) with respect to
tj for j = 1, 2, . . . , k and putting t = 0k ,
Z
k
Y
∂n
it′ y k
2
H(αj e′j y)g(y)dy |t=0k
nk e
n1
n2
Rk ∂t1 ∂t2 . . . ∂tk
j=1
Z
k
k
Y
Y
it′ y k n
′
[e 2 i
H(αj ej y)][ yℓnℓ ]g(y)dy |t=0k
=
∂ n Ψy (t)
=
n1
n2
∂t1 ∂t2 . . . ∂tnk k |t=0k
Rk
=
Z
Rk
j=1
[2k in
k
Y
ℓ=1
k
Y
H(αj e′j y)][
j=1
yℓnℓ ]g(y)dy.
(2.9)
ℓ=1
Taking derivative of (2.9) with respect to αm ,
R
Q
Q
∂( Rk [2k in kj=1 H(αj e′j y)][ kℓ=1 yℓnℓ ]g(y)dy)
∂αm
Y n
(nm +1) ′
= 2k in Eg(y) [ym
H (αm e′m y)[
yℓ ℓ H(αj e′j y)]
j6=m
(nm +1) ′
= 2k in Eg(y) [ym
H (αm e′m y)]Eg(y) [
Y
yℓnℓ H(αj e′j y)]
j6=m
=0
The first equality is true because of Lebesgue dominated convergence theorem, the
25
second equality due to the independence of components. The last equality is due to
the fact that
(nm +1) ′
ym
H (αm ym )
is an odd function of ym and g(.) is centrally symmetric around 0.
Therefore, E(y1n1 y2n2 . . . yknk ), for n1 , n2 , . . . , nk even, is constant as a function of αm .
If all αm = 0 then f (x) = g(x) and therefore
E(xn1 1 xn2 2 . . . xnk k ) = E(y1n1 y2n2 . . . yknk )
Finally,
E(xn1 1 xn2 2 . . . xnk k ) = E(y1n1 y2n2 . . . yknk )
is true for all αm . The required results follow immediately.
The first theorem aids in constructing families of densities. As for an example take the
family of jpdf’s generated by the uniform kernel.
Example 2.2.1. Multivariate skew uniform distribution. Let’s take the uniform density on the interval (−1, 1). The density is given by
1
u(z) = , z ∈ (−1, 1).
2
(2.10)
A k−dimensional extension of this density is obtained by considering k independent
variables, x = (x1 , x2 , . . . , xk )′ , each with density u(.). The density of z is
1
z ∼ u∗ (z) = ( )k , z ∈ (−1, 1)k .
2
(2.11)
A skew symmetric density can be obtained by using the cdf of the standard normal
distribution, Φ(.) In Theorem 2.2.1 above, use g(.) = u∗ (.), and H(.) = Φ(.), the skew
26
2 variate skew uniform density for α1=2, α2=−2
1
0.8
0.6
0.4
0.2
0
−1
0
1
x2
0
0.5
1
−0.5
−1
x1
Figure 2.1: Surface of skew uniform density in two dimensions; α1 = 2, α2 = −2, H(.) = Φ(.).
symmetric density becomes
f (x, α) =


 Qk
j=1


Φ(αj e′j x), x ∈ (−1, 1)k
0, elsewhere.
Figures 2.1 to 2.3 show the flexibility of the skew uniform distribution.
A skew normal density is obtained from normal kernel in the following example.
Example 2.2.2. Multivariate skew normal densities In Theorem 2.2.1 above, let g(.) =
φk (.) where , φk (.) is the k-dimensional standard normal density function. Also let H(.)
and α be defined as in Theorem 2.2.1. We can construct a density for k-dimensional
27
2 variable skew uniform density for α1=2, α2=0
1
0.8
0.6
0.4
0.2
0
1
0.5
1
0.5
0
0
−0.5
x2
−0.5
−1
−1
x1
Figure 2.2: Surface of skew uniform density in two dimensions; α1 = 2, α2 = 0.
28
2 variable skew uniform density for α1=2, α2=10
1
0.8
0.6
0.4
0.2
0
1
0.5
1
0.5
0
0
−0.5
x2
−0.5
−1
−1
x1
Figure 2.3: Surface of skew uniform density in two dimensions; α1 = 2, α2 = 10.
29
2 variable skew normal density for various choices of the shape parameter
α1=−5, α2=0
2
2
0
0
−2
−2
−4
−4
−4
−4
−2
0
2
4
x2
2
x2
4
0
−2
−2
0
2
−4
−4
4
x1
α1=0, α2=−5
α1=0, α2=0
α1=0, α2=5
4
2
2
0
x2
4
2
0
−2
−2
0
2
−4
−4
4
−2
0
2
−4
−4
4
α1=5, α2=−5
α1=5, α2=0
2
2
x2
2
x2
4
0
−2
0
2
−4
−4
4
0
2
4
2
4
α1=5, α2=5
4
−2
−2
x1
4
−2
4
0
x1
0
2
−2
x1
−4
−4
0
x1
4
−4
−4
−2
x1
−2
x2
α1=−5, α2=5
4
x2
x2
x2
α1=−5, α2=−5
4
0
−2
−2
0
x1
2
4
−4
−4
x1
−2
0
x1
Figure 2.4: The contours for 2 dimensional skew normal pdf for different values for α and
for H(.) = Φ(.).
joint p.d.f ’s of the form
f (y, α) = 2k φk (y)
k
Y
H(αj e′j y)
(2.12)
j=1
Using Theorem 2.2.2 we can relate some properties of the density introduced by Example
2.2.2 with its kernel density, φk (.). Let x ∼ φk (x). Let y be the random variable with
probability density function
k
f (y, α) = 2 φk (y)
k
Y
i=1
Then,
H(αi e′i y).
30
α1=3, α2=5
0.4
0.2
0
3
2
1
0
−1
x2
−2
−2
−1
0
1
2
3
x1
Figure 2.5: The surface for 2 dimensional pdf for skew normal variable for α1 = 3, α2 = 5,
H(.) = Φ(.).
31
(a) the even moments of y and x are the same, i.e E(yy ) = E(xx ) for p even
′ p
′ p
and E(y ′ y)m = E(x′ x)m for any natural number m,
(b) y ′ y and x′ x both have χ2k distribution.
Example 2.2.3. Multivariate skew-Laplace densities. Take g(.) as the joint density of
k iid Laplace variables:
k
X
1
|xj |).
g(x) = k exp(−
2
j=1
Let H(.) and α be defined as in Theorem 2.2.1. We get the following density:
k
k
X
Y
exp(−
|xj |)
H(αj e′j x).
j=1
2.2.1
j=1
Independent Observations, Location-Scale Family
In this section, we consider a location scale family of skew symmetric random variables.
But first observe the following:
Remark 2.2.1. The family of densities introduced by Equation 2.5 are different models
from the family of densities introduced in Theorem 2.2.1 except for the case k = 1.
Remark 2.2.2. (Joint Density for Random Sample). Let g(.) be a density function
symmetric about 0, H(.) be an absolutely continuous cumulative distribution function with H ′ (.) symmetric around 0k . Let y1 , y2 , . . . , yn constitute a random sample
from a distribution with density f (y, α) = 2g(y)H(αy). Then the joint p.d.f.
of
y = (y1 , y2 , . . . , yn )′ is
n ∗
f (y, α) = 2 g (y)
n
Y
i=1
where g ∗ (y) =
Qn
i=1
n ∗
H(αyi ) = 2 g (y)
n
Y
H(αe′i yi )
i=1
g(yi ), belongs to the family of densities introduced in Theorem
2.2.1. Observe that density of the random sample can not be represented by the family
32
0.6
0.4
−2
0.2
−1
0
0
−2
1
−1
2
0
1
3
2
4
3
x
1
4
5
x2
Figure 2.6: The surface for 2−dimensional pdf for skew Laplace variable for α1 = 3, α2 = 5,
H(.) = Φ(.).
33
of multivariate skew symmetric densities introduced by Equation 2.5, i.e., in the form
f (y, α) = 2g(y)H(α′ y). This puts a doubt on the usefulness of the latter.
Remark 2.2.3. (Joint Density of Independent Skew-Symmetric Variables). Let g(.)
be a density function symmetric about 0, H(.) be an absolutely continuous cumulative
distribution function with H ′ (.) symmetric around 0. Let
zj ∼ f (z, αj ) = 2g(z)H(αj z)
with mean µ(αj ) and variance σ 2 (αj ) for j = 1, 2, . . . , k be independent variables. Then
the joint p.d.f. of z = (z1 , z2 , . . . , zk )′ is
k ∗
f (z, α) = 2 g (z)
k
Y
H(αj e′j z)
j=1
where α = (α1 , α2 . . . , αk )′ , g ∗ (z) =
Qk
j=1
g(zj ) and e′j are the elementary vectors of
the coordinate system Rk . This density belongs to the family of densities introduced
in Theorem 2.2.1. Note that the density of z can not be represented by the family of
multivariate skew symmetric densities introduced by Equation 2.5.
Let z ∼ f (z, α) = 2k g ∗ (z)
Qk
j=1
H(αj e′j z). By construction, the mean of z is
E(z) = µ0 = (µ(α1 ), µ(α2 ), . . . , µ(αk ))′ ,
and covariance is
C(z) = Σ0 = diag(σ 2 (α1 ), σ 2 (α2 ), . . . , σ 2 (αk )).
Now let Σ be a positive definite covariance matrix of dimension k, and let y = Σ1/2 z.
34
Then
k
2k ∗ −1/2 Y
g (Σ
y)
H(αj e′j Σ−1/2 y).
y ∼ f (y, α, Σ) =
|Σ|1/2
j=1
By letting αj Σ−1/2 ej = λj for j = 1, 2, . . . , k, we get
f (y, λ1 , λ2 . . . , λk , Σ) =
k
2k ∗ −1/2 Y
g
(Σ
y)
H(λj ′ y).
1/2
|Σ|
j=1
This time, the mean of y is
E(y) = Σ1/2 µ′0 ,
and covariance matrix is
C(y) = Σ1/2 Σ0 Σ1/2 .
Next, let x = y + µ where µ ∈ Rk . The probability density function of x is
k
Y
2k ∗ −1/2
g (Σ
(x − µ))
H(αj e′j Σ−1/2 (x − µ)).
f (x, α, Σ, µ) =
1/2
|Σ|
j=1
The mean of x is
E(x) = Σ1/2 µ0 + µ,
and covariance matrix of x is
C(x) = Σ1/2 Σ0 Σ1/2 .
The results in the previous section lead to the following definition:
Definition 2.2.1. (Multivariate Skew-Symmetric Density). Let g(.) be a density function symmetric about 0, H(.) be an absolutely continuous cumulative distribution function with H ′ (.) symmetric around 0. A variable x has skew-symmetric distribution if it
35
has probability density function
k
Y
2k ∗ −1/2
g (Σ
(x − µ))
H(αj e′j Σ−1/2 (x − µ))
f (x, α, Σ, µ) =
|Σ|1/2
j=1
(2.13)
where αj are scalars, α = (α1 , α2 . . . , αk )′ , µ ∈ Rk , Σ is a positive definite covariance
Q
matrix, g ∗ (y) = kj=1 g(yj ) is a k-dimensional density function symmetric around 0.
We call this density skew-symmetric density with location parameter µ, scale parameter
1/2
Σ1/2 , and shape parameter α and denote it as ssg,H
, α).
k (µ, Σ
Note that we could have taken the the matrix Σ1/2 in the linear form x = Σ1/2 z + µ
to be any (m × k) matrix(for µ ∈ Rm ). Then we do not always have a density for
the random variable x, nevertheless the distribution of x is defined by the moment
generating function if it exists.
Let z ∼ ssg,H
k (0k , Ik , α). Then the moment generating function of z evaluated at
tk ∈ Rk ,Mz (tk ), can be obtained as follows.
t′k z
Mz (tk ) = E(e
)=
Z
...
Z
t′k z k ∗
e
2 g (z)
Rk
k
Y
H(αj e′j z)dz
t′k z
= Eg∗ (z) (e
j=1
k
Y
H(αj e′j z))
j=1
Let x = Az + µ for constant (m × k) matrix A and m dimensional constant vector µ.
The moment generating function of x evaluated at tm ∈ Rm is Mx(tm ) can be obtained
as follows.
t′m µ
Mx(tm ) = e
′
t′m µ
Mz (A tm ) = e
A′ t′m z
Eg∗ (z) (e
k
Y
H(αj e′j z))
j=1
Definition 2.2.2. (Multivariate Skew Symmetric Random Variable) Let g(.) be a density function symmetric about 0, H(.) be an absolutely continuous cumulative distribution function with H ′ (.) symmetric around 0. Let zj ∼ f (z, αj ) = 2g(z)H(αj z) for
36
j = 1, 2, . . . , k be independent variables. Then
′
k ∗
z = (z1 , z2 , . . . , zk ) ∼ f (z, α, Σ) = 2 g (z)
where g ∗ (z) =
Qk
j=1
k
Y
H(αj e′j z)
j=1
g(zj ) and ej for j = 1, 2, . . . , k are the elementary vectors of the
coordinate system Rk . Let A be a m × k constant matrix, and µ be a k-dimensional
constant real vector. A random variable x = Az + µ is distributed with respect to
multivariate skew symmetric distribution with location parameter µ, scale parameter
g,H
A, and shape parameter α. We denote this by x ∼ SSm,k
(µ, A, α).
By this definition we can write the following properties:
Property 2.2.1. Let zj ∼ f (z, αj ) = 2g(z)H(αj z) with mean µ(αj ) and variance
σ 2 (αj ) for j = 1, 2, . . . , k be independent variables. Let z = (z1 , z2 , . . . , zk )′ and x =
g,H
Az + µ and α = (α1 , α2 . . . , αk )′ . Then, x ∼ SSm,k
(µ, A, α). By construction, the
mean of x is E(x) = Aµ0 + µ, and covariance matrix of x is C(x) = AΣ0 A′ where
µ0 = (µ(α1 ), µ(α2 ), . . . , µ(αk ))′ and Σ0 = diag(σ 2 (α1 ), σ 2 (α2 ), . . . , σ 2 (αk )).
2.3
2.3.1
Multivariate Skew-Normal Distribution
Motivation: Skew-Spherically Symmetric Densities
Theorem 2.3.1. Let g(.) be a k−dimensional spherical jpdf, i.e. satisfies g(x) =
k(x′ x) for all x ∈ Rk for some density k(r2 ) defined for r ≥ 0, H(.) be a absolutely
continuous cumulative distribution function with H ′ (.) is symmetric around 0, α =
(α1 , α2 , . . . , αk ) be a k−dimensional real vector, and ej for j = 1, 2, . . . , k are the
elementary vectors of Rk . Then
f (y, α) = c−1 g(y)
k
Y
j=1
H(αj e′j y)
(2.14)
37
Qk
defines a probability density function where c = Ex( j=1 P (zj ≤ αj xj )|x) for (z1 , z2 ,
. . . , zk ) are identically distributed random variables with cdf H(.) independent of x ∼
g(x).
Proof. f (y) is nonnegative, we have to prove
Z
Rk
cf (y)dy =
Z
Rk
R
g(y)
Rk
f (y)dy = 1. Write
k
Y
H(αj e′j y)dy
j=1
k
Y
= Ex( P (zj ≤ αj xj )|x).
j=1
The last equality holds because z1 , z2 , . . . , zk are identically distributed random variables with cdf H(.) independent of x ∼ g(x). This concludes the proof.
When we choose g(.) = φk (.) and (z1 , z2 , . . . , zk )′ independent, this reduces to the denQ
sity introduced in Example 2.2.1. To see this observe that Ex( kj=1 P (zj ≤ αj xj )|x) =
Qk
Qk
1
′
j=1 P (zj ≤ αj xj ) =
j=1 P (zj − αj xj ≤ 0) = 2k for both H (.) and φ1 (.) are sym-
metric zj − αj xj j = 1, 2, . . . , k have iid symmetric distribution.
2.3.2
Multivariate Skew-Normal Distribution
Definition
In a recent article Chen and Gupta ([32]) pointed that neither of the multivariate
skew-normal models (2.3, 2.4) cohere with the joint distribution of a random sample
from a univariate skew-normal distribution and introduced an alternative multivariate
skew-normal model which overcomes this problem ([32]):
k
SNk (y, α, Σ) = f (y, α, Σ) = 2 φk (y, Σ)
k
Y
i=1
Φ(λ′i y)
(2.15)
−1/2
k
where α ∈ R and Λ = (λ1 , λ2 , . . . , λk ) = Σ
38
diag(α1 , α2 , . . . , αk ), φk (., Σ) is the k-
dimensional normal density function with covariance Σ and Φ(.) denotes the cumulative
distribution function of the univariate standard normal variable. Chen and Gupta’s
skew normal model ([32]) is in the form of skew symmetric density introduced in
Definition 2.2.1.
First note that a random variable z with probability density function
f (z, α) = 2φ(z)Φ(αz)
where α is a real scalar, φ(.) is the univariate standard normal density function and
Φ(.) denotes the cumulative distribution function of the univariate standard normal
q
α
2α2
2
variable has mean µ(α) = π2 √1+α
2 , variance σ (α) = (1 − π(1+α2 ) ). (Skewness: γ1 =
4−π (µ(α))3
,
2 (σ 2 (α))3/2
4
(µ(α))
Kurtosis: γ2 = 2(π − 3) (σ
2 (α))2 .)
Let zj ∼ f (z, αj ) = 2φ(z)Φ(αj z) for j = 1, 2, . . . , k be independent variables. Then
the joint p.d.f. of z = (z1 , z2 , . . . , zk )′ is
f (z, α1 , α2 . . . , αk ) = 2k φk (z)
k
Y
Φ(αj e′j z)
j=1
where φk (z) =
Qk
j=1
φ(zj ) is the k-dimensional standard normal variable and ej are
the elementary vectors of the coordinate system Rk . Mean of z is
E(z) = µ0 = (µ(α1 ), µ(α2 ), . . . , µ(αk ))′ ,
and covariance is
C(z) = Σ0 = diag(σ 2 (α1 ), σ 2 (α2 ), . . . , σ 2 (αk )).
39
Now let Σ be a positive definite covariance matrix, and let y = Σ
1/2
z. Then
k
Y
2k
−1/2
φk (Σ
y)
Φ(αj e′j Σ−1/2 y).
y ∼ f (y, α, Σ) =
|Σ|1/2
j=1
Let x = y + µ where µ ∈ Rk . The probability density function of x is
k
Y
2k
−1/2
f (x, α, Σ, µ) =
φk (Σ
(x − µ))
Φ(αj e′j Σ−1/2 (x − µ)).
|Σ|1/2
j=1
Density
Definition 2.3.1. (Multivariate Skew Normal Density). We call the density
f (x, α, Σ, µ) =
k
Y
2k
−1/2
φ
(Σ
(x
−
µ))
Φ(αj e′j Σ−1/2 (x − µ))
k
|Σ|1/2
j=1
(2.16)
as the density of a multivariate skew normal variable with location parameter µ, scale
parameter Σ1/2 , and shape parameter α = (α1 , α2 . . . , αk )′ ∈ Rk . and denote it by
snk (µ, Σ1/2 , α).
Remark 2.3.1. Neither models (2.2, 2.3) can represent the joint density of independent univariate variables xj ∼ SN1 (σj , αj ) except for some special cases. The joint
density of independent univariate skew normal variables belong the new family of densities introduced by Definition 2.3.1.
We can take the matrix Σ1/2 in the linear form x = Σ1/2 z +µ to be any (m×k) matrix.
Then we do not have a density for the random variable x in all cases, nevertheless the
distribution of x is defined by the moment generating function if it exists.
Moment Generating Function
We need the following lemmas: See Zacks ([61]) and Chen and Gupta ([17]).
40
Lemma 2.3.1. Let z ∼ φk (z). For scalar b, a ∈ R , and for Σ a positive definite
k
b
matrix of order k E(Φ(b + a′ Σ1/2 z)) = Φ( (1+a′ Σa)
1/2 ).
Lemma 2.3.2. Let Z ∼ φk×n (Z). For scalar b, a ∈ Rk , and for A and B positive
definite matrices of order k and n respectively, E(Φn (b + a′ AZB)) = Φn (a, (1 +
a′ AA′ a)1/2 B).
Let z ∼ snk (0k , Ik , α). The moment generating function of z evaluated at tk ∈ Rk is
Mz (tk ) can be obtained as follows.
t′k z
Mz (tk ) = Eφk (z) (e
k
Y
Φ(αj e′j z))
j=1
2k
=
(2π)k/2
Z
2k
(2π)k/2
Z
=
2k e
=
(2π)k/2
1
=
′
2k e 2 tk tk
(2π)k/2
1
t ′t
2 k k
2k e
(2π)k/2
k
=2 e
e
Rk
1
t ′t
2 k k
=
− 21 z′ z+tk ′ z
1
t ′t
2 k k
=2 e
1
t ′t
2 k k
Φ(αj e′j z)dz
j=1
1
′
′
e− 2 (z z−2tk z)
Rk
k
Y
Φ(αj e′j z)dz
j=1
Z
e
Z
e− 2 (z−tk ) (z−tk )
− 21 (z′ z−2tk ′ z+tk ′ tk )
Rk
1
Φ(αj e′j z)dz
′
k
Y
Φ(αj e′j z)dz
j=1
k Z
Y
1
2
e− 2 (zj −(tk )j ) Φ(αj z j )dzj
R
j=1
k Z
Y
k
Y
k
Y
j=1
Rk
j=1
k
k
Y
R
p
1
(2π)
1
2
e− 2 (yj ) Φ(αj y j + αj (tk )j )dyj
αj (tk )j
Φ( p
)
2)
(1
+
α
j
j=1
Let x = Az + µ for constant (m × k) matrix A and m dimensional constant vector µ.
Then the moment generating function of x evaluated at tm ∈ Rm is Mx(tm ) can be
41
obtained as follows.
t′m µ
Mx(tm ) = e
k t′m µ+ 21 t′m AA′ tm
′
Mz (A tm ) = 2 e
k
Y
αj (A′ tm )j
)
Φ( p
(1 + αj 2 )
j=1
Hence the following definition and theorem.
Definition 2.3.2. (Multivariate Skew Normal Random Variable) Let zj ∼ 2φ(z)Φ(αj z)
for j = 1, 2, . . . , k be independent univariate skew normal random variables. Then
′
k
z = (z1 , z2 , . . . , zk ) ∼ 2 φk (z)
where φk (z) =
Qk
j=1
k
Y
Φ(αj e′j z)
j=1
φ(zj ) and e′j are the elementary vectors of the coordinate sys-
tem Rk . Let A be a m × k constant matrix, and µ be a k-dimensional constant real
vector. A m dimensional random variable x = Az + µ is distributed with respect
to multivariate skew symmetric distribution with location parameter µ, scale parameter A, and k dimensional shape parameter α = (α1 , α2 . . . , αk )′ . We denote this by
x ∼ SNm,k (µ, A, α).
Theorem 2.3.2. If x has multivariate skew-normal distribution SNm,k (µ, A, α) then
the moment generating function of x evaluated at tm ∈ Rm is given by
k t′m µ+ 12 t′m AA′ tm
Mx(tm ) = 2 e
k
Y
αj (A′ tm )j
Φ( p
).
(1 + αj 2 )
j=1
(2.17)
Moments
The expectation and covariance of a multivariate skew normal variable x with distribution SNm,k (µ, A, α) are easily calculated from the expectation and variance of the
42
univariate skew normal variable: Let z ∼ snk (0k , Ik , α). Then,

√ α1
 1+α21

r
 √ α2
2 
1+α22
A
E(x) = AE(z) + µ =

..
π 
.


√ αk 2
1+αk
Cov(x) = ACov(z)A′ = A(Ik −





 + µ,




α12
α22
αk2
2
diag(
,
,
.
.
.
,
))A′ .
2
2
2
π
1 + α1 1 + α2
1 + αk
Given its jmgf, the cumulants of the multivariate skew normal distribution with µ = 0
can be calculated. The cumulant generating function is given by
k
X
1 ′
αj (A′ t)j
′
)).
Kx(t) = log(Mx(t)) = k log(2) + t AA t
log(Φ( p
2
(1 + αj 2 )
j=1
α e′ A′ t
α e′ A′
∂Kx(t)
|t=0k = AA′ t +
∂t
j
j j
k √
φ( √ j 2 )
X
(1+αj 2 )
(1+αj )
αj e′ A′ t
j=1

√ α1
 1+α21

r
 √ α2
2 
1+α22
=
A

..
π 
.


√ αk 2
1+αk
∂ 2 Kx(t)
|t=0k = AA′ +
∂t∂t′
k
X
= A(Ik −
j=1

Φ( √ j 2 )
(1+αj )

|t=0k









αj e′j A′ t
φ′ ( √
)

αj e′j A′ t
φ( √
)
2 
αj2
(1+αj 2 )
(1+αj 2 )  

′ ′
Ae
⊗
e
A
−
|


j
′
j
αj e A ′ t
αj e′ A′ t   t=0k
1 + αj2
)
)
Φ( √ j
Φ( √ j
(1+αj 2 )
α12
α22
αk2
2
diag(
,
,
.
.
.
,
))A′ .
π
1 + α12 1 + α22
1 + αk2
(1+αj 2 )
43
p
k
8 4 (2) X
αj
∂ 3 Kx(t)
|t=0k = ( − √ )
(q
)3 Aej ⊗ e′j A′ ⊗ Aej .
′
∂t∂t ∂t
π
π j=1
2
1+α
j
Linear Forms
By Definition 2.3.2 we can write z ∼ SNk,k (0k , Ik , α), and prove the following theorems.
Theorem 2.3.3. Assume that y ∼ SNm,k (µ, A, α) and x = By + d with B a (l × m)
matrix and d is a l-dimensional real vector. Then x ∼ SNl,k (Bµ + d, BA, α).
Proof. From assumption we have y = Az + µ, and so x = (BA)z + (Bµ + d), i.e.,
x ∼ SNl,k (Bµ + d, BA, α).
Corollary 2.3.1. Assume that y ∼ snk (µ, A, α) where A is a full rank square matrix
of order k. Let Σ = AA′ , and let Σ1/2 be the square root of Σ as defined in Section
d
2.1.1. Assume x ∼ SNk,k (µ, Σ1/2 , α). Then x − µ = O(y − µ) where O = Σ1/2 A−1 .
Corollary 2.3.2. Assume that y ∼ SNm,k (µ, A, α) and let

(1)


(1)

 y 
y=
,
y (2)
 µ 
µ=
,
µ(2)


 A11 A12 
A=

A21 A22
(1)
(1)
where y , µ
are l-dimensional, A11 is (l×l). Then y
(1)
(1)
∼ SNl,k µ , ( A11 A12 ), α .
Proof. In theorem above put B =
Il 0l×(m−l)
44
and d = 0l .
(2)
(2)
Corollary 2.3.3. y (1) ∼ SNm,k (µ(1) , A, α(1) ), y (2) ∼ SNm,k
(µ , B,
 α ) are inde(1)
 α 
(1)
(2)
(1)
(2)
pendent, then x = y + y has SNm,2k (µ + µ , [A, B], 
) distribution.
α(2)
Corollary 2.3.4. Let y (i) ∼ SNm,k (µ, A, α), be independent for i = 1, 2, . . . , n, then


 α 


 α 
Pn


S = i=1 y (i) has SNm,nk (nµ, [A, A, . . . , A],  . ) distribution. The jmgf can be
 .. 




α
written as
k
Y
αj (A′ tm )j n
nk t′m nµ+ 21 nt′m AA′ tm
(Φ( p
)) .
MS (tm ) = 2 e
(1 + αj 2 )
j=1
The jmgf of S/n is
1 ′
nk t′m µ+ 21 n
tm AA′ tm
MS/n (tm ) = 2 e
k
Y
αj (A′ tm )j
))n .
(Φ( p
2
n (1 + αj )
j=1
Some Illustrations
To illustrate the preceding theory, we
skew normal distribution.
 consider the bivariate

 a11 . . . a1k 
Assume (x1 , x2 )′ ∼ SN2,k ((µ1 , µ2 )′ , 
 , (α1 , . . . , αk )′ ). The mean vector
a21 . . . a2k
is


 a11 . . . a1k 
′
′
E((x1 , x2 )′ ) = 
 (µ(α1 )0 , . . . , µ(αk )0 ) + (µ1 , µ2 ) ,
a21 . . . a2k


 a11 µ(α1 )0 + . . . + a1k µ(αk )0 + µ1 
=

a21 µ(α1 )0 + . . . + a2k µ(αk )0 + µ2
(2.18)
45
covariance can be written as



′
 a11 . . . a1k 
 a11 . . . a1k 
2
2
C((x1 , x2 )′ ) = 
 diag(σ (α1 ), . . . , σ (αk )) 

a21 . . . a2k
a21 . . . a2k



=
2

2
′
 a11 σ (α1 ) . . . a1k σ (αk )   a11 . . . a1k 
=


a21 σ 2 (α1 ) . . . a2k σ 2 (αk )
a21 . . . a2k
a211 σ 2 (α1 )
2
2

a11 a21 σ (α1 ) + . . . + a1k a2k σ (αk ) 
 (2.19)
a11 a21 σ 2 (α1 ) + . . . + a1k a2k σ 2 (αk )
a221 σ 2 (α1 ) + . . . + a22k σ 2 (αk )
where µ(α) =
pπ
2
+ ... +
a21k σ 2 (αk )
√ α
,
1+α2
and σ 2 (α) = (1 −
2α2
).
π(1+α2 )
First, let’s take k = 1.
Then the mean vector becomes
E((x1 , x2 )′ ) = (a11 , a21 )′ µ(α1 )0 + (µ1 , µ2 )′ = (µ(α1 )0 a11 + µ1 , µ(α1 )0 a21 + µ2 )′ (2.20)
r
r
π α1 a11
π α1 a21
p
p
=(
+
µ
+ µ2 )′
1,
2 1 + α12
2 1 + α12
(2.21)
covariance can be written as
C((x1 , x2 )′ ) = (a11 , a12 )σ 2 (α1 )(a11 , a12 )′ = (1 −


2
a12 a21 
2α12
 a11
)

 . (2.22)
2
π(1 + α1 )
2
a12 a21
a22
For k = 2, the mean vector becomes


 a11 a12 
′
′
E((x1 , x2 )′ ) = 
 (µ(α1 )0 , µ(α2 )0 ) + (µ1 , µ2 )
a21 a22
(2.23)
46
r
r
r
r
π α1 a11
π α2 a12
π α1 a21
π α2 a22
p
p
p
p
+
+ µ1 ,
+
+ µ2 )′
=(
2
2
2
2 1 + α1
2 1 + α2
2 1 + α1
2 1 + α22
(2.24)
covariance can be written as



′
 a11 a12 
 a11 a12 
2
2
C((x1 , x2 )′ ) = 
 diag(σ (α1 ), σ (α2 )) 

a21 a22
a21 a22


=

σ 2 (α1 )a211 + σ 2 (α2 )a212
a11 a21 σ 2 (α1 ) + a12 a22 σ 2 (α2 ) 
.
2
2
2
2
2
2
a11 a21 σ (α1 ) + a12 a22 σ (α2 )
σ (α1 )a21 + σ (α2 )a22
(2.25)
(2.26)
For k = 3, the mean is


 a11 µ(α1 )0 + a12 µ(α2 )0 + a13 µ(α3 )0 + µ1 

,
a21 µ(α1 )0 + a22 µ(α2 )0 + a23 µ(α3 )0 + µ2
(2.27)
covariance is



a211 σ 2 (α1 )
+
a213 σ 2 (α3 )
2
2

a11 a21 σ (α1 ) + a13 a23 σ (α3 ) 

a11 a21 σ 2 (α1 ) + a13 a23 σ 2 (α3 )
a221 σ 2 (α1 ) + a223 σ 2 (α3 )
(2.28)
47
Independence, Conditional Distributions
Theorem 2.3.4. Assume that y ∼ snk (µ, Σ1/2 , α) and let

(1)

 y 
y=

(2)
y


(1)
 µ 
µ=

µ(2)


(1)
 α 
α=
,
(2)
α


1/2
1/2
 Σ11 Σ12 
Σ1/2 = 

1/2
1/2
Σ21 Σ22
1/2
1/2
1/2
where y (1) , µ(1) , α(1) are l-dimensional, Σ11 is (l × l). If Σ12 = 0 and Σ21 = 0 then
1/2
1/2
y (1) ∼ snl (µ(1) , Σ11 , α(1) ) is independent of y (2) ∼ snk−l (µ(2) , Σ22 , α(2) ).
Proof. The inverse of Σ1/2 is


Σ−1/2 = 
−1/2
Σ11
0
0
−1/2
Σ22


.
Note that the quadratic form in the exponent of snk (µ, Σ1/2 , α) is
(1)
(2)
Q = (y − µ)′ Σ−1 (y − µ) = (y (1) − µ(1) )′ Σ−1
− µ(1) ) + (y (2) − µ(2) )′ Σ−1
− µ(2) ).
11 (y
22 (y
Also
k
Y
j=1
Φ(αj e′j Σ−1/2 (x−µ)) =
l
Y
j=1
−1/2
Φ(αj e′j Σ11 (x(1) −µ(1) ))
k
Y
j=l+1
−1/2
Φ(αj e′j Σ22 (x(2) −µ(2) ))
48
and |Σ| = |Σ11 ||Σ22 |. The density of y can be written as
snk (µ, Σ1/2 , α) =
l
Y
2l
−1/2
−1/2
(1)
(1)
φ
(Σ
(y
−
µ
))
Φ(αj e′j Σ11 (y (1) − µ(1) ))
l
11
|Σ11 |1/2
j=1
k−l
Y
2k−l
−1/2
−1/2
(2)
(2)
×
φk−l (Σ22 (y − µ ))
Φ(αj e′j Σ22 (y (2) − µ(2) ))
1/2
|Σ22 |
j=1
1/2
1/2
= snl (µ(1) , Σ11 , α(1) )snk−l (µ(2) , Σ22 , α(2) ).
Since the ordering of variables are irrelevant, the above discussion also proves the
following corollary:
Corollary 2.3.5. Let y ∼ snk (µ, Σ1/2 , α) and assume that [i, j] partitions the indices
k = 1, 2, . . . , k such that (Σ1/2 )(ij) = 0 for all i ∈ i and all j ∈ j. Then, the marginal
joint distribution of variables with indices in i is skew normal with location, scale, and
shape parameters are obtained by taking the corresponding components of µ, Σ1/2 and
α, respectively.
We can also prove the following:
Corollary 2.3.6. Let y ∼ snk (µ, A, α) write A = (a′1 , a′2 , . . . , a′k )′ and assume that
[i, j] are two disjoint subsets of the indices k = 1, 2, . . . , k such that a′i aj = 0 for all
i ∈ i and all j ∈ j. Let Ai denote the matrix of rows of A with indices in i and Aj be
the matrix of rows of A with indices in j. Then, then there exists a k dimensional skew
normal variable, say w, that can be obtained by a nonsingular linear transformation
to y, so that for w the variables with indices in i are independent of the variables with
indices in j.
Proof. The hypothesis about the matrix of linear transformation A requires that AA′ =
Σ to be such that (Σ)ij = 0 for all i ∈ i and all j ∈ j. Let x ∼ snk (µ, Σ1/2 , α), and for
49
this distribution we have the variables with indices in i independent of the variables
d
with indices in j by Corollary 2.3.5. By corollary 2.3.1, we can write y − µ = O(x − µ)
for O = Σ1/2 A−1 . This concludes the proof.
Corollary 2.3.7. Assume that x ∼ snk (µ, A, α) and let




(1)
 x 
x=
,
(2)
x
(1)
 µ 
µ=
,
(2)
µ


 A11 A12 
A=

A21 A22
where x(1) , µ(1) are l-dimensional, A11 is (l × l). Let

 Il
C=
0
−A12 A−1
22
Ik−l


.
Consider the variable
y = (CAA′ C ′ )1/2 (CA)−1 (Cx − Cµ).
Write

(1)

 y 
y=

y (2)
where y (1) is l-dimensional. Then y ∼ snk (0k , (CAA′ C ′ )1/2 , α), and y (1) is independent of y (2) .
50
Corollary 2.3.8. Assume that x ∼
snk (µ, Σc1/2 , α)

(1)


(1)

and let
 x 
x=
,
x(2)
 µ 
µ=
,
µ(2)


(1)
 α 
α=
,
(2)
α


 A11 0 
Σ1/2
=

c
A21 A22
where x(1) , µ(1) are l-dimensional, A11 is (l × l) lower triangular matrix with positive
diagonal elements, A22 is ((k−l)×(k−l)) lower triangular matrix with positive diagonal
elements. Let


C=
Consider the variable

Il
0 
.
−A21 A−1
Ik−l
11

(1)

 x 
y = Cx = 
.
y (2)
(1)
Then x(1) is independent of y (2) = x(2) − A21 A−1
which has
11 x
(1)
(2)
snk−l (µ(2) − A21 A−1
11 µ , A22 , α ) distribution. The joint distribution of y is given by
(1)
(2)
snl (µ(1) , A11 , α(1) )snk−l (µ(2) − A21 A−1
11 µ , A22 , α ).
(1)
The density of x then can be obtained by substituting y (2) = x(2) − A21 A−1
11 x , the
51
density of x is
l
Y
2l
(1)
−1
(1)
(1)
φk (A11 (x − µ ))
Φ(αj e′j A11 −1 (x(1) − µ(1) ))
|A11 |
j=1
2k−l
(1)
φk (A22 −1 (x(2) − (µ(2) + A21 A−1
− µ(1) ))))
11 (x
|A22 |
k−l
Y
(2)
(1)
×
Φ(αj e′j A22 −1 (x(2) − (µ(2) + A21 A−1
− µ(1) )))).
11 (x
×
j=1
(1)
(1)
(2)
The conditional distribution of x(2) given x(1) is snk−l (µ(2) +A21 A−1
11 (x −µ ), A22 , α ).
2.3.3
Generalized Multivariate Skew-Normal Distribution
We can define a multivariate skew normal density with any choice of skewing cdf H
that has properties in Theorem 2.2.1. In the next section, we define and study some
properties of this generalized multivariate skew normal density
Definition 2.3.3. (Generalized Multivariate Skew Normal Density) Let x have a multivariate skew symmetric distribution with kernel φk (.) and skewing distribution H(.)
defined as in Definition 2.2.1 with location parameter µ = 0k , scale parameter Σ1/2 ,
and shape parameter α = (α1 , α2 . . . , αk )′ ∈ Rk ., i.e. the density of x is
k
Y
2k
−1/2
φk (Σ
x)
f (x, α, Σ, µ = 0k ) =
H(αj e′j Σ−1/2 x).
|Σ|1/2
j=1
We call this density the generalized skew normal density, and denote it by gsnH
k (µ, Σ, α).
Definition 2.3.4. (Generalized Multivariate Skew Normal Random Variable) Let H(.)
be defined as in Theorem 2.2.1. Let zj ∼ 2φ(z)H(αj z) for j = 1, 2, . . . , k be independent
univariate skew symmetric random variables. Then
z = (z1 , z2 , . . . , zk )′ ∼ 2k φk (z)
k
Y
j=1
H(αj e′j z)
where φk (z) =
Qk
j=1
φ(zj ) and
e′j
52
are the elementary vectors of the coordinate system
Rk . Let A be a m × k constant matrix, and µ be a k-dimensional constant real vector.
A m dimensional random variable x = Az + µ is distributed with respect to multivariate skew symmetric distribution with location parameter µ, scale parameter A, and k
dimensional shape parameter α = (α1 , α2 . . . , αk )′ . We denote this by
H
x ∼ GSNm,k
(µ, A, α).
Theorem 2.3.5. If x has generalized multivariate skew symmetric distribution
H
GSNm,k
(µ, A, α) then the moment generating function of x evaluated at tm ∈ Rm is
given by
k tm µ+ 12 tm ′ AA′ tm
Mx(tm ) = 2 e
k
Y
Eφ(y) (H(αj y j + αj (A′ tm )j ))
(2.29)
j=1
Proof. If z has generalized multivariate skew symmetric density gsnH
k (0k , Ik , α) then
the moment generating function of z evaluated at tk ∈ Rk is given by
53
t′k z
Mz (tk ) = Eφk (z) (e
k
Y
H(αj e′j z))
j=1
2k
(2π)k/2
Z
2k
=
(2π)k/2
Z
=
1
=
′
1
t ′t
2 k k
2k e
(2π)k/2
1
t ′t
2 k k
2k e
(2π)k/2
k
=2 e
=2 e
′
1
t ′t
2 k k
1
t ′t
2 k k
k
Y
H(αj e′j z)dz
j=1
− 21 (z′ z−2tk ′ z)
e
Rk
k
Y
H(αj e′j z)dz
j=1
Z
− 21 (z′ z−2tk ′ z+tk ′ tk )
e
Rk
Z
1
′
Rk
H(αj e′j z)dz
k
Y
H(αj e′j z)dz
j=1
k Z
Y
1
2
e− 2 (zj −(tk )j ) H(αj z j )dzj
R
j=1
k Z
Y
k
Y
k
Y
j=1
e− 2 (z−tk ) (z−tk )
j=1
k
′
Rk
2k e 2 tk tk
=
(2π)k/2
=
1
e− 2 z z+tk z
R
1
1
2
p
e− 2 (yj ) H(αj y j + αj (tk )j )dyj
(2π)
Eφ(y) (H(αj y j + αj (tk )j ))
j=1
Let x = Az + µ for constant (m × k) matrix A and m dimensional constant vector µ.
Then the moment generating function of x evaluated at tm ∈ Rm is Mx(tm ) can be
obtained as follows.
t′m µ
Mx(tm ) = e
′
k t′m µ+ 21 tm ′ AA′ tm
Mz (A tm ) = 2 e
k
Y
Eφ(y) (H(αj y j + αj (A′ tm )j ))
j=1
By Definition 2.3.4 we can write z ∼ gsnH
k,k (0k , Ik , α), and prove the following theorem.
H
Theorem 2.3.6. Assume that y ∼ GSNm,k
(µ, A, α) and x = By +d with B a (l ×m)
H
matrix and d is a l-dimensional real vector. Then x ∼ GSNl,k
(Bµ + d, BA, α).
54
Proof. From assumption we have y = Az + µ, and so x = (BA)z + (Bµ + d), i.e.,
H
x ∼ GSNl,k
(Bµ + d, BA, α).
H
Theorem 2.3.7. Assume that y ∼ GSNm,k
(µ, A, α) and let

(1)


(1)

 y 
y=
,
y (2)
 µ 
µ=
,
µ(2)


 A11 A12 
A=

A21 A22
H
where y (1) , µ(1) are l-dimensional, A11 is (l × l). Then y (1) ∼ GSNl,k
(µ(1) , A11 , α).
Proof. In theorem above put B =
(1)
H
GSNl,k µ , ( A11 A12 ), α .
Il 0l×(m−l)
and d = 0l . Then we obtain y (1) ∼
In the rest of this section we study some members of the generalized multivariate skew
normal distribution. Different choices for the skewing function H(.) will give different
distribution families. We will take H(.) to be the cdf of Laplace, logistic, or uniform
in the sequel. Some properties of these distributions are studied.
Multivariate Skew Normal-Laplace Distribution
The cdf of Laplace distribution is given by
H1 (x) =


 1 − 1 exp(−x), x ≥ 0,
2


1
exp(x), x
2
<0
55
If we put H(.) = H1 (.) in the definition of generalized skew normal density in 2.3.3
we get the skew normal-Laplace density. To evaluate the jmgf of skew normal-Laplace
random vector we use the following lemma ([40]):
Lemma 2.3.3. Let z ∼ φ(z). Then
1
−(a + b2 )
a
1
1
a − b2
1
) + Φ( ) − exp(−a + b2 )Φ(
).
Ez (H2 (a + bz)) = exp(a + b2 )Φ(
2
2
|b|
|b|
2
2
|b|
Corollary 2.3.9. The jmgf of a k−dimensional skew normal-Laplace random vector
with location parameter µ, scale parameter A, and shape parameter α; call x, is given
by
′
1 ′
′
Mx(tm ) = 2k et µ+ 2 t AA tm
×
k
Y
−(αj + (A′ t)2j )
1
αj
1 −αj + 1 (A′ t)2j αj − (A′ t)2j
1
′ 2
2
)
+
Φ(
)
−
e
))
Φ(
( eαj + 2 (A t)j Φ(
′ t) |
′ t) |
′ t) |
2
|(A
|(A
2
|(A
j
j
j
j=1
Proof. By Theorem 2.3.5 and Lemma 2.3.3.
The odd moments of the standard univariate skew normal-Laplace random variable
are given in ([50]), even moments are the same as the even moments of univariate
standard normal random variable. Suppose x is a random variable with standard skew
normal-Laplace distribution with shape parameter α. The first moment of x is given
by
α2
µ(α) = E(x) = 2αe 2 Φ(−α),
and we can calculate the variance from first and second moments:
α2
σ 2 (α) = var(x) = E(x2 ) − E 2 (x) = 1 − (2αe 2 Φ(−α))2 .
Let x have k−dimensional skew normal-Laplace distribution with location parameter
56
µ, scale parameter A, and shape parameter α, then


 µ(α1 ) 


 µ(α2 ) 


E(x) = A  .  + µ,
 .. 




µ(αk )
Cov(x) = A(Ik − diag(µ(α1 )2 , µ(α2 )2 , . . . , µ(αk )2 ))A′ .
Multivariate Skew Normal-Logistic Distribution
The cdf of logistic distribution is given by
H2 (x) =


1
1 + e−x

 P∞  −1  −ℓx

 e , x ≥ 0,
ℓ=0 

ℓ



=

 P
 x ∞  −1  ℓx
e ,x < 0
 e
ℓ=0 
ℓ
If we put H(.) = H2 (.) in the definition of generalized skew normal density in 2.3.3
we get the skew normal-logistic density. To evaluate the jmgf of skew normal-logistic
random vector we use the following lemma ([40]):
Lemma 2.3.4. Let z ∼ φ(z). Then
Ez (H2 (a+bz)) =
∞
X


((ℓ+1)b)2
−a − (ℓ + 1)b2
 −1  −ℓa+ (ℓb)2 2 a − ℓb2
)+e(ℓ+1)a+ 2 Φ(
).
Φ(
e

|b|
|b|
ℓ
ℓ=0
Corollary 2.3.10. The jmgf of a k−dimensional skew normal-logistic random vector
with location parameter µ, scale parameter A, and shape parameter α; call x, is given
57
by
′
1 ′
′
Mx(tm ) = 2k et µ+ 2 t AA tm


k
∞
Y X  −1 
(ℓ(A′ t)j )2
αj − ℓ(A′ t)2j
−ℓαj +
2
Φ(
×
)

e
|(A′ t)j |
ℓ
j=1 ℓ=0
+ e(ℓ+1)αj +
((ℓ+1)(A′ t)j )2
2
Φ(
−αj − (ℓ + 1)(A′ t)2j
).
|(A′ t)j |
Proof. By Theorem 2.3.5 and Lemma 2.3.4.
The odd moments of the standard univariate skew normal-logistic random variable
are given in ([50]), even moments are the same as the even moments of univariate
standard normal random variable. Suppose x is a random variable with standard skew
normal-logistic distribution with shape parameter α. The first moment of x is given by
µ(α) = E(x)






∞
∞
2
2
X
X  −1  (ℓα)

 −1  ((ℓ+1)α)
= −2α 
ℓ
(ℓ + 1) 
 e 2 Φ(−ℓα) −
 e 2 Φ(−(ℓ + 1)α) ,
ℓ
ℓ
ℓ=0
ℓ=0
and we can calculate the variance from first and second moments:
σ 2 (α) = var(x) = E(x2 ) − E 2 (x) = 1 − (µ(α))2 .
Let x have k−dimensional skew normal-logistic distribution with location parameter
58
µ, scale parameter A, and shape parameter α, then


 µ(α1 ) 


 µ(α2 ) 


E(x) = A  .  + µ,
 .. 




µ(αk )
Cov(x) = A(Ik − diag(µ(α1 )2 , µ(α2 )2 , . . . , µ(αk )2 ))A′ .
Multivariate Skew Normal-uniform Distribution
The cdf of uniform distribution is given by
H3 (x) =











0, x < −h,
x+h
, −h
2h
≤ x ≥ h,
1, h < x
If we put H(.) = H3 (.) in the definition of generalized skew normal density in 2.3.3
we get the skew normal-uniform density. To evaluate the jmgf of skew normal-uniform
random vector we use the following lemma ([40]):
Lemma 2.3.5. Let z ∼ φ(z). Then
(h + a)2
|b|
√ exp(−
)
2b2
2h 2π
|b|
(h − a)2
− √ exp(−
)
2b2
2h 2π
a
1
h−a
+ ( − )(Φ(
) − 1)
2h 2
|b|
1
h+a
a
).
+ ( + )Φ(
2h 2
|b|
Ez (H3 (a + bz)) =
Corollary 2.3.11. The jmgf of a k−dimensional skew normal-uniform variable with
59
location parameter µ, scale parameter A, and shape parameter α; call x, is given by
k t′ µ+ 21 t′ AA′ tm
Mx(tm ) = 2 e
k
2
Y
(h + αj )2
|(A′ t)j )|
√ exp(−
)
′ t)
2(A
2h
2π
j
j=1
|((A′ t)j )|
(h − αj )2
√
exp(−
)
2((A′ t)j )2
2h 2π
αj
1
h − αj
+ ( − )(Φ( ′
) − 1)
2h 2
|(A t)j |
1
h + αj
αj
).
+ ( + )Φ( ′
2h 2
|(A t)j |
−
Proof. By Theorem 2.3.5 and Lemma 2.3.5.
The odd moments of the standard univariate skew normal-uniform random variable
are given in ([50]), even moments are the same as the even moments of univariate
standard normal random variable. Suppose x is a random variable with standard skew
normal-uniform distribution with shape parameter α. The first moment of x is given
by
µ(α) = E(x) =
h
α
(2Φ( ) − 1),
h
α
and we can calculate the variance from first and second moments:
σ 2 (α) = var(x) = E(x2 ) − E 2 (x) = 1 − (µ(α))2 .
Let x have k−dimensional skew normal-uniform distribution with location parameter
µ, scale parameter A, and shape parameter α, then


 µ(α1 ) 


 µ(α2 ) 


E(x) = A  .  + µ,
 .. 




µ(αk )
60
2
2
2
′
Cov(x) = A(Ik − diag(µ(α1 ) , µ(α2 ) , . . . , µ(αk ) ))A .
2.3.4
Stochastic Representation, Random Number Genera-
tion
We have defined k-dimensional multivariate skew symmetric random variable through
a location scale family based on k independent univariate skew symmetric random
variables. This fact can be utilized to give a stochastic representation for the multivariate skew symmetric random variable which in turn can be used for random number
generation.
By putting k = 1 in Theorem 2.2.1, we get a density of univariate skew symmetric
variable
f (y, α) = 2g(y)H(αy)
(2.30)
where α is a real scalar, g(.) is a univariate density function symmetric around 0 and
H(.) is a absolutely continuous cumulative distribution function with H ′ (.) = h(.) is
symmetric around 0.
A probabilistic representation of a univariate skew symmetric random variable is given
in the following ([10]).
Theorem 2.3.8. Let h(.) and g(.) be defined as above. If v ∼ h(v) and u ∼ g(u) are
independently distributed random variables then the random variable


 u, v ≤ αu,
y=

 −u, v > αu
has density f (y, α) as given in (2.30) above.
Corollary 2.3.12. Let
zj ∼ f (z, αj ) = 2g(z)H(αj z)
61
for j = 1, 2, . . . , k be independent variables. The joint p.d.f. of z = (z1 , z2 , . . . , zk ) is
f (z, α) = 2k g ∗ (z)
k
Y
H(αj e′j z)
j=1
Qk
where g ∗ (z) =
j=1
g(zj ) and e′j are the elementary vectors of the coordinate system
1
Rk . For µ ∈ Rk Let Σ1/2 be a positive definite matrix, x = Σ 2 z + µ has p.d.f.
1/2
f (x, α, Σ
k
Y
2k ∗ −1/2
, µ) =
g (Σ
(x − µ))
H(αj e′j Σ−1/2 (x − µ)).
1/2
|Σ|
j=1
(2.31)
A random variable y with density f (y, α, Σ1/2 , µ) has the same distribution with x.
Using Theorem 2.3.8 and Corrolary 2.3.12, we can generate skew symmetric random
vectors from snk (µ, Σ1/2 , α) density.
2.4
Extensions: Some Other Skew Symmetric Forms
In Theorem 2.2.1, and 2.2.2 we have seen that a function of the form
−1
f (y, α) = c g(y)
k
Y
H(αj yj )
j=1
is a jpdf for a k−dimensional random variable if certain assumptions are met. Genton
and Ma ([45]) obtain a flexible family of densities by replacing αy in this density by
an odd function, say w(y). This motivates the following theorem.
Theorem 2.4.1. Let g(.) be the k-dimensional density centrally symmetric around 0k
(i.e. g(x) = g(−x)), H(.) be a absolutely continuous cumulative distribution function
62
′
with H (.) is symmetric around 0, let w(.) be an odd function. Then
f (y) = c−1 g(y)
k
Y
H(w(e′j y))
(2.32)
j=1
Q
is a jpdf for a k−dimensional random variable where c = Ex( kj=1 P (zj ≤ w(xj )|x)
for (z1 , z2 , . . . , zk )′ iid random variables with cdf H(.) independent of x ∼ g(x).
Proof. g(y)
Qk
j=1
H(w(yj )) is positive. We need to calculate c.
c=
Z
g(y)
Rk
k
Y
H(w(yj ))dy
j=1
k
Y
= Ex( P (zj ≤ w(xj )|x)
j=1
since (z1 , z2 , . . . , zk )′ iid random variables with cdf H(.) independent of x ∼ g(x).
Observe that when g(x) is a density for k independent variables x1 , x2 , . . . , xk , for
P
example, when we choose g(x) = 21k , g(x) = φk (x), or g(x) = 21k exp(− kj=1 |xj |) like
Q
Q
in Examples 2.2.1, 2.2.2, and 2.2.3 correspondingly, kj=1 P (zj ≤ w(xj )) = kj=1 P (zj −
w(xj ) ≤ 0) becomes
1
2k
since assuming both H ′ (x) and φ1 (x) are symmetric implies
that w(xj ) and therefore zj − w(xj ) for j = 1, 2, . . . , k have independent and identical
symmetric distributions.
Some examples of models motivated by 2.4.1 are presented next:
Example 2.4.1. Take w(x) = √ λ1 x
1+λ2 x2
, we obtain a multivariate form of the skew
symmetric family introduced by Arellanno-Valle at al. ([7]). This density is illustrated
in two dimensions in Figure 2.7.
Example 2.4.2. Take w(x) = αx + βx3 , we obtain a multivariate form of the skew
symmetric family introduced by Ma and Genton ([45]). This density is illustrated in
two dimensions in Figure 2.8.
63
λ11=2, λ12=−4 λ21=50, λ22=15
0.4
0.3
0.2
0.1
0
4
2
0
−4
−2
−2
0
2
x1
−4
4
x2
Figure 2.7: The surface for 2 dimensional pdf for skew-normal variable in Example 2.4.1 for
α1 = 2, α2 = 2, β1 = 5, β2 = 10:
64
Example 2.4.2 with α1=2, α2=1 β1=5, β2=10
4
0.5
3
2
0
4
1
0
3
2
1
−1
0
x2
−1
−2
x1
−2
Figure 2.8: The surface for 2 dimensional pdf for skew-normal variable in Example 2.4.2 for
α1 = 2, α2 = 2, β1 = 5, β2 = 10:
65
α1=2, α2=4 λ1=1, λ2=2
0.25
0.2
0.15
0.1
0.05
0
4
4
2
2
0
x2
0
−2
−2
x1
Figure 2.9: The surface for 2 dimensional pdf for skew-normal variable in Example 2.4.3 for
α1 = 2, α2 = 4, λ1 = 1, λ2 = 2:
Example 2.4.3. Take w(x) = sign(x)|x|α/2 λ(2/α)1/2 , we obtain a multivariate form
of the skew symmetric family introduced by DiCiccio and Monti ([23]). This density is
illustrated in two dimensions in Figures 2.9, 2.10, 2.11 and 2.4.3.
If we take the linear transformation of this variable we get the following graphs:
By the definition of multivariate weighted distribution in Equation 2.6, the skewing
function of the multivariate variate skew normal density in Definition 2.3.1, i.e.
k
Y
j=1
Φ(αj e′j (A−1 (x − µ))),
can be replaced by
Φm (ΓA−1 (x − µ))
66
α1=2, α2=4 λ1=1, λ2=2
2.5
2
1.5
x2
1
0.5
0
−0.5
−1
−1.5
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
x1
Figure 2.10: The contours for 2 dimensional pdf for skew-normal variable in Example 2.4.3
for α1 = 2, α2 = 4, λ1 = 1, λ2 = 2:
67
α1=2, α2=4 λ1=1, λ2=2
0.2
0.15
0.1
0.05
0
−5
0
−4
5
x1
−2
0
2
4
6
8
x2
Figure 2.11: The surface for 2 dimensional pdf for skew-normal variable
inExample 2.4.3
1 0
for α1 = 2, α2 = 4, λ1 = 1, λ2 = 2 with location µ = (0, 0)′ , Σ1/2 =
1 2
68
α1=2, α2=4 λ1=1, λ2=2
6
5
4
x2
3
2
1
0
−1
−2
−1
−0.5
0
0.5
1
1.5
2
2.5
x1
Figure 2.12: The contours for 2 dimensional pdf for skew-normal variable
inExample 2.4.3
1 0
.
for α1 = 2, α2 = 4, λ1 = 1, λ2 = 2 with location µ = (0, 0)′ , Σ1/2 =
1 2
69
where Γ is a matrix of dimension m × k. In this case, the normalizing constant 2 will
k
have to be changed to
Ez (P (y < ΓA−1 (z − µ))|z)
and for y ∼ φk (0k , Im ) and z ∼ φk (µ, AA′ ) this is equal to
Φm (0; Im + ΓΓ′ ).
This extension to multivariate skew normal density is given by Gupta et al. [34] and
is similar to the fundamental skew distribution of Arellano-Valle, Branco, and Genton
([5]). Note that if k = m and Γ is diagonal then we are back to the density in Definition
2.3.1.
70
CHAPTER 3
A Class of Matrix Variate Skew
Symmetric Distributions
3.1
Distribution of Random Sample
In order to make inferences about the population parameters of a k−dimensional distribution, we work with a random sample of n individuals from this population which
can be represented by a k × n matrix X. The typical multivariate analysis assumes independence among the individuals. However, in many situations this assumption may
be too restrictive. Matrix variable distributions can account for dependence among
individual observations.
In the normal case, the matrix variate density of X is written as ([36])
etr(− 12 (AA′ )−1 (X − M )(B ′ B)−1 (X − M )′ )
.
φk×n (X; M, AA , B B) =
(2π)nk/2 |AA′ |n/2 |B ′ B|k/2
′
′
(3.1)
Like in the multivariate case the matrix A determines how the variables are related,
also a matrix B is introduced to account for dependence among individuals. Note that,
for the density in Equation 3.1 be defined, AA′ and B ′ B should be positive definite. In
71
practice, however, there will be natural restrictions on the parameter space. First, the
knowledge about the sample design through which the variable X is being observed
can be reflected to the parameter space. Second, simplifying assumptions about the
relationship of k variables, like independence or zero correlation assumptions, can be
utilized to achieve model parsimony. Finally, if there are no restrictions on AA′ or B ′ B
then these parameters are confounded ([37]).
A detailed investigation of the sources, magnitude and impact of errors is necessary to
identify how survey design and procedures may be improved. For example, most data
collection and sample designs involve some overlapping between interviewer workload
and the sampling units (clusters). For those cases, a proportion of the measurement
variance which is due to interviewers is reflected to some degree in the sampling variance
calculations. The variable effects that interviewers have on respondent answers are
sometimes labeled the correlated response variance ([14]). In the next example, we
illustrate the use of matrix variate normal distribution to account for the correlated
response variance.
Example 3.1.1. Suppose that the k−dimensional random variable x has spherical
multivariate normal density over a population, i.e. x ∼
1
φ ((σIk )−1 x).
σk k
In order to
make inferences about the scale parameter σ 2 a random sample of individuals from
this population are to be observed. In a realistic setting, there will be more than one
interviewer to accomplish the work, each interviewer would observe a partition of the
random sample of individuals. It is reasonable to assume that each interviewer introduces a degree of correlation to their observations and that observations from different
interviewers are uncorrelated. Now, for simplicity, assume that there are only two interviewers available to observe the random sample of n individuals, say I-1 and I-2.
I-1 is going to observe n1 of the individuals and introduces a correlation ρ1 and I-2
is responsible of observing the remaining of the sample and introduces a correlation
of ρ2 . As the following results show, the measurement variance, which is due to the
72
interviewers, is reflected to some degree in the sampling variance calculations.
Likelihood function given the observations can be represented with the matrix variate
normal density: X = (x1 , x2 , . . . , xn ) ∼ φk×n (X; M, AA′ , B ′ B) with M = 0k×n , AA′ =
σ 2 Ik and


B′B = 
Ψ1
0n−n1 ×n−n1

0n−n1 ×n−n1 
 = Ψ.
Ψ2
Ψ1 is a n1 × n1 symmetric matrix with all diagonal entries equal to one and all its
off-diagonal entries equal to ρ1 . Similarly Ψ2 is a n − n1 × n − n1 symmetric matrix
with all diagonal entries equal to one and all its off-diagonal entries equal to ρ2 .
The log-likelihood is
σ −2
1
2 nk/2
k/2
k/2
) − log(σ )
− log|Ψ1 | − log|Ψ2 | −
tr(X ′ XΨ−1 ).
l(σ, ρ1 , ρ2 ; X) = log(
nk/2
(2π)
2
In order to find the ML estimators of ρ1 , ρ2 and σ 2 , we equate the partial derivatives of
l(σ, ρ1 , ρ2 ; X). According to this, if both ρ1 and ρ2 are known, asymptotically unbiased
maximum likelihood estimator of σ 2 is given by
(Ψ)
tr(X ′ XΨ−1 )
σb2 =
.
nk
The usual maximum likelihood estimator from the uncorrelated model, σˆ2 =
(3.2)
tr(X ′ X)
,
nk
is
asymptotically biased for σ 2 where the size of bias depends on the magnitudes of ρ1 and
ρ2 . A detailed investigation of the sources, magnitude and impact of errors is necessary
to identify how survey design and procedures may be improved.
Chen and Gupta extend the matrix normal distribution to accommodate skewness in
the following form ([17]):
f1 (X, Σ ⊗ Ψ, b) = c∗1 φk×n (X; Σ ⊗ Ψ)Φn (X ′ b, Ψ)
(3.3)
where
c∗1
73
= (Φn (0, (1 + b Σb)Ψ)) . A drawback of this definition is that it allows
′
−1
independence only over its rows or columns, but not both. Harrar ([39]) give two more
definitions for the matrix variate skew normal density:
f2 (X, Σ, Ψ, b, Ω) = c∗2 φk×n (X; Σ ⊗ Ψ)Φn (X ′ b, Ω)
(3.4)
f3 (X, Σ, Ψ, b, B) = c∗3 φk×n (X; Σ ⊗ Ψ)Φn (tr(B ′ X))
(3.5)
and
where c∗2 = (Φn (0, (Ω+b′ Σb)Ψ))−1 , c∗3 = 2; Σ, Ψ, and Ω are positive definite covariance
matrices of dimensions k, n and n respectively, B is a matrix of dimension k × n. Note
that if Ω = Ψ then f2 is the same as f1 . Although, more general than f1 , the densities
f2 and f3 still do not permit independence of rows and columns simultaneously.
A very general definition of skew symmetric variable for the matrix case can be obtained
from matrix variate selection models. Suppose X is a k×n random matrix with density
f (X), let g(X) be a weight function. A weighted form of density f (X) is given by
h(X) = R
f (X)g(X)
.
g(X)f
(X)dX
k×n
R
(3.6)
When the sample is only a subset of the population then the associated model would be
called a selection model. In their recent article article Domı́nguez-Molina, GonzálezFarı́as,Ramos-Quiroga and Gupta introduced the matrix variable closed skew normal
distribution in this form ([8]).
In this chapter, a construction for a family of matrix variable skew-symmetric densities
that allows for independence among both variables and individuals is studied. We
also consider certain extensions and relation to matrix variable closed skew normal
distribution.
74
3.2
Matrix Variable Skew Symmetric Distribution
To define a matrix variate distribution from the multivariate skew symmetric distribution first assume that z i ∼ ssg,H
k (0k , Ik , αi ) for i = 1, 2, . . . , n are independently
distributed random variables. Write Z = (z 1 , z 2 , . . . , z n ). We can write the density of
X as a product as follows,
n
Y
2k g ∗ (Z)
i=1
k
Y
H(αji e′j z i ).
j=1
This is equal to
nk ∗∗
2 g (Z)
n Y
k
Y
H(αji e′j Zci ).
i=1 j=1
Let A, and B be nonsingular symmetric matrices of order k, and n respectively, also
assume M is a k × n matrix. Define the matrix variate skew symmetric variable as
X = AZB + M.
Definition 3.2.1. (Matrix Variate Skew-Symmetric Density) Let g(.) be a density
function symmetric about 0, H(.) be an absolutely continuous cumulative distribution
function with H ′ (.) symmetric around 0. A variable X has matrix variate skew symmetric distribution if it has probability density function
2nk g ∗∗ (A−1 (X − M )B −1 )
Qn Qk
i=1
j=1
H(αji e′j (A−1 (X − M )B −1 )ci )
|A|n |B|k
(3.7)
where αji are real scalars, M ∈ Rk×n , A and B be nonsingular symmetric matrices of
Q Q
order k and n respectively. Finally, g ∗∗ (X) = ni=1 kj=1 g(yij ). The density is called
matrix variate skew-symmetric density with location parameter M, scale parameters
(A, B), and shape parameter ∆ = (αji ), and it is denoted by mssg,H
k×n (M, A, B, ∆).
Let Z ∼ mssg,H
k×n (0k×n , Ik , In , ∆). The the moment generating function of Z evaluated
75
k×n
at Tk×n ∈ R
is MZ (Tk×n ) and can be obtained as follows:
′
MZ (Tk×n ) = E(etr(Tk×n
Z))
Z
n Y
k
Y
′
k ∗∗
etr(Tk×n Z)2 g (Z)
=
H(αji e′j Zci )dZ
Rk×n
′
= Eg∗∗ (Z) (etr(Tk×n
Z)
i=1 j=1
n Y
k
Y
H(αji e′j Zci )).
i=1 j=1
Let X = AZB + M for constant (k × k) matrix A, (n × n) matrix B and k × n
dimensional constant matrix M. The the moment generating function of X evaluated
at Tk×n is MX (Tk×n ):
′
MX (Tk×n ) = etr(Tk×n
M )MZ (A′ Tk×n B ′ )
=
′
′
etr(Tk×n
M )Eg∗∗ (Z) (etr((BTk×n
A)Z)
n Y
k
Y
H(αji e′j Zci )).
i=1 j=1
Definition 3.2.2. (Matrix Variate Skew Symmetric Random Variable) Let g(.) be a
density function symmetric about 0, H(.) be an absolutely continuous cumulative distribution function with H ′ (.) symmetric around 0. Let zij ∼ f (zij , αji ) = 2g(zij )H(αji z)
for i = 1, 2, . . . , n, and j = 1, 2, . . . , k be independent variables. Then the matrix variQ Q
ate random variable Z = (zij ) has density 2nk |A|n1|B|k g ∗∗ (Z) ni=1 kj=1 H(αji e′j Zci )
Q Q
where g ∗∗ (z) = ni=1 kj=1 g(zij ), and e′j and c′i are the elementary vectors of the co-
ordinate system Rk and Rn respectively. Let A be a k × k constant matrix, B be
a n × n constant matrix and M be a k × n-dimensional constant matrix.A random
variable X = AZB + M is distributed with respect to matrix variate skew symmetric
distribution with location parameter M, scale parameters (A, B), and shape parameter
g,H
∆ = (αji ). We denote this by X ∼ M SSk×n
(M, A, B, ∆).
76
3.3
Matrix Variate Skew-Normal Distribution
Definition 3.3.1. (Matrix Variate Skew Normal Density). We call the density
f (X, M, A, B, ∆) =
2kn φk×n (A−1 (X − M )B −1 )
Qk
Qn
′
−1
i=1 Φ(αji ej (A (X
|A|n |B|k
j=1
− M )B −1 )ci )
(3.8)
as the density of a matrix variate skew normal variable with location parameter M, scale
parameters (A, B), and shape parameter ∆. We denote it by msnk×n (M, A, B, ∆).
Let Z ∼ msnk×n (0k×n , Ik , In , ∆). Then the moment generating function of Z evaluated
at Tk×n is MZ (Tk×n ). It can be obtained as follows:
′
MZ (Tk×n ) = Eφk×n (Z) (etr(Tk×n
Z)
n Y
k
Y
Φ(αji e′j Zci ))
i=1 j=1
2k
=
(2π)k/2
=
=
2k
(2π)k/2
n
Y
i=1
=
n
Y
i=1
=
i=1
e
Rk
n Z
Y
i=1
1 ′
t t
2 i i
2k e
(2π)k/2
(2π)k/2
1 ′
t t
2 i i
2k e
(2π)k/2
n
Y
2 e
n
Y
2k e 2 ti ti
k
1 ′
t t
2 i i
i=1
=
− 21 z′i zi +ti ′ zi
1
i=1
′
′
′
e− 2 (zi zi −2ti zi )
Rk
Z
Φ(αji e′j z i )dz
k
Y
Φ(αji e′j z i )dz
j=1
− 12 (z′i zi −2ti ′ zi +ti ′ ti )
e
Rk
− 12 (zi −ti )′ (zi −ti )
e
Φ(αji e′j z i )dz
k
Y
Φ(αji e′j z i )dz
j=1
k Z
Y
1
2
e− 2 (zij −(t)ij ) Φ(αji z ij )dzij
R
j=1
k Z
Y
k
Y
k
Y
j=1
Rk
j=1
1
k
Y
j=1
1 ′ Z
n
Y
2k e 2 ti ti
i=1
=
n Z
Y
R
1
1
2
p
e− 2 (yij ) Φ(αji y ij + αji (t)ij )dyij
(2π)
αji (t)ij
Φ( p
)
2)
(1
+
α
ji
j=1
n Y
k
Y
αji (T )ij
1
′
).
Φ( p
= 2 etr( Tk×n Tk×n )
2
(1 + αji 2 )
i=1 j=1
nk
77
Let X = AZB + M for constant (k × k) matrix A,(n × n) matrix B and k × n
dimensional constant matrix M. Then the moment generating function of X evaluated
at Tk×n ∈ Rk×n is MX (Tk×n ), this can be obtained as follows:
n Y
k
Y
αji (A′ Tk×n B ′ )ij
1
′
).
Φ( p
MX (Tk×n ) = 2nk etr(Tk×n
M + (A′ Tk×n B ′ )′ A′ Tk×n B ′ )
2)
2
(1
+
α
ji
i=1 j=1
Hence the following definition and theorems.
Definition 3.3.2. (Matrix Variate Skew Normal Random Variable) Let
zij ∼ 2φ(zij )Φ(αji zij )
for i = 1, 2, . . . , n, and j = 1, 2, . . . , k be independent univariate skew normal random
variables. Then the matrix variate random variable Z = (zij ) has density
nk
2 φk×n (Z)
n Y
k
Y
Φ(αji e′j Zci )
i=1 j=1
where φk×n (Z) =
Qn Qk
i=1
j=1
φ(zij ), and ej and ci are the elementary vectors of the
coordinate system Rk and Rn respectively. Let A be a k × k constant matrix, B be
a n × n constant matrix and M be a k × n-dimensional constant matrix. A random
variable X = AZB + M is distributed with respect to matrix variate skew symmetric
distribution with location parameter M, scale parameters (A, B), and shape parameter
∆ = (αji ). We denote this by X ∼ M SNk×n (M, A, B, ∆). If the density exists it is
given in Equation 3.8. We denote this case by writing X ∼ msnk×n (M, A, B, ∆).
Theorem 3.3.1. If X has multivariate skew-normal distribution M SNk×n (M, A, B, ∆)
78
then the moment generating function of X evaluated at Tk×n is given by
1
′
MX (Tk×n ) = 2nk etr(Tk×n
M + (A′ Tk×n B ′ )′ A′ Tk×n B ′ )
2
n
k
′
Y Y αji (A Tk×n B ′ )ij
×
).
Φ( p
2)
(1
+
α
ji
i=1 j=1
(3.9)
By Definition 3.4.2 we can write Z ∼ M SNk×n (0, Ik , In , ∆), and prove the following
theorems.
Theorem 3.3.2. Assume that Y ∼ M SNk×n (M, A, B, ∆) and X = CY D + N. Then
X ∼ M SNk×n (CM D + N, CA, BD, ∆).
Proof. From assumption, we have Y = AZB +M, and so X = CAZBD+(CM D+N ),
i.e., X ∼ M SNk×n (CM D + N, CA, BD, ∆).
Theorem 3.3.3. Let x1 , x2 , . . . xn be independent, where xi is distributed according
to snk (0k , Σ1/2 , α). Then,
n
X
j=1
x′ j Σ−1 xj ∼ χ2kn .
Proof. Let y ∼ Nk (µ = 0, Σ). Then y′ Σ−1 y ∼ χ2k , and x′ j Σ−1 xj and y′ Σ−1 y have the
same distribution from Theorem 2.2.2. Moreover, x′ j Σ−1 xj are independent. Then
the desired property is proven by the addition property of χ2 distribution.
It is well known that if X ∼ φk×n (M, AA′ , Ψ = In ) then the matrix variable XX ′
has the Wishart distribution with the moment generating function given as |(I −
2(AA′ )T )|−n/2 , (AA′ )−1 − 2T being a positive definite matrix. The following theorem
implies that the decomposition for a Wishart matrix is not unique.
Theorem 3.3.4. If a k×n matrix variate random variable X has snk×n (0k×n , A, In , ∆)
distribution for constant positive definite matrix A of order k and ∆ = α1′ , then
XX ′ ∼ Wk (n, AA′ ).
79
Proof. The moment generating function of the quadratic form XX can be obtained
′
as follows, for any T ∈ Rk×k , with (AA′ )−1 − 2T being a positive definite matrix,
′
E(etr(XX T )) =
Z
etr(XX ′ T )dF X
Rk×n
Q Q
2nk etr(− 21 (AA′ )−1 XX ′ + XX ′ T ) ni=1 kj=1 Φ(αji e′j A−1 Xci )
dX
=
(2π)nk/2 |A|n
Rk×n
Q Q
Z
2nk etr(− 21 X ′ ((AA′ )−1 − 2T )X) ni=1 kj=1 Φ(αji e′j A−1 Xci )
=
dX
(2π)nk/2 |A|n
Rk×n
Q Q
Z
2nk etr(− 21 Z ′ Z) ni=1 kj=1 Φ(αji e′j A−1 ((AA′ )−1 − 2T )1/2 Zci )
dZ
=
(2π)nk/2 |(I − 2(AA′ )T )|n/2
Rk×n
Q Q
1 ′
Z
2nk kj=1 ni=1 e(− 2 zi zi ) Φ(αji e′j A−1 ((AA′ )−1 − 2T )1/2 z i )
=
dZ
(2π)nk/2 |(I − 2(AA′ )T )|n/2
Rk×n
Z
2nk (Ez Φk (αji e′j A−1 ((AA′ )−1 − 2T )1/2 z))n
=
(I − 2(AA′ )T )|n/2
= |(I − 2(AA′ )T )|−n/2 .
3.4
Generalized Matrix Variate Skew Normal Dis-
tribution
Definition 3.4.1. (Generalized Matrix Variate Skew Normal Density). We call the
density
f (X, M, A, B, ∆) =
2kn φk×n (A−1 (X − M )B −1 )
Qk
′
−1
j=1 H(αji ej (A (X
|A|n |B|k
− M )B −1 )ci )
(3.10)
as the density of a matrix variate skew normal variable with location parameter M, scale
80
parameters (A, B), and shape parameter ∆. We denote it by
gmsnH
k×n (M, A, B, ∆).
Let Z ∼ gmsnH
k×n (0k×n , Ik , In ∆). The moment generating function of Z evaluated at
Tk×n is MZ (Tk×n ), it can be obtained as follows:
MZ (Tk×n ) =
′
EHk×n (Z) (etr(Tk×n
Z)
n Y
k
Y
H(αji e′j Zci ))
i=1 j=1
2k
=
(2π)k/2
2k
=
(2π)k/2
=
n
Y
i=1
=
i=1
1 ′
t t
2 i i
2k e
(2π)k/2
n
Y
1 ′
t t
2 i i
2k e
(2π)k/2
n
Y
2k e 2 ti ti
(2π)k/2
1
′
=
i=1
e
Rk
k
2 e
1 ′
t t
2 i i
k
Y
H(αji e′j z i )dz
j=1
− 21 (z′i zi −2ti ′ zi )
e
Rk
k
Y
H(αji e′j z i )dz
j=1
Z
e
Z
e− 2 (zi −ti ) (zi −ti )
− 21 (z′i zi −2ti ′ zi +ti ′ ti )
Rk
1
H(αji e′j z i )dz
′
k
Y
H(αji e′j z i )dz
j=1
k Z
Y
1
2
e− 2 (zij −(t)ij ) H(αji z ij )dzij
R
j=1
k Z
Y
k
Y
k
Y
j=1
Rk
j=1
i=1
n
Y
− 21 z′i zi +ti ′ zi
n Z
Y
n
Y
2k e
i=1
=
i=1
1 ′
t t
2 i i
i=1
=
n Z
Y
R
1
1
2
p
e− 2 (yij ) H(αji y ij + αji (t)ij )dyij
(2π)
αji (t)ij
H( p
)
(1 + αji 2 )
j=1
n Y
k
Y
1
αji (T )ij
).
= 2nk etr( Tk×n ′ Tk×n )
H( p
2)
2
(1
+
α
ji
i=1 j=1
Let X = AZB + M for constant (k × k) matrix A, (n × n) matrix B and k × n
dimensional constant matrix M. Then the moment generating function of X evaluated
at Tk×n ∈ Rk×n is MX (Tk×n ), this can be obtained as follows:
n Y
k
Y
1
αji (A′ Tk×n B ′ )ij
′
MX (Tk×n ) = 2nk etr(Tk×n
M + (A′ Tk×n B ′ )′ A′ Tk×n B ′ )
H( p
).
2)
2
(1
+
α
ji
i=1 j=1
Hence the following definition and theorems.
81
Definition 3.4.2. (Matrix Variate Generalized Skew Normal Random Variable) Let
zij ∼ 2φ(zij )H(αji zij )
for i = 1, 2, . . . , n, and j = 1, 2, . . . , k be independent univariate generalized skew
normal random variables. The matrix variate random variable Z = (zij ) has density
2nk φk×n (Z)
n Y
k
Y
H(αji e′j Zci )
i=1 j=1
where φk×n (Z) =
Qn Qk
i=1
j=1
φ(zij ), and e′j and c′i are the elementary vectors of the
coordinate system Rk and Rn respectively. Let A be a k × k constant matrix, B be a
n×n constant matrix and M be a k×n-dimensional constant matrix. A random variable
X = AZB +M is distributed with respect to generalized matrix variate skew symmetric
distribution with location parameter M, scale parameters (A, B), and shape parameter
H
∆ = (αji ). We denote this by X ∼ GM SNk×n
(M, A, B, ∆). If the density exists it is
given in Equation 3.10. We denote this case by writing X ∼ gsnH
k×n (M, A, B, ∆).
Theorem 3.4.1. If X has generalized matrix variate skew-normal distribution,
H
GM SNk×n
(M, A, B, ∆), then the moment generating function of X evaluated at Tk×n
is given by
1
′
MX (Tk×n ) = 2nk etr(Tk×n
M + (A′ Tk×n B ′ )′ A′ Tk×n B ′ )
2
n
k
′
YY
αji (A Tk×n B ′ )ij
).
×
H( p
2)
(1
+
α
ji
i=1 j=1
(3.11)
H
By Definition 3.4.2 we can write Z ∼ GM SNk×n
(0, Ik , In , ∆), and prove the following
theorems.
H
Theorem 3.4.2. Assume that Y ∼ GM SNk×n
(M, A, B, ∆) and X = CY D +N. Then
H
X ∼ GM SNk×n
(CM D + N, CA, BD, ∆).
82
Proof. From assumption we have Y = AZB +M, and so X = CAZBD +(CM D +N ),
i.e., X ∼ M SNk×n (CM D + N, CA, BD, ∆).
Theorem 3.4.3. Let x1 , x2 , . . . xn be independent, where xi is distributed according
1/2
to gsnH
, α). Then,
k (0k , Σ
n
X
j=1
x′ j Σ−1 xj ∼ χ2kn
.
Proof. Let y ∼ Nk (µ = 0, Σ). Then y′ Σ−1 y ∼ χ2k , and x′ j Σ−1 xj and y′ Σ−1 y have the
same distribution from Theorem 2.2.2. Moreover, x′ j Σ−1 xj are independent. Then
the desired property is proved by the addition property of χ2 distribution.
3.5
Extensions
As in Chapter 2, an odd function, say w(x), can be used to replace the term in the
form αji x in the skewing function to give more flexible families of densities. As before
we can take w(xji ) = √ λ1 x
1+λ2 x2
, to obtain a matrix variable form of the skew symmetric
family introduced by Arellanno-Valle at al. ([7]); take w(x) = αx + βx3 , we obtain
a matrix variate form of the skew symmetric family introduced by Ma and Genton
([45]); or take w(x) = sign(x)|x|α/2 λ(2/α)1/2 to obtain a matrix variate form of the
skew symmetric family introduced by DiCiccio and Monti ([23]).
Also note that, by the definition of matrix variate weighted distribution in Equation
3.6, the skewing function of the matrix variate skew normal density in Equation 3.8,
i.e.
k Y
n
Y
j=1 i=1
Φ(αji e′j (A−1 (X − M )B −1 )ci ),
can be replaced by
Φk∗ ×n∗ (ΓA−1 (Z − M )B −1 Λ)
83
where Γ and Λ are matrices of dimensions k k and n × n correspondingly. In this
∗
∗
case, the normalizing constant 2kn will have to be changed to EZ (P (Y < ΓA−1 (Z −
M )B −1 Λ|Z)) for Y ∼ φk×n (0k×n , Ik , In ) and Z ∼ φk×n (M, AA′ , B ′ B). This extension
to matrix variate skew normal density is similar to the matrix variable closed skew
normal distribution ([8]). Also, if k ∗ = k, n∗ = n and both Γ and Λ are diagonal such
that αij = Γii Λjj we return to the family introduced in Definition 3.4.1.
84
CHAPTER 4
Estimation and Some Applications
Pewsey discusses some of the problems related to inference about the parameters of the
skew normal distribution ([52]). The method of moments approach fails for the samples
for which the sample skewness index is outside the admissibility range of the skew
normal model. Also, the maximum likelihood estimator of the shape parameter α may
take infinite values with positive probability. To deal with this problem certain methods
have been proposed: Azzalini and Capitanio recommends that maximization of the
log-likelihood function be stopped when the likelihood reaches a value not significantly
lower than the maximum ([12]). Sartori uses bias prevention of maximum likelihood
estimates for scalar skew normal and skew-t distributions and obtains finite valued
estimators ([57]). Bayesian estimation of the shape parameter is studied by Liseo and
Loperfido ([44]). Minimum χ2 estimation method and an asymptotically equivalent
maximum likelihood method are proposed by Monti ([46]). Chen and Gupta discuss
goodness of fit procedures for the skew normal distribution ([33]).
Maximum products of spacings (MPS) estimation is independently introduced by
Cheng and Amin ([18]), and Ranneby ([53]). It is a general method for estimating
parameters in univariate continuous distributions and is known to give consistent and
asymptotically efficient estimates under general conditions. In this chapter, we consider
estimation of the parameters of
snk (α, µ, Σ1/2
c )
85
distribution introduced in Chapter 2,
using the MPS procedure. We also show that MPS estimation procedure can be extended to give a class of bounded statistical model selection criteria which is suitable
even for cases where Akaike’s and other likelihood based model selection criteria do
not exist.
4.1
Maximum products of spacings (MPS) estima-
tion
Consider the common statistical inference problem of estimating a parameter θ from
observed data x1 , . . . , xn , where the xi ’s are mutually independent observations from a
distribution function Fθ0 (x). The distribution function Fθ (x) is assumed to belong to
a family F = {Fθ (x) : θ ∈ Θ} of mutually different distribution functions, where the
parameter θ may be vector-valued.
Let x(1) , x(2) , . . . , x(n) be the ordered random sample from Fθ0 (x). MLE method arises
from maximization of the log product of density function,
n
n
1X
1X
l(θ, X) =
log fθ (x(i) ) =
li (θ).
n i=1
n i=1
(4.1)
b MPS method arises from maximization
Let the maximum of this be denoted by l(θ).
of the log product of spacings,
n+1
n+1
1 X
1 X
si (θ) =
log(Fθ (x(i) ) − Fθ (x(i−1) )),
S(θ; X) =
n + 1 i=1
n + 1 i=1
(4.2)
with respect to θ. It is assumed that Fθ (x0 ) = 0 and Fθ (x(n+1) ) = 1. Let the maximum
b Under very general conditions, Cheng and Amin
of this function be denoted by S(θ).
([18]) prove that the estimators produced by it have the desirable properties of being
86
consistent, asymptotically normal, and asymptotically efficient since ML method and
MPS method are asymptotically equivalent. For the details of this proof and the
necessary conditions see ([18]).
An intuitive comparison of the log likelihood with the log products of spacings is helpful
in understanding the similarities and differences. By the mean value theorem,
Z
log(s(i) ) = log(
x(i)
x(i−1)
fθ (x)dx) = li (θ) + log(x(i) − x(i−1) ) + R(x(i−1) , x(i) , θ).
(4.3)
for each i. Here, R(x(i−1) , x(i) , θ) is essentially of O(x(i) − x(i−1) ). Now, let the notation
y = Op (g(n)) mean that, given ǫ > 0, K and N can be found, depending possibly
on θ0 , such that P (|y| < Kg(n)) > 1 − ǫ, for all n > N. Similarly y = op (g(n))
mean that, given ǫ > 0, δ > 0, N can be found, depending possibly on θ0 , such that
P (|y| < δg(n)) > 1 − ǫ, for all n > N. Under standard regulatory assumptions, the
following result holds ([22]). Given any ǫ > 0, there is a K = K(ǫ, θ0 ) and N such
that sup(x(i) − x(i−1) ) < K log(n)
with probability 1 − ǫ for all n ≥ N. Therefore, the
n
P
difference between l(θ, X) and S(θ, X) is approximately n+1
i=1 log(x(i) − x(i−1) )/(n + 1)
). Consequently, MPS and ML estimation are asymptotically equal
which is Op ( log(n)
n
and have the same asymptotic sufficiency, efficiency, and consistency properties. How1
ever S(θ; X) is always bounded above by log( n+1
), MPS approach gives consistent
estimators even when MLE fails.
The following theorem is proved under mild conditions in Cheng and Stephens ([19]), it
justifies the use of MPS approach in its relation to the likelihood and Moran Statistic.
Theorem 4.1.1. Let θ̆ be the MPS estimator of θ. Then the MPS estimator, θ̆, exists
in probability, and the following holds:
−1 ∂ 2 l(θ)
1/2
1. n (θ̆ − θ0 ) ∼ n(0, −E
|θ0 ) asymptoticaly.
∂θ2
87
2. If θb is the likelihood estimator, then
θ̆ − θb0 = op (n−1/2 ).
3. Also
S(θ̆) = S(θ0 ) +
1
Q + op (n−1 ),
2n
where Q is distributed as χ2k .
ML method and MPS method can be seen as objective functions that are approximations to Kullback-Leibler divergence ([24]). Ranneby ([53]) gives approximations
of information measures other than the Kullback-Leibler information. In Ranneby
& Ekström ([54]) a class of estimation methods, called generalized maximum spacing (GMPS) methods, is derived from approximations based on simple spacings of so
called ϕ-divergences. Ghosh & Jammalamadaka ([28]) show in a simulation study that
GMPS methods other than the MPS method can perform better in terms of mean
square error. Ekstrom ([24]) gives strong consistency theorems for GMPS estimators.
There are two disadvantages to MPS estimation: MPS procedure fails if there are
ties in the sample. Although this will happen with probability zero for continuous
densities, when data is grouped or rounded this may become an issue. Titterington
([59]) and Ekström & Ranneby([54]) give methods to deal with ties in data using higher
order spacings. The second disadvantage is that optimization related to MPS approach
usually requires computational methods for solution.
88
4.2
A Bounded Information Criterion for Statisti-
cal Model Selection
A fundamental problem in statistical analysis is about the choice of an appropriate
model. The area of statistical model identification, model evaluation, or model selection
deals with choosing the best approximating model among a set of competing models
to describe a given data set. In a series of papers, Akaike ([2] and [3]) develops the
field of statistical data modeling from the principle of Boltzmann’s ([15]) generalized
entropy or Kullback-Leibler ([42]) divergence. As a measure of the goodness of fit of an
estimated statistical model among several competing models, Akaike proposes the so
called Akaike information criterion (AIC). AIC provides a trade off between precision
and complexity of a model, given a data set competing models may be ranked according
to their AIC, the model with smallest AIC is considered the best approximating model.
The Kullback-Leibler divergence between the true density fθ0 (x) and its estimate fθ (x)
is given by
KLD(θ0 , θ) =
Z
fθ0 (x) log(
fθ0 (x)
)dx.
fθ (x)
(4.4)
The most obvious estimator of [4.4] is given by
n
X
fθ (x(i) )
\ 0 , θ) = 1
KLD(θ
log( 0
).
n i=1
fθ (x(i) )
(4.5)
Minimization of the expression in (4.5) with respect to θ is equivalent to maximization
of the log likelihood, in Equation 4.6. The maximum log likelihood method can thus
be used to estimate the values of parameters. However, it cannot be used to compare
different models without some corrections because it is biased. An unbiased estimate
of −2 times the mean expected maximum value of log likelihood yields the famous
89
AIC:
b + 2k/n.
AIC = −2l(θ)
(4.6)
For small sample sizes, the penalty term penalty term of AIC, i.e. 2k/n, is usually
not sufficient and may be replaced in search for improvement. For example the Bayes
information criterion (BIC) of Schwarz ([58]) is obtained for 2k log(n)/n, HannanQuinn ([38]) information criterion is obtained for 2k log(log(n))/n. Many other model
selection criteria are obtained by similar adjustments to the penalty term. However,
all these different criteria share the same problem: There are many cases for which
the likelihood function is unbounded as a function of the parameters (for example see
([21])), therefore in such situations likelihood based inference and model selection is
not appropriate. Next, we develop a model selection criterion which overcomes the
unbounded likelihood problem based on the MPS estimation.
In Akaike ([2]), AIC is recommended as an approximation to the Kullback-Leibler
divergence between the true density and its estimate ([2]). The connection of S(θ, X)
to the Kullback-Leibler divergence and minimum information principle is studied by
Ranneby [53] and Lind ([43]). Ranneby introduces MPS from an approximation to the
Kullback-Leibler divergence [53]. Under mild assumptions the following holds
b < sup (−γ − KLD(θo , θ) + ǫ)) → 1
P (S(θ)
|{z}
θ
as n → ∞ for every ǫ > 0 where γ ∼
= 0.57722 is the Euler’s constant.
In Cheng and Stephens ([19]), the relation of MPS method to Moran’s goodness of fit
statistic ([47]) is studied. Moran’s statistics is the negative of S(θ, X), and its distribution evaluate at the true value of the parameter does not depend on the particular
distribution function it is calculated for. The percentage points of Moran’s statistic
are given by Cheng and Thornton ([20]).
90
Cheng and Stephens extend the Moran’s test ([48]) to the situation where parameters
have to be estimated from sample data. Under mild assumptions, when θb is an efficient
estimate of θ (this may be the maximum likelihood estimate, or the value, θ, which
minimizes S(θ; X)) then asymptotically the following holds:
b = S(θ0 ; X) +
S(θ)
1
Q + op (n−1 )
2n
(4.7)
where Q is a χ2k random variable with expected value k. As n → ∞ expected value of
b +
the statistic −S(θ)
k
2n
does not depend on the particular distribution function it is
calculated for.
Because of the above reasons, we propose the statistic
b + k
M P SIC = −S(θ)
2n
(4.8)
as a model selection criterion and we refer to it as the maximum products of spacings
information criterion (MPSIC) here on. Note that the penalty term in 4.8 can be
replaced with
k log(n)
2n
or
k log(log(n))
2n
to obtain other criteria of this kind, let’s call these
variants M P SIC2 and M P SIC3 . A
In Section 4.1, we have argued that S(θ0 ; X) should be a close to the log likelihood
in 4.1 when the latter exists; however, the main difference between S(θ0 ; X) and the
P
log likelihood in 4.1 is the term n+1
i=1 log(x(i) − x(i−1) )/(n + 1) which depends on the
unknown true parameter value θ0 . Because of this term the distribution of the log
likelihood depends on the unknown true parameter value and therefore is not suitable
for model identification purposes.
As mentioned earlier in skew normal model likelihood function is often unbounded.
Therefore, estimation in skew normal model leaves a lot of room for testing MPSIC
as a model selection criterion. In Table 4.1, we display the results of the following
91
experiment: We generate n observations from sn1 (α, 0, 1), and then we compare the
fit of this model to standard normal and half normal models with M P SC, M P SC2 ,
and M P SC3 . We record the percentage of times the skew normal model is selected by
these model selection criterion.
n
α MPSIC MPSIC2 MPSIC3
20 0
0.272
0.056
0.251
50 0
0.265
0.030
0.190
100 0
0.294
0.030
0.182
20 2
0.961
0.953
0.961
50 2
1.000
1.000
1.000
100 2
1.000
1.000
1.000
20 5
0.806
0.777
0.803
50 5
0.960
0.959
0.960
100 5
0.999
0.999
0.999
20 10
0.536
0.483
0.529
50 10
0.842
0.822
0.835
100 10
0.972
0.966
0.972
Table 4.1: M P SIC, M P SIC2 , and M P SIC3 are compared for samples of size 20, 50 and
100 for different levels of α.
4.3
Estimation of parameters of snk (α, µ, Σ)
We need the following result in this section: Let x(1) , x(2) , . . . , x(n) be the ordered
sample from sn1 (α, 0, 1), and let the cdf of this density be denoted as Fα (.). Then
q
x
(1)
2
φ( (α2 +1)
−1/2 )
∂
π
S(α; X) = −
∂α
Fα (x(1) )(α2 + 1)
q
x(n)
2
φ( (α2 +1)
−1/2 )
π
+
(1 − Fα (x(n) ))(α2 + 1)
q
x(i)
x(i−1)
n
X π2 (φ( (α2 +1)
−1/2 ) − φ( (α2 +1)−1/2 ))
.
+
(Fα (x(i) ) − Fα (x(i−1) ))(α2 + 1)
i=2
(4.9)
92
4.3.1
Case 1: MPS estimation of sn1 (α, 0, 1)
1/2
′
Assume X ∼ snk×n (µ1′k , Σ1/2
are known, we can write z i =
c , Ik , α1k ). When µ, Σc
Σ−1/2
(xi − µ) and estimate components αj of α by using the corresponding i.i.d.
c
sn1 (αj , 0, 1) random variables zj1 , zj2 , . . . , zjn .
Example 4.3.1. The ”frontier data” 50 observations generated by a sn1 (0, 1, α = 5)
distribution reported in Azzalini and Capitonio ([12]) is an example for which the
usual estimation methods like method of moments or maximum likelihood fail. Figure
4.1 gives the histogram, kernel density estimate, density curves for sn1 (0, 1, α = 5)
and sn1 (0, 1, α
b = 4.9632) distributions. For the frontier data,the Bayesian estimates
recommended by Liseo and Loperfido gives α
b = 2.12 ([44]), the estimate through bias
prevention method by Sartori gives α
b = 6.243 ([57]). Moreover the MPSIC for skew
normal model is 4.3999. The MPSIC for the half normal model is 5.0495. Therefore,
MPSIC identifies the correct model.
For any value shape parameter α, the absolute value of the skew normal variable has half
normal distribution. If we take the absolute values of the frontier data, the appropriate
model should be the half normal model instead of a skew normal model with large
shape parameter. This fact is supported by MPSIC: Estimate of α by MPS method
is 38.3906, and the MPSIC for this model is 4.4136. The MPSIC for the half normal
model is 4.4057.
The following Table 4.2 give summary statistics for the distribution of MPS estimator
for different values of n and α obtained from 5000 simulations. These results can be
compared with the simulation results for the ML estimator in Table 4.2.
Although MPS estimation provides an improvement on ML estimation, highly skew
distribution of the estimator, as seen in Figure 4.2, points to the tendency in some
samples to give extremely large estimates for the shape parameter. Since α → ±∞ the
density SN1 (y, α) approaches to the half normal density, the question weather skew
93
40
35
30
relative frequency
25
20
15
10
5
0
−2
−1
0
1
2
3
4
5
y
Figure 4.1: The histogram, kernel density estimate (blue), sn1 (0, 1, α = 5) curve (red) and
sn1 (0, 1, α
b = 4.9681) curve (black) for the ”frontier data”.
n
20
50
100
20
50
100
20
50
100
20
50
100
α Mean (b
α)
0
0.0028
0
0.0040
0
0.0057
2
2.9409
2
1.9662
2
1.9363
5
10.2334
5
5.5383
5
4.7936
10
21.5252
10
11.4908
10
10.0525
std(b
α)
Median(b
α) MAD(b
α)
0.2702
0.0045
0.1578
0.1572
0.0094
0.1048
0.1193
0.0065
0.0849
7.9874
1.6648
0.4516
1.1007
1.8120
0.2981
0.3932
1.8913
0.2446
19.1683
3.6313
1.1751
6.7330
4.2047
0.9611
1.4561
4.4462
0.7202
29.5315
6.2980
2.6961
15.1492
7.3934
1.9900
7.0268
8.3873
1.8476
Table 4.2: MPS estimation of α: Mean, standard deviation, median and mean absolute
deviation of α
b from 5000 simulations.
94
n
20
50
100
20
50
100
20
50
100
20
50
100
α Mean (b
α)
0
0.0108
0
0.0110
0
0.0017
2
3.3901
2
2.2586
2
2.1468
5
16.6467
5
7.2979
5
5.7956
10
30.9972
10
16.0803
10
12.8935
std(b
α)
Median(b
α) MAD(b
α)
0.3469
0.0105
0.1969
0.1827
0.0156
0.1142
0.1308
0.0010
0.0895
6.5156
2.2049
0.6616
0.8532
2.1192
0.3817
0.4518
2.0744
0.2737
25.4334
5.7100
2.2305
7.8039
5.5378
1.4243
3.5357
5.2114
0.8775
32.6160
12.0381
6.5854
16.9440
10.8762
3.5005
9.2488
10.4143
2.3308
Table 4.3: ML estimation of α: Mean, standard deviation, median and mean absolute
deviation of α
b from 5000 simulations.
normal model is appropriate for the data arises. When α
b is large, perhaps it is better
to use the half normal model with one less parameter for sake of model parsimony.
This can always be checked by comparing half normal model and skew normal model
by one of the model selection criteria, M P SIC, M P SIC2 , or M P SIC3 .
4.3.2
Case 2: Estimation of µ, Σc1/2 given α
For X ∼ snk×n (µ1′k , Σc1/2 , Ik , α1′k ), when α is known, we can use method of moments
to estimate the parameters µ, Σc1/2 . Note that
E(x) =
r

√ α1 2
1+α1


 √ α2
2 1/2 
1+α22
Σc 

..
π

.


√ αk 2
1+αk
Cov(x) =
Σ1/2
c (Ik





 + µ = Σ1/2
c µ0 (α) + µ,




α12
α22
αk2
2
′
′ 1/2
))Σc1/2 = Σ1/2
.
,
,...,
− diag(
c Σ0 (α)Σc
2
2
2
π
1 + α1 1 + α2
1 + αk
95
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0
100
200
300
400
alpha
500
600
700
800
Figure 4.2: Histogram of 5000 simulated values for α
b by MPS. The MPS estimator is highly
skewed suggesting that half normal model might be more suitable for cases where extreme
values of α
b are observed.
96
Therefore, given α, we can estimate Σ and µ by solving the moment equations
S = (1/n)
n
X
j=1
′
1/2
(xj − x)(xj − x)′ = Σ1/2
c Σ0 (α)Σc
x = (1/n)
n
X
(4.10)
xj = Σ1/2
c µ0 (α) + µ.
(4.11)
j=1
Solution to the above equations are
2
α12
α22
αk2
d
1/2
1/2
Σ
=
S
(I
−
diag(
))−1/2 ,
,
,
.
.
.
,
k
c
c
π
1 + α12 1 + α22
1 + αk2

√ α1
 1+α21

 √ α2
 1+α2
1/2
2
2
2 −1/2 
2
µ̂ = x − Sc (Ik − diag(µ(α1 ) , µ(α2 ) , . . . , µ(αk ) ))

.
..



√ αk 2
1+αk
(4.12)





.




(4.13)
d
1/2
1/2
Note that (Σ
c , µ̂). is an equivariant estimator of (Σc , µ) under the group affine
transformations obtained by scaling with lower triangular matrix with positive diagonal
elements.
4.3.3
Case 3: Simultaneous estimation of µ, Σc1/2 and α.
Given an initial guess of α, the following estimation procedure gives reasonably well
results.
(a) Start with an initial estimate for α, say α0 .
(b) Given an estimate of α, estimate Σ1/2
and µ by 4.12, and 4.13
c
(c) Given an estimate of µ and Σ, estimate α by MPS.
97
(d) Repeat steps 2-3 until convergence.
First, we apply this technique to univariate data.
Example 4.3.2. Bolfarine et. al. ([29]), consider the 1150 height measurements at 1
millisecond intervals along the drum of a roller (i.e. parallel to the axis of the roller).
Data is a part of a study on surface roughness of the rollers. Since there are ties in
the sample before analyzing the data we have adjusted the data for ties. We apply the
procedure described above, Figure 4.3 summarizes the results.
800
700
relative frequency
600
500
400
300
200
100
0
0
2
4
Drum Measurements
6
8
Figure 4.3: The histogram, kernel density estimate (blue), and estimated density curve
sn1 (µ = 4.2865, σ 1/2 = 0.9938, α = −2.9928) (black) for 1150 height measurements.
98
Table 4.4 gives summary statistics for the distribution of estimators of α, µ = 0 and
σ = 1 for different of n and α.
n
30
50
100
30
50
100
α mean α
b
1.5
2.6013
1.5
2.0914
1.5
1.8096
3.0
4.0288
3.0
3.7638
3.0
3.5131
std α
b median α
b MAD α
b mean µ
b
2.1136
1.9277
1.4755 -0.0722
1.2355
1.7920
0.8614 -0.0536
0.6817
1.6584
0.5292 -0.0372
2.7718
3.2239
2.0730 0.0229
2.0545
3.2340
1.5469 0.0072
1.5431
3.1676
1.1380 0.0012
std µ
b mean σ
b
0.2303 1.0560
0.1971 1.0432
0.1443 1.0228
0.1848 0.9836
0.1426 0.9934
0.1090 0.9981
std σ
b
0.2003
0.1634
0.1157
0.1840
0.1509
0.1090
Table 4.4: Table above gives summary statistics for the distribution of estimators of α, µ = 0
and σ = 1 for different of n and α.


 1 0 
Example 4.3.3. We generate 100 observations from sn2 (µ = (0, 0)′ , Σ1/2
=

,α =
c
.5 2
(5, −3)′ ) distribution. We estimate the parameters with the procedure described above
with initial estimate for α, i.e. α0 = (10, −10)′ . Figure 4.4 gives the the scatter plot,
and the contour plots of the true, and the estimated densities.
Example 4.3.4. Australian Athletes Data: This data is collected on 202 athletes and
mentioned in Azzalini ([11]). We take the variables LBM and BMI and fit a bivariate
skew normal model with the procedure described above with initial estimate for α, i.e.
α0 = (10, 10)′ . Figure 4.5 gives the the scatter plot, and the contour plot the estimated
density.
Example 4.3.5. Australian Athletes Data 2: This data is collected on 202 athletes and
mentioned in Azzalini ([11]). We take the variables LBM and SSF and fit a bivariate
skew normal model with the procedure described above with initial estimate for α, i.e.
α0 = (10, 10)′ . Figure 4.6 gives the the scatter plot, and the contour plot the estimated
density.
99
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2.5
−0.5
0
0.5
1
1.5
2
2.5
Figure 4.4: The scatter plot
generated observations (black dots) and, contours
for 100
1
0
, α = (5, −3)′ ) (blue), and estimated density curve
of sn2 (µ = (0, 0)′ , Σ1/2
=
c
.5 2
0.9404
0
1/2
′
, α = (5.6825, −3.6566)′ ) (red).
sn2 (µ = (0.0795, 0.0579) , Σc =
0.5845 1.1092
100
90
80
LBM
70
60
50
40
14
16
18
20
22
24
BMI
26
28
30
32
34
Figure 4.5: The scatter plot for LBM and BMI in Australian
Athletes Data, contours
4.2760
0
1/2
,α =
of the estimated density sn2 (µ = (19.7807, 48.8092)′ , Σc =
13.9306 10.7933
(2.5429, 0.8070)′ ) (red).
160
140
120
SSF
100
80
60
40
20
0
0
5
10
15
20
BMI
25
30
35
40
Figure 4.6: The scatter plot for LBM and SSF in Australian
Athletes Data, contours
4.2760
0
1/2
′
,α =
of the estimated density sn2 (µ = (19.7807, 17.3347) , Σc =
15.6127 50.5834
(2.5429, 8.6482)′ ) (red).
101
n
10
20
30
50
100
200
α
b
mean
3.9836
3.7503
α
b
mean
2.6878
2.5243
α
b
mean
2.0305
1.7223
α
b
mean
1.2905
1.3733
α
b
mean
1.184
1.1238
α
b
mean
1.0341
1.0575
cov
17.6933 -0.1927
-0.1927 16.2678
cov
16.1777
-0.77
-0.77 12.7421
cov
8.983
-0.3011
cov
1.1753
-0.0115
cov
0.4377
-0.0227
cov
0.1308
0.0064
-0.3011
3.9441
-0.0115
1.2555
-0.0227
0.3385
0.0064
0.1485
µ
b
mean
0.8143
0.7769
µ
b
mean
0.8812
0.8466
µ
b
mean
0.9149
0.9044
µ
b
mean
0.9553
0.9345
µ
b
mean
0.9506
0.9537
µ
b
mean
0.9796
0.963
cov
0.2056 0.0879
0.0879 0.2981
cov
0.1227 0.0597
0.0597 0.167
cov
0.0854 0.0424
0.0424 0.1235
cov
0.0644 0.0347
0.0347 0.0859
cov
0.0354 0.0178
0.0178
0.05
cov
0.0187 0.0093
0.0093 0.0261
b and α
b in estimation of sn2 (µ = (1, 1)′ , Σ1/2
Table
4.5:
=
c
The simulation results about µ
1 0
, α = (1, 1)′ ).
.5 1
102
d
1/2
vec(Σ
)
n
mean
10
1.1266
0.5627
1.0634
d
1/2
vec(Σ
)
n
20
n
30
n
50
n
100
n
200
mean
1.0797
0.5229
1.056
d
1/2
vec(Σ
)
mean
1.0582
0.5038
1.0338
d
1/2
vec(Σ
)
mean
1.0319
0.5201
1.0236
d
1/2
vec(Σ
)
mean
1.0268
0.5168
1.0162
d
1/2
vec(Σ
)
mean
1.0179
0.5112
1.0198
cov
0.1179 0.046 -0.0015
0.046 0.1867 0.0108
-0.0015 0.0108 0.1107
cov
0.0656 0.0349 -0.0042
0.0349 0.0841 0.0024
-0.0042 0.0024 0.0621
cov
0.044 0.0226 -0.0002
0.0226 0.0514 0.0018
-0.0002 0.0018 0.0396
cov
0.0293 0.0147
0.0147
0.03
0.0029 0.0014
0.0029
0.0014
0.029
cov
0.0159 0.0078
0.0078 0.0147
0.0003 0.0011
0.0003
0.0011
0.0159
cov
0.0085 0.0039 -0.0003
0.0039 0.0068 0.0002
-0.0003 0.0002
0.009
d
1/2
1/2
′
Table
4.6:
The simulation results about Σc in estimation of sn2 (µ = (1, 1) , Σc =
1 0
, α = (1, 1)′ ).
.5 1
103
BIBLIOGRAPHY
[1] D.J. Aigner, C.A.K. Lovell, and P. Schmidt. Formulation and estimation of
stochastic frontier production function models. Rand Corp., 1976.
[2] H. Akaike. A new look at the statistical model identification. Automatic Control,
IEEE Transactions on, 19(6):716–723, 1974.
[3] H. Akaike. On entropy maximization principle. Applications of Statistics, 1(1):27–
41, 1977.
[4] T.W. Anderson. An introduction to multivariate statistical analysis Wiley series
in probability and mathematical statistics. Probability and mathematical statistics.
Wiley-Interscience, 2003.
[5] R.B. Arellano-Valle, M.D. Branco, and M.G. Genton. A unified view on skewed
distributions arising from selections. Canadian Journal of Statistics, 34(4):581,
2006.
[6] R.B. Arellano-Valle and M.G. Genton. On fundamental skew distributions. Journal of Multivariate Analysis, 96(1):93–116, 2005.
[7] R.B. Arellano-Valle, H.W. Gomez, and F.A. Quintana. A new class of skew-normal
distributions. Communications in Statistics-Theory and Methods, 33(7):1465–
1480, 2004.
[8] G.F.; Gupta A.K. Armando, J.; Domı́nguez-Molina and R. Ramos-Quiroga. A
matrix variate closed skew-normal distribution with applications to stochastic
104
frontier analysis. Communications in Statistics - Theory and Methods, 36(9):1691–
1703, 2007.
[9] B.C. Arnold and R.J. Beaver. The skew-Cauchy distribution. Statistics & Probability Letters, 49(3):285–290, SEP 15 2000.
[10] A. Azzalini. A class of distributions which includes the normal ones. Scandinavian
Journal of Statistics, 12:171–178, 1985.
[11] A. Azzalini. The skew-normal distribution and related multivariate families. Scandinavian Journal of Statistics, 32(2):159–188, 2005.
[12] A. Azzalini and A. Capitanio. Statistical applications of the multivariate skew
normal distribution. Journal of the Royal Statistical Society. Series B (Statistical
Methodology), 61(3):579–602, 1999.
[13] A. Azzalini and A. Capitanio. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal
Statistical Society. Series B (Statistical Methodology), 65(2):367–389, 2003.
[14] L. Bailey, T.F. Moore, and B.A. Bailar. An interviewer variance study for the eight
impact cities of the national crime survey cities sample. Journal of the American
Statistical Association, 73(1):16–23, 1978.
[15] L. Boltzmann. On Zermelos paper On the mechanical explanation of irreversible
processes. Annalen der Physik, 60:392–398, 1897.
[16] M.D. Branco and D.K. Dey. A general class of multivariate skew-elliptical distributions. Journal of Multivariate Analysis, 79(1):99–113, 2001.
[17] J.T. Chen and A.K. Gupta. Matrix variate skew normal distributions. Statistics,
39(3):247–253, 2005.
[18] R.C.H. Cheng and N.A.K. Amin. Estimating parameters in continuous univariate
distributions with a shifted origin. Journal of the Royal Statistical Society. Series
B. Methodological, 45(3):394–403, 1983.
105
[19] R.C.H. Cheng and M.A. Stephens. A goodness-of-fit test using Moran’s statistic
with estimated parameters. Biometrika, 76(2):385–392, 1989.
[20] R.C.H. Cheng and K.M. Thornton. Selected percentage points of the Moran
statistic. Journal of Statistical Computation and Simulation, 30(3):189–194, 1988.
[21] R.C.H. Cheng and L. Traylor. Non-regular maximum likelihood problems. Journal
of the Royal Statistical Society. Series B. Methodological, 57(1):3–44, 1995.
[22] D.A. Darling. On a class of problems related to the random division of an interval.
The Annals of Mathematical Statistics, 24:239–253, 1953.
[23] T.J. DiCiccio and A.C. Monti. Inferential aspects of the skew exponential power
distribution. Journal of American Statistical Association, 99(466):439–450, 2004.
[24] M. Ekström. Alternatives to maximum likelihood estimation based on spacings
and the Kullback–Leibler divergence. Journal of Statistical Planning and Inference, 138(6):1778–1791, 2008.
[25] K. Fang. Generalized Multivariate Analysis. Science Press, Beijing, 1990.
[26] R.A. Fisher. The effect of methods of ascertainment upon the estimation of frequencies. Annals of Eugenics, 6:13–25, 1934.
[27] M.G. Genton. Skew-Elliptical Distributions and Their Applications: A Journey
Beyond Normality. Chapman & Hall/CRC, 2004.
[28] K. Ghosh and S.R. Jammalamadaka. A general estimation method using spacings.
Journal of Statistical Planning and Inference, 93(1-2):71–82, 2001.
[29] H.W. Gomez, H.S. Salinas, and H. Bolfarine. Generalized skew-normal models:
properties and inference. Statistics, 40(6):495–505, 2006.
[30] H.W. Gomez, O. Venegas, and H. Bolfarine. Skew-symmetric distributions generated by the distribution function of the normal distribution. Environmetrics,
18(4):395–407, 2007.
106
[31] A.K. Gupta and F.C. Chang. Multivariate skew-symmetric distributions. Applied
Mathematics Letters, 16(5):643–646, 2003.
[32] A.K. Gupta and J.T. Chen. A class of multivariate skew-normal models. Annals
of the Institute of Statistical Mathematics, 56(2):305–315, 2004.
[33] A.K. Gupta and T. Chen. Goodness-of-fit tests for the skew-normal distribution.
Communications in Statistics-Simulation and Computation, 30(4):907–930, 2001.
[34] A.K. Gupta, G. González-Farı́as, and J.A. Domı́nguez-Molina. A multivariate
skew normal distribution. Journal of Multivariate Analysis, 89(1):181–190, 2004.
[35] A.K. Gupta and W.J. Huang. Quadratic forms in skew normal variates. Journal
of Mathematical Analysis and Applications, 273(2):558–564, 2002.
[36] A.K. Gupta and D.K. Nagar.
Matrix Variate Distributions.
Chapman &
Hall/CRC, 2000.
[37] A.K. Gupta and T. Varga. Elliptically Contoured Models in Statistics. Kluwer
Series In Mathematics and Its Applications, 1993.
[38] E.J. Hannan and B.G. Quinn. The determination of the order of an autoregression.
Journal of the Royal Statistical Society, 41(2):190–195, 1979.
[39] S.W. Harrar and A.K. Gupta. On matrix variate skew-normal distributions. Statistics, 42(2):179–194, 2008.
[40] W.J. Huang and Y.H. Chen.
Quadratic forms of multivariate skew normal-
symmetric distributions. Statistics & Probability Letters, 76(9):871–879, 2006.
[41] M.C. Jones and M.J. Faddy. A skew extension of the t-distribution, with applications. Journal of the Royal Statistical Society. Series B (Statistical Methodology),
65(1):159–174, 2003.
[42] S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86, 1951.
107
[43] N.C. Lind. Information theory and maximum product of spacings estimation.
Journal of the Royal Statistical Society. Series B. Methodological, 56(2):341–343,
1994.
[44] B. Liseo and N. Loperfido. Default Bayesian analysis of the skew-normal distribution. Journal of Statistical Planning and Inference, 136(2):373–389, 2004.
[45] Y.Y. Ma and M.G. Genton. Flexible class of skew-symmetric distributions. Scandinavian Journal of Statistics, 31(3):459–468, 2004.
[46] A.C. Monti. A note on the estimation of the skew normal and the skew exponential
power distributions. Metron, 61(2):205–219, 2003.
[47] P.A.P. Moran. The random division of an interval. Supplement to the Journal of
the Royal Statistical Society, pages 92–98, 1947.
[48] P.A.P. Moran. The random division of an interval. Part II. Journal of the Royal
Statistical Society, 13(1):92–98, 1951.
[49] S. Nadarajah and S. Kotz. Skewed distributions generated by the normal kernel.
Statistics & Probability Letters, 65(3):269–277, 2003.
[50] S. Nadarajah and S. Kotz. Skew distributions generated from different families.
Acta Applicandae Mathematicae, 91(1):1–37, 2006.
[51] G.P. Patil, C.R. Rao, and M. Zelen. Weighted distributions. Encyclopedia of
Statistical Sciences, 9:565–571, 1988.
[52] A. Pewsey. Problems of inference for Azzalini’s skewnormal distribution. Journal
of Applied Statistics, 27(7):859–870, 2000.
[53] B. Ranneby. The maximum spacing method. An estimation method related to
the maximum likelihood method. Scandinavian Journal of Statistics, 11(2):93–
112, 1984.
[54] B. Ranneby and M. Ekström. Maximum spacing estimates based on different
metrics. Institute of Mathematical Statistics, Umea University, 1997.
108
[55] C. Roberts and S. Geisser. A necessary and sufficient condition for the square of
a random variable to be gamma. Biometrika, 53(1/2):275–278, 1966.
[56] S.K. Sahu, D.K. Dey, and M.D. Branco. A new class of multivariate skew distributions with applications to bayesian regression models. The Canadian Journal
of Statistics, 31(2):129–150, 2003.
[57] N. Sartori. Bias prevention of maximum likelihood estimates for scalar skew
normal and skew t distributions. Journal of Statistical Planning and Inference,
136(12):4259–4275, 2006.
[58] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–
464, 1978.
[59] D.M. Titterington. Comment on estimating parameters in continuous univariate
distributions. Journal of the Royal Statistical Society. Series B. Methodological,
47(1):115–116, 1985.
[60] J.Z. Wang, J. Boyer, and M.G. Genton. A skew-symmetric representation of
multivariate distributions. Statistica Sinica, 14(4):1259–1270, 2004.
[61] S. Zacks. Parametric Statistical Inference. Pergamon Press, New York, 1981.