An Object-Oriented Solution for Robust Factor Analysis

An Object-Oriented Solution for
Robust Factor Analysis ∗
Ying-Ying Zhang
Chongqing University
Abstract
Taking advantage of the S4 class system of the programming environment R, which
facilitates the creation and maintenance of reusable and modular components, an objectoriented solution for robust factor analysis is developed. The solution resides in the package robustfa. In the solution, new S4 classes "Fa", "FaClassic", "FaRobust", "FaCov",
"SummaryFa" are created. The robust factor analysis function FaCov() can deal with
maximum likelihood estimation, principal component analysis, and principal factor analysis methods. Consider the factor analysis methods (classical or robust), the data input
(data or the scaled data), and the running matrix (covariance or correlation) all together,
there are 8 combinations. We recommend classical and robust factor analysis using the
correlation matrix of the scaled data as the running matrix for theoretical and computational reasons. The design of the solution follows common statistical design patterns.
The application of the solution to multivariate data analysis is demonstrated on the hbk
data and the stock611 data which themselves are parts of the package robustfa.
Keywords: robustness, factor analysis, object-oriented solution, R, statistical design patterns.
1. Introduction
Outliers exist in virtually every data set in any application domain. In order to avoid the
masking effect, robust estimators are needed. The classical estimators of multivariate location
and scatter are the sample mean x̄ and the sample covariance matrix S. These estimates are
optimal if the data come from a multivariate normal distribution but are extremely sensitive
to the presence of even a few outliers. If outliers are present in the input data they will
influence the estimates x̄ and S and subsequently worsen the performance of the classical
factor analysis (Pison et al. 2003). Therefore it is important to consider robust alternatives
to these estimators. There are several robust estimators in the literature: MCD (Rousseeuw
1985; Rousseeuw and Driessen 1999), OGK (Maronna and Zamar 2002), MVE (Rousseeuw
1985), M (Rocke 1996), S (Davies 1987; Ruppert 1992; Woodruff and Rocke 1994; Rocke 1996;
He and Wang 1997; Salibian-Barrera and Yohai 2006) and Stahel-Donoho (Stahel 1981a,b;
Donoho 1982; Maronna and Yohai 1995). Substituting the classical location and scatter
estimates by their robust analogues is the most straightforward method for robustifying many
multivariate procedures (Maronna et al. 2006; Todorov and Filzmoser 2009), which is our
choice for robustifying the factor analysis procedure.
∗
The research was supported by Natural Science Foundation Project of CQ CSTC CSTC2011BB0058.
2
An OOS for Robust Factor Analysis
Taking advantage of the new S4 class system (Chambers 1998) of R (R Development Core
Team 2009) which facilitate the creation of reusable and modular components, an objectoriented solution for robust factor analysis is implemented. The application of the solution
to multivariate data analysis is demonstrated on the hbk data and the stock611 data which
themselves are parts of the package robustfa (Zhang 2013). We follow the object-oriented
paradigm as applied to the R object model (naming conventions, access methods, coexistence
of S3 and S4 classes, usage of UML, etc.) which is described in (Todorov and Filzmoser 2009).
The rest of the paper is organized as follows. In the next Section 2 the design approach and
structure of the solution are given. Section 3 describes the robust factor analysis method,
some theoretical results, the computations, and the implementations. The Sections 3.1, 3.2,
3.3, and 3.4 are dedicated to the object model, method of robust factor analysis, a hbk data
example, and a stocks data example, respectively. Section 4 concludes.
2. Design approach and structure of the solution
We follow the route of (Todorov and Filzmoser 2009). The main part of the solution is
implemented in the package robustfa but it relies on codes in the packages rrcov (Todorov
2013), robustbase (Rousseeuw et al. 2013), and pcaPP (Filzmoser et al. 2013). The structure
of the solution and its relation to other R packages are shown in Figure 1. The package
robustfa extends rrcov with options for dealing with robust factor analysis problems like
robust (Wang et al. 2013).
robust
robustfa
rrcov
robustbase
pcaPP
Figure 1: Class diagram: structure of the solution and relation to other R packages.
Ying-Ying Zhang
3
3. Robust factor analysis
3.1. Object model
The object model for the S4 classes and methods implementing the robust factor analysis
follows the Unified Modeling Language (UML) (OMG 2009a,b) class diagram and is presented
in Figure 2. A class is denoted by a box with three compartments which contain the name,
the attributes (slots) and operations (methods) of the class, respectively. The class names,
Fa and FaRobust, in italics indicate that the classes are abstract. Each attribute is followed
by its type and each operation is followed by the type of its return value. We use the R
types like numeric, vector, matrix, etc. but the type can be also a name of an S4 class (Fa,
FaClassic, or FaCov). The classes Ulogical, Unumeric etc. are class unions for optional
slots, e.g., for definition of slots which will be computed on demand. Relationships between
classes are denoted by arrows with different form. The inheritance relationship is depicted by
a large empty triangular arrowhead pointing to the base class. We see that both FaClassic
and FaRobust inherit from Fa, and FaCov inherits from FaRobust. Composition means that
one class contains another one as a slot. This relation is represented by an arrow with a solid
diamond on the side of the composed class. We see that SummaryFa is a composed class of
Fa.
As in (Todorov and Filzmoser 2009), all UML diagrams of the solution were created with the
open source UML tool ArgoUML (Robbins 1999; Robbins and Redmiles 2000) which is available for download from http://argouml.tigris.org/. The naming conventions of the solution
follow the recommended Sun’s Java coding style. See http://java.sun.com/docs/codeconv/.
3.2. Factor analysis based on a robust covariance matrix
As in (Todorov and Filzmoser 2009), the most straightforward and intuitive method to obtain
robust factor analysis is to replace the classical estimates of location and covariance by their
robust analogues. The package stats in base R contains the function factanal() which
performs a factor analysis on a given numeric data matrix and returns the results as an object
of S3 class factanal. This function has an argument covmat which can take a covariance
matrix, or a covariance list as returned by cov.wt, and if supplied, it is used rather than
the covariance matrix of the input data. This allows to obtain robust factor analysis by
supplying the covariance matrix computed by cov.mve or cov.mcd from the package MASS.
The reason to include such type of function in the solution is the unification of the interfaces
by leveraging the object orientation provided by the S4 classes and methods. The function
FaCov() computes robust factor analysis by replacing the classical covariance matrix with
one of the robust covariance estimators available in the package rrcov—MCD, OGK, MVE,
M, S or Stahel-Donoho, i.e., the parameter cov.control can be any object of a class derived
from the base class CovControl. This control class will be used to compute a robust estimate
of the covariance matrix. If this parameter is omitted, MCD will be used by default. Of
course any newly developed estimator following the concepts of the framework in (Todorov
and Filzmoser 2009) can be used as input to the function FaCov().
There are three methods to perform a factor analysis: maximum likelihood estimation (MLE),
principal component analysis (PCA), and principal factor analysis (PFA). The function factanal()
performs a factor analysis using MLE. The function FaCov() deals with three methods using
4
An OOS for Robust Factor Analysis
Fa
call : language
converged : Ulogical
loadings : matrix
communality : Uvector
uniquenesses : vector
cor : Ulogical
covariance : matrix
correlation : matrix
usedMatrix : matrix
reducedCorrelation : Umatrix
criteria : Unumeric
factors : numeric
dof : Unumeric
method : character
scores : Umatrix
scoresMethod : character
scoringCoef : Umatrix
meanF : Uvector
corF : Umatrix
STATISTIC : Unumeric
PVAL : Unumeric
n.obs : numeric
center : Uvector
eigenvalues : vector
cov.control : UCovControl
show() : void
print() : void
plot() : void
summary() : SummaryFa
SummaryFa
predict() : matrix
faobj : Fa
getCenter() : Uvector
importance : matrix
getEigenvalues() : vector
getFa() : fa
show() : void
getLoadings() : matrix
getQuan() : numeric
getScores() : Umatrix
getSdev() : vector
FaRobust
FaClassic
FaClassic() : FaClassic
FaCov
FaCov() : FaCov
Figure 2: Object model for robust factor analysis.
5
Ying-Ying Zhang
an argument ‘method = c("mle", "pca", "pfa")’. When the data set is large, factanal()
usually generates errors, and thus one can only resort to the PCA and PFA methods.
When compare the classical and robust factor analysis methods, there are two other issues to
consider. That is, whether we should use data or scale(data) as the data input; whether
we should use the covariance matrix or the correlation matrix as the running matrix (used
matrix)? Consider the factor analysis methods, the data input, and the running matrix all
together, there are 8 combinations, i.e.,
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
classical, x = data, cor = FALSE
classical, x = data, cor = TRUE
classical, x = scale(data), cor = FALSE
classical, x = scale(data), cor = TRUE
robust, x = data, cor = FALSE
robust, x = data, cor = TRUE
robust, x = scale(data), cor = FALSE
robust, x = scale(data), cor = TRUE
Here cor is a logical value indicating whether the calculation should use the covariance matrix
(cor = FALSE) or the correlation matrix (cor = TRUE).
There are 4 classical and robust factor analysis comparisons, i.e., (1) vs (5), (2) vs (6), (3) vs
(7), and (4) vs (8). We recommend (4) vs (8). The reasons are as follows. First, when the
variables have different units, it is common to standardize the variables, the sample covariance
matrix of the standardized variables is the correlation matrix of the original variables. Thus
it is common to use the correlation matrix as the running matrix. Second, we need to explain
the factors from the loading matrix. The entries of the loading matrix from the sample
covariance matrix are not limitted between 0 and 1, which makes the explanations of the
factors hard. The first two reasons suggest us to choose (2) vs (6) and (4) vs (8). However,
(2) and (4) ((6) and (8)) have the same running matrices, eigenvalues, loadings, uniquenesses,
scoring coefficients, scaled data matrices, and score matrices, see Theorem 3.2. That is, (2)
vs (6) and (4) vs (8) give us the same comparisons. We can choose any pair to do the
comparison. Third, we may not be able to compute the robust covariance matrix, and thus
the robust correlation matrix of the original data, as the stocks data example will illustrate.
Consequently, we should choose (4) vs (8).
The running matrices of the 8 combinations are given in Table 1. In the table,
correlation = cov2cor(covariance).
Table 1: The running matrices.
data
scale(data)
covariance
correlation
covariance
correlation
Classical
(1) S c
(2) Rc
c
(3) S̃
c
(4) R̃
Robust
(5) S r
(6) Rr
r
(7) S̃
r
(8) R̃
6
An OOS for Robust Factor Analysis
Let S = (sij )p×p be a square matrix. Let v = diag (S) = (s11 , s22 , · · · , spp )⊤ be a vector
holding the diagonal elements of S. Let


s11


D = diag (v) = diag (diag (S)) = 

s22
..
.
spp




be a diagonal matrix whose diagonal is the vector v. In fact,
diag (matrix) = vector,
diag (vector) = diagonal matrix.
Let
X n×p


X⊤
(1)
 ⊤ 
 X (2) 

=
 .. 
 . 
X⊤
(n)
be the original data matrix, where X (k) = (xk1 , xk2 , · · · , xkp )⊤ (k = 1, 2, · · · , n) is the kth
observation, n is the number of observations. Let M be the number of regular observations
(excluding the outliers), without loss of generality, we can assume X (1) , X (2) , · · · , X (M ) as
the regular observations. Let
n
1X
X (k) = (x̄1 , x̄2 , · · · , x̄p )⊤
n
X̄ =
k=1
be the sample mean of X, where
n
1X
x̄j =
xkj , j = 1, 2, · · · , p.
n
k=1
Let
n
S = (sij )p×p
⊤
1 X
=
X (k) − X̄ X (k) − X̄
n−1
k=1
be the sample covariance matrix, where
n
sij =
1 X
(xki − x̄i ) (xkj − x̄j ) , i, j = 1, 2, · · · , p.
n−1
k=1
7
Ying-Ying Zhang
Let X ∗ be the scaled matrix of X, where


⊤
 ∗⊤ 
− 12
X
−
X̄
D
(1)

X (1)
⊤ 

1
 ∗⊤  
−2

X (2)  

D
X
−
X̄
(2)


=
X∗ = 

.
 .  
.

..
 .  

∗⊤

⊤
X (n)
1
D − 2 X (n) − X̄

 ⊤
⊤
⊤
1
X (1) − X̄
X (1) − X̄ D − 2
1


⊤
 X (2) − X̄ ⊤ D − 2   X ⊤
(2) − X̄ 

 D − 21


=
..
..
=

.
 


.
⊤ − 1
⊤
⊤
2
X (n) − X̄ D
X (n) − X̄
1
⊤
= X − 1X̄ D − 2 ,
1 = 1n×1
 
1
1
 
= .
 .. 
1

n×1
D = [diag (diag (S))] = diag [diag (S)]
D
− 21
= [diag (diag (S))]
− 21
= diag [diag (S)]
1
X ∗(k) = D − 2 X (k) − X̄
 √
1/ s11
√

1/ s22

=

− 21
..
√ 
(xk1 − x̄1 ) / s11
(xk2 − x̄2 ) /√s22 


=
.
..


.
√
(xkp − x̄p ) / spp

s22


, D = diag (diag (S)) = 

1
2
1
2
.
1
2
√
..
.
spp
s11
√


=



s11


=


s22
..
√
1/ s11


,

.
√
spp


,

√
1/ s22
..
.
√
1/ spp



,


xk1 − x̄1
 xk2 − x̄2 




..


.

√
1/ spp
xkp − x̄p
To compare the 8 combinations, we summarize some useful results in Tables 2 and 3. There
are some points in Tables 2 and 3 that are proved here.
8
An OOS for Robust Factor Analysis
Table 2: Results of (1), (2), (5), and (6).
data matrix
X, classical
(1),(2)
 ⊤
X (1)
 ⊤ 
 X (2) 

X=
 .. 
 . 
X, robust
 (5),
(6)
⊤
X (1)
 ⊤ 
 X (2) 

X=
 .. 
 . 
X (1) , X (2) , · · · , X (n)
X (k)
n
P
c
X̄ = n1
X (k)
X (1) , X (2) , · · · , X (M )
X (k)
M
P
r
1
X̄ = M
X (k)
X⊤
(n)
number of
used observations
sample used
kth observation
X⊤
(n)
n
sample mean
M
k=1
sample covariance
c
S =
1
n−1
n
P
k=1
sample correlation
D
scaledX
kth scaled observation
sample mean of scaledX
X (k) − X̄
c
X (k) − X̄
Rc
c
c
D = diag (diag (S
))
⊤
c
−1
X ∗c = X − 1 X̄
(D c ) 2

⊤ 
X ∗c
(1)

⊤ 



 X ∗c
(2)


=

..




.

⊤ 
∗c
X (n)
c
∗c
c − 21
X (k) − X̄
X (k) = (D )
n
P
∗c
X ∗c
X̄ = n1
(k) = 0
k=1
c ⊤
k=1
cov(scaledX)
S ∗c = Rc
n P
∗c
∗c ⊤
1
= n−1
X ∗c
X ∗c
(k) − X̄
(k) − X̄
k=1
⊤
n
P
∗c
1
X ∗c
= n−1
(k) X (k)
k=1
− 21
= (D c )
S c (D c )
− 12
r
S =
1
M −1
M
P
k=1
X (k) − X̄
r
X (k) − X̄
Rr
r
D = diag (diag (S
))
⊤
r
−1
X ∗r = X − 1 X̄
(D r ) 2

⊤ 
X ∗r
(1)

⊤ 



 X ∗r
(2)


=

..




.

⊤ 
∗r
X (n)
r
∗r
r − 12
X (k) − X̄
X (k) = (D )
M
P
∗r
1
X ∗r
X̄ = M
(k) = 0 (a)
r ⊤
r
k=1
S ∗r = Rr
M P
∗r
∗r ⊤
X ∗r
X ∗r
= M1−1
(k) − X̄
(k) − X̄
k=1
⊤
M
P
∗r
= M1−1
X ∗r
(k) X (k)
= (D r )
k=1
− 21
r
S (D r )
− 12
(b)
9
Ying-Ying Zhang
Table 3: Results of (3), (4), (7), and (8).
data matrix
number of
used observations
sample used
kth observation
Y , classical (3), (4)
Y
= X ∗c
= scale (X)
⊤
c
−1
= X − 1 X̄
(D c ) 2
Y , robust (7), (8)
Y
= X ∗c
= scale (X)
⊤
c
−1
= X − 1 X̄
(D c ) 2
Y (1) , Y (2) , · · · , Y (n)
c
−1
= (D c ) 2 X (k) − X̄ = X ∗c
(k)
Y (1) , Y (2) , · · · , Y (M )
c
−1
= (D c ) 2 X (k) − X̄ = X ∗c
(k)
M
P
r
1
Ȳ = M
Y (k)
(f)
k=1
1
r
c
c −2
X̄ − X̄
= (D )
n
Y (k)
c
sample mean
Ȳ =
1
n
n
P
M
Y (k)
Y (k) = 0 (c)
k=1
c
S̃ =
1
n−1
n
P
k=1
sample covariance
=
Y (k) − Ȳ
1
n−1
n
P
k=1
sample correlation
D
scaledX
kth scaled observation
sample mean of scaledX
c
Y (k) − Ȳ
Y (k) Y ⊤
(k)
c ⊤
c
R̃ c
= E (d)
D̃ = diag diag S̃
1
c −2
c ⊤
D̃
Y ∗c = Y − 1 Ȳ

⊤ 
Y ∗c
 (1) 

⊤
 Y ∗c

(2)


=

..




.

⊤ 
∗c
Y (n)
c − 12
c
Y (k) − Ȳ = Y (k) (e)
Y ∗c
(k) = D̃
n
P
∗c
Ȳ = n1
Y ∗c
(k) = 0
k=1
∗c
c
= R̃
∗c ⊤
∗c
∗c
1
= n−1
Y (k) − Ȳ
Y ∗c
−
Ȳ
(k)
k=1
⊤
n
P
∗c
1
Y
Y ∗c
= n−1
(k)
(k)
− 12 c c − 21
ck=1
= D̃
S̃ D̃
S̃
cov(scaledX)
c
n P
r
S̃ =
1
M −1
M
P
k=1
Y (k) − Ȳ
r
r
Y (k) − Ȳ
R̃
r r
D̃ = diag diag S̃
r − 21
r ⊤
D̃
Y ∗r = Y − 1 Ȳ

⊤ 
Y ∗r
 (1) 

⊤
 Y ∗r

(2)


=

..




.

⊤ 
∗r
Y (n)
r − 21
r
Y (k) − Ȳ
Y ∗r
(k) = D̃
M
P
∗r
1
Ȳ = M
Y ∗r
(k) = 0 (g)
k=1
∗r
r
r ⊤
= R̃
∗r ⊤
∗r
∗r
= M1−1
Y ∗r
−
Ȳ
Y
−
Ȳ
(k)
(k)
k=1
⊤
M
P
∗r
= M1−1
Y
Y ∗r
(k)
(k)
1
1
r −k=1
−
r
r
2
2
= D̃
S̃ D̃
(h)
S̃
M P
10
An OOS for Robust Factor Analysis
Proof. (a)
X̄
∗r
M
1 X ∗r
=
X (k)
M
=
1
M
k=1
M
X
k=1
1
(D r )− 2 X (k) − X̄
1
= (D r )− 2
1
= (D r )− 2
= 0.
r
M
1 X
r
X (k) − X̄
M
k=1
r
r
X̄ − X̄
(b)
S ∗r = Rr
M
=
=
=
1 X ∗r
∗r
∗r ⊤
X (k) − X̄
X ∗r
−
X̄
(k)
M −1
1
M −1
1
M −1
= (D r )
k=1
M
X
k=1
M
X
k=1
− 12
r − 12
= (D )
"
⊤
∗r
X ∗r
X
(k)
(k)
1
(D r )− 2 X (k) − X̄
r
X (k) − X̄
r ⊤
1
(D r )− 2
#
M
1
1 X
r
r ⊤
X (k) − X̄
X (k) − X̄
(D r )− 2
M −1
k=1
r − 12
S r (D )
.
(c)
n
c
Ȳ =
1X
Y (k)
n
k=1
n
1
1X
c
=
(D c )− 2 X (k) − X̄
n
k=1
1
= (D c )− 2
1
= (D c )− 2
= 0.
n
1X
c
X (k) − X̄
n
k=1
c
c
X̄ − X̄
11
Ying-Ying Zhang
c
(d) By Theorem 3.1, we have S̃ = Rc , thus
c c
D̃ = diag diag S̃
= diag (diag (Rc ))
= diag ((1, 1, · · · , 1))
= E.
(e)
c − 1
2
c
Y (k) − Ȳ
D̃
Y ∗c
=
(k)
1
= E − 2 Y (k) − 0
= Y (k) .
(f)
M
r
Ȳ =
1X
Y (k)
M
k=1
M
1
1X
c
(D c )− 2 X (k) − X̄
=
M
k=1
1
= (D c )− 2
1
= (D c )− 2
M
1X
c
X (k) − X̄
M
k=1
r
c
X̄ − X̄ .
(g)
Ȳ
∗r
=
=
M
1 X ∗r
Y (k)
M
1
M
k=1
M X
1
r −2
k=1
= D̃
= D̃
= 0.
1
r −2
D̃
1
r −2
Y (k) − Ȳ
r
M
1 X
r
Y (k) − Ȳ
M
k=1
r
Ȳ − Ȳ
r
12
An OOS for Robust Factor Analysis
(h)
S̃
∗r
= R̃
r
M
1 X ∗r
∗r
∗r ⊤
=
Y (k) − Ȳ
Y ∗r
(k) − Ȳ
M −1
=
=
1
M −1
1
M −1
= D̃
k=1
M
X
k=1
⊤
∗r
Y ∗r
Y
(k)
(k)
M X
k=1
1
r −
2
r − 1
2
= D̃
"
D̃
1
r −2
Y (k) − Ȳ
r
Y (k) − Ȳ
r ⊤
D̃
1
r −2
#
M
r − 1
1 X
2
r
r ⊤
Y (k) − Ȳ
Y (k) − Ȳ
D̃
M −1
k=1
r − 1
r
2
S̃ D̃
.
Define the multiplication of two vectors by
x · y = (x1 , x2 , · · · , xp ) · (y1 , y2 , · · · , yp )
= (x1 y1 , x2 y2 , · · · , xp yp ) .
To prove Theorem 3.1, we need the following lemma.
Lemma 3.1 Let

Then

λ1
λ2


Λ1 = 

..
.
λp





 , Λ2 = 



µ1
µ2
..
.
µp


 , A = (aij )p×p .

diag (Λ1 AΛ2 ) = diag (Λ1 ) · diag (A) · diag (Λ2 ) ,
diag (Λ1 Λ2 ) = diag (Λ1 ) · diag (Λ2 ) .
Proof.


a11 a12 · · · a1p
µ1





a
a
·
·
·
a
µ2
λ
2p  
2
  21 22

Λ1 AΛ2 = 




.
.
.
.
.
..
..
..
..  
  ..

λp
ap1 ap2 · · · app


λ1 a11 µ1
∗
···
∗
 ∗
λ2 a22 µ2 · · ·
∗ 


=
,
..
..
..
..


.
.
.
.
∗
∗
· · · λp app µp

λ1
..

.
µp




13
Ying-Ying Zhang
diag (Λ1 AΛ2 ) = (λ1 a11 µ1 , λ2 a22 µ2 , · · · , λp app µp )
= (λ1 , λ2 , · · · , λp ) · (a11 , a22 , · · · , app ) · (µ1 , µ2 , · · · , µp )
= diag (Λ1 ) · diag (A) · diag (Λ2 ) .


λ1
λ2


Λ1 Λ2 = 

..
.
λp

µ1
µ2




..
.
µp


λ1 µ 1
 
 
=
 
λ2 µ 2
..
.
λp µ p


,

diag (Λ1 Λ2 ) = (λ1 µ1 , λ2 µ2 , · · · , λp µp )
= (λ1 , λ2 , · · · , λp ) · (µ1 , µ2 , · · · , µp )
= diag (Λ1 ) · diag (Λ2 ) .
c
c
r
Theorem 3.1 (a) Rc = S̃ = R̃ . (b) Rr = R̃ .
c
1
Proof. (a) We prove Rc = S̃ first. Since Y (k) = (D c )− 2 X (k) − X̄
n
Rc =
1 X ∗c ∗c ⊤
X (k) X (k)
n−1
c
k=1
=
1
n−1
c
= S̃ .
c
n
X
Y (k) Y ⊤
(k)
k=1
c
Next we prove R̃ = S̃ . From Table 3 (e), we have Y ∗c
(k) = Y (k) , thus
n
1 X ∗c ∗c ⊤
R̃ =
Y (k) Y (k)
n−1
c
k=1
=
1
n−1
c
= S̃ .
n
X
Y (k) Y ⊤
(k)
k=1
(b) We have
M
Rr = S ∗r =
r
R̃ = S̃
∗r
=
1 X ∗r ∗r ⊤
X (k) X (k) ,
M −1
1
M −1
k=1
M
X
k=1
⊤
∗r
Y ∗r
.
(k) Y (k)
= X ∗c
(k) ,
14
An OOS for Robust Factor Analysis
r
To prove that Rr = R̃ , it suffices to prove that
∗r
X ∗r
(k) = Y (k) .
(3.1)
Write
r
r − 12
X (k) − X̄ ,
X ∗r
(k) = (D )
r − 1
2
r
Y (k) − Ȳ ,
D̃
Y ∗r
=
(k)
1
c
Y (k) = (D c )− 2 X (k) − X̄ ,
from Table 3 (f), we have
1
r
r
Ȳ = (D c )− 2 X̄ − X̄
Thus
r − 1
2
D̃
Y ∗r
=
(k)
r − 1
2
= D̃
r − 1
2
= D̃
.
r
Y (k) − Ȳ
i
h
1
1
c
r
c
(D c )− 2 X (k) − X̄ − (D c )− 2 X̄ − X̄
1
(D c )− 2 X (k) − X̄
To prove (3.1), it suffices to prove
which reduces to prove
c
D̃
1
r −2
r
.
1
1
(D c )− 2 = (D r )− 2 ,
r
D̃ D c = D r ,
that is,
r · diag (diag (S c )) = diag (diag (S r )) ,
diag diag S̃
which is equivalent to
r
diag S̃ · diag (S c ) = diag (S r ) .
We have just proved
1
r
Y (k) − Ȳ = (D c )− 2 X (k) − X̄
and thus
r
,
M
r
S̃ =
1 X
r
r ⊤
Y (k) − Ȳ
Y (k) − Ȳ
M −1
k=1
=
1
M −1
M
X
k=1
c − 21
= (D )
c − 21
= (D )
"
1
(D c )− 2 X (k) − X̄
r
X (k) − X̄
r ⊤
1
(D c )− 2
#
M
1
1 X
r
r ⊤
X (k) − X̄
X (k) − X̄
(D c )− 2
M −1
k=1
r
c − 12
S (D )
.
15
Ying-Ying Zhang
Consequently,
r
1
1
diag S̃ = diag (D c )− 2 S r (D c )− 2
1
1
= diag (D c )− 2 · diag (S r ) · diag (D c )− 2 (Lemma 3.1)
1
1
= diag (D c )− 2 · diag (D c )− 2 · diag (S r )
= diag (D c )−1 · diag (S r ) (Lemma 3.1).
Finally,
r
diag S̃ · diag (S c ) = diag (D c )−1 · diag (S r ) · diag (S c )
= diag (D c )−1 · diag (S r ) · diag (D c )
= diag (D c )−1 · diag (D c ) · diag (S r )
= diag (E) · diag (S r ) (Lemma 3.1)
= diag (S r ) .
The proof is complete.
The score of the kth observation is summarized in Table 4. The scores method can be
“regression” or “Bartlett”. The running matrix can be the covariance matrix (S) or the
correlation matrix (R).
Table 4: The score of the kth observation.
regression
⊤
cor = FALSE
f̃ (k) = L̂ S −1 X (k) − X̄
cor = TRUE
f̃ (k) = L̂
∗
∗⊤
R−1 X ∗(k)
Bartlett
⊤ −1 −1 ⊤ −1
L̂ Ψ̂
X (k) − X̄
f̂ (k) = L̂ Ψ̂ L̂
∗⊤ ∗−1 ∗ −1 ∗⊤ ∗−1
∗
L̂ Ψ̂ X ∗(k)
f̂ (k) = L̂ Ψ̂ L̂
If we define the scoring coefficient S c by
Sc =
 ⊤

L̂ S −1 ,


⊤ −1 −1 ⊤ −1


 L̂ Ψ̂ L̂
L̂ Ψ̂ ,
cor = FALSE, scoresMethod = “regression”,
cor = FALSE, scoresMethod = “Bartlett”,
∗⊤

L̂ R−1 ,
cor = TRUE, scoresMethod = “regression”,




 L̂∗⊤ Ψ̂∗−1 L̂∗ −1 L̂∗⊤ Ψ̂∗−1 , cor = TRUE, scoresMethod = “Bartlett”,
then the score of the kth observation simplifies to
f (k) =
S c X (k) − X̄ , cor = FALSE,
S c X ∗(k) ,
cor = TRUE.
16
An OOS for Robust Factor Analysis
If cor = FALSE, then the score matrix is
 
 ⊤  
⊤ 
⊤
f (1)
X (1) − X̄
X (1) − X̄ S ⊤
c

 
 ⊤  
  X (2) − X̄ ⊤  ⊤  f (2)   X (2) − X̄ ⊤ S ⊤
c
=
=
 S c = X − 1X̄ ⊤ S ⊤
F =
.
c .
.
.
 

 .  
..
..
 

 .  
⊤
⊤
f⊤
X (n) − X̄ S ⊤
X (n) − X̄
(n)
c
If cor = TRUE, then the score matrix is
 ⊤   ∗⊤ ⊤   ∗⊤ 
X (1)
X (1) S c
f (1)
 ⊤   ∗⊤ ⊤   ∗⊤ 
 f (2)   X (2) S c   X (2)  ⊤
 =  .  Sc = X ∗S⊤
 
F =
..
c ,
  . 
 ..  = 
.
  . 
 .  
⊤
X ∗⊤
X ∗⊤
f⊤
(n)
(n) S c
(n)
1
⊤
where X ∗ = X − 1X̄ D − 2 . If we define
scaledX =
⊤
X − 1X̄ , cor = FALSE,
X ∗,
cor = TRUE,
then
F = scaledX · S ⊤
c .
(3.2)
Theorem 3.2 The running matrices (R), eigenvalues (λ), loadings (L), uniquenesses (Ψ),
scoring coefficients (S c ), scaled data matrices (scaledX), score matrices (F ) are the same for
(a) combinations (2) and (4),
(b) combinations (6) and (8).
Proof. If the running matrices (R) are the same, then the eigenvalues (λ), loadings (L),
uniquenesses (Ψ) are the same. If R, L, Ψ are the same, then by the definition of scoring
coefficient S c , we know that S c are the same. If the scaled data matrices (scaledX) are the
same, then by (3.2), the score matrices (F ) are the same. Thus it suffices to show that R
and scaledX are the same.
c
(a) The running matrices Rc = R̃ by Theorem 3.1. From Table 3, we see that
∗c
Y ∗c
(k) = Y (k) = X (k) ,
thus the scaledX are the same.
r
∗r
(b) The running matrices Rr = R̃ by Theorem 3.1. By (3.1), we have X ∗r
(k) = Y (k) , thus
the scaledX are the same.
3.3. Example: hbk data
In this subsection, a data set hbk is used to show the base functionalities of the robust factor
analysis solution. The Hawkins, Bradu and Kass data set hbk is from the package robustbase
consists of 75 observations in 4 dimensions (one response and three explanatory variables).
The first 10 observations are bad leverage points, and the next four points are good leverage
Ying-Ying Zhang
17
points (i.e., their x are outlying, but the corresponding y fit the model quite well). We will
consider only the X-part of the data.
Robust factor analysis is represented by the class FaCov which inherits from class Fa by class
FaRobust of distance 2, and uses all slots and methods defined from Fa. The function FaCov()
serves as a constructor (generating function) of the class. It can be called either by providing
a data frame or matrix or a formula with no response variable, referring only to numeric
variables. The usage of the default method of FaCov is:
FaCov(x, factors = 2, cor = FALSE, cov.control = CovControlMcd(),
method = c("mle", "pca", "pfa"),
scoresMethod = c("none", "regression", "Bartlett"), ...)
where x is a numeric matrix or an object that can be coerced to a numeric matrix. factors
is the number of factors to be fitted. cor is a logical value indicating whether the calculation should use the correlation matrix (cor = TRUE) or the covariance matrix (cor = FALSE).
cov.control specifies which covariance estimator to use by providing a CovControl-class
object. The default is CovControlMcd-class which will indirectly call CovMcd. If cov.control
= NULL is specified, the classical estimates will be used by calling CovClassic. method is the
method of factor analysis, one of "mle" (the default), "pca", and "pfa". scoresMethod specifies which type of scores to produce. The default is "none", "regression" gives Thompson’s
scores, and "Bartlett" gives Bartlett’s weighted least-squares scores (Johnson and Wichern
2007; Xue and Chen 2007). The usage of the formula interface of FaCov is:
FaCov(formula, data = NULL, factors = 2, cor = FALSE,
method = "mle", scoresMethod = "none", ...)
where formula is a formula with no response variable, referring only to numeric variables.
data is an optional data frame containing the variables in the formula. Classical factor analysis is represented by the class FaClassic which inherits directly from Fa, and uses all slots
and methods defined there. The function FaClassic() serves as a constructor (generating
function) of the class. It also has its default method and formula interface as FaCov().
The code lines
R>
R>
R>
R>
+
R>
##
## faCovPcaRegMcd is obtained from FaCov.default
##
faCovPcaRegMcd = FaCov(x = hbk.x, factors = 2, method = "pca",
scoresMethod = "regression", cov.control = CovControlMcd());
faCovPcaRegMcd
generate an object faCovPcaRegMcd of class FaCov. In fact, it is equivalent to use the formula
interface
R> faCovForPcaRegMcd = FaCov(~., data = as.data.frame(hbk.x),
+ factors = 2, method = "pca", scoresMethod = "regression",
+ cov.control = CovControlMcd())
18
An OOS for Robust Factor Analysis
That is faCovForPcaRegMcd = faCovPcaRegMcd. Type
R> class(faCovPcaRegMcd)
[1] "FaCov"
attr(,"package")
[1] "robustfa"
We see that the class FaCov is defined in the package robustfa. For an object obj of class Fa,
we have
obj = show(obj) = print(obj) = myFaPrint(obj).
Here show() and print() are S4 generic functions, while myFaPrint() is an S3 function
acting as a function definition for setMethod of the generic functions show and print.
R> show(faCovPcaRegMcd)
Call:
FaCov(x = hbk.x, factors = 2, cov.control = CovControlMcd(),
method = "pca", scoresMethod = "regression")
Standard deviations:
[1] 1.391072 1.261732 1.170626
Loadings:
Factor1
Factor2
X1 -0.004150242 1.2079697
X2 1.180797866 -0.1246965
X3 0.646169668 0.4903801
From Figure 2 we see that summary() generates an object of class SummaryFa which has its
own show() method.
R> summaryFaCovPcaRegMcd = summary(faCovPcaRegMcd);
R> summaryFaCovPcaRegMcd
Call:
FaCov(x = hbk.x, factors = 2, cov.control = CovControlMcd(),
method = "pca", scoresMethod = "regression")
Importance of components:
Factor1 Factor2
SS loadings
1.812
1.715
Proportion Var
0.370
0.350
Cumulative Var
0.370
0.720
Ying-Ying Zhang
19
From the summary result of faCovPcaRegMcd, we see that the first two factors accout for
about 72.0% of its total variance.
Next we calculate prediction/scores using predict(). The usage is predict(object, ...),
where object is an object of class Fa. If missing ...(newdata), the scores slot of object is
used. To save space, only the first five and last five rows of the scores are displayed.
R> predict(faCovPcaRegMcd)
1
2
3
4
5
71
72
73
74
75
Factor1
20.1391070
20.9770640
21.3980990
22.5484289
22.0558985
0.6381775
0.1462415
0.2005665
0.3525364
-0.5293472
Factor2
10.4105484
10.0136763
11.4405912
10.8723961
11.0589128
0.5281550
-0.7438318
-0.7383855
-1.1721414
-0.4594629
If not missing ..., then ... must have the name newdata. Moreover, newdata should
be scaled first. For example, the original data is x = hbk.x, and newdata is a one row
data.frame.
R> newdata = hbk.x[1, ]
R> cor = FALSE # the default
R> newdata = { if (cor == TRUE)
+
# standardized transformation
+
scale(newdata, center = faCovPcaRegMcd@center,
+
scale = sqrt(diag(faCovPcaRegMcd@covariance)))
+
else # cor == FALSE
+
# centralized transformation
+
scale(newdata, center = faCovPcaRegMcd@center, scale = FALSE)
+
}
Finally we get
R> prediction = predict(faCovPcaRegMcd, newdata = newdata)
R> prediction
Factor1 Factor2
1 20.13911 10.41055
One can easily check that
prediction = predict(faCovPcaRegMcd)[1,] = faCovPcaRegMcd@scores[1,]
20
An OOS for Robust Factor Analysis
To visualize the factor analysis results, the plot method can be used. We have setMethod
plot with signature
x = "Fa", y = "missing".
The usage is
plot(x, which = c("factorScore", "screeplot"), choices = 1:2).
Where x is an object of class Fa. The argument which indicates what kind of plot. If which =
"factorScore", then a scatterplot of the factor scores is produced; if which = "screeplot",
then the eigenvalues plot is created and is helpful to select the number of factors. The
argument choices is an integer vector of length two indicating which columns of the factor
scores to plot. To see how plot is functioning, we first generate an Fa object.
R> faClassicPcaReg = FaClassic(x = hbk.x, factors = 2, method = "pca",
+ scoresMethod = "regression"); faClassicPcaReg
Call:
FaClassic(x = hbk.x, factors = 2, method = "pca", scoresMethod = "regression")
Standard deviations:
[1] 14.7024532 1.4075073
0.9572508
Loadings:
Factor1 Factor2
X1 3.411036 0.936662
X2 7.278529 3.860723
X3 11.270812 3.273734
R> summary(faClassicPcaReg)
Call:
FaClassic(x = hbk.x, factors = 2, method = "pca", scoresMethod = "regression")
Importance of components:
Factor1 Factor2
SS loadings
191.643 26.500
Proportion Var
0.875
0.121
Cumulative Var
0.875
0.996
faClassicPcaReg is an object of class FaClassic. From the summary result of faClassicPcaReg,
we see that the first two factors accout for about 99.6% of its total variance. Next we generate
an object faCovPcaRegMcd of class FaCov using the same data set. The show and summary
results have already been shown before.
The following code lines generate classical and robust scatterplots of the first two factor scores.
See Figure 3.
21
Ying-Ying Zhang
R>
R>
R>
R>
usr <- par(mfrow=c(1,2))
plot(faClassicPcaReg, which = "factorScore", choices = 1:2)
plot(faCovPcaRegMcd, which = "factorScore", choices = 1:2)
par(usr)
scores = regression, classical
scores = regression, robust
12
6
14
5
Factor2
2
Factor2
4
10
13
11
3
6754
14
1
89
10
2
−1
0
27
11
16
4
85
9
10
3
0
0
13
43
54
63
56
32
58
22
33
19
55
24
26
61
17
37
20
23
16
66
74
40
72
67
15
51
38
49
64
71
50
21
36
73
70
29
31
27
59
69
35
18
57
62
25
60
34
42
68
45
28
52
48
41
47
39
75
65
30
53
46
44
12
1
2
3
41
48
27
35
52
15
49
64
47
20
31
21
44
65
16
46
18
50
71
39
28
58
30
59
51
45
69
57
23
67
36
70
56
33
53
38
29
60
26
75
25
42
24
62
61
19
73
72
63
34
54
40
68
22
37
55
32
74
66
43
17
0 5
Factor1
15
25
Factor1
Figure 3: Classical and robust scatterplots of the first two factor scores of the hbk data.
The following code lines generate classical and robust scree plots. See Figure 4.
R>
R>
R>
R>
usr <- par(mfrow=c(1,2))
plot(faClassicPcaReg, which = "screeplot")
plot(faCovPcaRegMcd, which = "screeplot")
par(usr)
Next we impose a 97.5% tolerance ellipse on the scatterplot of the first two factor scores of
the hbk data by using a function rrcov:::.myellipse. See Figure 5. The left panel shows
the classical 97.5% tolerance ellipse, which is very large and contains the outliers 1-13. The
regular points are not well seperated from the outliers. The right panel shows the robust
97.5% tolerance ellipse which clearly separates the regular points and outliers. We see that
the estimate of the center of the regular points is located at the origin (where the mean of the
scores should be located). The following code lines compute the means of the classical and
robust scores. We see that the mean of all observations of the classical scores equals 0, while
the mean of good observations (regular points, excluding the outliers) of the robust scores
equals 0.
R> colMeans(faClassicPcaReg@scores)
Factor1
-1.452310e-16
Factor2
3.376871e-16
R> colMeans(faCovPcaRegMcd@scores)
22
An OOS for Robust Factor Analysis
Scree plot
Scree plot
●
1.7
1.6
Variances
●
1.5
100
0
1.4
50
Variances
150
1.8
200
1.9
●
1
●
●
2
3
●
1
2
3
Figure 4: Classical and robust scree plots of the hbk data.
15
Robust (MCD), method = "pca"
15
Classical, method = "pca"
●
●
●
●
●
●
●
● ●
●
10
14
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
−5
0
●
●
●
●
●
●
●
●
●
●
●
●
1−13
●●
●
●
●
●
●
●
●●
●
●
●
1−10
5
Factor2
●
14
0
Factor2
10
●
12
●
13
●
11
−40
0
Factor1
20
40
−40
0
20
40
Factor1
Figure 5: Classical and robust scatterplot of the first two factor scores of the hbk data with
a 97.5% tolerance ellipse.
23
Ying-Ying Zhang
Factor1 Factor2
4.272569 2.072818
R> colMeans(faClassicPcaReg@scores[15:75,])
Factor1
Factor2
-0.4523799 -0.1358260
R> colMeans(faCovPcaRegMcd@scores[15:75,])
Factor1
Factor2
-8.167419e-17 -3.185066e-18
35
My Distance−Distance Plot
14
●
1113
30
● ●
4
59
3
10
21
87 6
15
20
25
●
● ●●
●
●●● ●
●
0
5
10
Robust distance
12
●
●
●
●
●●●●●
●●
●●
●●
●
●
●●
●
● ●●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●● ●●
●
●●●●
●
●
●
●●
● ●
1
2
3
4
5
6
Mahalanobis distance
Figure 6: A distance-distance plot for hbk data.
Next we plot a Cov-class. See Figure 6. The figure shows a distance-distance plot. We
see that the robust (mahalanobis) distances are far larger than the (classical) mahalanobis
distances. The outliers have large robust distances. The figure is generated by myplotDD(x
= covMcd). myplotDD() is a revised version of .myddplot() in plot-utils.R in the package
rrcov. In myplotDD(), id.n and ind are printed out. Here id.n is the number of observations
to identify by a label. By default, the number of observations with robust distances larger
than cutoff is used. By default cutoff = sqrt(qchisq(0.975, p)). ind is the index of
robust distances whose values are larger than cutoff.
cutoff = 3.057516
id.n <- length(which(rd>cutoff))
id.n = 14
24
An OOS for Robust Factor Analysis
Here y is the robust distance (rd).
sort.y = (To save space, only the smallest five and largest five
elements of sort.y$x and sort.y$ix are shown.)
$x
67
18
72
71
0.4553524 0.6505517 0.7631246 0.8306446
11
13
12
14
30.3225470 30.5536038 31.4123574 34.0079508
$ix
[1] 67 18 72 71 28
28
4
0.8570290 27.1972578
4 11 13 12 14
ind =
[1]
1
8
2
6
7 10
3
9
5
4 11 13 12 14
From the above results we see that the cutoff is computed as 3.057516. There are id.n
= 14 observations with robust distance larger than cutoff. sort.y is a list containing the
sorted values of y (the robust distance). sort.y$x is arranged in increasing order. To save
space, only the smallest five and largest five robust distances with their indices are shown.
sort.y$ix contains the indices. ind shows the indices of the largest id.n = 14 observations
whose robust distances are larger than cutoff.
The accessor functions getCenter(), getEigenvalues(), getLoadings(), getQuan(), getScores(),
and getSdev() are used to access the corresponding slots. For instance
R> getEigenvalues(faCovPcaRegMcd)
[1] 1.935081 1.591967 1.370366
The function getFa(), which is used in predict() and screeplot(), returns a list of class
"fa".
Note that our previous comparisons in this subsection are just (1) vs (5), i.e., classical and
robust factor analysis comparisons using x = hbk.x as the data input and the covariance
matrix (cor = FALSE) as the running matrix.
r
r
For x = hbk.x, we checked that S r 6= S̃ , and Rr = R̃ for control = "mcd", "ogk", "m",
r
"mve", "sfast", "bisquare", "rocke", small differences between Rr and R̃ for control =
"sde", "surreal". The results illustrate Theorem 3.1.
The eigenvalues of the running matrices of the hbk data of the 8 combinations are given in
Table 5. From Table 5 we see that the eigenvalues of (2), (3), and (4) are the same, the
eigenvalues of (6) and (8) are the same. The results illustrate Theorems 3.1 and 3.2.
Classical and robust (MCD) scatterplots of the first two factor scores of the hbk data with
97.5% tolerance ellipses are ploted in Figure 7. We see that the scores of (2), (3), and (4)
are the same, the scores of (6) and (8) are the same, in agree with Theorem 3.2. Note that
25
Ying-Ying Zhang
Table 5: The eigenvalues of the running matrices for hbk data.
hbk.x
scale(hbk.x)
covariance
correlation
covariance
correlation
Classical
(1) 216.162129
(2) 2.92435228
(3) 2.92435228
(4) 2.92435228
1.981077 0.916329
0.05668483 0.01896288
0.05668483 0.01896288
0.05668483 0.01896288
Robust (MCD)
(5) 1.935081 1.591967 1.370366
(6) 1.1890834 0.9563255 0.8545911
(7) 0.12408511 0.02501676 0.01089401
(8) 1.1890834 0.9563255 0.8545911
the tolerance ellipse is very large for (1), since the outliers severely affected the eigenvalues of
the running matrix S c . While the tolerance ellipses are very small for (2), (3), and (4), also
c
c
due to the outliers severely affected the eigenvalues of the running matrices Rc = S̃ = R̃ .
The tolerance ellipse is very small for (7) and it does not cover the regular points, due to
the first two eigenvalues of (7) are very small. It exemplifies that the results from robust
covariance matrix of the scaled data is not very reliable. The tolerance ellipses of (6) and (8)
well seperate the regular points and the outliers.
3.4. Example: Stocks data
In this subsection, we apply the robust factor analysis solution to a real data set stock611.
This data set consists of 611 observations with 12 variables. The data set is from Chinese
stock market in the year 2001. It is used in (Wang 2009) to illustrate factor analysis methods.
For x = stock611[,3:12],
cov_x = CovRobust(x = x, control = control)
gets error message for control = "mcd", "m", "mve", "sde", "sfast", thus we can not
compute cov_x, S r , and Rr for these robust estimators. That is, we can not get results for
combinations (5) and (6) for these robust estimators. However, for x = scale(stock611[,3:12]),
r
r
we can compute cov_x, S̃ , and R̃ for these robust estimators, and we can get results for
combinations (7) and (8). Although (6) and (8) have the same running matrices (R), eigenvalues (λ), loadings (L), uniquenesses (Ψ), scoring coefficients (S c ), scaled data matrices
(scaledX), and score matrices (F ), as were proved in Theorem 3.2, we may not be able to get
results for (6) due to computational error for cov_x, while for (8) the computational error
does not occur. That is why we recommend (4) vs (8) for classical and robust factor analysis.
The first two eigenvalues of the running matrices of the stock611 data of the 8 combinations
are given in Table 6. From Table 6 we see that the eigenvalues of (2), (3), and (4) are the
same, the eigenvalues of (6) and (8) are the same. The results also illustrate Theorems 3.1
and 3.2.
Classical and robust (OGK) scatterplots of the first two factor scores of the stock611 data
with 97.5% tolerance ellipses are ploted in Figure 8. The scatterplots of the first two factor
scores of combinations (1) and (5) are not shown, because errors occur in
solve.default(S): system is computationally singular.
To get a clearer view of the scatterplots, we zoom in the scatterplots. We see that the scores
of (2), (3), and (4) are the same, the scores of (6) and (8) are the same, in agree with Theorem
26
An OOS for Robust Factor Analysis
Classical
Robust (MCD)
Classical
Robust (MCD)
25
20
40
0
20
40
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
0
●
●
●
●●
●
●
●
●
●
●
●
●
●14
−5
−40
10
5
10
5
5
0
20
Factor2
15
25
20
15
0
●
●
●
●
●
●
●
●
●
●
1−13
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
−5
0
●
●
●
●●
●
●
●
●
1−13
●●
●
●
●
●
●
●●
−40
−40
0
20
40
−40
0
20
Classical
Robust (MCD)
Classical
Robust (MCD)
0
40
−40
Factor1
0
Factor1
20
10
Factor2
15
20
20
40
5
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
0
●
●
●
●●
●
●
●
●
●
●
●
●
●14
−5
0
20
10
●
●
1−13
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
0
−5
−40
12
13
11
14
●
●
●
●
●
●
●
●
●
●
●
●
●
1−10
5
5
●
●
1−13
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●14
Factor2
15
20
10
Factor2
15
20
15
10
12
13
11
●
●
●
●
●
●
●
●
●
●
●
●
●
1−1014
40
25
Factor1
25
Factor1
25
Factor1
25
Factor1
5
Factor2
10
10
●
14
●
12
13
11
14
●
●
●
●
●
● ●
●
●
●
●
●
1−10
Factor2
15
25
20
15
1−10
Factor2
20
●
13
●
11
●
12
5
Factor2
25
●
14
−40
0
20
40
Factor1
−40
0
20
40
Factor1
Figure 7: Classical and robust (MCD) scatterplots of the first two factor scores of the hbk
data with 97.5% tolerance ellipses. First row: (1) vs (5); (3) vs (7). Second row: (2) vs (6);
(4) vs (8).
Table 6: The first two eigenvalues of the running matrices of the stock611 data.
stock611[,3:12])
scale(stock611[,3:12]))
covariance
correlation
covariance
correlation
Classical
(1) 4.272384e+20 1.985165e+19
(2) 5.7900498488 2.3189552681
(3) 5.7900498488 2.3189552681
(4) 5.7900498488 2.3189552681
Robust (MCD)
(5) 3.990230e+17 7.358056e+16
(6) 5.15840672 2.40502329
(7) 7.527778e-01 2.967432e-01
(8) 5.15840672 2.40502329
27
Ying-Ying Zhang
3.2. Note that the tolerance ellipses for (2), (3), and (4) cover the outliers, due to the outliers
c
c
severely affected the eigenvalues of the running matrices Rc = S̃ = R̃ . The tolerance ellipses
of (6) and (8) well seperated the regular points and the outliers.
Next we plot a Cov-class for the stock611 data using the function myplotDD(). See Figure 9.
The figure shows a distance-distance plot. We see that the robust (mahalanobis) distances are
far larger than the (classical) mahalanobis distances. The outliers have large robust distances.
cutoff = 4.525834
id.n <- length(which(rd>cutoff))
id.n = 287
Here y is the robust distance (rd).
sort.y = (To save space, only the smallest five and largest five
elements of sort.y$x and sort.y$ix are shown.)
$x
281
0.6880160
379
180.7741644
368
0.8109610
539
264.6799442
239
0.9234880
337
531.0887434
$ix
[1] 281 368 239 312 229 379 539 337
312
229
0.9763902
1.0429306
79
338
923.0346401 1315.6544592
79 338
ind =
[1]
[17]
[33]
[49]
[65]
[81]
[97]
[113]
[129]
[145]
[161]
[177]
[193]
[209]
[225]
[241]
[257]
[273]
601
582
493
69
340
184
220
419
351
558
567
332
20
606
175
17
541
237
577
122
565
532
70
173
124
531
609
574
253
97
272
226
309
15
492
589
36
597
591
73
273
603
187
83
354
104
346
415
471
25
9
1
490
576
202
517
547
34
132
524
293
600
399
371
61
610
42
153
11
230
540
12
429
192
469
602
72
391
84
322
448
343
608
53
611
66
19
580
583
571
387
596
513
435
374
39
578
254
449
365
563
546
544
24
193
3
542
74
466
489
289
372
314
141
590
519
213
598
117
320
160
14
4
307
424
2
290
261
420
150
366
356
437
454
174
145
140
108
18
592
413
27
593
113
570
417
222
481
96
561
459
384
478
176
552
479
87
595
345
10
581
8
46
133
502
276
586
599
144
495
557
31
409
295
377
92
303
6
569
59
203
30
382
148
394
102
33
260
188
467
182
341
526
13
504
114
572
379
566
266
163
38
169
357
518
22
607
486
378
548
288
605
206
562
285
539
29
105
194
278
81
538
523
554
166
40
26
21
339
7
604
594
75
337
559
549
406
139
49
418
225
35
442
463
579
32
321
344
216
5
444
79
313
585
252
401
506
568
156
63
503
287
575
584
138
57
472
56
129
338
397
462
41
350
407
37
198
23
190
103
28
94
44
16
553
588
82
28
An OOS for Robust Factor Analysis
Robust (OGK)
●●
● ●●
10
10
Classical
5
0
●
●
●
●●●
●
●
● ●● ●
● ●●
● ●●
●●
●●
●
●●
●
● ●
●●
●
● ●
●
●
●●● ●
●●
●
●
●
●
●●
●●●●●●●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
● ●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
● ●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
● ●
●●
−5
Factor2
0
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
Factor2
5
●●
−10
−10
●
−5
0
5
10
−10
−5
0
5
Factor1
Classical
Robust (OGK)
10
10
Factor1
10
−10
●
5
0
●
−5
Factor2
0
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
Factor2
5
●
●●
●
●
●●
● ● ●
●
● ●●
●●
●●
●
●●
●● ● ●
●● ●
●
● ●●
●●
●
●
●
●
●
●●
●
●●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●● ●
● ●●
●
●
●
●●
●
●
●●
● ●●
●●
● ●
●
●●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
−10
−10
●
●
● ●
−5
0
5
10
−10
−5
0
5
Factor1
Factor1
Classical
Robust (OGK)
10
●●
● ●●
10
10
−10
5
0
●
●
●
●●●
●
●
● ●● ●
● ●●
● ●●
●●
●●
●
●●
●
● ●
●●
●
● ●
●
●
●●● ●
●●
●
●
●
●
●●
●●●●●●●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●● ●●●
● ●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
● ●
●●
−5
Factor2
0
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5
Factor2
5
●●
−10
−10
●
−10
−5
0
Factor1
5
10
−10
−5
0
5
10
Factor1
Figure 8: Classical and robust (OGK) scatterplots of the first two factor scores of the stock611
data with 97.5% tolerance ellipses. First row: (2) vs (6). Second row: (3) vs (7). Third row:
(4) vs (8).
29
Ying-Ying Zhang
My Distance−Distance Plot
79
600
800
●
●
337
200
400
Robust distance
1000
1200
●
338
●
539
●
379
59
8
●
●2
●74
113
●
571
●589
●
●
12 285
●
●82
●
● 576
237
● ●
129
●
●●●
75
444
●
572
●
●
●
581569
●
593
●
424
542
583
●
●●●
●
●
490540
●
492
●
541
●
●
588
56
53 ● 1●92
●
●
●
●
594
562
6
114
10
●
27
307
●
●
●
●
580
●
●
●
230
●
●
15
●
●
●
●
55317
●
216
●
●
472
●
●
●
●
206604
●
●
504
303
●
345
413
4
193
19
●
9
●
11
309
57
●
175
●
16
344
605
13
●●
595
●
592
66
14
●
153
2
4
25
●
●
226
●
606
●
138
●7
44
321
●
339
288
●
377
526
87
●
18
160
●
●
●
544
●
6
11
●
●
●
9420
42
272
71
●
584
32
●
●
548
341
●
295
2
1
4
79
108
320
●
●●
546
53
610
415
575
97
●
332
28
579
●
●●
●
●
●
2563
6
●
182378
409
●
●
●●
552
140
●
●
●
608
61
253346
567
●
103
287
117
40463
486
●
467
31
176
145
365
598
343
371
●●
104
574
190558
●
503
●●
●●
●
166
188
607
●
●
●
478557
213
●
●●
449
●●
442
448
399
●
●●
354
●
609
174
351
●●
6323
●
35
●●
554
●
●●
●●●
260
●●
●●
495
●●
384
4
54
254519
●
322
●●
●
●●●
●
600
531
83
●
22
419
1
98
●●
156
●
225
523
●
518
33
437
459
590
84578
293
220
187
3
7
568
144
538
418
357
102
599
561
356
124
391
39
524
173
6
03
184
407
8149
506
141
169
394
586
9
6
366
314
374
273
132
340
70
350
401
139
278
276
481
148
38
72
150
372
602
435
34
532
73
69
406
252
163
382
502
194
420
289
513
4
69
547
493
565
591
462
549
105
585
266
30
577
261
133
41
192596
517
489
597
582
397
566
559313
29
222
2
03
570
46
290
466
387
429
202
36
601
417
122
●
0
●
0
5
10
15
20
25
Mahalanobis distance
Figure 9: A distance-distance plot for stock611 data.
From the above results we see that the cutoff is computed as 4.525834. There are id.n =
287 observations with robust distances larger than cutoff. sort.y is a list containing the
sorted values of y (the robust distance). sort.y$x is arranged in increasing order. To save
space, only the smallest five and largest five robust distances with their indices are shown.
sort.y$ix contains the indices. ind shows the indices of the largest id.n = 287 observations
whose robust distances are larger than cutoff.
From the above results, we recommend (4) vs (8) when one needs to compare classical and
robust factor analysis. The following code lines generate an object faClassic4 of class
FaClassic.
R> ## (4) classical, x = scale(stock611[,3:12]), cor = TRUE (correlation matrix)
R> faClassic4 = FaClassic(x = scale(stock611[,3:12]), factors = 2, cor = TRUE,
+ method = "pca", scoresMethod = "regression"); faClassic4
Call:
FaClassic(x = scale(stock611[, 3:12]), factors = 2, cor = TRUE,
method = "pca", scoresMethod = "regression")
Standard deviations:
[1] 2.40625224 1.52281163 1.00433419 0.75771973 0.37846924 0.31490830
[7] 0.22719326 0.09697058 0.06300416 0.02779799
Loadings:
Factor1
Factor2
X1
0.98781949 -0.02611587
X2
0.99154548 -0.00633362
X3
0.99222167 0.05353998
X4
0.98486045 0.07865320
30
An OOS for Robust Factor Analysis
X5
0.05473434 0.95570764
X6 -0.01321817 0.62527581
X7
0.01924642 0.48022545
X8
0.04578987 0.88138600
X9
0.94053884 -0.01792872
X10 0.99068004 -0.04474676
R> summary(faClassic4)
Call:
FaClassic(x = scale(stock611[, 3:12]), factors = 2, cor = TRUE,
method = "pca", scoresMethod = "regression")
Importance of components:
Factor1 Factor2
SS loadings
5.785
2.324
Proportion Var
0.579
0.232
Cumulative Var
0.579
0.811
Since we use the correlation matrix (cor = TRUE) as the running matrix, the entries of the
loading matrix are limitted between 0 and 1, which makes the explanations of the factors
easy. From the show result of faClassic4, we see that its Factor1 explains variables x1-x4,
x9, x10; its Factor2 explains variables x5-x8 (with loadings larger than 0.48). From the
summary result of faClassic4, we see that the first two factors accout for about 81.1% of its
total variance.
Next we generate an object faCov8 of class FaCov using the same data set.
R> ## (8) robust, x = scale(stock611[,3:12]), cor = TRUE (correlation matrix)
R> faCov8 = FaCov(x = scale(stock611[,3:12]), factors = 2, cor = TRUE, method = "pca",
+ scoresMethod = "regression", cov.control = CovControlOgk()); faCov8
Call:
FaCov(x = scale(stock611[, 3:12]), factors = 2, cor = TRUE, cov.control = CovControlOgk(),
method = "pca", scoresMethod = "regression")
Standard deviations:
[1] 2.2712126 1.5508138 1.0753673 0.8269330 0.4632001 0.4302182
[7] 0.3448025 0.1828619 0.1768218 0.1144628
Loadings:
Factor1
X1
0.1192720
X2
0.3585258
X3
0.7582825
X4
0.7604049
X5
0.9546105
X6
0.4300313
Factor2
0.771765320
0.808017502
0.596037255
0.586322722
0.072062014
0.111234443
31
Ying-Ying Zhang
X7
0.8787653 -0.002537877
X8
0.9349237 -0.028219013
X9
0.0457114 0.931164519
X10 -0.0712708 0.827516094
R> summary(faCov8)
Call:
FaCov(x = scale(stock611[, 3:12]), factors = 2, cor = TRUE, cov.control = CovControlOgk(),
method = "pca", scoresMethod = "regression")
Importance of components:
Factor1 Factor2
SS loadings
4.046
3.518
Proportion Var
0.405
0.352
Cumulative Var
0.405
0.756
From the show result of faCov8, we see that its Factor1 explains variables x3-x8 (with
loadings larger than 0.43); its Factor2 explains variables x1-x4, x9, x10. Thus Factor1
(Factor2) of faClassic4 and Factor2 (Factor1) of faCov8 are similar. In fact, this is
not the general case. That is, in most of the situations, the explanations of the factors of
faClassic4 and faCov8 should be the same. From the summary result of faCov8, we see that
the first two factors accout for about 75.6% of its total variance.
The classical and robust scatterplots of the first two factor scores of the stock611 data are
shown in Figure 10. From the figure we see that Factor1 (Factor2) of faClassic4 and
Factor2 (Factor1) of faCov8 are similar in patterns.
scores = regression, classical
scores = regression, robust
338
200
400
Factor2
10
379539
79
571
59
74
444
589
113
572 583
542
581
569
593
576
540
129
541
75
490
594
237
562
424
553
307
492
285
303
216
472
504
413
114
82
580
595
56
345
230
288
153
592
175
604
548
605
544
588
584
479
295
206
226
343
332
546
272
449
160
574
399
321
467
339
138
552
371
409
478
344
448
415
579
471
309
254
554
557
384
145
293
225
57
518
463
253
365
213
287
495
94
531
182
568
176
418
538
44
320
166
575
567
526
459
377
174
8
7
156
563
503
394
260
417
578
406
341
558
549
462
372
340
290
276
278
190
391
188
585
590
283
357
193
350
437
378
322
366
602
220
108
382
3
02
489
222
481
42
53
547
140
555
66
370
375
56
266
510
141
570
300
566
483
319
187
192
513
420
326
419
324
407
451
163
482
606
523
308
586
517
502
275
475
498
184
395
169
63
117
240
532
221
139
346
271
136
447
573
519
527
353
582
328
524
361
306
314
233
311
331
6
01
84
603
398
124
198
426
374
228
525
383
452
536
284
143
537
491
105
243
289
148
269
102
609
442
425
514
333
299
173
608
515
512
107
144
61
597598
387
438
150
257
92
100
162
545
103
351
500
329
207
132
373
180
400
499
355
354
9697
534
446
325
422
435
464
364
205
200
245
596
453
358
273
529
368
564
263
65
234
550
401
509
232
402
23
93
131
556
587
591
389
487
521
51
185
352
161
70
197
493
315
551
380
445
301
165
455
508
217
565
134
560
485
202
470
40
170
246
318
111
600
392
520
530
297
83
281
152
72
62
312
46
122
282
133
559
440
474
110
101
104
195
473
208
599
607
317
310
201
236
461
506
528
376
348
268
416
436
342
47
171
577
396
327
304
54
81
247
4
30
360
286
210
291
149
454
516
265
203
191
238
458
494
211
267
533
158
186
496
511
296
229
194
405
414
431
330
164
199
189
408
433
480
505
146
159
137
393
429
305
120
125
168
60
460
363
349
535
214
255
86
543
410
362
219
270
274
385
457
428
298
45
224
48
55
261
259
157
196
434
497
88
142
404
450
441
91
239
423
335
32
121
126
262
109
611
427
476
507
486
223
279
167
73
71
412
292
123
135
403
218
227
242
443
484
369
172
334
35
179
235
439
119
116
264
465
488
501
19
89
98
204
388
381
347
215
52
50
115
155
130
610
58
118
249
386
367
95
90
177
280
183
397
316
64
251
231
477
43
49
178
561
390
127
248
241
421
209
67
68
85
250
468
37
36
128
469
466
359
181
76
256
99
411
456
336
212
38
20
112
147
106
244
522
34
41
151
78
80
432
294
258
3
13
231
83
29
154
77
6
9
13
337
30
23
9
24
16
2
52
3
277
1011
22
18
25
15
14
21
26
615
79
8 12
27
342
17
−4 −2
0
Factor2
2
4
79
0
0
5
Factor1
15
600
20
800
338
379
539
59
74
113
571
444
75
129
572
542
589
593
581
540
237
569
307
541
424
216
114
82
56
492
490
303
285
594
553
562
472
230
413
153
345
504
175
288
595
295
160
138
226
272
206
57
5576
83
580
44
479
145
332
592
343
548
544
309
254
344
449
409
399
604
87
339
94
371
467
321
225
546
42
588
213
471
176
293
384
584
415
182
287
478
166
463
156
574
66
174
448
53
552
253
320
605
365
554
812510
418
188
579
341
193
495
260
190
526
278
340
97
372
377
518
459
140
108
394
290
538
276
283
220
406
568
141
531
557
187
163
302
322
417
63
462
350
192
503
563
366
391
266
222
382
139
375
84
117
378
370
356
419
275
437
40
169
61
124
558
357
578
575
567
481
198
300
326
319
184
308
136
20
420
549
148
324
221
240
489
32
590
271
407
105
25
67
144
102
585
1
143
103
328
510
346
233
502
314
100
107
482
451
547
483
131
132
65
243
374
353
555
34
306
93
70
162
426
570
361
228
517
475
2
8
72
383
398
447
519
51
180
311
173
395
498
150
284
602
46
269
331
513
566
35
27
532
257
524
425
207
17
54
96
452
15
299
205
354
24
26
333
200
83
245
202
289
185
197
438
273
111
234
401
527
364
62
325
323
329
165
110
122
400
491
9
47
31
18
161
586
351
19
134
232
373
355
263
525
536
582
358
246
158
101
34
16
217
170
523
81
402
537
512
500
387
422
29
210
37
36
13
149
208
573
368
152
171
14
389
514
446
38
33
45
301
291
499
315
236
120
515
21
282
380
454
60
281
247
137
435
453
348
30
22
49
318
199
534
445
352
310
330
392
440
486
464
1
1
297
304
265
164
487
376
55
146
286
312
201
86
545
493
601
473
430
455
23
317
91
268
189
195
529
603
133
191
436
109
104
470
155
396
508
564
142
219
509
521
550
442
342
267
270
255
186
159
168
229
48
327
224
88
123
126
416
428
362
58
71
172
116
125
520
556
485
349
360
296
292
410
363
223
238
211
530
551
461
130
179
157
412
494
597
458
335
92
203
118
80
433
39
50
69
259
242
167
405
596
431
560
587
52
135
239
231
516
414
559
480
64
196
528
423
496
474
151
334
218
235
121
385
381
298
274
119
434
194
98
393
460
95
215
209
511
606
178
397
313
154
214
408
67
68
249
89
106
204
386
535
577
450
457
591
41
248
227
403
565
609
600
279
262
336
177
183
533
505
497
598
469
367
369
43
305
90
388
427
73
280
316
112
439
77
261
258
115
465
347
147
181
429
76
404
506
488
85
468
441
241
250
252
390
256
294
264
128
127
507
543
443
484
244
99
212
421
78
599
359
251
501
476
432
456
411
477
277
607
466
608
522
561
2
610
611
−20
0 10
337
30
Factor1
Figure 10: Classical and robust scatterplots of the first two factor scores of the stock611 data.
Furthermore, by inspecting the classical and robust ordered scores, we find that they are
quite different. In the following, orderedFsC[[1]] and orderedFsOgk[[1]] are decreasing
32
An OOS for Robust Factor Analysis
according to their first column; orderedFsC[[2]] and orderedFsOgk[[2]] are decreasing
according to their second column.
R> orderedFsC = fsOrder(faClassic4@scores[,2:1]); orderedFsC
R> Lst=list(orderedFsC[[1]][1:10,], orderedFsC[[2]][1:10,]); Lst
[[1]]
Factor2
583 3.898163
337 3.681513
606 3.263503
486 2.503726
563 2.252890
598 2.247478
454 2.129092
572 2.123974
588 2.006510
526 1.995515
Factor1
1.442520e-01
-1.520371e-01
-5.357695e-02
-1.257069e-01
-8.903074e-03
-1.065788e-01
-1.118824e-01
5.649819e-01
1.021676e-01
1.440485e-05
[[2]]
338
379
539
79
571
59
74
444
589
113
Factor2
Factor1
-1.22930368 23.8967397
-0.10141001 3.2591570
1.48465281 3.0500510
0.17308239 2.3347627
0.74880375 0.9807258
-0.89391045 0.8830250
-0.92530280 0.7621726
0.01917557 0.7219354
1.81077486 0.6714384
-0.78396701 0.6627497
R> orderedFsOgk = fsOrder(faCov8@scores[,1:2]); orderedFsOgk
R> Lst=list(orderedFsOgk[[1]][1:10,], orderedFsOgk[[2]][1:10,]); Lst
[[1]]
337
539
589
576
583
569
571
572
Factor1
Factor2
38.809139 -16.399416
22.810633 91.478265
9.979105 20.071925
8.842042 15.095968
7.537310
6.476780
6.961884 16.455123
6.904237 34.271241
6.224583 21.041747
Ying-Ying Zhang
606
581
6.135567
5.437048
33
-0.857162
17.934172
[[2]]
338
379
79
539
59
74
113
571
444
75
Factor1
Factor2
-17.8843840 899.42807
3.1144936 117.91379
-19.1832143 109.18835
22.8106327 91.47826
-13.9358175 48.77925
-10.2725106 41.54875
-8.9311254 35.37131
6.9042366 34.27124
0.1627833 28.57266
-5.7843270 23.48737
4. Conclusions
We presented an object-oriented solution for robust factor analysis developed in the S4
class system of the programming environment R. In the solution, new S4 classes "Fa",
"FaClassic", "FaRobust", "FaCov", "SummaryFa" are created. The classical factor analysis
function FaClassic() and the robust factor analysis function FaCov() both can deal with
maximum likelihood estimation, principal component analysis, and principal factor analysis
methods. Consider the factor analysis methods (classical or robust), the data input (data or
the scaled data), and the running matrix (covariance or correlation) all together, there are
8 combinations. We recommend (4) (classical factor analysis using the correlation matrix of
the scaled data as the running matrix) vs (8) (robust factor analysis using the correlation
matrix of the scaled data as the running matrix) for theoretical and computational reasons.
The application of the solution to multivariate data analysis is demonstrated on the hbk data
and the stock611 data which themselves are parts of the package robustfa. A major design
goal of the object-oriented solution was the openness to extensions by the development of
new robust factor analysis methods in the package robustfa or in other packages depending
on robustfa.
References
Chambers JM (1998). Programming with Data: A Guide to the S Language. Springer-Verlag,
New York.
Davies PL (1987). “Asymptotic Behavior of S-Estimators of Multivariate Location Parameters
and Dispersion Matrices.” The Annals of Statistics, 15, 1269–1292.
Donoho DL (1982). “Breakdown Properties of Multivariate Location Estimators.” Technical report, Harvard University, Boston. URL http://www-stat.stanford.edu/~donoho/
Reports/Oldies/BPMLE.pdf.
34
An OOS for Robust Factor Analysis
Filzmoser P, Fritz H, Kalcher K (2013). pcaPP: Robust PCA by Projection Pursuit. R
package version 1.9-49, URL http://CRAN.R-project.org/package=pcaPP.
He X, Wang G (1997). “A qualitative robustness of S*-estimators of multivariate location and
dispersion.” Statistica Neerlandica, 51, 257–268.
Johnson RA, Wichern DW (2007). Applied Multivariate Statistical Analysis. Pearson, New
Jersey. 6th edition.
Maronna RA, Martin D, Yohai VJ (2006). Robust Statistics: Theory and Methods. John
Wiley & Sons, New York.
Maronna RA, Yohai VJ (1995). “The Behavoiur of the Stahel-Donoho Robust Multivariate
Estimator.” Journal of the American Statistical Association, 90, 330–341.
Maronna RA, Zamar RH (2002). “Robust Estimation of Location and Dispersion for HighDimensional Datasets.” Technometrics, 44, 307–317.
OMG (2009a). “OMG Unified Modeling Language (OMG UML), Infrastructure, V2.2.” Current formally adopted specification, Object Management Group. URL http://www.omg.
org/spec/UML/2.2/Infrastructure/PDF.
OMG (2009b). “OMG Unified Modeling Language (OMG UML), Superstructure, V2.2.”
Current formally adopted specification, Object Management Group. URL http://www.
omg.org/spec/UML/2.2/Superstructure/PDF.
Pison G, Rousseeuw PJ, Filzmoser P, Croux C (2003). “Robust Factor Analysis.” Journal of
Multivariate Analysis, 84, 145–172.
R Development Core Team (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
http://www.R-project.org/.
Robbins JE (1999). “Cognitive Support Features for Software Development Tools.”
Phd thesis, University of California, Irvine. URL http://argouml.tigris.org/docs/
robbins-dissertation/.
Robbins JE, Redmiles DF (2000). “Cognitive Support, UML Adherence, and XMI Interchange
in Argo/UML.” Information and Software Technology, 42, 79–89.
Rocke DM (1996). “Robustness Properties of S-Estimators of Multivariate Location and
Shape in High Dimension.” The Annals of Statistics, 24, 1327–1345.
Rousseeuw PJ (1985). “Multivariate Estimation with High Breakdown Point.” In W Grossmann, G Pflug, I Vincze, W Wertz (eds.), Mathematical Statistics and Applications Vol. B,
pp. 283–297. Reidel Publishing, Dordrecht.
Rousseeuw PJ, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Maechler
M (2013). robustbase: Basic Robust Statistics. R package version 0.9-10, URL http:
//CRAN.R-project.org/package=robustbase.
Rousseeuw PJ, Driessen KV (1999). “A Fast Algorithm for the Minimum Covariance Determinant Estimator.” Technometrics, 41, 212–223.
Ying-Ying Zhang
35
Ruppert D (1992). “Computing S-Estimators for Regression and Multivariate Location/Dispersion.” Journal of Computational and Graphical Statistics, 1, 253–270.
Salibian-Barrera M, Yohai VJ (2006). “A Fast Algorithm for S-regression Estimates.” Journal
of Computational and Graphical Statistics, 15, 414–427.
Stahel WA (1981a). “Breakdown of Covariance Estimators.” Research Report 31, ETH Zurich.
Fachgruppe für Statistik.
Stahel WA (1981b). “Robuste Schätzungen: Infinitesimale Optimalität und Schätzungen von
Kovarianzmatrizen.” Phd thesis no. 6881, Swiss Federal Institute of Technology (ETH),
Zürich. URL http://e-collection.ethbib.ethz.ch/view/eth:21890.
Todorov V (2013). rrcov: Scalable Robust Estimators with High Breakdown Point. R package
version 1.3-4, URL http://CRAN.R-project.org/package=rrcov.
Todorov V, Filzmoser P (2009). “An Object-Oriented Framework for Robust Multivariate
Analysis.” Journal of Statistical Software, 32(3), 1–47. URL http://www.jstatsoft.org/
v32/i03/.
Wang JH, Zamar R, Marazzi A, Yohai VJ, Salibian-Barrera M, Maronna R, Zivot E, Rocke
D, Martin D, Konis K (2013). robust: Insightful Robust Library. R package version 0.4-15,
URL http://CRAN.R-project.org/package=robust.
Wang XM (2009). Applied Multivariate Analysis. Shanghai University of Finance & Economics
Press, Shanghai. 3rd edition (This is a Chinese book).
Woodruff DL, Rocke DM (1994). “Computable Robust Estimation of Multivariate Location
and Shape in High Dimension Using Compound Estimators.” Journal of the American
Statistical Association, 89, 888–896.
Xue Y, Chen LP (2007). Statistical Modeling and R Software. Tsinghua University Press,
Beijing. (This is a Chinese book).
Zhang YY (2013). robustfa: An Object Oriented Solution for Robust Factor Analysis. R
package version 1.0-5, URL http://CRAN.R-project.org/package=robustfa.
Affiliation:
Ying-Ying Zhang
Department of Statistics and Actuarial Science
College of Mathematics and Statistics
Chongqing University
Chongqing, China
E-mail: [email protected]
URL: http://baike.baidu.com/view/7694173.htm

Download Report

An Object-Oriented Solution for Robust Factor Analysis

Paperzz.com

Your Paperzz