The use of Yanai`s Generalized Coefficient of Determination

13
The use of Yanai's Generalized
Coefficient of Determination to
reduce the number of variables in
DEA models
Maurício Benegas
FORTALEZA
AGOSTO
2016
UNIVERSIDADE FEDERAL DO CEARÁ
CURSO DE PÓS-GRADUAÇÃO EM ECONOMIA - CAEN
SÉRIE ESTUDOS ECONÔMICOS – CAEN
Nº 13
The use of Yanai's Generalized Coefficient of Determination to reduce
the number of variables in DEA models
FORTALEZA – CE
AGOSTO – 2016
THE USE OF YANAI’S GENERALIZED COEFFICIENT OF DETERMINATION TO REDUCE THE NUMBER
OF VARIABLES IN
DEA MODELS
Maurício Benegas
Graduate Program in Economics - CAEN/UFC
[email protected]
ABSTRACT
This paper proposes a novel method of reducing the number of inputs and outputs in DEA models. The
method is based on Yanai’s Generalized Coefficient of Determination and on the concept of pseudo-rank
of a matrix. It is also proposed a rule to determine the cardinality of the subset of selected variables so as
to optimize discretionary power without significant loss of information.
Keywords: DEA; Yanai's Generalized Coefficient of Determination; pseudo-rank.
JEL Codes: C6; C8
Série ESTUDOS ECONÔMICOS - CAEN
Nº 13
1 INTRODUCTION
DEA (Data Envelopment Analysis) model is a nonparametric method of estimating
production frontiers. DEA involves solving a linear programming (LP) problem to determine a
production frontier against which the technical efficiency of Decision Making Units (DMU) will be
calculated. A basic DEA model was originally proposed by Charnes, Cooper and Rhodes
(1978). That initial model is also known as “the CCR model”.
The basic CCR model proposes maximizing the ratio between a weighted sum of
outputs and a weighted sum of inputs. The weights of these sums are chosen according to
feasibility conditions and assuming a hypothesis of constant returns to scale. Charnes, Cooper
and Rhodes have transformed the fractional (primal) CCR model into a linear (dual) model,
usually referred to in the literature as the DEA 1 model.
The CCR model and its variants have been increasingly used since the first
description by Charnes, Cooper and Rhodes in 1978. Since then, researchers worldwide have
used the model as a tool to assess technical efficiency. The DEA model has become the
method of choice for this type of study, especially in the absence of an explicit production
function to define the relationship between inputs and outputs 2
One of the most frequent problems associated with the CCR model is the lack of
discrimination between DMUs when the number of inputs and outputs is very large in relation to
the number of DMUs. A large number of variables relative to the number of observations may
entail a large number of efficient DMUs in the sample – thus reducing the model’s ranking
capability. This is a characteristic of the CCR model: the lower the number of DMUs, the less
active the restrictions imposed on maximum efficiency multipliers.
Several alternatives have been proposed to increase the ranking capacity of the
CCR model 3, including the super-efficiency (Andersen and Petersen, 1993) and cross-efficiency
evaluation methods (Sexton et al., 1986; Doyle and Green, 1994; and Green et al., 1996). Other
methods that use additional information, usually characterized by adding restrictions, include
the cone-ratio and assurance-region approaches.
In the present work we propose a simple and objective method to address a
situation in which the original inputs and outputs have been correctly selected, but the low
number of observations has translated into low discrimination power. This approach proposes to
reduce the dimensions of inputs and outputs and, therefore, does not require additional
information. In addition, it does not require post-estimation procedures (as in the case of the
super-efficiency and cross-efficiency approaches). The approach we propose relies on
1
For details on this procedure it is recommended to consult the work of Cooper, Seiford and Tone (2006).
If there is evidence that inputs and outputs can be linked through a function, then the stochastic frontier model of production can be
used as an alternative to DEA. Please refer to the work of Kumbhakar and Lovell on models of stochastic frontier production models
(2000).
3
See Adler et al. (2002), Angulo-Meza and Lins (2002), Podinovski and Thanassoulis (2007) and Senra et al. (2007).
2
4
Agosto 2016
T HE USE OF Y ANAI’ S GENERALIZED COEFFICIENT OF D ETERMINATION TO REDUCE THE NUMBER OF VARIABLES IN DEA MODELS
multivariate statistical techniques, notably Principle Components Analysis, as on the use of a
correlation matrix. Again it should be underscored that the approach is not intended for
selection of inputs or outputs. Rather, it is a support tool to increase the discriminatory power
without significant loss of information – that is, discriminatory power is increased regardless of
the knowledge concerning which variables are essential for the model.
Following this Introduction, the work is divided as follows: Section 2 briefly reviews
multivariate variable selection in DEA; Section 3 introduces the proposed methodology; section
4 discusses applications of the proposed method to the CCR model; and finally, Section 5
presents conclusions and suggestions for future research.
2
VARIABLE SELECTION/REDUCTION IN DEA
One of the issues relating to the CCR model is that of the dimensions of inputs and outputs.
A consequence of using a large number of variables relative to the number of observations is
the loss of discrimination power due to the generation of a large number of efficient DMU´s.
There is no consensus on the optimal number of inputs and outputs to be used. However,
Cooper, Seiford, and Tone (2006) suggest the following rule: J:J>max{M x N, 3(M + N)}, where
M and N correspond to the number of outputs and inputs respectively.
One of the most common ways of selecting variables in DEA is the use of
correlation matrices for inputs and outputs. When two variables (inputs or outputs) are highly
correlated, one is discarded, usually on the basis of ad hoc criteria. However, eliminating one or
the other variable could have a dramatic impact on estimated efficiency (Dyson et al., 2001).
In recent years, the application of multivariate statistical methods, especially
Principal Components Analysis (PCA), has appeared as a satisfactory alternative for variable
reduction. The formulation of a DEA model in which inputs and outputs are summarized as
principal components is the focus of the work of Ueda and Hoshiai (1997) and Adler and Golany
(2002). One limitation of using PCs instead of inputs and outputs is that these “new” variables
may take a negative value, and therefore must be transformed for PCA. That transformation,
however, may impact results. One way of overcoming the problem is to use the additive model
of DEA (Ali and Seiford, 1990) that is invariant to translation of inputs and outputs. Pastor
(1996) has also shown that an input-oriented BCC model (after Banker, Charnes and Cooper,
1984) is invariant to output translation.
There is also a practical difficulty. Even if the problem of negative PCs is resolved,
the question remains of how to interpret the results in terms of the projection of inputs and
outputs (that is, predicting quantities). The fact is that the only satisfactory way of accomplishing
this is back transformation to original variables - which might require a considerable
computational effort. Some authors select variables based upon their contribution to PCs.
5
Série ESTUDOS ECONÔMICOS - CAEN
Nº 13
Specifically, the variables with the largest absolute linear combination coefficient are selected.
Because it is common for the first few components to explain most data variance, the result is a
considerably reduced subset of variables.
Jenkins and Anderson (2003) employ the method of partial covariance analysis to
identify the correlation between variables as well as the contribution of each variable to these
correlations. With this technique, the authors demonstrate that the removal of variables with
little contribution to the correlations does not significantly change the results.
The method we propose is based on the work of Cadima (2001) and Cadima and
Jollife (2001) and combines PCA with elements of the Jenkins and Anderson (2003) method. It
consists of generating a correlation matrix between two data sets – the orthogonal projection of
data onto the subspace generated by a subset of PCs and the orthogonal projection of data
onto the subspace generated by a subset of original variables. This matrix measure of the
closeness of the two subspaces is known as Yanai’s Generalized Coefficient of Determination
(GCD) (Yanai, 1974). The resulting subset of PCs (which usually includes PCs that explain
eighty to ninety percent of data variance) corresponds to the original variables that maximize
the GCD.
The number of PCs is determined prior to the generation of the correlation matrix,
estimating the pseudo-rank for the matrix, as proposed by Cadima (2001) based on the specific
geometric structure of the cone of positive semidefinite matrices.
In the following section a brief discussion of the geometrical structure of the cone of
positive semidefinite matrices and of PC analysis is presented. GCD is then defined.
3
THE PSEUDO-RANK OF A MATRIX AND YANAI’S GENERALIZED COEFFICIENT OF
DETERMINATION
This section briefly describes basic concepts described in detail in the work of Jolliffe (1986),
Cadima (2000), and Cadima and Jolliffe (2001).
Assuming C p
to be the cone of positive semidefinite matrices with dimension
p × p provided with the Frobenius inner product ⋅,⋅
A, B
F
F
: C p × C p → ℜ such that
= tr ( A' B) for any A, B ∈ C p
The norm induced by (1) will be denoted by
F
(1)
. For any matrix V ∈ C p , the set
Ray (V ) = { A ∈ C p ; A = λV , λ ≥ 0} is called the ray associated with V. The ray associated with
the identity matrix of dimension p is called the central ray of C p
6
Agosto 2016
T HE USE OF Y ANAI’ S GENERALIZED COEFFICIENT OF D ETERMINATION TO REDUCE THE NUMBER OF VARIABLES IN DEA MODELS
With the definitions above, it is possible to find the angle between the rays
associated with any two matrices A and B. This angle is given by the arc whose cosine is
cos( A, B) =
A, B
A
F
B
F
(2)
F
In Cadima (2001), the author demonstrates that C p has a layered structure, with
containing several cones fitted inside the other:
Figure 1. - Cone of positive semidefinite matrices
Based on this observation, the author argues that the region close to the central ray
contains only matrices of full rank, or at least with rank p − 1 (which can occur at the boundary
of such regions). The farther away from the central ray the lower the rank of matrices.
However, matrices of full rank are also found outside of the core. Because may have
eigenvalues close to zero, they behave as low-rank matrices. The question then is how far (into
the core of the cone) does one have to move to avoid such matrices? The answer to this
question lies in the concept of the pseudo-rank of a matrix. According to Cadima (2001), the
pseudo-rank of a V ∈ C p matrix is the smallest integer k * ≤ p such that
cos(V , I p ) ≤
k*
p
(3)
7
Série ESTUDOS ECONÔMICOS - CAEN
Nº 13
The author shows that this value is given by
 tr (V ) 2 
k* = 
2 
 tr (V ) 
(4)
where ⌈z⌉ is the nearest to z greater integer. Letting V the covariance/correlation matrix of a
−1
data matrix A, with p variables and n observations, or V = n A' A . In this case the pseudo-rank
of V corresponds to the number of components to be used as representative of all accumulated
variance associated with A.
3.1. Yanai’s Generalized Coefficient of Determination
Let A represent a data matrix with dimension (n×p), where p indicates the number of
variables and n the number of observations for each variable. In this context, A may refer to a
matrix of discretionary/nondiscretionary outputs/inputs. It is important to keep in mind that in
DEA models, the number of observations indicates the number of columns in the pertinent
matrices. Thus, for the implementation of the proposed method, matrix A should be considered
the transposed output/input matrix.
−1
Given the covariance matrix/correlation of data S = n A' A , let Λ and P be the
diagonal matrix of eigenvalues (arranged in decreasing order) and the matrix of normalized
eigenvectors of S respectively. The PCs are the columns of the matrix (n×p) given by C = AP
Using the spectral decomposition of S, it is easy to show that the covariance matrix of C is
exactly Λ , so that the variables in C are uncorrelated. For this reason the AP transformation is
sometimes called data “decorrelation.”
Consider K to be a subset of indices associated with k ≤ p PCs arranged in
decreasing order of eigenvalues (in general the first k’s). Similarly, let Q be defined as a subset
of indexes associated with the q ≤ p original variables. The sets K and Q are the subspaces
generated by vectors with indices K and Q respectively. The following matrices are then
defined:
8
•
AK is the submatrix of A in which columns with indexes in K are maintained;
•
S K = n −1 AK' AK is the covariance matrix associated with AK ;
•
Λ K is the matrix of the eigenvalues associated with S K ;
•
PK is the matrix of eigenvectors associated with eigenvalues in Λ K ;
Agosto 2016
T HE USE OF Y ANAI’ S GENERALIZED COEFFICIENT OF D ETERMINATION TO REDUCE THE NUMBER OF VARIABLES IN DEA MODELS
Let us assume PK to be the matrix of orthogonal projection on the subspace K such
that
PK = n −1 AS K−1 A'
where S K−1 is the Moore-Penrose generalized inverse of S K . Similarly, PQ is the matrix of
orthogonal projection on the subspace Q defined as:
PQ = n −1 AI Q S Q−1 I Q A'
where I Q is the identity matrix of the submatrix obtained by selecting the q columns with indices
in Q and S Q = n −1 I Q A' AI Q .
Given the definitions above, Yanai’s GCD between subspaces Q and K is defined as:
GCD(Q, K ) =
PQ , PK
PQ
F
PK
F
(5)
F
Supposing that K = {1,2,..., k * } remains fixed, where k
*
is the pseudo-rank of the
covariance matrix, in this case the selection of variables exhibiting the greatest contribution to
~
the principal components selected in K is the set of indices Q such that
~
Q = arg max GCD (Q, K )
Q
4 REDUCED CCR MODEL 4
A practical example of the proposed method is provided in this section. For that, the CCR
model will be presented with the original variables (hereafter called the “general model” and
denoted by CCRg), and with the subset of selected variables (“reduced model,” denoted by
CCRr). For the sake of simplicity, only product-oriented models will be discussed below.
N
Let us suppose that there are J DMUs under study, each using a vector x ∈ ℜ + of
M
inputs to produce a vector y ∈ ℜ + of outputs with a technology defined by
4
The use of the CCR model is an example. The method can actually be applied to any DEA model.
9
Série ESTUDOS ECONÔMICOS - CAEN
Nº 13
TCCR = {( x, y ); x ≥ Xλ , y ≤ Yλ , λ ≥ 0}
g
where X ( N× J ) , Y( M × J ) and λ( J ×1) are input and output matrices and the vector of intensities
respectively.
Given a DMU j, its technical efficiency in the model CCRg, denoted by ET jg , is
estimated by solving the following linear programming problem (the minimization of slacks is
omitted for simplicity)
max θ
 θ ,λ
g
ET j = subject to
(x , θy ) ∈ T
j
CCR g
 j
(6)
Let us now suppose that a set of variables was selected following the procedure
presented in Section 3. For simplicity’s sake, let us assume that only outputs were selected. Let
us denote by y qj the vector product of a DMU j = 1,2,…J , with selection of q < N outputs, and
let us denote by Y q the respective reduced output matrix. The technology in the CCRr model is
defined as
TCCR = {( x, y q ); x ≥ Xλ , y q ≤ Y q λ , λ ≥ 0}
g
Given a DMU j, its technical efficiency in the model CCRr, denoted by ET jr , is
estimated by solving the following linear programming problem
max θ
 θ ,λ
r
ET j = subject to
(x , θy q ) ∈ T
j
CCRr
 j
(7)
The reduced model is obtained in three stages, as summarized below:
10
Agosto 2016
T HE USE OF Y ANAI’ S GENERALIZED COEFFICIENT OF D ETERMINATION TO REDUCE THE NUMBER OF VARIABLES IN DEA MODELS
Figure 2 - Steps for Obtaining the Reduced Model
First Stage
Selection of the number of principal components through the pseudo-rank
covariance (or correlation) matrix of original data
Second Stage
Having selected the number of PCs in the first stage, the variables of the
subset of cardinality k that maximizes GCD are selected
Third Stage
The subset of variables of cardinality k is used to obtain technical
efficiency estimates through a (reduced) DEA model
One issue not covered by the procedure presented above is that of defining the
cardinality of the subset of selected variables. One suggestion would be to combine the gain in
discrimination with some measure that would reveal loss of information. The gain in discrimination
would be obtained by the difference between the percentage of efficient DMUs in the general
model and in the reduced model. The complexity of this issue relates to what measure of
informational loss should be used. One possibility is to use the Kolmogorov-Smirnov statistic,
which quantifies the difference between distributions. In this context, this statistic is given by
KS (q ) = sup F ( x) − Fq ( x)
x
where F and Fq are the empirical cumulative distribution functions of technical efficiency
estimated by the general and reduced models (the latter with a subset q of selected variables)
respectively.
Let K * and K q* be the number of efficient DMUs in the general and reduced
models respectively; let δ q = ( K * − K q* ) / K such that δ q represents, in proportional terms, the
gain in discrimination power of the reduced model in relation to the general model. Then the
optimal cardinality would be given by q * such that
 δq
2
; KS (q ) ≤ 1.36
q * ∈ arg max 

K
q
 KS (q )
(8)
In which KS (q ) ≤ 1.36 2 / K is included so that optimal cardinality will depend on acceptance
of the null hypothesis of the Kolmogorov-Smirnov test. The amount 1.36 2 / K represents the
11
Série ESTUDOS ECONÔMICOS - CAEN
Nº 13
nullity condition of the Kolmogorov-Smirnov test with a significance level of 0.05, such that if
KS (q ) > 1.36 2 / K
the null hypothesis of equality between the distributions is rejected. It
should be noted that 1.36 2 / K is generally valid for K ≥ 8 , otherwise it is necessary to
consult tabulated values.
5
If there are multiple solutions to (8) the lowest maximize q * is selected, so that

 δq
2 
; KS (q ) ≤ 1.36
q * = min qˆ ∈ arg max 

qˆ 
K 
q
 KS (q )

(9)
4.1. Example
This example employs real-world data previously described by Benegas and Silva (2010),
who examined the technical efficiency of the public health care system in Brazil.
The database uses one input (x), represented by the annual per capita expenditure
on health of the three levels of government, and 12 outputs (yi, i = 1,...12), , representing health
indicators available in the Ministry of Health’s Information System DATASUS.
6
For the present example, to reduce the power of discrimination, only 12 of the 27 states
studied by Benegas and Silva (2010) were selected. The data and some of the descriptive statistics
are shown in Table 1 (Appendix Table A1 provides a description of the variables used in the example).
Table 1 - Data of example
State
x
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
y11
y12
RO
AC
AM
RR
PA
AP
TO
MA
PI
CE
RN
PB
Max
Min
Mean
SE
387.5
556.6
513.4
674.5
263.8
512.5
503.8
285.3
315.8
291.4
405
335.1
674.5
263.8
420.4
130.1
68.2
68.6
68.4
67.2
68.8
66.2
68.8
63.4
65.6
65.7
66.3
65.2
68.8
63.4
66.9
1.73
73.8
73.8
74.4
72.1
74.7
74.1
73.3
71.3
71.7
74.4
74.1
72.2
74.7
71.3
73.3
1.18
70.9
71.1
71.3
69.6
71.7
70.1
71
67.2
68.6
69.9
70.1
68.6
71.7
67.2
70
1.33
80
71
78
83
76
79
78
69
73
74
69
68
83
68
75
4.8
0.8
0.8
0.9
1.1
0.8
0.8
1.1
0.6
0.8
0.9
1.2
1.1
1.2
0.6
0.9
0.2
1.7
2
1.5
1.6
1.6
1.5
1.8
2.4
2.5
1.9
2.3
2.7
2.7
1.5
2
0.4
111.2
86.5
106.3
93.13
116.6
94.12
99.57
108.2
101.9
108.3
98.87
105.7
116.6
86.5
102.5
8.529
106.9
83.64
93.88
88.62
112.3
96.3
107.2
106.6
102.1
108.3
95.45
103.6
112.3
83.64
100.4
8.772
107.5
110.5
131.1
114.1
144.5
126.7
110.1
135.6
106
113.3
108.2
117.8
144.5
106
118.8
12.64
108.3
90.14
92.36
89.85
119.8
98.17
107.4
111.9
104.6
112.3
94.61
104
119.8
89.85
102.8
9.732
47.6
40.3
57.1
71.9
55
28.2
21
50.1
61.7
40.8
45.1
48.5
71.9
21
47.3
13.9
70
67.4
72.8
79.3
76.9
90.5
69.9
59
49.6
71.3
82.3
74.7
90.5
49.6
72
10.6
Source: Benegas and Silva (2010)
5
6
See Gibbons and Chakraborti (2003) for details.
See site www2.datasus.gov.br.
12
Agosto 2016
T HE USE OF Y ANAI’ S GENERALIZED COEFFICIENT OF D ETERMINATION TO REDUCE THE NUMBER OF VARIABLES IN DEA MODELS
In this example, the Kolmogorov-Smirnov null hypothesis is accepted with a
significance level of 0.05 for a value up to 0.5552, such that the condition to select the
cardinality of the subset of selected outputs becomes

 δq

q * = min qˆ ∈ arg max 
; KS (q ) ≤ 0.5552
qˆ 
q
 KS (q )


Table 2 shows the results obtained by applying the proposed variable selection
method to the output matrix. Variables were selected into subsets with cardinality from 1 to 10.
Using the pseudo-rank of the output covariance matrix (eq. (4)), the first four PCs were used.
These for PCs accounted for approximately 84% of the total variability of the sample.
Table 2 - Results of general and reduced models *
State
GM
RM1
RM2
RM3
RM4
RM5
RM6
RM7
RM8
RM9
RM10
RO
AC
AM
RR
PA
AP
TO
MA
PI
CE
RN
PB
0.71
0.51
0.59
0.53
1.00
0.61
0.64
1.00
1.00
1.00
0.88
1.00
-
0.62
0.42
0.49
0.40
1.00
0.61
0.48
0.71
0.54
0.84
0.70
0.76
0.33
0.42
0.80
0.65
0.42
0.49
0.40
1.00
0.61
0.48
0.86
0.73
0.84
0.70
0.76
0.33
0.42
0.80
0.67
0.47
0.51
0.40
1.00
0.61
0.52
0.87
0.80
0.88
0.70
0.76
0.33
0.42
0.80
0.69
0.49
0.57
0.48
1.00
0.61
0.63
0.87
0.85
1.00
0.88
1.00
0.17
0.17
1.00
0.69
0.49
0.59
0.53
1.00
0.61
0.63
0.87
0.94
1.00
0.88
1.00
0.17
0.17
1.00
0.71
0.49
0.59
0.53
1.00
0.61
0.64
0.88
0.94
1.00
0.88
1.00
0.17
0.17
1.00
0.71
0.49
0.59
0.53
1.00
0.61
0.64
0.88
0.94
1.00
0.88
1.00
0.17
0.17
1.00
0.71
0.49
0.59
0.53
1.00
0.61
0.64
0.87
0.94
1.00
0.88
1.00
0.17
0.17
1.00
0.71
0.49
0.59
0.53
1.00
0.61
0.64
0.87
0.94
1.00
0.88
1.00
0.17
0.17
1.00
0.71
0.49
0.59
0.53
1.00
0.61
0.64
0.87
0.94
1.00
0.88
1.00
0.17
0.17
1.00
δq
KS(q)
δq / KS(q)
Source: Author estimates
*The initials GM and RMq refer to general and reduced models with cardinality q
{
}
Note that in this example, arg max δ q / KS (q ); KS (q ) ≤ 0.5552 = {4,5,6,7,8,9,10}
q
and therefore q * = 4 . The last part of the example appears in Figure 3, which shows the
estimated densities for the general and reduced models with cardinality from 1 to 8. The
procedure to estimate the densities uses the Gaussian kernel. The bandwidth was selected
minimizing the mean integrated square error. 7
7
See Silverman, B. W. (1986) for details.
13
Série ESTUDOS ECONÔMICOS - CAEN
Nº 13
Figure 3 - Estimated densities for TE in general and reduced models
1.6
2
1.8
1.4
1.6
1.2
1.4
1
1.2
0.8
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
MG
0
0
10
20
30
40
MR1
50
60
MG
70
80
90
100
0
0
10
20
30
40
0
10
20
30
40
0
10
20
30
40
0
10
20
30
40
MR2
50
60
70
80
90
100
70
80
90
100
70
80
90
100
70
80
90
100
1.4
1.6
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
MG
0
0
10
20
30
40
MR3
50
60
MG
70
80
90
100
0
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
60
0.2
MG
0
MR4
50
0
10
20
30
40
MR5
50
60
MG
70
80
90
100
0
60
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
MR7
MG
0
MR6
50
0
10
20
30
40
50
60
MG
70
80
90
100
0
MR8
50
60
The results shown in Figure 3 suggest that in the presence of four selected
variables, the inclusion of an additional variable does not have a significant impact on the
comparison between the general model and the reduced model. This observation confirms the
conclusions obtained by applying the method of subset cardinality selection suggested by
equation (9).
5
CONCLUSIONS AND FURTHER RESEARCH
This paper proposes a method for reducing the dimension of input/output matrices used to
estimate production frontiers through the CCR model (or its variants). The method is based on
Yanai’s Generalized Coefficient of Determination (GCD) and on the concept of pseudo-rank of a
matrix. Additionally, a rule is suggested to support the choice of cardinality of the subset of
14
Agosto 2016
T HE USE OF Y ANAI’ S GENERALIZED COEFFICIENT OF D ETERMINATION TO REDUCE THE NUMBER OF VARIABLES IN DEA MODELS
selected variables. This rule seeks to combine maximum gain in discriminatory power with
minimal loss of information.
Through an example that employs real-world data, it was found that the pseudo-rank
of the output correlation matrix indicates that the first four PCs should be maintained. The GCD
for output subsets with cardinality from 1 to 10 was then calculated. The cardinality rule
indicated the subset with four of the 12 original outputs. Finally, the estimation of densities of
the general and reduced models suggested that the cardinality decision rule can support
decisions concerning the number of variables required in the model to obtain maximum
discrimination with minimal loss of information.
Further research is suggested on the cardinality decision rule taking into
consideration various measures of loss of information. Also warranted are studies comparing
the proposed method with other methods of summarization and selection.
References
ADLER, N. & GOLANY, B. (2001). Evaluation of deregulated airline networks using Data
Envelopment Analysis combined with principal component Analysis with an Application to
Western Europe. European Journal of Operational Research, 132, p. 260-273.
ADLER, N & GOLANY, B. (2002). Including principal component weights to improve
discrimination in Data Envelopment Analysis. Journal of the Operational Research Society,
53, p. 985-991
ADLER, N & YAZHEMSKY, E. (2010). Improving discrimination in Data Envelopment Analysis:
PCA–DEA or variable reduction. European Journal of Operational Research, 202(1), p. 27384.
ALI, A.I. & SEIFORD, L.M. (1990). Translation invariance in Data Envelopment Analysis.
Operations Research Letters, 9, p. 403-405.
ÂNGULO-MEZA, L. & LINS, M.P.E. (2002). Review of methods for increasing discrimination in
Data Envelopment Analysis. Annals of Operations Research, 116, p. 225-42.
BENEGAS, M. & SILVA, F.G. (2010). Estimação da eficiência técnica do SUS nos Estados
Brasileiros na presença de variáveis contextuais. Texto para Discussão, CAEN-UFC.
CADIMA, J.F.L. & JOLLIFE, I.T. (2001). Variable selection and the interpretation of principal
subspaces. Journal of Agricultural, Biological and Environmental Statistics, 6(1), p. 62-79.
CADIMA, J.F.L. (2001). Redução de dimensionalidade através duma análise em componentes
principais: um critério para o número de componentes principais a reter. Revista de Estatística
(INE), 1, p. 37-49.
CHARNES, A.; COOPER, W.W. & RHODES, E. (1978). Measuring the efficiency of decision
making units. European Journal of Operational Research, 2(6) November, p. 429-444.
15
Série ESTUDOS ECONÔMICOS - CAEN
Nº 13
DYSON, R.G.; ALLEN, R.; CAMANHO, A.S.; PODINOVSKI, V.V.; SARRICO, C.S. & SHALE,
E.A. (2001). Pitfalls and protocols in DEA. European Journal of Operational Research, 132,
p. 245-259.
GIBBONS, J.D. & CHAKRABORTI, S. (2003). Nonparametric Statistical Inference, 4th Ed.,
CRC Press, London.
JENKINS, L. & ANDERSON, M. (2003). A multivariate statistical approach to reducing the
number of variables in Data Envelopment Analysis. European Journal of Operational
Research, 147, p. 51-61
JOLLIFE, I.T. (1986). Principal Component Analysis, Springer-Verlag, New York, USA.
KUMBHAKAR, S.C. & LOVELL, C.A.K. (2000). Stochastic Frontier Analysis. Cambridge
University Press, Cambridge.
PASTOR, J. (1996). Translation invariance in Data Envelopment Analysis: a generalization.
Annals of Operations Research, 66, p. 93-102.
SENRA, L.F.A.C.; NANCI, L.C.; SOARES DE MELO, J.C.B. & ANGULO-MEZA, L. (2007).
Estudo sobre métodos de seleção de variáveis em DEA. Pesquisa Operacional, 27(2), p. 191207
SILVERMAN, B.W. (1986). Density Estimation for Statistics and Data Analysis, Chapman
and Hall, London.
UEDA, T. & HOSHIAI, Y. (1997). Application of principal component analysis for parsimonious
summarization of DEA inputs and/or outputs. Journal of Operational Research Society of
Japan, 40, p. 466-478.
YANAI, H. (1974). Unification of various techniques of multivariate analysis by means of
Generalized Coefficient of Determination (GCD), Journal of Behaviormetrics, 1, p. 45-54.
16
Agosto 2016
T HE USE OF Y ANAI’ S GENERALIZED COEFFICIENT OF D ETERMINATION TO REDUCE THE NUMBER OF VARIABLES IN DEA MODELS
Appendix
Table A1: Input and outputs used in example
Product
Description
x
Public spending per capita on health
y1
Male life expectancy
y2
Female life expectancy
y3
Combined life expectancy
y4
Survival rate (%)
y5
Physicians per 1000 inhabitants
y6
Hospital beds per 1000 inhabitants
y7
Coverage MMR (%)
y8
Tetravalent vaccine coverage (%)
y9
BCG vaccination coverage (%)
y10
Polio vaccine coverage (%)
y11
Sanitation coverage (%)
y12
Garbage collection coverage (%)
Source: Benegas and Silva (2010)
17
This paper proposes a novel method of
reducing the number of inputs and outputs in DEA
models. The method is based on Yanai's Generalized
Coefficient of Determination and on the concept of
pseudo-rank of a matrix. It is also proposed a rule to
determine the cardinality of the subset of selected
variables so as to optimize discretionary power
without significative loss of information.
Av. da Universidade, 2700 Benfica Fone/Fax:(085) 3366.7751
Cep: 60.020-181 Fortaleza - CE - Brasil
http://www.caen.ufc.br
e-mail: [email protected]