1 - UCI

Theoretical Note: Bounds on Variances of Estimators for
Multinomial Processing Tree Models1
Pierre Baldi
Department of Information and Computer Science
California Institute for Telecommunications and Information Technology
University of California, Irvine
Irvine, CA 92697-3425
[email protected]
(949) 824-5809
(949) 824-4056 (FAX)
William H. Batchelder
School of Social Sciences
Institute for Mathematical Behavioral Sciences
University of California, Irvine
Irvine, CA 92697-5100
[email protected]
(949) 824-7271
(949) 824-2307 (FAX)
Running title: Bounds on variances of Estimators
1
Copy communication to both authors.
ABSTRACT
When there are order constraints among the parameters of a binary, multinomial
processing tree (MPT) model, methods have been developed for reparameterizing the
constrained MPT into an equivalent unconstrained MPT. This note provides a theorem that is
useful in computing bounds on the estimator variances for the parameters of the constrained
model in terms of estimator variances of the parameters of the unconstrained model.
In
particular, we show that if X and Y are random variables taking values in [0,1], then
Var[ XY ]  2 (Var [ X ]  Var [Y ]).
Key words: Multinomial processing tree models, parametric order constraints, estimator
variances, reparameterization.
2
1. Introduction
This note provides some inequalities concerning random variables taking values in
[0, 1]. The inequalities are used to compute constraints on the variances of estimators of
parameters for binary multinomial processing tree (hereafter called MPT) models (see
Batchelder & Riefer, 1999, for discussion of MPT models). In particular, Knapp and Batchelder
(2001) provide general methods of accommodating parametric order constraints of the form
0   N  ...  1  1
(1)
in MPT models. Their solution is to reparameterize the MPT into an equivalent MPT without
any order constraints, that is, the new parameters are functionally independent and each is free
to vary in [0, 1]. Standard procedures based on a special application of the EM (expectation
maximization) algorithm have been developed to obtain maximum likelihood estimates (MLEs)
of the new parameters as well as estimates of their variances (Hu & Batchelder, 1994; Hu &
Phillips, 1999). It is straightforward to use the MLEs of the new model to obtain MLEs of the
original constrained parameters; however, there is no general procedure for using the variances
of the estimators of the new parameters to obtain variances of the MLEs of the constrained
parameters. Our results provide some useful inequalities for bounding the variances of the
estimators of the constrained parameters.
Following the introduction, this note is organized into four main sections. First, we
review the definition of MPT models, and second, we review some of the reparametrization
methods in Knapp and Batchelder (2001). Third, we provide our inequality results and, finally,
we apply them to bound estimator variances of parameters in MPT models with parametric
order constraints.
2. Binary Multinomial Processing Tree Models
In its most simple form, a (binary) MPT model is defined by a set of categories
C {C1 ,..., CJ } , a directed rooted tree T, and a set of parameters   {1,...,  S } . The parameters
are functionally independent, each with full range in [0, 1], and to each internal vertex v of T is
3
associated a parameter  vs   , where  vs and 1   vs represent the transition probabilities to the
two children of v. The same parameter can be associated with several vertices of the tree, and
the tree has a single root.
Each leaf of the tree is associated with a category C in C. If  represents the branch
(path) of the tree ending with a leaf associated with Category C , we have
Pr ( ) 


n ( , s )
s
(1   s ) m( , s ) ,
(2)
s 
where n( , s ) and m( , s ) are the number of times, respectively, that parameters
s
and
(1   s ) appear along branch  . In other words, the probability of choosing or observing
category C in conjunction with path  is given by the product of the transitional probability
parameters along  . The probability of category C j is then the sum over all branches that lead
to it, that is,
P (C j )   P( ) ,
(3)
 :C j
where " : C j " represents the set of all branches associated with class C j , j = 1, …, J.
The data D  ( N1, N 2 , ..., N J ) consists of the counts N j representing how many times
category C j is observed. The multinomial data likelihood is
J
P(C j ) Nj
j 1
N j!
P (D )  N! 
,
(4)
where N  N j . A number of algorithms are available to estimate the parameters  s from D,
including maximizing the likelihood or suitable posterior functions using gradient descent or the
EM algorithm (Hu & Batchelder, 1994), on-line (example by example) or off-line.
4
3. Representing Parametric Order Constraints
There are a number of situations where it is reasonable to apply an MPT model to data
with constraints among the S parameters. Knapp and Batchelder (2001) analyze the example of
a multi-trial experiment wherein a particular parameter  is constrained to be non-increasing (or
non-decreasing) over successive experimental trials, for example, as depicted in Eq. 1. They
discuss several ways to reparameterize the model to reflect the order constraints where the
reparameterized model has an equal number of parameters, can be represented as an MPT
model (with no parameter constraints), and is equivalent to the original model with the order
constraints. These methods are applied to two data sets in Riefer et al. (2002).
One example of reparameterization that handles Eq. 1 is to define new parameters
0   i  1 , for i  1,..., N , and satisfy Eq. 1 by
s
 s   i
,
(5)
i 1
where
1  1 .
Equation
5
defines
a
one-to-one
transformation
of
  i 1  1  ...   N  0, i  1,..., N onto the subset  of i 0  i  1, i  1,..., N where
i  0 implies  i  k  0, k  1,..., ( N  i) . The inverse is given by 1  1 and
 i /  i 1 if  i 1  0

i  
if  i 1  0
0

,
(6)
i  2,..., N .
Knapp and Batchelder (2001) show that if  is a subset of the parameters of an MPT
model subject to Eq. 1, then a new MPT model, with  replacing  , can be constructed. The
practical thrust of this result is that the new model without order constraints can be analyzed
with the EM algorithm implemented in Hu and Phillips (1999) yielding MLEs ̂i , and then the
MLEs of the original model with constraints are given through Eq. 5 by replacing the i by the
̂i . Further, if the original model with order constraints is identified, then there will be a unique
5
MLE of the new model that leads via Eq. 5 to a unique MLE of the constrained model that
satisfies the order constraints.
Knapp and Batchelder (2001) provide other reparameterizations that have similar
properties as the one in Eq. 5. For example, another reparameterization method is given by
noting from Eq. 1 that
0  1  1  ...  1   N  1 .
(7)
Then one can introduce 0   i  1 , for i  1,..., N , and satisfy Eq. 1 by
N
 s  1   i ,
(8)
is
where  N  1   N . Equation 8 has an inverse like Eq. 5 given by  N  1   N and
(1   i ) / (1   i 1 )


i  


0
if  i   i 1
if 0   i 1
,
(9)
for i  1,..., N  1.
4. An Inequality
Theorem: Let X and Y be two random variables in the [0,1] interval. Then
Var [ XY ]  Var[ X ]  Var[Y ]  2 Var[ X ]Var[Y ]  2(Var[ X ]  Var[Y ])
(10)
Proof: Consider a pair ( X , Y ) of auxiliary random variables independent and identically
distributed as (X,Y). Note immediately that Var [ X ] 
Var[XY] =
1
E[( X  X ) 2 ]. Then similarly:
2
1
E[( XY  X Y ) 2 ]
2

1
1
E[ XY  XY ' XY ' X ' Y ' )2 ]  E[(( X (Y  Y ' )  Y ' ( X  X ' )) 2 ]
2
2
=
1
E[ X 2 (Y  Y ' )2 ]  E[Y 2 ( X  X ' )2 ]  2E[ XY ' ( X  X ' )(Y  Y ' )]
2

(11)
6
Now using the fact that X , Y   1, we get
Var[XY] 

1
E[(Y  Y )2 ]  E[( X  X )2 ]  2 E[ ( X  X )(Y  Y ) ].
2
(12)
 Var[ X ]  Var[Y ]  E[ ( X  X ' ) (Y  Y ' ) ]
This inequality is in general strict unless X and Y are equal to 1 with probability 1.
Applying Schwarz's inequality (Billingsley, 1995) yields
E[ X  X '
Y  Y ' ]  E[( X  X ' ) 2 ] E[(Y  Y ' )2 ]
(13)
= 2 Var[ X ]Var[Y ].
Finally, using the geometric-arithmetic mean inequality yields
Var[ X ]  Var[Y ]  2 Var[ X ]Var[Y ] .
Thus,
Var[ XY ]  2(Var[ X ]  Var[Y ]) .
(14)
As an example, if X has a two-dimensional Dirichlet (Beta) distribution with parameters
2
and
3,
and
Y
has
a
similar
E ( X )  0.4  E (Y ), Var ( X )  0.040
distribution
and
with
parameters
Var (Y )  0.022.
Using
4
and
6,
then
the
theorem,
Var ( XY )  .124.
The bound in Eq. 14 is general and simple; however, it is not always tight (the bound in
Eq. 13 is, of course, tighter). For example, when X  0 with probability 1, then Var ( XY )  0 .
7
With additional assumptions, tighter bounds or useful formulae are easy to achieve.
For
example, if X and Y are independent, then
Var[ XY ]  E[ X 2 ] E[Y 2 ]  E 2[ X ]E 2[Y ]
(15)
 Var[ X ]Var[Y ]  E 2[ X ]Var[Y ]  E 2[Y ]Var[ X ] .
It is easy to se that Eq. 15 is a much tighter bound than Eq. 14.
Another case that is useful for applications to parametric order constraints is where
( X , Y ) is (approximately) bivariate normal with known means, variances, and correlation  .
Kotz, Balakrishnan, and Johnson (2000) state that in the case of standardized variables z1 and z2
with a bivariate normal distribution
E[ z12 z22 ]  1  2  2 ,
(16)
where  is the Pearson correlation.
Inserting z12  ( X  E[ X ]) 2 / Var[ X ] and z22  (Y  E[Y ]) 2 / Var[Y ] into Eq. 16 and performing
routine operations with expectations yields
Var[ XY ]  Var[ X ]Var[Y ](1   2 )  E 2 [ X ]Var[Y ]  E 2 [Y ]Var[ X ]
 2 E[ X ]E[Y ] Var[ X ]Var[Y ] .
(17)
5. Obtaining Bounds on Estimator Variances
From a Bayesian perspective, the parameters are random variables, so when the
parameters i in Eq. 5 are independent of each other, we obviously have
i
i
k 1
k 1
E[ i ]  E[  k ]   E[ k ] ,
(18)
8
and Eq. 15 can be applied iteratively if the E[ k ] and E[ k2 ] are known.
For the maximum likelihood estimates, the relationship ˆi  ik 1 ˆ k is true even when
the ̂i are not independent. This is true because the likelihood function and, therefore, its
maxima are preserved under one-to-one parametric transformations of a model. In order to
bound variance and covariance terms for the ˆs , we can employ the theorem of the previous
section.
This theorem can be repeated iteratively to find upper bounds for variance and
covariance of the ˆs . Suppose we have estimates of Var[ˆ i ], i  1,..., N . These can be obtained
using the program described in Hu and Phillips (1999) either by asymptotic approximations,
using the observed Fischer information matrix, or by simulation from the MPT model calibrated
by the ̂ i . Then it is straightforward to use the theorem to obtain bounds on the Var[ˆi ] , for
example (hereinafter suppressing the "hat" for MLE),
Var [ 3 ]  2(Var[1 ]  Var[3 4 ])
 2Var[1 ]  2 2 (Var[ 2 ]  Var[ 3 ]).
For each  i , there are many decompositions into pairwise products and, therefore, we
can bound the variance of each  i by
i
Var[ i ]  Var[  k ]  inf  2 ( k )Var[ k ] ,
k 1

(19)
k
where  runs over all possible binary tree decompositions of the product  ik 1  k , and  (k ) is
the length of the branch in the tree decomposition corresponding to  k . For instance,
 3  1 23 can be decomposed in three ways as  3   ( 2 3 ) or  3  (1 2 ) 3 , or
1
 3   2 (13 ). Therefore,
Var[  3 ]  inf  2 Var[1 ]  (22 )Var[ 2 ]  (22 )Var[3 ],
2 Var[ 3 ]  (22 )Var[1 ]  (22 )Var[ 2 ],
2 Var[ 2 ]  (22 )Var[1 ]  (22 )Var[ 3 ].
9
This bound tends to get weaker as the number of  ' s in the product increases. Many
models used in practice have a small number of parameters and, therefore, the blow-up caused
by the factors of the form 2 ( k ) could be contained provided the variances of the  ' s are small in
comparison. For large numbers of parameters N, it is necessary that the individual variances of
the  ' s be exponentially small in N for the bound to be useful.
Similar estimates and bounds can be derived for covariance products of the form  i j
with (i  j ), by using
j
i
 i j   ( )   k
2
k
k 1
(20)
k  i 1
When the parameterization using the 1   ' s in Eq. 8 is used, the previous theorem
applies
again
in
the
same
way
mutatis
mutandis.
For
instance,
N
Var (1   s )  Var ( s )  Var (  i ) .
s i
If the number of data points for an MPT is large, one may obtain tighter bounds by the
fact that MLEs are asymptotically multivariate normal, with variance-covariance matrix
approximated by the inverse of the observed Fischer information matrix. The implementation
of the EM algorithm discussed in Hu and Batchelder (1994) provides these approximations.
k
Then, an approximate bound for Var[ˆ i ] , for k  2,..., N can be computed by iterative use
i 1
of Eq. 17. In this application, E (ˆ j ) is approximated by ̂i , and the Var (ˆ j ) and  (ˆ j ,ˆ l ) are
approximated by the obvious terms in the inverse of the observed Fischer information matrix.
10
References
Batchelder, W.H. & Riefer, D.M. (1999). Theoretical and empirical review of multinomial
processing tree modeling. Psychonomic Bulletin & Review, 6, 57-86.
Billingsley, P. (1995). Probability and Measure, Third Edition. New York: Wiley.
Hu, X. & Batchelder, W.H. (1994). The statistical analyss of general processing tree models
with the EM algorithm. Psychometrika, 59, 21-47.
Hu, X. & Phillips, G.A. (1999). Multinomial processing tree models: An implementation.
Behavior Research Methods, Instruments & Computers, 31, 689-695.
Knapp, B. & Batchelder, W.H. (2001). Representing Parametric Order Constraints in MultiTrial Applications of Multinomial Processing Tree Models. Technical Report MBS-01-14.
Institute for Mathematical Behavioral Sciences, University of California, Irvine.
Kotz, S., Balakrishnan, N., & Johnson, N.L. (2000). Continuous Multivariate Distributions,
Vol. 1. New York: Wiley.
Riefer, D.M., Knapp, B.R., Batchelder, W.H., Bamber, D., & Manifold, V. (2002). Cognitive
psychometrics: Assessing storage and retrieval deficits in special populations with multinomial
processing tree models. Psychological Assessment, 14, 184-201.
11
The work of Pierre Baldi is supported by a Laurel Wilkening Faculty Innovation award and a
Sun Microsystems award at UC Irvine. Pierre Baldi acknowledges useful discussions with
Y. Rinott. The work of William Batchelder is supported by NSF Grant SES-0136115; William
Batchelder also acknowledges the support of the Santa Fe Institute, where he was a Visiting
Professor during some of the work.
12