.
MODELLING VARIABILITY IN THE HIV GENOME
by
Hildete Prisco Pinheiro
Department of Biostatistics
University of North Carolina
Institute of Statistics
Mimeo Series No. 2186T
December 1997
MODELLING VARIABILITY IN THE HIV GENOlVIE
by
Hildete Prisco Pinheiro
A dissertation submitted to the faculty of the University of North Carolina at
Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of
Philosophy in the Department of Biostatistics: School of Public Health.
Chapel Hill
1997
Approved by:
~.,,~.
- .;~.
~
~
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Advisor
___
c--------"~L~----....
Adivisor
-----I---I------I'---.H-..::.:....:.:..----
Reader
@1997
Hildete Prisco Pinheiro
ALL RIGHTS RESERVED
11
ABSTRACT
HILDETE PRISCO PINHEIRO: Modelling Variability in the HIV Genome. (Under
the direction of Dr. Franc;oise Seillier-Moiseiwitsch and Dr. Pranab Kumar
Sen.)
In modelling the mutational process in DNA sequences of the human immunodeficiency virus (HIV) one cannot assume independence among positions along the
sequences. Sites are analyzed, at the nucleotide or amino-acid level, by comparing
each sequence to the consensus sequence. The state at each site is regarded as a
binary event (i.e., there is a mutation or not). Two families of models are considered.
Each sequence can be thought of as a degenerate lattice, then autologistic models
are applicable. The probability of mutation at a specific site, given all others, has an
exponential form with, as predictor, a function of neighboring sites. Since there is
no closed form for the likelihood function, a Markov-chain Monte-Carlo procedure is
called upon. The second model is based on the Bahadur representation for the joint
distribution of dichotomous responses and assumes only pairwise dependence among
sites. Parameter estimation is performed via the maximum-likelihood method. An
extension to three categories is suggested to evaluate the type of mutation (transition
or transversion) or the absence of mutation. For the autologistic model, two dummy
variables represent the three categories. The estimates of the parameters can also be
obtained by Markov-chain Monte-Carlo methods.
It is also of interest to compare sequence variability between and within groups.
These groups can be HIV-infected individuals from different geographical regions, different high-risk groups or different subtypes of the virus. Two analyses of variance for
categorical data are proposed. One is based on an approach developed by Simpson
(1949). We consider the variability in the distribution of the categorical response.
Sequences are not considered on an individual basis. A test statistic is developed assuming independence among positions. The other is based on the Hamming distance,
which is the proportion of positions where there is a difference between two aligned
sequences. Sequences are considered on an individual basis. The interest is now in
estimating the variability between, within and across groups. U-statistics represent
111
average distances. The total sum of squares is decomposed into within-, betweenand across-group sums of squares. The latter term does not appear in the classical
decomposition of sum of squares. In order to find out the distributions of these sum
squares, we use generalized V-statistics theory. Test statistics are constructed to test
the hypothesis of homogeneity among the groups.
IV
Acknowledgments
I would like to thank my advisors Drs. Seillier-Moiseiwitsch and Sen for their
guidance, patience, and encouragement, and all my committee members Drs. Margolin, Newman and Swanstrom for their helpful comments and suggestions. In particular, I would like to thank Dr. Seillier for her friendship and for providing me with
the opportunity to work in the HIV and the Brain Tumor projects. I am grateful for
the financial support provided by CAPES.
The love and encouragement exhibited by my parents, my in-laws and friends,
especially Pai-Lien Chen and Hsiao-Chuan Tien for their love and support during my
stressful periods. Most of all, I would like to thank my loving and understanding
husband for his unconditional love, encouragement and the sharing of the family
work. My especial thanks goes to my daughter Tals for providing me such wonderful
moments with her smiles, first steps and all the joys of motherhood.
v
Contents
1 Introduction and Literature Review
1
1.1
Introduction
1.2
Biological Background
3
1.3
Standard Statistical Procedures
6
1.3.1
Markov Chain Models
6
1.3.2
Genetic Variability ..
8
1.4
.
1
Specific Biological Problems and Synopsis of the Solutions
2 Modelling the Mutation Process
2.1
2.2
2.3
20
The Autologistic Model . . . .
20
2.1.1
25
Estimation Procedure
Model based on the Bahadur Representation .
29
2.2.1
31
Estimation Procedure
Models for Three Categories .
31
2.3.1
Overview of the Literature
31
2.3.2
An Alternative Extension of the Autologistic Model
34
3 Analyzing the Variability in DNA Sequences
3.1
18
Variation in Categorical Data
'
VI
.
37
38
4
5
3.2
Partitioning the Measures of Diversity
39
3.3
The Probabilistic Model . . . .
43
3.4
Moments of Diversity Measures
44
3.5
The Test Statistic . . .
48
3.6
The Power of the Test
61
Analysis of Variance based on the Hamming Distance
65
4.1
The Total Sum of Squares and its decomposition. . . .
66
4.2
Connections Between Sums of Squares and V-Statistics
69
4.3
Asymptotic Distributions. . . .
77
4.3.1
One sample V-statistics
77
4.3.2
Multiple-Sample V-statistics
91
4.4
Combining the V-statistics
121
4.5
Test Statistics . . . . . . .
141
4.5.1
Power of the Tests
142
Numerical Studies and Data Analysis
147
5.1
147
5.2
6
Modelling the Mutation Process
5.1.1
The Autologistic Model
147
5.1.2
Model based on the Bahadur Representation.
152
Analyzing the Variability in DNA Sequences
153
5.2.1
Simulations ..
153
5.2.2
Data Analysis .
156
Conclusion and Future Research
158
6.1
Concluding Summary.
158
6.2
Future Research . . . .
160
Vll
6.2.1
An Application ofthe Autologistic Model to sequences from the
nef gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
160
6.2.2
Extension ofthe Autologistic Model to more than Two Categories161
6.2.3
An Alternative Null Hypothesis for the Analysis of Diversity
Measures
161
6.2.4
Diversity Measures for Sequences with Dependent Positions
162
6.2.5
Inclusion of Covariates in the Analysis of Diversity Measures
165
6.2.6
Analysis of Variance based on the Hamming Distance when
Sequences are not Independent
6.2.7
165
Tests Based on Contrasts in the Analysis of Variance for the
Hamming Distances. . . . . . . . . . . . . . . . . . . . . . ..
166
Appendix A
167
Appendix B
169
Bibliography
178
Vlll
List of Tables
1.1
Analysis of Variance for Heterozygosity . . . . . . . .
12
1.2
Formulae for the Number of Nucleotide Substitutions
15
1.3
Pattern of Nucleotide Substitution
.........
16
3.1
Summary of the Data (one position)
40
3.2
Contingency Table (K positions) .
42
5.1
Convergence Results
150
5.2
Modell
150
5.3
Model 2
151
5.4
Bahadur Representation Model (35 positions)
152
5.5
Bahadur Representation Model (12 positions)
153
5.6
Results of Simulations for Diversity Measures (pck
= Pc)
.
154
5.7
Results of Simulations for Diversity Measures (Pck
#- Pc)
.
155
5.8
Percentiles of the Bootstrap Dist. for Diversity Measures
157
6.1
Contingency Table for Two Categories (dependent positions)
163
6.2
Contingency Table for C categories (dependent positions) . .
163
IX
Chapter 1
Introduction and Literature
Review
1.1
Introduction
A brief review of molecular biology and genetic variability is presented in Section
1.2.
The primary interest is to model the mutation process exhibited in DNA sequences of the human immunodeficiency virus (HIV). The data set consists of sequences from different individuals, with n sites per sequence. We cannot assume
independence among sites. Each sequence is compared to the consensus sequence at
the nucleotide or amino-acid level.
Statistical methods to measure variability of DNA sequences, in general, are
available and are discussed in Section 1.3.1, but one needs to be careful when modelling such data because of the known dependencies among neighboring nucleotide positions. An interesting question arises: how does one quantify the notion of neighbor?
Tavare & Giddings (1989) determine the distance over which there is base-frequency
dependency by modelling the sequence as a Markov chain and estimating the order of
the chain. For a high-order chain, the number of parameters may be too large to be
reliably estimated with sequences of moderate length. Also, the dependency is located
on one side only and this may not be true in reality. Positions along a sequence are
1
not ordered like time points. Raftery & Tavare (1994) proposed an alternative model
to reduce the number of parameters, the mixtu're transition distribution model. They
apply it to repeated patterns in a DNA sequence. In these papers, the authors look
at a single long sequence and try to see if there is a stochastic pattern for base-pair
composition along the sequence. Dependencies on one side only are allowed. However, positions along a sequence cannot be regarded as a time-series: the sequence
codes for a three-dimensional structure.
In our case we want to look at several sequences and compare them with the
consensus one. We also would like to look at dependencies on both sides. Another
question of interest is where are the "hot spots" in the sequence. If we have an
estimate of the rate of mutation at each site along the sequences, we might be able
to answer that question.
Another interest is to compare sequence variability between and within groups.
Comparisons can be performed between and within individuals as well as between
and within groups.
These groups can be HIV-infected individuals from different
geographical areas, different high-risk groups or different subtypes of the virus. Our
interest would be if the variability is similar in each group.
The analysis of variance for continuous variables is a well-established procedure
to compare a number of means. The experimental design guides its format (Cochran
& Cox, 1957). When the variables are categorical, but ordered, rank analysis of
variance can be used as a nonparametric method to test for mean differences (Daniel,
1978). In our case the variables are categorical and not ordered. Therefore, we need
to find other alternative methods.
Weir (1990a) describes an analysis of variance for the genetic variation in the
population, in particular for the amount of observed heterozygosity (Section 1.3.2).
.
The variance of the estimate of the average heterozygosity is broken down to show the
contribution of populations, loci and individuals by setting out the calculations in a
framework similar to that of an analysis of variance. Our situation is a little different
because we would like to construct a categorical analysis of variance based on Hamming distances (Seillier-Moiseiwitsch et al., 1994 and references therein), assuming
that the sequences are independent, but the positions may not be. The Hamming
2
distance is the proportion of positions at which two aligned sequences differ.
In Section 1.4 the biological problems and solutions given in this manuscript
are discussed.
1.2
Biological Background
Nucleotides are the building blocks of genomes and each nucleotide has three
components: a sugar, a phosphate and a base. The sugar may be one of two kinds:
ribose or deoxyribose. In any given nucleic acid macromolecule, all the sugars are
of the same kind. A nucleic acid with ribose is called Ribonucleic Acid or RNA,
one with deoxyribose, Deoxyrinucleic Acid or DNA. DNA has four bases: Adenine
(A), Cytosine (C), Guanine (G) and Thymine (T), where Adenine fits together with
Thymine and Guanine with Cytosine. These are the so-called base pairs. A sequence
of base pairs may be thought of as a series of "words" specifying the order of amino
acids (each coded by three nucleotides) in a protein. To transform the DNA "words"
into amino acids, some sophisticated molecular machinery is needed. Now, RNA
comes into play. RNA also has four bases: A, C, G and Uracil (U) in place of T.
Adenine is now complementary to Uracil. Transcription is the process by which a
region of DNA is teased apart and a molecule of RNA is built along one strand by
an enzyme, the RNA Polymerase, to begin protein synthesis. Each base of RNA is
complementary to the corresponding base of DNA. The messenger RNA or mRNA
then carries the genetic information from the DNA to the protein factory (Gonick &
Wheelis, 1991).
A retrovirus has the ability to reverse the normal flow of genetic information
from genomic DNA to mRNA (Varmus & Brown, 1989). Its genomic RNA encodes an
enzyme that makes a DNA copy of its RNA and incorporates this DNA into the host
genome (Gonick & Wheelis, 1991). HIV belongs to this family. Retroviruses have
been subdivided into three subgroups on the basis of the in-vivo disease they produce
and, more recently, on the basis of sequence homology (Teich, 1984; Varmus & Brown,
1989): oncovirus, lentivirus and spumavirus. HIV is classified as a lentivirus. These
cause slow, chronic diseases (Cullen, 1991).
3
The human immunodeficiency virus evolves rapidly. HIV-1 and HIV-2 are two
strains of HIV, with HIV-1 being the most common in the world and in the U.S.
HIV-1 has nine genes (gag, pol, viI, vpr, vpu, tat, 1'ev, env and nej), of which eight
are conserved among the other primate lentiviruses. Most of the internal part of
the genome is densely packed with protein-coding domains, such that some genes
overlap, The DNA form of the genome is bounded by a repeated sequence, the LTR.
The 5' copy of the LTR contains important transcription signals, while the 3' copy
is used to encode part of the nef gene and the 3' end-defining polyadenylation site in
the transcribed RNA (Seillier-Moiseiwitsch et al., 1994). The pattern of nucleotide
variation is not constant over the whole genome. For instance, the genes encoding
internal virion proteins, gag and pol, are more conserved than the gene coding for the
envelope of the virus, env. Also, the nucleotide differences in env change the encoded
amino acids more frequently, and therefore, the amino acid sequence exhibits more
variation in env than in gag and pol (Coffin, 1986).
The viral envelope is the only gene product in direct contact with the host
environment. It is thus not surprising that this gene has the most variable sequence
(Hahn et al., 1985; Coffin, 1986; Seillier-Moiseiwiltsch et al., 1994). To illustrate
the diversity of HIV, env sequences were isolated at different times from each of a
number of individuals (Hahn et al., 1986) and the greatest variation is within the
so-called hypervariable regions V1-V5. The viruses from an individual differ from one
another, but not as much as viruses from different individuals. Such high variation
seems most likely due to a combination of two aspects. First, there may be a strong
selection by immunological pressure for variation in these regions and, second, a lack
of selection against variation may permit the presence of almost any sequence that
does not interrupt translation.
In different organisms most genes do not follow a pattern in the two first codon
positions, but in HIV there is a preference for A at the expense of C, in particular.
In the third codon position the shift towards A is even greater, while in most other
organisms A is rare in the third codon position.
~Nhat
is particular to HIV is that
purines (A and G) predominate over pyrimidines (C and T). The overrepresentation
of A is highest in pol and lowest in env, and may be related to the great genetic
4
•
variability of HIV. As A does not appear to be concentrated in the hypervariable
segments of env, there is so far no plausible explanation for this peculiar coding
strategy (Kypr & Mrazek, 1987).
The genetic variability of HIV-1 is relatively high compared to other retroviruses
•
(Mansky & Temin, 1995). Error rates of purified HIV-1 reverse transcriptase determined with a DNA template (of the lacZa peptide gene) range from 5
6.7
X
X
10- 4 to
10- 4 (Roberts et al., 1988). To test the hypothesis that the mutation rate for
HIV-1 is comparable to that of purified HIV-1 reverse transcriptase, Mansky & Temin
(1995) developed a system to measure forward mutation rates with an HIV-1 vector
containing the lacZa peptide gene as a reporter for mutations. They found that the
forward mutation rate of HIV-1 in a single cycle of replication is 3.4 X 10- 5 mutations
per base pair per cycle. The in-vivo mutation rate of HIV-1 is therefore lower than
the error rate of purified reverse transcriptase by a factor of 20 (Mansky & Temin,
1995). Explanations for this difference are: the association of viral or nonviral accessory proteins during reverse transcription, the influence of cellular mismatch repair
mechanisms, and/or differences between the reverse transcriptase produced in vivo
with that assayed in vitro.
.
Decomposing DNA sequences into tables of codon usage (i.e., frequency tables
of codons among organisms) loses a lot of information and their interpretation can
be misleading (Sharp, 1986). For example, looking at codon similarities between HIV
and other viruses, one can think that the moloney murine leukemia virus is the most
closely related to HIV, when in fact HIV codon usage is most similar to that of the
influenza virus and cauliflower mosaic virus. It is also important to note that, unlike
divergence in protein and DNA sequences, differentiation at the level of codon usage
is not linear in time.
Genomic comparisons of virus isolates have shown that HIV-1 variants in Africa
•
are both highly diverse and generally distinct from those in North America and Europe. Analysis based on gag or env indicate that African isolates are more hetero-
•
geneous than the North American/European group. For instance, McCutchan et aI.
(1992) compare 22 HIV-1 isolates from Zambia and 16 from North America. The Zambian isolates are most distant from a North American virus isolate, HIVMN: mean
5
difference from HIVMN was 13.6%, with a range of 1.4.4-17.1%. Among the Zambian
isolates the mean pairwise difference was 7.1% with a range of 4.4-9.8%. peR and
sequencing results indicate that HIV-1 isolates from Zambia are relatively homogeneous, but they are distinct from previously described HIV-1 isolates, clustering in
the separate branch of the phylogenetic tree.
•
Sequences can be compared at either the nucleotide or amino-acid level. Nucleotide substitutions can be evaluated for mutations that cause changes in amino
acids (nonsynonymous) vs. mutations that do not (silent or synonymous). Furthermore, we can have substitutions between purines only (A ++ G) or pyrimidines only
(C ++ T), termed transitions, or we can have mu.tations between a purine and a
pyrimidine (A ++ C, A ++ T, G ++ C, or G ++ T), called transversions.
1.3
Standard Statistical Procedures
1.3.1
Markov Chain Models
Markov chain models have been used to analy:ze DNA sequences because of the
format of these data. Tavare & Giddings (1989) estimate the order of the chain.
Let X
= {Xn ,
n
= 1,2, ...} be
•
a stochastic process with m states. If these
states represent nucleotides, then m is 4 (A = 1, C = 2, G = 3 and T = 4). X is
called a Markov chain of order k if
Pr{Xn+1
= in+! I X n = in, ... , X n- k+l = i n - k +! ... , Xl = i l }
= Pr{Xn+l = in+l I X n = in' ... , Xn-k+l = in-k+l}
for all n > k - 1 and for all choices of states iI, i 2 ,
••• ,
i n- l from {1, 2, ... , m}. In
other words, the distribution of the next base in the sequence is determined by the k
previous ones. When k = 0, the bases are independently distributed.
The transition probabilities are denoted by
•
The goals are to estimate the order k of the Markov chain as well as the transition
6
probabilities, and to test various hypotheses about the DNA sequence(s). For a single
sequence, a typical hypothesis is that of independence among the bases, i.e., one tests
whether k
= o.
Suppose the sequence of interest has length N. Let n(i 1 , i 2 , ••• , i r ) be the number of transitions i 1 -+ i 2 -+ ... -+ i r observed in the sequence and r = 1,2, ...
Since we are dealing with a multinomial distribution, it is known that the maximum
likelihood estimator of P(i1' i 2 , ... , i k; ik+d is
(1.3.2)
where
n(iiJ i 2 , ••• , i k, +)
=L n(i
i
1 , 2 , ••• ,
i k, j)
j
Note that for a k-th order Markov chain with m states, there are mk(m -1) independent parameters to estimate. So, for a high-order chain, the number of parameters
may be too big and a very large data set is needed. In this situation, very long
sequences are required to get reliable estimates of the parameters.
To reduce the parameter space, Raftery (1985) suggests the following reparametrization for a k-th order Markov-chain model
k
p(i lJ i 2 , ... , i k; ik+l)
= L.Aj q(i j , ik+d
(1.3.3)
j=l
where Q = {q(i,j), 1 ~ i,j ~ m} is a stochastic matrix,
m
q(i,j) ~ 0 and
L
q(j, I)
=1
j
= 1, ... , m,
1=1
and
The number of independent parameters is now reduced to m(m - 1) + k - 1.
The only disadvantage with this reduced model is the estimation procedure.
,-
While in model (1.3.1) there is a simple expression for the maximum-likelihood
estimate of p(i lJ i 2 , ••• , i k; ik+l), given by (1.3.2), in model (1.3.3), the maximumlikelihood estimates of .A's and q(i,j)'s must be found by numerically maximizing the
7
log-likelihood
L =
L n(il, i
k
i k, i k+1 ) In{L \i q(i j , ik+d}
2 , ••• ,
j=l
Raftery & Tavare (1994) propose a computational algorithm for maximum-likelihood
estimation of the parameters in model (1.3.3) by reducing the large number of constraints.
1.3.2
Genetic Variability
Weir (1990a) introduces a simple measure of genetic variation in a population:
the observed heterozygosity. Let n, uv be the observed number of heterozygotes
AuAv ,
u =I- v, at a locus I in a sample of size n. Then the sample heterozygote frequency at
locus I is
n,
uv
H- 1= '"'
L.J '"'
L.J-u u-j.v n
If there are m loci, the average heterozygosity is
-
1 ~-
H = -
L.JHI
m
Since
H,
1=1
is the sum of heterozygote counts that are multinomially distributed, each
H, is binomially distributed with
E(HI )
= H,
and
1
Var(H, ) = ~-H,(l - H,)
n
where H, is the proportion of heterozygotes at locus I in the population.
be written as
1
H, = -
n
n
LXjl
j=l
where
1 if individual is heterozygous at locus I
o otherwise
and for one population, we have
E(X]I) = H,
and
8
H,
can also
assuming that the individuals are independent within a sample.
-
1
-
Let H = - L HI be the estimate of the average heterozygosity within a pop-
m
l
ulation. Then,
-
= -m1
E(H)
To compute the variance of
H,
LHI
=H
l
we need to take into account the covariance
between heterozygosities at different loci, since they are not independent.
and
Var(H)
= E(xjl Xjl') is the probability that a random
where the two-locus heterozygosity HII'
individual is heterozygous at loci l and If.
The sample variance of single-locus heterozygosities is
1
m-l
-
~
L
m I
~)HI _ H)2
I
Hl-
(1
m m -
1) L
I
L: HIHI,
1':#1
with
E(s~)
~
L: [Hl + .!.HI(l min
-
(1
Hd]
) L L [HIHI'
m m- 1
I
1':#1
9
+ .!.(HII'
n
HI HI' )]
To get the variance between populations, we need to take into account the
dependence between members of the same population, caused by the founder effect.
Let
where M, is the probability that two individuals in the same population are heterozygous. Then,
E(Hz)
H,
~, Xii Xj,,) -
:2 E(l;: xj, + l;:
Var(HI )
J
J
+ !:.(H
In
(M, - H1)
[E(HI )]2
JrJ
Al,)
Averaging over all m loci
H-
" - = -LJLJXjl
1 ""
= -1LJHI
m
nm .1
I
I
Denote by Mil' the probability that two random individuals from the same population
are heterozygous, one at locus 1 and the other at locus If:
Then,
1
E(H)
-
1
- LE(HI) = - LHI
m ,
m ,
~E
(LLX]I + L L: LXjIXj',
m n
.1
.1 j'"#j
Var(H)
I
+
I
LLLXjlXjl'
.1
I
+ LLj' i- jLLXjIXj",)
.1
1'"#1
I
~
[L(MI- H1) + L2)MIl' m
I
+
I
~
[L(HI m n
I
2
1'"#1
HIHI')]
1'"#1
M, + LL)HIl' - MIl')]
I
- H
(1.3.4)
1'"#1
The four terms in expression (1.3.4) can be rearranged to show how populations,
loci, and individuals contribute to the variance of the average heterozygosity by setting
out the calculations in a framework similar to that used for an analysis of variance.
10
Now, an index i is added to the indicator variable to denote the population
being sampled, i.e.,
Xijl = {
1 if individual j from population i is heterozygous at locus I
°
otherwise
When m loci scored on n individuals taken from each of r populations, the indicator
variables can be represented by the linear model
Xijl
= ai + {3ij +" + (a')il + ({3,)ijl,
i
= 1 ... , n,
j
= 1, ... , r,
1= 1, ... , m
where ai represents the population effect, {3ij the individual-within-population effect,
" the locus effect, (a, )il the population-by-Iocus interaction and ({3, )ijl the "locus
by individual within population" interaction.
,I is a fixed effect, since the same loci
are repeatedly scored (the investigator is interested in these particular loci only) and
all the other effects are considered random.
= 0, Var(ai) = 0';
E({3ij) = 0, Var({3ij) = O';/p
E(ai)
E(,,) = H,
E((a,)il)
= 0, Var((a,)il) =
0';"
E( ({3, )ij,) = 0, Var( ({3, )ijl) = O'l/p
The analysis of variance format is shown in Table 1.1 and expectations of the
terms defined in this table are
2
E(x~·
)
I)·
2
2
+ O'i/p
+" 2 + O'pl2 + O'li/p
2
" H )2
2
2
m 2 O'p2 + m 2 O'i/p
+ (L.J
I + mO'pl + mO'li/p
E(x;.,)
2 2
n 0'p
E(x;jl)
O'p
I
2H2
n
I
2
+ nO'i/p +
+ n 2 0'pi2 + nO'li/p
2 2 2
2 2
2 ( " H )2
2
2
2
n m 0'p + nm O'i/p + n L.J I + n mO'pi + nmO'li/p
E(x;.. )
2
I
E(x~,)
E(x~.)
-
2
+ rnO'i/p + r n I + rn 2 O'pl2 + rnO',i/p
2 2 2
2 2
2 2 ( " U)2
2
2
2
rn m O'p + rnm O'i/p + r n L.J.el1 + rn mO'pl + rnmO'li/p
2
2
2
rn O'p
2 2H2
I
and
-
2
rnmO'p
2
(")2
2
2
+ rmO'i/p
+ -rn
L.J H, + rnO'pl + rO'li/p
m I
11
,.
Table 1.1: Analysis of Variance for Heterozygosity
Source
Populations
dJ.
Sum of Squares
r-1
SS!-C
r(n -
Individuals
1)
SS2 - SSt
Expected Mean Square
2
a ,i / p
+ mai2/ p + napl2 + mnap2
+
2
2
a,./
I p ma"/
Ip
within populations
m-1
SS3- C
(r - l)(m - 1)
SS4 - SSt
Loci
Loci by
a~/p + na;, + L
2/
a ,i
p
+ na,p2
-SS3 + C
populations
r(n -
Loci by
l)(m - 1)
SSs - SS2
2
a ,i/ p
-SS4 + SSt
individuals
within populations
SSt =
S S4 =
_l_I:x~..
nm
I
.
!:. I:I: X~.,
nil
I: Xijl
X.·I = I:I: Xijl
Xij·
SSs-C
mnr-1
Total
=
I
j
SSs = I:I:I:x~jl
I: Xijl
Xu. = I:I:I: Xijl
=
j
j
= -rn1 "L..JI Xul2
c=
--x~.
I
j
Xi·1
SS3
I
12
Xi ..
=
1
mnr
I:I: Xijl
j
I
2
rnmup
nmup2
,,",)2
2
+ rnmu i2/ p + -rn (L..J
H, + rnupl2 + rnu,i/p
m
I
2
+ mui2/ p + rn ""'H
L..J 1+ nmupl
+ mu,i2 / p
I
2
rnmup
E(SS5)
E(C)
2
rnmup
2
nmup
2
+ rmui/p + rn ""'H
L..J 1+ rnmupl
+ rmu ,i2 / p
2
2
+ rnmui/p +
I
,,",u
rn L..J HI
I
2
+ rnmupl2 + rnmu,i/p
2
2
+ mui2/ p + -rn (,,",)2
L..J H, + nUpl + u 'i / p
m
I
These expressions lead to the expected mean squares in Table 1.1 and
L=
1
m -1
l:(H, - H)2, m> 1
I
The four variance components are:
for the population effect,
U; =
(1
) l:l:(Mll' - H,H,,),
m m - 1
I
"#'
for the individual-within-population effect,
U;/p =
(1_1) l:l:(Hll' - Mll'),
m m
I
I'
for the population-by-locus interaction,
U;, = 2-m 'E(MI- Hl) I
(1
1)
m m-
'E 'E(M/l' "#'
I
H, H" )
and for the locus-by-individual-within-population interaction,
U~/p = 2'E(HI m I
M, ) -
(1
m m-
1) 'E'E(Hll' - Mll')'
I
I'
For comparative studies of DNA sequences, statistical methods for estimating
the number of nucleotide substitutions are required as are models for the molecular
evolution of the sequences (Gojobori et al., 1990).
Denote by I(t) the probability that two nucleotide bases at corresponding (homologous) sites at time t are identical to each other. First, let us assume that the
13
substitution rate is the same for all pairs of nucleotides and constant over time. The
substitution process is described by a single parameter, a say.
Consider the probability that two homologous nucleotide sites are identical to
each other at time t and are also identical at time t
+ 1.
In this case, there are two
mutually exclusive events: one when both bases change into two other identical bases,
having probability 3a 2 , and the other when both nucleotide sites remain unchanged,
having probability (1 - 3a)2. Therefore,
Pr
two nucleotide bases remain identical at the site
{ at time t
+ 1 when they are identical at time t
} = [(1- 3<»' + 3<>']1(t)
(1.3.5)
Now, consider the case where the bases are different from each other at time t
and become identical at time t
+ 1.
There are also two mutually exclusive events in
this case. The first event is when a change occurs at one of the two corresponding
sites but the other site remains unchanged. This occurs with probability 2a(1 -
3a). The second is when both nucleotide bases change into two other identical bases
simultaneously. This occurs with probability 2a 2 • Therefore,
Pr
two nucleotide sites become identical at time t
+1
}
{ when they are different from each other at time t
= [2a(1-3a)+2a 2][1-I(t)]
(1.3.6)
Thus, from (1.3.5) and (1.3.6) we obtain
I(t + 1)
= ((1 -
3a)2
+ 3a)I(t) + (2a(1
Satisfying the initial condition that 1(0)
I(t)
- 3:a)
+ 2a2)(1 -
I(t))
(1.3.7)
= 1, the solution of the above equation is
1
= 4[1 + 3(1 -
8a + 16a 2 )t]
(1.3.8)
Since a is generally very small, the terms in a 2 in (1.3.8) are negligible and we get
I(t) =
~[1 + 3(1 -
8a)t]
By solving a differential equation derived by substituting dI(t)/dt for I(t
(1.3.9)
+ 1) -
I(t)
in equation (1.3.7) (Nei, 1975) we get
I(t)
= 1 - 4~(1 14
e- 8at )
(1 ..
3 10)
The mean number of nucleotide substitutions accumulated per site at time t is 2 x 3at.
The factor 2 comes from the fact that the divergence of two sequences involves two
evolutionary lines each having the time length t. Let K = 2 x 3at be the evolutionary
distance in terms of accumulated changes (i.e., the number of nucleotide substitutions)
and FD = 1 - I(t). Then, from (1.3.10),
3
-4ln (1 -
K =
4
(1.3.11)
"3FD)
The standard error of K is given by
ax=
J~FD(l- FD)
1- ~FD
where n is the total number of sites compared (Kimura & Ohta, 1972).
Gojobori et al. (1990) also construct two, three, four and six-parameter methods
for estimating the total number of nucleotide substitutions. A summary of the formulae for K appears in Table 1.2 and the pattern of nucleotide substitutions according
to each model in Table 1.3.
Table 1.2: Formulae for the Number of Nucleotide Substitutions
Model
Estimating Formula
One Parameter
K = -~ln(l - FD )
Two Parameters
K = -~ln{(l - 2P - Q)J(l - 2Q)}
Three Parameters
K = -iln{l - 2P - 2Q)(1 - 2P - 2R)(1 - 2Q - 2R)}
Four Parameters
K
Six Parameters
K
[1- 2w(1-w)
PtR ]8W(1-W)-1}
~ln [~(F12 - B + 3:~2)]
= _l.ln{(S13-t::!tH S24-Q2)-[(P-R)/2]2
4
w(l-w)
= -pqln (~) -~ln r3/;;Qr.
1
(F34 - B 1 + 3~~4)1
For the two-parameter model, a and (3 are the rates of transition and transversion, respectively. Let P
P
+ Q = FD,
where
= P(t) = ~ - ~e-4(Qt,8)t + ~e-8,8t
4
2
4
and Q = Q(t)
= -21 -
1
-2 e- 8,8t.
P and Q represent, respectively, the fractions of nucleotide sites with transition and
transversion differences between the two sequences compared. Then k
15
= a + 2(3 is the
Table 1.3: Pattern of Nucleotide Substitution
Substituted Nucleotide
I Original Nucleotide
A
T
C
G
0'
0'
0'
0'
0'
One-parameter model
A
T
0'
C
0'
0'
G
0'
0'
0'
13
13
0'
0'
13
13
0'
Two-parameter model
A
T
c
13
13
0'
G
0'
13
13
13
,
Three-parameter model
A
T
13
C
,
G
0'
Four-parameter model
A
,
T
813
13
C
G
0'
13
,
13
0'
,
,
0'
13
80'
0'
0'
80'
,
8(3
,
0'1
0'
0'
0'
0'
Six-parameter model
A
T
131
13
13
C
G
16
13
13
0'2
132
number of nucleotide substitutions per site per year, and K = 2kt is the total number
of nucleotide substitutions per site between the two sequences which diverged from
the common ancestor t years ago.
For the three-parameter model (Kimura, 1981), P(t) is the fraction of sites (in
the two sequences being compared) having TC or AG nucleotide pairs at time t,
Q(t) the fraction of sites having TA or CG nucleotide pairs, and R(t) is the fraction
of sites having TG or CA nucleotide pairs:
P
= P(t) = [1 -
e- 4 (a+l3)t - e- 4 (a+'i')t
+ e- 4 (13+'Y)t]/4,
Q = Q(t) = [[1 - e- 4 (a+ 13 )t + e- 4 (a+'Y)t
R(t)
= [1 + e- 4 (a+l3)t _
and K = 2( a
e- 4 (a+'Y)t
-
_
e- 4 (13+'Y)t]/4,
e- 4 (13+'Y)t]/4,
+ f3 + 'Y)t is the total number of nucleotide substitutions per site.
For the four-parameter model proposed by Takahata & Kimura (1981), w is the
fraction of A
+T
in the two DNA sequences compared. 8 13 is the fraction of sites
having AA or TT nucleotide pairs. 8 24 represents the fraction of sites having CC or
GG nucleotide pairs. Ql is the fraction of sites having AT pairs. Ql is the fraction
of sites having GC pairs. P is the fraction of sites having CT or AG pairs, and Q
is the fraction of sites having GT or AC pairs.
The six-parameter method is based on the model proposed by Kimura (1981)
and Gojobori et al. (1982) derived its exact formulation.
qA, qT, qc,
and
qG
stand,
respectively, for the contents of A, T, C and G in the DNA sequences compared.
Also,
= qA + qT, q = qc + qG
B 1 = pq - (XAC + XAG + XTC + XTG),
E 12 = (qAq - XAC - XAG)(qTq - XTC -
XTG),
E 34 =
XTG),
p
(qCp -
XAC -
XTC)(qGP -
XAG -
= XAA + XTT - XAT - p 2 + 3qAqT
and F34 = xcc + XGG - XCG - q2 + 3qcqG,
F 12
where
Xii
represents the fraction of sites having the same base pairs i, and 2 Xij (i
is the fraction of sites having different base pairs i and j (i, j
17
= A,
C, T, G).
=I j)
1.4
Specific Biological Problelns and Synopsis of
the Solutions
In modelling the mutation process exhibited in DNA sequences of the human im-
munodeficiency virus (HIV) one cannot assume independency among sites. Suppose
the data set consists of a set of nucleotide (or amino-acid) sequences from different
individuals, with n sites per sequence. Each sequence is compared to the consensus
sequence at the nucleotide (or amino-acid) level. First, we consider the mutation
process at each site as a binary event (i.e., there is a mutation or not).
We formulate two models (Chapter 2). In the first model (Section 2.1), each
sequence is thought of as a degenerate lattice and we use results pertaining to spatial
theory to build up an autologistic model, where the probability of mutation at a
specific site, given all others, has an exponential form with as predictor a function of
neighboring sites. Also, since we have sequences from different individuals, we have
several independent lattices, not just one as is the case in image reconstruction where
this theory is used. Parameter estimation is problematic: there is no closed form for
the likelihood function to get maximum likelihood estimates. We therefore resort to
Markov-chain Monte-Carlo procedures to estimate the parameters. In particular, we
use the Metropolis algorithm, to generate random vectors, with the pseudolikelihood
estimates as initial values.
The second model (Section 2.2) is based on a Bahadur representation for the
joint distribution of the responses for the n dichotomous sites, assuming pairwise
dependence only among sites. The probability of mutation is no longer assumed to
have an exponential form. Parameter estimation is performed via the maximumlikelihood method.
Then, we extend the model to three categories denoting the type of mutation
(transition or transversion) or the absence of muta.tion. We construct an extension
of the autologistic model (Section 2.3), keeping in mind that these categories are
not ordered. We define two dummy variables to represent the three categories. The
model is similar to the autologistic model used in the binary case, with vectors as the
response variable for each site. The estimates are again obtained via Monte-Carlo
18
Markov-chain simulation.
The other problem of interest is the comparison of sequences. Comparisons can
be performed between and within individuals as well as between and within groups.
For example, we may be interested in comparing DNA sequences from different geographical areas to see whether the variability is similar in each. We develop two
approaches.
In Chapter 3 we propose a categorical analysis of variance extending the work of
Simpson (1949) and Light & Margolin (1971) to consider a number of sites (because
in the context of interest, a single position yields little information). The sequences
are not considered on an individual basis, i.e., at each position we count the number
of sequences in each category and the variability in the distribution of these counts is
considered. A test statistic is developed, assuming independence among positions, to
test the hypothesis that the probabilities of being at a certain category at a specific
position is the same for all groups.
The analysis of variance proposed in Chapter 4 is based on Hamming distances.
In comparing two aligned sequences, the Hamming distance is the proportion of positions where they differ. Our interest now is in estimating the variability between,
within and across groups. We make all possible pairwise sequence comparisons within
and across groups. Note that now sequences are considered on an individual basis.
We use V-statistics to represent the average distance between and within groups as
well as the overall distance. The total sum of squares is decomposed into within-,
between- and across-group sums of squares. The latter term is new: it does not appear in the usual decomposition. In order to find the distributions of these terms,
we use generalized V-statistics theory (Puri & Sen, 1971; Lee, 1990; Sen & Singer,
1993). We develop statistics to test the hypothesis of homogeneity among groups.
Numerical studies for the test statistics developed in Chapter 3 and data analYSIS,
using the methodologies proposed in this dissertation, are shown in Chapter
5.
The conclusions and some directions for future research are discussed in Chapter
6.
19
Chapter 2
Modelling the Mutation Process
Our primary interest is in modelling whether there is a mutation or not at each
site. We compare each sequence to the consensus sequence and look for differences.
The response is thus binary and two alternative models are formulated: an autologistic model (Section 2.1), and a Bahadur representation for the joint distribution
of Bernoulli trials (Section 2.2). We assume only pairwise dependence among sites.
Parameter estimation is discussed in each case.
2.1
The Autologistic Model
The autologistic model was introduced by Besag (1972, 1974, 1975) and is
widely suited to spatial binary data (Cressie, 1993). This parametric family can
handle situations involving both spatial correlation and dependence on covariates.
Applications include modelling the distribution of plant species in terms of climate
variables like temperature and rainfall (Huffer & Wu" 1995a), the effect of soil variables
on disease incidence in plants (Gumpertz & Graham, 1995) and network autocorrelation data (Smith, Calloway & Morrisey, 1995). It is important to point out that in
these papers the whole data set consists of a single lattice, but in our case we have a
number of independent lattices (i.e., sequences).
20
Let
. {O if there is no mutation at site i
y( z) =
1'f h '
.
..
1 t ere IS a mutatIOn at sIte z
Construct 1 x n vectors, where n is the total number of sites along a sequence:
y = (y(1) ... y(n))
and 0 = (0 0 ... 0).
The model formulation requires that we define the concept of neighborhood and introduce the positivity condition.
Definition 2.1
A site j is a neighbor of site i if the conditional distribution of Y(i), given all other
site values, depends functionally on y(j), for j
i- i.
Also define
Ni = {j : j is a neighbor of i}
(2.1.1)
•
to be the neighborhood of site i.
Positivity Condition
Let Y be a discrete variable associated with n sites. Define ( = {y : Pr(y)
> O} and
= {y( i) : Pr(y( i)) > O}, i = 1, ... , n. Then the positivity condition is satisfied if
( = (I X .•. X (n' For a continuous variable, the same definition applies except that
(i
•
Pr(-) is replaced by f(·).
This condition states that the support of the joint distribution is the cartesian product of the supports for the marginal distributions. It implies that considering the
elements {y( i) : i,
{y(l)} x {y(2)} x
, n} jointly does not rule out combinations allowed in the set
X
{y(n)}. For instance, this condition is invalidated for an
infectious disease model. Because of the one-way nature of infection, the following
situation is not allowed: y(i) = 1 when {y(j) = 0, j E Nil, i.e., y(i) is affected when
all the neighbors of i are not affected.
Without loss of generality, assume that 0 can occur at each site. Let
Q(y) = 10g{Pr(y)jPr(O)},
21
y E (,
(2.1.2)
where ( is the support of the distribution of Y. Then the knowledge of Q(.) is
equivalent to the knowledge of Pr(.), because
Pr(y) = exp(Q(y))/
'E exp(Q(y))
yEC
in the case of discrete y's. A similar formula, with integrals replacing summations,
applies for continuous y's.
Proposition 2.1 (Cressie, 1993)
The function Q satisfies the following two properties:
I.
Pr(y(i) I {y(j) : j i- i}) _ Pr(y) _
(Q() Q( ))
Pr(O(i) I {y(j) : j =I- i}) - Pr(Yi) - exp
y Yi
where O(i) denotes the event Y(i) = 0 and Yi
II.
(2.1.3)
=(y(l), ... , y(i-1), 0, y(i+1), ... , y(n))
Q can be expanded uniquely on ( as
Q(y) -
'E
+
'E
y(i) Gi(y(i)) +
l$i$n
y(i) y(j) Gij(y(i), y(j))
l$i<j$n
'E
y( i) y(j) y( k) Gijk(y(i), y(j), y( k))
l$i<j<k$n
+y(l) ... y(n) Gl...n(y(l), ... , y(n)), y E (.
+ ...
(2.1.4)
•
Note that although the expansion (2.1.4) is unique, the function {G ij ...} are not
uniquely specified. By defining Gij...(y(i), y(j), ...)
=0 whenever one of the argu-
ments is 0, uniqueness is obtained.
Example (Liang, Zeger & Qaqish, 1992)
The representation of Q(.) in (2.1.4) has been used to represent the probability
distribution for a vector x of binary responses in a saturated log-linear model:
n
Pr(x)
= exp{uQ + 'E Uj Xj + 'E Ujk Xj Xk + .". + U12 ..,n Xl
j=l
j<k
•..
Xn}
(2.1.5)
where there are 2n -1 parameters U = (Ul, ... , Un, Un, U12, " ., Un-l,n, ... , U12 ...n)"
These have straightforward interpretations in terms of conditional probabilities. For
22
example,
Ujk
Uj
= logit{Pr(xj = 1
=
logOR(xj, Xk
I Xl
I Xk
= 0, k =I- j)},
= 0,1 =I- j, k),
j = 1,
,n,
j < k = 1,
,n,
and
U123
= logOR(xl'
x21 X3
= 1, Xl = 0,1> 3) -logOR(xl,
X2
I X3 = 0,
Xl
= 0,1>
3)
(2.1.6)
where
OR( v, w)
Pr(v = w = 1)Pr(v = w = 0)
= ----,.--'-------'----'-----'---:Pr(v = 1, w = 0) Pr(v = 0, W = 1)
The implication of properties (i) and (ii) is that the expansion (2.1.4) for Q(y)
is actually made up of conditional probabilities. For instance,
.
.
[pr(y(i) I {O(j) : j =I- i} )]
y(z) Gi(y(z)) = log Pr(O(i) I {O(j) : j
i- i})
y(i) y(j) Gij(y(i), y(j))
_ 1 {pr(y(i) I y(j), {O(l) : 1 i- i,j})
Pr(O(i) I {O(l) : 1 i- i})}
- og Pr(O(i) I y(j), {O(l) : 1 i- i,j} x Pr(y(i) I {O(l) : 1 i- i}
•
From (2.1.2), the joint probability distribution of y, Pr(y), is proportional to exp( Q(y)).
Finding the proportionality constant as a function of the parameters enables us to
write down the full likelihood and to obtain the maximum likelihood estimates of the
parameters. Unfortunately, this is not always possible. Further, there is a powerful
theorem regarding the form the function Q must take so that the conditional expressions combine consistently into a proper joint distribution. We must first define a
clique.
Definition 2.2
•
A clique is a set of sites that consists either of a single site or of sites that are all
•
neighbors of each other.
Theorem 2.1 (Hammersley-Clifford, 1971)
Suppose that Y is distributed according to a Markov random field on , that satisfies
23
the positivity condition. Then, the negpotential function Q(.) given by (2.1.4) must
satisfy the property that if sites i, j, ... ,n do not form a clique, then Gij ...n (·) = 0,
where cliques are defined by the neighborhood structure {Ni: i
= 1, ... , n}.
•
Assuming only pairwise dependence between sites,
L
Q(y) =
+ L
y(i) Gi(y(i))
l~i~n
y(i) y(j) Gij(y(i), y(j))
(2.1.7)
l~i<j~n
Since y( i) is either 1 or 0, the only values for the functions G needed in (2.1.7) are
Gi(l)
=ai and Gij (l, 1) =lij' Thus,
n
=L
Q(y)
i=l
L
ai y(i) +
lij y(i) y(j)
= log{Pr(y)jPr(O)}
(2.1.8)
l$i<j~n
where lij = 0 unless sites i and j are neighbors. Then,
n
Q(y) - Q(Yi) = ai y(i) +
where lij
= Iji
L "Iij y(i) y(j)
j=l
and, to maintain identifiability of the parameters, Iii
= O.
Therefore,
from (2.1.3),
Pr(y( i) I {y(j) : j
Pr(O( i) I {y(j) : j
i- i})
i- i})
~
.
..
+ ~ lij y( Z )Y(J)}
= exp{ ai y( z)
Because y(i) = 0 or 1,
Pr(O( i) I {y(j) : j
i- i}) = 1 + exp (.
+lEnj=l 113" YC))
a
J
l
and
P(y( i) I {y(j) : j
i- i}) =
exp( ai y( i) + Ei=~ lij y(i) ~(j))
1 + exp( ai + E j =l "Iij Y(J ))
(2.1.9)
Even with just pairwise dependence this formulation may involve too many parameters for data sets of moderate size and we need to reduce the number of parameters
by imposing some constraints. For instance,
• For the a's,
24
1.
ai'S
all different;
2. ai = a, Vi;
3. a different ai for each specific amino acid (or nucleotide) in the consensus
sequence.
• For the I'S,
1. I'S all different;
2. Iii
3•
= 0 if I i - j I> d,
'Y .. II) -
where d is some chosen threshold;
'Yp1i-il
.,
I
4. equal Iii for each pair of specific amino acids (or nucleotide) in the consensus sequence;
5. Iii
2.1.1
="
Vi, j.
Estimation Procedure
Suppose the data consist of m independent sequences with n sites. In fact, we
have Yk(i), where i
= 1, ... , n and k = 1, ... , m.
Then, y(i) is a m
X
1 vector.
We can get approximate estimates by using the pseudolikelihood function L p
(Besag, 1975), i.e. the product of the conditional probabilities, which specifies the
model assuming independence among sites.
n
L p ( 8; y(1), ... ,y(n)) =
II Pr(y(i) I {y(j) : j
=1=
i}; 8)
i=1
where 8
= (al, . .. , an, II 2, .•. , ,n-l n)' is the parameter vector.
Since the sequences
are independent, we can write
m
L p ( 8; y(1), ... , y( n)) =
n
II II Pr(Yk( i) I {Yk(j) : j
=1=
i} )
k=1 i=1
=
IT IT {exp[ai Yk(i) + "Ej=~ lik Yk(i) ~k(j)]}
k=1 i=1
1 + exp[ai + "Ei =1 lij Yk(J)]
(2.1.10)
The maximum pseudolikelihood estimate (MPLE) is the value of 8 which maximizes Lp, i.e., (2.1.10). The MPLE is consistent and asymptotically normal (Comets,
25
1992; Comets & Gidas, 1992). It has some disadvantages: its efficiency is unknown
and expected to be inferior to that of the maximum likelihood estimate (MLE). Also,
there is no direct way of obtaining standard errors for the estimates.
In order to calculate the MLE we need to proceed by Markov-chain MonteCarlo simulation as proposed by Geyer & Thompson (1992). We can write the joint
distribution of y
= (y(l), ... , y( n))
as
Pr{y(1), ... ,y(n);8} = C(8)F(y(1), ... ,y(n);8).
where F is an explicitly computable function and C( 8) = Pr{O(l), ... , O(n); 8}. For
the autologistic model,
(2.1.11)
Fix some specific value of 8, 8 0 say. Suppose we can generate a random vector
(Y'(l), ... , Y'(n)) from the distribution with 8 = 8 0 . The random variable
,Y'(n); ~J)
,Y'(n); 8 0 )
F(Y'(l),
F(Y'(l),
has mean
.
'"
F(y'(1), ... ,y'(n);8) ( ) ('()
y'
LJ
F(y'(l)
'( ). 8 ) C 80 FyI, ... , (n); 8 0 )
y'(l),oo.,y'(n)
, ... , y n, 0
C(8 0 )
L:
F(y'(1), ... ,y'(n);8)
y'(l),.oo,y'(n)
C(8 0 )
C(8) ,
L
since
C(8)F(y'(1), ... ,y'(n);8)
= 1.
y'(l),oo.,y'(n)
Then, suppose we have a (not necessarily independent) set of random vectors (y(r)(l),
... ,y(r)(n)), r
=
1,2, ... ,R, each drawn from the distribution C(8 0 )F(.;80 ) for
some fixed 8 0 • Denote the observations by (Y(l), ... , Y(n)). The function
H(8)
=.!-
t
F(y(r)(1),
R r=l F(y(r)(l),
,y(r)(n);8) x F(Y(1),
, y(r)(n); 8 0 )
F(Y(l),
is an unbiased estimator of the likelihood ratio of (Jo to 8,
C(8 0 ) F(Y(l)
C(8) F(Y(l),
26
,Y(n); 8 0 )
, Y(n); 8)
,Y(n);8o)
, Y(n); 8)
(2.1.12)
SInce
E[H(8)]
~ F(Y(l),
R F(Y(l),
t
,Y(n); 8 0 )
E [F(y(r)(l),
, Y(n); 8) r=l
F(y(r)(l),
C(8 0 )F(Y(1),
C(8)F(Y(1),
, y(r)(n); 8) ]
, y(r)(n); 8 0 )
, Y(n); 8 0 )
, Y(n); 8)
To obtain the MLE of 8, use equation (2.1.12), based on a single simulated sequence,
and minimize it with respect to 8 (Geyer & Thompson, 1992).
Here is a summary of the procedure:
1. Choose a model, as in equation (2.1.9).
2. Use the MPLE to generate an initial point estimate (8 0 ) for the unknown parameter vector.
3. Use Gibbs sampling or the Metropolis algorithm to generate a simulated sequence of random vectors from the distribution with parameter vector 8 0 , Discard an initial warm-up sample, then generate random vectors
yO), y(2) ....
Select R vectors (R should be at least 1000) by sampling every k (50, say)
because of the dependence between successive vectors.
4. Use equation (2.1.12) to define H(8), the simulated likelihood ratio of 8 0 to 8.
5. Using numerical optimization, maximize the function -log H(8) to obtain the
MLE
iJ,
and use the equivalent form of the observed information matrix to
evaluate standard errors for these estimates.
Alternatively in (3), instead of sampling every k, R samplers can be initiated at
different values, discarding an initial warm-up at each.
•
The Gibbs Sampler algorithm (Geman & Geman, 1984; Gelfand & Smith, 1990)
and the Metropolis algorithm (Metropolis et al., 1953) are special cases of the Hastings
algorithm (Hastings, 1970; Peskun, 1973). An explanation of how these algorithms
work and of the differences between them follows.
27
Hastings' Algorithm
Suppose we want to generate a discrete or continuous random variable (scalar or
vector) Y which takes values in a space
T
with density 7r(t). We assume that 7r(.) is
completely specified up to a normalizing constant. vVe also select a Markov transition
kernel q(s, t), s, t E T, which is used to generate trial values of Y. In the discrete
case, the interpretation is that if the current values of Y is s, the next trial value is t
with probability q(s, t). The choice of q(.,.) is almost arbitrary and the performance
of the algorithm may vary according to this choice. The algorithm follows.
Step 0: Take an arbitrary starting value Y
Step 1: Given Y
= ti =
distribution q(s, t), s, t E
= to
and set i
= o.
s, choose a new trial value according to the probability
T,
where
q(s, t)
T
= {O, I} and
= Pr(Y;. = t I YO = to = s)
is an arbitrary Markov kernel (generating an irreducible and aperiodic chain).
Step 2: Calculate
. {7r(t)q(t,S) }
a(s,t)=mm 7r ()
( .),1 ,
s q s, t
where 7r(t) is a distribution specified up to a normalizing constant. Here 7r(.) -
F(.; 8 0 ), where F is given by (2.1.11).
Step 3: Generate a random variable
u=
Step 4: If U
I with probability a(s, t)
{
o with probability 1 -
= 1, move to t, i.e., ti+l =
Step 5: Set i := i
a(s, t)
t. Otherwise ti+l
= ti.
•
+ 1.
Metropolis' Algorithm
q is taken to be a symmetric random walk, i.e.,
q(~l,t)
= q(t,s).
Then, in step 2:
•
28
The Gibbs Sampler Algorithm
For the Gibbs Sampler, each of the updating steps corresponds to generating a new
value from a particular conditional distribution, i.e., we always accept the new value.
Then, in step 2: a(s, t) = 1. In our specific case, we are generating m independent
sequences of length n and the Gibbs sampler proceeds as follows:
Step 0: Take arbitrary starting values, say (Yl' Y2, ... ,Yn)' Define y~O) = Yj, J =
1, ...
,n.
Let i = O.
Step 1: Given the current values of y~ i), y~ i), ... ,
value from the conditional distribution of
Yii ), generate a new pseudo-random
Yi given }j = yy), j = 2, ... ,n, and call it
y~i+l). This probability distribution is given in equation (2.1.9), i.e.,
P(y( i)
I {y(j) : j i- i}) =
exp( ai y( i) + ~j=~ lij y( i) ~(j))
1 + exp( ai + ~j=l lij Y(J ))
Step 2: Given the current values of y~iH), y~i), ... ,
Yii ), generate a pseudo-random
value from the conditional distribution of Y2 given
= y~i+l), Yj(i), j = 3, ...
Yi
,n, and
call 1't Y2(iH) .
Continue updating one component at a time until...
Step n: Given the current values of y~iH), y~iH), ... , y~i~ll), generate a pseudorandom value from the conditional distribution of Yn given}j = y~iH), j = 1, ... , n1, and call it
yiiH ).
Step n+l: Set i := i + 1 and return to Step 1.
A single cycle through steps 1 to n+1 completes one iteration of the algorithm. We
will discard an initial warm-up sample, then generate R random vectors by sampling
the chain every k cycles.
2.2
_
Model based on the Bahadur Representation
Let Yk = (Yk (1) '"
Yk (n )) be the 1 x n vector representing the binary responses
for the n sites along sequence k, i.e., whether there is a mutation or not at each site
along the sequence. Since each Yk(i) (i
= 1, ... , n) assumes the value 1 or 0, the
29
random variable Yk(i) is a Bernoulli trial with parameter ~i
= Pr{Yk(i) = 1}.
Then,
E[Yk(i)] = ~i and Var[Yk(i)] = ~i(l - ~d·
Let
and
So, fij are second-order correlations,
Hence, there are (~)
+ (~) + ... + (:)
fiji
= 2n
-
third-order correlations, and so on.
n - 1 correlation parameters.
Denote by P[ljk the joint distribution of Yk(i)'s when they are independent, i.e.,
n
P[ljk(Yk(l), ... , Yk(n))
= II
~r(i) (1 - ~i)l-Ylc(i)
(2.2.1 )
i=l
Let P(Yk) be the distribution ofYk, where Y k = (l'i(1) ... Yk(n)) is a 1 x n vector
denoting the response random vector for the k-th sequence.
Proposition (Bahadur, 1961)
For every Yk = (Yk(l), ... ,Yk(n)),
(2.2.2)
where
!(Yk)
= 1 +L
fij zk(i) Zk(j)
+ ... + f12 ... n .zk(l) zk(2)
.. , zk(n)
(2.2.3)
i<j
•
If we assume pairwise dependence only, the distribution of Y k is
P(Yk)
=
P[ljk(Yk)[l
+L
fij
Zk( i) Zj\:(j)]
i<j
n
- II ~r(i) (1 -
~i)l-YIc(i) [1
~l
n
- II ~r(i) (1 -
+ L: fij zk(i) Zk(j)]
i~
~i)l-YIc(i)
i=l
(2.2.4)
30
2.2.1
Estimation Procedure
= (6, ... ,en,r12, ... ,rl n,r23, ... ,r2n, ... ,rn-l n )"
Let ()
An estimate of () can be
obtained by maximum likelihood. The likelihood function for sequence k is given by
(2.2.4). For m independent sequences, the likelihood function is
m
=
n
II II efk(i) (1 -
ei)l-Yk(i)
k=l i=l
(2.2.5)
2.3
Models for Three Categories
In this section, we extend the number of categories of interest, investigating not
only if there is a mutation, but also the type of mutation. The events can fall into
one of three categories: transitions (A
A
f-t
T, G
f-t
C, or G
f-t
f-t
G or C
f-t
T), transversions (A
f-t
C,
T) or no mutation. An extension of the autologistic model
is proposed by creating dummy variables to represent the categories (Section 2.3.2).
2.3.1
Overview of the Literature
Little has been done to extend the classical autologistic model. Strauss (1977)
thought of a black-and-white picture as a binary lattice and generalized his analysis
for a multicolored representation. With a set of n sites, associate with site i a random
variable Yi determining its color. The 'color' Yi = 0 is available at each site. Therefore
Pr(O) > 0 and as in equation (2.1.2) we can define:
Q(y)
= log{Pr(y)jPr(O)},
(2.3.1)
where y = (Yb"" Yn). Following equation (2.1.4) he expands Q(y) as
Q(y)
=
L
l~i~n
Yi Gi(Yi)
+
L
Yi Yj Gij(Yi, Yj)
+ ... + Yl ... Yn Gl...n(Yb
... , Yn)
l~i<j~n
(2.3.2)
31
Consider now the special case where the sites are arranged in a regular lattice, and
each site has one of c + 1 colors, denoted by Yi
= 0,1, ... ,c.
Two assumptions are
needed to obtain a workable statistical model.
(a) Homogeneity
If S is a set of sites, then Pr{}i
= Yi, Yj = Yj, ... , ~ = Ys
: (i,j, ... , s) E S} should
be the same for any set of sites derived from S by translation, rotation or refiexion.
This condition requires that all subscripts be dropped from the functions G.
(b) Pairwise dependence only
With assumptions (a) and (b), we have
L
Q(y) =
Yi G(Yi)
L
+
19~n
YiYj G(Yi, Yj)·
l~i<j~n
For the multicolored case, it is more convenient to write
Ur
= r G(r)
and
= r s G(r, s).
V rs
Then
c
Q(y)
=L
c
mrUr
+L
r=l
c
L
nrsvrs
(2.3.3)
r=ls=l
where m r is the number of sites with color r and n rs is the number of pairs of
neighboring sites where one has color r and the other color s, and
c
Pr(y)
<X
c
exp{L mrU r
r=l
+L
c
I:
nrsv rs }
r=ls=l
Equation (2.3.3) usually involves too many parameters, and in analyzing actual data
one hopes that the following assumption is valid.
(c) Color indifference
Suppose that for any pair of neighboring sites i and j
Q(O, . .. ,0, Yi, 0, ... ,0, Yj, 0, ... ,0) - £;;,(0, ... ,0, Yj, 0, ... ,0)
is independent of yj, for yj' =I Yi. This means that the affinity between any two colors
is the same as that between any other two, which may be appropriate when the colors
32
are unordered. It follows from (2.3.3) that Vrs is constant for all r, s (r
L
L
n rs +
n rr is fixed, for a given lattice, reparametrize and take Vrs =
r'f:.s
r
Thus, with assumptions (a)-(c), the model becomes
Pr(y) ex exp
=I s). Since
0 for r =I s.
(~mrur + ~ nsvs) ,
(2.3.4)
where the subscripts of the n's and V's have been dropped, and here n s represents
the number of pair of sites with the same color. It is possible to express (2.3.4) in a
conditional form: for any internal site Yrs
exp(uj + vjnjrs)
(
)
1 + L...i=l exp Ui + Vi nirs
where njrs is the number of sites colored j with neighboring sites (r, s).
Pr ( Yrs
")
= J"I other sztes
=
,",C
(2.3.5)
Because of limitations in the estimation procedure, the model is simplified even
further and reduces to a single parameter v or u. When conditioning on the observed
mr's, the ur's become nuisance parameters and Vr is the focus of attention, where Vr
measures the" clustering tendency". Then
Pr(y; vb"" Vc) ex exp
(~nsvs)
(2.3.6)
Assuming that the strength of attraction is the same for all colors; I.e., VI =
... = V c = V,
V2
(2.3.6) becomes
Pr(y; v) ex exp( v y),
(2.3.7)
where Y is the number of adjacencies of like color. When conditioning on the observed
ns's, the v/s become the nuisance parameters and Ur is the" coloring tendency".
Then,
Pr(y; UI,'
•• ,
uc) ex exp
(~mrur)
(2.3.8)
If we assume that the" coloring tendency" is the same for all colors, i.e, UI =
... = U c = u,
U2
=
(2.3.8) becomes
Pr(y; u) ex exp( U y),
(2.3.9)
where Y is the number of adjacencies with the same color.
A x2-approximation to the distribution of Y is obtained after calculating the
first two cumulants of y. The MLE for v or
U
is calculated from this distribution.
The resulting models are so simplistic that they are of little practical use.
33
2.3.2
An Alternative Extension of the Autologistic Model
Let Yk(i)
= (Y1k(i) Y2k(i))',
where
{1 if site i of sequence k is in category 1
.
Y1k(Z) =
o otherwise
and
.
Y2k(Z)
Note that if Y1k(i)
=
{1 if site i of sequence k is in category 2
o otherwise
= 0 and Y2k(i) = 0 there is no mutation at site i
of sequence k.
Thus, we have Yk(i) = (00)' if there is no mutation at site i of sequence k, Yk(i) =
(1 0)' if the substitution falls in category 1 and Yk(i) = (0 1)' if the mutation falls in
category 2.
Then,
n
Q(y) -
L:y~(i) Gi(Yk(i)) + L: Y1k(i) Y1k(j) G~~/)(Y1k(i), Y1k(j))
i=1
1$i<i$n
+ L: Y2k(i) Y2k(j) G~J2)(Y2k(i), Y2k(j))
1$i<i$n
+ L: Y1k( i) Y2k(j) G~J2) (Y1k( i), Y2k(j))
1$i<i$n
n
L:[Y1k(i) GP)(Y1k(i)) + Y2k(i) G?\Y2k(i))]
i=1
+
Y1k(i) Y1k(j) G~J1)(Y1k(i)" Y1k(j))
1$i<i$n
L:
+ L: Y2k(i) Y2k(j) G~~/)(Y2k(i),
1$i<i$n
+ L: Y1k(i) Y2k(j) G~J2)(Y1k(i),
Y2k(j))
Y2k(j))
1$i<i$n
Assuming homogeneity, all subscripts can be dropped from the functions G, which
yields
n
Q(y) -
L:[Y1k(i) G(1) (Y1k(i)) + Y2A:(i) G(2) (Y2k(i))]
i=1
+
Y1k( i) Y1k(j) G(1,1) (Y1k( i) Y1k(j))
1$i<i$n
L:
34
'E
+ 'E
+
Y2k(i) Y2k(j) G(2,2) (Y2k(i), Y2k(j))
l~i<j~n
Ylk( i) Y2k(j) G(1,2) (Ylk( i) Y2k(j))
1$i<j~n
m1k a1
where G(1)
+ m2k a2 + n11k III + n22k 122 + n12k 112
= a1 , G(2) = a2 , G(1,l) = III , G(2,2) = 122 , G(1,2) = 112.
mlk and m2k are
the numbers of sites along sequence k in category 1 and category 2, respectively.
n11k, n22k and n12k are the numbers of pairs of sites from sequence k with both in
category 1, with both in category 2, and with one in category 1 and the other in
category 2, respectively.
If we do not assume homogeneity among sites,
Q(Yk) - Q(Yki) =
Ylk(i) ali
+ Y2k(i) a2i
n
+ 'E[Y1k( i) Ylk(j) IB,l) + Y2k( i) Y2k(j) Ig,2) + Ylk( i) Y2k(j) Ig,2)]
j=l
h
GIP)
were
-
. GP) -
ah ,
I
-
. ' IG~~,l)
)
a21
=
=
P,l) G~~,2)
II)'
I)
~~,2) G~~,2)
II)'
I)
_
-
~~,2)
II)
Therefore,
Pr(Yk(i) I {Yk(j) : j
Pr(Ok(i) I {Yk(j) : J"
-I i})
-I i})
= exp{Q(y) - Q(Yi)}
n
= exp{y'(i) a
+ 'E(Ylk(i) Y1k(j)
Y2k(i) Y2k(j) Ylk(i) Y2k(j)) 1'}
j=l
= (ali a2i )' ,Yk (")Z = (Y1k (")Z Y2k ("))'
Z
,1' = (lij(1,1) lij(2,2) lij(1,2)),
and O(i) = (00)'.
W h ere
a
So,
Pr(Yk(i) I {Yk(j):j
-I i}) =
exp(B'Yk(i))
1 + exp(B1 ) + exp(B 2 )
where Yk(i) E {(O 0)', (1 0)', (0 1)'} and B = (B 1 B 2 )' with
n
n
j=l
j=l
B 1 = ali
L..J Y1 (")
J lij(1,1) +'"'
L..J Y2 (")
J lij(1,2)
+ '"'
B 2 = a2i
(") (2,2) + '"'
(") (1,2)
L..J Y2 J lij
L..J Y1 J lij .
+ '"'
and
n
.
j=l
n
j=l
35
(2.3.10)
Estimation Procedure
The estimation procedure in the three-category case is analogous to that of the
classical autologistic model (Section 2.2). Only now, we have to generate vectors on
the set {(O 0)', (1 0)', (0 I)'}. Recall that the vector (0 0)' represents no mutation,
(1 0)' category 1 and (0 1)' category 2. These vectors are generated from the conditional distribution (2.3.10), with initial estimates obtained by maximization of the
pseudo-likelihood function.
•
36
Chapter 3
Analyzing the Variability in DNA
Sequences
The focus here is the comparison of sets of sequences. For example, we are
interested in comparing DNA sequences from the human immunodeficiency virus
(HIV) from different geographical areas to see whether the variability is similar in
each. Similarly, when we study several individuals and obtain a set of sequences from
each individual at different time points, our interest lies in estimating the variability
between and within individuals.
Simpson (1949) proposed a measure of diversity for categorical data in terms of
frequencies for each category. On the basis of a similar measure of variation, Light &
Margolin (1971) developed an analysis of variance (CATANOVA), for one-way tables,
suitable for categorical variables. The properties of the components of variation are
investigated under a common multinomial modeL This framework can be used to
compare the variability of the response variable at a single position between and
within groups. We consider a number of sites because, in the context of interest (the
analysis of HIV-l sequences), a single position yields little information. Components
of variation are derived from the fact that the sum of squares of deviations from the
mean can be expressed as a function of the squares of the pairwise differences for all
possible pairs (Section 3.1). The sequences are not considered on an individual basis
but only as contributing to the overall variability in the distribution of the categorical
37
response. We partition the measures of diversity according to the factors considered
(Section 3.2) assuming independence among positions (Section 3.3). We develop a
test statistic for the null hypothesis of homogeneity among groups (Sections 3.4 and
3.5) and assess its power (Section 3.6).
3.1
Variation in Categorical Data
For categorical data, the mean is an ill-defined concept. Therefore, measures
of variation, such as the variance, which are meaningful for continuous variables, no
longer apply. Gini (1912) found an alternative way of characteryzing variation and
developed a measure of variation for categorical data.
Let Xl, X 2, ... , X N denote measurements of N independent experimental units.
The variance of X may be expressed as E4>(X1 , X2)~, where 4>(a, b)
= t(a -
b)2 (Ho-
effding, 1948). In a similarfashion, the sum of squares is
ss
1
N
i=l
N
2N i=l j=l
1
N N
1
""
""
d~.
=
LJLJ IJ
N
2N i=l
j=l
-
N
I:(Xi -X? = -I:I)Xi
_Xj )2
(3.1.1)
""
LJ d~IJ.
l~i~N
N
where X = 2:i=l Xi/N and dij = Xi - Xj.
In the present context, each Xi falls into one of C possible categories. Define
d(Xi , Xj)
= dij
as
d,; = {
1 if Xi and X j belong to different categories
o if Xi
and X j belong to the same category.
(3.1.2)
Definition 3.1
The variation for categorical responses Xl, . .. , X N i.s
(3.1.3)
•
where dij is defined in (3.1.2).
38
As each response assumes one and only one of the C possible categories, the data
is summarized by the vector 4t = (nl"'" nc) where ni is the number of responses
in the ith category (i
= 1, ... , C),
so that L:~1 nj
= N.
Then the variation in the
responses is defined as
(3.1.4)
If PI, ... ,Pc stand for the probabilities of X belonging to these C categories, the
Simpson's Index of ecological diversity is defined as
c
Is(p) = 1 - p'p = 1-
LP;
(3.1.5)
i=1
and its corresponding sample counterpart is
c
is(p) = 1 - p'p = 1 -
L P~,
(3.1.6)
i=1
where
Pi = n;f N,
i = 1, ... , C relate to the sample proportion. Therefore, we have
The definitions (3.1.4) and (3.1.5) are motivated by two properties.
1. The variation of N categorical responses is minimized if and only if they all
belong to the same category, i.e., Pi
= 1,
Vi
= 1, ... , C.
2. The variation of N responses is maximized when the responses are distributed
among the available categories as evenly as possible, i.e., Pi = l/C, V i
=
1, ... ,C.
3.2
Partitioning the Measures of Diversity
Let
Xr =
(XYl' X~,
group g. Suppose i
= 1,
, XYK)' be a random vector representing sequence i of
, N, k
= 1, ... , K
39
and 9
= 1, ... , G.
So, XYk represents
position k of sequence i of group g. Xrk is a categorical variable assuming C (unordered) categories. For instance, if comparisons are made at the nucleotide level,
x~k
E {A, C, T, G} and there are 4 categories.
First, assume there is only one position for each sequence. We summarize the
data in Table 3.1.
Table 3.1: Summary of the Data (one position)
Group
....
Xl1
2
x2
1
3
x3
2
Xl2
x2
x3
3
Xl3
x 32
... 2
x 33 ... x G
3
N
xlN
2
xN
xt
Sequence
1
1
2
....
1
G
xG
1
xG
2
...
xG
N
Now dij is defined as
do;
1 if XflI
...J.
X!!J
= { o if Xf I# XJ .
The total number of responses is
Gee
G
NG = L:n. g = L:ne. = ~=L:neg
g=l
e=l
e::::1 g=l
where neg is the number of responses in category c for group 9 and N
= n.g = L:~=1 neg
is the number of responses for group g, which here i.s simply the number of sequences
in each group. The Total Simpson Index (TSI) is
(n
C _.e.)2
TSI=l-L:
e=l NG
(3.2.1)
The dispersion within group 9 (i.e., within xf, x~, ... , x!j.,) is
(3.2.2)
40
Therefore, the within-group Simpson Index (WSI) is found by averaging (3.2.2) over
all g's:
W SI
1
G{1 - ~C()2}
~~; = 1 -
= G];
GC(~c~) 2
G ]; ~
(3.2.3)
The between-group Simpson Index (BSI) is
(3.2.4)
Now, assume there are K positions along each sequence.
(Xfl' XY2' ... ,XYK)' and
We have X~ =
xg = (Xi, X~, ... ,X~)'. The data are summarized in Table
3.2.
The total number of responses is
eKe G K
G
NGK
= 2: n.g. = 2: n c.· = 2: n.·k = 2: 2: 2: ncgk
g=1
k=1
c=1
c=1 g=1 k=1
The interest is in assessing the homogeneity among groups: the null hypothesis
is that
Pcgk
=
Pck
where
Pcgk
is the population probability of belonging to category c
in group g at position k. One could argue that this is the classical Pearson's X2 test,
but note that the classical Pearson's X 2 statistic for Table 3.2 is
G
C
K
2:2:2:
g=1 c=1 k=1
G
C
K
ncgk
nc.k ) 2
( -----
N
NG
nc·k / N G
G ( ncgk
_
n~k) 2
- 2: 2: 2: ~_--=--<.g=1 c=1 k=1
N nc·k
with K(G - 1)(C - 1) degrees of freedom. The limiting X2-distribution is a close
approximation only when the cell frequencies
ncgk's
are all large (at least 5). In
analyzing amino-acid sequences, we know that these conditions are not met. The
distribution at a single position usually exhibits a few polymorphisms with very low
frequencies. Just as for Fisher's exact test, the exact null distribution is difficult to
implement for small values of N, when G or K is not small. Moreover, if the number
of degrees of freedom of the x2-statistic is large but the noncentrality parameter is
not proportionally so, the resulting test is likely to have less power than some tests
41
Table 3.2: Contingency Table (K positions)
Group
Position
1
2
...
C
1
1
nlll
n2l1
...
nCll
1
2
nll2
n212
...
nCl2
nllK
n2lK
.. .
...
nCIK
nll·
n21·
1
K
2
1
nl21
n221
...
...
2
2
nl22
n222
...
nC22
nl2K
n22K
.. .
...
nC2K
n12·
n22·
Total
2
K
Total
.. .
.. .
.. .
.. .
G
1
nlGI
n2GI
G
2
nlG2
n2G2
nlGK
n2GK
Total
nlG·
n2G·
TOTAL
nl··
n2··
:
G
K
42
nCI·
nC2l
.
.. .
...
...
...
nCGl
...
nCGK
.,
...
...
nC2·
. ..
nCG2
nCG·
nc..
Total
=N
n.12 = N
n.ll
=N
n.l. = NK
n.21 = N
n.22 = N
n.IK
=N
n.2. = NK
n.2K
"
.
=N
n.G2 = N
n.Gl
=N
n.G. = NK
n ... = NGK
n.GK
directed towards specific alternatives. Note that the number of degrees of freedom
here is usually large: for instance, comparing two groups of sequences 100 nucleotide
long yields 300 degrees of freedom. For these reasons we use another approach to
assess homogeneity among groups.
The variation within the gth group at the kth position is
C(n egk ) 2,
EC()2
negk
=1- E
1-
e=1
since
n.gk
= N.
n'gk
e=1
N
The variation within the gth group is
1-
t (n .)2 t (~)2
eg
e=1
= 1-
n.g .
e=1
NK
since n.g . = N K. The measures of dispersions are
W SI =
~
t {I - t (N K
neg. )
G g=1
TSI - 1 _
-
and
t(
e=1
n e .. )
NGK
B SI = T SI - W SI = G
e=1
neg.
)
NGK
2
(3.2.5)
(3.2.6)
2
'
LL
G
C (
g=1 e=1
3.3
t(
2} = 1_G
e=1
neg.
)2
-
NGK
EC (
e=1
2
e
n ..)
•
NGK
(3.2.7)
The Probabilistic Model
Assuming that responses in different groups are independent, for each group and
each position, the responses
where E~=IPegk = 1,
Pegk
E( negk)
(nlgk, n2gk, . •. , nCgk)
follow a multinomial distribution:
> 0, c= 1, ... ,C, k = 1, ... ,K and 9 = 1, ... ,G
= N Pegk
if gl =
g2
otherwise
43
and k1 = k2
ncgk
denotes number of responses in category c at position k for group 9 and
Pcgk
the probability of being at category c at position k for group g.
If we assume that the positions are independent, the model is
K
II Pr{(nllk,' .. ,nClk, n12k,· .. , nC2k,· .. ,nlGk, ... , nCGk)}
k=l
K
G
=
II II Pr{(nlgk, n2gk,· .. , nCgk)}
g=lk=l
=
N
K (
II II
G
g=l k=l
)
nlgk··· nCgk
Then V 9 = (nlgl ...
nCgl nl g 2 '"
IIC (Pcgk tCg/c
c=l
nCg2 ... nlgK ... nCgK)'
=(VI V V is a GC K
E(V) = P =N Po = N(Plll ...
and V
2 ...
G)'
X
is a C K X 1 vector
1 vector.
(3.3.1)
PCll ... PlGK ••. PCGK )'
Let Ell denote the direct-sum operation, i.e., if C
= A Ell B, C = ( : :).
Then
Cov(V) =:E where :E gk is a C
N:E°
..
N(:E ll EB :E 12 EB ••• EB :ElK EB :E 2l E3 ••• EB :E 2K EB .•. EB :E GK ) (3.3.2)
X
C matrix of the form
(3.3.3)
with D gk being a C
Pogk
= (Plgk
3.4
X
C diagonal matrix with elem.ents Plgk,
. " , PCgk
and
•.• PCgk)'.
Moments of Diversity Measures
Definition 3.2
Let A
= (aij)
and B
= (bij )
be m
X n
and P
Kronecker product
44
X
q matrices, respectively. Then the
.
is an mp x nq matrix expressible as a partitioned matrix with aijB as the (i,j)th
partition, i = 1, ... ,m and j
= 1, ... ,n.
•
Let
(3.4.1)
where UKG is a KG x KG matrix of 1's, Ie is the C x C identity matrix and
TO
= (NGK)2T is a
CKG x CKG matrix, having KG x KG partitions with each
partition being C x C identity matrix. Let M be a G x G diagonal matrix with diagonal
elements Gn~g.(= G(NK)2 here), i.e., M = G(NK)2IG Then M- 1 = G(1.JK)2IG'
o
W
[(M- 1 0 UK) 0 Ie]
W
(NgK)2 [(IG 0 UK) 0 Ie] =
G(~K)2 WO
(3.4.2)
Then
TSI
1- V'TV
(3.4.3)
WSI
1- V'WV
(3.4.4)
Therefore,
BSI -
TSI - WSI
= -V'TV + V'WV = V'(-T + W)V
V'BV
(3.4.5)
where
G
-1
-T + W = (NGKp(UKG 0 Ie) + (NGK)2[(IG 0 UK) 0 Ie]
B
-
(NgK)2 [(IG 0U K ) -
~UKG) o Ie] =G(~K)2BO
Since
E(V'TV) = trace(T~)
E(TSI)
1 - trace(T~) - p.'Tp.
45
+ p.'Tp.
,
(3.4.6)
E(WSI)
And E(BSI)
Define the population variation within the 9th group at the kth position as
c
Is(Pgk) = 1 - 'LP~gk
c=l
H o : Pcgk = Pck for all 9 implies that
46
.,.
(3.4.7)
i.e., within-group variation at the kth position is the same over all the groups and
this implies that
II Plk 11=11 P2k 11= '" \I PGk II
where Pgk = (Plgk P2gk ... PCgk)' is a C
(3.4.8)
1 vector representing the probabilities of
X
belonging to categories c = 1, ... ,C in group 9 and position k.
If one is interested only in the hypothesis stated in (3.4.8), the hypothesis of
homogeneity among the groups (Pcgk = Pck) is not necessarily true. Here, we consider
Ho : Pcgk = Pck·
Eo(TSI)
1
1- NGK
1
~[ ~ 2
+ N(GK)2
~ G f:tPck -
1
1- NGK
1
C
+ NGK2?;
1
Eo(WSI)
1- NK
Eo(BSI)
NGK
G- 1
r'liK
~~P~k - NGp~.
1 ~ [~2
+ NK2
~ f:tPCk
~
1
+ N(GK)2 ~
[
2 2]
NG Pc.
~
N
-
]
2]
(3.4.10)
Pc·
2
(3.4.9)
2 2]
G f:tPCk - NG Pc.
G
C [K
]
NGK2?; {;P~k - Np~.
G-1
1
C K
1 C K
NGK+ NGK2EEp~k-NK2EEp~
c=l k=l
G-1[
1
NGK 1- K
c=l k=l
C K 2]
~(;PCk
(3.4.11)
Since V follows a multinomial distribution, from (3.3.1) and (3.3.2), asymptotically,
(3.4.12)
.
where :Eo
= :E l EB :E 2 EB ... EB :E G , :E g =
:E gl EB :Eg2 EB ... EB :EgK, 9
= 1, ... , G and :E gk
is given by (3.3.3). Under Ha , for any 9 = 1, ... , G,
:Egk = :E ok
where
:Eok
is a C
X
and
:E g =
:E~ =
:E Ol EB :E 02 EB ... EB :E oK
(3.4.13)
C matrix
(3.4.14)
47
with Dk being a C x C diagonal matrix with elements Plk, ... , PCk and ILok
(P1k .. , PCk)'.
Therefore, under Ho,
(3.4.15)
Now,
i. Cov(RSI, WSI)
Cov(RSI, TSI - RSI)
Cov(RSI, TSI) - Var(RSI)
ii. Cov(TSI, WSI)
Cov(TSI,TSI - RSI)
-
3.5
(3.4.16)
Var(TSI) - Cov(TSI, RSI)
(3.4.17)
The Test Statistic
Note that RSI can be written as
tt [
neg. _ n e.. NKVG]2
g=le=l NKVG
(NGK)2
RSI = V'BV =
Let 8egk
= negk -
Npek
= negk -
EO(negk). Then
G
8e.. =
K
K
L: L: 8egk = ne.. -
NG
g=lk=l
Also, under H o, 8
= (8111 ... 8Cg1
~~
L: Pek
k=l
... 8CGK )' is asymptotically
Jw '"
where
(3.5.1)
N(O, ~~)
(3.5.2)
is as in (3.4.15). So,
Hence, under Ho R S I is asymptotically
CGK
RSI",
Ai
i=l
L: (xO.,
48
I
(3.5.3)
where (XDi's are independent x2-random variables with 1 degree of freedom and
P'i, i
= 1, ... , CGK}
is the set of characteristic roots of
NB~g = N~K2Bo~g
=
N~K2
[((I G 0 UK) -
~UKG) 0 Ie] (IG 0 ~~)
by (3.4.6) and (3.4.15).
Looking at ~Ok, it is easy to see that its rank is at most (C - 1), because of the
restriction that ~~=l pck
= 1.
In fact, the rank of each ~Ok is (C - 1), since any of
its (C -1) columns are linearly independent. Therefore, the rank of ~~ is K(C -1).
1 BO~oo is a G x G partition matrix with elements the
NGK2
premultiplied by some constants (G - 1 or -1).
Further,
~Ok 's
matrices
Note that for any k
(G-l)
...terms
and so
(G-l)
(G -l)(U K 0
terms
Ie)~~ -[(UK 0 Ie)~~ + .~. + (UK 0 Ie)~~f
= (G -l)(U K 0 Ie)~~ - (G -l)(U K 0 Ie)~~ = 0
In order to get the characteristic roots of
N~K2 Bo~g we need to solve the equation
IN~K2 Bo~g -
AIeKGI
=0
(3.5.4)
But C K of the characteristic roots are zero and the others are the characteristic roots
of G(U K 0
Ie)~~,
with multiplicity (G -1). Thus,
1
BSI '" NGK2
trKe Ai
(X(G-l)) i
(3.5.5)
where {Ai : 1, ... , KC} is the set of characteristic roots of (UK 0 Ie )~~. Since
{Pck' c = 1, ... , C} is unknown and need to be estimated, so are {Aik' i = 1, ... , C-
1}. Determining the characteristic roots of a multinomial covariance matrix is not
straightforward. Royet al. (1960) studied this problem without actually presenting
the closed-form expression for the roots. The characteristic equation for each k is
(3.5.6)
49
It is easy to see that ).
= 0 is
a root, but identifying the other roots must proceed
numerically.
Now,
TSI =
1- V'TV
Since NT~g is not idempotent, the distribution of V'TV is not X~rank(T)'I-"TI-")' Under
H o, however,
1
,
°
~
1
2
(NGK)2 V T V = (NGK)2 ~ ne..
V'TV
-
1
c
(NGK)2 ~[Be.. + NGpe.]2
-
1
~(2
2"
(NGK)2
~
Be.. + (NG)
P;:.
-
1
c
1 c
2
c
(NGK)2 ~8~.. + K2 Ep~, + NGK2 EBe.. Pe.
6'T6
+
;2 L P~. +
+ 28e .. NGpe. )
c
A'6
e=l
1
(NGK)2
6'T o 6
1
~
2
+ K2 ~Pe. + A
,
6
(3.5.7)
where A = (A* j\* ... A*)' is a CGK x 1 vector and A* is a 1 x CK vector of the
form
A*
=
2 (Pl· '" Pc· Pl· ... Pc. '" Pl... , pc.)
NGK2
Lemma 3.1
6'T6 and A'6 are not independent.
Proof:
6'T6
1
= (NGK)26'T06
and A'6 would be independent if and only if
A'N~g(N~K)2TO = 0
1
N(GK)2
A'~o °
LJoT
50
(Searle, 1971).
N(~J{)2 (A* A*
... A*)(IG ® :E~)(UKG ® Ie)
N(gJ{)2 (A*:E~(UK @ Ie)
Let a
= (Pl .
... pe.)'. Recall
A*:E~(U K ® Ie)
:E~
A*:E~(UK ® Ie)
A*:E~(UK ® Ie))
= :EO! EB ". EB :E oK
N~J{2 (a':Eol a':Eo2
=
...
'" a':EoK )(U K ® Ie)
For each k, from (3.4.14)
e
e
e
a':E ok = [Plk(Pl. - 'EPc.Pck) P2k(P2. - 'EPc.Pck) ... pek(pe. - 'EPc.Pck)]
c=l
c=l
c=l
and the first element of the vector A *:E~(U K ® Ie) is
N~J{2
(Pll(Pl. - t,pc.pcI)
=
+ P12(PI. -
t,PC.PC2)
N~J{2 (t.Plk(Pl. -
= N~K2
+ ... + PlK(Pl. -
t,PC.PCK))
t,PC.PCk))
(pi - t,P, EMd) # 0
•
Hence, B'TB and A'B are not independent.
Now,
KeG
B'TB '"
'E
Ai (xO.,
i=l
A'B '" N(O, N A':E~A)
I
and
KeG
V'TV",
'E
Ai (xO. + N(O, N A':E~A) + 8
i=l
1
I
where
P'i'
i = 1, ... , CGJ{} is the set of characteristic roots of
NT:E~ = N(~J{)2 T°:E~ and
r
Ul
1 ~ 2
,
= J{2 LJPc. = P Tp under No
(3.5.8)
c=l
Recall that TO
= (U KG ® Ie).
Thus, the characteristic roots of T°:E~ are the char-
acteristic roots of (UK ® Ie ):E~ with multiplicity G. Therefore,
V'TV =
(N~J{)2 V'ToV '" N(~J{)2 ~Ai (X~)i + N(O,NA':E~A) + 8
1
51
(3.5.9)
where {Ai: i = 1, ... ,KC} is the set of characteristic roots of (UK ® Ie )~~.
An alternative of the distribution of Y'TY goes along the following lines. Let
R be a CGK x CGK matrix such that
= RY::} Y
y
R~R' =
IeGK and
= R-1y
Then,
and
Let P be an orthogonal matrix such that (p-l )'Cp-l is a diagonal matrix and
y*
=py::} y = p-ly* = p'y*
Then,
Y* '" N(PRIL, IeGK)
y'TY
= y'CY = (Y*)'PCP'y* = (y*)'C*y*
Hence,
eGK
y'TY = (y*)'C*y* '"
L
c; (X~(lSi))
with lSi
= ci(lli?
(3.5.10)
i=l
where crs are the diagonal elements of C* and
Ili
is the ith row of the vector
,.,,* = PR,.".
As for
WSI
1-Y'WV,
under H o
Y'WY
1
'0
G(N K)2 Y W Y
1
.~~
2
= G(N K)2 ;~ ~ neg.
52
from (3.2.5)
1
G
e
G(NK)2 {;~(Beg. + NpeY
1
Ge
2
e
G(NK)2 {; ~ f)~g_ + NGK2 ~ Be..Pe.
1 Ge
+ GK2 : ; ~P~.
,
,
1 ~ 2
8 W8 + A 8 + K2 LiPee=1
8'W8 + A'8 + 81
from (3.5.8)
(3.5.11)
Again,
tt
eGK
V'WV = G(;K)2 V'WoV '"
Ai
(xO i + N(O, NA'~~A) + 81
where {Ai, i = 1, ... , CGK} is the set of characteristic roots of
NW~~ = N~K2 WO~~.
Recall that WO
= [(IG ® UK) ® Ie].
Hence, the characteristic roots of
Wo~~
have
multiplicity G. Then,
,
1, °
1
~
VWV= G(NK)2 VW V'" NGK2{;tAi XG
(2) i+ N(O,NA~oA
, 0) +°1r
(3.5.12)
where {Ai: i = 1, ... , K C} is the set of characteristic roots of (UK ® le):E; and 81
is given by (3.5.8).
Alternatively, similarly to the derivation of (3.5.10)
eGK
V'WV
= (X*)'D*X* '"
~
eli (x~( 8i ))
with 8i = df(vi)2
(3.5.13)
i=1
where df's are the diagonal elements of D* = P 2(R;I)'WR;lp; and vi is the ith
row of the vector v* = P 2R 2p. Note that
X* '" N(v*, leGK)
where R 2 is a CGKx CGK matrix such that R2~R; = leGK and P 2 is an orthogonal
matrix such that (p;1 )'(R;1 ),WR2 1p;1 is a diagonal matrix.
As the above distributional results for the sums of squares do not readily yield
the first moments, we calculate them directly.
53
Let
G-1[
1
C
K
Eo(BSI) = NGK 1- K ~{;P~k
(' [K
+ NGK2 ~~
{;PCk - NGpc. ]
1
1
Eo(TSI) = 1- NGK
1
Eo(WSI) = 1- NK
]
"Iii:::"'"
2
2
1
C [K
]
+ NK2
~ {;P~k - Np~.
from (3.4.11), (3.4.9) and (3.4.10). The variances are calculated using the following
theorem.
Theorem 3.1 (Searle, 1971)
When X is N(J', 'E), the rth cumulant of X'AX is
•
Var(TSI) Var(WSI)
= Var(8'B8) = 2trace(BN'E°)2
Var(V'TV) = 2 trace(T N'E°)2 + 4N J'~ T N'E°T N 1'0
Var(V'WV) = 2 trace(W N'E°)2 + 4N J'~ W N'E°W N Po
Var(V'BV)
Var(BSI)
-
(3.5.14)
(3.5.15)
(3.5.16)
and under H o
trace(N~K2 BO'E~) 2 = (NG2K2)2 trace(B°'E~)2
Varo(BSI) -
2
Va:co(TSI)
2trace(N(~K)' T'I::)
2
N2(GK)4trace
Varo(WSI)
(T0 'E 0 )2
0
2
+ N(~K)4/(T.I::T'It.
4
, TO~o
+ N(GK)4J'0
LJoT
0
Po
(NG~2K4trace(W°'E~)2 -I- NG;K4J'~WO'E~W°J'0
Let
54
(3.5.17)
(
3.5.18
)
(3.5.19)
Note that
KGG
(i)
?= Ai (X~)i = Op(N-
1
)
1=1
since {Ai: i = 1, ... ,KC} is the set of characteristic roots of
N(dK)2(U K
® IG):E~ and :E~
= 0(1).
(ii) A'B = Op(N- 1 / 2 ) since A'B '" N(O, N A':E~A) and A
(iii)
(h =
;2 tp~. =
= 0(N- 1 )
0(1)
c=1
Then,
TN ,2
-
TSI -
-
1 - (Op(N-
V'TV -
(}2 = 1 -
- (1 + 0(N-
1
_
Op(N- 1 / 2 )
-
WSI -
-
1 - (Op(N-
1
)
(}2
+ Op(N- 1 / 2 ) + ~2 t,p~.)
;2 t,p~.)
) -
Similarly,
TN ,3
(}3
= 1 - V'WV - (}3
- (1 + 0(N_
1
1
)
+ Op(N- 1 / 2 ) +
) -
Op(N- 1 / 2 )
and
55
;2 t,p~.)
;2 t,p~.)
.
smce
TN,3
1/2
= Op(N-),
83 =
N,3
N(BSI)T
8
2
3
()
d
= Op (-1/2)"
N
, N ( BSI ) = 0 plan
1 (
1
1 - NK 1 - K
1-
K
')
l: kl:
P~k =l,
C
1
K2
c=1
;2 tp~. +
C
l: P~.
c=1
O(N- 1 )
c=1
8~
+ O(N- 1 )
By (3.5.3), asymptotically
BSI
1 CGK
F 1 -- N80- '" -80 "~ A"1
3
2
(v 1)i
(3.5.20)
1\
3 1=1
1
where {Ai: 1, ... , KGC} is the set of characteristic roots of NGK2Bo~;.
Under Ho, asymptotically
N [Eo(BSI)]
= N8 l
8°3
830
2]
(G-1)[
1~~~
GK80 1- K"~ LJPck
3
c=l k=l
2trace(Bo~g)2
(GK 283)2
Since Pck'S are unknown, one can only get estimates for the
Ai's, i.e., the charac-
teristic roots of ~~. To derive the distribution of PI, estimate {pck}, then get the
characteristic roots of ~~ based on those estimates.
Alternatively, since each of the elements on the R.H.S. of (3.5.20) are i.i.d. X~'s,
if max(Ai)
. /"f?GK A~
V LJI=l
--t 0, we can apply the C.L.T. for CGK large,
1
K (Fl - N:~) '" N(0,c
2
2
where
(7
=
2trace(Bo~g)2
(G8 )2
3
r2
)
(3.5.21)
.
Using a similar approach, Light & Margolin (1971) developed an analysis of
variance for categorical data (CATANOVA) and these two approaches are equivalent.
For a one-way table, the Total Sum of Squares (TSS) is
TSS = NG _ _
2
1_~n2c· == NG
2 TS1
2NG LJ
c=l
56
()
3.5.22
xi, x~, ... ,x}v) is
The variation within group g (i.e., within
(3.5.23)
Therefore, the within-group sum of squares (WSS) is found by summing (3.5.23) over
g:
WSS
G (n. g
= L: 2 g=1
1
c 2) = 2GN - 2N1 L:L:n
G 2 N G WSI
Cg = -2C
2n. L:n Cg
g c=1
(3.5.24) ~.,
g=1 c=1
The between sum of squares (BSS) is
NG
1 C
TSS-WSS=---~n2
BSS =
2N G
2
1
2NG
{
G
(C~~n~g
G
)
LJ
=1
c·
GN
__
C }=
- ~n~.
2
1 G C
+_~~n2
2N
LJ LJ
Fl=1
NG
-2- BSI
cg
(3.5.25)
Extending these for several positions, the variation within the g-th group at the
k-th position is
n'gk
-2- -
~ 2
ncgk
·gk c=1
1
~ LJ
N
= 2" -
1 ~
2N
LJ
c=1
2
ncgk'
since n.gk = N. The variation within the g-th group is
since n.g .
= N K.
The sum of squares are
WSS =
(3.5.26)
TSS
and
BSS
= NGK
2
_
~n2 = NGK TSI
1
2NGK c=1
L.J
1
TSS - WSS
1
2NGK
(
Coo
2
Gel
= 2NK EEn~g. g=lc=1
Gee
)
G~~n~g. - ~n~.. =
57
(3.5.27)
'
C
2NGK Ln~oo
c=1
NGK
2
BSI. (3.5.28)
In this setup a test statistic could be
BSS/(G -1)
BSI/(G -1)
F = WSS/(NGK _ G) - WSI/(NGK -1)
*
l
=
(NGK - G)
N(G -1) Ft
(3.5.29)
Let
()r -
Eo(BSS)
=
()*2
-
Eo(TSS)
=
()*3
-
Eo(WSS)
=
NGK
2
NGK
2
Eo(BSI)
C K
]
= -G2- -1 [ 1-- K1 ~(;P~k
Eo(TSI)
=
NGK
2
NGK -1
2
1 C [K 2
+ 2K ~ {;PCk -
NGK - G
Eo(WSI) =
2
2]
NGpc.
G C [K 2
+ 2K ~ (;PCk -
2]
N Pc.
from (3.4.11), (3.4.9) and (3.4.10).
Under H o
Varo(BSS) -
( 1
)2
1
2 trace 2K BO~~ = ~~K2 trace(BO~~)2
(3.5.30)
Varo(TSS) -
1
(TO~0)2
2( G K)2 trace
0
(3.5.31)
Varo(WSS) -
1
(WO~0)2
2K2
trace
0
+ (GNK)2 Po'To ~o0To""0
N , Wo~oWo
+ J{2
Po
0
Po
(3.5.32)
Let
SI
=BSS - ()t,
SN,2
=TSS - ();,
SN,3
=W"SS - ();
and
(3.5.33)
a2
-
(3.5.34)
-
a3 -
-
(3.5.35)
58
Now
(NGK -1)
Similarly,
(NGK -G)
Then
or
F*
1
-
Sl
BSS/(G - 1)
(G - 1) + (G - 1)
WSS/(NGK - G) 0;
SN,3
(NGK - G) + (NGK - G)
59
By (3.5.3), asymptotically
Ft
BSS
1
= a*(G -1) ,.... a*(G -1)
2
OGK
?= Ai (Xni
2
(3.5.36)
1=1
where {Ai: 1, ... , KGC} is the set of characteristic roots of
2~BO~~.
...
Under Ho, asymptotically
a1
E(BSS)
01
a2
a2(G - 1)
a2(G -1)
~
[1- ~K c=1tf=P~k]
2a 2
k=l
Varo(Ft) =
Varo(BSS)
t:race(BO~~)2
(a2(G-1))2
2( Ka2(G -1))2
Applying the C.L.T. for CGK large,
K
:i) ,. . N(O, u;)
(Ft -
(3.5.37)
2 trace(BO~~)2
where u* = 2(a2(G _ 1))2'
1
Note that a~ = 20~ and that for large N, we can say that
2
.ltcan b esh ownt h atu*=
2 (G_1)2U'
G
2
Note that, as N -+
a1
00, a3
1
K2
GK
(G _ 1) Fl. Therefore,
a1
-+ -; and
1 ,,0 "K
2
1- K
L..Jc=1 L..Jk=1 Pck
1-
Ft =
0 (K
~c=1 ~k=1 Pck
a2
)2
"K
2
1 - K1 ,,0
L..Jc=1 L..Jk=1 Pck
1-
~~=1 (~f=1 ~)2
1-
~~=1
(k ~f=1 (Pck -
Pc)2
+ pn
1- ~~=1P~
1 ,,0 "K (
,
h
were
pc
K
= '"'
L..J Pck
k=1 K
- )2
1 _ K L..Jc=1 L..Jk=l pck - pc
a:
1- ~~=1~
is bounded by 0 and 1, since ~~=:1 ~f=1 P~k ~
a2
Also, if there is homogeneity among positions within. a group, Pel
Note that
_
a1
Pc and -
a2
k
= 1.
60
~~=1
(k ~f=1 Pck) 2.
= pc2 = ... = pcK =
Note that
(~f=1 Pck f]
1 + K(NJK-I) ~~=I [~f=IP~k - NG (~f=IPCkr]
1
a3
a2
+ K(Nk-l) ~~=I [~f=1P~k -
N
2
I ("K
)2]
L..c=1 ["K
L..k=1 Pck - K L..k=1 Pck
1+
1 + K(NJK-I) ~~=I [~f=IP~k - NG (~f=IPCkf]
N(G-I)
,,0 "K (
- )2
(NK-I)(NGK-I) L..c=1 L..k=1 Pck - pc
N(G-I)
,,0
(NK-I)(NGK-I)
----------;,...=:---------~
=
1+
I
K
0
(
1 + K(NGK-I) ~c=1 ~k=1 pck - Pc
)2
-
0
2
~c=1 Pc
Again, if there is homogeneity among positions within a group, Pel
_
PcK
= Pc
3.6
a3
and -
a2
= Pc2 = ...
=
= 1.
The Power of the Test
Let us consider an alternative hypothesis, i.e.,
1
Pcgk
Thus,
Icgk
case where
= y'N,cgk + pck
= 0 yields the null hypothesis
Icgk
=I-
o.
H o : Pcgk =
pck.
Then,
Bcgk =
ncgk -
N
(..Jw'C9k
and
RBI
61
+ PCk)
The interest here is in the
where B is as in (3.4.6), B
= (B 111
.. , BCg1 ... BCOK )', Al and A 2 are CGK
X
1
vectors of the form
Al
= K2;N3/2 (A;l
... A;G) with
A;g = (,lg. ... 'YCg· 'Y1g' .. , 'YCg· ... 'Y1g. ... 'YCg.) for each 9 = 1, ... , G
A2
=(KG~N3/2(A;
A; =
A;) with
(,1..... 'Yc .. 'Y1
Recall that
~ '" N(O, ~O)
teristic roots of
'Yc .. ... 'Y1 .. ... 'Yc.. )
and B'BB '"
I:f=~K Ai (XDi' with Ai'S being the charac-
N~K2BO~O, where
~O
= (~11 EB ~12 EB ... EB ~lK EB ~21 EB ..• EB ~GK)
and ~gk is as in (3.3.3). Now, let A~
(3.6.1)
= N3/2A 1 and A; = N 3/2A 2.
Then
Lemma 3.2
B'BB and
A~ B
are not independent.
Proof:
B'BB and
A~ B are independent if and only if A~ NEoB
A~N~oB
-
Let
atg
A;g~g
= (,lg. '"
= O.
A~ N~K2~OBO
N~K2 [((IG ® UK) - ~ U KG ) ® I c ]
K3G;N5/2(At1~1 ... AtG~G) [((IG ® UK) - ~UKG) ® I c ]
K2;N3/2 (At1 ... AtG)
'YCg.) be a 1 x C vector. Then write Atg =
can be written as
62
(atg
•.•
atg )
and
Now,
a19:Elk = [Plgk b,g- -
~ 1,g- P,gk)
- - - Pegk be..
-
~ 1",,- p""k)]
for each 9 = 1, ... , G and k = 1, ... ,K. The first element of
A~ N:E<>B
is
•
Since 8'B8 and A~8 are not independent, 8'B8 and (AI - A 2)'8 are also
not independent. So, the distribution of V'BV is not the convolution of a linear
combination of x2-random variables and a normal distribution.
V'BV
_
8'B8 + A'8 + N 2 A'A
(B I/ 28 + ~B-I/2 A)'(B I/ 28 + ~B-I/2 A) - ~A'B-I A + N 2 A'A
2 2 4
(B I/ 28 + ~B-I/2 A)'(B I/ 28 + ~B-I/2 A) + ~A'(4N21 _ B-I)A
2 2 4
and let X = (B I/ 28
+ ~B-I/2 A), r
=
G~ K2 (B<»1/2:E<>[(B<»1/2], and ILB
= tBI/2A.
Then,
Let P be an orthogonal matrix (i.e., P'P = I) such that prp' = A, where A
is a diagonal matrix, and
Y = PX =? X = P'Y
Then,
and
G
X'X = y'PP'Y = y'Y
row
2: Ai (X~(8i)).
i=1
63
'
(3.6.2)
a~
where ,Vs are the diagonal elements of A, Oi = A'.' with ai being the ith row of the
,
vector ~PB-l/2A, which are linear combination of the
Let c be a constant
~A'(4N2I- B- l )A.
Pr(FI ?:
/cgk'S.
.
Then,
u) _ Pr (N(X~ + c) ?: u)
Pr(NX'X?: 8~u - Nc)
(3.6.3)
Note that Nc = 0(1) since B- 1 = 0(N 2 ) and A'A= 0(N- 3 ). As the noncentrality
parameter Oi increases, the distribution of each of the noncentral x2-random variables
shifts to the right, therefore the probability in
(3.6.3~)
goes to 1 and the power of the
test converges to 1.
.
64
Chapter 4
Analysis of Variance based on the
Hamming Distance
The interest here still lies in the comparison of sequences. Now they are considered on an individual basis in that they are compared to each other: all possible pairwise comparisons within and across groups are performed. We develop an
analysis-of-variance framework for Hamming distances and estimate the variability
between, within and across groups. In the within sum of squares, we are 'estimating
the variability among individuals within a group around the average distance within
this group. In the across sum of squares, we are estimating the variability of individuals across two groups with respect to the average distance between those groups. In
the between sum of squares, we estimate the variability in the group average distances
around the overall distance.
Weir (1990a) describes an analysis of variance for the genetic variation in the
population, in particular for the amount of observed heterozygosity. The variance of
the estimate of the average heterozygosity is broken down to show the contribution
of populations, loci and individuals by setting out the calculations in a framework
similar to that of an analysis of variance. Our situation is a little different because
we would like to construct a categorical analysis of variance based on Hamming
distances (Seillier-Moiseiwitsch et al., 1994 and references therein), assuming that the
sequences are independent, but the positions may not be. The Hamming distance is
65
the proportion of positions at which two aligned sequences differ.
In this context V-statistics are utilized to represent the average distance between and within groups as well as the overall distance. The total sum of squares
is decomposed into within-, between- and across-group sums of squares. The latter
term is new: it does not appear in the classical set-up. Generalized-V-statistics theory
(Puri & Sen, 1971; Lee, 1990; Sen & Singer, 1993) is used to find the asymptotic distributions of each sum of squares. Test statistics are developed to assess homogeneity
among groups.
4.1
The Total Sum of Squares and its decomposition
Let Xf = (XYl' X~, ... , XYk)' be a random vector representing sequence i of group 9.
Suppose i
= 1, ... , N, k = 1, ... , K
and 9
= 1, ...., G.
So, X3c represents either the
amino acid or the nucleotide present at position k of sequence i in group 9 (e.g., at
the nucleotide level, Xfk E {A, C, T, G}).
Consider Xf' and X~2 .
Definition 4.1
The Hamming Distance D~Jl!12) is a descriptive statistic for sequence comparison:
D~1,g2)
K
=
-
!- L: I(XYk i= XJn
K
k=1
~
X
(number of positions where
(4.1.1)
Xl'
and
Xr
differ),
and when 91 = 92 = 9,
•
Definition 4.2
Let XI, X 2 , ••• be independent observations with distribution F. They may be vec-
66
tors, but for simplicity we confine our attention to the scalar case. Consider a parameter junction e = e(F) for which there is an unbiased estimator,i.e., e(F) may be
represented as
8(F)
= EF{cP(XI , •.. , X m)} =
f ... ! cP(XI, ... , xm)dF(xd··· dF(x m),
(4.1.2)
for some function cP = cP(XI, ... , x m ), called a kernel. Without loss of generality,
assume that cP is symmetric. For, if not, it may be replaced by the symmetric kernel
(4.1.3)
where ~p denotes the summation over the m! permutations (iI, ... ,im ) of (1, ... ,m) .
•
Definition 4.3
For any kernel cP, the corresponding U-statistic for estimation of 8 on the basis of a
sample Xl, ... ,Xn of size n
~
m is obtained by averaging the kernel cP symmetrically
over the observations:
(4.1.4)
where
Ec
denotes the summation over the
(::J combinations of m distinct elements
{iI, ... ,i m } from {1, ... ,n}. This V-statistic is said to be of degree m. Clearly, Un is
an unbiased estimate of e.
•
Let
ef = P {X;k # X%k}
and jj~
1
E[Dfj] = K
= k ~f=l ef . Then,
K
L E[I(Xit # X;k)] =
k=l
1
K
K
k=l
Define the average distance within a group as
67
_
L ef = 8~.
which is a V-statistic of degree 20 The average distance between two groups is
which, as we will see later, is a two-sample V-statistics of degree (1,1). The overall
distance is
which is a linear combination of V-statistics.
The Total Sum of Squares
G
=L
TSS
N
(Dfj - D.)2 +
L
g=ll~i<j~N
N
~~L(D~1'92) - D.?
L
l~gl <g2~G
(4.1.5)
i:=l j=l
can be decomposed as follows
G
L
G
+L
(Dfj - D~?
L
g=ll5: i <j5:N
2:
(D~ - D.)2
g=ll5: i <j5:N
G
+2
L
(Dfj - D~)(D~ - D.)
L
g=119<j~N
N N
+
LL(D~1'92) - D~91'92)?
L
l~gl <g2~G
L
2:2:(D~91'92) - Do)2
1~1l'1 <92~G i=l
i=l j=l
j=l
N N
2:
+2
N N
+
2:2:(D}r,g2) - D~91'92»)(D~~'1,92) - Do).
l~gl <g2~G
i=l j=l
Since
G
L
L
(Dfj - D~)(D~ -- D.) =
0
g=ll~i<j~N
and
N N
L
l~gl
G
TSS
L
L L(D~.il'92) - D~91'92»)(D~91,92) - Do)
<925:G i=l j=l
2:
g=119<j~N
G
(Dfj - Dn 2 + 2:
2:
g=119<j5:N
68
(D~ - D.)2
= 0,
N
+
N
LL(D~Jl,g2) - D~gl,g2))2
L
l::;gl <g2~G i=l j=l
Within Sum of Squares (WSS)
+
+
4.2
N
+
N
LL(D~gl,g2) - Dl
L
i=l j=l
+ Between Sum of Squares(BSS)
l~gl <g2~G
Across Within Sum of Squares (AWS S)
Across Between Sum of Squares (AB S S)
Connections Between Sums of Squares and UStatistics
The extension of U-statistics to the case of several samples is straightforward.
Definition 4.4
Consider k independent collections of independent observations {XP), X~l), . ..}, ... ,
2 , . . . } t a ken f rom d'IS t n'b u t'IOns F(l) , ... , F(k) , respect'lve1y.
{X 1(k) , X(k)
Let () = (}(F(1), ... , F(k)) denote a parametric function for which there is an unbiased
estimator, i.e.,
() = E{ </>(X~l), ... , X~!, ... ; X~k), . .. ,X!:~)},
where </> is assumed, without loss of generality, to be symmetric within each of its k
blocks of arguments. Corresponding to the kernel</> and assuming nl
~
ml, ... , nk
~
mk, the k-sample U-statistic of degree (ml, m2, ... , mk) for estimation of () is
(4.2.1)
Here {i jl, ... , i jm j} denotes a set of mj distinct elements of the set {I, 2, ... , n j },
1 ~ j ~ k, and ~c denotes the summation over all such combinations.
•
Since we have G groups of N sequences, we can disregard the group clustering and
think of the sequences as a random sample of size NG. Then
TSS
~
(Dij -
by
l::;i<j::;NG
(NG(~G -1) -1)
CT-1r ,<;21<;,
.:::;.1 or j:::;jI
69
(D ij - D i l j /)2
2
NG(:G _ 1)
{t [ .~.
9=1
+
'"
~
(D fIJ!. _ DfI,1 J.)2
l$i<i'<j$N
+
$N
1 $1<J<3'
+
'"
~
+
'"
~
L
l$i<i'<j'<j$N
+
L
1$91<g2$G
+
+
'"
~
'"
[I: (D~rl,g2)
-
l$i<i'$N
15-,;, i/~N
1
1<i, i',j'
~i''''i'
l:Si,j~j':SN(D~r,g2) - D~fJ,'Y2»)2]
",j'
I: [I: (D~r,g2)
1:S91,92dIS:S G
91 "'92"'9S
N
L
-
l<i,j'<N
+
L
l<i,j'<N
-,"'j'
(D~rl,g2) - D~J:,gs)?
(D~Jl'92) - D~Jl'9S)?
L
l<"j<N
-·",r
(D~I'92) - D~r,gS»)2
+ L
-i",r
L
1<",'<N
I:
(DJJ1,92) - D;~1'9S»)2
1<',i<N
-,,,,r
l<',j<N
-,,,,.'
D)~j:9S»)2 +
-''''i'
L(D~rl,g2) - D~rl,gS)?
i=l
+
(D~r,g2) - D~J:'92»)2
i':SN
l$i<j$N
i"'i'
'" (D~f?1'92) _ D\y'}'92»)2 + '" (D~f?1'92) _ D~f?:'Y2»)2
~
~
n
~. ~
~
l:Si,i,i':SN
l:Si,i,i':SN
i"'j"'j'
i"'Nj'
'"
(D~!!l ,g2) _ D~f~ ,92))2 + '" (D~!!1 ,g2) _ D~f~,'92»)2
~
IJ
1J
~
It
1J
+
+
L
l:Si,j':SN
i"'j'
(D 1IJ(!!1,92) _ D(!!1,92»)2
JI
~
11
It
D~~~'92»)2 -+-
+ '"
(D~f!1'92) _ D~f.l'92))2
~
l:Si,
i"'j'" i'
+
•
(Drj - Df,j' )2]
i"'Ni'
+
(D!!.~ - D!!,IJ.,)2
l$i<i'<j<j'$N
lStn~SG [,sESN 19~5N (D'!; -Df,j')']
+
+
(D!!IJ. ._ D!J.JJ.,)2
l$i<j<j'$N
g
(D ~. - D!!,IJ.,)2
l$i<j<i'<j'$N
+
(D;j - Dfi,)2
(D~rl,g2) - DJ~I'9S»)2
+
1<',j<N
-''''1
+
+
70
(D~I'92) - D;f1,gS»)2
+
+
"
~
l<i,i,i'<N
-i~#i'
+
D(f~,g3))2
(D(gl,g2) _
t)
t)
l<i,j~j'<N(D~I,g2)
"
~
l$i, i',j'$N
i~i'~j'
(D(gl,g2) _ D(fl,'g3))2
t)
It
Pg3
- D;f ))2]
-i~#i'~jT
+
'<"
J:.... [t, t, ~ El(D!J"") - D!!7,&'))']
<G
91 <92 , 93 <94
L
+
[L
91~92
+
-
-
-
I ,g2))2
(Df!~
- D(f.
t)
t t
"~
+
15 i ,j,i'5N
+ "
"(Df!~ - D(f!1 ,g2))2
(Df!~ - D\fl.},g2))2
')
JJ
+
I)
I)
L [L t t
+
t)
"~
(Df!~ _ D(!!,I,g2))2
')
t)
15i ,j,i'5N
i~#j', i<j
"(Df!~ - D(f~,g2))2 +
~
l<',j,i'<N
i;j#i', i<j
I)
15,i<j5,N
15i,j,i'5N
i~j~j', i<j
+
(Df!~ _ D(f!I,g2))2
~
t))1
"~
(Df!~
_ D\r~,'g2))2
t)
))
j'~i~#j', i<j
19<j5,N
+
D}~I,g2))2
-
"~
l<i,j,i'<N
~
(Dr; -
l<i<j<N
i'~i~j~i'-:i<j
+
+ L
(Dr; - D;r,g2))2
l<i<j<N
1591,925 G
1591~92~93:S;G
92<93
"
~
l<',j,i.',j'<N
ii#i'~j', I<j
(Dr; -
15,i<j5,N i'=1 j'=1
(Df!!
13
D(f~,'g2))2]
13
D~f]/g3))2] }
(4.2.2)
Note that the total number of terms in (4.1.5) is
NG(1G-
1
(
))
1
=
sNG(NG - 1)[NG(NG - 1) - 2]
_
NG[N 3 G3
8
_
2N 2G2
-
NG + 2]
Separating the different U-statistics and adding up the number of terms we get
G
(
N{~-l))
2
+
+
G(G - 1) (N)2
2
2
G(G - 1)(G - 2) N 4
2
+G(G -1)
+
+
G(G -1) (N 2 )
2
2
G(G -1)(G - 2)(G - 3) N 4
8
[(~)N2] + G(G -~(G -
= NG [N3G3 _ 2N 2 G2
8
-
NG + 2]
71
2)
[(~) N2]
We now characterize each term in the above decampastian (4.2.2):
• one-sample V-statistics of degree 3:
• one-sample V-statistics of degree 4:
u~~l = (~)-1 ~
i<j<i' <j'
(Drj _ Df,j,)2,
u~~l = (~) ..~
ut~ = (~)-1 ~
i<i' <j<j'
.(Drj - Dr'j')2
I<I'<)'<}
• two-sample V-statistics of degree (2,2):
smce
+
"
~
(D~!?1'92) _ D~f~,'92))2
I}
I}
'~N"~i'
j<,'
+
"
~
(D~!?l ,92) _ D~f~,'92))2
I}
'~N"~j'
j>"
72
I}
(Drj - Df,j'? and
• two-sample V-statistic of degree (1,2):
smce
+
since
N
~ ~(D~!!1,g2)
L..J L..J
D~g~,g2»)2
_
I)
=
I')
~:~, i:;:i '
has N(N - 1)
L..J
L..J
_
D~g.1,g2»)2
I',
II
+ N(N-~)(N-2)
N
~(D!!~
L..J
l$.i<j$N j'=l
I)
-
D~!!/l,g2»)2
I)
+
~ (D~!!1,g2)
L..J
_
D~g~'~»)2
I)
i::f:.#i '
i::f:.i '
Jr'
~
~(D~!!1,g2)
= (~) (~) terms,
= ~(D!!~
L..J
I)
-
D~gl,g2»)2
II
i<j
I)
i<j
+ I: I:
i<j
+ ~(D!!~
L..J
(Dr] - D~,1,g2»)2
i'
i',##i
73
_
D~!!1,g2»)2
I)
I')
• two-sample V-statistics of degree (3,1):
U(S,I)
7,1
U(I,I)
7,2
U(I,I)
7,1
=
=
=
[(N)3 (N)]
1
[(N)3 (N)]
1
[(N)3 (N)]
1
-1
~ (D!!~ _ D~9~'92»)2
"
L...i L...i
IJ
I')'
,
i<j<i' j'=1
-1
~ (D!!~ _ D~9~'92»)2
"
L...i L...i
IJ
I'J'
and
i'<i<j j'=1
-1
~ (D!!~
"
L...i L...i
IJ
_
D~91'92»)2
I'J'
i<i'<j j'=1
sInce
N
L
L(Dr1 - D~rj,'92»)2
i'l-#i' j'=1
i<i
=
L (Dr1 - D)¥j,'92»)2 +
i'l-#i'
i<i
I: (Dr; -
D~!'J'92»)2
i'l-#i'
i<i
+ I:
I:
(Dr; - D~~1'92»)2 +
i'l-#i'
i<i
(Dr1 - D~rj/92»)2
i'l-i'l-i''l-i'
i<i
terms, and
N
L
L(Dr; - D~rj,'92»)2 = L
i'l-#i l j'=1
i<i
N
L(Df; - D!rj,,92)? + L
i<j<i' j'=1
N
L(Dr; - DJrj/92»)2
i'<i<j j'=1
N
+L L
(DrJ - D~;j,'92»)2 has 3(~) (~) terms.
i<i'<j j'=1
• three-sample V-statistics of degree (2,1,1):
sInce
74
= '""'(D~91 ,92)
~
It
_ D~f.~ '9a))2
It
i::f.i'
+ '""'(D~91 ,92) _
~
It
i::f.i'
+ I:
+ I:
(D}r'92) - D~f}/9a))2
i::f.i'::f.j'
has N(N - 1)
D~f.l '9a))2
U
(D~1'92) - D~~1'9a))2
i~#i'
i<i'
+ N(N -
+ N(N -
1)
l)(N - 2)
+ N(N-~)(N-2) + N(N -
l)(N - 2)
+N(N-l)(~-2)(N-3) = (~) (~) (~) terms, and
=
+ '""'(D~!!1'92)
_
~
Q
'""'(D~91'92) _ D(!!1'9a))2
~
~
Q
i::f.j
+ '""'
~
i::f.j
+ '""'
~
(D(!!1,92) _ D(fJ.}'9a))2
I)
))
i::f.#j'
+ I:
(D~1'92) - D~~1'9a))2
I)
+ I:
I)
(D~;1'92) - D~fF9a))2
i~#i'~jI
j>P
..
(D(!!1,92) _ D(f~'9a))2
i::f.#i'
i~#i'
has N(N - 1)
D(fJ.1'9a))2
n
j>P
+ N(N -
1)
+ N(N -
l)(N - 2)
+ N(N -
l)(N - 2)
+ N(N-~)(N-2)
+N(N-l)(~-2)(N-3) = (~) (~) (~) terms.
• three-sample V-statistic of degree (1,1,1):
N
N
N
I:I: I: (D~1'92) -
D~,1'9a))2
i=1 j=1 i'=1
= '""'(D~91'92) _ D(!!1,9a))2
~
I)
I)
i::f.j
N
+ '""'(D~f/l'92) _
~
II
D~f/l'9a))2
II
i=1
+ '""'(D~f/l'92) _
~
II
D~!!l'9a))2
I)
+ '""'(D~!!1'92)
~
i::f.j'
+ L
I)
_ D(f/l,9a))2
II
i::f.j
(D~1'92) - D~,1'9a))2
i::f.#j'
has N (N - 1)
+ N + N (N -
1)
+ N (N -
1)
terms.
75
+ N (N -
1)(N - 2)
= (~) (~) (~)
• four-sample V-statistic of degree (1,1,1,1):
U~;,l,l,l) = N- 4
N
N
N
N
L L L L (D}Jl,92) i=1
D}~~'94»)2
;=1 i'=1 ;'=1
Further,
G
WSS =
L L
(Df; - [)~)2
g=ll~i<j~N
can be shown to be a V-statistic.
For each 9, we can write
L
=
WSSg
(Dfj - [)~)2
19<;5. N
(N(N2-1) -1)
(
+
+
crr
;<;~<;,
-1) (N(~-I»)-I~ [
2
2
L
(Dfj-Dfjl?
2 l~i<;<j'~N
+
(D£!o'3 - D!!,, 30)2
2
.::>,' or j::>j'
N(N-1)
'"
L..J
g
9
l'D '3
0 0 - D 01 01 )2
, 3
_'
l~i<i'<j~N
'"
L..J
(D£!o'3 _ D9.JJ01)2
19<j<j'~N
(D£!o' 3
- D~
01)2
'3
'"
L..J
+
1 ~i<j<i' <j' ~N
'"
L..J
(D£!o' 3
_ D~
01)2
'3
19<i'<j<j' ~N
•
+
L
(Drj - Dr.}' )2]
1 ~i<i' <j' <j~N
Therefore,
wss
=
(N(N2-1)-1)crl
t ~ [ ~o
4
N(N _1)
-
0
1~'<3<3'~N
g=1
+
'"
L..J
(D£!o'3 _
l~i<i'<j~N
+
rt
L
D~, 30)2
''5: i '
or
( D£!o'3 -
D~
01)2
, 3
2
;5:';'
(Dfj - Dfjl)2
+
'"
L..J
(D£!o'3 ._ D9.JJ,,)2
l~i<j<j'~N
L
(Drj - Df,jl)2 +
19<j<i'<j'~N
+
;<;~<;,
L
(Drj - Dr. jl)2
l~i<il<j<j'~N
(Drj - Df,j' )2]
1 ~i<i' <j' <j~N
2
~
N(N - 1) ~
[(N)3 (3)
(3) (3») + (N)
(4) (4) (4»)]
U +U +U
4 U +U +U
1,1
1 ,2
76
1 ,3
2 ,1
2 ,2
2,3
Now,
N
N
2: 2: 2: (D;;I ,92) -
AWSS
[)~91'92))2
1$91 <92$G i=1 j=1
(N' -1)
(~2rl<~~<Gt.t,it,
-
4.3
4.3.1
t,
i~il or
-
2
i-5i'
Asymptotic Distributions
One sample V-statistics
Let Un be a U-statistic of degree m with kernel </J(XI, ... , X m) and E(Un ) = e(F)
Un
=
U(XI, ... ,Xn) =
2:
n-[m]
= e.
</J(Xil, ... ,Xim )
l$il~ .. ·~im$n
(n)-1
m
where n-[ml
2:
</J(Xill ... ,Xim ),
n~m
l$il <...<im$n
= (n[m1)-1 = {n ... (n -
m
+ 1)}-1.
77
From (4.1.2) we know that
(4.3.1)
Definition 4.5
von Mises (1947)
Define a von Mises' differentiable statistical junction as
n -m
n
n
i1=1
im=l
L ... L
¢(Xi 11 . . . , Xim )
n 2: 1
(4.3.2)
where Fn(x) is the empirical dJ.
with c( u) being 1 if all p coordinates of u are nonnegative and 0 otherwise.
•
Let
(4.3.3)
(4.3.4)
Theorem 4.1
The function Wc defined in (4.3.3) has the properties
(i) Wc(Xl, ... ,xc)
= E{Wd(Xl, ... ,xc,Xc+ll '"
(ii) E{Wc(Xl'''''Xc)}
,Xd)} for 1 :s c < d:S m,
= E{¢(X1, ... ,Xm )}.
The proof appears in Lee (1990, p. 11).
By (4.3.1) and (4.3.4),
Var(Un )
where
E(c)
=
stands for summation over all subscripts such that
78
•
and exactly c equations i k =
jh
to ~c. The number of terms in
are satisfied. By (4.3.5), each term in
r:(c)
~o =
0,
is equal
is
n(n - 1) ... (n - 2m + c + 1) = (m) (n - m) (n)
c!(m - c)!(m - c)!
c
m- c m
Since
r:(c)
(:rtot.(:)(:) (:) (:
(4.3.6)
~:)~,
(:-_:)~,
( : ) -1
(4.3.7)
Hoeffding (1948) obtained the following inequality:
(4.3.8)
which leads to
Now, from (4.3.7) and (4.3.6)
- (:r' Lm-D~~n~~~+l)!~d
...
Hm}
m(m-1) ... 1
n(n-1) ... (n-m+1)
m(n-m)(n-m-1) ... (n-m-(m-1)+1)
x {
(m _ 1)!
. 6
_
m 2 (n - m) (n - m - 1) ... (n - 2m + 1 + 1)
n(n - 1) ... (n - m + 1)
+
m(m-1) ... 1
~m
n(n - 1) ... (n - m + 1)
m
2
n
+
+ ... + ~m
}
6 + ...
(nn-1
- m) '" (nn-m+1
- 2m + 2) ~I +...
,
m.
~m
n(n - 1) ... (n - m
+ 1)
Hence nVar(Un ) is a decreasing function of n which tends to its lower bound m 26 as
.
.
n Increases, I.e.,
79
or
(4.3.9)
00 and el > 0,
Therefore, if E(cP 2 ) <
(4.3.10)
In order to compute the covariance of two V-statistics, say U~-r) and U~6) (V-statistics
of degrees mh) and
m(6)
respectively), we need to take into account the common
elements in the two V-statistics. Let
E(U~-r))
= E{cPh)(XI,
E(U~6)) = E{ cP(6) (Xl ,
,Xm(-y))}
= 8h)'
,Xm(6))} = 8(6)l
E{ cPh)(Xl'
,Xc, Xc+l,
,Xm(-y)) - 8 h )}
E{ cPh) (XI,
, Xc, Xc+I,
, Xm(-y))} - 8h)
and
e~-r·6) -
E{ tPh)c(XI,
, Xc) tP(6)c(Xl "
E{'1t (-r)c(XI,
,Xc) tP(6)c(X l , ... ,Xc)} - 8 h ) 8(6)
.. ,
Xc)}
where c = 1, ... ,mh) and " 8 = 1, ... , q.
If, = 8, we have
c(-r) - c(-r,-r) - E{o,.2 (X
X c)}
l"c
- l"c
'P h)c
1, ... ,
which is the term needed for the variance of a V-statistic.
Let
80
If m(-y)
:s m(<i)l
u(U~"I), U~<i)) =
(
n )
-1
m(-y)
~
(n -
(m(<i))
c=1
m(<i))
m(<i) - C
C
e~"I,<i).
(4.3.11)
From (4.3.11)
lim nu(U("I)
n-+oo
n' U(<i))
n = m( "I )m«)t(-Y,<I")
\'1
(43
12)
. .
0
Therefore, by (4.3.12) it is sufficient to compute
Ei"l,<i),
i.e., it is enough to consider
only the case of one element in common between the kernels.
To obtain a decomposition of ()(Fn) and Un, for every 1
:s h :s m, let
(4.3.13)
with each x being a p dimensional vector and since there are h of them, the integral
is in
Rph.
Then,
n
Vn,1
= n- 1 ~]Wl(Xi)
- ()(F)]
i=1
Writing dFn(Xi)
= dF(Xi) + d[Fn(Xi) 8(Fn) = 8(F)
+
F(Xi)], i
i; (~)
= 1, ... ,m, we
Vn,h,
n2 1
(4.3.14)
Similarly, we may rewrite (4.3.1) as
Writing dc(xj - XiJ
= dF(xj) + d[c(xj Un
= 8(F) + ~
XiJ - F(xj)], 1
(~)Un'h
:s j :s m, we obtain
n2m
(4.3.15)
where
for 1
:s h :s m. Further, if we write
h
Wh(Xl"'" Xh)
=
Wh(XI, ... , Xh) -
L Wh-l(Xl, ... , Xj-l, XjH,""
Xh)
j=1
(4.3.16)
81
for 1 :::; h :::; m, we obtain
(~)-l
Un,h =
. L:.
1:::; h:::;
Wh(Xill ... ,Xih)'
(4.3.17)
l~'l <'''<'h~n
= 2, we have
So, the Un,h are themselves V-statistics. Note that for h
E(Un ,2) -
m
E(W~(Xl,X2)) = E(W2(Xl~.X2))
- E(Wl(Xt})
+ 8(F)
-
E(Wl(X2))
-
8(F) - 8(F) - 8(F)
+ 8(F) =
0
Let
Then
W~l(Xt}
-
I Xd
E[W 2(Xl ,X2) I Xd -
-
Wl(Xt} - Wl(Xl ) - E(W l (X2))
-
8(F) - 8(F)
-
E[W~(Xl,X2)
E[Wl(Xd I Xl] - E[W l (X2) I Xl]
+ 8(F)
+ ,O(F)
=0
and by (4.3.7),
(4.3.18)
Consequently, Un ,2
For h
E(Un ,3)
= Op(n- l ).
= 3 we have
-
E(W~(Xl,
-
E(W3(Xl, X 2 , X 3 ))
X 2 , X 3 ))
-
E(W2(Xl , X 2 ))
-
E(W2(Xl, X 3 ))
- E(W2(X2,X3 )) + E(Wl(Xd) + E(W l (X2)) + E(W l (X3)) - 8(F)
-
48(F) - 48(F)
-
0
82
Ixd
= E[W3(XI , X 2, X 3) Ixd - E[W2(XI , X 2) Ixd - E['1I 2(XI , X 3) I Xl]
- E['1I 2 (X2 , X 3 ) Ixd + E[WI(XI ) I Xl] + E['1I I (X2) Ixd
+ E['1I I (X3 ) I Xl] - 8(F)
- WI(XI ) - wl(Xd - '11 1(Xl) - 8(F) + wI(Xd + 8(F) + 8(F) - 8(F)
w~I(Xd
=
E[W~(XI,X2,X3)
=
0
W~2(XI' X 2 )
= E[W~(Xt,X2,X3) I X I ,X2]
=
E['1I 3 (Xt, X 2, X 3 ) I Xl, X 2]- E[W2(XI , X 2) I Xl, X 2]
- E[W2(Xt,X3 ) I Xt,X 2]E[W 2(X2,X3 ) I Xt,X2] + E['1I I (Xd I Xt,X2]
+ E[WI(X2 ) I Xl, X 2 ] + E['1I I(X3 ) I Xt, X 2 ] =
W2(Xt,X2) - '1I 2 (XI ,X2 )
+ 8(F) -
=
-
wI(Xd - '1I I (X2 )
8(F)
+ '1I I(Xd + '11 1(X2)
8(F)
0
Then,
e~
-
n(n
-1~(n _ 2) [3(n; 3)e~ + (D (n ~ 3)e~ + e~]
e~
E[W31(XI)]2 - (E(Un ,3))2 = 0
E[W 32 (XI , X 2)]2 - (E(Un ,3))2
=0
By (4.3.7),
Var(Un ,3) -
-
n(n _
-
O(n- 3 )
1~(n _ 2) (9(n -
From direct computation, E(Un,h)
Var(Un,h)
3)(n -
4)e~ + 18(n - 3)e~ + 6e~)
(4.3.19)
= 0,
V1
~ h ~
= E(U~,h) = O(n- h),
m and
h
= 1,2, ... , m;
(4.3.20)
Since Un,l = Vn,l, we can write
(4.3.21)
83
Theorem 4.2
For
C
= 1,2, ... ,h-1, and h = 1,2, ... ,m, Whc (X1,.".,X c) = 0
Proof:
Lee (1990)
Using the integral representation
W~(X1, ..
. , Xh)
rn
h
¢(U1' .. '" urn) II[d(c(Uj - XiJ) - dF(uj)] II dF(uj)
(4.3.22)
j=1
j=h+1
h-1
rn
=
¢(u1" .. ,xh, ... ,urn )II[d(c(Uj-XiJ)-dF(Uj)] II dF(uj)
j=1
j=h+1
h-1
rn
- / ... / ¢(U1, ... ,urn) II [d(c(uj - XiJ) - dF(uj)] II dF(uj)
j=1
j=h
h-1
rn
=
¢(u1, ... ,urn )II[d(c(Uj-Xij ))-dF(Uj)] II dF(uj)
j=1
j=h+1
- W~_1 (Xll ... , xh-d
= / ... /
f···/
f···f
Integrating both sides with respect to Uh with measure dF(Uh) yields
o
and so Wh,h_1(Xll ... ,Xh-d
= E{wh(xll ... ,Xh-llXh)} = 0 from (4.3.3).
It follows from Theorem 4.1 (i) that
•
Theorem 4.3
(i) Let h
< h' and let 3 1 and 3 2 be a h-subset of {1, 2, ... , n} and a h'-subset of
{l, 2, ... , n} respectively. Then
(ii) Let 3 1 and 3 2 be two distinct h-subsets of {l, 2:, ... , n}. Then
84
Proof:
Lee (1990)
(i) Since E(Un,h)
= E(W h(X l , . .• ,Xh)) = 0, we only need to prove that
Since h < h', there is an element in S2 that is not in SI, therefore there is a r.v., X h,
say, that appears in Wh,(S2) but not in wh(Sd. Thus we can write
E{E(W h(SI)W h,(S2) I X h,)}
E{w h(Sd E(w h,(S2) I X h,)}
since wh(Sd and X h, are independent. But E(w h,(S2)
I Xh') = Wh'l (Xh,) =
0 by
Theorem 4.2 and so E{wh(Sdwh,(S2)} = 0, proving the first part.
COV(Un,h, Un,h')
=
(~) (~,) L L
-1
-1
(n,h) (n,h')
where the sum L(n,h) is taken over all subsets I
(ii) If SI
~
Cov(Wh(Sd, Wh,(S2))
il
< ... < i h ~ n of {I, 2, ... ,n}.
nS2 is empty, the result follows by independence.
S2 - SI = {it, ... ,i c } with c < h. (If SI
~
=0
Otherwise, suppose that
S2 then consider SI - S2). Then
E{E(wh(Sd wh(S2) I Xii"· ., X ic )}
E(W h(SI)W h(S2))
E{W h(SI)E(W h(S2) I Xii' ... ,Xic )}
-
o
•
since Whc (Xl' . .. ,Xc) = 0 by Theorem 4.2.
For the one-sample U-statistics of degree 3,
JJ~~I = (~)-1. 2;. (Drj _ Drj,)2, U~~~ = (~)-1. ~ .(Drj 1<3<3'
Df,j)2
,<,,<3
U~~~ = (~)-1 .L. (Drj _ Djj')2,
and
1<3<3'
P9l
= E(u~~I)
-
E(U~~~) = E(U~~~) = E(Drj - Drj ,)2
:2
[t 6f + L
k=1
ki
=f.k2
6tk2 -
85
t
k=1
P(Xfk =I- X:k , Xfk =I- X:'k)
where, for any 9,
C-1
8f(i,j)
= 8f = P(XYk =I- X]k) = L
(4.3.23)
pf(c)[l - pf(c)],
C=O
8fl k2(i, j) = 8t k2
P (XYk l =I- X]kl ' XYk2 =I- X]k2)
-
,,~=o~k,(C"C21 [~ ~ fl,k,(C3,
1]'
(4.3.24)
pf(c)[l - pf(C)]2,
(4.3.25)
c4
ca;!Cl c4;!c2
C-1
8f(i,j; i,j')
= P(XYk =I- X]kl
XYk =I- X]lk)
=L
C=O
P(XYkl =I- X]kl' XYk 2 =i X]lk 2)
C-1
L
pfl
k2 (C1, C2) [1 -
pfl
(C1)] [1 -
pf2 ( C2 ) ] ,
(4.3.26)
Cl,C2=O
(4.3.27)
If the null hypothesis of interest is that there is no between-group difference in Ham-
ming distances, i.e., H o: 81
= ... = 8f = 8k and 8t Jc = .. , = 8~k2 = 8klk2 , then
Note that ¢l,l(Xf, X~, X~I)
2
= (Dfj -
Dfjl)2 is not a symmetric kernel. Therefore,
we need to replace it by the symmetric kernel
Since the sequences are i.i.d.,
86
E[<P~,l (xf,
xJ, X;,) -
JLgl]
3~2 {2 k=l
t P(X%k i= xfk) + 2kd:k2
l: P(X%k
=I-
1
Xfk 1 ,
X%k2
i= Xfk 2)
K
l: (P(XJk i= Xfk)f - l:
k1 ,#k2
k=l
P(XJk1
i= Xfk
1
,X;'k2 i= Xfk 2)
K
2l: P(X%k i= XJ'k' XJk i= Xfk)
k=l
K
2 l:
k1 ,#k2
P(XJkl
i= X%'kl' XJk 2 i= Xfk 2) - 2l: Of
k=l
K
2 l:
kl,#k2
+
By
3
l:
kl,#k2
0tk2
+ 3l: P(XYk i= XJk, XYk i= XJ'k)
k=l
P(Xf,.l =I- XJk 1 ' X!k 2 =I- X%'k2)}
(4.3.21),
ur2 = JLgl + ~
Example:
t
(W(l,k)l(Xn -
JLgl)
+ Op(N- 1)
for k
= 1,2,3.
i=l
Two categories
Let, for any 9,
pf(O)
= P(XJk = 0) , pf(l) = P(XJk
pf1 k2(1,1) = P(X%kl =1,X%k2 =1) ,
pf1 k2(1,0)
= 1)
Ff k2(0,1)=P(XJk
1
1
= P(XJk = 1,X%k2 = 0) , pf k2(0,0)
1
1
= P(XJk1
=0,XJk2 =1)
= 0,XJk2 = 0).
•
Therefore, there are 4 parameters for each group.
Now,
'I/J(l,l)l (x!!)
-
9~4 {4 (t P(XJk =I- xfk )) + 4(kl,#k2
l: P(XJkl i=
2
k=l
87
Xfkl'
XJk 2 =I- Xfk 2)) 2
+
(tk=l (P(XJk
2
=1=
+(
XIk)f)
I:
kd=k2
P(XJk 1
XIk1 ' XJ'k2 =1= XIk 2)) 2
=1=
2
2
+ 4 (tp(XJk
k=l
XJ'k' XJk
(t Of + I: Of
(t
+ [-2
+
=1=
3
k=l
k=l
P(XYk
k1 =#=k2
=1=
=1=
XIk)) +4
=1=
XJ'k)
XJk' XYk
+L
kl=#=k2
(tk=l P(XJk
- 4
(t. P(XJd f.») (f; (P(XJd
- 4
-8l~
=1=
Xfk)) (
f.»)
P(XJ. # X
- 8 ( t P(XJk
k=l
4 k=l
+ ( t P(XJk
+ 3
Xfk)) (
(t
k=l
P(XJk 1
L
k1 =#=k2
P(XJkl
=1=
XJkl' X!k 2 =1= XJ'kJ)] 2
XJk2
=1=
Xfk2))
f.)')
=1= Xfkl'
=1=
XIk)) (
=1=
XIk))
I:
k1 =#=k2
P(XJkl
=1=
XJ'k2
Xfk 2))
=1=
XJ.kl' XJk 2 =1= Xfk 2))
[-2 (k=lt Of + L:=#=k2 0t k2').
k1
P(Xfk
=1=
XJk' X!k
=1=
XJ'k) +
L
kl=#=k2
=1=
L:
P(XJk1
=1= Xfkl '
(L
P(XJk1
=1= Xfkl'
XJk2
=1=
Xfk 2))
L
P(X;k1
=1= Xfkl'
X;k2
=1=
Xfk 2)) (
- 8(
XJ'k 1 ' XJk2 =1= Xfk 2))
(f; P(XJ. # X:"' XJ. # Xf.»)
k1 =#=k2
XIkl' XJk2
~~
kl=#=k2
XJ~
4( L=#=k2 P(X;k
1
=1= Xfkl'
XJk2
P(Xi11
Xfk2)) ( t (P(XJk
k=l
=1=
Xfk2)) (
=1=
Xfk 2))
kl
L:
=1=
=1=
P(XJk1
Xfk))2)
=1= Xfkl'
XJ'k2
~~
(tk=l P(X;k
L
=1=
P(X;kl
X;'k, X;k
=1=
~~
=1=
kl
=1=
Xfk 2))
Xfk))
X;'kl' X;k 2 =1= Xfk2))
[-2 (~=k:= Of + L=#=k2 Ot k2)
1
88
X~kl' X;k2 XJ'~))]
=1=
=1=
~~
+
X
P(XJk1
- 4(
=1=
P(XYk l
=1= Xfk1 '
L:
- 4(
- 8
L
k1 =#=k2
X
( t P(XJk
k=l
1
1 k2 )
+8
=1=
(I:
P(XJk
kd=k2
.
+
(t
3
+2
k=l
P(XYk
-I- XJk'
-I- Xh) +
XYk
L:
P(XYk l
kl::j.k2
(tk=l (P(XJk -I- X;k)r) (L:
P(X;k -I- X;k
kl::j.k2
1
1,
-I- XJk
1 '
X[k2
-I- X;'k 2))]
X;'k 2 -I- X;k 2))
(f, (p(Xr, f- xf,l)') (E p(Xr, f- xr", xr, f- xf'l)
+4
(tk=l (P( X;k -I- X;k)) 2) (kL:::j.k2 P(X;k -I- X;'k X;k2 -I- X;k2))
- 2(tk=l (P(X;k -I- X;k) r) [-2 (tk=l fJf + L:::j.k2 t k2 )
+4
1'
1
1
fJ
k1
+
3
(t
k=l
P(XYk
+ 4 (L:
k1 ::j.k2
-I- X;k'
-I- X;'k) +
XYk
L:
kl::j.k2
+ 4 (L: P(XJkl -I- Xfk X;'k 2 =I- Xik2)) (
1'
~::j.~
+
3
(t
k=l
1
-I- Xfk
1'
XJ'k2
Xfk2
-I- X;'kJ)]
L:
~::j.~
P(X;k1 =I- X;'k 1 , XJk 2 =I- Xik2))
[-2 (t
fJf + L: fJf1k2)
k=l
kl ::j.k2
-I- Xik2))
-I- X;'k) +
P(XYk =I- XJk' XYk
-I- X;kl'
(tk=l P(X;k -I- X;'k' XJk -I- Xfk))
P(X;k1 =I- X;k1 , X;'k 2 -I- Xik2))
- 2(k1L:::j.k2 P(XJk
P(X[kl
L:
kl::j.k2
P(XYkl =I- XJk 1 , XYk2 =I- X;'k 2))]
(tk=l P(XJk =I- X;'k' XJk -I- X;k)) ( kl::j.k2
2: P(XJk =I- XJ.kl' XJk2 =I- Xfk2))
- 4(t
P(XJk =I- XJ'k' XJk =I- Xfk)) [-2 (t fJf + 2: fJf1k2)
k=l
k=l
kl ::j.k2
+8
+
3
1
(t
k=l
-I- X;'k) + 2:
P(X?k =I- XJk' X?k
- 4( 2:::j.k2 P(XJk
1
-I- XJ'kl'
kl::j.k2
XJk2
k1
+
•
3
(t
k=l
P(XYk
-I- X;k'
XYk
-I- XJk
1 ,
X?k2 =I- XJ.k 2))]
[-2 (tk=l fJf + 2:::j.k2fJt k2 )
-I- Xfk2)
-I- X;'k) +
P(X!k1
k1
L:
k1 ::j.k2
and
89
P(X[kl
-I- X;k
1 '
X!k2 =I- XJ.k2))]}
For the I-sample V-statistics of degree 4,
u~~i = (~)-1, .~
(Drj
Dr'j')2,
_
'<3<"<3'
and
u~~~ = (~)-1. ,L
.(Drj
-
Dr'j,)2,
'<"<3'<3
j-tg2 -
E(u~~i)
- ;2 {t
k==l
= E(U~~~) = E(U~~~) = E[(Drj -- Dr'j' )2]
8f(1 - 8f) +
L
kl'Fk2
(4.3.28)
(8tk2 - 8t 8f2)}
and under H o,
E[cP2,1(xf, X~, Xf" X~,) - j-tg2]
1P(2,l)1(xf)
;, {t(l- 28f)P(XJ. '" xf.) +
+
L
k1:#k2
P(X%kl
L
2
-I
Xfk 1,X%k2
-I
t.
Xfk 2) +
9:{28f -1)
L (28f 8~ -
k1:#k2
1
8f1 ~)
8f2P (X%k1 -I XfkJ}
(4.3.29)
"1 ;o!"2
which becomes under H o
K
1P(2,l)1(Xi)
-
;2
+
{
L
kl :#k2
2
K
{;(1 - 28k)P(Xjk =I- :r:ik) + {; 8k(28k - 1)
P(X jk1 =I- Xikl' X jk2 =I- ~rik2)
L
+L
k1:#k2
(28 k1 8k2 - 8k1k2 )
8k2P(Xjkl =I- Xik1)}
•
"1 ;o!"2
90
By (4.3.21),
UJ~ = I-lg2 + ~ t (W (2,k)1 (Xf) -
1-l92)
+ Op( N- 1 )
for k
= 1,2,3.
i=1
4.3.2
Multiple-Sample V-statistics
Let {X~j); i ~ 1}, j = 1, ... , c (~ 2) be independent sequences of independent
random vectors, where x~j) has a distribution function F(j)(x), x E RP, for j =
1, ... , c. Let F = (F(I), ... , F(e)) and ¢(X~j), 1 ~ i ~
measurable kernel of degree m =
(ml, ... ,
1 ~ j ~ c) be a Borel-
mj,
me), where without loss of generality we
assume that ¢ is symmetric in the m j (~ 1) arguments of the j th set, for j = 1, ... , c.
Let -rno
= ml + ... + me and
Definition 4.6
For a set of samples of sizes n =
(nl' n2, . .. , n e )
with
nj
>
mj,
1
~ j
< c, the
generalized U-statistic for 8(F) is
U(n)
IT;::.
e
=
(
j=1
) -1
1~ j
~ c),
(4.3.31)
(n)
3
where the summation
L* ¢(X~), a = ijI, ... , i jmjl
'Ltn) extends over all 1
~ i j1
< '" <
i jmi ~
nj,
1 ~ j ~ c.
U(n) is an unbiased estimator of 8(F). The generalized von Mises' functional is
8(F(·, n)) =
(u
i
n;m )
3=1
U{f:... f: ¢(X~),
3=1
lil
a
= i j1 , ... , i jmj ,
'imi
1
~ j ~ C)}
(4.3.32)
•
•
Now, for every dj
Wdl ...
:
0
~
dj
~ mj,
1~j
~
c, let d = (d 1 , ... , de) and
dc(X~j), ... ,X~), 1 ~j ~ c) = E(¢(x~j),.,.. ,x~),X~~I'''''X~~, 1 ~j ~ c))
(4.3.33)
91
so that Wo = O(F) and
so that eo(F)
= O.
Var[U(n)] =
wm
= <p. Then
~
Then, for every n
m (Sen, 1981.),
[u (:::.)]-1 {~d=O ... dc=O~ U(;3)1 (~.-_~~)edl, . .'dc(F)}
3=1
c
3
~ njla} [1.
j=1
3=1
1
3
3
3
+ O(no1 )]
(4.3.35)
where no = min(nl, ... , n c ) and
(J'~3
with
oO/{3
= m~tr.
3
r.
<"0Jl,· •• ,OJc
(F)
J'
= 1., .... ,c
(4.3.36)
= 1 or 0 according to whether 0: = f3 or not.
For a two-sample U-statistic of degree (ml, m2), the kernel is
and
where E(n,m) extends over all 1. ~
i j1
< ... <
ijmj ~
nj, j = 1,2. U(nb n2) is an
unbiased estimator of O(P(1), p(2)).
Define
"1
"2
"2
~ ~
~
i1ml=1 i2 1=1
<p(Xil b ... , XiI mIl 1'i2 1, ..• , 1'i2m2)
i2 m 2=1
Then,
(4.3.37)
provided the variance of U"1 ,"2 exists.
92
E{~~ld2(X1"" ,Xdl , Yi,
edl d2
-
E{'11~ld2 (Xl, ... ,Xdl , Yi,
= 0, ... , mI, d2 = 0, ... , m2.
for d1
(eOG
= 0).
, Yd 2)}
, Yd2 )} - (j2(F(l), F(2))
(4.3.38)
Then,
(4.3.39)
where
(4.3.40)
The decomposition for U(n) can be developed similarly to the one-sample Ustatistic. For a two-sample U-statistic, let
= f···
f ¢(U1"'"
hi
u ml ; VI,
... , vm2 ) J![d(c(Uj - Xij)) - dF 1(Uj)]
3=1
ml
IT
x
~
dF 1(Uj)
j=hl +1
IT [d(c(vi -
liz)) - dF 2(vI)]
1=1
~
IT
dF 2(v,)
l=h2+1
then it follows as in the one-sample case that
¢(XI, ... , X ml ; Y1,· .. , Ym2)
ml
=L
hi =0
where
l:ml,h l
m2
L L L
h2=0 ml ,hi m2,h2
'11 hl ,h2(Xjp ... , Xjhl; Yip"" YlhJ
(4.3.41)
is taken over all subsets 1 ~ j1 < ... < jhl ~ m1 of {1, 2, ... , m1}'
Theorem 4.4 (Lee, 1990)
The two-sample U-statistic admits the representation
(4.3.42)
where U~hl~h2)
is the generalized U-statistic based on the kernel '11 ho I.h2 and is given by
I, 2
93
The functions Whl ,h2 implicitly defined by (4.3.41) satisfy
(i) E{W hl ,h2 (X I , ... , X hl ; Yi, ... , Yh 2 )} = 0
(ii) Cov(W h 1,2
h (Sl, S2), Who,l ' h'2 (S~, S~)) = 0 for all integers hI, h 2, h~, h~ and sets Sl,
O
S2,
S~, S~
unless hI
= h~,
h 2 = h~ and SI
= S~
and 8 2
= S~.
•
The generalized V-statistics U~~~~~2) are thus all uncorrelated. Their variances are
given by
(4.3.43)
From (4.3.42) we can write
where
and U*(nl, n2) is also a generalized V-statistic for which
Var(U*(nl, n2))
E[U*(nl, n2)]2
- E[U(nl, n2) - 9(F) - UI(nl) - U2(n2)]2
Var(U(nl,n2))
+ E(Uf(nI)) + E(U;(n2))
- 2 E[(U(nl' n2) - 9(F))UI (nl)] - 2 E[(U(nl, n2) - 9(F))U2(n2)]
+ 2E(UI(ndU2(n2))
= Var(U(nl, n2))
- 2 E[Uf(nd
+ E(Uf(nd) + E(U;(n2))
+ UI(nl)U*(nl, n2)] 94
2 E[U;(n2)
+ U2(n2)U*(nl, n2)]
since the generalized V-statistics U~~:;'~2) are all uncorrelated (Theorem 4.4).
If Var(U*(n1' n2)) = O(ni)2), then U*(n1, n2) = Op(ni)l). Also, U1(n1) - 8(F)
and U2(n2) - 8(F) are one-sample V-statistics. Therefore, U(n1' n2) can be written
as
The above expression can be generalized for multiple-sample V-statistics. For
instance, the decomposition for a three-sample and four-sample V-statistics are as
follows
m
m
~
~
8(F) + _1 L[WlOO(Xi) - 8(F)] + _2 L[WOlO(l'i) - 8(F)]
n1 i=l
n2 i=l
"3
+
m3 L[W OO1(Zi) - 8(F)]
n3 i=l
+ Op(no1 )
(4.3.45)
For the two-sample V-statistic of degree (2,2),
U~2,2) = (N)
2
-1
(N)
2
E(D!!~
- D~2.,)2
= E [(D!!~)2
I)
1 )
I)
-1
L
L
(Df) - Df,j,?,
l<i<j<N
- l<i'<j'<N
-
+ (D~2.,)2 1 )
95
2D!!~
D!!,2.,)
I)
I)
•
(4.3.47)
1
E(Dfj)2
E [K
t; I(Xfk :f: X]k)
K
[t
- ;2 {t
I~2 E
k=1
k=1
1
K2
2
1 (Xfk
] 2
~
:f: X]k) + 2
I (X[k 1
:f: X]k
1)
~<~
P(XYk
~<k2 P(XYk
:f: X]k) + 2
1
:f: X;k
1 ;
I (Xfk 2 :f: X]k 2)]
XYk2
:f: X;k 2)}
k1
~
eg 2 " eg
L.J k + K2 L.J k k2
k=1
k1
<k2
1
And, for any 91 and 92,
E(D!!~
D~2.,)
I)
1 )
=
=
{;,It, I(Xrh~ XJLl] [t,
~2 [t
:f:
:f:
E
{E
{t
~2 {t
= ~2
=
I(Xf.i
k=1
k=1
k=1
I (XY':
XJk) I (XY,1
i
X::k )]}
~
XJ,2k)+
I (XYr:1
kd:k2
P(XYr:
:f: XJf, XY,1 :f: XJrk)+ ~
p(Xr,:
:f: X]np(X;,~ :f: Xjrk)+
k1 #k2
~
kd:k2
= 2= 2K2 {~
L.J ek ek + "
L.J ek 1 ek2
K2 {~
L.J
g1
g2
g1
g2
}
k=1
k1
#k2
P( X!f1
k 1 =1
:f: XJlc
~
L.J e ek2
k2=1
':1
g2
}
since the sequences in groups 91 and 92 are independent.
Therefore,
96
1
,XY,~2 :f: Xf~2)}
:f: Xjt )P(X;'~2 :f: Xj'~2)}
p(Xrr:1
g1
:f: XJt )I (XY,12 :f: X Jrk2 )] }
(4.3.49)
1f(3)10(Xfl )
- E{<P3(xf\ X~l; X~2, X~n - P(gl,g2)3}
{[~
E
k=l
L
+2
~<~
P(Xfk =I Xjk)
8f:k2 - 2
{t
~2
~
I(xik ,. XIk) -
{t
~2
=
E
k=l
E
I(Xf,j, ,. XFk)r -1'(0".,)3}
+2 L
P(Xfk1 =I Xjk 1 , Xfk 2 =I Xjk 2) +
k1 <k2
t 8f2 P(Xfk =I XJk) - 2 L
k=l
~¢~
P(Xfk =I Xjk)
+2 L
k1
<k2
P(Xfk1 =I Xjkl , Xfk2 =I Xjk 2)
k=l
k1
g1
- 2 '"
L.J 8k1 k2
~<~
K
¢k2
8f~P(xfkl
=I Xjkl) -
L 8fl
k=l
g1 g2
g1 g2
+ 2~
L.J 8k 8k + 2 '"
L.J 8kl 8k2 }
k=l
g2
=~
K2 {~
L.J 8k (28 k g1
k=l
~¢~
1) + 2 '"
..j. X!!l
X!!l..j.
X!!l
L.J P(X!!l
Ik 1 T
]k1 '
Ik2 T
]k2 )
kl <k2
K
+
+
L(1- 28f2)P(xfk =I XJk) - 2 L
k=l
kl
¢k2
8f~P(xfkl =I XjkJ
L (28f~ 8f~ - 8f~ k2) }
k1 ¢k2
Under Ho,
1f(3)1O(Xi)
E{<P3(Xi, Xj; XiI, Xjl) - P3}
=
1
K2
+2
{K
{; 8k(28k ~ P(Xik1
K
1) - ~(28k - l)P(xik =I X jk )
=I X jk1 , Xik2 =I Xjk 2) - 2
~<~
L
~~~
97
8Y:
8f~P(xfkl =I XJkJ} -
K
L 8f2 P(Xfk =I Xjk) - 2 L
- 2
t
k=l
8k2 P(Xik1 =I XjkJ
P(gl,g2)3
Similarly,
'IjJ(3)Ol (Xf,2)
E {4>3 (Xf! , X~! , xfl , X~~) - J-l(9! ,92)3}
{[~
E
E
;2 {t
# XJi) -
+2
L Of~k2 +
Of!
~<~
k=l
~<~
-
=
2
L 0f~ P( xflk! =I- X Jlk!)} -
J-l(9! ,92)3
k! #;k2
K
L
P(xf:t =I- XJlk )
k=l
k=l
P (xflk =I- XJlk) + 2
k=l
Of! P(xf,~ =I- XJlk) - 2
k=l
92
_ 2 '"'
L.J 0k! k2
~<~
=
t
;2 {t
P (xflk! =I- XJlk! ;
L
Of~P(xf,t =I- Xjr,2k!) -
k! #;k2
k=l
Of2 (20f! - 1) + 2
L
~<~
+
K
Lor
k=l
92
'"'
L.J 09!
k! 0k2 }
~#;~
P (xflk! =I- XJlk! Xf,2k2 =I- XJ,2k2 )
K
L(l - 20f! )P(Xf,2k =I- XJ,2k ) - 2
~l
Xf'~2 =I- XJlk2 )
L
k! <k2
92
+2~
L.J 09!
k 0k + 2
~l
+
-1'(""")3)}
P (Xf,2k! =I- XJlk! ' Xf,2k2 =I- XJ,2k2 ) - 2 L Of! P (xflk =I- XJlk )
;2 {t
- 2
I(xf.d XJlk)r
K
L
+2
E
~
I(Xr..
L
~~
Of~P(xf,~! =I- XJ,2kJ
L (20t Of~ - O~kJ}
k! #;k2
98
99
/;4
(tk=l 8fl(28f2 -1)) ( L::pk28f~P(xfkl -I X:'~I))
k1
g1 g2
~
K4 (~
L..J 8k (28 k k=l
+
:4 [L
leI < 1e 2
+
+
L
1)) (""::pk2(28g1 8kg22 - 8g1klk2,'l)
L..J
kl
kl
P(xfkl -I XJk l ; xfk l -I XJk 2)(1 - 28Z:)P(xffl -I XJfl )
P(xfk -I XJf Xff -I XJkJ(I- 28Z~)P(xfk2 -I XJk 2)
1 ;
l
kl <k2
L
p(xnl =f
leI ;/ 1e2;/IeS
leI <1e2
l
XJf xnl -I XJk2)(Il
;
28Z~)p(xns =I XJk
S )]
g1
gl
l g2
8 [""
gl 1 -'gl -Igl -I- X
K4
L..J P( xik
r xgjk2 )8kl P( xik2
r jk2)
r X jk1 '. xik2
lei < 1e2
gl
g1
g1
P( xikl
gl -'gl 2 -'gl -Ir X jk2){j92P(
r X jk1 ., xik
17k2 xikl
r X jkl )
L..J
+ ""
kl
+
<k2
L
lei ;/ 1e2;/ le a
lei <1e2
+
L
P(Xffl -I
XJf Xfk 2 =I XJf 2)8Z: P(xfks -I XJk s)
l
;
P (Xfkl -I XJk 1 ; Xfk2 =f XJk 2)8f~ P (xft =f XJt )
lei ;/1e2;/IeS
lei <1e2
+
L
P( Xfkl =I- XJkl ; Xfk2 -I XJk2 )8f~ P( xfks
=I XJks)
leI ;/1e2;/lea
lei < 1e 2
+
L
P(Xfk1 =l X Jk 1 ; Xfk 2
leI ;/1e2;/IeS
lei <1e2
+
~
K4 (""
L..J
~<~
-IXJkJ8f~P(xfk2 -IXJk2)]
g2
g1
P(X!!1
\:""" (28 g1
.kl -Ir X!!1
)kl .' X!!I
.k2 -')k2)) ( 'L-J
k l 8k2 - 8kl k2 ))
r X!!1
~::p~
;4 [L
+
+
+
L
k1::pk2
(1 - 28f:)P(xfkl -I XJkJ 8f: P(xfk2 :1= XJkJ
k1::pk2
(1 - 28f:)P(xfk1 =I XJkJ8f~P(xfkl =I XJ~I)
L
(1 - 28f:)P(xfkl
k1::pk2::pks
=I XJk l )8f~P(xfks =I XJk
~
- 2(}92)P(X!!1
K4 (~(1
L..J
k
.k -I X!!I))
)k
k=l
S )]
( ""
_ (}glk1k2))
L..J (2(}91(}92
k 1 k2
k l ::pk2
100
Similarly,
'IjJ(3)Ol (xf~)
;4
(tOf2(20f1 _1))2
+
~l
;4 (L
~<~
P(Xf,2k1 =I XPk 1 ; Xf,1 2 =I
Xf~k2))2
+
~
(t(1 K
k=l
+
;4 ( L Of~P(xfii:l =I Xf,2k1 )) 2+ ;. ( L (20f~Of~ _ Of~k2)) 2
+
;4 (tO~2(20~1 -1)) (L
+
~. (t. er(20:' - 1)) (t,(1 - 20:' )P(xf,'. '" Xj'.'.))
20fl
)P(Xf~k =I Xf~k))2
k1::f:.k2
kl :#2
kl <k2
k=l
P(Xf,2k1 =I
Xf,t; Xf'1 2 =I XJ,2k2 ))
1)) (k1L::f:.k2Of~P(xf~kl =I Xf,2k1 ))
g1
g1 g2
~
K4 (~ Og2
k (20 k _ 1)) (""' (20 kl 0k2 _ Og2
k1k2 ))
k=l
kl::f:.k2
;4 ( t Of2(20fl k=l
LJ
LJ
;4 (L
:4 (L
+ ;4 (L P(Xf~kl
;4 (E(1 -
+
~<~
P(xf,11 =I XJ,2k1 ; Xf'11 =I XJ,2k2 ))
P(Xfikl =I
~<~
~<~
+
-I- Xf,2k1 ;
;4 (t(1k=l
-
~::f:.~
k1::f:.k2
Of2(20fl _'1))2
k=l
-I-
20~1 )P(Xf'1 =I Xf?k)) ( k1L::f:.k2 (20~~ O~~ - O~:k2))
;4 (L Of~P(xf,t -I-Xf~kl)) (L (20tO~~
(f.
+ (L
;4
-I- XJ,2k1 ))
~::f:.~
=I
kl ::f:.k2
kl::f:.~
-
~l
LOf~P(xf,11
Xf~k2 Xf~k2)) (L (20f~Of~ Of~k2))
Xf~k)) (L Of~P(xf,2kl Xf~kl))
Xf~kl ; xf,12 -I- XPk2)) (
20fl )P(Xf,2k =I
k=l
(E(1- 20fl )P(Xf~k -I- Xf~k))
;4
k1<k2
101
-Of:k2)
P(xf?kl -I- Xf,2k1 ; xfii:2 =I XPk 2)) 2
+ ;. (t.(1 -29f' JP(Xf'k # XJikJ)'
+
+
+
;4 (~ 9f~P(xf,t
;4 (t -1)) (~
9f2(29f1
k=l
(t.
k1 <k2
8f' (20:'
- 1
...t.
L..J
L..J
L..J
[~
P(xf,11 -:J XJfk1 ; Xf,t -:J X],1 2)(1-
~ P (Xf?k1 -:J XJ?k1 ; Xf,2k1
k1<k2
~
"1 #"2#"S
=I XJ?k 2)(1 -
P( Xf?k 1 -:J XJ?k1 ; Xf?k 1 -:J XJ?k 2)(1 -
:4 [~
"1 <"2
=I XJ,2kJ
2(}:~ )P( xf,1 -:J X],2ks )]
s
)9f~ P( xf,12 =I X],2k2 )
P( Xf,2k1 -:J XJ?k1 ; Xf?k2 -:J X],1 2
~ P(Xf,2k1
k1<k2
29f~ )P(xf,t =I XJ~1)
29f~) P(Xf?k2
"1 <"2
+
XJ~2))
P(Xf,2k1 -:J XJ,2k1 ; Xf?k 2 =I
L..J
"1 <"2
+
(~(29f:9f~
-9f:k2))2
k1;j;k2
J) (t. (1 - 20:' JP(Xf,'k # XJikJ)
92
92 91
~
K4 (~9 k (29 k _ 1)) ('" 9k2 P(X!!2
l'k1 r X!!2
.7'k1))
k=l
k1;j;k2
92
91 92
92 91
~
K4 (~ 9k (29k _ 1)) ('" (29 k1 9k2 _ 9k1k2))
k=l
k1 ;j;k2
~.
+ ;4
+
=lXJ,t))' + ;.
k1;j;k2
=I XJ,2k1
; Xf?k 2 =I XJ,2k2)9f~P(xf,2k1
=I XJ,2k1 )
+
'"
P( Xi'k
92 ...t.
X92 . 92 ...t. X92 )LJ91 P( 92 ...t. X92 )
L..J
1 r j'k1' Xi'k 2 r j'k2 Uk 1 Xi'ks r j'ks
+
92
92
92
'"
92 ...t.
92 2 ...t.
.92 ...t.
L..J P( Xi'k1
r X j'k1 ., Xi'k
r X j'k2 )LJ91
Uk sP( J'i'k2
r X j'k2 )]
"1#"2#"S
"1 <"2
+
;4 (~
~<~
P(Xf?k1 -:J XJ?k 1 ; Xf?k 2 =I X Jfk2 )) ('
~ (29f:9f~ - 9f~kJ)
~;j;~
102
."
- ;4 [L
(1 -
kl=#2
2et~ )P(xf,t -I- X;,2k 1)et~ P(xf'k2 -I- X;,2kJ
+ L
k1
(1 - 2et )P(Xf,2k1
:f;k2
kl
:f;k2:f;ka
L
+
+
(1-
-I- x;'kJet~p(xf,2kl -I- X;,2kJ
2ef~)p(xf,2kl -I- x;'kJef~p(xf,2ka
-I- Xj,2ka )]
~4 (t(1
- 2e~1 )P(Xf,2k -I- Xj,2k)) ( L (2e~~ef~ - e~k2))
k=l
:f;k2
k1
- ;4 (L ef~p(xf,2kl -Ik1
e~~)
-
:f;k2
X Yk1 )) (
L (2ef~ef~ - ef~k2)
k1
:f;k2
E{ 1/J~o(Xrl )}
- ;. (E or
+
;4{L [t (~~k2(U, ~~k2(1L [t ~~
~~
~~
L [t ~~ka ~~ka ~~k2ka
L: [t rfIc~k2
rfIc~ka t)rfIc~k2ka
k1
+
+
+
+
X
+
(2e:' - 1l)'
2
V))2
<k2<ka
u, v t=o
k1
<k2 <ka
u, v t=o
k1
<k2<ka
u, v t=o
2
L:
u, 1 - V)]
U,v=O
k1
2
2
<k2
k1:f;k2:f;ka:f;k4
[t
k2(u, v)
u,v,t,z=O
ka ( u, t)
k2 ka (1 - u, 1 - v, 1 - t)]
(u, t)
(v, t)
(1 - u, 1 - v, 1 - t)]
(u, v)
(v,
(1 - u, 1 - v, 1 - t)]
~~ka (u, t)P~~k4 (v, z)
~~k2kak4(1-U,1-v,1-t,1-Z)]}
~4 {E(1- 2ef2)2
[1;
(1-
~1(U))2P~I(U)]
+
-28tH1 - 28t) [.to (1 - pf. (u») (1 - v:: (v)) pf..,(u, vl]}
;. L~., (Of:)' [to (1 - pf. (u»)' pf. l]
+
2
+ 2.~y
(u
E"f... (8t)' [.tJ
1-
1"': (u l)(1 -
1<2<l<a
103
v::(v)) p.,... (u, v)]
+
+
t
[t (1 - pf~
L of~of~
k=l k1 i=k2 i=ks
2.~r., 9::~
+ 2
u=O
(u))2 pf~ (U)]
[uto (l-lfo:{u)) (1- P"';{V))lfo:r.,{u,V)]
tk=l k1 i=k2i=ks
L Of:Of~ [t (1 - pf~ (u)) (1 - pf~ (v)) pf~ks (u, V)]
v=o
u,
+ 2
L
k1i=k2 i=ks i=k4
+ ~
K4 (""
L..J
;4 (t of
1
(20
k=l
+ ;.
[t (1-Pf~(U))(1-Pt(V))Pf~k4(U'V)]}
u,
v=o
92
91
(20 91
k10k2 _ 0k1k2)) 2
k1i=k2
+
Of~Of~
[t pf~j:2(U,V)pf~k2(1-u,1-V)])
r -1)) (L
k1 <k2 v=O
u,
(E ef'{29f - 1)) (E{l -29f) [t.(1 -lfo'
(u))Jf
(U)])
;4 (t
Of1 (20f2 - 1)) ( L Of~ [t (1 - pf1 (u ))pf1 (U)])
k=l
k1 i=k2
u=O
91 92
~
K (~
L..J 0k (20 k 4
+
+
+
1)) (""
(20 91
k10k2
k1 i=k2
L..J
k=l
;4 {L
fJ2
L
(1-
~
091
k1k2))
t pf~k2(U,v)(1- pf~
(1- 20f:) [
k1<k2
u,v=O
k1<k2
-"
(1 -
U))pf~k2(1- u, 1- V)]
t
20f~) [u,V=O pf~k2(U,v)(1- pf~(1 - V))pf~k2(1- u, 1- V)]
20f~) [
(1 -
u,
"1 ;/"2;/"S
"1 <"2
t
vt=O
pf~k2(U, v) (1 -- pf~ (t)) rJl~k2ks (1 -
u, 1 - v, t)] }
- :4 {L Of~ [t pf~k2(u,V)(1-pf~(1-V))pf~k2(I-U,I-V)]
k1 <k2
+
+
L Of~
k1<k2
~
"1 ;/"27I "S
"1 <"2
+
tv=o rJl~ k2 (U, V) rJl~ 'i~)) rJl~ k2 u,
Of: [ t fi~k2 (u, v)(1 - rJl~ (t))fi~k2ks (1 - u,
Vt=O
[
L Of~
"1 ;/"2 ;/"S
"1 <"2
uv=O
(1 -
(1 -
(1 -
1 - V)]
u,
1 - v, t)]
u,
[t rJl~k2(u,V)(I-rJl~(I.-U))pf~k2(I-u,I-V)]
u,
v=O
104
(}f~ [ t rfl~k2 (U, V) (1 - rfl~ (t)) rfl~k2ka (1 -
2:
+
kl ;ilk2 ;ilka
kl <k2
(}f~
2:
+
u, 1 - v, t)]
u, v t=o
kl ;ilk2;i1ka
[t
U,V=O
rfl~k2(U,V) (1- rfl~(l-v))pf~k2(1- u, 1- V)]}
k l <k2
+
;4 (2: (2(}f~ (}f~
;4 {2:
kl
+
-
i'k2
(}f~ k2)) (2:<k2 [ t rfl~ k2(U, v) rfl~ k2 (1 kl
(1 - 2(}f: )Of: [ t (1 kli'k2
U,V=O
+ 2:
i'
k1 k2
(1 - 2(}f: )of~
2:
+
kl
i'k2 i'ks
[t (1 - pf~ (
kl
i' k 2
kl
i'k2
-
-
rfl~ (u ))( 1 - rfl~ (v) ) rfl~ k2 (u, V)]
? pf~ (u )]
[t (l-rfl~(U))(l-Pf~(V))rfl~ka(U'V)]}
(1-2(}f:)(}f~
;4 (2: (2(}f~(}f~
- ;4 (2: (2(}f~ (}f~
+
u)
u=o
u, 1 - V)] )
u, v=o
U,V=O
(}f~k2))
(£:(1- 2(}f2) [t(l- rfll(U))rfll(U)])
k=l
U=O
(}f~k2)) ( 2:i'k2 (}f~ [t(l - rfl~ (u) )rfl~ (U)])
kl
U=O
Similarly,
e~~)
-
E{1/J51(Xrn}
-
~4(t8f(2~' -1))'
;4{ 2:<k2 [t (rfl~k2(U'V)frfl~k2(1-U,1-V)]
+
kl
+
+
+
+
X
2:
2
kl
<k2 <ks
kl
<k2 <ks
kl
<k2 <ks
2:
2
2:
2
2
2:
U,V=O
[t
u, v t=o
[t
u, v t=o
[t
u, v t=o
k1i'k2i'kai'k4
rfl~k2(U'V)rfl~kS(U,t)rfl~k2ka(1-U'1-V'1-t)]
rfl~ks(U,t)pf~kS(V,t)pf~k2kS(1-U,1-V,1-t)]
pf~k2 (u, v) pf~ks (v, t)pf~k2ks (1 -
[t
U,v,t,z=O
pf~ks(U,t)pf~k4(V,Z)
pf~k2ksk4(1- u, 1- v, 1- t, 1- Z)]}
105
u, 1 - v, 1 - t)]
+ ~4
R
{t(l - 28fl)2 [t (1 - pfc2(U))2 pfc2(U)]
k=l
u=O
+ 2k~" (1 +
;4{L
(8f~r [t (1 - pf~(u)r pf~(U)]
k l :f;k2
t. .,f..,
+ 2
+
u=O
(8::)' [.~O (1 - pr,(u)) (1 - pr,(v)) pr,k,(u, V)]
le2<le3
t
U)
+ 2 k~" 8r,o::
t.
u=O
[.t
k,¢t:¢k' 8r,8f;
L
+ 2
k l :f;k2:f;k3 :f;k4
+
(1- rl:(u)) (1- pr,(v)) pr,.,(U,V)]
[.~O (1 - pr, (u)) (1 - Jt(v)) rl:k,(U, V)]
8f~ 8f~ [ t
pfc~ (u)) (1 -- pfc~ (v)) pfc~k4 (u, V)] }
(1 -
u, V=O
;4 (L (28f~8~ _ 8f~k2)) 2
k l :f;k2
+
r
L 8f~ 8f~ [t (1 - ~ ( pfc~ (u)]
k=l k l :f;k2:f;k3
+ 2
[.~O (1 - rl:(u)) (1 - pr,(v)) pr,.,(u, V)]}
28f;)
20::)(1 -
;4 (t
8f2(28fl -1))
(L [t pfc~k:!(u,v)pfc~k2(1k 1 <k2
k=l
:4 (~8f2(28fl
U, 1- V)])
u, v=o
1)) (~(1 - 20fl) [t(1- pfc2(U))pfc2(U)])
- ;4 (t Of2(20fl - 1)) ( L Of~ [t(1- pfc2(U))pfc2(U)])
+
-
k l :f;k2
k=l
92 91
~ (~
L..J 8 (20 -
- . K4
k
k
1)) ('" (2091 892 - 892
L..J
k=l
+
;4{L
(1 -
k l <k2
+
L
(1 -
k 1 <k2
+
L
lei ;l le2;l le 3
lei < le 2
u=O
k 1 k2
k l k 2 ))
k 1 :f;k2
28f~)
[t pfc~k2(U,
u,v=O
v)(l--
pfc~(l - u))pfc~k2(1 -
20f~) [ t pfc~ k2 (U, v) (1 - pfc~ (1 - V))pfc~ k2(1 -
u, 1 - V)]
u, 1 - V)]
U,v=O
(1-20f~)
[t pfc~k2(U'V)(1-pfc~(t))pfc~k2k3(1-U,1-V,t)]}
u, v t=O
106
- :4{ L e~~ [t P~~k2(U,V)(1-fi~(1-V))fi~k2(1-U,1-V)]
k l <k2
e~~ [ t fi~k2(U'V)(1-fi~(1-U))fi~k2(1-U,1-V)]
+ L
k l <k2
"I
+
U,V=O
L
+
uv=O
>t"2 >t"3
"I <"2
ef: [ t
u, t=O
V
L e~~
"I >t "2 >t "3
"I <"2
fi~k2(U'V)(1-fi~(t))fi~k2k3(1-U,1-V,t)]
fi~ k2 (U, V) (1 - fi~ (1 - u)) fi~ k2 (1 -
[t
u, v=O
u, 1 - V)]
[t fi~k2(U,V)(1-Pf~(t))fi~k2ka(1-U,1-v,t)]
+ L e~~ [t fi~k2(u,V)(1-fi~(1-V))fi~k2(1-u,1-V)]}
+ ;4 (L(2ef:ef~ ef~k2)) (L [t P~~k2(U, V)fi~k2(1
+ ;4{ L (1-2ef~)ef~
(1-fi~(U))(1-fi~(V))Pf~k2(U'V)]
+
L ef~
"I >t"2>t"3
"I <"2
u, V
"I >t"2 >t"a
"I <"2
U,V=O
t=O
-
- u, 1 - V)])
k l :j::k2
k l <k2
U,V=O
[t
k l :j::k2
+ L
(1 -
k l :j::k2
+
L
u, v=o
[t(l
- fi~ (u))2 fi~ (U)]
U=O
2e~~ )e~~
(1-
k l :j::k2 :j::ka
2ef~)e~~
;4 (L (2ef~e~~
- ;4 (L (2e~~e~~
+
kl :j::k2
k 1 :j::k2
[t
(1-
U,V=O
fi~(u)) (1- fi~(V))fi~ka(U'V)]}
(£:(12e~l)
[t(1- fi 2(u))fi2(U)])
k=l
U=O
- ef~k2)) ( L ef~ [t(l- fi~(U))fi~(U)])
-
ef~k2))
k 1 :j::k2
u=o
U~2,2) (4.3.47) can be decomposed as
U~2,2)
-
J.l(91,g2)3
+ ~E['lT(3)lO(Xfl) -
J.l(91,92)3]
i=l
+ ~ E['lT(3)Ol(Xrn i'=l
107
J.l(91,92)3]
+ Op(N- 1 )
(4.3.50)
The other two-sample V-statistics of degree (2,2) are
(4.3.51)
(4.3.52)
Let
where
Oi
91
1
,92)
= p(Xr~ # X:n = L~I(U)p~2(1- u)
u=o
and
(4.3.53)
t!'(4)1O(xf1 )
E( <P4(xfl ,X~2 ,Xfl ,X~~) -
=
J1(91 ,92)4)
{~(1
_ 20(91,92))P(Xf!2
.../.. xf!l) + ""
L...J
k
Jk r
,k
L...J
_1
K2
k=l
K
+ ""
0(91,92)(20(91,92) _
L...J k
k
k=l
P(Xf!2 .../.. xf!1
Jkl
r
,kl'
kl:f;k2
1) +
""
L...J
(20(91,92)0(91,92) __ 0(91,92))
kl
k2
k 1 :f;k2
- 2 ""
0(91,92)P(Xf!2 .../.. xf!1 )}
L...J k2
Jk 1 r
'kl
k 1 :f;k2
108
kl~
Xf!2
.../..
Jk2 r
xf!1 )
,k2
Under Ho,
VJ(4)10(Xi)
E( 4>4 (Xi , Xj, XiI,
Xi') - 114)
{KL(l - 2(h)P(Xj k i= Xik) + L
1
= K2
k=1
+
K
L (h(28 k -
1) +
k=1
- 2
L
P(Xjki
i= Xikll
X jk2
i= Xik 2 )
kl'l:k2
L
(28 k1 8 k2 - (JkikJ
k i 'l:k2
8k2P(Xjki
i= Xiki)}
k i 'l:k2
VJ( 4)01 (x12 )
=
In - 11 (gi ,g2)4)
~
- 28(gl
,g2) )P(X~l .../.. X~2) + "
. ./. X~2
X!!i.../..
X~2Jk2 )
K2 {~(1
L
k
Ik r
Jk
L P(X~l
Iki r
Jki'
Ik2 r
E ( 4>4 (xyt, xJ\ Xf,t, X
+
k=1
K
"8(gl,g2)(28(gl,g2) _
L k
k
k=1
- 2
ki'l:k2
1) +
"L
(28(gl,g2)8(gl,g2) _ 8(gl,g2))
k1
k2
k i k2
kl'l:k 2
L 8i~l,g2) P(X!kl
i= XJk
1
)}
k 1 'l:k2
The decomposition of U~~k2), k = 1,2, is then
U~~r)
l1(gl,g2)4
+ ~t[W(4)1O(Xfi) -11(gl,g2)4]
1=1
+ ~ E[W(4)01(X1
2
) -
l1(gl,g2)4]
+ Op(N- 1 )
(4.3.54)
J=1
For the two-sample U-statistic of degree (1,2),
(4.3.55)
Let
109
where
I
9ig1 'f/2j91'92)(i,jji,j') = P(XYk f. Xfk, XYk i= XJ.2k) == Lpt1(u)[l- pt2(U)]2
u=o
2 Xfl 1 ~ Xf!,2 )1
P(XflI k11 ~
r Xf!k
J l'
I k2 r
J k2
9(91,92j91,92)(i
k1k2
' J" , i , J")
I
- L
U,v=o
p~~ k2 (U, v)[1 - p~~ (u)][1 .- p~~ (v )]
and under H o,
J.L (91 ,g2)S
J.L S
=
9k + L 9k1k2 ~2 {f.
k=l
kd;k
2
f.9 k(i,j;i,j') - L 9k1k2(i,j;i,j')}
k=l
k1¢k2
Note that under Ho, J.LI = J.LS·
.,p(S)lo(xf1) = E[</>s(xf1, X~2, Xi:) - J.L(91,92)S]
;2 {f.
=
k=l
P(X%k
i= xrl) +
L P(Xfk1 f. xrl1, X%k2 f. xrl2)
k1 ¢k2
K
L P(X%k
k=l
i= xrk)p(X%1k f. xro -
L P(X%k 1
k1¢k2
i= xrlJ P(X%,2k2 i= Xrk2)
K
+ L P(XY': i= X%k' XY': f. XJ.2k) + L P(Xfk1 i= X%k 1' Xfk2 f. XJ.~2)
k1 ¢k2
k=l
-
~
1I(91 ,92) 1I(91,92)}
L- Uk
- Uk1k2
k=l
.,p(S)01(X~2) =
=
;2 {f.
k=l
E[</>s(Xf1, X~2, X~:) - J.L(91,92)S]
P(XY': f.
x~~) + k1L¢k2 P(XY':1 i= X~~1' XYk2 i- X~~2)
K
- 2 L P(XYk
bl
i= xi~,
XYk f. X%,2k) - 2 L
~~
110
P(Xfk1 i= X~~1' XYk2 i= X%,2k2 )
K
+L
P(Xlk =I-
- k=1't
8i9l ,92) -
k=1
X%f,
Xlk =I- X%?k)
+L
kl'l:k2
P(X[kl =I-
X%f
l ,
Xlk2 =I- X%?k 2)
L 8i~~~2)}
kl
#:k2
For the two-sample V-statistics of degree (2,1)
U(2,l)
(N)]
-1 L..J
~ L..J
"(D(!!1,92)
_
5,2 = [(N)
2
1
1)
j=1
Ni'
D(9~ ,92»)2
1')
i#:i'
,
(4.3.56)
E(U~~21») = E(U~~i2») = Jl(91,92)5
and the decompositions for U~\2)
and U~221)
,
, are
U~~i2) =
+
Jl(9l,92)5 +
~f:[W(5)10(Xfl) -
Jl(91,92)5]
i=1
~ f:[W(5)OI(X~2 -
Jl(91,92)5]
+ Op(N- 1)
(4.3.57)
)=1
U~~21)
=
Jl(91,92)5
+ ~f:[ W(5)lO(Xfl -
Jl(91,92)5]
1=1
+ ~[W(5)01(X~2) -
Jl(91,92)5]
+ Op(N- 1)
(4.3.58)
The other two-sample V-statistics of degree (2,1) are
U~~il) = [(N) (N)]-1 L
2
U~~21)
1
f:(DfJ - D}Jl'92»)2 and
l~i<j~N j'=1
= [(N) (N)]-1
2
1
1
L 'E(DfJ _
<i<j<N j'=1
D}~}'92»)2
(4.3.59)
(4.3.60)
Now
Jl(91,92)6
E(DfJ - D~,I ,92»)2 = E(DfJ - D~~}'92»)2
2{~(891 + 8(91,92»)
+
K2 L..J k
k
k=1
+ 8(91,92»)
kl k2
"(891
L..J
kl
#:k2
kl~
- 2 'tP(X[k =I- XJl:, Xlk =I- XJ?k) - 2
k=1
L
kl
= 2+ 8(91,92»)
+
K2 {~(891
L..J k
k
k=1
P(X[kl =I- XJl:l' X!k2 =I#:k2
kl 2 + 8(91,92»)
klk2
"(891 k
L..J
k1 #:k2
L
L
k=1
kl #:k2
.
- 2 K 8i9l,91;9l,92)(i,j;
i,j') - 2
111
8i~~:lj91'92)(i,j; i,j') }
XJ;~)}
and under H o,
Since the kernels
are not symmetric, we need to work with the symmetric kernels
= ~
{(Df!~'} 2
'+"6
'f' 1
= ~2 {(Df!~'}
,+.'
'f'62
D~~;
,92))2 + (D9.~
'}
}'
- D(fl.},92))2
33
+ (D9.~}'
- D\!1.: ,92))2} and
33
-
D~t!;'92))2}
with
'}
=
E[<P~l (Xfl, X~\ X~n - 1l(91 ,92)6]
1/;(6)lO(xfl)
t
2~2 {2 k=l P(X;l: # xfl:) + 2 kd'k2
L: P(X;l:l # xfl
=
1,
# xfl:2)
X;l:2
K
+ L: P(Xj,2k # xfl:) + L:
k=l
P(Xj,2k1
# xfl:
K
- 2
L: P(X;l: # xfl:, Xff,2k # xfl:) -
# xfl:2)
L:
2
P(XJl:l
# xfl:
~A
bl
K
- 2 L: P(Xjl:
# xfI:, Xjl: # Xj:k) -
2
L:
P(XJ~l
~A
bl
K
+ 4 L: P(Xfl # XJI:,
Xfl
# Xff,2k) + 4 L:
~A
bl
_
X;~2
,
1
k 1 :;:k2
~(28g1k + 8(91,92))
_
k
~
k=l
"(28g1
~
~:;:~
~~
1
,
Xff,2k2
# xfl:2)
# xft, XJl:2 # Xj:k2)
P(Xfl1 # XJl:l' Xfl2 # XJ?k 2)
+ 8(91,Y2))}
~~
and
1/;(6)01(X~n
=
=
;2 {t
k=l
E[<P~l(Xf\X~\~n -1l(91,92)6]
P(Xfl
# x~~k) + L:
k1:;:k2
P(Xfll
# X~~kl'
Xfl2 # Xf,k2)
K
- 2
L: P(Xfl # XjI:, Xfl # X~~k) - 2 L:
bl
~A
112
p(X,rl1 # XJ~l' Xfl2 # Xf,k2)
K
+ 22: P(X;k i= X;/.,
X;k
~1
i= X;,q , X;k2 i= xj;kJ
P(X;kl
~¢~
- 't oigl ,92) k=1
U~2r),
for k
,
U~~kl)
i= X;,2k)+ 2 2:
2: Oi~~:2)}
kl ¢k2
= 1,2, can be decomposed as
=
+
J-l(9l ,g2)6 + ~
t[W
(6)10(Xf
,=1
~ t[W(6)01(X~n -
l
) -
J-l(9l,92)6]
J-l(9l,92)6] + Op(n- 1)
(4.3.61)
j'=1
For the two-sample V-statistics of degree (3,1)
7,1
=
U(8,1)
=
U(8,1)
7,2
U(8,1)
7,8
[(N)
"L..J ;-'(D!!~
_
3 (N)]-1
1
L..J
')
D~9~'92))2
,')'
,
[(N)
"L..J ;-'(D!!~
_
3 (N)]-1
1
L..J
')
D~9~'92))2
,')'
i<j<i' j'=1
i'<i<j j'=1
= [(N) (N)]
3
1
-1
"
;-.
L..J L..J
i<i'<j j'=1
(D!!~') _ D~9~'92))2
,')'
(4.3.62)
and
(4.3.63)
,
(4.3.64)
which becomes under Ho
J-l(gl g2)7 = J-l7
= J-l2
= J-l3 = J-l4
~2 {'t[Ok(1Ok) + L (Oklk2 k=1
kl ¢k2
Since the kernels
,/,. (X!!l
X~2) = (D!!~
i < J' < i' ,
'f'71
" X~l
)' X!!l
,")'
') - D~9~'92))2
,')"
cP72(Xrl,X~\Xrl,X~n = (Dr] - D}fj/92))2, i' < i < j and
,/,. (X!!l
'f'73
" X~l
)' Xf!l
,', X~2)
)' = (D!!~
') - D~g~
")' ,92))2 , i < i' < J'
are not symmetric, we switch to the symmetric ones, i.e.,
,/,.' (Xgl
X gl X gl Xg2)
i ' j , i', j'
'f'71
113
OklOk2)}
91
= ~
3 {(D I) _ D~fl,'92))2
1 )
g1
g1
g1
"" (X i ' X j ' X i" xgj'2)
'fJ72
= ~3 {(D!!1
I)
-
D~9:'92))2 + (DfJ.;,
-
D~9,1'92))2}
,
D~f~,'92))2 + (DfI.1, - D~9.:'92))2
+ (D!J.\
JJ
-
D~9.,1'92))2}
,
+ (D!J.~, - D~9.:'92))2}
,
+ (Df!.l,
-
II))
1 )
)1
II
)1
I)
I)
i < J' < i' ,
i' < i < J' and
g2
g1
1 g1
"" (X
'fJ73
. i ' xgj ' X i" X j' )
= ~
3 {(D!!1
I)
D~fl,'92))2 + (D!!~,
- D(9.:,92))2
II))
1 )
)1
I)
i < i' < J"
with
= E[¢~l(xfl, X~I, Xfl, X~n - JL(91,92)7]
_1_
{~(2 _46(91,92))P(Xf!1
-I.. xf!l) + 2'""
3K2 L..J
k
)k I
Ik
L..J
VJ(7)1O(xf1 )
=
~A
bl
K
_ '""(6(91,92)
L..J k
k=l
+ 2691k
_
66 g1 6(91,92)) _
k k
P(X!!1 -I.. X!!1 X!!1 -I.. X!!1 )
)k1 I
Ik1' )k2 I
Ik2
'"" (6(91,92)
L..J kl k2
k1::f:.k2
+ 2691k1k2 _
66916(91,92))
k1 k2
K
- 4 '""
L..J
(6(91>92)
P(X!!1
k
)k
2
~::f:.~
+
L
k1::f:.k2
1
-I..
I
X!!1
)
Ik
1
+ '""(126 g1
L..J
k )PI(X!!2
)'k I-I.. X!!I)
Ik
k=l
P(XJ?k1 =I- Xfk 1' XJ,2k2 =I- Xfk2) - 2
~ 6f~ P(XJ?k2 =I- Xfk2)}
.,
k1::f:.k2
and
VJ(7)Ol(X~n
=
= E[¢~l (Xft, X~I, Xf,1, x~n -
;2 {f:(1 -
28fl )P(Xf,k =I-
k=l
JL(91,92)7]
X~?k) + k1::f:.k2
~ P(Xf,~1 =I- Xf,k1' Xf,~2 =I- X~~k2)
K
g1
- 2 ~ 8f~ P( Xf, k2 =I- X~?k2) - ~ 8i ,92) (1 - 28~1 )
kl::f:.k2
k=l
+ '""
L..J
(28g1 8(91 ,92) _ 8(91 ,92)) }
k1 k2
kl k2
k1::f:.k2
Note that under Ho, VJ(7)lO(Xi) = VJ(7)01(Xj') and e~~)i = e~~) E(VJ(7)lO(Xi)). The decomposition of U~~kl), for k
U~~kl) =
JL(gI92)7
+~
= 1,2,3 is
+ ~t[W(7)lO(X?I) -
t[W(7)01(Xf,) - JL(9192)7]
j'=l
JL(9192)7]
i=l
+Op(N-
114
1
)
(4.3.65)
The three-sample V-statistics of degree (2,1,1) is
U~2,1,1)
=
[(N)2 (N)1 (N)]
1
-1
l: t
i<j i'=1
t (Dr; _ D~f}'9s»)2
(4.3.66)
j'=1
with
which becomes under Ho
J.L(91929S)S = J.Ls = J.L2 = J.L3 = J.L4 = J.L7
:2
E
{E[(h(1-9k )+
(9k1k2 -9kI 9k2)}
k=1
k1"l-k2
Also,
tP(S)loo(xf1) = E[¢s(xf\ X~I, Xf[, X~n - J.L(91,92,9S)S]
=
...!K2 {~(1
L.-i
k=1
-
29(92,9S»)P(X!!1 ../.. X!!I)
k
]k" ,k
+ "L.-i
k1"1-k2
P(X!!1 ../.. X!!1 X!!1 ../.. X!!1 )
]k1 " ,k1' ]k2 r 'k2
_ 2 "L.-i 9(92,9S)
P(X!!1 ../.. X!!1 ) - ~ 991 (1 - 29(92,9S») + " (2991 9(92,9S) - 991 )}
k2
]k1 r ,k1
L.-i k
k
L.-i
kl k2
kl k2
~~
tP(s)olO(xfn
=
~1
= E[¢s(Xfl , X~I, xf[, X~n -
~2 {E(1
k=1
29fl ) P
~~
J.L(91,92,9S)S]
(X:~k -I xf'k) + kE"l-k P (X:~kl -I Xf,2k1 ' X:~k2 =I Xf,2k2 )
1 2
_ 2 "L.-i 9k91 P(X!!S
../.. X!!2 ) _ ~ 9(92,9S)(1 _ 2991 ) + " (2991 9(92,9S) _ 9(92,9S»)}
]'k2 ., ,'k2
L.-i k
k
L.-i
k1 k2
k1k2
1
~~
tP(S)OOI(~n
=
~1
= E[¢s(Xfl, X~\ Xf[, x~n -
~2 {E(1 -
29fl )P(Xf,1
~1
=I X~~k) +
_ 2 "L.-i 991
P(Xf!2 ../.. X~s )
k1
,'k2 ., ]'k2
~~
Vnder H o, tPlS)100(Xi)
_
~~
J.L(91,92,9S)S]
l:
~"I-~
P(Xf,11 -I
X~~kl' Xf,12 =I X~~kJ
~
9(92,9S)(1_ 2991
) + "L.-i
L.-i k
k
~1
~~
= tPlS)OI0(Xi') = tPlS)OOl(Xj') and
d~b = e~~b = e~~l
= E(tPls)100(Xi))
115
(2991
9(92,9S) _ 9(92,9S))}
k1 k2
k1k2
The decomposition for U~2,1,1) is
U~2,1,1)
=
J.l(glg2ga)S
+ ~I:[W (s)lOo(Xf
1
J.l(gI92ga)S]
) -
,=1
+
~ I: [W(S)OlO(Xrn -
J.l(glg2ga)S]
i'=1
+
+ ~ I: [W(S)OO1(X~n -
J.l(glg2ga)S]
.i'=1
Op( N- 1 )
(4.3.67)
The other three-sample V-statistics of degree (2,1,1) are
(4.3.68)
(4.3.69)
with
•
which becomes under H o
J.l9 = J.l2 = J.l3 = J.l4 = J.l7=
J.l(glg2ga)9
- :'
J.ls
{~[9.(1 - 9.) + .?1" (9"" - 9" 9,,) }
Since
A-,_
0/91
(X!!l
X!!l
"
,"
X~2 x~a)
),)'
= (D~!?I,g2)
')
(X!!l
X!?l
X~2 x~a) =
'f'92
"
,', ) ' ) '
A-,
D~g~ ,ga»)2 i...,J. i' J'...,J. i' and
,')"
I
,
I
(D~!?I,g2) _ D~g~,ga»)2 ;...,J. j: ; -J. ;,
')
,')"
• I
,. I •
are not symmetric, we consider the symmetric kernels
(X!!l
X!?l
X~2 x~a) =
"
,', ) , ) '
![(D~!?I,g2) _ D~g~,ga))2
+ (D~g~,g2)
_
. ,')
D~I!I,ga»)2]
i to i', j to i' and
A-,' (X!!l X!?l X~2 x~a) =
'f'92
"
,', ) ' ) '
ito j, ito i' with
![(D~!?I,g2) _ D~g~,ga»)2
+ (D~g~,g2)
_
. ,')
D~!?I,ga»)2]
A-,'
'f'91
2
2
')
')
,')'
,')'
116
')'
')'
,
,
2~2
=
{t
P( XJlc
k=l
# xf~) +
L
P(
kl=f;k2
XJlcl # Xf~l'
# Xf~2)
XJlc 2
K
2 ~ 8(gl,ga) P(X!!2 .../.. X!!l) - 2 ~ 8(gl,ga) P(X!!2 .../.. X!!l )
-
~
k
~
Jk -;-.k
k=l
K
L
+
k2
Jk l -;-
.k l
k l =f;k2
P(XJ~k
# xfk) +
k=l
L
P(XJ/ak l
# Xf~l' XJ~2 # xf~J
k l =f;k2
K
_ 2~
~
8(gl ,g2) p(x!!a .../.. X!!l) k
J'k -;-.k
k=l
2 ~
~
kl=f;k2
K
+ L(48igl ,g2)8igl ,ga) -
8(gl,g2) p(x!!a .../.. X!!l )
kl
J ' k 2 -;- ak 2
8igl ,ga) - 8igl ,g2))
k=l
+ ~
(48(gl,g2)8(gl,ga) _ 28(gl,ga) _ 28(gl,g2))}
kl
k2
k 1k2
k 1k 2
~
k l =f;k2
= E[¢~l (Xft, Xf,l , X~2 ,X~n - J.L(gl ,g2,ga)9]
= K2 {~(1 28(gl,ga))p(X!!1
.../.. X~2) + ~ P(X!!1 .../.. X~2
X!!l.../.. X~2 )
k
.k -;- Jk
ak l -;- Jk l '
.k2 ., Jk2
1P(9)010( ~2)
_1
_
~
~
k=l
k l =f;k2
K
_ 2 ~ 8(gl,ga) P(X!!1 .../.. X~2 ) _ ~(8(gl,g2) _ 28(gl,g2)8(gl,ga))
~
k2
.kl
-;-
Jkl
k l =f;k2
~
_
~
k
k
k
k=l
(8(gltg2) _ 28(gl,g2)8(gl,ga))}
klk2
kl
k2
~
k l =f;k2
1P(9)001(~n
=
E[¢~l (Xfl, Xf,l, X~2, ~n -
;2 {t(1 -
gl
28i ,g2))P(Xf,k
=
~1
J.L(gl,g2,ga)9]
# X~~k) +
L
P(Xf,kl
~~
=I X~:kl'
K
.../.. x~a ) _ ~(8(gl,ga) _ 28(gl,g2)8(gl,ga))
a'k2 -;- J'k2
~
k
k
k
k=l
_ 2 ~ 8(gl,g2) P(X!!l
~
kl
k l =f;k2
_
~
~
(8(gl,ga) _ 28(gl,g2)8(gl,ga))}
kl k2
kl
k2
kl =f;k2
Under Ho, 1P(9)100(Xi)
= 1P(9)010(Xj) = 1P(9)OOl(Xj/) and
ef~~ = e~i~ = e~~~ = E( 1P(9)100(Xi))
The decomposition for U~~kl,l) for k = 1,2 is
U(2,l,l)
I,k
J.L(glg2ga)9
+ ~I:[W(9)100(Xfl) i=l
117
J.L(glg2ga)9]
Xf,k2
=I X~:k2)
+ ~ f:['11(9)010(X~2) +
J.l(9Ig2ga)9]
j=l
1
+~
t
['11 (9)001 (X;n - J.l(9Ig2ga)9]
j'=l
Op(N- 1 )
(4.3.70)
For the three-sample V-statistic of degree (1,1,1)
(4.3.71)
_
E(D~!?I ,92)
13
D~!?,I ,ga))2
_
13
.2{~(0(91'92) + o(gl,ga))
+ "L.J
K2 L.J k
k
k1:f;k2
k=l
(0(91,92)
k1k2
+ o(gl,ga))
k1k2
K
2
L
P(Xfk
f. XJ~,
Xfk
=f. XJ,ak)
k=l
2
L
kl:f;k2
_1_
K2
P(Xfkl
f. XJ~I'
Xfk2
=f. XJ?k 2)}
{~(0(91'92)
+ o(gl,ga))
+ "L.J
L.J k
k
k=l
kl
Oig1 ,92jgl,ga)(i,j; i,j') - 2
2t
:f;k2
(0(91,92)
k1k2
+ 0(gl,9a))
k1k2
L Oi~~:2jgl,ga)(i,j; i,j')}
kl:f;k2
k=l
where
oig1 ,92;gl ,9a) (i, j; i, j')
= P(Xfk =f. XJk,
1
=f. XJ~k) = L zlf/ (u )pf2 (1 - u )pfa (1 - u)
Xfk
u=o
P(Xfk1 f. XJ~I' Xfk 2 f. XJ,2k2 )
1
L
u,v=o
pf~k2(u,v)pf~(1- u)pf~(l- v)
then under Ho
J.l(91g29a)1O -
J.l1O
= J.ll = J.ls = J.l6
~2 { t Ok + k1L:f;k2 Ok1k2 k=l
tOk(i,j;i,j')k=l
Also,
118
L
k1:f;k2
Ok1k2(i,j;i,j')}
{KL
1
K2
=
k=1
P(XJ~
# xffc) +
K
+L
k=1
P(X;,3k
# xfk) +
L
L
k1=l=k2
k1=1=k2
P(XJ~1
# xfic1, XJ~2 # xfic2)
# xfic1, XJ'k2 # xfleJ
P(X;,3k1
K
- 2
L
k=1
P(XJ~
# xffc)P(XJ/3k # xfic) -
P( XJ~
# Xf~ , XJ/3k # xfic) + 2
2
K
+2L
k=1
~((/91'92)
+ 0(91,93))
_
k
k
_
L
P(XJ~1
k1=1=k2
~
k=1
""" (0(91,92)
~
~=I=~
~~
L
k1=1=k2
P( XJ~1
# xficJP(XJik2
=1=
xfic2 )
# Xf~1 ' XJ/3k2 =1= Xf~2 )
+ 0(91,93))}
~~
'I/J(1O)010(X~2) - E[cPlO(Xf , x~\ X~n - P(91,92,93)1O]
1
~ {~(1
= K2
~
-
1 -'- X!?2)
(lg1,93))P(Xfl
k
ak,.
Jk
+ """
~
~A
bl
- 2 """
P(Xflak1 -'X!?2 )(/91,93)
~
1 ,. Jk 1 k2
K
1 -'- X!!2 Xfl 1 -'- X!!3 )
+ 2 """ P(Xflak,.
Jk , ak,. J' k
~
k=1
~=I=~
+
1 # X!?2 Xfl 1 =1= X!?2 )
P(Xflak1
Jk1' ak2
Jk2
~
2 """ P(Xfl1 -'- X!!2 Xfl 1 -'- X!!3 ) _
0(91,92) _ """ (/91,92)}
~
ak1 ,. Jk1 ' ak2" J' k2
~ k
~
k1k2
k1=1=k2
k=1
k1=1=k2
and
= E[cPlO(Xf1, X~2, x~n - P(91,g2,ga)10]
1 -'- x!?a ) + """ P(Xfl1 -'- x!?a Xfl 1 -'- x!?a )
{~(1 (lg1,92))P(Xfl
k
ak ,. J'k
ak1 ,. J' k1' ak2" J'k2
'I/J(lO)OO1(X~n
=
_1_
K2
~
_
~
k=1
k1=1=k2
1 -'- x!?a )0(91,92)
- 2 """
P(Xflak1"
~
J' k1 k2
k1=1=k2
+
K
1 -'- X g2
+ 2 """
P(Xflak,.
~
Jk'
k=1
2 """ P(X g1 -'- X!!2 Xfl 1 -'- x!!a ) _
~
ak1 ,. Jk1' ak2" J' k2
~=I=~
~ e(g1
,ga) _
k
~
k=1
1 -'- x!!a )
Xflak,.
J'k
""" e(g1,ga)}
~
k1k2
~=I=~
The decomposition for U~~l,l) is
U~~l,l)
=
P(91g2ga)1O
+ ~E[\l1 (1O)lOO(Xf1) -
J.L(91g2ga)10]
1=1
+
~ E[\l1(10)010(X~2) -
P(91g2ga)10]
J=1
+
Op( N- 1 )
+
~
t [\l1(10)001(X~n
- P(91g2ga)10]
J'=1
(4.3.72)
119
For the four-sample V-statistic of degree (1,1,1,1)
N
N
N
N
= N- L L L L
U~;,l,l,l)
4
(D~I'92)
-
D~fF94»)2
(4.3.73)
,
i=l j=l i'=l j'=l
and under H o
Now
.,p(U)10oo(xfl)
=E[</>u(xfl, Xf, Xf,3,
= _1
K2 {f-(1
L.J _
X~:) -11(91,92,98,94)u]
2fi 91 ,92»)P(X/!2 -J. xf!l)
k
Jk r
,k
k=l
+ '"
L.J
P(Xf!2 -J. xf!1
-J. xf!1 )
. Jk l r ,k1 ' X/!2
Jk2 r
'k2
kd:k2
K
_ 2 '" 8(98,94) P(X/!2
L.J
-J. X/!I )
Jkl r
,kl
k2
+ '"
8(91,92)(28(98,94) L.J k
k
~~~
+ '"
L.J
1)
k=l
(28(91,92)8(98,94) _ 8(91,92»)}
k2
kl
k 1 k2
kl~k2
.,p(U)0100(X~2)
=
E[</>u(Xft,~\Xf,8 ,X~n -11(91,92,98,l14)u]
= -.!.K2 {~(1
L.J
-
28(9a,94»)p(X!!1 -J.
k
,k r
X~2)
+ '"
Jk
L.J
kl ~k2
k=l
-J. X~2 )
,kl r
Jkl
_ 2 '" 8(98,94) P(X!!I
L.J
k2
kl~k2
+ '"
L.J
K
+'"
8(91,92) (28(98,94) L.J k
k
k=l
(28(91,92)8(98,94) _ 8(91,92»)}
kl
P(X!!I -J.
..' 'kl r
k2
kl k2
kl~k2
120
X~2Jk1 , X!!l,k2 -J.r X~2Jk2 )
1)
d 11 )
',,1000
=
t(11)
',,0100
t(11)
t(11)
(Xi )]
= ',,0010
= ',,0001
= E[ol.2
'1-'(11)1000
The decomposition for U~;,1,l,1) is then
U~;,I,I,I) =
J1(91929a94)11
+
+ ~I:['11(11)1000(Xfl) -
J1(91929a94)11]
i=1
~ I:['11(11)0100(X~2) -
J1(91929a94)11]
j=1
+
~ I : ['11 (11)OOlO(Xrn -
J1(91929S94)11]
i'=1
+ ~I: ['11(11)OOOl(X~t) -
J1(91929S94)11]
+ Op(N- 1 )
(4.3.74)
j'=1
4.4
Combining the U -statistics
We know that
wss =
2
~
N(N - 1) LJ
9=1
[(N)3 (U(8)1,1 + U(8)1,2 + U(8»)
+ (N)(U(4)
+ U(4)
+ U(4»)]
I,S
4
2,1
2,2
2,S
121
(N - 2) ~[U(3)
3
L....J
1,1
-
+ U(3)
+ U(3)]
1,2
1,3
g=1
+
(N - 2)(N - 3) ~[U(4) + U(4) +
12
L....J
2,1
g=1
2,2
U(4)j.
2,3
with expected value
G
E(WSS) = E(N - 2)J.l9l
g=1
=
G
+E
(N - 2)(N - 3)
4
J.lg2
g=1
G
{
E(N - 2) J.lgl
g=1
(N - 3)
}
+ -4-J.l
92
Under H o, there is homogeneity among groups, i.e., for any 9,
.
8f = 8k
and
8t k2 = 8k1k2 , thus
=
Eo(WSS)
G(N - 2) {J.ll
+ (N ~ 3) J.l2}
where
and
(4.4.2)
Note that
C-l
8k = P(Xik
# X jk ) =
E Pk(c)[l -
Pk(C)]
(4.4.3)
Pk1k2(C3, C4)]
(4.4.4)
c=o
8k1k2
= P(Xikl # X jk1 ; X ik2 #- X jk2 )
<,~/,""(C"C2) [~ ~
ca'jlcl
c4'j1c2
Decomposing W SS,
WSS
(N - 2)
3
-
G
E 3{J.l9l
g=1
+
+
(N - 2)(N - 3)
12
3
+N
N
+ Op(N- 1)}
E[W(I)l(Xf) - J.l9l]
i=1
G
{
: ; 3 J.lg2
Op(N- 1 ) }
4
N
9
+ N £;[W(2)1(Xi ) -
J.l92]
(4.4.5)
122
and under Ho,
WSS
G(N - 2) (Ji1
+
+ (N; 3) Ji2)
3 N
(N - 2) NG~:::rW(l)l(Xi) - Ji1]
+ Op(1)
i=l
+
(N - 2)(N - 3) N
N
GI:[W(2)1(X;) - Ji2]
+ Op(N)
(4.4.6)
}-1
+ Op(N )
(4.4.7)
i=l
and the associated mean square expression is
WSS
WMS
=G(~)
=
2WSS
= GN(N -1)
(N - 2)(N - 3) {
N(N _ 1)2
Ji2
4
N
+ N t;[W(2)1(X i ) -
Ji2]
with
(N - 2)(N - 3)
2N(N - 1) Ji2
Ji2 + O(N- 1)
Eo(WMS)
+
O(N- 1)
2
Let U W88
(3) U(3)
(3)
(4) U(4) U(4»), d
b h'
.
= (U 1,1'
1,2' U 1,3' U 2,1'
2,2'
2,3 an ~W88 e t e variance- covari-
ance matrix of U W88' In order to get the variance of W SS, we need to know the the el·
.. d COVU
((3),k,U (3»)
d CovU
((4),k,U (4»)
ementsof~w88' SmceXI,X
,l an
,l
2 , ... ,XN arel.l..,
1
(for 1
1
2
1
:5 k < 1 :5 3) are nothing but Var(U~~~) and Var(U~~~), respectively. To
illustrate this result, compute Cov(U~~~,U~~~). From (4.3.29)
E[cP2,l(xf, X~, X~, X~) - Ji92]
tP(2,l)1(xf)
~2 {E(1 - 28f)P(X~k i- xfk) +
+
+
I: P(X~kl
k1 ::f;k2
2 I:
k1 ::f;k2
i- Xfk
1'
X~k2
E
8f(28f - 1)
i- Xfk 2) + I: (28f18f2 - 8f k2)
kl ::f;k2
8~2P(X~kl i- xfkJ}
and
E[cP2,2(xf, X~, X~, X~) - Ji92]
~2 {E(1 - 2/p')p(X:. i' 3!,.) + E/P.(2/P. - 1)
123
1
-+-
L
P (Xlk1
#- Xfk
1
,Xlk2
#- Xfk 2 ) -+- L (28f 8f
1
~¢~
L
-+- 2
2 -
8t k 2 )
~¢~
8f 2 P (Xlkl
#- X fk1 )}
k 1 ¢k2
Now, let
~(I)I(Xf)
~(2)I(Xf)
= ~(I,I)I(Xf) = ~(1,2)I(Xf) = 'tP(I,3)I(Xf),
= ~(2,1)I(xf) = ~(2,2)I(Xf) = ~(2,3)I(xf),
2
1
d ) = E(~t2)I(Xn) and d ,2}
1
d ) = E(~tl)I(Xn),
= E(~(I)I(Xn~(2)I(Xn)
Therefore, the variance-covariance matrix of
(8)
(8)
(8)
(4) U(4)
(4)), .
U WBB = ( U 1,1, U 1,2, U 1,8, U 2,1, 2,2' U 2,8 IS
where J 3 is a 3 X 3 matrix of l's. So,
Var(WSS)
= ~
{(N -9 2)2 Var(U(3)1,1 -+- U(3)1,2 -+- U(3))
L.J
1,3
g=1
-+- (N - 2)2(N - 3)2 Var(U(4) -+- U(4) -+- U(3))
144
2,1
2,2
2,3
-+- 2 (N - 2)2(N - 3) Cov[U(3) -+- U(3) -+- U(3) U(4) -+- U(4) -+- U(3)l}
36
1,1
1,2
1,3'
2,1
t,{(N;2)2 (3X ~~ll)+2(3X ~dl)))
-+-+-
(N-2)2(N-3? (3
144
16 t (2) 2(3
x N"'1 -+-
X
16 t (2)))
N"'1
2 (N - 2?(N - 3) (9 x 12 t(I,2))}
36
N"'1
(N - 2)2
N
t {geP)
-+- (N - 3)2d2) -+- 6(N - 3)ep,2)}
g=1
Var(WMS)
124
2,2
2,3
For AWSS,
AWSS
and under Ho
G(G - 1) [(N - 1)2
2
2
J.L4
Eo(AWSS)
+
(N _ 1)
G(G - 1)(N - 1) ((N - 1)
2
2
J.L4
where J.L4
= J.L2 is given by (4.4.2)
and J.L5
J.L5
+ J.L5
]
)
= J.Ll is given by (4.4.1).
AW S S can be decomposed as
AWSS
=
E
(N - 1)2 (
[
2
2
J.L(91,92)4
1:591 <92:5G
N
91
?=(W(4)10(Xi ) - J.L(91,92)4)
,=1
+
~ ~(W(')01(Xfl -1'1.. ..,).) + Op(N-l l)
+
(N -1) (
2
2J.L(91,92)5
+
~t(W(5)01(X~I)
j=1
+
(91)
+ N1 ~(
~ W(5)10 Xi -
J.L(91,92)5)
J.L(91,92)5
)
+ ~t(W(5)10(Xfl) -
J.L(91,92)5)
i=1
~ ~(WI5)01(Xfl -1'19".,)5) + Op(N-l)) ]
The associated mean-square expression is
AWMS =
+N
AWSS _ 2AWSS
(;)N2
N2G(G - 1)
125
(4.4.8)
(N - 1)2
N2G(G _ 1)
-
L
1$gl<g2$G
[
/1(91,92)4
2
+N
N
gl
?=(W(4)10(Xi )
,=1
~ ~(W(')OI(Xf) -1'(....,).)] + Op(N-
+
Eo(AWMS)
Var(AWSS)
(4.4.9)
)
N2G(G _1)E o(AWSS)
(N - 1) ((N -- 1)
N2
-2-/14
-
(N - 1)2
2N2 /14
=
Jl4
2
= Eo(WMS)
(N - 1)2
-
/1(91,92)4)
2
=
-
Note that Eo(AWMS)
I
-
since under H o, /14
{(N -
L..J
l$gl
)
1
)
+ O(N- 1 )
"
4
+ O(N-
+ /15
4
<92$G
= /12
lY Var (U(2,2)
+ U(2,2))
4,1
4,2
+ Var(U(1,2)
5,1 + U(2,1))
5,2
+ (N -
1)COV[U(2,2)
4,1
(N - 1)2
4
+ U(2,2)
4,2, U(1,2)
5,1 + U(2,l)]}
5,2
{(N -
"
L..J
1$gl<g2$G
4
lY (4 X
(.!c(4)
N"'10
•
.!C(4)))
+ N"'Ol
+ ~d5,1) + .!d5,1) + .!c(5,2) + ~c(5,2)
+
N"'10
N"'Ol
N"'10
N"'Ol
2
(~t(5,1),(5,2) + ~c(5,1),(5,2))
X
N"'l0
N"'Ol
+
(N - 1) (2
+
2x
X
(~d~),(5,1) + ~e~~)'(5'1))
(~ e~~),(5,2) + ~ e~~),(5'2))) }
4
Var(AWMS)
N4G2(G _ 1)2 Var(AWSS)
(N-1)4
(4N elO + N4eOl
(4)
2N4G(G _ 1)
For TSS,
TSS
=
1
NG(NG -1)
{~
~
[(N)3
(3)
U1,1
+ U1(3),2 + U1(3))
,3
126
(4))
+ O(N
-2
)
+
3
N (N - 1)
L
J.l(91,g2,ga)9
1:$91,92,9a:$G
n_92_n
L
+ N4
J.l(91,92,ga,g4)11
1:$91,92,9a,94:$G
n-92-n_~
and under Ho,
Eo(TSS) =
1
NG - 1
+ (N -
{(N - 1)(N - 2)
2
J.ll
127
1)(N - 2)(N - 3)
8
J.l2
+ (G - 1) N(N _ 1)2j.l3 + (G -1) N(N _ 1)2j.l4
8
4
(G - 1) N(N
)
(G - 1)(G - 2) N 2
+
+
2
- 1
j.ls
+
(G - 1)(G - 2) N 2(N
2
-
2
1)
j.l9
+
N (N -1) }
(G - 1) (G - 2)
4
j.ls
T S S can be decomposed as
TSS
128
(G - 1)(G - 2) (G - 3) N 3
+-
+ (G-1)N(N-1)j.l6+(G-1)
2
j.l1O
8
N(N - 1)(N - 2)
2
j.ln
j.l7
+
~ it, (w (10)001 (X~I) ~ 1'(" ,,,,,,,,)10) + Op( N-
+
N (N - 1)
3
1
)]
,<"E'<G [1'19""'~')9 + ~ ~(WI9)lOO(Xr) -1'(""",..)9)
91#92#9a
+ ~
+
~
+
N
f: ('1T (9)010(X~2) -
p(gl ,g2 ,ga)9)
j=1
it,
(WI9)OO1(X11) - 1'(" "",,,,)9) + Op(N-
L
4
1
)]
+ ~ f:('1T(11)1000(Xr
[P(gl,g2,ga,g4)1l
l
)
-P(gl,g2,ga,g4)1l)
,=1
1~91 ,92,93,94~G
91#92#9a#94
N
+ ~ L('1T(1l)0100(Xf,2) -P(gl,gMa,g4)11)
i'=1
+
~f:('1T(11)OOlO(X~3) -
P(gl ,g2,ga,g4)11)
j=1
+
~ ;t,(WIll)OOOl(x~n -I'(""","~.)ll) + Op(N-
+
N'( N - 1)
'sf<G
[1'1" ",,)6 +
1
)]
~ ~(W (6)1O(Xf) -
1'1.. ,,,,)6)
91 #92
+
+
~;t, (W (6)01 (X'ft) N 2 (N - 1)(N - 2)
2
I'lg, ,,,,)6) + 0 p(N-l )]
1~9~~G
[
3
P(gl,g2)7
N
gl
+ N ~('1T(7)1O(Xi
) -P(gl,g2)7)
91#92
+
+
~ it,(WI1)01(X'ft) - 1'1"
N 3 (N - 1)
2
1~91~93~G
",,)1)
+ Op( N- 1 )]
[
2
P(gl,g2,g3)S
gl
N
+ N ~('1T(S)100(Xi
91 #92#93
+
~ f:('1T(S)010(Xrn -P(gl,g2,ga)S)
i'=1
+
~it, (W(8)001 (X~n - 1'1" ""~.)8) + Op( N-
129
1
)] }
) -P(gl,g2,g3)S)
Define the Total Mean Square (T M S) as
TSS
TMS
2TSS
NG(NG -1)
= (~G) =
2
NG(NG _1)E o(TSS)
Eo(TMS) -
(N -1)(N - 2)(N - 3)
(G -1)N(N - 1)2
4NG(NG - 1)2
/12 + 4NG(NG - 1)2 /13
(G -1)N(N _1)2
(G - 1)(G - 2)N2(N -1)
2NG(NG - 1)2 /14 +
NG(NG _ 1)2
/19
3
(G - I)(G - 2)(G - 3)N
(G - I)N(N - I)(N - 2)
4NG(NG -1)2
/111 +
NG(NG _ 1)2
/17
(G - 1)(G - 2)N2(N - 1)
2NG(NG - 1)2
/18
+
+
+
But, under Ho,
/12 = /13
=
= /14 =
;2 {t
/17
= /18 =
fh(l - 8k )
= 1-£11
/19
+L
k=1
(8 k1k2
-
8k 1 8k 2 )}
k 1 -::f;k2
Therefore,
Eo(TMS) _
-
+
+
[(N-l)(N-2)(N-3)
1
NG(NG - 1)2/12
4
(G - l)N(N - 1)2
4
+
(G - I)N(N - 1)2
2
3
(G _ l)(G _ 2)N 2(N _ 1) + (G - I)(G - 2)(G - 3)N
4
+
-
+ !:G -
(G _ l)N(N -1)(N _ 2)
2
I)(G -22)N (N - 1)]
:~3 [1 + (G -1)(7 + (G - 2}(G + 3))] + O(N- 1 )
Now
BSS _
N(~ -
1)
t(l)~ _ D.)2
g=1
-
where D 1 is the G
X
N(N -1)D'D
2
1
Jl
1 vector
D1
= (D~ - D. ...
130
D~
- D.)'
Note that
E(D9) = _1_ "
E(DfI.) = ~ ~ (j9 = e9
. K (N)
L.t
IJ
K L.t k
.
2 1$.i<j$N
k=1
and
E(D.)
Therefore,
VI
= E(D~ -
D.) =
e~ -
t
(N -1)
e~ + 2N
G(NG - 1) 9=1
G(NG - 1)
L ij~91,92)
1<
<G
_91 < 92_
Let U~~ = D~ and U~~,l) = D~91,92). Then,
cPI2(Xf,X~)
1 K
= Dfj = K
LI(XYk
i- Xfk)
k=1
cPI3(Xf\X~2) = D!f '92) = ~ t
1
I(XYf
i- Xfn
k=1
\II(13)10(xf1)
= E[cPI3(Xf1,X~2) I Xf1] = ~ tp(Xf~ =I Xfk)
k=1
\II(13)OI(~2) = E[cPI3(Xf\X~2) I X~2] = ~ t
P(XYf
k=1
i- x~i)
Under H o,
\II (13)1O(Xi )
= \II (13)01 (Xj )
1 K
K L P(Xjk
k=1
131
i- Xik)
(4.4.10)
since the sequences Xl, X 2 , ••• , XN are i.i.d. and P(Xjk =/:- Xik)
= P(Xik
=/:- Xjk).
and under Ho
Example: Two categories
P(Xjk =/:- Xik)
P(Xjk =/:- O)I(xik
-
Pk(1)I(xik
= 0) + P(Xjk
= 0) + (1 -
=/:- 1)I(xik
Pk(1))I(xik
= 1)
= 1)
P(Xjk1 =/:- Xik 1 ; Xjk2 =/:- XikJ
= P(Xjk1 =/:- 0, Xik2 =/:- 0)I(Xik1 = 0, Xik2 = 0)
+ P(Xjk1
+ P(Xjk1
+ P(Xjk1
=/:- 0,Xik2 =/:- 1)I(Xik1
=/:- 1, Xik 2 =/:- O)I(Xikl
=/:- 1,Xik2 =/:- 1)I(Xik1
= 0,Xik2 = 1)
= 1, Xik2 = 0)
= 1,Xik2 = 1)
= Pk1 k2(1,1)I(Xik 1 =0,Xik2 =0)+Pklk2(1,0)I(Xikl =0,Xik2 =1)
+ Pklk2 (0, 1)I(Xik1 = 1, Xik2 = 0)
1
=
K2
+ Pklk2 (0, O)I(Xikl
= 1, Xik2 = 1)
K
2: [(pk(1)2I(Xik = 0) + (1 -
Pk(1)? I(Xik
= 1)]
k=l
+
(e.-)2 -
2- ~
Ke. L..J [Pk(1)I(Xik
= 0) + (1 -
Pk(1))I(Xik
= 1)]
k=l
1
+ K2
2: [Pklk2(1,1)I(Xikl = 0,Xik2 := 0)
k1 ::f;k2
+ Pklk2(1,0)I(Xikl = 0,Xik2 = 1) + Pklk2(0,1)I(Xikl = 1,Xik2 = 0)
+
Pklk2 (0, 0)I(Xik1
= 1, X ik2 = 1)]
and
132
1
K
K2
=
L
[(Pk(1) )2pk (0)
k=1
-
2
K
e.LK
+ (1 -
[Pk(1)Pk(O)
k=1
+
1
L
K2
+ (1 -
Pk(1) )2pk (1)]
+ (e.)
2
Pk(1))Pk(1)]
[Pk 1k2 (1, 1)Pk1k2 (0,0)
+ Pk1k2 (1, O)Pk k2 (0,1)
l
kl'#2
+
Pk1k2(0,1 )Pk1k2 (1,0)
1
+ Pk1k2(0, O)Pkl k2 (1,1)]
K
K2
L
[(pk(1))2 pk (0)
k=1
-
2
K
e.LK
[Pk(1)Pk(0)
+ (1- pk(1))2pk (1)] + (e.)
+ (1 -
2
Pk(1))Pk(1)]
k=1
+
2
K2
L
[Pk 1k2(1,1)Pklk2(0,0)
+ pk1k2(1,0)pk 1k2(0,1)]
k 1'#2
•
Since D~ is a U-statistic of degree 2,
Var(Dn =
where ~P2)
~~P2) + O(N- 2 )
= E[¢(12)1(Xn], and since D~91,92) is a two-sample U-statistic of degree
(1,1),
Var(D(gl,92)) =
.
where ~~~3)
=E[¢(13)10(Xf
1
)]
.!-t(13)
Nl"l0
and ~~~3)
+ .!-t(13)
+ O(N- 2)
Nl,,01
=E[¢(13)01(X1
2
(4.4.11)
)].
We are assuming that under H o there is homogeneity across or within groups,
.
f)1k -- f)2k -- ... -- f)G
-- f)9k -- f) k· There£ore, un der no,
U
l.e.,
k -- f) k an d f)(91,92)
k
(4.4.12)
and
/131 (D~91'92) where /;3
e.)
~ N(O, 1)
= 1~~~3) + 1~~~3) = ~~P2) by (4.4.11)
(4.4.13)
and (4.4.10).
If D. is a linear combination of normal variables, then D. also follows a normal
distribution.
D.
133
G~Z;~)1)
+ G:~
(
t, [9" + ~
_L
1
)
l~gl <g2~G
t,(WI12)1(Xf) - 8?)]
[8~91'92) + ~ I:(W(13)1O(X;1) - 8~91'92))
,,=1
+ ~ t,(WI13)Ol(Xr') - 81M')] + Op(N- 1 )
Under H o,
= E (D)
TIl 0
.
(1~
-
+ Op(N-1)
= (N -
1)8. + N(G -- 1)8.
( N G - 1)
Varo( D.)
(N-1)2
G
_
G2(NG -1)2 ~ Varo(D~)
4N 2
134
=8
.
c(12) _ c(13) _ c(31) _ c(13,1;13,2) _ t(12,13) d
<"1 - <"10 - <"01 - <"10
- <"1
an
0"2
1
_
(N - 1)2 4 e(12)
G(NG _ 1)2 N 1
+
+ 2N2(G -
1)
G(NG _ 1)2
[(~e(12»)
N 1
+ 2(G _ 2)~e(12)]
N 1
2N(N - 1) G(G _ 1)~e(12)
G2(NG - 1)2
N 1
-
4e(12)
2
[(N - 1)2 + N (G - 1)2 + N(N - l)(G -1)] NG(N~ _ 1)2
-
[(N _1)2
4e(12)
+ N(G -l)(NG -1)] NG(N~ -1)2
(4.4.14)
Hence, under H o,
0"1 1 (Do -
(0)
~ N(O, 1)
Now
(4.4.15)
and
Varo( D~ - Do)
r[
-
Varo(D~)
-
Ne1
4 (12)
+ Var(Do) 2
+ 0"1
2Covo(D~,Do)
( (N - 1) ~ - 2Covo D., G(NG _ 1) ~ Do
9
9)
(D~,
D~91'92»)
2N
L
G(NG -1) 1~91<92~G
4 (12)
2
(N - 1)
-9
N e1 + 0"1 - 2 G(NG _ 1) Varo(D o)
- 2Covo
_
4N
G(NG-1)
LG
1/2=1
COVO(D91
D-(91 ,92 ) )
0'
0
91'j1!92
[
where e~12,13)
1 _ 2 (N - 1) ] ~t(12)
G(NG - 1) N<"l
=E{1P(12)1 (Xfl )1P(13)1O(Xf
1
2 _ 4N(G - 1) 2 t(12,13)
G(NG - 1) N<"l
(4.4.16)
+ 0"1
)}
under H o.
135
=
d12),
since 1P(12)1 (Xi)
=
1P(13)10(Xi )
Then,
72
1
[
1- 2
-
(N-1+N(G-1))] 4 t (12)
G(NG _ 1)
N"'1
4e(12)
+ [(N - 1)2 + N(G -1)(NG -1)] NG(N~ _ 1)2
(12)
= [(G - 2)(NG - 1)2 + (N _1)2 + N(G - 1)(NG - 1)] NG(~~ _ 1)2
4 (12)
=
{(N - 1)2
+ (NG - 1)[N(G - 1) + (NG - 1)(G - 2)]} NG(~~ _ 1)2
(4.4.17)
So,
-1
71
- 9
(D.
- D.)
d
~
N(O, 1)
Therefore, as in Chapter 3,
1)D~Dl
BBB = N(N -
2
'" N(N - 1)
2
where Ay'S are the characteristic roots of Var(Dt}
elements of ~1 are
Cov( D~l _
=
=
tAg (Xn
g=1
= ~1'
9
Note that the diagonal
7f and the off-diagonal elements are
D., D~2 - D.)
Cov( D~l , D~2)
-
-2Cov(D~,D.)
Cov( D~l , D.)
-
Cov( D~2, D.)
+ Var( D.)
+ Var(D.)
_ -2 (N - 1) .!t(12) _ 4N(G - 1) ~t(12,13)
G(NG - 1) N"'1
G(NG - 1) N"'1
2
+ 0"1
which becomes under H o
Covo( D~l _
D., D~2 - D.)
4ei 12 )
= [-2(N - 1 + N(G - 1))] NG(NG -1)
4e(12)
+ [(N - 1)2 + N(G - 1)(NG - 1)] NG(N~ _ 1)2
-
(N - 1)2 + N(G - 1)(NG - 1)]
4eP2)
(NG - 1)
NG(NG - 1)
(N - 1)2 - (NG - 1)(NG + N 4ei 12 )
[
(NG - 1)
NG(NG - 1) < 0
-2(NG _ 1)
[
+
2)]
-
136
since (NG - l)(NG
+N
- 2) > (N _1)2.
Now,
) _ N(N - 1) G 2
Eo(BSS) -- N(N2- 1) trace (~
~1 2
71
and
Let
BMS = BSS
G(~)
Then
1
Eo(BMS) = G Eo(BSS) =
2
71
and
For AB SS we have,
N
ABSS =
N
L
LL(.i)~91t92) - Dy = N 2 D 2 D 2
1:591 <92:5G i=1 j=1
where D 2 = (D~lt2) - D., D~lt3) - D., ... ,D~G-l,G) - Do)' is a G(~-I) x 1 vector.
Let
V2
=E(D~91t92)
- Do)
= 8~91t92) _
t
(N - 1)
8~ _
2N
8~91t92)
G(NG - 1) 9=1
G(NG - 1) 1<_91 < 92_<G
L
Under H o,
(4.4.18)
and
7i
Var(D~91t92) - D.)
Var(D~91,92»)
+ Var(D.) -
2Cov(D~91,92),Do)
137
=
~
(,,(13) + ,,(13») + 0'2 _ 2Cov (jj(91,92) (N - 1) ~ jjg)
N <"10
<"01
1
. , G(NG - 1) ~
.
g=1
_ 2Cov (jj(91,92)
.
~ (,,(13)
N
_
2N
, G(N G - 1)
+ <"01
,,(13») + 0'2 _
1
<"10
'"
~
l~gl <g2~G
jj(91,92»)
.
2( N - 1) -COV(jj(91,92) jjgl)
G(NG - 1)
.
,.
2(N - 1) COV(jj(91,92) jj92)
G(NG -1)
.
,.
4N
COV(jj~91'92), ~ l)~91,92»)
G(NG - 1)
l~gl<g2~G
~ (C<13)
N <"10
+ C(13») +
<"01
2 _
0'1
2( N - 1) _~C<12,13}
G(NG _ 1) N<"1
2( N - 1) 2 e(12,13}
G(NG -1) N 1
4N
[var(D(gl,92}
G(NG - 1)
.
t
+
9S=1
COV(D~91'92}, D~91,9S})
9S;o'91 ;0'92
+
t;.,
Cov(Dlg.,g,),
D(g,~,»)]
9S;o'91 ;0'92
-
~ (,,(13)
N <"10
+ C<13}) +
<"01
2 _
0'1
4(N - 1) ~C<12,13}
G(NG _ 1) N<,,1
4N
[~(,,(13) + C<13})
G(NG - 1) N <"10
<"01
+ 2(G _
2)~C<13,1;13'2}]
N<,,10
Note that
W(13,1}1O(xfl)
W(13,2}10(xf
l
)
t
~t
~t
~
k=1
k=1
W(13}10(xfl) -
k=1
W(13}01(X~2)
~t
P(X:k i- xfk)
P(X:k i- xfk)
P(X:k i- xfk)
p(Xrk
k=1
and under Ho there is homogeneity among groups,
138
i- x~i)
(4.4.19)
since the sequences are i.i.d.
Therefore, \]!(13,1)10(Xi)\]!(13,2)1O(Xi) = \]!~13)lO(Xi) and
((13,1;13,2) _ c(13) _ ((13) _ c(12) _ c(12,13)
<"10
- <"10 - <"01 - <"1 - <"1
So, under H o,
r;
_
2 (12)
N e1
_
+
[(
N - 1)
2
+ N(G -1)(NG -1)]
12
4d )
NG(NG _ 1)2
4(N - 1) 2 c(12) _
4N
(G _ )~ ((12)
G(NG - 1) N <"1
G(NG _ 1)
1 N <"1
2(G - 4)d 12 )
[(N - 1) + N(G - l)(NG - 1)] NG(NG _ 1)2 +
NG
2
4eP2)
- {2(N - 1)2 + (NG - 1)[2N(G - 1) + (NG -1)(G - 4)]}
e(12)
NG(~~ _
1)2
(4.4.20)
As in BSS,
G(G-1)/2
ABSS
ro.J
N2
L
i=l
where Ai's are the characteristic roots of ~2
are
Ai
(xO.
I
= Var(D 2).
The diagonal elements of ~2
r; and, if all groups are different, the off-diagonal elements are
COV(D~91.92) -
=
D., D~93,94) - D.)
COV(D~91'92),D~93'94»)
COV(D~91'92),D.)
-
+ Var( D.)
-2Cov(D~91'92), D.) + Var(D.)
- Cov( D~93.94) , D.)
-
_ -4(N - 1) 2 c(12,13) _
4N
[~(((13) ((13»)
- G(NG - 1) N <"1
G(NG - 1) N <"10 + <"01
+
2(G - 2) ~ e~~3.1;13'2)]
= -4(N - 1) 2 ((12) _
G(NG - 1) N <"1
+ [(N -
_Sd12)
-
NG
2
1)
+
+ O"~
4N
(G _ )~ c(12)
G(NG _ 1)
1 N <"1
+ N(G - 1)(NG [(N - I? + N(G -
12
4d )
1)] NG(NG _ 1)2
I)(NG - 1)]
4eP2)
(NG - 1)
NG(NG - 1)
12
4d )
2
[(N - 1) - (NG - 1)(NG + N - 2)] NG(NG _ 1)2 < 0
139
and if 91
= 92
COV(D~91'92)
or 91
= 93 or 92 = 93,
- D., D~91,93) - D.)
COV(D~91'92)
_
- D., D~92,93) - D.)
- 2Cov(D~91'92),D.) + Var(D.)
_ ~ c(13,l;13,2) _ 4(N - 1) 2 c(12,13) _
4N
[~(c(13) t(13»)
- N 1,,10
G(NG _ 1) N 1,,1
G(NG _ 1) N 1,,10 + 1,,01
=
COV(D~91'92),D~91'93»)
+
=
=
2( G - 2) ~
e1~3'l;13'2)] + (j~
~ e(12) _ 4(N - 1) ~ e(12) _
4N
(G _ 1)~ e(12)
N 1
G(NG - 1) N 1
G(NG - 1)
N 1
4e(12)
+ [(N - I? + N(G - 1)(NG - 1)] NG(N~ __ 1)2
t(12)
(G _ 8)_1,,1_
NG
+ [(N _1)2 + N(G - {4(N _1)2
4e(12)
1)(NG - 1)] NG(N~ __ 1)2
+ (NG -1)[4N(G -1) + (G -
e(12)
8)(NG -I)]} NG(rJG -1)2
Now
Eo(ABSS)
2
= N 2 trace(E 2 ) = N 2 G(G-l)
2
72
The corresponding mean-square term is defined as
Then
Eo(ABMS)
2
= G(G _
1) trace(E 2 ) =
2
72
4
2
Varo ( ABMS) = G2(G _ 1)2 trace(E2)
140
..
4.5
Test Statistics
One alternative is to compare WMS with AWMS. Let T1 = A~"iiS' Under
H a,
WMS
AWMS
But, A~"iiS
.Et
(,T. (2)1 (X.)
_ J.l2 )} + 0 p (N- 1 )
+ .1.N "N
L..ti=l
a
(N-2)(N-3) {
2N(N -1)
J.l2
1 as N -+
00,
'£
i.e, asymptotically the distribution of Al1~"iis is degen-
erate.
Let ~1
= ~~~
and ~2
= ~~;.
BSS
"-J
Under Ha,
(N - 1)
2
~ A*19 (x 21 ) g
L...J
g=l
where Atg's are the characteristic roots of ~~.
BMS =
BSS
(N)
G 2
1
"-J
NG
~ *1g (
L...J
g=l
G(G-1)/2
ABSS"-J N
L:
i=l
A;i
A
Xl2) g
(Xn.
a
where A;i's are the characteristic roots of ~;.
Also, under Ha, by theoretical results pertaining to U-statistics
VN(WMS -
J.ld2)
-+
and
Thus,
141
N(0, ~d2))
while
WMS
= Op(N- 1/ 2)
AWMS
and
= Op(N- 1/ 2)
Define
TN,2 =N (BMS)
WMS
( ABMS)
TN,3= N AWMS
and
•
Since, BMS and ABMS are the dominating terms in TN,2 and TN,3, respectively, we
can write
N [
BMS
]
W M S - J.t2/ 2 + J.td2
2N(BMS) + 0 (N- 1/ 2)
J.t2
p
= N (BM~)
J.td2
[1 + (WMS - J.td 2)]-1
J.td2
and
Therefore,
and
4
TN ,3
Because the elements of
f'J
:E~
G(G-1}/2
G(G -1)
J.t2
~
1=1
A;i
(xO i
and :E; are unknown, the characteristic roots of these
matrices are also unknown. Therefore, the above distributions do not have a closed
analytic form and we call upon resampling methods, such as the bootstrap, to generate
the reference distribution for the test statistic.
4.5.1
Power of the Tests
Lemma 4.1
Let Tn be a vector of random variables that can be expressed as
Tn =
1
V
+ yinU + ~I
n
142
where R n = Op(n- I ).
If Q(T)
= T'AT is a quadratic form on T.
Q(T)
Then,
T'AT
{v
In our case, T
JnUn+ Rn}'A{v + JnUn + Rn}
+
2
1
Q(v)
+
+ -Q(U n ) + 2v'ARn + Op(n- 3 / 2 )
=DI
and the quadratic form is Q(Dd
;;:-v'AUn
n
yn
= D~DI'
Note that we
can write,
G
D~DI
G
L(i)~
- D.)2 = L(D~ - D. -
g=1
G
- D. - vI)2 + 2 VI L(D~ - D. - vd + G V;
g=1
=D1-
and Var(V N)
vI,
g=1
where
VI
is a vector G x 1 with elements
= :E I = ~:E~ = O(N- I ).
Q(DI)
Therefore,
= D~DI = V~VN + 2V~VN + V~VI
NV~ V N
where
+ vI)2
G
L(D~
Let V N
VI
g=1
G
'"
L
g=1
A; are the characteristic roots of :E~.
Now,
143
A;
(X~) g
Also,
VI.
Then, E(V N)
=0
TN,2 - 2NV;/J-l2)
( 4.,fNVt/(GJ-l2)
=
2
N~~=I(D~ - D. -vd + ";1'1 t(D~ _ D. -vd + Op(N- 1)
2.,fNVI
g=1
Note that
G
-
-
2
G
N ~g=1 (D~ - D. - vd = 0 (N- 1/ 2) slllce N 2:(D~ - D. - VI?
2.,fNVI
p,
g=1
= Op(l)
and
G
.;NIJD~ - D. - VI)
g=1
So, for a fixed VI
G
Slllce I:(D~
= Op(l),
=J 0, as N -+
g=1
- D. - vd = Op(N- I/ 2)
00,
TN,2 - 2Nv; / J-l2) = .;N t(D~ - D. -- vd
( 4.,fNVt!(GJ-l2)
g=1
+ Op(N- I/ 2)
Thus,
(J-l2 - 2NV1))
4.,fN
-+ 1, as N -+
P(TN ,2 > vd = P ( Z > G
00,
i.e., this test is consistent.
Now, consider a local alternative hypothesis. Let VI
=
7N'Yt, where 'Yt is a
constant. Then,
+
N ~;=1
(D~ -
.;Nt
g=1
(D~ -
D. - ?N'Yt
2'Yt
D. -
r
.-!-'Y~) + Op(N- I/2)
..jN
Note that
G (N~g=1
D~ - D.
-
1 *)2
v;l'Y1
---=--~---~--'--
2'Yt
= Op(l)
and
.17lT'\.C!.....(N ~ D~ - D.
V
Therefore, TN ,2 no longer follows a Normal distribution as N -+
1
) = Op(l)
- .,fN'Y~
00.
It is a convolution
of a linear combination of chi-square random variables and a normal random variable:
TN,2 =
2N
G V~V N + 4.,fN ("'Y~)' V N -+- 2
GJ-l2
J-l2
J-l2
hn
144
2
+ Op( N- 1/ 2)
2 L..J
~
TN,2 '" -G
1-l2
g=l
\* (2) + N ( 0, G216 2
A1g Xl
1-l2
9
(1'1*)' ,u
~*11'*)
1
+2
bn
2
1-l2
Now, let us find out whether V~VN and ("'Y1)'V N are independent. V~VN and
("'Y1)'V N are independent if and only if ("'Y1)':E 1 = 0 (Searle, 1971).
Recall that
:E 1 =
712 712
712 712
712
712 712
712
712
where
7{ = {(N _1)2
+ (NG -
4e(12)
1)[N(G - 1) + (NG - 1)(G - 2)]} NG(N~ _ 1)2
and
4e(12)
712 = {(N _1)2 - (NG - 1)(NG + N - 2)} NG(N~ _ 1)2
Then,
and
7{
+ (G <=?
1)712 = 0
G(N - 1)2 + (NG - 1)[N(G - 1) + (NG - 1)(G - 2)
- (G - 1)(NG + N - 2)] = 0
<=?
=0
<=?
G(N - 1)2 + (NG -1?(G - 2) - (NG -1)(NG - 2)(G - 1) = 0
N 2G + 2NG - 2NG2 - 2N2G2 + 3NG2 + N 2G2 - 3NG = 0
<=?
N - 1 + G - N G = 0 <=? N (1 - G)
<=?
.
G(N - 1)2 + (NG - 1)[(NG - 1)(G - 2) - (NG - 2)(G - 1)]
+G -
1= 0
<=?N=1
So, V~ V Nand (1'1)' V N are independent if and only if N = 1, which is not the case
here.
145
Now, write
2
GP2
[NV~ V N + 2VlV ('"Y~)' V N]
=
2
G [( VlVV N
P2
+ '"Yt)'( VlVV N + ,n - ('"Y~)' '"Y~]
and
2N (BMS) = 2N D~Dl
P2
Gp2
_2_( VlVV N
Gp2
P2
Gp2
2
(VlVV N
GP2
Note that VlVVN
+ '"Yrn VlVVN + In _2Ght)2 + 2ht)2 + Op( N- 1/ 2)
+ '"Y~)'(VlVV N + It) + Op(N- 1/ 2 )
+ '"Y; '" N('"Y;, ~;)
and
The distribution of JND~ D 1 can also be derived the following way.
Let P be a G x G orthogonal matrix (i.e., P'P
= I) such that P~;P' = A, where A
is a diagonal matrix, and
Then,
and
ND~ D 1
= y'PP'Y =
Y'y,
Hence,
ND~ D 1
G
= Y'y '" E Ai (X~( 8i))
(4.5.1)
i=1
where 8i
= (vti)2,
Ai
Ai'S are the diagonal elements of the diagonal matrix A and vti is
the ith row of the vector vr
TN ,2
= P'"Yr.
By (4.5.1),
2N,
= aD1D1
P2
'" a
2
P2
~
(2 )
L...t Ai xl(8i ) .
i=1
a
Since we have a linear combination of non-central chi-square random variables,
1*
when VI =
IN'
..
As the distribution of T N ,3 is similar to the distribution of TN,2, the above results
about consistency and power of the test apply to TN ,3'
146
Chapter 5
Numerical Studies and Data
Analysis
5.1
Modelling the Mutation Process
As an example, we consider a set of HIV-1 sequences from LaRosa et al. (1990)
and Myers et al. (1992) which span the V3 loop of the envelope gene. The data
consist of 87 sequences with 35 amino acids each. They are all from the B clade
(North America, Western Europe, Brazil and Thailand). Within the V3 loop are
found determinants for T-cell-adaptation and macrophage tropism (the wild-type
phenotype). These strains (T-cell adapted and macrophage tropic) are subject to
different selection pressures and therefore linkage patterns within the V3 loop are
most likely to differ as well.
5.1.1
The Autologistic Model
Recall from Chapter 2 that y( i) = 0 or 1,
•
Pr(O(i) I {y(j) : j EN;}) =
1 + exp
.
147
(1+ L
(l:i
. jENj
)
lij
y(j)
and
exp ( n, y(i) +
P(y(i) I {y(j) : j E Ni })
=
L: /';; y(i) y(j))
(JENj
1 + exp
ai
(5.1.1)
)
+L
lij
y(j)
jENj
where N i
= {j :
j is a neighbor of i} is the neighborhood of site i.
Since the data set is of moderate size and this model involves too many parameters, we reduce the number of parameters by imposing some restrictions.
Modell
• The neighborhood is defined as the immediate positions, i.e, the neighborhood of
site i is i - I and i
+ 1.
At the two extremities of the sequences, the neighborhood
is only the next one for the left extremity and only the previous one for the right
extremity, i.e., the neighborhood of site 1 is site 2 and the neighborhood of site n is
n-l.
• ai
• lij
= a, Vi
="
Vi, j.
•
The model is then,
P(y(i) I {y(j): 0 <\ i - j I~ I}) =
exp(ny(i) + L: /,y(i)y(j))
(
1 + exp a
jENj
+
L
)
(5.1.2)
I y(j)
jeNj
The maximum pseudo-likelihood estimates (MPLE) and the simulated maximum likelihood estimates (SMLE) are shown in Table 5.2. We used the Metropolis
algorithm to obtain the MCMC maximum-likelihood estimates: 1000 samplers were
generated with different initial values, discarding an initial warm-up of 500 iterations
in each.
To verify if we reached convergence after a burn-in of 500, we considered the
first 100 samplers and compared at each of the 33 positions the empirical probabilities
with the probabilities under model (5.1.2).
148
For each position we computed the statistic Z
Pi - Pi
Zi
= JPi (1 _ Pi) / m '
.
Z
= 1, ... , 33
where,
Pi = P(y( i) I {y(j) : 0 < I i - j
I~
exp(iiy(i) + 2: iY(i)y(j»)
I}) =
(
1 + exp a
jEN.
+
L
)
:y y(j)
jEN.
a and :y are the MPLE of a
and /, respectively, and
Pi
is the empirical distribution
computed from the first 100 samplers.
For each position we compute 4 probabilities:
= 0 I y(i -1) = O,y(i + 1) = 0) ,
P(y(i) = 0 I y(i -1) = 1,y(i+ 1) = 0) and
P(y(i)
= 0 I y(i -1) = O,y(i + 1) = 1)
P(y(i) = 0 I y(i -1) = 1,y(i + 1) = 1)
P(y(i)
For the empirical distribution we have,
A
Pil
A
Pi3
Pil
(#y(i) = O,y(i -1) = O,y(i + 1) = 0)
(#y(i -1) = O,y(i + 1) = 0)
(#y(i) = O,y(i -1) = O,y(i + 1) = 1)
(#y( i) - 1 = 0, y( i + 1) = 1)
(#y(i) = 0, y(i - 1) = 1, y(i + 1) = 0)
and
(#y(i -1) = 1,y(i + 1) = 0)
(#y(i) = 0, y(i - 1) = 1, y(i + 1) = 1)
(#y(i -1) = 1,y(i + 1) = 1)
We only looked at 33 positions, because the first and last positions remain constant.
So, we then have 4 x 33 statistics to compute and the results in Table 5.1 show that
only 6 out of 132 (4%) of those statistics have a significant result. Hence, if we adopt
a level of significance of 5%, a burn-in of 500 is satisfactory.
For this model,
I { P(y(i) = 1 I y(i - 1) = 0, y(i + 1) = 0) }_
og 1 _ P(y(i) = 1 I y(i - 1) = 0, y(i + 1) = 0) - a
Hence, a is the log-odds of mutation at a certain site when there is no mutation in
its neighborhood. If a is negative, the probability of mutation at site i is less than
149
Table 5.1: Convergence Results
Positions
Statistic
2
3
4
5
6
7
8
9
10
11
12
Zit
1.41
0.06
0.88
0.30
1.89
-1.44
0.05
0.98
-0.42
-1.15
1.65
Zi2
-0.36
-0.87
1.62
-1.29
0.82
-0.38
·-1.65
1.25
0.59
-1.03
-0.27
Zi3
0.65
-0.83
-0.62
-0.35
1.37
-1.19
0.03
-0.63
0.08
0.78
-0.82
Zi4
-0.80
-1.47
-0.55
2.62*
0.54
-0.88
0.07
-0.50
1.34
-0.78
0.77
Statistic
13
14
15
16
17
18
19
20
21
22
23
Zit
0.65
-0.27
0.71
0.14
2.45*
-0.89
0.83
-0.83
1.06
-0.43
0.55
Zi2
-0.89
0.44
-0.10
-1.59
1.03
-0.25
0.12
0.12
-0.32
0.09
-1.54
Zi3
0.56
-1.77
0.61
-0.50
0.06
0.47
-1.13
0.83
0.47
-0.80
-0.33
Zi4
-0.95
-0.67
0.04
0.26
0.54
-2.34*
0.96
-0.64
-0.40
-1.57
-0.40
Statistic
24
25
26
27
28
29
30
31
32
33
34
Zit
0.92
-2.01*
-0.55
1.01
-0.17
0.89
0.10
-1.55
-0.17
-2.26*
-0.69
Zi2
1.52
-0.22
-0.57
1.32
0.09
-0.10
0.99
0.79
0.82
-1.72
0.88
Zi3
-0.74
0.21
0.47
0.37
0.07
0.85
0.20
0.27
-1.17
1.27
0.02
Zi4
-0.28
-0.76
-1.19
-0.24
-0.25
0.27
-1.02
-1.20
3.91*
1.22
-1.43
* I Zi I> 1.96 .
Table 5.2: Modell
a
,
MPLE
SMLE
-1.7117
-1.7117
0.7813
0.8204
150
•
the probability of no mutation at that site, given no mutation at its neighborhood.
Since we are assuming that
•
ai = a,
the probability of no mutation is 5.53 (exp(1.71))
times higher than the probability of mutation given that there is no mutation in
the neighborhood. When there is mutation in the neighborhood of site i, the log
odds of mutation is a function of both a and ,. So, when there is mutation in the
neighborhood of site i, if, is large and positive the log-odds of mutation at this site
increases, and if , is negative the log-odds of mutation at this site decreases.
Model 2
• The neighborhood is defined as in model 1.
• ai = a,
Vi
• 'ii = ,pli-il.
The model is then,
Table 5.3: Model 2
MPLE
SMLE
,
-0.0582
-0.0582
1.5715
1.5715
p
0.4679
0.4679
a
There is no difference between the pseudolikelihood estimates and the simulated
maximum-likelihood estimates, indicating that the dependency among neighboring
positions is weak. Again we are assuming that
ai = a
and the probability of no
mutation is 1.05 (exp(0.06)) times higher than the probability of mutation given that
there is no mutation in the neighborhood. When there is mutation in the neighborhood of site i, the log odds of mutation is a function of a, " p and the distance
between sites i and j (since p has power I i - j I). So, when there is mutation in the
neighborhood of site i, the log-odds of mutation at this site increases, since, and p
are positive.
151
5.1.2
Model based on the Bahadur Representation
=
From Section 2.3, Yk
(Yk(I), ... , Yk(n)) is a n
X
1 vector representing the
binary responses for the n sites along sequence k, i.e, whether there is a mutation
from the consensus or not at each site along the sequence. Here, we consider k
sequences and n
•
= 87
= 35 sites.
Probabilistic model
n
P(Yk) =
II er(i) (1 -
ei)1-YIo(i)
;=1
(5.1.4)
The number of parameters to be estimated in the above model is too big (35
+
e:)) compared with the number of observations (87) in the data set. Therefore, we
cannot work with this general model. To get reliable estimates we need to reduce the
parameter space. First, let
Tij
= 0 if I i -
j
12: 1 and ei = e, Vi.
Then, for illustration
purposes, let us discard all positions with less than 20% of mutation and still consider
Tij
= 0 if 1i
- j
12: 1.
In this case we have 12 positions and 23 (12+11) parameters.
The 12 positions we are now dealing with are: 9, 10, 11, 13, 14, 19, 20, 22, 23, 24, 25
and 32. The results are shown in Tables 5.4 and 5.5.
Table 5.4: Bahadur Representation Model (35 positions)
Parameter
e
MLE
T1,2
T2,3
T3,4
T4,5
T5,6
T6,7
T7,8
T8,9
0.24
0.57
0.47
0.75
0.48
0.65
0.32
0.40
-0.02
Parameter
T9,10
T10,1l
Tll,12
T12,13
T13,14
T14,15
T15,16
T16,17
T17,18
MLE
0.11
0.14
-0.11
-0.05
0.25
0.01
0.53
0.70
0.46
Parameter
T18,19
T19,20
T20,21
T21,22
T22,23
T23,24
T24,25
T25,26
T26,27
MLE
-0.07
0.36
-0.21
0.13
0.11
0.31
0.26
-0.20
0.22
Parameter
T27,28
T28,29
T29,30
T30,31
T31,32
T32,33
T33,34
T34,35
MLE
-0.01
0.49
0.00
0.49
-0.11
-0.26
0.42
0.47
Comparing the results of Tables 5.4 and 5.5, we see that the assumption of
equal probability of mutation for all positions (model 1) is not tenable. For instance,
152
.
Table 5.5: Bahadur Representation Model (12 positions)
.
Parameter
e9
60
el1
63
64
69
60
62
63
MLE
0.30
0.35
0.53
0.73
0.18
0.27
0.55
0.40
0.26
Parameter
64
65
62
r9,10 rlO,l1
rl1,13
r13,14
r14,19
r19,20
MLE
0.58
0.73
0.30
-0.15
0.16
-0.16
0.77
0.49
0.30
Parameter
r20,22
r22,23
r23,24
r24,25
r25,32
MLE
-0.22
0.09
0.31
-0.27
0.14
'
in Table 5.5, the estimated probability of mutation at position 14 is 0.18 while at
position 13 it is 0.73, despite the fact that
r13,14
is 0.77, indicating high correlation
between these adjacent sites. A negative estimate of rij means that the mutation rate
at position i is lower than at position j.
5.2
Analyzing the Variability in DNA Sequences
5.2.1
Simulations
In order to look at the behavior of the asymptotic distributions, we generated
N sequences, with K positions each, in G groups and computed the test statistic
Ft
(3.5.29) and the standardized
Ft
(i.e.,
K(F: - :i) /u*).
We performed 500
simulations (i.e., the above procedure was repeated 500 times). In Table 5.6 the data
were generated using the same probability across all positions, i.e., pck = Pc, while
for Table 5.7, different probabilities were used across positions. The tables show the
number below and above some quantiles of the standard normal distribution.
The number of groups seems to make a big difference on the asymptotic results.
When we increase the number of groups to 10, the results are much better (see the
lines for N
= 100,
K = 10, G = 10 and N = 200, K
= 10,
G
= 10).
The number
of sequences also plays an important role, but we need to keep in mind that the
ratio of the number of sequences (N) to the number of positions (K) should not be
small (ideally it should be at least 5). Therefore, for nucleotide sequences we need
153
,
Table 5.6: Results of Simulations for Diversity Measures (Pck = Pc)
Percentiles of the Std. Normal Dist.
N
K
G
1
2.5
5
10
90
95
97'.5
99
20
10
2
0*
0**
0**
0**
52
36*
27**
16**
20
10
5
0*
1**
2**
20**
48
30
21*
16**
20
50
2
0* 0**
0**
0**
59
50
10
2
0* 0**
0**
0**
67* 43** 28**
50
10
5
0* 0**
4**
33*
51
50
50
2
0*
0**
0**
0**
59
41** 32** 27**
100 10
2
0* 0**
0**
0**
52
38*
30** 24**
100 10
5
0* 1**
5**
26**
63
38*
2(;**
11*
4*
11*
40
42
30
14
10*
100
10 10 0*
49** 38** 27**
34
19**
23** 13**
100 50
2
0* 0**
0**
0**
49
36*
21*
16**
100 50
5
0* 0**
3**
23**
56
32
18
12**
200
10
5
0* 0**
9**
30*
48
32
24**
18**
200
10 10
3*
14*
40
52
31
1.7
7
1
* between 2SD and 3SD or -2SD and -3SD.
** greater than 3SD or smaller than -3SD.
154
.
..
Table 5.7: Results of Simulations for Diversity Measures (Pck =J Pc)
Percentiles of the Std. Normal Dist.
.
N
K
G
1
2.5
5
10
90
95
97.5
20
10
2
0* 0**
0**
0**
59
36*
23** 16**
20
10
5
0* 0**
4**
24** 54
37*
22*
20
50
2
0* 0**
0**
0**
63
39*
29** 20**
50
10
2
0* 0**
0**
0**
60
36*
28**
19**
50
10
5
0* 1**
6**
43
51
33
17
10
50
50
2
0* 0**
0**
0**
53
37*
24**
14**
100 10
2
0* 0**
0**
0**
53 42** 31** 21**
100 10
5
0*
100 10 10
1
1** 10** 27** 41
99
13**
28
19
12**
53
30
18
7
13*
0**
49
0**
46
28
22*
17**
100 50
2
3
0* 0**
100 50
5
0* 1**
9**
28** 57
35*
22*
10*
200 10
5
0* 1**
8**
34*
54
30
18
12**
1
17
40
54
30
17
9
200 10 10
8
* between 2SD and 3SD or -2SD and -3SD.
** greater than 3SD or smaller than -3SD.
155
at least five times more sequences than the number of positions in order to apply the
asymptotic results. The scenario where the probabilities are different across positions
yields equaly good results.
5.2.2
...
Data Analysis
The data set consists of two groups (subtype B and not B) with 46 sequences
each. The nucleotide sequences are all from different individuals and span the protease region. There are therefore four categories. After aligning the sequences and
discarding the positions with no change, we end up with 155 positions.
Looking at the simulations results we see that our data set is not large enough
for the asymptotic results to apply. We therefore rely on resampling techniques, such
as the bootstrap. Here is a summary of the procedure:
.
f
h d
.
1. E stImate
Pck rom t e ata, I.e., Pck =
A
2. Generate N
nc1k
+n
2N
c 2k
' . F 1·
an d compute t h
e statIstIc
= 46 sequences, with K = 155 positions each, in each of the G = 2
groups, USlllg
Pck.
3. Recompute the test statistic F I from the generated data and store it.
4. Repeat steps 2 and 3 1,000 times.
.
#F{s ~ Flobs
The p-value IS then
1000
.
The results are
WSI
= 0.7005
TSI = 0.7007
BSI = 0.0002
and
Flobs = 0.011
The percentiles of the bootstrap distribution are given in Table 5.8 and the
observed p-value is less than 1/1001. This means that relative to the within-clade
variation, there is significant variability between the two clades.
156
Table 5.8: Percentiles of the Bootstrap Dist. for Diversity Measures
90%
95%
97.5%
99%
99.5%
99.9%
0.0015
0.0020
0.0023
0.0027
0.0030
0.0040
157
..
Chapter 6
Conclusion and Future Research
6.1
Concluding Summary
The autologistic model formulated in Section 2.1 is able to handle situations
involving spatial binary data and dependence on covariates. One problem is that the
estimation procedure is not straightforward and computer-intesnsive MCMC procedures are needed. Also, the general model formulation involves too many parameters
and when applied to data sets of moderate size, we need to reduce the number of parameters by imposing restrictions. For the data set we use, the number of sequences is
small compared to the number of positions, and we end up with a very simple model
because of the restrictions we impose on the parameter space. For instance, we assume equal mutation rate and equal correlation structure over all positions, which
may not be true in reality.
The model based on the Bahadur representation (Section 2.2) can also handle
dependent binary data, but the inclusion of dependent covariates is not as easy as in
the autologistic model. An advantage of this model :is that the likelihood function has
a closed form and the parameter estimates are obtained by the maximum-likelihood
approach using numerical optimization methods, such as the Newton-Raphson procedure. Again, we have to reduce the parameter space when applying this model
to our data set, because the sample size (the number of independent sequences) is
small compared to the number of positions. Ideally, the ratio between the number
158
of sequences and the number of positions should be at least 5. When we discard all
positions with frequency of mutation less than 20%, we see that the mutation rate
varies a lot from one position to the other, confirming the fact that the assumption
of equal mutation rate over all positions does not hold. For the extension of the
autologistic model to three categories (Section 2.3) we also need large data sets, since
the general formulation of this model involves even more parameters than the one for
a binary response.
When using the diversity measures (Chapter 3) to analyze the variability in
DNA sequences, we assume independence among positions and we only consider sequences from independent individuals. The power of the test developed in Chapter
3 is evaluated and we conclude that the test is consistent, i.e., as the sample size
increases, the power of the test goes to 1. Since we know that positions in DNA
sequences may not be independent, more work needs to be done relaxing this assumption and this is discussed in Section 6.2.4. The simulations (Section 5.2.1) show
that the asymptotic test statistic (3.5.21), developed in Chapter 3, behaves better
when the number of groups is large (at least 10) and the number of sequences (N) is
large compared to the number of positions (K). The ratio between N and K should
be at least 10. Also, when the data are generated using different probabilities across
positions the asymptotic results are as good as those obtained when generated using
equal probabilities. For small sample sizes we can use resampling techniques, such as
the bootstrap, to generate the reference distribution and see where the observed test
statistic falls. The data set used for illustration consists of two groups (subtype B and
not B) with 46 sequences in each. Sequences are originated from different individuals
and the sequences span the protease region. After aligning the sequences, we discard
the positions with no change and we end up with 155 positions. The results show
that relative to the within-clade variation, there is significant variability between the
two clades.
The analysis of variance based on Hamming distances (Chapter 4) has the advantage of considering the sequences on an individual basis, since we make all pairwise
comparisons within and across groups. The sequences are assumed to be independent
in each group, but we should also think about developing tests when the sequences
159
cannot be considered as independent. For instance, if the interest is in comparisons
of sequences within and between individuals, the sequences within the same individual cannot be considered as independent. This is discussed in Section 6.2.6. We
decompose the total sum of squares into within-, between- and across-group sums of
squares, with the latter term being new: it does not appear in the usual decomposition. The assumption of independence among positions is relaxed and test statistics
are developed based on V-statistics theory. We foun.d that these tests are consistent
and their power goes to 1 as the sample size increases. The distributions of these test
statistics do not have a closed analytical form, and therefore, we need to call upon
resampling techniques, such as the bootstrap, to perform the test.
6.2
6.2.1
Future Research
An Application of the Autologistic Model to sequences
from the nef gene
In the application of the autologistic model of Section 2.1 no covariates are
considered. Now, we would like to include covariates and define neighborhoods in
a biologically meaningful manner. In particular, the three-dimensional molecular
structure of the nef gene is known and it is possible to define neighborhoods according
to molecular distances. Having the spatial coordinates for each sequence position, we
can compute genetic distances. At the amino-acid level, possible covariates are size
(e.g., small or large) and group (hydrophobic or hydrophylic). The goal is to estimate
the probability of mutation taking into account the neighborhood of the site and the
most significant covariates. A mutation occurs when there is a change in amino acid
with respect to the consensus sequence.
160
6.2.2
Extension of the Autologistic Model to more than Two
Categories
ti
The theoretic aspects of an extension of the autologistic model for three categories are developed in Section 2.3, but no application is shown. Now, we will extend
this model for any C (C > 2) categories and apply it to some data. Markov-chain
Monte-Carlo procedures will estimate the model parameters.
6.2.3
An Alternative N uIl Hypothesis for the Analysis of Diversity Measures
The null hypothesis of interest in Chapter 3 is that there is homogeneity among
the groups, i.e., the category probability at a certain position is the same over all
groups. Now, we would like to test a less restrictive hypothesis. Recall from Chapter
3 that the population variation within the gth group at the kth position is
C
IS(Pgk) = 1 - LP~gk
(6.2.1)
c=l
If the null hypothesis of interest is just
i.e., the within-group variation at the kth position is the same over all groups, it
implies that
II Plk 11=11 P2k 11= ... II PGk II
(6.2.2)
where Pgk = (Plgk P2gk ... PCgk)' is a C x 1 vector representing the probabilities
of belonging to categories c = 1, ... ,C in group 9 and position k. Note that if this
hypothesis is true, the hypothesis of Chapter 3 (Ho : Pcgk = Pck) is not necessarily
true.
The distribution of the test statistic under this less restrictive null hypothesis
need to be derived and we need to be aware of the structure of the covariance matrix
in this situation. The latter is now a little more complicated since we cannot assume
that individual probabilities are the same over all groups.
161
6.2.4
Diversity Measures for Sequences with Dependent Positions
..
The test statistic developed on Chapter 3 assumes independence among positions along the sequences. A simple scenario for dependence among positions is
depicted by a first-order Markov chain.
First, assume that there are only 2 categories, i.e., C = 2. For a first-order
Markov chain
Let
14(2 11)
= Pr{being in category 2 at position k for group 9
I at position
k - 1 it is at category 1}
Then,
pt(1 \1)
Let ngk
=1-
pt(2 11) and pt(1 12)
=1-
pt(2 12)
=ngk (1) denote the number of responses in category 1 at position k for group
g, N the total number of responses at position k, N - ngk
responses in category 2 at position k for group g, nf(a
I b)
= ngk(2)
the number of
the number of responses
in category a at position k for group 9 given that at position k - 1 it is in category
b, with a, b = 1,2.
The contingency table for two categories is shown in Table 6.1 and for C categories in Table 6.2.
For each group's responses (ngl, ng2,'
Pr{( ngl, n~(1 11), n~(1
=
.. ,
ngK )~I
I 2), ... ,nk(1 11), nk(1 I 2))}
(N)Pf(1)"9
(1- !J1(1))(N-n 9d
ngl
1
X
(n~~gi 1))~(1 I i)"~(111)(1- ~(1 11))(n91·-n~(111))
162
Table 6.1: Contingency Table for Two Categories (dependent positions)
Position k
•
Category
1
Posit.
Category
k-l
2
Total
1 nf(1 11)
nf(2 11)
ng,k-l (1)
2 nf(1 I 2)
nf(2 I 2)
ng,k-l(2)
ng,k(2)
n.g. = N
Total
x
(~(~ ~;))~(112)"~(112)(1_ p~(112))(N-n9l-n~(112))
... x
x
ng,k(l)
(n~(~tl))Pk(111)"~(111)(1 -
Pk(1
11))(ngK-l-n~(111))
(:~(~i~)I)Pk(1 12)"~(112)(1- ~(112))(N-ngK-l-n~(112))
Table 6.2: Contingency Table for C categories (dependent positions)
Position k
Category
Position
Categ.
1
2
1
nf(1 11)
nf(2 11)
...
...
2
nf(1 I 2)
nf(212)
C
Total
nf(C 11)
ngk-l(l)
... nf(C 12)
ngk-l(2)
...
k-l
C
Total
nf(1 I C)
nf(2 I C)
...
ngk(l)
ngk(2)
...
n~(C
I C)
ngk(C)
ngk-l(C)
n.g. =N
Since the groups are independent, we can take the product over all groups. The
probabilistic model is
G
IIPr{(ngl,n~(111),n~(112), ... ,nk(111),nk(112))}
g=1
163
x
(n~;gi 1))P~(1 11t~(l11)(I- p~(111))(ngl--n~(1ll))
x
(~(~ ~~))P~(1 12)n~(lI2)(I_ p~(1 12))(N-ngl-n~(lI2))
...
x
X (
ngK-l )pg (1
nk(1 11) K
11)n~(lI1)(1
_ pg (1 Il))(ngK-l-n~(1ll))
K
.
(~~(~91'~)1)Pk(1 12t~(lI2)(I- ~(112))(N-ngK-l-n~(112))
Let
V l =(nll(l) nll(2) ... nG1(I) nG1(K))' be a 2G x 1 vector,
= (nl(1
Vk
vector for k
ILl
11) nl(2
I 1)
= 2, ... , K
nl(1
and V
=E(Vd = (Np~(I) Np~(2)
ILk=E(Vk)=(nlk-l(l)pl(I!I)
= 2, ... ,K
IL =E(V) = (ILl 1t2
I 2) nl(2 I 2) ... nr(1 I 2) nr(2 I 2))', be a 4G
= (VI V 2 ••• V K)' be a 2G(2K - 1) x 1 vector.
NP?(I) Np?(2))'
nlk-l(2)pl(212) ... nGk-l(2)pr(212))'
x 1
(6.2.3)
(6.2.4)
for k
...
(6.2.5)
It K )'
~=Cov(V)
= (~11 EB ~2l EB ... EB~G1 EB~1211EB~1212EB... EB~G211 EEi~G212EB... EB~GKIIEB~GKI2)' (6.2.6)
where
~gl
is a 2 x 2 matrix
~gl =
and
~gklc,
N
(
-pf(l)~(l)
-pf(l)~(I)
)
~(1)(1 - ~(1))
for c = 1,2, is a 2 x 2 matrix
~gklc = ngk-l(c) (
Therefore,
pf(I)(1 - pf(I))
~
pf(1 I c)(l- pf(l I c)) -pf(1 I c)pf(2 I c)
)
-pf(1 I c)pf(2 I c)
pf(2 I c)(1 - pf(2 I c))
is a square matrix of dimension 2G(2K - 1).
Note that for C categories V is a CG(C(K - 1) -+- 1) x 1 vector. The probabilistic
model for C categories is
G
IIPr{(ng l(I), ... ,ng l(C), ... ,nk(l! C), ... ,nk(C I C))}
g=l
164
..
So, we need to find the distribution of the sums of squares under this model and
construct a test statistic.
More general dependency will be considered: e.g., kth order Markov chains and
autologistic models.
6.2.5
Inclusion of Covariates in the Analysis of Diversity
Measures
IT we want to include covariates in the set up of the analysis of diversity measures
and these covariates are categorical, we can treat them as additional factors in the
analysis of variance. IT the covariates are continuous, the appropriate model for this
situation would be a log-linear model, and it does not seem to be possible to extend
this model for the case of non-independent positions. However, most of the time the
variables of interest are categorical. Even in the cases of variables like viral load or
CD4 count, they are usually categorized.
6.2.6
Analysis of Variance based on the Hamming Distance
when Sequences are not Independent
The theory of generalized U-statistics for independent random vectors (i.e, sequences) was used to get the distribution of sums of squares and develop test statistics
in Chapter 4. In the case of dependence among sequences, the theory of U-statistics
for non-independent random vectors may be called upon to find the distribution of
sums of squares.
165
6.2.7
Tests Based on Contrasts in the Analysis of Variance
for the Hamming Distances
II
If the test of homogeneity among groups turns out significant, we would like to
identify the most heterogeneous groups. So, we need to construct tests to compare
all possible pairs of groups and to order them according to their degree of variability.
166
Appendix A
(AA)
.
(A.5)
(A.6)
(A.7)
+ 2 L:
k 1 :#2 ::#a
aklbk2akabkl
L:
+
k 1 :#2 :f:.ka :f:.k4
167
aklbk2akabk4
(A.8)
(L
kl
ak! bk2)
(L
i-k2
kl
Ck! Ck2)
<k2
+ L
=
L
kl
ak! bk2 Ck! Ck2
kl
+
akl b k2 Ck 2 Ck s
~i-~i-~
+
L
+
i-k2
LOki
ak l bk2 Ck l Cks
i-k2i-ks
(A.9)
b k2 Cks Ck 4
~~~~~~~
kS<k4
ak l b k2 Cks dkl
+ L
akl b k2 Ck 2 dks
kl ~k2~ks
kl < le2' les <lei
k l <k2 <ks
lei ~k2~leS
lei <les, k2 <lea
lei ~le2~ka~k4
lei < le2' lea < le 4
+
(A.I0)
+
akl ak2 Cka Ckl
+ L:
lel~le2~lea
lei < le 2, lea <lei
+
akl ak2 Ck2 Cka
k l <k2 <ka
(A.ll)
lei ~k2y1!lea
lei <lea, le2 <lea
"
L: (akl b k2 )2 + 2 L:
~<~
+ L
~<~<~
ak l b k2 Ilk l b ka
~<~<~
akl bk2ak2bka
+2
+2
L:
ak l b ks ak2 b ka
~<~<~
(A.12)
~yI!~yI!~yI!~
lei <lea, le2 < le 4
168
Appendix B
Example: Two categories
Let, for any 9,
~(O) =
P(XJk = 0) , ~(1) = P(XJk = 1)
~lk2(1, 1) = P(XJk 1 = 1,XJk2 = 1) , ~lk2(0, 1) = P(XJk 1 = 0,XJk2 = 1)
~lk2 (1,0) = P(XJkl = 1, XJk2 = 0) , ~lk2 (0,0) = P(XJk 1 = 0, XJk2 = 0) .
•
Also, let
1
8f
= P(Xlk =I XJk)
= 2:~(u)[1- ~(u)]
u=o
(B.1)
1
8t k2 = p(Xrkl =I XJkl' Xrk2 =1= XJk2) = 2:
u,v=o
~lk2 (u, V)~lk2 (1 - u, 1 - v)
(B.2)
1
p(Xrk
=1=
XJk,Xfk
=1=
XJ'k) = 2:~(u)[1- ~(u)]2
(B.3)
u=o
1
=I Xjk
p(Xrk1
1
,
xrk2
=1=
Xj'kJ =
2:
u,v=o
~lk2(U, v)[1- ~l (u)][1- pt2(V)]
(B.4)
1
P(XJk
=1=
Xfk) = 2:[1 - pt(u)]I(xfk = u)
(B.5)
u=o
1
(P(XJk
=1=
Xfk))2 = 2:[1- ~(u)]2I(xfk = u)
u=o
169
(B.6)
1
P(Xjk
# Xj'k,Xjk # X;k) = LPHu)[l- Pi(u)]I(:r;k = U)
(B.7)
u=O
»
1
P(Xjk
# Xfk)P(Xjk # Xj'k,Xjk #
X;k) = L[l - pf(u)]2 pHu)I(xfk
=
(B.8)
U)
u=O
P(Xjkl
# X;kl' Xj k2 =I XfkJ
1
1
L [1 -
=
Pi
l
(
Pi
U)][ 1 -
2
U,u=O
(P(Xjk l
(
v)] I (X;k l == u, X;k2
= v)
(B.g)
# Xfk l ))2 P(Xjkl # X;kl' Xj k2 # Xfk 2)
(P(Xjk l # X;k l ))2 P(Xjk2 # Xfk 2' Xj'kl # ,Xfkl )
1
1
-
L [1- Pil (u)]3[1- Pi2(v)]I(x;k l = U,Xfk2 = v)
(B.10)
U,u=O
P(Xjkl
=I Xj'kl' Xjkl =I X;kl )P(Xjk2 # X;k2' Xj'ks =f Xfks)
1
Pi (u)[l - Pi
= L
l
l
Pi
(u)][l -
2
(v)][l - PiJz)]I(xfkl = U, Xfk2 = V, XfkS = Z)
u,u,z=O
(B.ll)
P(Xjkl
=I Xj'k l ' Xjk l # X;kJP(Xjk l # X;k Xj k2 =I X;k 2)
P(Xjkl # Xj'kl ,Xjkl # Xfkl )P(Xjk 2 =I X;k Xj'k l # X;k l )
1 ,
1
2,
1
L
Pi
l
(u)[l -
Pi (U )]2[1 - Pi
l
2
(V )]I(x;kl == U, Xfk2 = V)
(B.12)
u,u=o
P(Xjkl
=I Xfk l ' Xj,k2 =I Xfk 2)P(Xjk l =I X;k l ' Xj, ks # Xfk s )
1
=
L
u,u,z=O
[1- P~l (u)]2[1- Pi2(v)][1- Pis (z)]I(xfkl = U,X;k2
= v,xfks = z)
(B.13)
1
P(Xjkl
~lk2(U, V)I(xfkl
=I X;k l ' Xjk2 =I Xfk2) = L
=1-
U, Xfk2 = 1 - V)
(B.14)
u,u=O
1
P(Xjkl
=I Xj'k l ' Xjk2 =I Xfk 2) = L
~l (1 - U)~lk2(U, v)I(xfk2 = 1- v)
U,u=O
170
(B.15)
•
P(X;k1 =/: Xfk1 ' X;k 2 =/: XfkJP(X;k 1 =/: Xfk 1 ' X;lk 2 =/: Xfk2)
1
- L
u,v=O
p~lk2(1- u, 1- v)[l- ~l (u)][l- p~2(v)]I(xfkl = u,Xfk2 = v)
P(X;k1 =/: Xfk1 ' X;ka =1= xfka)P(X;k1
1
=
L
U,v,z=O
=1=
(B.16)
Xfk1 ' X;lk 2 =1= XfkJ
~lka(1- u, 1- v)[I- pt (u)][l- P12(z)]I(xfk 1 = U,Xfk 2 = Z,Xfka = v)
(B.17)
(P(X;kl =/: X;'kl' X;k2 =/: Xfk2)
r
1
=
L
u,v=O
~l (1 - U)]2~lk2 (u, V)]2 I(xfk2 =
1 - V)
1
+ L pt (1- U)~l (U)pt k2(U, V)pt k2(1 u,v=O
U, v)I(xfk2 = 1- v)
(B.18)
P(X;k1 =/: Xfk1 ' X;lk 2 =/: Xfk 2)P(X;k1 =/: X;'k 1, X;k 2 =/: Xfk 2)
1
- L
[1- ~l (u)][l- pf2(V)]~1 (1- Z)Pk1k2(Z, 1- v)I(xfkl
= u,xf~ = v)
u,v,z=O
•
(B.19)
P(X;k1 =/: Xfk1 ' X;'k a =/: xfka)P(X;k1 =/: X;'k 1' X;k 2 =/: Xfk2)
1
L
=
[1 - ~l (u)][1 - ~a (V)]~l (1 - Z)Pklk2(Z, 1 -l)I(xfkl =u, Xfk2= l, xfka = v)
u,v,z,I=O
(B.20)
P(X;k2 =/: Xfk2' X;'ka =/: xfka)P(X;kl =/: X;'k 1, X;k 2 =/: XfkJ
1
L
u,V,z=O
[1- ~2(u)][l-~a(v)]~l (1 - Z)pk 1k2(Z, 1- u)I(xfk2 = u,xfka
= v)
(B.21 )
P(X;ka =/: xfka' X;lk4 =/: XfkJP(X;kl =/: X;'k 1, X;k 2 =/: XfkJ
1
=
L
[1- ~a(u)][l-~4(V)]~1 (1- Z)pk1k2(Z, 1 -l)I(xfk2 =l,xfka =U,Xfk4 =v)
u,v,z,I=O
(B.22)
171
P(XJkl =I- XJ'k l ' XJkl
i- Xfk )P(XJk2 i- XJ'k2' XJka l= Xfk a)
l
1
L
=
pt(U)[1-pt(u)]fi2(1-z)p~2ka(Z,V)/(xfkl =u,xfka =l-v) (B.23)
U,V,Z=O
P (XJk l =I- XJ, kl ' XJk l
i- Xfk P(XJk l i- XJ, k XJk2 i· Xfk 2)
l )
l '
1
L pt (u)[l - fi
=
l
(U)]P~l (1 - Z)pt k2(Z, V)/(Xfk l
lJ,V,z=O
i- XJ'k l ' XJk l i- Xfk )P(XJk2 i- XJ'k 2' XJkl i- Xfk
P(XJkl
l
= U, Xfk 2 = 1 -
v) (B.24)
l )
1
=
L pt (u)[l- fi
l
(u)]fi 2(1- V)ptk 2(1- U,V)/(Xfk l
= U)
(B.25)
lJ,v=O
P (XJk l =I- XJ, k l ' XJk 2 f:. Xfk 2)P(XJk l
i- XJ, k XJka i- xfka)
l '
1
=
L fi
l
(1 - U)fi l k2(U, v)fil (1 - Z)filka(Z, l)/(xfk2 = 1 - v, xfka = 1 -l)
lJ,v,z,I=O
(B.26)
•
P(XJk 2 =I- XJ, k2' XJkl =I- XfkJ P(XJka
i- XJ, ka' XJkl i- XfkJ
1
L
lJ,V,z=O
PI2(1 - U)PIlk2(V,u)PIa(1- Z)PIlka(V,Z)/(xfk1 = 1 - v)
(B.27)
1
P(XJkl =I- XfkJP(XJ'k 2 =I- XfkJ=
L
lJ,v=O
[1- fi 1 (u)][1-- fi 2(v)]I(xfk l
= U,Xfk 2 = v)
(B.28)
P(XJkl
i- XfkJP(XJ'kl i- Xfk XJ'k2 i- Xfk2)
1'
1
=
L
[1 -
lJ,v=O
P(XJkl
f:.
Xfkl
PI
l
(1 - U)]fi lk2(U, V)/(Xfk l
)P(XJk2
=1-
U,
Xfk 2 = 1 - v)
(B.29)
i- Xfk2' XJ'k a f:. Xfka)
1
=
L
[1- fi l (u)][l- PI2(v)][1- p~a(z)]/(xrkl = U,Xfk2 = V,Xfk a = z)
U,V,Z=O
(B.30)
172
=I Xfk P(X;, k2 =I Xfk 2' X;I k a =I Xfk a)
P(X;k 1
1)
1
L: [1- pt (u)]P1 2ka(v,z)I(xfk 1
=
= u,xfk2 = 1- V,Xfk a = 1- z)
U,v,z=O
(B.31)
=I Xfk X;k 2 =I Xfk 2)P(X;' k =I Xfk X;I ka =I Xfka)
P(X;k 1
1'
1 '
1
1
=
L: pt k2(U, V)ptka(U, z)I(xfk1
U,tJ,Z=O
= 1- U,Xfk 2 = 1- V,Xfk a = 1- z)
(B.32)
=I Xfk X;k 2 =I XfkJP(X;'k =I Xfk X;lk 2 =I XfkJ
P(X;k1
1'
=
[pflk2 (U,
L:
v WI (Xfk 1
= 1-
u,v=O
=I Xfk X;k2
P(X;kl
1 '
1
1
1'
U, Xfk 2 = 1 - v)
=I- Xfk 2)P(X;'ka =I Xfka' X;'k 4
(B.33)
=I XfkJ
1
=
L:
P1 k2(U, V)P1ak
1
4
u,v,z,I=O
(
z, 1) I (Xfk 1 = 1 - U, Xfk 2= 1 - v, Xfk a= 1 - z, Xfk 4 = 1 - 1)
(B.34)
•
=I xfka)
(P(X;kl =I- XfkJ) 2 P(X;I k2 =I- Xfk 2' X;I ka
1
L: [1 u,v,z=o
P1 (U)]2 P12ka (v, z)I(xfk = U, Xfk 2 = 1 1
(P(X;kl =I- XfkJ
1
L: [1 -
r
P1
1
1
P(X;I kl
v, Xfka = 1 - z)
(B.35)
=I Xfk X;I k2 =I Xfk 2)
1 '
(1 - U)] 2P11 k2 (U, V) I (Xfk 1
= 1-
U1 Xfk2 = 1 - V)
(B.36)
=I XfkJ = L:[1- P1 (u)]3I(xfk = U)
(B.37)
U,V=O
2
(P(X;kl =I- XfkJ) P(X;'k1
1
1
1
u=o
(B.38)
P(XZ k =I Xfk)P(X;2k =I- X;kl X;2k =I- Xfk)
1
- L:P1(u)[l- P1(U)]2I(xfk = U)
(B.39)
u=O
173
P(XZ k1 =I- XfkJP(X;2k2 =I- X;ak2' X;2k2 =I- XfkJ
1
L
[1 -
u,u;;;:Q
pt (u)]P~2 (V)[ 1 -
P~2 (V)] I (Xfk 1 = u, :l:fk2 = v)
(BAD)
L
P (X;2 k1 =I- X;a k1 ' X;2 k1 =I- Xfk 1)P(XZ k1 =I- Xfk 1' XZ k:2 =I- Xfk2)
1
L
u,u;;;:Q
~l (1 - u)[l - ~l (1 - U)]~lk2(U, v)I(xfk 1 = 1 - U, Xfk 2 = 1 - v)
(BA1)
P(X;2kl =I- X1k 1, X;2kl =I- XfkJP(X1k 2 =I- Xfk 2' X1k s =I- xfka)
1
L
u,u,z;;;:Q
~l (u)[l- ~l (U)]~2ka(v,z)I(xfkl = U,Xfk2 = 1- V,Xfka = 1- z)
(BA2)
P(XZ k1 =I- Xfk 1)P(X;2kl =I- X;akl' X;2k2 =I- XfkJ
1
- L
U,V,Z==O
[1 - ~l (z) ]~l (1 - u )~l k2(U, v) I (Xfk 1 =
Xfk 2 = 1 - v)
(BA3)
[1 - ~l (Z)]~2 (1 - u)~2ka (u, v)I(xfkl = .z, xfka = 1 - v)
(BA4)
Z,
P(X1k1 =I- XfkJP(X;2k2 =I- X1k2' X;2ka =I- xfka)
1
- L
u,u,z;;;:Q
P(X;2kl =I- X;akl' X;2k2 =I- Xfk2)P(X;lk1 =I- Xfk 1' X1k 2 =I- Xfk 2)
1
L
~l (1- Z)~lk2(Z,V)~lk2(U,v)I(xfkl= 1- U,Xfk2 = 1- v)
(BA5)
U,tl,z=O
P(X1k1 =I- X1k 1, X1k2 =I- Xfk2)P(X1k1 =I- Xfk 1' XZ ka =I- xfka)
1
=
L
u,u,z,/;;;:Q
~l (1 - U)~lk2 (u, V)~lka (z, 1)I(xfk1= 1 -- z, Xfk 2= 1 - v, Xfka = 1 - 1)
(BA6)
P(X1kl =I- X;akl' X;2k2 =I- XfkJP(XZ k2 =I- Xfk2' XZ ka =I- Xfka)
1
L
~l (1- Z)~lk2(Z,U)~2ka(U,v)I(xfk2= 1- u,xfka
u,u,z;;;:Q
174
= 1- v)
(BA7)
•
P(X]2 k 1 =I- XL k 1 ' X]2 k2 =I- Xfk 2)P(X1 k a =I- Xfk a ' x1 k4 =I- Xfk 4 )
1
L pt (1 -
=
u,v,z,I=O
U)ptk2 (u, V )P'kak4 (z, 1)I(xfk2= 1 - v, Xfk a= 1 - z, Xfk 4 = 1 - 1)
(BA8)
P(X1k2 =I- Xfk2)P(X]2k1 =I- XLk 1, X]2k2 =I- Xfk2)
1
L
[1 - P'k2(1 - v)]P'k1(1 - U)P'k 1k2(U, v)I(xfk2 = 1 - v)
u,v=O
(BAg)
Thus
l~(l
-
28f:)P(XJi.
01
Xfklr
K
= L(l - 20~2)2 (P(X]k =I-
xfk))
2
k=l
+2
(1 - 2(92
)P(X!!1
X!!lak1 )(1 - 2(92
X!!lak2 )
k1
Jk 1 ...J.
I
k2)P(X!!1
Jk2 ...J.
I
~
L.J
k1<k2
=
•
K
1
k=l
U =O
LL
(1- 20~2)2 [1- P'k1(u)]2I(xfk
1
(1-20~~)(1-20~~)[1-p~~(u)][1-P'k~(v)]I(xfk1 =U,Xfk2 =v)
L L
+2
k1
<k2
= u)
,V=O
U
(B.50)
[
~
L.J
092
...J. X!!l )] 2
k2 P(X!!l
Jk 1 ..,
ak1
r
k1'#2
L (O~~)
=
2
(P(X]k 1 =I- Xfk 1)
k1'#2
+2 L
1<1
;i1<2;il<a
(O~~ )
2
P(X]k 2 =I- Xfk 2) P(X]k a =I- Xfka)
1<2<1<3
L
+2
•
+2
~
L.J
k1
+2
+
<k2
092
092 P(X!!l
...J. X!!l )P(X!!l
k1 k2
Jk1'"
ak1
Jk2
~
L.J
092
P(X!!l
k1
Jk1
...J.
I
...J.
I
X!!lak2 )
X!!lak1 )092
...J. X!!l )
k2 P(X!!l
Jka'"
aka
k 1 ::j;k2 ::j;ka
L
k 1 ::j;k2 ::j;k3 ::j;k4
O~~ P(X]k2 =I- XfkJO~~ P(X]k4
175
=I XfkJ
L (Of~r[t(1-pf~(u))2I(xfk1=U)]
=
k1:;=k2
u=O
L (Of:
+2
/ol~/o2~/oa
r[
t
pf~ (u))(1 - pf~ (v))I (Xfk2 = u, xfka = V)]
(1 -
u,v=O
/02</01.
L Of~Of~ [t(1-pf~(u))2I(Xfk1 =u)]
+2
/01 ~/o2~/oa
/o2</oa
u=O
+ 2 L Of: Of~ [ t
k1
<k2
k1
:;=k2 :;=ka
k1
:;=k2:;=ka :;=k4
+2
L Of~Of~
L
+2
(1 -
U,V=O
Of:
pf~ (u)) (1 - pf~ (v))I (:rfk
[ t (1u, v=O
Of~ [ t
1
= U, Xfk 2 = V)]
pf~(u))(1-pf~(V).l(Xfk1 = u, xfka = V)]
(1 -
u, v=O
pf~ (U))(1 - pf~ (v) )I (Xfk2 = U, Xfk4 = V)]
(B.51)
[k L<k
P( XJk1
=I Xfk1' XJk 2 =I Xfk2 )] 2
2
1
= ""'
(P(X!!l
Jk1
~
•
...J..
~
X!!l
X!!l
...J.. X!!l )) 2
sk '
Jk ~
Ik
2
1
2
k 1 <k2
+2 L
<k2<ka
k1
+2 L
<k2<ka
k1
L
+2 L
+2
<k2<ka
k1
P(XJk1 =I
P(XJk1
Xfk1' XJk2 =I XfkJP(XJk 1 =I Xfk1, XJka =I xfka)
=I Xfk1,XJka =I Xfka)P(XJk2 =I Xfk2,XJka =I Xfka)
P(XJk1 '"
Xfk1' XJk2 =I Xfk2 )P(XJ~ =I Xfk2' XJka =I xfka)
P(XJk1 =I
/ol~/o2~/oa~/o4
Xfk1 ,XJka =I xfka)P(Xlk2 =I Xfk2,XJk4 =I XfkJ
/01 </01., /02 </04
=
L
k1<k2
U,v=O
L
+2
k1
X
[t
<k2<ka
pf~k2(u,v)I(xfk1 = 1-u,xfk2 = 1_v)]2
[ t pf~k2(U, v)I(xfk1 = 1- U, Xfk2 = 1- V)]
u, v=O
[t pf~ka
=
L [ t pf~ka(U,
(u, v)I(xfk1
1 - u, xfka
= 1 - V)]
u,v=O
+2
k1
<k2<ka
v)I(xfk1
= 1- U,Xfka = 1- V)]
u, v=O
176
•
[uto
x
+2
pf,.. (u, v)I(x!l,
[t pf~
L
ki<k2<ks
~ 1-
L
[
lei ;f 1r 2 ;flea ;f le4
lei <lea, le2 < le4
1
v) I (Xfk i = 1 - u, Xfk a = 1 - V)]
(pf~k2(U,V)) I(xfk i = l-u,xfk2 = I-v)
<k2 tI,v=O
+2 L
ki
[
<k2<ka
I(xfk i
X
ki
X
ki
<k2<ks
X
I (Xfk i
+2
L
t
tI,V,z=O
z)]
pf~ka(U,V)pf~ka(Z,V)
= 1- U,~fk2 = 1- Z,Xfka = 1- V)]
L
+2
pf~k2(U,V)pf~ks(U,Z)
= 1- U,Xfk2 = 1- V,Xfk a = 1 -
<k2<ks
I(xfk i
t
tI, v, z=O
L [
+2
[t
tI, v, z=O
=1-
lei ;f le2;f lea;f le 4
lei <lea, le2 < le 4
X
ka ( u,
2
L L
•
t pf~
tI, v=O
U' X!!iaka = 1 - V)]
[uto~~'.(U'V)I(Xff, = 1- u,xfL, = 1- V)]
X
ki
V)]
k2(u, v) I (XIk i = 1 - u, Xfk2 = 1 - V)]
tI,v=O
+2
~ 1-
tI,v=O
[~pgi
(u v)I(x!!l
L..J k2ka'
ak2 = 1 -
X
u, xfL,
P~~k2(U,V)pf~ks(V,Z)
U, Xfk 2 = 1 - v, Xfk s = 1 - z)]
[t
tI,v,z,I=O
pf~ka (u, V)pf~k4 (z, 1)
.
I(x!!i~ = 1 - U, X!!i~ = 1 - Z, X!!i
=1 -
V' X!!i~ = 1 - l)J]
(B.52)
•
177
•
Bibliography
Amfoh, K. K., Shaw, R. F. and Bonney, G. E. (1994). The use of logistic models
for the analysis of codon frequencies of DNA sequences in terms of explanatory
variables. Biometrics 50, 1054-1063.
Anderson, R. J. and Landis, J. R. (1980). CATANOVA for multidimensional contingency tables: Nominal-scale response. Communications in Statistics - Theory
and Methods A9(11), 1191-1206.
Anderson, R. J. and Landis, J. R. (1982). CATANOVA for multidimensional contingency tables: Ordinal-scale response. Communications in Statistics - Theory
and Methods 11(3), 257-270.
Bahadur, R. R. (1961). A representation of the joint distribution of responses to
n dichotomous items. In Studies in Item Analysis and Prediction, H. Solomon
(ed.). Stanford University Press. pp. 158-176.
Berger, J. O. (1980). Statistical Decision Theory and Bayesian Analysis. second edn.
Springer-Verlag.
Besag, J. (1972). Nearest-neighbor systems and the auto-logistic model for binary
data. Journal of the Royal Statistical Society, Series B 34, 75-83.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems
(with discussion). Journal of the Royal Statistical Society, Series B 36, 192-236.
Besag, J. (1975). Statistical of non-lattice data. The Statistician 24, 179-195.
Besag, J. (1986). On the statisitcal analysis of dirty pictures. Journal of the Royal
Statistical Society, Series B pp. 259-302.
Besag, J. (1989). Digital image processing towards Bayesian image analysis. Journal
of Applied Statistics 16, 395-407.
Bose, R. C., Chakravarti, I. M., Mahalanobis, P. C", Rao, C. R. and Smith, K. J. C.
(1970). A representation of the joint distribution of responses to n dichotomous
items. In Essays in Probability and Statistics, R. C. Bose (ed.). The University
of North Carolina Press. pp. 111-132.
178
•
'v
Broemeling, L. D. (1985). Bayesian Analysis of Linear Models. Marcel Dekker, INC.
Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. The American
Statistician 46, 167-174.
•
Chakraborty, R. and Rao, C. R. (1991). Measurement of genetic variation for evolutionary studies. In Handbook of Statistics 8, R. Rao, C R & Chakraborty (ed.).
North-Holland. pp. 271-316.
Cliff, A. D. and Ord, J. K. (1981). Spatial Processes - Models and Applications. Pion.
Cochran, W. G. and Cox, G. M. (1957). Experimental Designs. John Wiley & Sons,
Inc.
Coffin, J. M. (1986). Genetic variation in AIDS viruses. Cell 46, 1-4.
Cressie, N. A. C. (1993). Statistics for Spatial Data. second edn. John Wiley & Sons,
INC.
Cullen, B. R. (1991). Human immunodeficiency virus as a prototypic complex retrovirus. Journal of Virology 65, 1053-1056.
Daniel, W. W. (1978). Applied Nonparametric Statistics. Houghton Miffin Company.
Davidson, R. R. and Bradley, R. A. (1969). Multivariate paired comparisons: The
extension of a univariate model and associated estimation and test procedures.
Biometrika 56, 81-95.
Dawid, A. P. (1983). Introduction to Bayesian statistics. In Encyclopedia of Statistics
4, S. Kotz and N. L. Johnson (eds). John Wiley'& Sons. pp. 89-105.
Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating
marginal densities. Journal of the American Statistical Association 85, 398-409.
Gelfand, A. E., Hills, S. E., Racine-Poon, A. and Smith, A. F. M. (1990). Illustration
of Bayesian inference in normal data models using Gibbs sampling. Journal of
the American Statistical Association 85, 972-985.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and
the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence 6, 721-741.
Geyer, C. J. and Thompson, E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society, Series B
54, 657-699.
Gini, C. W. (1912). Variabilita e Mutabilita. Studi Economico-Giuridici della R.
Universita di Cagliari 3(2), 3-159.
179
Gojobori, T., Ishii, K. and Nei, M. (1982). Estimation of average number of nucleotide
substitutions when the rate of substitution varies with nucleotide. Journal of
Molecular Evolution 18, 414-423.
Gojobori, T., Moriyana, E. N. and Kimura, M. (1990). Statistical methods for estimating sequence divergence. Methods in Enzimology 183, 531-550.
Gonick, L. and Wheelis, M. (1991). The Cartoon Guide to Genetics. updated edn.
Harper Perennial.
,.
•
Graybill, F. A. (1961). An Introduction to Linear Statistical Models. McGraw-Hill,
New York.
Hahn, B. H., Gonda, M. A., Shaw, G. M., Popovic, M. and Hoxie, J. A. (1985).
Genomic diversity of the acquired immune deficiency syndrome virus HTLV-III:
Different viruses exhibit greatest divergence in their envelope genes. Proc. Natl.
Acad. Sci. USA 82,4813-4817.
.
Hahn, B. H., Shaw, G. M., Taylor, M. E., Redfield, R. R. and Markham, P. D. (1986).
Genetic variation in HTLV-Ill/LAV over time in patients with AIDS. Science
232, 1548-1553.
Hammersley, J. M. and Clifford, P. (1971). Markov fields on finite graphs and lattices.
Unpublished manuscript, Oxford University.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chain and their
applications. Biometrika 57, 97-109.
v
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.
Ann. of Math. Stat. 19, 293-325.
Huffer, F. W. and Wu, H. (1995a). Markov chain Monte Carlo for autologistic regression models with application to the distribution of plant species. Not published
yet.
Huffer, F. W. and Wu, H. (1995b). Variable selection in auto-logistic models. Not
published yet.
Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78, 454-458.
Kimura, M. and Ohta, T. (1972). On the stochastic model for estimation of mutational
distance between homologous proteins. Journal of Molecular Evolution 2, 87-90.
Kullback, S. (1983). Kullback information. In Encyclopedia of Statistics
and N. L. Johnson (eds). John Wiley and Sons. pp. 421-425.
4,
S. Kotz
Kypr, J. and Mrazek, J. (1987). Unusual codon usage of HIV. Nature 327, 20.
Lee, A. J. (1990). U-Statisitcs - Theory and Practice. Marcel Dekker, Inc.
180
.
'.
Liang, K., Zeger, S. L. and Qaqish, B. (1992). Multivariate regression analyses for
categorical data. Journal of the Royal Statistical Society, Series B 54, 3-40.
Light, R. J. and Margolin, B. H. (1971). An analysis of variance for categorical data.
Journal of the American Statistical Association 66, 534-544.
•
Light, R. J. and Margolin, B. H. (1974). An analysis of variance for categorical data
II: Small sample comparisons with chi square and other competitors. Journal of
the American Statistical Association 69, 755-764.
Mansky, L. M. and Temin, H. M. (1995). Lower in vivo mutation rate of human
immunodeficiency virus type 1 than that predicted from the fidelity of purified
reverse transcriptase. Journal of Virology 69, 5087-5094.
McCutchan, F. E., Ungar, B. 1. P., Hegerich, P., Roberts, C. R., Fowler, A. K., Hira,
S. K., Perine, P. L. and Burke, D. S. (1992). Genetic analysis of HIV-1 isolates
from zambia and an expanded phylogenetic tree for HIV-l. Journal of Acquired
Immune Deficiency Syndromes 5, 441-449.
Mises, R. V. (1947). On the asymptotic distribution of the differentiable statistical
functions. Annals of Mathematical Statistics 18, 309-348.
Nei, M. (1975). Molecular Population Genetics and Evolution. North-Holland, Amsterdam.
Parzen, E. (1962). Stochastic Processes. Holden-Day, INC.
Puri and Sen, P. K. (1985). Nonparametric Multivariate Analysis.
Raftery, A. E. (1985). A model for high-order markov chains. Journal of the Royal
Statistical Society, Ser. B 47, 528-539.
Raftery, A. E. and Tavare, S. (1994). Estimation and modeling repeated patterns in
high order Markov chains with the mixture transition distribution model. Applied
Statistics 43, 179-199.
Roberts, J. D., K, B. and Kunkel, T. A. (1988). The accuracy of reverse transcriptase
from HIV-l. Science 242, 1171-1173.
.
Roy, S. N., Greenberg, B. G. and Sarhan, A. E. (1960). Evaluation of determinants,
characteristic equations, and their roots for a class of patterned matrices. Journal
of the Royal Statistical Society, Ser. B 22(2), 348-359.
'
Searle, S. R. (1971). Linear Models. John Wiley & Sons.
Searle, S. R. (1982). Matrix Algebra Useful for Statistics. John Wiley & Sons.
Seillier-Moiseiwitsch, F., Margolin, B. H. and Swanstrom, R. (1994). Genetic variability of human immunodeficiency virus: Statistical and biological issues. Annual
Review of Genetics 28, 559-596.
181
Sen, P. K. (1967). On some multisample permutation tests based on a class of ustatistics. American Statistical Association Jou'mal pp. 1201-1213.
Sen, P. K. (1981). Invariance Principles and Statistical Inference. John Wiley & Sons.
Sen, P. K. (1995). Paired comparisons for multiple characteristics: An anocova approach. In Statistical Theory and Parctice: Papers in Honor of H. A. David,
H. N. Nagaraja, P. K. Sen and D. F. Morrison (eds). Springer-Verlag, New York.
pp. 247-264.
Sen, P. K. and Singer, J. M. (1993). Large Sample Methods in Statistics. Chapman &
Hall.
Sharp, P. M. (1986). What can AIDS virus codon usage tell us. Nature 324, 114.
Simpson, E. H. (1949). The Measurement of Diversity. Nature 163, 688.
Smith, R. L. (1994). A tutorial on Markov chain Monte Carlo. Unpublished
Manuscript.
Smith, R. L., Calloway, M. O. and Morrissey, J. P. (1994). Network autocorrelation with a binary dependent variable: A method and an application. Not yet
published.
Smith, T. F., Srinivasan, A., Schochetman, G., Marcus, M. and Myers, G. (1988).
The phylogenetic history of immunodeficiency viruses. Nature 333, 573-575.
Strauss, D. J. (1977). Clustering on coloured lattices. Journal of Applied Probability
14, 135-143.
-,'
Takahata, N. and Kimura, M. (1981). Genetics 98, 641.
Tanner, M. A. (1993). Tools for Statistical Inference. second edn. Springer-Verlag.
Tavare, S. and Giddings, B. W. (1989). Some statistical aspects of the primary structure of nucleotide sequences. In Mathematical Methods for DNA Sequences, M. S.
Waterman (ed.). CRC Press. pp. 117-132.
Teich, N. (1984). Taxonomy of retroviruses. In RNA tumor viruses vol. 2, R. Weiss,
N. Teich, H. Varmus and J. Coffin (eds). Cold Spring Harbor Laboratory, Cold
Spring Harbor, N.Y.. pp. 25-207.
Varmus, H. and Brown, P. (1989). Retroviruses. In Mobile DNA, D. E. Berg and
M. M. Howe (eds). American Society for Microbiology. pp. 53-108.
...
.
Weir, B. S. (19900,). Genetic Data Analysis. Sinauer Associates.
Weir, B. S. (1990b). Variation in sequence distances. Molecular Evolution pp. 281-288.
Wu, H. and Huffer, F. W. (1995). Modeling the distribution of plant species using
the autologistic regression model. Not published yet.
182
..
© Copyright 2026 Paperzz