Combining multivariate Markov chains
Jesús E. García
University of Campinas
Department of Statistics
Abstract. In this paper we address the problem of modelling multivariate finite order Markov
chains, when the dataset is not large enough to apply the usual methodology. The number of
parameters needed for a multivariate Markov chain grows exponentially with the process order
and dimension of the chain’s alphabet. Usually, when the data set is small, the order of the fitted
model is reduced compared to the true process order. In this paper we introduce a strategy to
estimate a multivariate process, through this new strategy the estimated order will be greater than
the order achieved using standard statistical procedures. We apply the partition Markov models,
which is a family of models, where each member is identified by a partition of the state space. The
procedure consist in obtaining a partition of the state space that is constructed from a combination
of the partitions corresponding to the marginal processes of the multivariate chain, and the partition
corresponding to the multivariate Markov chain.
Keywords: Multivariate Markov chains; Dependent stochastic processes; Model selection; Discrete copula.
PACS: 02.50.Ey; 02.50.Ga; 02.50.Tt.
INTRODUCTION
In the field of estimation in multivariate Markov processes, should be noted some aspects
that may compromise the quality of the estimation. On one hand, fixed an alphabet, the
number of parameters needed to specify a discrete Markov chain grows exponentially
with the memory of the chain. On the other hand, fixed the order of the chain, the
number of parameters grows exponentially with the dimension of the alphabet where
the process takes values. For instance, a k dimensional multivariate Markov chain of
order o, with k marginal processes on the finite alphabet B, requires (|B|k − 1)(|B|k )o
parameters. This number grows very fast with k and o making the task of fitting a model
highly complicated. To simplify the explanation of the idea introduced in this paper
we will assume that the k marginal processes are order om Markov chains and that the
multivariate Markov process also has order om . The following strategy is proposed to
find an approximated order om Markov model for the multivariate process.
The dataset will be split into two subsets. One of the subsets is used to fit, independently
for each one of the k one-dimensional marginal processes, a Markov model with order
om . Assuming that the sample size is large enough for doing this. In this step we will
estimate k(|B| − 1)|B|om parameters. Note that we do not get any information about the
dependence or interaction between the marginals. The second subset is used to fit a
Multivariate Markov model with memory length oc . Here, oc is such that the size of the
second subset of data is large enough to model the joint process as an order oc Markov
chain. We will assume that oc < om , otherwise, we will not need this procedure, because
we are assuming that the joint process has order om . In this second step, we recover
part of the information about the dependence between the marginals by modelling
the multivariate process as a Markov chain of order oc , which will require additional
(|B|k − 1)(|B|k )oc parameters.
The information obtained from the modelling of the marginals and from the modelling
of the joint process is combined, using a copula, which allows to built a Markov model
of order om . The total number of parameters required by the strategy is k(|B| − 1)|B|om +
(|B|k − 1)(|B|k )oc , which is smaller than (|B|k − 1)(|B|k )om for oc < om . We also reduce
the number of parameters using a new family of models called the Partition Markov
models (PMM), see [4], [3] and [5], which is defined to fit Markov models with the
minimal number of parameters.
PARTITION MARKOV MODEL
In this section we introduce the notion of Partition Markov Models (PMM), see [5] for
a more complete explanation. Let Xt be a discrete time, Markov chain with memory o
on a finite alphabet A, with state space S = Ao . Denote the string am am+1 . . . an by anm ,
t−1
where ai ∈ A, m ≤ i ≤ n. For each a ∈ A and s ∈ S , P(a|s) = Prob(Xt = a|Xt−o
= s).
Definition 1 We will say that s, r ∈ S are equivalent (denoted by s ∼ p r) if P(a|s) =
P(a|r) ∀a ∈ A.
Definition 2 We will say that Xt is a Markov chain with partition L if this partition is
the one defined by the equivalence relationship ∼ p introduced by definition 1.
The set of parameters for a Markov chain over the alphabet A with partition L can
be denoted by {P(a|L) : a ∈ A, L ∈ L }. If we know the equivalence relationship for
a given Markov chain, then we need (|A| − 1) transition probabilities for each part of
the partition, to specify the model. As a consequence, the total number of parameters is
|L |(|A| − 1). To choose a model in this family in a consistent
we can use
way (see [4]),
t−1
the distance
d
where
n
is
the
sample
size.
Define,
N(s)
=
{t
:
o
<
t
≤
n,
X
t−o = s} and
n
t−1
N(s, a) = {t : o < t ≤ n, Xt−o
= s, Xt = a}.
Definition 3 Let n be the sample size, for any s, r ∈ S ,
2
N(s, a)
N(r, a)
dn (s, r) =
∑ N(s, a) ln N(s) + N(r, a) ln N(r)
(|A| − 1) ln(n) a∈A
N(s, a) + N(r, a)
− (N(s, a) + N(r, a)) ln
,
N(s) + N(r)
dn can be generalized to sub sets of S and it has the property of being equivalent to the
BIC criterion to decide if s ∼ r for any s, r ∈ S (see [4]).
MIXED MARKOV PARTITIONS PROCEDURE
Let Xt be the state of the k dimensional Markov process at time t, Xt = (X(1)t , ..., X(k)t ),
where X(i)t ∈ B and it is the state of the i-source at time t for i = 1, ..., k, Xt ∈ A, where
A = Bk and B is the finite alphabet for the one-dimensional marginals. We will assume that for 1 ≤ i ≤ k, X(i)t is an order om Markov chain, with om < ∞. The marginal
state space is Bom . For each s ∈ Bom and b ∈ B, we denote the marginal conditional joint
probability for the i coordinate by Pi (b|s) = Prob(X(i)t = b|X(i)t−1
t−om = s).
The first step of the procedure is to use a proportion of the dataset (say half) to fit a
PMM to the multivariate process Xt with a maximum order equal to oc , obtaining L oc ,
the partition of Aoc corresponding to the fitted model.
The partition L oc , is then extended to a partition of Aom denoted by Pc in the following
oc }, then
way. If L oc = {L1oc , ..., Lm
c
c
Pc = {L1c , ..., Lm
}, with, Lcj = ∪s∈Loc {w.s : w ∈ A(om −oc ) }, 1 ≤ j ≤ mc .
c
j
Where "." denotes the concatenation between strings.
In the second step we divide the remaining proportion of data into k independent subsets
of equal length. We fit a PMM to each marginal process using the corresponding subset
i } the partition of Bom corresponding to the model
of the dataset. Call L i = {L1i , ..., Lm
i
fitted to the marginal process X(i)t .
From the collection of partitions {L 1 , ..., L k } define the following partition of Aom .
Pm = {L1j1 × ... × Lkjk : 1 ≤ j1 ≤ m1 , ..., 1 ≤ jk ≤ mk }.
The last step is to build a partition P of Aom combining Pc and Pm . P is a refinement
of the partitions Pm and Pc . Corresponding to the following equivalence relationship
in Aom ,
s ∼ r if there exist parts L ∈ Pm and L0 ∈ Pc such that s, r ∈ L ∩ L0 .
This means that two states s and r belong to the same part of P if and only if they
belong to the same part of both Pm and Pc .
In the next section we will show how the conditional probabilities are computed to
fully define the model parameters. We use the estimated probabilities obtained when
fitting the marginal models of order om and the joint model of order oc . All this probabilities are combined together using a copula.
TRANSITION PROBABILITIES ESTIMATION
Given s ∈ Aom and a ∈ A, we will show how to compute P(a|s). Let w be the size oc suffix
of s, that means s = q.w for an appropriated string q. Consider the estimator P̂c (a|w) of
the joint probability Pc (a|w) of the process of order oc , obtained from the first step of
our procedure.
For 1 ≤ i ≤ k, let s(i) be the sequence in Bom , that is the sequence consisting of the
concatenation of elements of s in the coordinate i. Denote by P̂i (a(i)|s(i)) the estimate
of the marginal probability from the i process of order om , obtained from the second
step of our procedure, where a(i) is the i-coordinate of a. Denote by F̂i (a(i)|s(i)) the
corresponding marginal distribution function.
The two set of probabilities are combined in the following way, first we define a kdimensional copula distribution Ĉ ((u1 , ...uk )|w) from the joint probabilities P̂c (a|s),
following, for example the idea presented in [3]. There is more than one way of choosing
copula distributions in the case of discrete random variables, see [6]. Finally, the copula
distribution is evaluated on the marginal distributions, as follows
P̂(a|s) = Ĉ (F̂1 (a(1)|s(1)), ..., F̂k (a(k)|s(k)))|w .
It is easy to check that for any L ∈ P if s, r ∈ L then P̂(a|s) = P̂(a|r), ∀a ∈ A. In the
approach proposed in this paper the number of parameters to estimate is (|A|−1)|L oc |+
∑ki=1 (|B| − 1)|L i | which gives a notion of computational complexity of the procedure.
CONCLUSION
In this paper, we show a procedure to fit an approximated PMM to a multivariate process,
when the dataset is not large enough to apply the usual methodology. The procedure
combines individual PMM from the marginal processes and a PMM model for the joint
process, this last with an order smaller to the real one. We show how to combine the
corresponding partitions on a unique partition and how to combine the probabilities
of all the models to a unique set of transition probabilities, for the fitted model. This
methodology can be modified to be used with others families of Markov models, such
as the fixed and variable length Markov chains, for which there exists several model
selection methods (see [7], [1] and [2]).
ACKNOWLEDGMENTS
This article was produced as part of the activities of FAPESP Center for Neuromathematics (grant 2013/ 07699-0 , S.Paulo Research Foundation). The authors acknowledge the
support provided by USP project “Mathematics, computation, language and the brain
and FAPESP project “Portuguese in time and space: linguistic contact, grammars in
competition and parametric change."
REFERENCES
1. Csiszár, I. and Talata, Z., “Context tree estimation for not necessarily finite memory processes, via
BIC and MDL", IEEE Trans. Inform. Theory 52, 1007–1016, 2006.
2. Galves, A., Galves, C., Garcia, J. E., Garcia, N. L. and Leonardi, F., “Context tree selection and
linguistic rhythm retrieval from written texts", Annals of Applied Statistics, 6 1, 186 – 209, 2012.
3. Garcia, Jesus E., and Fernandez, M., "Copula based model correction for bivariate Bernoulli financial
series." In American Institute of Physics Conference Series, vol. 1558, pp. 1487-1490. 2013.
4. Garcia, J. and Gonzalez-Lopez, V. A., "Minimal Markov Models", arXiv:1002.0729v1, 2010.
5. Garcia, J. E., Gonzalez-Lopez, V. A. and Viola, M. L. L., "Robust model selection and the statistical
classification of languages", AIP Conference Proceedings 1490, 160 – 170, 2012.
6. Joe, H., Multivariate models and dependence concepts (Vol. 73), CRC Press, 1997.
7. Rissanen J., "A universal data compression system", IEEE Trans. Inform. Theory 29(5), 656 – 664,
1983.
© Copyright 2025 Paperzz