STUDY ON INFORMATION THEORY: CONNECTION
TO CONTROL THEORY, APPROACH AND ANALYSIS
FOR COMPUTATION
by
WANCHAT THEERANAEW
Submitted in partial fulfillment of the requirements
For the degree of Doctor of Philosophy
Dissertation Advisor: Dr. Kenneth A. Loparo
Department of Electrical Engineering & Computer Science
CASE WESTERN RESERVE UNIVERSITY
January 2015
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the dissertation of
Wanchat Theeranaew
candidate for the Doctor of Philosophy degree *
Committee Chair:
Kenneth A. Loparo, PhD
Dissertation Advisor,
Professor,
Department of Electrical Engineering & Computer Science
Committee:
Vira Chankong, PhD
Associate Professor,
Department of Electrical Engineering & Computer Science
Committee:
Marc Buchner, PhD
Associate Professor,
Department of Electrical Engineering & Computer Science
Committee:
Richard Kolacinski, PhD
Research Associate Professor,
Department of Mechanical and Aerospace Engineering
OCT 13, 2014
*We also certify that written approval has been obtained for any proprietary
material contained therein.
Contents
1 Introduction
1
2 Basic Concept of Information Theory
7
2.1
Entropy and Differential Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3 Connection between Information Theory and Control Theory
3.1
14
Literature review for state estimation and feedback control in an information theory
context
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2
Information Theory for General Systems . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.3
Connection between Information Theory and Control Theory in Linear Systems . . .
18
3.3.1
3.3.2
Controllability and Observability in Linear System under Information Theory
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
Computation of Some Information Measures in Linear System
22
. . . . . . . .
4 Computation of Mutual Information: Continuous-Value Representation using
Hidden Markov Model
27
4.1
Hidden Markov Model with Gaussian Emission . . . . . . . . . . . . . . . . . . . . .
27
4.1.1
Parameters estimation for Hidden Markov model with Gaussian Emission . .
29
4.1.2
Entropy and Mutual Information of Data under Hidden Markov Model with
4.2
Gaussian Emission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Entropy and Mutual Information of Gaussian mixture . . . . . . . . . . . . . . . . .
32
4.2.1
Review of Upper and Lower Bound Computations for Entropy of a Gaussian
Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
33
4.2.2
Improvement of Upper and Lower Bound Computations for Entropy of a Gaussian Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3
4.3
Simulations and Results for Upper and Lower Bound Computations for Entropy of a Gaussian Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Mutual Information of random process using Hidden Markov Model . . . . . . . . .
43
4.3.1
Model order selection for Hidden Markov Model . . . . . . . . . . . . . . . .
43
4.3.2
Simulations and Discussion for Computation of Mutual Information using Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Computation of mutual Information: Discrete-Value representation
5.1
5.2
34
44
47
Investigation of Shannon Mutual information: Theoretical perspective . . . . . . . .
47
5.1.1
Mutual Information and Quantization . . . . . . . . . . . . . . . . . . . . . .
48
5.1.2
Shannon Mutual Information and independent random processes . . . . . . .
48
5.1.3
Shannon Mutual Information for Continuous-Random Variables . . . . . . . .
49
Investigation of Shannon Mutual information: Computational perspective . . . . . .
50
5.2.1
Quantization methods for Computing Shannon Mutual Information . . . . .
52
5.2.2
Investigation of the effects of Quantization Methods and number of Quantized
states on Shannon Mutual Information . . . . . . . . . . . . . . . . . . . . . .
53
5.2.3
Comparative study of Modification on Shannon Mutual Information . . . . .
58
5.2.4
Additional study of Mutual Information on jump system and the system with
varying parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
6 Conclusions
71
Appendices
74
A Alternative Information Measure and Relationship to Mutual Information
75
Bibliography
78
iii
List of Figures
4.1
4.2
4.3
4.4
(
Upper and lower bounds of entropy for Gaussian mixtures of the form 0.2 N (0, 1) +
)
N (X, 0.8) + N (0.75, 0.6) + N (1.5, 0.4)) + 0.1(N (2, 0.2) + N (2.2, 0.4) where X ∈
{0, 0.05, . . . , 20}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(
Upper and lower bounds of entropy for Gaussian mixtures of the form 0.2 N (0, 0.8) +
)
N (0.5, 0.8) + N (1, 0.8) + N (1.5, 0.8) + N (X, 0.8) where X ∈ {0, 0.05, . . . , 20}. . . . .
(
Upper and lower bounds of entropy for Gaussian mixtures of the form 0.2 N (0, X) +
)
N (0.5, X) + N (1, X) + N (1.5, X) + N (2, X) where X ∈ {0.01, 0.02, . . . , 2}. . . . . .
(
Upper and lower bounds of entropy for Gaussian mixtures of the form Y N (0, 0.6) +
)
N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (2, 0.6) where X ∈ {0.01, 0.02, . . . , 0.99}
38
39
40
(1−X)
.
4
41
and Y =
(1−X)
.
4
42
and Y =
(1−X)
.
4
and Y =
4.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(
Upper and lower bounds of entropy for Gaussian mixtures of the form Y N (0, 0.6) +
)
N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (4, 0.6) where X ∈ {0.01, 0.02, . . . , 0.99}
4.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(
Upper and lower bounds of entropy for Gaussian mixtures of the form Y N (0, 0.6) +
)
N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (6, 0.6) where X ∈ {0.01, 0.02, . . . , 0.99}
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.7
The flow chart for hidden Markov model construction. . . . . . . . . . . . . . . . . .
44
4.8
The data for Hénon system in which for each coupling strength consist of 100 realizations and using 1,000 data points to construct HMM . . . . . . . . . . . . . . . . . .
5.1
45
Estimation of (Shannon) mutual information for coupled Hénon maps with different coupling strength and number of bins. The figure on the top uses equal space
partitioning and the figure on the bottom uses equal frequency partitioning . . . . .
iv
54
5.2
Estimation of (Shannon) mutual information for coupled Lorenz systems with different coupling strength and number of bins. The figure on the top uses equal space
partitioning and the figure on the bottom uses equal frequency partitioning . . . . .
5.3
55
Estimation of (Shannon) mutual information for coupled Rossler systems with different coupling strength and number of bins. The figure on the top uses equal space
partitioning and the figure on the bottom uses equal frequency partitioning . . . . .
5.4
56
Estimation of Shannon mutual information for coupled Hénon maps with the different
coupling strength using long time-series data averaging over different windows. Top
figure contains windows with overlap size 0 - 3,000. Bottom figure contains windows
with overlap size 4,000 - 7,000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
60
Estimation of Shannon mutual information for coupled Lorenz systems with different
coupling strength using long time-series data and averaging over different windows.
Top figure contains the windows with overlap size 0 - 3,000. Bottom figure contains
the windows with overlap size 4,000 - 7,000. . . . . . . . . . . . . . . . . . . . . . . .
5.6
61
Estimation of Shannon mutual information in Shannon for coupled Rossler systems
with different coupling strength using long time-series data and averaging over different windows. Top figure contains windows with overlap size 0 - 3,000. Bottom figure
contains windows with overlap size 4,000 - 7,000. . . . . . . . . . . . . . . . . . . . .
5.7
62
Estimation of Shannon mutual information in Shannon for coupled Hénon maps with
varying coupling strength, Ct . Ct = 0.8 for 0 ≤ t ≤ 150, 000, Ct = 0.4 for 150, 000 <
t ≤ 300, 000 and Ct = 0 for 300, 000 < t ≤ 500, 000. . . . . . . . . . . . . . . . . . . .
5.8
65
Estimation of Shannon mutual information in Shannon for coupled Lorenz systems
with varying coupling strength, Ct . Ct = 2.0 for 0 ≤ t ≤ 150, 000, Ct = 1.0 for
150, 000 < t ≤ 300, 000 and Ct = 0.0 for 300, 000 < t ≤ 500, 000. . . . . . . . . . . . .
5.9
66
Estimation of Shannon mutual information in Shannon for coupled Rossler systems
with varying coupling strength, Ct . Ct = 2.0 for 0 ≤ t ≤ 150, 000, Ct = 1.0 for
150, 000 < t ≤ 300, 000 and Ct = 0.0 for 300, 000 < t ≤ 500, 000. . . . . . . . . . . . .
66
5.10 Estimation of Shannon mutual information for coupled Hénon maps with varying
coupling strength, Ct . Ct = 1 −
1
1+e−α(t−250,000)
where α = 0.00005. . . . . . . . . . .
68
5.11 Estimation of Shannon mutual information for coupled Lorenz systems with varying
coupling strength, Ct . Ct = 1 −
1
1+e−α(t−250,000)
where α = 0.00005. . . . . . . . . . .
68
5.12 Estimation of Shannon mutual information in Shannon for coupled Rossler systems
with varying coupling strength, Ct . Ct = 1 −
v
1
1+e−α(t−250,000)
where α = 0.00005. . . .
69
Acknowledgements
First of all, I would like to express my deeply gratitude to my adviser, Professor Kenneth A. Loparo,
for providing me with his supports, valuable advice and encouragement to complete this research. His
guidance in the area of random signal, stochastic processes and construction of models to replicate
real world data is critical for completion of this research. In addition, he is the person who inspires
and motivates other people, including me, by his work ethic, thoroughness and kindness. He is the
person whom I believe that is one of the best role model in my life.
I am also indebted to Professor Vira Chankong for providing me with his advice on constructive
problem solving and his guidance for many aspects in life. I am also greatful for him and his wife,
Mrs. Pornpituk Chankong, for their supports and care during my study in United States. Without
their supports, my daily living and acedemic life would be much more difficult. Professor Vira is
also one of the best role model in my life.
I would like to thank Professor Richard Kolacinski for sharing his working experience to broaden
my perspective for the work, outside academic, which will be useful in the future. He also greatly
help revising this thesis.
I would like to thank all of my friends in Cleveland who make my life in Cleveland more enjoyable,
although, I cannot mention all of them. I am thankful for my close friends in Cleveland, Thunyaseth
Sethaput and Assistant Professor Adirak Kanchanaharuthai. Their discussion, advice and friendship
are an important part of my academic life in Cleveland. Their helps in the time of need are the key
to make the hardship more bearable.
Finally, I would like to express my sincere gratitude to my parents who are provide me all of
supports since I were born. Their patient in teaching and supporting is the greatest factor in every
aspects of my life. My special thanks go to my little sister who takes an important role in taking
care of our parents. Otherwise, I would not to be able to focus on my study and research as I could.
Her constant conversation make me feel comfortable particularly in the times of need.
vi
I would also like to acknowledge Department of Energy (DOE) for the funding support of this
research.
vii
Study on Information Theory: Connection To Control Theory,
Approach and Analysis for Computation
Abstract
by
WANCHAT THEERANAEW
This thesis consists of various studies in information theory, including its connection with control theory and the computational aspects of information measures. The first part of the research
investigates the connection between control theory and information theory. This part extends previous results that mainly focused on this connection in the context of state estimation and feedback
control. For linear systems, mutual information, along with the concepts of controllability and observability, is used to derive a tight connection between control theory and information theory. For
nonlinear systems, a weaker statement of this connection is established. Some explicit calculations
for linear systems and interesting observations about these calculations are presented. The second
part investigates the computation of mutual information. An innovative method to compute the
mutual information between two collections of time series data based on a Hidden Markov Model
(HMM) is proposed. For continuous-valued data, a HMM with Gaussian emission is used to estimate the underlying dynamics of the original data. Mutual information is computed based on the
approximate dynamics provided by the HMM. This work improves the estimation of the upper and
lower bounds of entropy for Gaussian mixtures, which is one of the key components in this proposed
method. This improvement of these bounds are shown to be robust compared to existing methods
in all of the synthetic data experiments conducted. In addition, this research includes the study of
the computation of Shannon mutual information in which the strong assumptions of independence
and identical distribution (i.i.d.) are imposed. This research shows that even if this assumption is
violated, the results process a meaningful interpretation. The study of the computation of Shannon mutual information for continuous-valued random variables is included in this research. Three
coupled chaotic systems are used as exemplars to show that the computation of normalized mutual
information is relatively insensitive to the number of quantized states, although quantization resolution does significantly affect the unnormalized mutual information. The same coupled chaotic
systems are used to show that the quantization method also does not significantly affect the normal-
viii
ized mutual information. Simulations from these chaotic systems also show that normalized Shannon
mutual information can be used to detect the different (fixed) coupling strengths between two subsystems. Two modified information measures, which enforce sensitivity to time permutation, are
compared on these three systems. By using piecewise constant coupling and monotonically decaying
coupling, the simulation results show that normalized mutual information can track time-varying
changes in coupling strength for these chaos systems to a certain degree.
ix
CHAPTER
1
Introduction
In 1928, the the first formal introduction of the word ‘information’ was introduced by Hartley [16].
In this work, Hartley introduced the measure of uncertainty, the Hartley function or Hartley entropy,
applied to communication systems. This measurement is directly related to the number of possible
symbol sequences transmitted via a communication channel. In 1948, Shannon extended the concept
of information and introduced the concept of entropy using a probabilistic framework [45]. The main
focus of the work was on discrete-valued random variables that are associated with a message to
be sent through a communication channel. In the same work, the generalization of entropy into
differential entropy for continuous-valued random variables is also presented. This concept has been
further extended using the fact that uncertainty is always associated with most real world problems.
One interesting and useful measure developed from entropy/differential entropy is called mutual
information. Mutual information basically is the measure of statistical independence between two
sets of random variables. Due to the fact that mutual information can go beyond linear relationships
in a stochastic context, entropy measures are popular in many fields of study. The concept of entropy
and mutual information from Shannon was generalized by Rényi [38] in 1961. This work extended
the existing entropy measure to the generalized entropy measure, Rényi entropy. Similar to Shannon
entropy, Rényi entropy is also directly computed from the probability mass (or density) function.
However, the computation of Rényi entropy does not involve the expectation operator. In addition,
it includes a weighting factor as an exponent. One can consider both Hartley and Shannon entropy
as special casea of Rényi entropy. Similar to the Shannon framework, Rényi mutual information can
be computed based on the concept of Rényi divergence that measures the dissimilarity between two
probability distributions.
1
In 1959, Sinai defined the entropy of a space partition [47, 48] and developed Kolmogorov-Sinai
entropy (or entropy) of a dynamical system. Adler, Konheim and McAndrew extended this concept
and introduced topological entropy [1] in 1965. A slightly different definition of topological entropy
is given by Bowen and Dinaburg [3, 7]. Both Kolmogorov-Sinai entropy and topological entropy
are directly related to the complexity of deterministic systems. These concepts were limited to
theoretical investigations until the introduction of computational methods for their calculation by
Grassberger and Procaccia [15] in 1983. A similar concept was developed in the same year by
Takens [50]. Further development on the computation of entropy was done by Cohen and Procaccia
[5] in 1985; and by Olofsen, Degoede and Heijungs [31] in 1992. Since computational methods are
available for Kolmogorov-Sinai entropy, it is used as a complexity measure for deterministic chaotic
systems [9, 26]. Base on the ideas presented in [9, 15, 50], Pincus introduced another measure of
complexity called approximate entropy (ApEn) [33]. One interesting point that emerged in this
work is that ApEn is identical to entropy in information theory for Markov processes. This shows
that, although the study of complexity emerged from different fields of study, it converged to a
similar quantity and interpretation. Despite its different origin, eventually, the study of complexity
in chaotic system coincides with the study of information in information theory.
In thermodynamics, the term ‘entropy’ emerges from the study of the Carnot cycle by Clausius [4].
In this context, entropy was used as a state variable for heat transfer in thermodynamically reversible
processes and is defined in terms of the change of state. Boltzmann generalized the term ‘entropy’
as a measure of the number of possible microstates corresponding to a given macrostate in 1872.
In his work, entropy is equal to the product of the Boltzmann constant and the log of the number
of microstates consistent with the given macrostate. In 1878, Gibbs introduced a probabilistic
interpretation of Boltzmann’s result (generalized Boltzmann entropy) [12], known as Gibbs entropy.
This clearly intersects with Shannon entropy and the difference between these two quantities is only
that Shannon entropy does not contain the Boltzmann constant. In 1988, Tsallis generalized Gibbs
entropy by introducing an additional parameter. When this parameter is equal to one, Tsallis entropy
is identical to Gibbs entropy. His concept is similar to the work of Rényi in an information theory
context. The development of entropy in thermodynamics is similar to that to the developments in
information theory.
As previously mentioned, the study of uncertainty and the development of uncertainty measures
is not limited to information theoretic frameworks. Regardless, information theoretic frameworks
are widely used in many areas of study because of their association with the commonly used probabilistic framework. In this work, the word ‘entropy’ will refer to entropy in an information theory
2
context unless another entropy is explicitly mentioned. Due to the fact that entropy is a measure
of uncertainty, the amount of uncertainty of some quantities, e.g., noise or error, can also be used
as a measurement in some applications. For example, low entropy of error is clearly associated with
accuracy. In addition, there are a lot of applications in which one is interested in quantifying the
connectivity between two quantities that can be measured by mutual information. Many researchers
who are working in the fields of estimation and feedback control are also interested in the information
theoretic framework for their investigations. For instance, in the area of estimation, the reduction
of estimation error is the main focus. Thus, the entropy of the estimation error can be used as an
indicator of performance of an estimator. Estimation is always based on observations, so the mutual information among the desired unknown quantities, the estimated values, the estimation errors
and the observations can be used to determine the performance of the estimators as well. Many
researchers [10, 11, 22, 41, 51–53, 55] have followed this line of thought. Their research studies are
extremely valuable from a theoretical perspective, and demonstrate the tight connection between
information theory and control theory.
In 1976, Conant constructed a framework by applying information theory on the intrinsic dynamics of complex systems [6]. In this work, the system is decomposed into a set of subsystems on
the basis of information flow. These subsystems are categorized into internal subsystems and output
connected subsystems. In addition, the inputs to and outputs from the system are considered as
another system known as the environment. This description views complex systems as a communication network. Conant described dynamical systems as a flow of information from the input to
output. From this concept, Conant defined the Partition Law of Information Rates (PLIR) under
an information theoretic framework. This law described the total activity of the systems as the
sum of throughput rate, blockage rate, coordination rate, and noise rate. The throughput rate is an
input-output flow rate that measures the connection strength between the input and output of the
system. The blockage rate is the rate at which information from the input that is irrelevant to the
output is blocked within the system. The coordination rate measures the total internal communication among subsystems that the system requires to perform its function(s). Finally, the noise rate is
the information generated by the system that is independent from an input. The underlying mathematics of these quantities provide interpretations of different information measures and how they
are related to each other. This work is extremely interesting because it shows that complex systems
can be explained by the concept of communication within an information theoretic framework.
The work of Conant showed that systems can be viewed as communication channels using an
information theoretic framework, and the work presented in [10, 11, 22, 41, 51–53, 55] focuses on
3
estimation and control problems that are known to be directly related to the structure and underlying
properties of the system being studied. It is clear that information measures should be able to reflect
the intrinsic structure of systems. For these reasons, this work investigates the connection between
an information theoretic framework and the structure of the system from a systems perspective. The
objective of this research is to reinforce previous research results, so the properties of the systems
of interest are constant with this objective. As a result, this work is focused on the structure of
systems that are strongly related to estimation and control. Although a primary goal of this work
is to show the connection between these two fields of study for general systems, analytical solutions
for nonlinear systems are extremely difficult to obtain. In addition, limitations from the use of
continuous-valued representations of the data also add complexity to the problem. Therefore, the
main focus in this research is on relating the concepts of controllability and observability in linear
systems to information theory concepts such as entropy and mutual information. Some closed form
representations of information measures for linear systems are given in this work.
The concept of information theory cannot be widely used without a method to compute the
related measures. For discrete-valued random variables, the computation is straight forward (and
inexpensive in the case of finite alphabets). However, the underlying probability mass function
is needed. With a finite set of observation data, the mismatch between the actual values and
estimated values cannot be avoided [14, 17, 18, 39]. Correction terms have been proposed to reduce
the estimation error for entropy [14,17,39] and mutual information [39]. When the data is continuousvalued, the problem becomes much more complicated. If the underlying density function is known,
the computation of differential entropy is simply an integration problem. Otherwise, one needs to
either estimate the underlying distribution by using a numerical approximations such as a histogram
or assume the form of the distribution, e.g. using a kernel method [49]. Although the latter method
is said to be better [49], numerical integration of the estimated distribution can be burdensome. If a
histogram is used to compute information theoretic measures, the correction terms in [14,17,39] can
also be applied. For the histogram-based methods, Kraskov et.al. proposed an alternative to directly
counting the frequency of data inside each bin for computing entropy and mutual information [25].
However, all of the aforementioned methods assume that the data consists of independent and
identically distributed (i.i.d.) random variables. To our knowledge, this assumption is always imposed on the computation of information measures using these methods. If the data is assumed
to be stationary, entropy in the sense of averaging uncertainty over time can be computed by the
methods presented in [13, 23, 32, 35, 46, 54]. This average uncertainty, called the entropy rate, can be
also be used to compute rates of information. Although this is, in fact, the measure of interest in our
4
work, the imposed stationarity assumption is too strong. Moreover, it is difficult to find a proper
interpretation when stationarity is not satisfied. Note, a random process defined as a sequence of
i.i.d. random variables is stationary and i.i.d. is a stricter assumption than stationarity. However, it
can shown that the basic histogram method can give meaningful results even if the i.i.d. assumption
is violated. Thus, an investigation of this simple method is one of the main aspects of this research.
It is well-known that the computation of mutual information by the histogram method is greatly
influenced by the resolution and boundaries of the bins. We show that, by normalizing the information measure, these two factors do not significantly affect the normalized mutual information.
Both synthetic and real world data are usually generated from some underlying dynamics with
unknown properties. The generator of this underlying data can be taken into account when computing the information measures. Specifically, if the dynamics of the underlying data generator is known
and the unknown factors can be modeled as aggregated random variables, the analytical representation of differential entropy and mutual information is straightforward. In general, the dynamics and
related factors are either partially or completely unknown. Thus, an interesting question is can one
or both of them be estimated or approximated. This line of thought is presented in this work and
the development of innovative methods for computing mutual information is investigated. The first
step in this approach is to estimate the underlying dynamics via Hidden Markov Models (HMMs).
By using a doubly stochastic modeling approach, the modeling flexibility to handle the uncertainty
that comes from the unknown dynamics and related factors of the actual data are obtained. Then,
the mutual information of the data is estimated using the approximated dynamics. Due to the fact
that continuous-valued time-series data are commonly encountered in engineering applications, this
continuous nature is addressed directly and a HMM whose hidden states emit continuous-valued
random variables is selected. Because any probability distribution can be approximated by a Gaussian mixture, Gaussian emission HMM are chosen in this research. As a result, the computation
of mutual information for a Gaussian mixture becomes one of the key aspects of this approach.
The study and improvement of the computation of entropy and mutual information for Gaussian
mixtures is also included in this thesis. The final results, including a discussion of difficulties and
extensions of previous work are included.
This dissertation is organized as follows. Chapter 2 reviews the fundamental concepts of information theory including entropy, differential entropy, mutual information, etc. Important identities,
equalities and relationships related to measures in information theory that are used in this work
are also included. Chapter 3 presents an investigation of the connection between control theory
and information theory using mutual information. A theoretical framework that establishes the
5
connection between these two fields for linear systems is the main focus of the chapter. Some analytical results for related information measures for linear systems are also given and discussed in
the chapter. The implementation of mutual information estimation using a Hidden Markov Model
(HHM) will be discussed in chapter 4. This chapter also introduces the basic concept of HMM. The
existing computation of upper and lower bounds of entropy for Gaussian mixtures and our approach
to improve these bounds are also presented in this chapter. Finally, the computational results and
issues associated with this method are discussed in this chapter. Chapter 5 presents research on
information measures for discrete-valued and discretized continuous-valued random variables. Shannon entropy and mutual information are studied in terms of their validity for sequences of non i.i.d.
random variables. Quantization and its effect on mutual information is also studied in this chapter.
This chapter also contains the study of the modification of information measures to create time
permutation sensitivity using coupled chaotic systems as exemplars. Finally, chapter 6 contains the
conclusions and suggestions for future work.
6
CHAPTER
2
Basic Concept of Information Theory
This chapter discusses basic concepts of information theory under the assumption that readers are
familiar with the basic concepts of probability and random processes. The foundation and some
important equalities and inequalities are included in this chapter. Due to the fact that information
theory is well-studied, many details are omitted. Interested readers are encouraged to review the
book [29] for further details on the fundamentals of information theory.
2.1
Entropy and Differential Entropy
Let X be a discrete random variable with probability mass function pD
X (x) = P rob(X = x). Without
loss of generality, assume that X ∈ 1, 2, . . . , nX . Entropy (H(X)) is a measure of the uncertainty of
X defined by
H(X) = −
nX
∑
D
pD
X log pX .
(2.1)
i=1
It is well known that
H(X) ≥ 0.
(2.2)
The inequality (2.2) will be important in later chapters, and in this section it is shown that the analog
of entropy for continuous-valued random variables does not have this property. For any continuous
random variable Z with probability density function pZ (z), a generalized measure of uncertainty of
Z is called differential entropy (h(Z)) and is defined by
∫
h(Z) = −
pZ (z) log pZ (z).
7
(2.3)
To observe the relationship between entropy and differential entropy, we define Z ∆ as a discrete
random variable with probability mass function defined by
∫
(i+1)∆
pD
Z ∆ (i) = −
pZ (z) log pZ (z).
i∆
In other words, Z ∆ is a uniformly quantized version of the continuous random variable Z with step
size ∆. The entropy of Z ∆ can be computed by
∞
∑
H(Z ∆ ) = −
D
pD
Z ∆ (i) log pZ ∆ (i).
i=−∞
If the probability density function pZ (z) is Riemann integrable, we have
∆→0
H(Z ∆ ) + log ∆ → h(Z).
(2.4)
Equation (2.4) shows that differential entropy is not an “exact” generalization of entropy due to the
additive term, log ∆. This log ∆ approaches −∞ as ∆ → 0. Thus, h(Z) ∈ R. The unboundedness of
differential entropy creates a lot of difficulty in derivations related to continuous random variables.
For example, if Z is deterministic and represented as a continuous-valued random variable, h(Z) =
−∞. From application standpoint, it is not likely that the data yields h(Z) = ∞. In contrast, any
degenerate random variable or vector always gives h(Z) = −∞ and it is worth noting that the latter
case is common in practice.
For a pair of discrete random variables (X1 , X2 ) with joint probability mass function
pD
X1 X2 (x1 , x2 ), the joint entropy H(X1 , X2 ) can be defined by
H(X1 , X2 ) = −
∑∑
x1
D
pD
X1 X2 (x1 , x2 ) log pX1 X2 (x1 , x2 ).
(2.5)
x2
This joint entropy H(X1 , X2 ) measures the uncertainly of the random variables X1 and X2 , and can
also be interpreted as the uncertainty of the associated two dimensional random vector. Similarly,
for a pair of continuous random variables (Z1 , Z2 ) with probability density pZ1 Z2 (z1 , z2 ), the joint
differential entropy h(Z1 , Z2 ) can be defined by
∫ ∫
h(Z1 , Z2 ) =
pZ1 Z2 (z1 , z2 ) log pZ1 Z2 (z1 , z2 )dz1 dz2 .
8
(2.6)
The generalization to n random variables (or n dimensional random vector) is straightforward for
both the discrete and continuous cases.
For a pair of discrete random variables (X1 , X2 ) with probability mass function pD
X1 X2 (x1 , x2 )
and conditional probability mass function pD
X1 |X2 (x1 , x2 ), the conditional entropy of X1 given X2
can be defined by
H(X1 |X2 ) = −
∑∑
x1
D
pD
X1 X2 (x1 , x2 ) log pX1 |X2 (x1 , x2 ).
(2.7)
x2
The result (2.7) can be interpreted as an average of the entropy of X1 given a specific value of
X2 = x2 . This average is computed for all x2 ∈ R using the joint and conditional the probability
measures. Likewise, for continuous random variables (Z1 , Z2 ) with probability density pZ1 Z2 (z1 , z2 )
and conditional density pZ1 |Z2 (z1 , z2 ), the conditional differential entropy of Z1 given Z2 can be
defined by
∫ ∫
h(Z1 |Z2 ) =
pZ1 Z2 (z1 , z2 ) log pZ1 |Z2 (z1 , z2 )dz1 dz2 .
(2.8)
With the definitions in (2.1), (2.5) and (2.7), these important equalities hold
H(X1 , X2 )
= H(X1 ) + H(X2 |X1 )
(2.9)
= H(X2 ) + H(X1 |X2 ).
The equalities in (2.10) define a chain rule for entropy. The more general form of this chain rule can
be stated as follow. For n random variables (X1 , X2 , . . . , Xn ) with joint probability mass function
pD
X1 X2 ...Xn (x1 , x2 , . . . , xn ) we have
H(X1 , X2 , . . . , Xn ) =
n
∑
H(Xi |Xi−1 , . . . , X1 ).
(2.10)
i=1
Likewise, differential entropy for continuous random variables (Z1 , Z2 , . . . , Zn ) has a similar chain
rule given by
h(Z1 , Z2 ) =
h(Z1 ) + H(Z2 |Z1 )
= h(Z2 ) + h(Z1 |Z2 )
n
∑
h(Z1 , Z2 , . . . , Zn ) =
h(Zi |Zi−1 , . . . , Z1 ).
i=1
9
(2.11)
(2.12)
2.2
Mutual Information
Information theory defines a quantity called relative entropy or Kullback-Leibler divergence that
measures the closeness between the probability mass functions associated with discrete random
D
variables. The relative entropy between two probability mass functions, pD
X (x) and qX (x), is defined
by
D(p||q) =
∑
pD
X (x) log
x
pD
X (x)
.
D (x)
qX
(2.13)
Similarly, for continuous random variables, the relative entropy between two density function pZ (z)
and qZ (z) can be defined by
∫
D(p||q) =
pZ (z) log
pZ (z)
dz.
qZ (z)
(2.14)
D
When pD
X (x) = qX (x) and/or pZ (z) = qZ (z), the term inside the log is equal to 1 and D(p||q) = 0.
This computation omits regions of probability zero, e.g., pD
X (x) = 0 or pZ (z) = 0. This relative
entropy is also called Kullback-Leibler divergence which has the same interpretation for both discrete
and continuous random variable (or vector). An important property of relative entropy is that
D(p||q) ≥ 0.
(2.15)
The inequality in (2.15) holds for both (2.13) and (2.14). It is worth mentioning that D(p||q) ̸=
D(q||p). This is the reason that Kullback-Leibler divergence is not a distance measure.
Using the concept of Kullback-Leibler divergence, we can measure the distance between the
marginal distributions associated with any joint distribution. This idea gives rise to the concept
of mutual information that measures the (statistical) dependance between two random variables.
Strictly speaking, for two discrete random variables X1 and X2 with joint probability mass function
D
D
pD
X1 X2 (x1 , x2 ) and marginal probability mass functions pX1 (x1 ) and pX2 (x2 ), the mutual information
between these two random variables I(X1 ; X2 ) is given by
I(X1 ; X2 ) =
∑∑
x1
pD
X1 X2 (x1 , x2 ) log
x2
pD
X1 X2 (x1 , x2 )
.
D
pX1 (x1 )pD
X2 (x2 )
(2.16)
Similarly, for two continuous random variables Z1 and Z2 with joint probability density function
pZ1 Z2 (z1 , z2 ) and marginal probability density functions pZ1 (z1 ) and pZ2 (z2 ), the mutual information
10
between these two random variables I(Z1 ; Z2 ) is given by
∫ ∫
I(Z1 ; Z2 ) =
pZ1 Z2 (z1 , z2 ) log
pZ1 Z2 (z1 , z2 )
dz1 dz2 .
pZ1 (z1 )pZ2 (z2 )
(2.17)
From (2.15), it follows that
I(X1 ; X2 ) ≥ 0
(2.18)
I(Z1 ; Z2 ) ≥ 0.
(2.19)
Mutual information is equal to zero when two random variables (or vectors) are independent, e.g.,
D
D
pD
X1 X2 (x1 , x2 ) = pX1 (x1 )pX2 (x2 ) (or pZ1 Z2 (z1 , z2 ) = pZ1 (z1 )pZ2 (z2 )).
For discrete random variables, the following equalities hold
I(X1 ; X2 ) =
H(X1 ) − H(X1 |X2 )
(2.20)
= H(X2 ) − H(X2 |X1 )
I(X1 ; X2 ) =
H(X1 ) + H(X2 ) − H(X1 , X2 )
(2.21)
I(X1 ; X2 )
I(X2 ; X1 )
(2.22)
H(X).
(2.23)
=
I(X; X) =
In a similar manner following equalities hold for continuous random variables
I(Z1 ; Z2 ) =
h(Z1 ) − h(Z1 |Z2 )
(2.24)
= h(Z2 ) − h(Z2 |Z1 )
I(Z1 ; Z2 )
=
I(Z1 ; Z2 ) =
h(Z1 ) + h(Z2 ) − h(Z1 , Z2 )
(2.25)
I(Z2 ; Z1 ).
(2.26)
Observe that 3 of these equalities are “identical” for both discrete and continuous random variables.
However, the last equality (2.23) does not hold for continuous random variables due to the fact that
h(Z|Z) = −∞ because the conditional density is the Dirac delta distribution.
Similar to conditional entropy, the conditional mutual information between two discrete random
11
variables X1 and X2 given another discrete random variable X3 can be computed by
{
I(X1 ; X2 |X3) = EX1 X2 X3 log
pD
X1 X2 |X3 (x1 , x2 , x3 )
}
D
pD
X1 |X3 (x1 , x3 )pX2 |X3 (x2 , x3 )
(2.27)
where EX1 X2 X3 {·} is the expectation operator with defined with respect to the three random variables X1 , X2 and X3 .
The conditional mutual information also satisfies the following equalities
I(X1 ; X2 |X3 )
H(X1 |X3 ) − H(X1 |X2 , X3 )
=
(2.28)
= H(X2 |X3 ) − H(X2 |X1 , X3 )
I(X1 ; X2 |X3 )
=
H(X1 |X3 ) + H(X2 |X3 ) − H(X1 , X2 |X3 )
(2.29)
I(X1 ; X2 |X3 )
=
I(X2 ; X1 |X3 )
(2.30)
I(X1 ; X1 |X3 )
=
H(X1 |X3 ).
(2.31)
For continuous-valued random variables, conditional mutual information is given by
{
I(Z1 ; Z2 |Z3 ) = EZ1 Z2 Z3 log
}
pZ1 Z2 |Z3 (z1 , z2 , z3 )
.
pZ1 |Z3 (z1 , z3 )pZ2 |Z3 (z2 , z3 )
(2.32)
Again, EZ1 Z2 Z3 {·} is the expectation operator defined with respect to the three random variables
Z1 , Z2 and Z3 . In this case, conditional mutual information satisfies the following equalities
I(Z1 ; Z2 |Z3 ) =
h(Z1 |Z3 ) − h(Z1 |Z2 , Z3 )
(2.33)
= h(Z2 |Z3 ) − h(Z2 |Z1 , Z3 )
I(Z1 ; Z2 |Z3 ) =
h(Z1 |Z3 ) + h(Z2 |Z3 ) − h(Z1 , Z2 |Z3 )
(2.34)
I(Z1 ; Z2 |Z3 ) =
I(Z2 ; Z1 |Z3 ).
(2.35)
The last equality as in (2.31) is missing due the fact that h(Z1 |Z1 , Z3 ) = −∞.
Similar to entropy, for n discrete random variables (X1 , X2 , , Xn ), a chain rule for mutual information is given by
I((X1 , X2 , . . . , Xn−1 ); Xn ) =
n−1
∑
i=1
12
I(Xi ; Xn |Xi−1 , . . . , X1 ).
(2.36)
Likewise, for n continuous random variables (Z1 , Z2 , , Zn ), we have
I((Z1 , Z2 , . . . , Zn−1 ); Zn ) =
n−1
∑
I(Zi ; Zn |Zi−1 , . . . , Z1 ).
(2.37)
i=1
For the special case of (2.37), with three random vectors Z1 , Z2 and Z3 , we have
I(Z1 ; (Z2 , Z3 )) = I(Z1 ; Z2 |Z3 ) + I(Z1 ; Z3 ).
(2.38)
It is notable that any bijective mapping does not change the entropy of discrete random variables.
In other words, entropy is invariant under permutation and scaling. In contrast, a bijective mapping
of continuous random variables does change the differential entropy. However, for two continuous
random vectors X and Y and an invertible linear transformation A, the important equality [37]
I(X; Y ) = I(AX; Y )
(2.39)
follows. The fact that mutual information is preserved under a nonsingular linear transformation of
the data plays a key role in later chapters.
In addition to these properties, the following lemma is useful for deriving the main result in the
next chapter.
Lemma 2.1. For continuous-valued random vectors X, Y and Z with an invertible matrix A, the
following identity
I(X; Y |Z) = I(X; Y |AZ)
holds.
Proof. Applying (2.38) and (2.39), yields
I(X; Y |AZ)
=
I((AZ, X); Y ) − I(AZ; Y )
= I((Z, X); Y ) − I(Z; Y )
I(X; Y |Z) =
I(X; Y |AZ).
This completes the proof.
13
(2.40)
CHAPTER
3
Connection between Information Theory and
Control Theory
3.1
Literature review for state estimation and feedback control in an information theory context
Since 1966, there has been a considerable amount of research that has investigated the relationship
between information theory and control theory, often interpreting control theory concepts in the
context of information theory, with much of the research focused on feedback control and/or state
estimation problems. In 1966, Zaborszky outlined the connection between entropy and system
identification using the concept of mutual information between unknown states and its relationship
to identification and state estimation for subsets of system states [55]. In 1970, Weidemann and Stear
investigated state estimation in an information theoretic context. In the noise free case, they derived
an inequality that involves the entropy of estimation errors, the entropy of system outputs and the
mutual information between estimated values of system states and system inputs. For systems
with noisy observations, the inequality included the entropy of estimation errors, the entropy of
noisy observations, and the entropies of system inputs and noise processes [53]. In 1976, Tomita,
Omatu and Soeda studied filtering problems in both continuous-time and discrete-time [51]. For
both continuous-time and discrete-time, they showed that for linear stochastic filtering problems, the
maximum mutual information between states and their estimated values is equivalent to the minimal
entropy of estimation errors. They also showed that the Kalman-Bucy filter maximizes mutual
information between states and their estimates. For continuous-time nonlinear stochastic filtering
14
problems, they showed that maximum mutual information between states and their estimated values
is equivalent to minimal entropy of Gaussian random variables, which have the same covariance as
the estimation errors. They also showed that the nonlinear filter derived by Kushner maximizes
mutual information between states and their estimated values. In 1979, Kalata and Priemer [22]
showed that the entropy of estimation errors is greater than the conditional entropy of states given
observations. In addition, minimizing the entropy of estimation errors is equivalent to minimizing
the mutual information between estimation errors and observed signals. Moreover, the entropy of
estimation errors is minimum when these errors are statistically independent of the observations.
Finally, the Minimum Squared Error (MSE) estimator minimizes the entropy of a Gaussian random
variable with the same covariance as the estimation errors and this leads to an alternative derivation
of the Kalman filter.
In 1988, Saridis studied optimal control system design for continuous-time systems [41] by assigning a distribution function that represents the uncertainty of selecting the appropriate control
law over the space of admissible controls. The distribution was selected to satisfy Jaynes maximum
entropy criterion and the optimal control law minimizes the differential entropy. This result was
demonstrated for linear stochastic systems. For noise-free adaptive control problems, the optimal
criteria consists of the entropy of estimated states given estimated parameters and the entropy of
estimated parameters. For noisy nonlinear systems, the optimal criteria consists of the entropy of
control inputs given estimated states and estimated parameters, the entropy of estimated states
given estimated parameters, and the entropy of estimated parameters. In 1992, Tsai, Casiello and
Loparo extended the result from Saridis to discrete-time systems [52]. In 1997, Feng, Loparo and
Fang studied adaptive control design and state estimation based on criterion using seven different
information measures [11]. For a nonlinear system they showed that some of the criteria are in
conflict with each other. In addition, they compared these criteria with the traditional minimum
mean-square measure and showed that all these criteria are equivalent for linear Gaussian systems.
Linear systems with non-Gaussian noise were also intensively studied in the same work. In the same
year, Feng and Loparo use the same approach as in [11] to design controller and estimator pairs for
quantized feedback systems [10]. An intensive study on one dimension quantized feedback system
was presented in their work.
In control theory, study of the structure of the system realization is well defined for linear systems.
Feedback control and state estimation are both directly related to certain properties of the system
realization, such as, controllability and observability. This past work shows that both control and
state estimator design can be studied using information theory. The question arises of how can
15
controllability and observability affect information measures. Especially, the mutual information
between inputs and outputs of the system. In other words, the primary interest here is the role
of information theory in understanding the structure of any system realization that is reflected in
the relationship between inputs and outputs. Next section shows that the answer to this specific
question can be explicitly shown for linear systems. The connection between information measures
and system structure can also be loosely stated for nonlinear system.
Definition 3.1. For any discrete-time vector process, zk ∀k = 1, . . . , tf , define
[
]
z1:tf , z1T , z2T , . . . , ztTf .
3.2
Information Theory for General Systems
To begin the investigation of the interconnection between information and control theory, consider
nonlinear systems of the form
xk+1
=
f (xk , uk , vk )
(3.1)
yk+1
=
h(xk , wk ).
(3.2)
Where xk is the state vector, uk is the control input, yk is the observation, vk is the process noise and
wk is the observation noise. Note that all five quantities are considered as vectors of appropriate
dimension. In addition, it is assumed that the observation noise process, w1:tf , is statistically
independent from control input, u1:tf ; this assumption is reasonable for most problem formulations.
Consider
h(u1:tf ) − h(u1:tf |y1:tf )
I(y1:tf ; u1:tf )
=
I(y1:tf ; u1:tf )
= h(u1:tf ) − h(u1:tf |y1:tf ) + h(u1:tf |x1:tf , y1:tf ) − h(u1:tf |x1:tf , y1:tf )
(
) (
)
= h(u1:tf ) − h(u1:tf |x1:tf , y1:tf ) − h(u1:tf |y1:tf ) − h(u1:tf |x1:tf , y1:tf )
(
)
= h(u1:tf ) − h(u1:tf |x1:tf , y1:tf ) − I(x1:tf ; u1:tf |y1:tf ).
(3.3)
Now,
{
}
h(U1:tf |X1:tf , Y1:tf ) = −EU1:tf ,X1:tf ,Y1:tf log pU1:tf |X1:tf ,Y1:tf
16
where EU1:tf ,X1:tf ,Y1:tf {·} is an expectation operator over U1:tf , X1:tf and Y1:tf using definition 3.1.
Consider,
pU1:tf |X1:tf ,Y1:tf
=
pY1:tf |X1:tf ,U1:tf
pY1:tf |X1:tf
pU1:tf |X1:tf
= pU1:tf |X1:tf .
This is from the fact that pY1:tf |X1:tf ,U1:tf = pY1:tf |X1:tf for all (X1:tf , U1:tf ) such that pU1:tf |X1:tf ̸= 0
under the assumption that u1:tf and w1:tf are independent random processes. Therefore,
{
}
−EU1:tf ,X1:tf ,Y1:tf log pU1:tf |X1:tf ,Y1:tf
{
}
= −EU1:tf ,X1:tf ,Y1:tf log pU1:tf |X1:tf
{
}
= −EU1:tf ,X1:tf log pU1:tf |X1:tf
h(U1:tf |X1:tf , Y1:tf ) =
h(U1:tf |X1:tf , Y1:tf ) =
h(U1:tf |X1:tf ).
(3.4)
Using (3.3) and (3.4), I(y1:tf ; u1:tf ) can be expressed as
I(y1:tf ; u1:tf ) = I(x1:tf ; u1:tf ) − I(x1:tf ; u1:tf |y1:tf ).
(3.5)
From (3.5), it can be seen how the input process, u1:tf , and output process, y1:tf , are related
to the system internal state process, x1:tf . This result shows that one aspect of the information
theory connection is based on the amount of interaction between the internal states of the system
and the control inputs. Observe that I(x1:tf ; u1:tf ) ≥ I(y1:tf ; u1:tf ) ≥ 0 due to the non-negativity
of information measures. Therefore, I(x1:tf ; u1:tf |y1:tf ) is inversely proportional to the amount of
information about the internal system variables that is contained in an observation process. Further
equality in (3.5) indicates that, I(y1:tf ; u1:tf ) must be related to the influence of the controls on the
system and the ability of observations to estimate the internal states. Note that these two aspects
of a relationship between inputs and outputs to internal states are rigorously defined, and easily
computable, for linear systems, in terms of controllability and observability. In the next section,
effect of controllability and observability toward the mutual information between input and output
process passing through linear systems is made explicit.
17
3.3
Connection between Information Theory and Control
Theory in Linear Systems
In this section, evidence, that the mutual information between input and output processes is directly
related to the concept of controllability and observability, is provided. Thus, this mutual information is directly related to the structure of the systems. This can be shown using the controllable
and observable forms given in section 3.3.1. Section 3.3.2 provides analytical solutions for some
computations in linear systems that could be useful for future applications.
3.3.1
Controllability and Observability in Linear System under Information Theory Framework
In this section, the tight connection between control theory and information theory in linear systems
is shown. The connection between these two fields of study explicitly depends on the structure of
the realizations, that is the controllable and observable forms. The result holds for both open- and
closed-loop systems with reasonable assumptions. Consider linear systems of the form
A11
=
A21
(1)
yk
C11
=
(2)
yk
C21
xk+1
A12
B11
xk +
A22
B21
C12
xk + wk .
C22
(1)
B12 uk
+ vk
(2)
B22
uk
(3.6)
(3.7)
For open-loop systems, assume that
(LSOL1) w1:k is independent from x1:k , u1:k and v1:k .
For closed-loop systems, assume that
(LSCL1) w1:k is independent from u1:k .
The following derivation shows that the mutual information between any subset of inputs and outputs
(i)
(j)
of the system, I(u1:k ; y1:k ), is directly related to the controllable and observable realizations of the
(1)
(1)
system. Without loss of generality, consider the input/output pair (u1:k ; y1:k ) since any row pivoting
(i)
(j)
operation can transform the system to any desired subset of inputs and outputs while I(u1:k ; y1:k )
is invariant under row pivoting operation due to (2.39).
18
First, consider
(1)
(1)
I(u1:k ; y1:k )
(1)
(1)
(1)
(1)
(1)
(1)
=
h(u1:k ) − h(u1:k |y1:k )
=
h(u1:k ) − h(u1:k |y1:k ) + h(u1:k |x1:k , y1:k ) − h(u1:k |x1:k , y1:k )
=
+h(u1:k |x1:k ) − h(u1:k |x1:k )
) (
)
(
(1)
(1)
(1)
(1)
(1)
h(u1:k ) − h(u1:k |x1:k ) + h(u1:k |x1:k ) − h(u1:k |x1:k , y1:k )
(
)
(1) (1)
(1)
(1)
− h(u1:k |y1:k ) − h(u1:k |x1:k , y1:k )
=
I(u1:k ; x1:k ) + I(u1:k ; y1:k |x1:k ) − I(u1:k ; x1:k |y1:k ).
(1)
(1)
(1)
(1)
I(u1:k ; y1:k )
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(3.8)
Let T̆ be a nonsingular transformation that transforms (3.6) and (3.7) into observable form with
(1)
respect to (x1:k ; y1:k ) as follows,
(o)
x̆k+1
Ă11
=
(uo)
x̆k+1
Ă21
(
(1)
yk
where w̆k = (Ip×p
=
C̆11
(o)
x̆k
0
B̆11
+
(uo)
Ă22
x̆k
B̆21
)
(o)
x̆k
0 (uo) + w̆k
x̆k
(1)
B̆12 uk
+ T̆ vk
(2)
B̆22
uk
(3.10)
(
(o)
(o)
0) wk , p is equal to dimension of x̆k , T̆ xk = x̆k = x̆k
is an observable pair with respect to the observation
(1)
y1:k .
(3.9)
Note that
(uo)
)T
x̆k
(2)
yk
; and (Ă11 , C̆11 )
is omitted from the
observation equation in (3.10) since it is not related to the computation of interest.
By (2.39), there are identities for two of the terms in (3.8), that is
(1)
(1)
I(u1:k ; x1:k ) =
(1)
I(u1:k ; x̆1:k )
(1)
(1)
I(u1:k ; x1:k |y1:k ) =
(3.11)
(1)
I(u1:k ; x̆1:k |y1:k ).
(3.12)
Combining (2.38) and (3.11), yields
(1)
I(u1:k ; x̆1:k ) =
(1)
(o)
(uo)
I(u1:k ; (x̆1:k , x̆1:k ))
(1)
(o)
(1)
(uo)
(o)
= I(u1:k ; x̆1:k ) + I(u1:k ; x̆1:k |x̆1:k ).
19
(3.13)
(1)
(1)
Similarly, by (2.38) and (3.12), I(u1:k ; x̆1:k |y1:k ) can be expressed by
(1)
(1)
I(u1:k ; x̆1:k |y1:k ) =
(1)
(o)
(uo)
(1)
I(u1:k ; (x̆1:k , x̆1:k )|y1:k )
(1)
(o)
(1)
(1)
(uo)
(o)
(1)
= I(u1:k ; x̆1:k |y1:k ) + I(u1:k ; x̆1:k |x̆1:k , y1:k ).
(3.14)
Now, consider
(1)
(uo)
(o)
(1)
(uo)
(o)
(1)
I(u1:k ; x̆1:k |x̆1:k ) − I(u1:k ; x̆1:k |x̆1:k , y1:k )
(
)
(1) (o)
(1) (uo)
(o)
h(u1:k |x̆1:k ) − h(u1:k |x̆1:k , x̆1:k )
(
)
(1) (o)
(1)
(1) (uo)
(o)
(1)
− h(u1:k |x̆1:k , y1:k ) − h(u1:k |x̆1:k , x̆1:k , y1:k )
(
)
(1) (o)
(1) (o)
(1)
h(u1:k |x̆1:k ) − h(u1:k |x̆1:k , y1:k )
(
)
(1) (uo)
(o)
(1) (uo)
(o)
(1)
− h(u1:k |x̆1:k , x̆1:k ) − h(u1:k |x̆1:k , x̆1:k , y1:k )
=
=
(1)
(uo)
(o)
(1)
(uo)
(o)
(1)
I(u1:k ; x̆1:k |x̆1:k ) − I(u1:k ; x̆1:k |x̆1:k , y1:k )
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(uo)
(o)
=
I(u1:k ; y1:k |x̆o1:k ) − I(u1:k ; y1:k |x̆1:k , x̆1:k )
=
I(u1:k ; y1:k |x̆1:k ) − I(u1:k ; y1:k |x̆1:k ).
(o)
(3.15)
Applying Lemma2.1, (3.8), (3.11), (3.12), (3.13), (3.14) and (3.15), yields
(1)
(
(1)
I(u1:k ; y1:k )
=
=
=
=
(1)
(1)
I(u1:k ; y1:k )
(1)
(1)
=
)
(1)
(o)
(1)
(uo) (o)
(1)
(1)
I(u1:k ; x̆1:k ) + I(u1:k ; x̆1:k |x̆1:k ) + I(u1:k ; y1:k |x1:k )
(
)
(1)
(o) (1)
(1)
(uo) (o)
(1)
− I(u1:k ; x̆1:k |y1:k ) + I(u1:k ; x̆1:k |x̆1:k , y1:k )
(
)
(1)
(o)
(1)
(o) (1)
(1)
(1)
I(u1:k ; x̆1:k ) − I(u1:k ; x̆1:k |y1:k ) + I(u1:k ; y1:k |x1:k )
(
)
(1)
(uo) (o)
(1)
(uo) (o)
(1)
+ I(u1:k ; x̆1:k |x̆1:k ) − I(u1:k ; x̆1:k |x̆1:k , y1:k )
(
)
(1)
(o)
(1)
(o) (1)
(1)
(1)
I(u1:k ; x̆1:k ) − I(u1:k ; x̆1:k |y1:k ) + I(u1:k ; y1:k |x1:k )
(
)
(1)
(1) (o)
(1)
(1)
+ I(u1:k ; y1:k |x̆1:k ) − I(u1:k ; y1:k |x̆1:k )
(
)
(1)
(o)
(1)
(o) (1)
(1)
(1) (o)
I(u1:k ; x̆1:k ) − I(u1:k ; x̆1:k |y1:k ) + I(u1:k ; y1:k |x̆1:k )
(
)
(1)
(1)
(1)
(1)
+ I(u1:k ; y1:k |x1:k ) − I(u1:k ; y1:k |x̆1:k )
(
)
(1)
(o)
(1)
(o) (1)
(1)
(1) (o)
I(u1:k ; x̆1:k ) − I(u1:k ; x̆1:k |y1:k ) + I(u1:k ; y1:k |x1:k ).
(3.16)
(o)
Therefore, I(u1:k ; y1:k |x1:k ) can be written as
(1)
(1)
(o)
I(u1:k ; y1:k |x1:k ) =
(1)
(o)
(1)
(o)
(1)
h(y1:k |x1:k ) − h(y1:k |x1:k , u1:k )
(1)
= h(w̆k ) − h(w̆k |u1:k ).
20
(3.17)
(1)
Under assumption (LSOL1), h(w̆k ) = h(w̆k |u1:k ) for open-loop systems. Similarly, for closed-loop
(1)
(1)
(1)
(o)
systems, h(w̆k ) = h(w̆k |u1:k ) under the assumption (LSCL1). Therefore, I(u1:k ; y1:k |x1:k ) = 0 and
the intermediate result
(1)
(1)
(1)
(o)
(1)
(o)
(1)
I(u1:k ; y1:k ) = I(u1:k ; x̆1:k ) − I(u1:k ; x̆1:k |y1:k )
(3.18)
follows.
As can be seen from (3.18), the unobservable part of x̆1:k can be discarded because it is unrelated
(1)
(1)
to I(u1:k ; y1:k ) . Now, let T̃ be a nonsingular transformation that transforms (3.9) and (3.10) into
(o)
(1)
controllable form with respect to (x̆1:k ; u1:k ) as follows,
(c,o)
x̃
k+1
Ã11
=
(uc,o)
x̃k+1
0
(
(1)
yk
=
C̃1
(c,o)
Ã12 x̃k B̃11
+
(uc,o)
Ã22
x̃k
0
)
(c,o)
x̃k
C̃2 (uc,o) + w̃k
x̃k
(o)
(o)
where p is equal to the dimension of x̆k , T̃ x̆k
(o)
= x̃k
(
(1)
B̃12 uk
+
T̃
Ip×p
(2)
B̃22
uk
)
0 T̆ vk (3.19)
(3.20)
(
=
(c,o)
x̃k
(uc,o)
)T
x̃k
; and (Ã11 , B̃11 ) is a
(1)
controllable pair with respect to the input, u1:k .
(1)
(o)
By (2.38) and (2.39), I(u1:k ; x̆1:k ) can be rewritten as
(1)
(o)
I(u1:k ; x̆1:k ) =
=
(1)
(1)
(o)
(1)
(c,o)
I(u1:k ; x̃1:k )
(1)
(uc,o)
I(u1:k ; x̃1:k ) + I(u1:k ; x̃1:k
(o)
(1)
(1)
(o)
(1)
(c,o)
(c,o)
|x̃1:k ).
(3.21)
Likewise, from (2.38) and (2.39), I(u1:k ; x̆1:k |y1:k ) can be rewritten as
(1)
(o)
(1)
I(u1:k ; x̆1:k |y1:k ) =
=
(1)
I(u1:k ; x̃1:k |y1:k )
(1)
(1)
(uc,o)
I(u1:k ; x̃1:k |y1:k ) + I(u1:k ; x̃1:k
21
(c,o)
(1)
|x̃1:k , y1:k ).
(3.22)
Applying (3.18), (3.21) and (3.22), yields
(1)
(
(1)
I(u1:k ; y1:k )
)
(1)
(c,o)
(1)
(uc,o) (c,o)
I(u1:k ; x̃1:k ) + I(u1:k ; x̃1:k |x̃1:k )
(
)
(1)
(c,o) (1)
(1)
(uc,o) (c,o) (1)
− I(u1:k ; x̃1:k |y1:k ) + I(u1:k ; x̃1:k |x̃1:k , y1:k )
(
)
(1)
(c,o)
(1)
(c,o) (1)
I(u1:k ; x̃1:k ) − I(u1:k ; x̃1:k |y1:k )
)
(
(uc,o) (c,o)
(1)
(uc,o) (c,o) (1)
(1)
+ I(u1:k ; x̃1:k |x̃1:k ) − I(u1:k ; x̃1:k |x̃1:k , y1:k ) .
=
=
(1)
(uc,o)
For open-loop systems, I(u1:k ; x̃1:k
(uc,o)
from the fact that x̃1:k
(c,o)
(1)
(uc,o)
|x̃1:k ) = I(u1:k ; x̃1:k
(c,o)
(1)
(1)
(uc,o)
ditionally) independent. For closed-loop systems, I(u1:k ; x̃1:k
(1)
(uc,o)
in the open-loop case. However, I(u1:k ; x̃1:k
(1)
(uc,o)
I(u1:k ; x̃1:k
(1)
|x̃1:k , y1:k ) = 0. These equalities are
(uc,o)
cannot be driven by u1:k . Therefore, u1:k and x̃1:k
(1)
(c,o)
(1)
|x̃1:k , y1:k ) =
(c,o)
(3.23)
are statistically (con-
(c,o)
|x̃1:k ) = 0 by the same reasoning as
(1)
|x̃1:k , y1:k ) = 0 due to the fact that
(uc,o)
|x̃1:k , y1:k ) − h(x̃1:k
(uc,o)
|y1:k ) − h(x̃1:k
h(x̃1:k
= h(x̃1:k
(c,o)
(1)
(1)
(uc,o)
(uc,o)
(c,o)
(1)
(1)
|x̃1:k , y1:k , u1:k )
(1)
|y1:k )
= 0.
With these facts and (3.23), the final result is
(1)
(1)
(1)
(c,o)
(1)
(c,o)
(1)
I(u1:k ; y1:k ) = I(u1:k ; x̃1:k ) − I(u1:k ; x̃1:k |y1:k ).
(3.24)
The result in (3.24) is similar to (3.5) except that mutual information explicitly depends on the
controllable and observable states of the system. The maximum possible information that can be
(1)
(c,o)
(1)
(c,o)
(1)
shared is determined by I(u1:k ; x̃1:k ). In contrast, I(u1:k ; x̃1:k |y1:k ) is inversely related to the
(c,o)
amount of information about x̃1:k that is contained in the observations. This result elucidates the
tight connection between control theory and information theory for linear systems. Although, only
weaker results for nonlinear systems can be shown, it is strongly believed that similar results, as
given in (3.24), also hold for nonlinear systems.
3.3.2
Computation of Some Information Measures in Linear System
In this section, various analytic computations for different information measures are stated focusing
primarily on mutual information. Although this computation is not used directly in the present
work, it is included because it is potentially useful for future work and/or related research. For
22
linear systems of the form
xk+1 = Axk + Buk + vk
(3.25)
yk+1 = Cxk + Duk + wk
(3.26)
under the assumptions that
(LSC1) v1:k and w1:k are independent white noise processes
(LSC2) v1:k and w1:k are independent from x0 and u1:k .
The following results hold:
Lemma 3.2. For the system in (3.25) and (3.26) with the additional assumption that vl is a nondegenerate process for l = 1, . . . , k, the mutual information between inputs and states can be computed
as follows
I(u1:k ; x1:k ) =
k
∑
[h(Bul−1 + vl−1 ) − h(vl−1 )] .
(3.27)
l=1
Proof.
I(u1:k ; x1:k ) =
=
h(x1:k ) − h(x1:k |u1:k )
k
∑
[h(xl |x1:l−1 ) − h(xl |x1:l−1 , u1:k )]
l=1
=
k
∑
[h(xl |xl−1 ) − h(xl |xl−1 , ul )]
l=1
I(u1:k ; x1:k ) =
k
∑
[h(Bul−1 + vl−1 ) − h(vl−1 )] .
(3.28)
l=1
Note that the third line in the proof comes from (LSC1) and (LSC2). Also, the computation in
(3.28) is always possible due to the assumption that vl is non-degenerate.
Lemma 3.3. For system in (3.25) and (3.26) with additional assumption that wl is non-degenerate
for l = 1, . . . , k, the mutual information between inputs and outputs can be computed as follows
I(u1:k ; y1:k ) = h(M z) − h(M̃ z)
where
• xk ∈ Rn , uk ∈ Rp and yk ∈ Rm
23
(3.29)
T
• z = [xT0 , uTk−1 , . . . , uT0 , vk−1
, . . . , v0T , wkT , . . . , w1T ]
• M (i, 1) = CAi−1
• M (i, 2) = [0m×p(k−i) , CB, CAB, . . . , CAi−1 B]
• M (i, 3) = [0m×n(k−i) , C, CA, . . . , CAi−1 ]
• M (i, 4) = [0m×m(k−i) , Im×m , 0m×m(i−1) ]
• M (i, 5) = [0m×p(k−i) , D, 0m×p(i−1) ]
• M(i) = [M (i, 1), M (i, 2) + M (i, 5), M (i, 3), M (i, 4)]
T
T T
• M = [M(1)
, . . . , M(k)
]
• M̃(i) = [M (i, 1), 0m×pk , M (i, 3), M (i, 4)]
T
T T
• M̃ = [M̃(1)
, . . . , M(k)
] .
Proof. Write (3.25) and (3.26) as
xk+1
= Ak x0 +
k−1
∑
Ak−1−l (Buk + vk )
(3.30)
l=1
yk+1
= CAk x0 +
k−1
∑
CAk−1−l (Buk + vk ) + Duk + wk
(3.31)
l=1
with
I(u1:k ; y1:k ) = h(y1:k ) − h(y1:k |u1:k ).
From (3.30) and (3.31), it follows that
h(y1:k ) =
h(M z)
h(y1:k |u1:k ) =
h(M̃ z).
This is always possible by the assumption that wl is non-degenerate.
Note that the result in lemma 3.3 follows directly from the solution of the linear systems in (3.25)
and (3.26). However, the results in lemma 3.2 and 3.3 give rise to one interesting observation. The
correlation structure of u1:k , v1:k (and w1:k ) affects the computation of mutual information between
the control inputs and outputs but not the computation of mutual information between the control
inputs and states. This implies that I(u1:k ; x1:k ) only depends on the sum of the instantaneous
24
information from the control inputs and additive noises. For the computation of I(u1:k ; y1:k ), direct
computation can be avoided by computing the upper and lower bounds. This can be done using the
following lemma.
Lemma 3.4. For the system in (3.25) and (3.26) with the additional assumption that wk is nondegenerate, the upper and lower bounds of the mutual information between the inputs and outputs is
given by
I(u1:k ; y1:k )
I(u1:k ; y1:k )
≤ I(u1:k ; x1:k )
≥ h(u1:k ) −
k
∑
(3.32)
[
(
)]
h(ul |ul−1 ) − h(yl+1 , yl |ul ) − h(yl+1 , yl |ul−1 ) .
(3.33)
l=1
Proof. The upper bound given in (3.32) follows from the fact that h(u1:k |y1:k ) ≥ h(u1:k |x1:k ). focusing attening on the lower bound, following inequality holds:
h(u1:k |y1:k ) =
k
∑
h(ul |u1:l−1 , y1:k )
l=1
≤
k
∑
h(ul |u1:l−1 , yl+1 , yl )
l=1
h(u1:k |y1:k )
≤
k
∑
h(ul |ul−1 , yl+1 , yl ).
l=1
Note that, ∀l ∈ [1, k],
h(ul |ul−1 , yl+1 , yl )
=
h(ul , ul−1 , yl+1 , yl ) − h(ul−1 , yl+1 , yl )
=
h(ul , ul−1 , yl+1 , yl ) − h(ul−1 , yl+1 , yl )
+h(ul , ul−1 ) − h(ul , ul−1 )
=
h(yl+1 , yl |ul , ul−1 ) − h(ul−1 , yl+1 , yl ) + h(ul , ul−1 )
=
h(yl+1 , yl |ul ) − h(ul−1 , yl+1 , yl ) + h(ul , ul−1 )
=
h(yl+1 , yl |ul ) − h(ul−1 , yl+1 , yl ) + h(ul , ul−1 )
+h(ul−1 ) − h(ul−1 )
h(ul |ul−1 , yl+1 , yl )
=
h(yl+1 , yl |ul ) − h(yl+1 , yl |ul−1 ) + h(ul |ul−1 ).
Thus,
h(u1:k |y1:k ) ≤
k
∑
[h(ul , ul−1 ) + h(yl+1 , yl |ul ) − h(yl+1 , yl |ul−1 )] .
l=1
25
Therefore,
I(u1:k ; y1:k ) ≥ h(u1:k ) −
k
∑
[
(
)]
h(ul |ul−1 ) − h(yl+1 , yl |ul ) − h(yl+1 , yl |ul−1 ) .
l=1
This completes the proof for the lower bound given in (3.33).
26
CHAPTER
4
Computation of Mutual Information:
Continuous-Value Representation using Hidden
Markov Model
4.1
Hidden Markov Model with Gaussian Emission
In a hidden Markov model, it is assumed that there exists a set of hidden states, usually finite where
each hidden states emits a random variable with a specific probability distribution. In addition,
there is a hidden process whose values are associated with these hidden states. This hidden process
cannot be observed and is Markovian. The observable sequence are determined from the hidden
process and hidden states as follows. At any given time, if the hidden state takes the value i, the
observation is a random variable with distribution pi which is an emission from the ith hidden state.
Mathematically, let
• q1:T be a trajectory of hidden state where qt ∈ {1, 2, . . . , M } ∀t = 1, 2, . . . , T
• O1:T be an observation process with Ot ∈ R ∀t = 1, 2, . . . , T .
In addition, q1:T is Markovian with initial probability Ph (i) = P rob(q1 = i) and state transition
matrix A. The number of hidden states is M . The lth state has distribution pl , and the nodes, visited
by hidden process, emit an independent random variable drawn from the associated distribution. If
qt = l and its value is known, (Ot |qt = l) is a random variable with probability distribution pl . It
27
follows that
Ot ∼
M
∑
pl (Ot )P rob(qt = l).
(4.1)
l=1
Equation (4.1) shows that Ot is a mixture of different distributions that depend on emissions from
the hidden states.
Three basic problems of hidden Markov models have well known solutions are as follow.
• What is the probability of observation sequences given model parameters, P (O1:T |M odel).
– This problem is solved by Forward-backward procedure (Baum).
• What is the most probable trajectory of the system (in terms of the hidden states) given
observations and model parameters, P (q1:T |O1:T , M odel).
– This problem is solved by the Veterbi algorithm.
• Determining the model parameters.
– This problem is solved by Baum-Welch method.
The Veterbi algorithm is not discussed in this chapter since it is not related to the present work.
The full details can be found in [36]. The main concern within the current context is with the first
and third problems, and the solutions to these problems are summarized in the next section. Note
that these three problems are related to classical implementations of hidden Markov model. In
addition, the number of hidden states has to be specified a priori in classical implementations. There
exist implementations of hidden Markov model in which the number of hidden states is determined
during the construction [2, 43, 44]. An advantage of these algorithms is significant since determining
an appropriated model order for HMMs is extremely difficult task. However, these implementations
are designed for discrete-valued random process. Since the objective at hand is to reconstruct an
underlying dynamics for continuous-valued process, the approaches presented in [2, 43, 44] cannot
be applied to current problem without major modification.
In this work, it is assumed that the emitted random variables associated with the hidden states
are Gaussian, e.g., pl = N (ml , Cl ). This representation reflects the continuous-valued nature of
the data. Following from (4.1), the distribution of the data given the model at each time instant
is a Gaussian mixture. With the well-known fact that any distribution can be approximated by
28
a Gaussian sum, this selection for emission density is meaningful. Detailed information on hidden
Markov models with Gaussian or Gaussian mixture emissions can be found in [21, 27, 34].
4.1.1
Parameters estimation for Hidden Markov model with Gaussian
Emission
For a hidden Markov model with Gaussian Emission, the following model parameters are necessary
• M , number of nodes,
• p1 , p2 , . . . , pM , emissions (probability distributions) for each of the nodes described by
– m1 , m2 , . . . , mM , set of mean vectors for each of the distributions,
– C1 , C2 , . . . , CM , set of covariance matrices for each of the distributions,
• Ph , initial probability for the hidden state where Ph (j) = P rob(q1 = j),
• A, state transition matrix for the Markov chain where Aij = P rob(qt = j|qt−1 = i)
∀t =
2, 3, . . . , T .
Note that the number of nodes M msut be selected a priori in classical implementations. Then
method for determining an appropriate value for M is discussed in a later section.
For the forward-backward procedure, let bj (Ot ) denote the distribution function of Oj given
that qt = j. A collection of model parameters for the hidden Markov model is defined using single
vector as λ , {{m1 , m2 , . . . , mM }, {C1 , C2 , . . . , CM }, Ph , A}. The distribution of O1:T is defined
by Lλ (O1:T ) and the joint distribution of O1:T and qt is defined by Lλ (O1:T , qt ). The following
relationships hold:
Lλ (O1:T ) =
M
∑
Lλ (O1:T , qt = j)
j=1
where
Lλ (O1:T , qt = j) = Lλ (O1:t , qt = j)Lλ (Ot+1:T |O1:t , qt = j).
The forward αt (j) and backward βt (j) variables for the hidden Markov model can be defined by
αt (j) =
Lλ (O1:t , qt = j)
βt (j) =
Lλ (Ot+1:T |O1:t , qt = j).
29
These two variables have the following relationships
αt (j) =
M
∑
αt−1 (i)bj (Ot )Aij
(4.2)
βt+1 (i)bi (Ot+1 )Aji
(4.3)
i=1
βt (j) =
M
∑
i=1
where bj (Ot ) = Lλ (Ot |qt = j), A is the transition matrix with Aij = P rob(qt = j|qt−1 = i) and,
∀j = 1, . . . , M ,
α1 (j) =
bj (O1 )Ph (j)
βT (j) = 1.
Note that by these definitions, Lλ (O1:T , , qT = j) = αT (j).
Parameter estimation is done using an iterative procedure that begins with an initial guess for the
(1)
(1)
(1)
(1)
(1)
parameters, Ph , A(1) , {m1 , . . . , mM } and {C1 , . . . , CM }. The forward and backward variables
(1)
are then computed from (4.2) and (4.3) using the current model parameter to obtain αt (i) and
(1)
βt (j) for i, j = 1, . . . , M and t = 1, . . . , T . Each model parameter is updated until the change in all
parameters is less than some threshold, δmax , or the number of iterations exceeds a preset maximum
number of iterations. The estimate of the model parameters at the v th iteration is given by
∑T −1
(v)
Aij
(v)
mi
(v)
Ci
=
t=1
(v−1) (v−1)
(v−1)
(i)Aij
bj
(Ot+1 )βt+1 (j)
∑T −1 (v−1)
(v−1)
(i)βt
(i)
t=1 αt
∑T
(v−1)
(v−1)
(i)βt
(i)Ot
t=1 αt
= ∑
(v−1)
(v−1)
T
(i)βt
(i)
t=1 αt
(
)(
)T
∑T
(v−1)
(v−1)
(v−1)
(v−1)
α
(i)β
(i)
O
−
m
O
−
m
t
t
t
i
i
t=1 t
=
∑T
(v−1)
(v−1)
(i)βt
(i)
t=1 αt
(v−1)
(v)
(v−1)
αt
Ph (i) =
α1
(i)
(v−1)
bi
(O1 )
∀i, j = 1, 2, . . . , M
(4.4)
∀i = 1, 2, . . . , M
(4.5)
∀i = 1, 2, . . . , M
(4.6)
∀i = 1, 2, . . . , M.
(4.7)
Note that, for a long sequence of observations O1:T , it is well known that underflow commonly
occurs in calculation of αt (j) for large t and calculation of βt (j) for small t. In case of underflow,
the compuation of the quantities defined by (4.4)-(4.7) can produce a spurious result of zero. A
suggested solution to this problem is to multiply αt (j) and βt (j) by appropriate constants [36].
However, there is no specific rule for determining these constants and selecting appropriate value for
them is non-trivial. In 2006, Mann suggested that the log scale should be used to store and compute
30
both forward and backward variables to overcome this underflow problem [28]. Mann’s method is
applied here in order to manage the computation of these variables in this work.
4.1.2
Entropy and Mutual Information of Data under Hidden Markov
Model with Gaussian Emission
For any process O1:T generated by a hidden Markov model with parameters λ, the entropy of this
process is given by
h(O1:T |λ) =
∑T
t=1
h(Ot |O1:t−1 , λ).
The distributions required to compute h(Ot |O1:t−1 , λ) are given by
Lλ (O1:t )
Lλ (O1:t−1 )
∑M
αt (i)
= ∑Mi=1
i=1 αt−1 (i)
∑M ∑M
i=1
j=1 αt−1 (i)bj (Ot )Aij
=
∑M
i=1 αt−1 (i)
∑M
∑M
i=1 αt−1 (i)Aij
=
bj (Ot ) ∑
M
j=1
i=1 αt−1 (i)
∑M
Lλ (Ot |O1:t−1 ) =
bj (Ot )Kt (j).
Lλ (Ot |O1:t−1 ) =
j=1
(4.8)
From (4.8), Lλ (Ot |O1:t−1 ) is a Gaussian mixture because bj (Ot ) is Gaussian and Kt (j) is a multiplier
at time t associated with the hidden states j. From (4.8), Kt (j) can be explicitly calculated using
the forward variables by
∑M
Kt (j) =
i=1 αt−1 (i)Aij
.
∑M
i=1 αt−1 (i)
(4.9)
Now, assuming that O1:T can be partitioned as
(1)
O1:T
O1:T =
(2)
(4.10)
O1:T
The mutual information between two subsets of O1:T defined in (4.10) is given by
(1)
(2)
(1)
(2)
I(O1:T ; O1:T |λ) = h(O1:T |λ) + h(O1:T |λ) − h(O1:T |λ)
31
(4.11)
(1)
(2)
The term h(O1:T |λ) in (4.11) is given by (4.8) and (4.9). The computation of h(O1:T |λ) and h(O1:T |λ)
(p)
can be performed as follows. For O1:T where p ∈ 1, 2, the appropriate parameters are extracted
from λ as λ(p) . In fact, the mean vectors and covariance matrices are truncated. To be precise,
∀i = 1, . . . , M , mi and Ci can be written as
mi
=
Ci
=
(1)
mi
(2)
mi
(21)
(11)
Ci
Ci
.
(22)
(12)
Ci
Ci
(4.12)
(4.13)
With (4.12) and (4.13), truncated model parameters, λ(1) and λ(2) , are given by
λ(1)
λ(2)
{
}
(1)
(1)
(1)
(11)
(11)
(11)
{m1 , m2 , . . . , mM }, {C1 , C2 , . . . , CM }, Ph , A
}
{
(22)
(22)
(2)
(22)
(2)
(2)
=
{m1 , m2 , . . . , mM }, {C1 , C2 , . . . , CM }, Ph , A
=
(4.14)
(4.15)
(p)
Then, for p = 1, 2, the forward variables αt (i) can be recomputed using the model parameters λ(p) .
(p)
(p)
Finally, αt (i) in (4.8) and (4.9) are replaced with αt (i) to obtain h(O1:T |λ) for p = 1, 2.
4.2
Entropy and Mutual Information of Gaussian mixture
As mentioned before, the computation of mutual information of data from a hidden Markov model
with Gaussian emission requires computing the entropy of a Gaussian mixture. In 2008, Michalowicz
et al. investigated computation of this entropy when Gaussian mixture contains only two components
with equal variance [30]. The result from [30] is very restrictive and cannot be applied in the present
work. In the same year, Huber and his colleagues provided estimation of this entropy for any
Gaussian mixtures [19]. In this work, Huber et al. applied a Taylor series expansion to the estimation
of the entropy of Gaussian mixtures. However, there is no solid proof that establishes the region
convergence for this computation. Although, they have a scheme to reduce the approximation error,
the computation can be overwhelming for higher dimensional random vectors. For these reasons,
focus in this work is restricted on the computation of upper and lower bounds of the entropy of
Gaussian mixtures using results from the same work [19]. In this section, it is assumed that the
32
random variable X has a distribution of the form
∑M
i=1
αi N (mi , Ci )
(4.16)
where N (mi , Ci ) is a Gaussian distribution with mean mi and covariance matrix Ci .
4.2.1
Review of Upper and Lower Bound Computations for Entropy of a
Gaussian Mixture
In [19], Huber and his colleagues present upper and lower bounds on the entropy of Gaussian mixture
given in (4.16). The lower bound, resulting from log sum inequality, can be computed by
h(X) ≥ −
∑M
i=1
αi log
∑M
i=1
αj zij
(4.17)
where zij = N (mj , Ci + Cj )|mi .
Note that the derivation of (4.17) is based on Jensen’s inequality, − log E{x} ≥ E{− log x}. As a
result, h(X) will rarely obtain the lower bound presented in (4.17). In the same work, they propose
a loose upper bound given by
h(X) ≤
∑M
i=1
αi hN (mi ,Ci ) −
∑M
i=1
αi log αi
(4.18)
where hN (mi ,Ci ) is the differential entropy of the normal distribution with mean mi and covariance
Ci . This can be easily computed.
However, the result in (4.18) does not depend on the location (or mean) of each Gaussian in the
mixture (4.16). Thus, the bound is inaccurate when mixtures are clustered near each other. Huber
and his colleagues propose another inequality, which is used to refine the result in (4.18), as follows.
∑
For X with the distribution given in (4.16) and X̃ ∼
l̸=i,j αl N (ml , Cl ) + α̃N (m̃, C̃), with
α̃ = (αi + αj ) and N (m̃, C̃) with the same mean and covariance as
h(X) ≤ h(X̃).
In other words, if any pair of Gaussians inside
∑M
i=1 αi N (mi , Ci )
αi
α̃ N (mi , Ci )
+
αj
α̃ N (mj , Cj )
(4.19)
are merged, the entropy of the
resulting distribution is larger than the original one. Huber and his colleagues use this fact to
construct a procedure to tighten the bound given in (4.18) by
33
(HStep 1) For ith iteration, X (i) ∼
∑
(i)
(i)
(i)
l αl N (ml , Cl ),
compute hU (X (i) ) using (4.18) .
(HStep 2) For all pairs r and s where r ̸= s compute α̃(i) N (m̃(i) , C̃ (i) ) such that
(i)
(i)
(a) α̃(i) = αr + αs ,
(b) N (m̃(i) , C̃ (i) ) has the same mean and covariance as
(i)
(i)
α(i)
r
N (mr , Cr )
α̃(i)
+
(i)
(i)
α(i)
s
N (ms , Cs ).
α̃(i)
(HStep 3) Pick the pair r∗ and s∗ such that (r∗ , s∗ ) is a solution of
(i)
(i)
(i)
(i)
min r,s α̃(i) log det C̃ (i) − αr log det Cr − αs log det Cs
r̸=s
This criterion comes from Runnalls’s work which is based on an upper bound of the
Kullback-Leibler divergence [40] between the merged and original versions of the two
Gaussian sums.
(HStep 4) Iterate over (HStep 1) through (HStep 3) until all components in the Gaussian mixture
are merged into a single Gaussian.
(HStep 5) hU (X) = mini hU (X (i) ).
The upper bound computed by this procedure is significantly improved from that provided by (4.18).
4.2.2
Improvement of Upper and Lower Bound Computations for Entropy of a Gaussian Mixture
It is known that the computation of mutual information between two subsets of Gaussian mixtures
involves the calculation of three entropies. To be specific, assume that
(1)
X
X=
.
X (2)
The mutual information between X (1) and X (2) is given by
(
)
(
)
(
)
I X (1) ; X (2) = h X (1) + h X (2) − h(X).
If X is a Gaussian mixture, both X (1) and X (2) are also Gaussian mixtures. The gap between the
upper and lower bounds on mutual information is roughly three times larger than the gap between
the bounds on entropy. Thus, it is desirable to close the gap between the lower and upper bounds
on entropy estimation as much as possible. In this section, a method to tighten the lower and
34
upper bounds from the original work of Huber and his colleagues are proposed. Details on how
to modify and improve the upper and lower bound computations are provided. Simulation results
using synthetic data are given in the next section.
The criterion in Runnalls’s work is based on the Kullback-Leibler divergence between the Guassian mixture and the merged version [40]. Although this criterion is ideal for Gaussian merging,
the exact computation of Kullback-Leibler divergence of two different distributions is not possible.
In [40], Runnalls computes an upper bound of the Kullback-Leibler divergence and Huber uses this
upper bound for Gaussian merging. Although it is agreeable that Kullback-Leibler divergence is the
best criterion, its upper bound might not be an appropriate criterion. Based on this speculation,
Runnalls’s criterion [40] is investigated and found not to be ideal and that better merging policy
exists. Thus, the merging criterion (HStep 3) in the previous section is changed to
(
)
min r,s h X̃ (i) (r, s)
r̸=s
where
• X̃ (i) (r, s) ∼
∑
( (i) (i) )
(i)
l̸=r,s αl , N ml , Cl
(
)
+ α̃(i) N m̃(i) , C̃ (i)
• α̃(i) , m̃(i) and C̃ (i) are defined the same way as (HStep 2) in previous section.
In other words, the pair of Gaussians in the mixture such that the new distribution yields the least
upper bound is merged. With this criterion, the upper bound computed by the same procedure as
Huber’s work is improved.
The lower bound can be very loose in some instances since the computation does not depend
on the means of the components in the mixture. Before proceeding, note that, for any random
variable X such that X ∼ αf (X) + (1 − α)g(X) = p(X) where both f (X) and g(X) are probability
distributions, the differential entropy may be expressed as
∫
h(X) =
=
=
=
h(X) =
−
p(X) log p(X)dx
∫
(
)
αf (X) + (1 − α)g(X) log p(X)dx
∫
∫
−α f (X) log p(X)dx − (1 − α) g(X) log p(X)dx
∫
∫
p(X)g(X)
p(X)f (X)
dx − (1 − α) g(X) log
dx
−α f (X) log
f (X)
g(X)
(
)
(
)
α hf − D(f ||p) + (1 − α) hg − D(g||p)
−
35
(4.20)
where hf is the entropy associated with distribution f (X), hg is the entropy associated with distribution g(X) and D(·||·) is the Kullback-Leibler divergence. The Kullback-Leibler divergence is
nonnegative, so we can get the inequality from (4.20) as
h(X) ≥ αhf + (1 − α)hg
(4.21)
∑
The result in (4.21) can be generalized for probability distribution of the form p = αi fi where
∑
αi = 1, αi ≥ 0 and fi is a probability distribution. Thus, associated differential entropy is bound
from below by
hp ≥
∑
αi hfi .
(4.22)
From (4.19), there exists a method to create groups from a collection of components in pX for the
∑M
∑M
random variable X with distribution pX = i=1 αi N (mi , Ci ) , i=1 αi Ni , such that
max
Γ
∑
βi ϕ(gi ).
(4.23)
The notations in (4.23) are as follow.
• Each gi is a Gaussian mixture.
•
∑
∑M
βi gi = i=1 αi Ni and βi is the normalization so that gi is a valid probability distribution.
• ϕ(·) is a lower bound estimate as given in (4.17).
• Γ = {Γ1 , Γ2 , . . . , Γp } are a collection of Gaussian mixtures such that
∑p
i=1 Γi
= pX and each
component in Γi has different mean and covariance for all i = 1, 2, . . . , p.
In other words, the Gaussian mixture is split into different groups. The lower bound is then estimated
using the inequality (4.22) and the lower bound given by (4.17). The maximum lower bound is then
given by (4.23). However, finding an optimal group is extremely tedious and impractical. Using
the upper bound refinement from Huber’s work as a guideline, a similar method can be used to
construct a lower bound using the following iterative procedure.
(LB Step 1) For X (i) ∼
∑
(i) ( (i)
(i) )
l∈Γ1 αl N ml , Cl
+
∑
(i) ( (i)
(i) )
,
l∈Γ2 αl N ml , Cl
)
(
compute hL X (i) as
follows
(a)
(b)
∑
(i) ( (i)
(i) )
l∈Γ1 αl N ml , Cl
= β1
∑
(i) ( (i)
(i) )
l∈Γ2 αl N ml , Cl
= β2
∑
∑
36
(i)
αl
l∈Γ1 β1
(i)
αl
l∈Γ1 β2
( (i) (i) )
∑
(i)
N ml , Cl
where β1 = l∈Γ1 αl
( (i) (i) )
∑
(i)
N ml , Cl
where β2 = l∈Γ1 αl
(c) Use (4.17) to compute lower bound for
(d) Use (4.22) to compute lower bound for
∑
∑
(i)
αl
l∈Γ1 β1
(i)
αl
l∈Γ2 β2
( (i) (i) )
(i)
N ml , C l
to obtain hl1
( (i) (i) )
(i)
N ml , C l
to obtain hl2
(
)
(i)
(i)
(e) hL X (i) = β1 hl1 + β2 hl2
(LB Step 2) Pick the r ∈ Γ1 such that
(a) X̃ (i) (r) ∼
(i) ( (i)
(i) )
l∈Γ1 \{r} αl N ml , Cl
∑
+
∑
(i) ( (i)
(i) )
l∈Γ2 ∪{r} αl N ml , Cl
(
)
(b) X (i+1) = maxr hL X̃ (i) (r) with the computation in (LB Step 1) where Γ1 (r) =
Γ1 \ {r} and Γ2 (r) = Γ2 ∪ {r}
(LB Step 3) Repeat (LB Step 1) and (LB Step 2) until all mixtures are in Γ2
)
(
(LB Step 4) hL (X) = maxi hL X (i)
In other words, a procedure that is similar to the upper bound refinement in [19] is performed by
starting with all of the components in the mixture belonging to the first group. The difference is
that Gaussian components are first separated into two groups. Then, for the first group, (4.17) is
used to compute the group lower bound. For the second group, the lower bound is computed using
a weighted sum of entropy for each individual Gaussian as in (4.22). The weighted sum of these
two terms is then used to compute the final lower bound. Next, a member is selected from the first
group such that, when it is moved to the second group, it prodvides a larger lower bound. This
procedure is continued until all of the components belong to the second group and the largest lower
bound from all iterations is used as the final lower bound estimate.
4.2.3
Simulations and Results for Upper and Lower Bound Computations
for Entropy of a Gaussian Mixture
In this section, different data corresponding to Gaussian mixture random variables are generated.
All figures have the following format. The y-axis is the estimated value of entropy and the x-axis is
an adjusted parameter. The blue line represents the upper bound given by [19] and the green line
is the corresponding lower bound. The red and teal lines are the improvement for the upper and
lower bounds, respectively.
The first data set is simply a Gaussian mixture with different weights. It consists of six Gaussian
distributions that have different fixed variances. The means of five mixtures are fixed and the mean
37
of one mixture is varied. The distribution is of the form
(
)
0.2 N (0, 1) + N (X, 0.8) + N (0.75, 0.6) + N (1.5, 0.4)) + 0.1(N (2, 0.2) + N (2.2, 0.4)
where X ∈ {0, 0.05, . . . , 20}.
(
Figure 4.1: Upper and lower bounds of entropy for Gaussian
mixtures of the form 0.2 N (0, 1)+N (X, 0.8)+
)
N (0.75, 0.6) + N (1.5, 0.4)) + 0.1(N (2, 0.2) + N (2.2, 0.4) where X ∈ {0, 0.05, . . . , 20}.
Figure 4.1 shows that the lower bound can only be slightly improved in this case. This is due to
the fact that the mixtures are widely separated relative to their variances. Loose clustering provides
an advantage for the computation given by [19]. In contrast, the improvement in the upper bound
in this method is significant. This improvement is more pronounced when the second component is
moved further from the cluster.
The second data set is a Gaussian mixture with five evenly weighted components that have fixed
variances. The means of the first four distributions are fixed while the mean of the last component
is varied. The distribution is of the form
(
)
0.2 N (0, 0.8) + N (0.5, 0.8) + N (1, 0.8) + N (1.5, 0.8) + N (X, 0.8)
where X ∈ {0, 0.05, . . . , 20}.
38
(
Figure 4.2: Upper and lower bounds of entropy
for Gaussian mixtures of the form 0.2 N (0, 0.8) +
)
N (0.5, 0.8) + N (1, 0.8) + N (1.5, 0.8) + N (X, 0.8) where X ∈ {0, 0.05, . . . , 20}.
In Figure 4.2, the improvement for the upper bound is significant when the mean of the fifth
Gaussian is clearly separated from the means of the remaining Gaussians. This result is similar to
the result given in Figure 4.1. However, it can be seen that the improvement for the lower bound in
this case is better than the result from Figure 4.1. This is due to the fact that the distances among
the mixtures are smaller relative to their variances.
From the previous result, another data set was generated in which the mixture consists of five
evenly weighted Gaussian component with equal variances that are varied. To be specific, distribution is of the form
(
)
0.2 N (0, X) + N (0.5, X) + N (1, X) + N (1.5, X) + N (2, X)
where X ∈ {0.01, 0.02, . . . , 2}.
39
(
Figure 4.3: Upper and lower bounds
of entropy for Gaussian mixtures of the form 0.2 N (0, X)+N (0.5, X)+
)
N (1, X) + N (1.5, X) + N (2, X) where X ∈ {0.01, 0.02, . . . , 2}.
The result, in Figure 4.3, confirms the observation from the two previous experiments. Observe
that in this case the methods explained here only slightly improve the estimate for the lower bound
when the variances of all mixtures are increased. The upper bound computations are identical to
those given in [19].
How the weights of the mixtures affect the bounds computation is also of interest in this framework. To that end, the mixture with fixed means and variances for all components is considered.
The mixture consists of five Gaussian distributions in which four mixtures have the same weight.
The weight of the last component is adjusted along with the rest of the distributions so that the
sum remains a valid distribution. The composite distribution is of the form
(
)
Y N (0, 0.6) + N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (2, 0.6)
where X ∈ {0.01, 0.02, . . . , 0.99} and Y =
(1−X)
.
4
40
(
Figure 4.4: Upper and lower bounds of entropy for Gaussian mixtures of the form Y N (0, 0.6) +
)
N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (2, 0.6) where X ∈ {0.01, 0.02, . . . , 0.99} and Y = (1−X)
.
4
In Figure 4.4, the upper bound computations are identical for the two methods. In contrast,
the lower bound given by the modified method performs significantly better than (4.17), especially
when the mixture is close to a single Gaussian. This is due to the fact that the computation in
(4.17) does not converge to the real solution. It is clear that the method present here can prevent
an overly conservative lower bound estimate in this scenario.
Next scenarios similar to those explained in the previous case where the last component is
moved from the rest of the mixtures is considered. In other words, the two cases considered have
distributions
(
)
Y N (0, 0.6) + N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (4, 0.6)
(
)
Y N (0, 0.6) + N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (6, 0.6)
where X ∈ {0.01, 0.02, . . . , 0.99} and Y =
(1−X)
.
4
41
(
Figure 4.5: Upper and lower bounds of entropy for Gaussian mixtures of the form Y N (0, 0.6) +
)
N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (4, 0.6) where X ∈ {0.01, 0.02, . . . , 0.99} and Y = (1−X)
.
4
(
Figure 4.6: Upper and lower bounds of entropy for Gaussian mixtures of the form Y N (0, 0.6) +
)
N (0.5, 0.6) + N (1, 0.6) + N (1.5, 0.6) + XN (6, 0.6) where X ∈ {0.01, 0.02, . . . , 0.99} and Y = (1−X)
.
4
From Figures 4.5 and 4.6,when the mean of the fifth component is far from the rest, the lower
bound computations from the modified method and those produced by (4.17) are nearly identical. In
contrast, the estimate of the upper bound in [19] becomes very conservative, while method present
here provides tighter upper bounds shown by all of the results in this section.
In conclusion, the closeness of the upper and lower bounds depend on the structure of the
Gaussian mixture. The proposed methods for computing the upper and lower bounds improve upon
those from [19] in all cases examined. In the next section, the estimation results from the modified
method are applied due to the robustness of these bounds as observed in this section.
42
4.3
Mutual Information of random process using Hidden
Markov Model
4.3.1
Model order selection for Hidden Markov Model
As mentioned in section 4.1.1, the model order or the number of hidden states has to be specified
a priori. Clearly, HMM with single hidden state is undesirable. Likewise, too many hidden nodes
will require very large amounts of data to obtain a “precise” model. An iterative procedure for
determining the number of hidden nodes is proposed. The basic idea is summarized in the following
steps.
(Step 1) At ith iteration, generate the hidden Markov model with i hidden states.
(Step 2) If the model in (Step 1) is valid, keep the current model and repeat the procedure for the
next iteration, i + 1.
The valid model in (Step 2) means that all of the model parameters are finite real numbers. In
addition, the initial probability and state transition matrix satisfy their constraints. Note that the
method to construct a hidden Markov model given in section 4.1.1 depends on an initial guess for all
of the parameters. There is no specific rule to choose this initial value for the model parameters [36].
Therefore, we use random initial model parameters. As a result, an invalid model given by (Step 1)
can occur due to a “bad” initial guess. Thus, the basic procedure must be modified by computing
the models of the same order multiple times. If all of the attempts in the ith iteration fail to converge
to a legitimate model, execution is halted and the model in the (i − 1)th iteration is selected. The
procedure is summarized in Figure 3 7. Maximum attempts per iteration, nmax , is chosen to be
equal to 10 and at each generation of the hidden Markov model and random initial parameters are
used.
43
Figure 4.7: The flow chart for hidden Markov model construction.
4.3.2
Simulations and Discussion for Computation of Mutual Information
using Hidden Markov Model
After obtaining the hidden Markov model from the data, the computation is straightforward using
the results in section 4.1.2 and 4.2.2. In this section, the coupled Hénon Maps given in (4.24) are
used as a test system.
(1) 2
x(1)
+ 0.3x(2)
n
n+1 = 1.4 − xn
x(2) = x(1)
n
n+1
(
)
(1) (3)
(3) 2
x(3)
+ 0.3x(4)
n
n+1 = 1.4 − Cxn xn + (1 − C)xn
(4.24)
x(4) = x(3)
n
n+1
(2)
(4)
The C in (4.24) determines the coupling strength between the two systems. The pair xn and xn
are used to estimate the mutual information between the two systems. Only data corresponding to
(2)
(4)
xn and xn are used in model constructions. This reflects the fact that, most of the time, one can
only get partial observations of the whole system. The coupling strength C is varied from 0 to 0.8
44
with a step size of 0.04 as in [20]. 100 realizations are used at each coupling strength with a different
( (1) (2) (3) (4) )
initial values x0 , x0 , x0 , x0 . The number of time-steps (data points) is 1,000. Acceptable
threshold in the change of estimated model parameters during hidden Markov model construction
is 10−6 . The maximum number of iterations for each hidden Markov model construction is set to
1,000. The results are given in Figure 4.8 below.
Figure 4.8: The data for Hénon system in which for each coupling strength consist of 100 realizations and
using 1,000 data points to construct HMM
The results in Figure 4.8 show that the computation cannot give any meaningful results for low
coupling strength, 0 ≥ C ≥ 0.4. The algorithm produces an increasing sequence of average extrema
for the upper bound for coupling strengths greater than or equal to 0.44. Similarly, this behavior
repeats for the average and maximum lower bounds for values of coupling strengths at or above 0.4.
In contrast, the minimum values for upper bounds do not have increasing mutual information along
coupling strength until the two system are tightly coupled, i.e., C ≥ 0.72. The minimum values for
the lower bound stay at zero until C = 0.68 and monotonically increase with coupling strength after
that. The gap between the upper and lower bounds are large for most cases. For C ≥ 0.72, the
hidden Markov model degenerates into a model with a single hidden state, as can be seen by the
equal upper and lower bounds.
As can be seen from the results in Figure 4.8, the proposed approach to computing the hidden
Markov model (HMM) is not reliable. The following analysis is provided for any future researchers
that use a similar approach. First, one of the well-known problems of HMM construction is the
underflow of forward and backward variables. The remedy, which applied herein, is to use the log
scale in the internal computation instead of linear scale on the value space [28]. The method was
45
only tested for 1,000 data points due to the computational time required. Although no studies on
the use of the log scale on large set of data (,i.e., ≥ 5,000 points) have been identified, it is suspected
that the value of internal variables suffers from cumulative error from the log scale computation of
forward and backward variables. Using continuous-valued representation implies that h(X) = −∞
is allowed, in the degenerate case. This makes computation of the upper and lower bounds of mutual
information impossible. This problem is addressed in the same manner as invalid solutions for model
parameters. This makes the selection for number of iterations required to construct a “legitimate”
hidden Markov model using the specified order much harder. Reducing number of iterations while
increasing number of model construction attempts may be needed. This is a tradeoff between model
accuracy and computational time. Another important problem is that the proposed approach relies
on computation of lower and upper bounds of entropy of Gaussian mixtures to estimate bounds on
mutual information. It turns out that the bound on mutual information can be almost three times
larger than the bound on entropy. This is not surprising since I(X; Y ) = H(X) + H(Y ) − H(X, Y ).
Further improvement in estimating the entropy of Gaussian mixtures may resolve this issue. The
last problem is the required computational time that is unreasonably long compared to the number
of data points, 1,000. It is doubtful that this amount of data is sufficient to construct a meaningful
continuous-valued hidden Markov model. Although, this problem can be alleviated by using a
machine with more computational power, increasing the training data can give rise to the first
problem, i.e., cumulative numerical error.
In conclusion, it can be seen that using a hidden Markov model for mutual information estimation
is still an immature technique. Although the underlying concept of this approach is interesting and
potentially meaningful, there is still much work needed to make this computational approach viable.
In the next chapter, the use of discretization and approximation of mutual information using a
discrete-valued representation is investigated.
46
CHAPTER
5
Computation of mutual Information:
Discrete-Value representation
The previous chapter illustrated that the computation of mutual information using hidden Markov
model approach with continuous value representation is insufficient for practical application. It was
also observed that the problem of degenerate solutions that results from using continuous-valued
representations of the data further degrades viability. Thus the use of discrete representations
of continuous-valued data along with quantization of the continuous-valued data is the primary
focus of this chapter. In this chapter, the histogram based estimation of mutual information is
investigated. The meaning and interrelation between Shannon mutual information and the actual
relationship between two time-series are studied. In addition, the relationship between quantization
and the estimation of mutual information is also investigated in this chapter. Simulation results and
discussions are also inlcuded in this chapter.
5.1
Investigation of Shannon Mutual information: Theoretical perspective
It must be emphasized that the assumption of stationarity cannot be satisfied in practical real world
data where only finite data sets are considered and the computation of Shannon entropy and mutual
information require even stronger assumptions, e.g., independent and identically distributed (i.i.d.)
random processes. Regardless, the computation of Shannon mutual information offers meaningful
benefits as shown by the results in this section.
47
5.1.1
Mutual Information and Quantization
Let X ∈ 1, . . . , N and Y ∈ 1, . . . , M be two discrete random variables with well-defined distributions
and let Z be a function of Y , e.g., Z = f (Y ). By the data processing inequality [29],
I(X; Y ) ≥ I(X; Z).
(5.1)
I(X; Y ) ≥ I(Q1 (X); Q2 (Y ))
(5.2)
In other words,
where Q1 (·) and Q2 (·) are quantizers.
Y
X
Y
X
Y
X
X
Further, let X1 = QX
1 (X), Y1 = Q1 (Y ), X2 = Q2 (X) and Y2 = Q2 (Y ) where (Q1 , Q1 , Q2 , Q2 )
Y
is a set of different quantizers. If there exist two additional quantizers, QX
12 and Q12 , such that
Y
X2 = QX
12 (1) and Y2 = Q12 (Y1 ), then it follows from equation (5.2) that
I(X1 ; Y1 ) ≥ I(X2 ; Y2 ).
(5.3)
The pair (X1 ; Y1 ) represents high resolution quantizations of (X, Y ) where the pair I(X2 ; Y2 ) represents low resolution quantizations of (X, Y ). Equation 5.3 states that the finer the quantizer, the
higher the value of the mutual information. In other words, quantization destroys information. The
generalization into multiple levels of resolution is straightforward.
5.1.2
Shannon Mutual Information and independent random processes
Fixing attention on the most commonly used entropy measure, Shannon entropy, and the computation of mutual information based on Shannon entropy is now investigated. Assuming Xk and Yk are
two independent discrete-value random processes for k = 1, . . . , T , define X̃ and Ỹ as two different
random variables with the following probability mass functions
1 ∑T
δXt ,i
t=1
N
1 ∑T
δYt ,i
=
t=1
N
P rob(X̃ = i) =
P rob(Ỹ = i)
(5.4)
where δi,j is Kronecker delta. By construction, since Xk and Yk are independent random processes,
it follows that
I(X̃; Ỹ ) = 0.
48
(5.5)
The mutual information between X̃ and Ỹ is a computation of mutual information in the context
of Shannon entropy for the original processes X1:T and Y1:T where any two independent random
processes have zero mutual information. However, the converse is not true. Although, the stronger
result cannot be proven, one can conjecture that high values for I(X̃; Ỹ ) imply a strong connection
between X1:T and Y1:T . This conjecture comes from the fact that, if the occurrence of X1:T given
Y1:T is high for all possible values of Yt , it directly implies that the instantaneous value of Yt can be
used to reduce uncertainty of Xt , possibly by a significant amount. The uncertainty reduction on Yt
from Xt produces the same conclusion. Low mutual information, in this sense, can be interpreted
as an inability to determine a relationship between two sets of time-series data. This effect is
similar to the measurement of the correlation coefficient between two time series. However, mutual
information measures interconnection between two random variables (vectors) beyond the notion of
linear dependence.
5.1.3
Shannon Mutual Information for Continuous-Random Variables
It was showed in the first chapter that differential entropy is not an exact analog of entropy defined
for discrete random variables. However, time series data are stored in digital format before any
further processing or analysis can be performed on them. Using this rationale, one can always view
continouos-valued data contained in digital conmputer as a distcrete-valued data regardless of their
original nature. The range space of a discrete representation becomes the possible floating point
representation of the data. Although machine representation of floating point is non-uniform, i.e.
machine number are not uniformly distributed, data stored in this manner have the same range
space. Therefore, one can always consider continuous-valued random variables in digital format as
discrete ones with the same range space. The number of discrete states is generally large compared
to the time windows in which the mutual information is computed. Thus, the use of quantization
is unavoidable due to the lack of data to accurately compute mutual information. With the results
from section 5.1.1 and 5.1.2, the relationship between quantization and mutual information along
with the significance of mutual information in the context of Shannon entropy is known. By the
reasons given in this section, a computation of Shannon mutual information under quantization has
significant value for analysis of the interconnection between two random variables/vectors/processes.
49
5.2
Investigation of Shannon Mutual information: Computational perspective
In the previous section, the neccessity of using discrete representation of continuous-valued random
variables was illustrated. However, quantization is needed in order to perform the computations. In
this section, the effect of various quantization methods and resolution on the computation of mutual
information is investigated. It is shown that by selecting any reasonable number of quantized
states, reliable result are obtained with normalization of the information measure. In addition,
some additional computational method for estimating Shannon mutual information are proposed
to impose a constraint on ordering of the time-series data. Coupled chaotic systems are used as
test beds for evaluating the performance of all computational method so that the performance of
methods based on probabilistic concepts can be determined on deterministic systems. The chaotic
systems contain rich dynamics and, by assuming that the data generation process is unknown to the
estimator, the uncertainty is mainly influenced by the rich system dynamics. In addition, chaotic
systems are sensitive to initial conditions that are also assumed to be unknown to to the estimator.
This property induces additional complexity to the estimation and reflects real world data in which
both the parameters and the model governing the system are generally unknown. Thus, chaotic
systems can be used to quantify and verify estimator performance for estimating the connectivity
or coupling in the systems.
The coupled chaotic systems used for the rest of this chapter are from [20,26]. These three systems
are coupled Hénon maps, coupled Lorenz systems and coupled Rossler systems. The information on
how data is collected and which parts of the system are used is as follow.
• Coupled Hénon maps
(1) 2
x(1)
+ 0.3x(2)
n
n+1 = 1.4 − xn
x(2) = x(1)
n
n+1
(
)
(1) (3)
(3) 2
x(3)
+ 0.3x(4)
n
n+1 = 1.4 − Cxn xn + (1 − C)xn
x(4) = x(3)
n
n+1
where
– C ∈ {0, 0.04, 0.08, . . . , 0.8},
(2)
(4)
– Measured Pair: xn+1 and xn+1 .
50
(5.6)
• Coupled Lorenz systems
(1)
ẋt
(2)
ẋt
(3)
ẋt
(4)
ẋt
(5)
ẋt
(6)
ẋt
(
)
(2)
(1)
= 10 xt − xt
(1)
(3)
(2)
= xt (28 − xt ) − xt
8 (3)
(1) (2)
= xt xt − xt
3
(
)
(5)
(4)
= 10 xt − xt
(4)
(5.7)
(6)
(5)
= xt (28.001 − xt ) − xt
(
)
8 (6)
(4) (5)
(3)
(6)
= xt xt − xt + C xt − xt
3
where
– C ∈ {0, 0.1, 0.2, . . . , 2},
(1)
(4)
– Measured Pair: xn+1 and xn+1 ,
– Numerical Integration: Runge-Kutta with step size 0.01 seconds.
• Coupled Rossler systems
(1)
ẋt
(2)
ẋt
ẋ(3)
t
(4)
ẋt
(5)
ẋt
ẋ(6)
t
(2)
(3)
= −0.95xt − xt
(1)
(2)
= 0.95xt + 0.15xt
(
)
(3)
(1)
= 0.2 + xt xt − 10
(
)
(5)
(6)
(1)
(4)
= −1.05xt − xt + C xt − xt
(4)
(5.8)
(5)
= 1.05xt + 0.15xt
(
)
(6)
(4)
= 0.2 + xt xt − 10
where
– C ∈ {0, 0.1, 0.2, . . . , 2},
(1)
(4)
– Measured Pair: xn+1 and xn+1 ,
– Numerical Integration: Runge-Kutta with step size 0.1 seconds.
Since the coupled Hénon maps are discrete-time in nature, the data generated from the system can
be used directly in digital computer computations. In contrast, the coupled Lorenz systems and the
coupled Rossler systems are continuous-time systems. Therefore, numerical integration is required
to obtain the trajectories of these two systems. A similar setup to the work in [20] is applied here so
that any future work on these three systems can directly use these results for comparison purposes.
51
5.2.1
Quantization methods for Computing Shannon Mutual Information
The focus in this section is on the fundamental concept of quantization, the conversion of continuousvalued data into discrete-valued data. The fundamental notion underlying quantization is to divide
the value space into different segments in space. Each segment is associated with a discrete or
quantized value. Mathematically, for x ∈ R, define the region ri ⊆ R ∀i = 0, ±1, ±2, . . . such that
∪∞
i=−∞ ri = R and ri ∩ rj = ∅ when i ̸= j. The quantized operator is such that Qr (x) = i
for x ∈ ri .
The principal question that must be addressed in quantization is how to select
{. . . , r−2 , r−1 , r0 , r1 , r2 , . . . }. Quantization methods have been extensively researched in machine
learning in the form of discretization. Despite the vast amount of research on this topic, there is
no generally accepted “best” method for quantization. The selection of quantizer is more or less
application specific. Discretization techniques can be categorized into unsupervised and supervised
methods [8, 24]. The main focus is on unsupervised methods due to the fact that the primary
aim here is to use the information measure to gain an understanding of the data. Thus, the goal
or criteria needed in supervised methods is not explicitly present in the current framework. Two
unsupervised methods which are used in this work are as follows.
1. Equal spacing: This is the simplest version of discretization/quantization. Each ri in this
method is of the form [ai , ai+1 ) where ai+1 − ai = d.
2. Equal frequency: In this method, each ri is of the form [ai , ai+1 ) where ai+1 −ai = di . For data
∑T
set {x1 , x2 , , xT }, the value of ai and ai+1 for ∀i is chosen so that t=1 δi,Q(xt ) = Const where
δi,j is Kronecker delta. In other word, each space partition contained the “same” quantity of
points from the data set. One variation of this method is the construct the quantizer which
yields the maximum entropy of the quantized version. In most cases, the original method and
its variant will give the identical result.
For unsupervised discretization, the number of quantized states needs to be manually specified. It is
safe to assume that the data has finite value since this holds true for application under consideration
for the future implementation. Although there exist more sophisticated quantization techniques,
attention will be restricted to these two basic methods. This is because the results in the next section
will demonstrate that the normalized version of Shannon mutual information has low sensitivity to
quantization methods.
52
5.2.2
Investigation of the effects of Quantization Methods and number
of Quantized states on Shannon Mutual Information
As shown in section 5.1.1, the estimate of mutual information is proportional to the resolution of
the quantization. Also, having too many quantized states will reduce the accuracy of the estimation
unless the cardinality of the data is sufficiently large. Thus, a test using data generated from the
systems (5.6), (5.7) and (5.8) are constructed to observe the relationship between the number of
quantization bins and the estimated mutual information. The same number of quantized states
is used for data from two subsystems. Although it is not necessary to have the same number of
quantized states for two sets of data, the cases considered are restricted to those holding the number
of quantized state constant across subsystems for clarity of presentation.
53
(a)
(b)
Figure 5.1: Estimation of (Shannon) mutual information for coupled Hénon maps with different coupling
strength and number of bins. The figure on the top uses equal space partitioning and the figure on the
bottom uses equal frequency partitioning
54
(a)
(b)
Figure 5.2: Estimation of (Shannon) mutual information for coupled Lorenz systems with different coupling
strength and number of bins. The figure on the top uses equal space partitioning and the figure on the bottom
uses equal frequency partitioning
55
(a)
(b)
Figure 5.3: Estimation of (Shannon) mutual information for coupled Rossler systems with different coupling
strength and number of bins. The figure on the top uses equal space partitioning and the figure on the bottom
uses equal frequency partitioning
From Figure 5.1, 5.2 and 5.3, it can be seen that maximum marginal entropy partitioning yields
slightly higher estimation results than equal space partitioning. The two different quantization
methods examined here yield nearly identical results. From Figure 5.2, it is arguable that equal
56
frequency gives the “better” or smoother surface for the coupled Lorenz system. This result implies
that the estimation is relatively insensitive to the partitioning method. For any fixed coupling
strength, the line associated with each surface gives a similar increscent behaviour with increasing
coupling strength (5.3). However, this trend is not a straight line. This implies that the relationship
between resolution of quantization and mutual information is nonlinear. As can be observed from
the results shown in Figures 5.1, 5.2 and 5.3, the basic relationship between mutual information and
coupling strength appears to be relatively insensitive to the number of bins. However, the number
of bins used in the estimation process does have an effect on the magnitude of the estimated mutual
information. The use of normalized mutual information, presented in next section, is proposed
to improve robustness of the entropy estimate to quantization. By using normalization, different
numbers of bins can be applied to different data sets and still produce meaningful results. The
number of quantized states still needs to be specified and it must be reasonable relative to amount
of data being analyzed. The method given in [42] is proposed for selection of the number of quantized
states. The criteria is as follow. For time series data X1:T ,the number of bins is chosen by
1
nbin =
(Xmax − Xmin )T 3
3.49σsample
(5.9)
where
• Xmax and Xmin are maximum and minimum value of X1:T respectively,
• σsample is standard deviation of X1:T .
Although this criteria is designed for a Gaussian i.i.d. random process, the results presented in this
section strongly suggest that consistent estimation results will be achieved even when this criteria
is applied on a different propability density. This criteria is proposed due the fact that it selects a
reasonable number of bins. In addition, the ratio between (Xmax − Xmin ) and σsample makes the
selection of the number of bins scale-invariant. This effect is strongly desirable since any scaling in
the value space of the data should not effect the partition. Thus, this criteria makes normalization
of the original data unnecessary. Note that the number of quantization states of X1:T and Y1:T can
be different when the Shannon mutual information between X1:T and Y1:T is computed. The results
in following section will use (5.9) to compute the numbers of bins for any pair of data for mutual
information estimation.
57
5.2.3
Comparative study of Modification on Shannon Mutual Information
It is well-known that scrambling data does not affect the computation of Shannon entropy (and
Shannon mutual information). One way to make Shannon entropy sensitive to ordering of the data
sequence is to use the notion of embedding as follows. For any time series data X1:T , construct
T
T
Z1:T −n where Zi = [XiT , Xi+d
, . . . , Xi+d
]T . Basically, this is an injective mapping between
1
( n−1)
X1:T and Z1:T −n where the vector Z1:T −n lies in a higher dimensional space. Z1:T −n can be used
to compute the entropy instead of original data. Theoretically, the number of possible values of
Zi is exponentially larger than that of Xi . However, practically, the trajectories of Z1:T −n do not
span all of Rn . Therefore, using Z1:T −n to estimate entropy and mutual information is possible.
An alternative transformation of X1:T is also proposed in the form of the vector Z̃1:T where Z̃i =
[X1T , X2T − X1T , , XTT − XTT−1 ]T . This Z̃1:T describes X1:T in term of the initial point and the changes
in values along trajectory. These changes can be interpreted as an attempt to capture the derivative
of X1:T for the case when X1:T is a sample version of continuous-time data. This construction also
guarantees that the scrambled version of the original data will not have the same Shannon entropy
and/or mutual information as the original data. Arguably, if T is large, omitting X1 from Z̃1:T will
not significantly affect the computation of entropy. A testbed using the data generated from the
systems (5.6), (5.7) and (5.8) is constructed with the specified parameters and the performance of
these two modifications on the computation of Shannon mututal information is observed. For the
time series data X1:T and Y1:T , the following vectors are constructed,
(1)
(1)
T
= [XiT , Xi+1
]T ,
(2)
(2)
T T
= [YiT , Yi+1
] ,
1. Z1:T −1 where Zi
2. Z1:T −1 where Zi
(1)
(1)
T
= Xi+1
− XiT ,
(2)
(2)
T
= Yi+1
− YiT .
3. Z̃1:T where Z̃i
4. Z̃1:T where Z̃i
(1)
(2)
(1)
(2)
The computations for the normalized version of I(X1:T ; Y1:T ), I(Z1:T −1 ; Z1:T −1 ) and I(Z̃1:T ; Z̃1:T )
are compared. For each computation of mutual information, the binning is performed by the criteria
in (5.9) for each time series involved. Then the Shannon mutual information is computed from the
symbol sequences resulting from the quantization.
For the first test, a very long data sequence is generated for each system and each coupling
strength. The number of data points in each trajectory is 400,000. For each trajectory, a window size
58
of 8,000 is used. The data in each window is used to compute normalized mutual information. Instead
of using the value obtained from computation in an individual window, an average of estimated values
from different windows is computed. the effect of choosing different windows for mutual information
estimation is examined and the results from choosing 50 non-overlapping windows are compared
with the results where these windows overlap. The results for overlap sizes of 0, 1000, 2000, . . . ,
7000 data points are presented. Note that the actual amount of data that is required for mutual
information estimation is reduced with increasing overlap. For example, only 56, 000 data points
are needed when the overlap size is 7, 000 using 50 consecutive overlapping windows. This provides
the motivation for the first test. The results for three chaotic systems and their coupling strengths,
given in (5.6), (5.7), and (5.8) are as follow.
59
(a)
(b)
Figure 5.4: Estimation of Shannon mutual information for coupled Hénon maps with the different coupling
strength using long time-series data averaging over different windows. Top figure contains windows with
overlap size 0 - 3,000. Bottom figure contains windows with overlap size 4,000 - 7,000.
60
(a)
(b)
Figure 5.5: Estimation of Shannon mutual information for coupled Lorenz systems with different coupling
strength using long time-series data and averaging over different windows. Top figure contains the windows
with overlap size 0 - 3,000. Bottom figure contains the windows with overlap size 4,000 - 7,000.
61
(a)
(b)
Figure 5.6: Estimation of Shannon mutual information in Shannon for coupled Rossler systems with
different coupling strength using long time-series data and averaging over different windows. Top figure
contains windows with overlap size 0 - 3,000. Bottom figure contains windows with overlap size 4,000 7,000.
First of all, among all three systems, the coupled Rossler systems are the least sensitive to the
selected windowing. There is almost no difference in the computation of normalized mutual information for all 3 methods and, even though the three different methods yield different estimated
entropy values, each shows a (desirable) monotone increasing trend. The results from direct computation of mutual information and mutual information of the “derivative” have increasing trends that
are more linear than when using embedding. On the coupled Hénon maps, all methods show non-
62
decreasing trends for all overlapping windows. There are no apparent differences for low coupling,
C ∈ {0, 0.04, . . . , 0.66}, when using different overlapping window sizes. However, for overlapping
window with the sizes of 6,000 and 7,000, all estimators give the maximum value of normalized
mutual information for high coupling levels, C ≥ 0.7. In other words, all methods lose their ability
to distinguish high coupling for coupled Hénon maps using large overlapping windows. This is a
trade off for using smaller data sets to estimate the connection between two subsystems.
Next the performance of the three estimators on coupled Hénon maps is considered. First,
the embedding method clearly provides better discrimination for low coupling strength, C ≤ 0.7.
It also provides a more “linear” increasing trend. The trade off for this method is that, when two
subsystems are uncoupled, it gives non-zero normalized mutual information. The mutual information
of the “derivative” gives a slightly more linear increasing trend than the regular mutual information
computation for low coupling, C ≤ 0.6; but the differences are not significant. There is no noticeable
difference between the later two methods with high coupling, C ≥ 0.68.
The last system to be discussed is the Lorenz system. In this system, when averaging from nonoverlapping windows, the embedding and “derivative” estimation methods give a non-decreasing.
The usual Shannon mutual information also has a non-decreasing except for the region where 1.5 ≥
C ≥ 1.6 where there is slight decrease in the estimated value. In the same coupling region, the
performance of all three methods start to suffer when windows overlap. The V-shaped trend in this
region is deeper when the overlap is larger. This can be seen clearly at an overlap size of 7,000. In
addition, the coupling range of 1.6 ≥ C ≥ 1.9 also has the same effect with lesser magnitude. In
this system, there are no apparent differences between regular mutual information estimation and
the estimator using the difference. The embedding method is slightly worse than the rest for low
coupling levels; the lowest value is non zero and the increasing trend is not as revealing.
In conclusion, the embedding dimension method is preferable if coupled Hénon maps are in consideration due to the fact that it is has better resolution for low coupling. There are no significant
differences between the other two methods. For coupled Rossler systems, all three methods can
be used despite their different increasing trends. For coupled Lorenz systems, the regular mutual
information estimation and the mutual information using the “derivative” have minor differences
in performance. In contrast, the embedding method is not suitable for this system. In addition,
the estimation of Shannon mutual information with shuffling still provides meaningful results. The
mutual information computations with the “derivative” can be applied since the additional computations are trivial. However, there is no significant difference with the ordinary computation of
mutual information between two data streams. Finally, for a complex system such as the coupled
63
Lorenz system, estimation of coupling using normalized mutual information by averaging consecutive
time-windows greatly depends on the overlap region of each window. The trade off between fewer
data points required and better estimation is shown in the results. In contrast, for less complicated
systems, overlapping windows allows the estimation of the coupling with far fewer data points.
5.2.4
Additional study of Mutual Information on jump system and the
system with varying parameters
The previous section provided a comparison between three different methods for computing normalized mutual information. Using a long data sequence with fixed coupling strength for (5.6), (5.7),
and (5.8). This setup does not reflect the real world where measurements can be corrupted by (additive) noise as studied in [26]. Also, the parameters of the system can be time varying and are the
primary interest in this case since this attribute can mimic sudden changes in the system dynamics
due to faults, disturbances, and changes in operating conditions. In this study, data is generated
from (5.6), (5.7), and (5.8) with a window size of 8,000 points used to compute normalized mutual
information. Each estimate is updated every 1,000 data points so that two consecutive windows
overlap by 7,000 data points. The coupling strengths for each system are now time-varying. Nor(1)
(2)
(1)
(2)
malized versions of all three estimators, I(X1:T ; Y1:T ), I(Z1:T −1 ; Z1:T −1 ) and I(Z̃1:T ; Z̃1:T ), given in
the previous section are compared. In this section, results are shown in four figures. The top-left
figure is the estimate of normalized mutual information from single windows at different times. The
Bottom-left figure is an estimate normalized mutual information from an average of 50 consecutive
windows. Top-right figure shows the time-varying coupling strength. Bottom-right figure gives the
standard deviation of the estimates from 50 consecutive windows.
The first experiment considers the case when the coupling undergoes two step change over the
course of each trajectory. Each time-series data contains 500,000 points. The coupling starts at
0.8 for the coupled Hénon maps; and 2.0 for the coupled Lorenz and coupled Rossler systems. The
coupling strength is constant for 150,000 time-steps before it experiences a step change to half of its
former value. The coupling strength stays at that value for another 150,000 data points and then
drops to zero for the rest of the trajectory. The coupling strength as a function of time, Ct , is given
64
by a step function
Cmax ,
1
Ct =
Cmax ,
2
0,
0 ≤ t ≤ 150, 000
150, 000 < t ≤ 300, 000
300, 000 < t ≤ 500, 000
where
• Cmax = 0.8 for coupled Hénon maps,
• Cmax = 2.0 for coupled Lorenz and coupled Rossler systems.
Figure 5.7: Estimation of Shannon mutual information in Shannon for coupled Hénon maps with varying
coupling strength, Ct . Ct = 0.8 for 0 ≤ t ≤ 150, 000, Ct = 0.4 for 150, 000 < t ≤ 300, 000 and Ct = 0 for
300, 000 < t ≤ 500, 000.
65
Figure 5.8: Estimation of Shannon mutual information in Shannon for coupled Lorenz systems with varying
coupling strength, Ct . Ct = 2.0 for 0 ≤ t ≤ 150, 000, Ct = 1.0 for 150, 000 < t ≤ 300, 000 and Ct = 0.0 for
300, 000 < t ≤ 500, 000.
Figure 5.9: Estimation of Shannon mutual information in Shannon for coupled Rossler systems with
varying coupling strength, Ct . Ct = 2.0 for 0 ≤ t ≤ 150, 000, Ct = 1.0 for 150, 000 < t ≤ 300, 000 and
Ct = 0.0 for 300, 000 < t ≤ 500, 000.
For coupled Hénon maps and coupled Rossler systems, the results are similar as can be seen from
figures 5.7 and 5.9. There are some fluctuations in the estimated values associated with each of
66
the individual windows. Regardless, all three methods can detect the instantantaneous changes
in coupling. The average of 50 consecutive windows makes the estimation smoother with sharp
transition becoming monotone decreasing trends. One interesting aspect is the standard deviation
for the estimation of normalized mutual information from 50 consecutive windows. The standard
deviation graph consists of two lobes that are directly associated with the step change in coupling.
The distance between peak of each lobe and the actual change point are identical to window size
(8000). In addition, the only difference among the three different methods is the magnitude of the
estimates. Thus, in this scenario, all methods can be used to detect sudden changes in system
parameters. For coupled Lorenz systems, there are fluctuations in the instantaneous estimates of
mutual information from all methods, as shown in 5.8. Even with an average of 50 adjacent windows,
the fluctuations remain, though with reduction in magnitude. In addition, all methods are unable to
detect the second change in coupling strength, e.g., from 1.0 to 0. However, this result is expected
since all methods have limited sensitivity in detecting the different coupling strengths for this system
as shown in figure 5.5. In addition, for the change in coupling from 2.0 to 1.0, there exists a lobe in
the standard deviation of 50 adjacent windows with the occurrence of the peak directly related to
the time of the sudden change.
As can be seen from the previous section, information measures can be used to detect sudden
changes in system parameters as long as they have the ability to distinguish systems with different
constant parameters. However, in many real world applications, the parameters may be changing
more continuously with time. Second experiment is next conducted in which the coupling strength
is decaying with time from its maximum value to zero with coupling strength as a function of time,
Ct , given by
Ct = 1 −
1
1+
e−α(t−250,000)
where
• Cmax = 0.8 for coupled Hénon maps,
• Cmax = 2.0 for coupled Lorenz and coupled Rossler systems,
• α = 0.00005.
67
Figure 5.10: Estimation of Shannon mutual information for coupled Hénon maps with varying coupling
1
strength, Ct . Ct = 1 − 1+e−α(t−250,000)
where α = 0.00005.
Figure 5.11: Estimation of Shannon mutual information for coupled Lorenz systems with varying coupling
1
strength, Ct . Ct = 1 − 1+e−α(t−250,000)
where α = 0.00005.
68
Figure 5.12: Estimation of Shannon mutual information in Shannon for coupled Rossler systems with
1
varying coupling strength, Ct . Ct = 1 − 1+e−α(t−250,000)
where α = 0.00005.
Similar to previous experiment, the results for coupled Hénon maps and coupled Rossler systems
from all methods exhibit similar behavior as shown in figure 5.10 and 5.12. There are some minor
fluctuations in the trend of the direct estimation for each time window. The trend in this direct
estimation replicates the time-varying behavior of the coupling. The plot of the average value of 50
adjacent windows contains a smooth curve that is adequately close to the instantaneous coupling
strength. The plot for standard deviation still contains a large lobe. However, there exists a
fluctuation in this lobe that makes the determination of the change point difficult. This result is not
unexpected since there is no sudden change in this scenario. For coupled Lorenz systems, the results
indicate more complexity as shown in figure 5.11. There are large fluctuations for the estimation
of mutual information during high coupling. In addition, there are minor fluctuations during low
coupling for the same estimated value. All three methods suffer from this effect in different time
windows. Estimation using averaging has smoother trends, with fluctuations during high coupling
persisting but with reduced magnitude, and fluctuations in the low coupling are almost totally
eliminated. Aside from ripples in the trend, these graphs are tracking the time-varying coupling of
the systems. The standard deviation plots contain one relatively large lobe in which the peak is in
the transition phase of the coupling strength from its maximum value to 0.
In conclusion, there are no significant differences among the 3 methods for estimating normalized
mutual information when the system parameters are varied. All three methods can be used to detect
69
both sudden changes and monotone decay for coupled Hénon maps and coupled Rossler systems.
For coupled Lorenz systems, only the transition in coupling from 1.0 to 2.0 can be detected. This
is not surprising given the results for distinguishing different (constant) coupling strengths in this
system. The standard deviation of the entropy for 50 consecutive windows has relatively large lobe
associated with the change point. The peak can be used to determine the exact time of (step) change
of system parameters. In contrast, the lobe contains some fluctuations if the change occurs more
smoothly. Although, it is not ideal, the normalized mutual information computed from averaging 50
consecutive overlapping time windows is applicable for detecting both sudden and smooth changes
in coupling parameters.
70
CHAPTER
6
Conclusions
In this thesis, the relationship between control theory and information theory was studied from
the systems perspective. The results from previous research focusing on feedback control and
state estimation were extended. Details on how system structure is directly related to mutual
information between inputs and outputs are provided. Focusing on the structure of the system
highlights the relationship to control and observer designs, and extends previous work on how
control theory and information theory are connected. The tight connection between mutual information and the concepts of controllability and observability in linear systems has been explicitly
shown in this work where the mutual information between inputs and outputs only depends on
the controllable and observable states of the system. Computation of the information measures for
linear systems shows that the mutual information between inputs and internal states only depends
on data at each time instant. This result is related to the Markov property of evolution of the
system states, and future research in needed to further clarify this relation. The main limitation
of this research in this topic is that the results relating information measures to controllability and
observability can only be proven rigorously for linear systems. For nonlinear systems, only weaker
results related to control inputs, observation processes and internal states are presented. One can
only conjecture that the same result related to controllability and observability as in linear systems
holds. This missing piece is also another challenging problem that can be pursued in future research.
The method proposed for computing mutual information using the Hidden Markov Model as an
approximation of the underlying dynamical system does not give satisfactory results. This is due
to a variety of reasons. One of the main reasons is that the computation of entropy of a Gaussian
71
mixture is still an open problem. Estimating this quantity using upper and lower bounds is not
sufficient for the computation of mutual information due to the fact that the gap between the two
bounds is approximately three times larger than the gap between the bounds on entropy. There
are also problems associated with data-driven model construction. Degeneration of the model
parameters and inability to reasonably specify the model order are two of the main issues. If these
problems can be resolved, this approach to computing information measures will be of practical
value. Future research on the computation of entropy for Gaussian mixtures or different forms of
mixtures can potentially improve the results. In addition, further research on data-driven models
for continuous-valued random processes, including further development of infinite hidden Markov
Models [2] or Causal State Splitting Reconstruction (CSSR) [43, 44], will be extremely useful for
this approach. Regardless of the unsatisfying nature of the final result, this work significantly
improved the computation of upper and lower bounds for Gaussian mixtures relative to existing
methods.
This improvement is an important component of the proposed method and for the
robust computation of entropy for Gaussian mixtures in all situations considered. Finally, this
approach for computing mutual information is based on the fact that the data is generated from an
unknown discrete dynamical system. If this assumption is invalid, this line of investigation is not
recommended.
The results in this thesis also show that Shannon mutual information has a meaningful interpretation, even when the i.i.d. assumption is not satisfied. Independent sequences have zero Shannon
mutual information while high values in mutual information can be interpreted as suggesting strong
(statistical) relationships/connectivity. In addition, the computation of mutual information for
continuous-valued data using quantization depends on the number of quantized states. In contrast,
normalized Shannon mutual information is shown, using computational examples, to be relatively
insensitive to the number of quantized states. In addition, the method of quantization does not
significantly effect the computation of Shannon mutual information. The results have been shown
for coupled chaotic systems, (coupled Hénon maps, coupled Lorenz systems, and coupled Rossler
systems.) This work also showed that the average of mutual information for overlapping and
non-overlapping windows can be used to determine the relative coupling strength of these three
chaotic systems. Although the trends of the estimated mutual information are nonlinear, they are
generally non-decreasing with increased coupling strength. This thesis also showed that, for the data
generated by these systems with time-varying coupling strength, normalized mutual information
computed from averaging sliding windows of data follows the changes in system coupling strength.
72
This result can potentially be useful in many applications in which a smooth or sudden change of
parameters is indicative of a fault that needs to be detected and isolated. Further study of this
result in different synthetic and real data is interesting for future research.
Finally, an alternative information measure within the existing information framework is presented in Appendix A. It has been shown that this computation is related to mutual information
and that the information measure is non-negative and can be decomposed into four nonnegative
terms. Although an intensive study of this measure is yet to be performed, further study on both
theoretical and application aspects of this measure are an interesting direction for research.
73
Appendices
74
APPENDIX
A
Alternative Information Measure and Relationship
to Mutual Information
In this section, a different way for determining the relationship between two sets of time-series data
is investigated. These new measure uses the concepts of information theory focusing on information
ralated to the range of random variables. The theoretical connection between this measure and
mutual information is also presented.
For a random variable X ∈ {1, 2, . . . , nX } and Y ∈ {1, 2, . . . , nY } with probability mass function
pD
XY (x, y), we can define an information measure based on the values of X and Y as follows
MI (X, Y, x, y) = I(ZX (x); ZY (y))
(A.1)
where
• ZX (x) is a Bernoulli random variable such that P rob(ZX (x) = 1) = pD
X (x),
• ZY (y) is a Bernoulli random variable such that P rob(ZY (y) = 1) = pD
Y (y).
The relationship between these random variables in terms of the value space can be computed by
Mtotal =
∑nX ∑nY
x=1
y=1
75
MI (X, Y, x, y)
(A.2)
The information measure Mtotal in (A.2) is closely related to mutual information between X and Y ,
and this relationship can be shown by expanding (A.2) as follows
Mtotal =
Mtotal =
∑nX ∑nY
x=1
y=1
∑nX ∑nY
MI (X, Y, x, y),
(A.3)
P rob(X = x, Y = y)
P rob(X = x)P rob(Y = y)
∑nX ∑nY
P rob(X ̸= x, Y = y)
+
P rob(X ̸= x, Y = y) log
x=1
y=1
P rob(X ̸= x)P rob(Y = y)
∑nX ∑nY
P rob(X = x, Y ̸= y)
+
P rob(X = x, Y ̸= y) log
x=1
y=1
P rob(X = x)P rob(Y ̸= y)
∑nX ∑nY
P rob(X ̸= x, Y ̸= y)
+
P rob(X ̸= x, Y ̸= y) log
.
x=1
y=1
P rob(X ̸= x)P rob(Y ̸= y)
x=1
y=1
P rob(X = x, Y = y) log
(A.4)
Now, define
C(X̄, Y ) =
∑nX ∑nY
x=1
y=1
P rob(X ̸= x, Y = y)
P rob(X ̸= x, Y = y)
,
P rob(X ̸= x)P rob(Y = y)
∑nY
P rob(X = x, Y ̸= y)
× log
C(X, Ȳ ) =
∑nX
x=1
y=1
P rob(X = x, Y ̸= y)
,
P rob(X = x)P rob(Y ̸= y)
∑nY
P rob(X ̸= x, Y ̸= y)
× log
C(X̄, Ȳ ) =
∑nX
x=1
(A.5)
× log
(A.6)
y=1
P rob(X ̸= x, Y ̸= y)
.
P rob(X ̸= x)P rob(Y ̸= y)
(A.7)
With (A.6), (A.5), (A.7) and (A.4) Mtotal can writen as
Mtotal = I(X; Y ) + C(X̄, Y ) + C(X, Ȳ ) + C(X̄, Ȳ ).
(A.8)
By the definition of Mtotal from (A.2), it follows that
Mtotal ≥ 0.
(A.9)
The interesting part is that all four quantities on the right hand side of (A.8) are non-negative.
Clearly, I(X; Y ) ≥ 0 follows from (2.19). This fact is formally stated by the following two lemmas.
Lemma A.1. With the definition in (A.6) and (A.5), C(X̄, Y ) ≥ 0 and C(X, Ȳ ) ≥ 0.
76
Proof. The proof is only required for either C(X̄, Y ) or C(X, Ȳ ). Consider
C(X̄, Y ) =
≥
∑nX ∑nY
x=1
y=1
P rob(X ̸= x, Y = y) log
(∑nX ∑nY
P rob(X ̸= x, Y = y)
P rob(X ̸= x)P rob(Y = y)
)
P rob(X ̸= x, Y = y)
x=1
y=1
∑nX ∑nY
x=1
y=1 P rob(X ̸= x, Y = y)
× log ∑nX ∑nY
x=1
y=1 P rob(X ̸= x)P rob(Y = y)
∑nX
(∑nX
)
P rob(X ̸= x)
=
P rob(X ̸= x) log ∑nx=1
X
x=1
P
x=1 rob(X ̸= x)
(∑nX
)
=
P rob(X ̸= x) log 1
[log sum inequality]
x=1
= 0.
Therefore, this complete the proof of the lemma.
Lemma A.2. With the definition in (A.7), C(X̄, Ȳ ) ≥ 0.
Proof.
C(X̄, Ȳ ) =
≥
∑nX ∑nY
x=1
y=1
P rob(X ̸= x, Y ̸= y) log
(∑nX ∑nY
P rob(X ̸= x, Y ̸= y)
P rob(X ̸= x)P rob(Y ̸= y)
)
P rob(X ̸= x, Y ̸= y)
x=1
y=1
∑nX ∑nY
x=1
y=1 P rob(X ̸= x, Y ̸= y)
× log ∑nX ∑nY
[log sum inequality]
x=1
y=1 P rob(X ̸= x)P rob(Y ̸= y)
(∑nX ∑nY (
))
=
P rob(X ̸= x) − P rob(X ̸= x, Y = y)
x=1
y=1
)
∑nX ∑nY (
x=1
y=1 P rob(X ̸= x) − P rob(X ̸= x, Y = y)
∑nY
∑nX
× log
y=1 P rob(Y ̸= y)
x=1 P rob(X ̸= x)
(∑nX
)
=
(nY − 1)P rob(X ̸= x)
x=1
∑n X
rob(X ̸= x)
x=1 (nY − 1)P
)∑nY (
)
× log ∑nX (
1
−
P
rob(X
=
x)
x=1
y=1 1 − P rob(Y = y)
(∑nX
(
))
=
(nY − 1) 1 − P rob(X = x)
x=1
(
)
∑nX
x=1 (nY − 1) 1 − P rob(X = x)
)∑nY (
)
× log ∑nX (
x=1 1 − P rob(X = x)
y=1 1 − P rob(Y = y)
= (nX − 1)(nY − 1) log
(nX − 1)(nY − 1)
(nX − 1)(nY − 1)
= (nX − 1)(nY − 1) log 1
= 0.
Therefore, this completes the proof of the lemma.
77
Bibliography
[1] Roy L Adler, Alan G Konheim, and M Harry McAndrew. Topological entropy. Transactions
of the American Mathematical Society, pages 309–319, 1965.
[2] Matthew J Beal, Zoubin Ghahramani, and Carl E Rasmussen. The infinite hidden markov
model. In Advances in neural information processing systems, pages 577–584, 2001.
[3] Rufus Bowen. Entropy for group endomorphisms and homogeneous spaces. Transactions of the
American Mathematical Society, 153:401–414, 1971.
[4] R. Clausius and T.A. Hirst. The Mechanical Theory of Heat: With Its Applications to the
Steam-engine and to the Physical Properties of Bodies. J. Van Voorst, 1867.
[5] Aviad Cohen and Itamar Procaccia. Computing the kolmogorov entropy from time signals of
dissipative and conservative dynamical systems. Phys. Rev. A, 31:1872–1882, Mar 1985.
[6] Roger C. Conant. Laws of information which govern systems. Systems, Man and Cybernetics,
IEEE Transactions on, SMC-6(4):240–255, April 1976.
[7] E. I. Dinaburg. The relation between topological entropy and metric entropy. Soviet Math.
Dokl., 11:13–16, 1970.
[8] James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Machine Learning: Proceedings Of The Twelfth International
Conference, pages 194–202. Morgan Kaufmann, 1995.
[9] J-P Eckmann and David Ruelle. Ergodic theory of chaos and strange attractors. Reviews of
modern physics, 57(3):617–656, 1985.
[10] Xiangbo Feng and K.A. Loparo. Active probing for information in control systems with quantized state measurements: a minimum entropy approach. Automatic Control, IEEE Transactions on, 42(2):216–238, 1997.
78
[11] Xiangbo Feng, K.A. Loparo, and Yuguang Fang. Optimal state estimation for stochastic systems: an information theoretic approach. Automatic Control, IEEE Transactions on, 42(6):771–
785, 1997.
[12] Josiah Willard Gibbs. On the equilibrium of heterogeneous substances. 1879.
[13] P. Grassberger. Estimating the information content of symbol sequences and efficient codes.
Information Theory, IEEE Transactions on, 35(3):669–675, 1989.
[14] Peter Grassberger. Finite sample corrections to entropy and dimension estimates. Physics
Letters A, 128(67):369 – 373, 1988.
[15] Peter Grassberger and Itamar Procaccia. Estimation of the kolmogorov entropy from a chaotic
signal. Physical review A, 28(4):2591–2593, 1983.
[16] Ralph VL Hartley. Transmission of information1. Bell System technical journal, 7(3):535–563,
1928.
[17] H. Herzel, A.O. Schmitt, and W. Ebeling. Finite sample effects in sequence analysis. Chaos, Solitons & Fractals, 4(1):97 – 113, 1994. ¡ce:title¿Chaos and Order in Symbolic Sequences¡/ce:title¿.
[18] Hanspeter Herzel and Ivo Groe. Measuring correlations in symbol sequences. Physica A:
Statistical Mechanics and its Applications, 216(4):518 – 542, 1995.
[19] M.F. Huber, Tim Bailey, H. Durrant-Whyte, and U.D. Hanebeck. On entropy approximation
for gaussian mixture random vectors. In Multisensor Fusion and Integration for Intelligent
Systems, 2008. MFI 2008. IEEE International Conference on, pages 181–188, 2008.
[20] S. Janjarasjitt and K.A. Loparo. An approach for characterizing coupling in dynamical systems.
Physica D: Nonlinear Phenomena, 237(19):2482 – 2486, 2008.
[21] B.-H. Juang, S.E. Levinson, and M.M. Sondhi. Maximum likelihood estimation for multivariate
mixture observations of markov chains (corresp.). Information Theory, IEEE Transactions on,
32(2):307–309, 1986.
[22] Paul Kalata and Roland Priemer. Linear prediction, filtering, and smoothing: An informationtheoretic approach. Information Sciences, 17(1):1 – 14, 1979.
[23] I. Kontoyiannis, P.H. Algoet, Yu M. Suhov, and A.J. Wyner. Nonparametric entropy estimation
for stationary processes and random fields, with applications to english text. Information
Theory, IEEE Transactions on, 44(3):1319–1327, 1998.
79
[24] Sotiris Kotsiantis and Dimitris Kanellopoulos. Discretization techniques: A recent survey, 2006.
[25] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information.
Phys. Rev. E, 69:066138, Jun 2004.
[26] Thomas Kreuz, Florian Mormann, Ralph G. Andrzejak, Alexander Kraskov, Klaus Lehnertz,
and Peter Grassberger. Measuring synchronization in coupled model systems: A comparison of
different approaches. Physica D: Nonlinear Phenomena, 225(1):29 – 42, 2007.
[27] L. Liporace. Maximum likelihood estimation for multivariate observations of markov sources.
Information Theory, IEEE Transactions on, 28(5):729–734, 1982.
[28] Tobias P. Mann. Numerically stable hidden markov model implementation. An HMM scaling
tutorial, pages 1–8, 2006.
[29] Thomas M.Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons,
Inc., New Jersy, 2nd edition, 2006.
[30] Joseph V Michalowicz, Jonathan M Nichols, and Frank Bucholtz. Calculation of differential
entropy for a mixed gaussian distribution. Entropy, 10(3):200–206, 2008.
[31] E. Olofsen, J. Degoede, and R. Heijungs. A maximum likelihood approach to correlation dimension and entropy estimation. Bulletin of Mathematical Biology, 54(1):45–58, 1992.
[32] D.S. Ornstein and B. Weiss. Entropy and data compression schemes. Information Theory,
IEEE Transactions on, 39(1):78–83, 1993.
[33] Steven M Pincus. Approximate entropy as a measure of system complexity. Proceedings of the
National Academy of Sciences, 88(6):2297–2301, 1991.
[34] A. Poritz. Linear predictive hidden markov models and the speech signal. In Acoustics, Speech,
and Signal Processing, IEEE International Conference on ICASSP ’82., volume 7, pages 1291–
1294, 1982.
[35] A. Quas. An entropy estimator for a class of infinite alphabet processes. Theory of Probability
& Its Applications, 43(3):496–507, 1999.
[36] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2):257–286, 1989.
[37] Fazlollah M. Reza. An introduction to Information Theory. McGraw-Hill, New York, 1962.
80
[38] Alfrd Rnyi. On measures of entropy and information, 1961.
[39] Mark S Roulston. Estimating the errors on measured entropy and mutual information. Physica
D: Nonlinear Phenomena, 125(34):285 – 294, 1999.
[40] A.R. Runnalls. Kullback-leibler approach to gaussian mixture reduction. Aerospace and Electronic Systems, IEEE Transactions on, 43(3):989–999, 2007.
[41] G.N. Saridis. Entropy formulation of optimal and adaptive control. Automatic Control, IEEE
Transactions on, 33(8):713–721, 1988.
[42] David W. Scott. On optimal and data-based histograms. Biometrika, 66(3):pp. 605–610, 1979.
[43] Cosma Rohilla Shalizi and Kristina Lisa Shalizi. Blind construction of optimal nonlinear recursive predictors for discrete sequences. CoRR, cs.LG/0406011, 2004.
[44] Cosma Rohilla Shalizi, Kristina Lisa Shalizi, and James P. Crutchfield. An algorithm for pattern
discovery in time series. CoRR, cs.LG/0210025, 2002.
[45] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
5(1):3–55, January 2001.
[46] Paul C. Shields. String matching: The ergodic case. The Annals of Probability, 20(3):pp.
1199–1203, 1992.
[47] Ya G Sinai. Metric entropy of dynamical system. Preprint.
[48] Ya. G. Sinai. On the notion of entropy of a dynamical system. Doklady of Russian Academy of
Sciences, 124:768–771, 1959.
[49] Ralf Steuer, Jürgen Kurths, Carsten O Daub, Janko Weise, and Joachim Selbig. The mutual
information: detecting and evaluating dependencies between variables. Bioinformatics, 18(suppl
2):S231–S240, 2002.
[50] F. Takens. Invariants related to dimension and entropy. Atas do 13◦ . Rio de Janeiro: Colóqkio
Brasiliero do Matemática.
[51] Yutaka Tomita, Shigeru Omatu, and Takashi Soeda. An application of the information theory
to filtering problems. Information Sciences, 11(1):13 – 27, 1976.
81
[52] Y.A. Tsai, Francisco A. Casiello, and K.A. Loparo. Discrete-time entropy formulation of optimal
and adaptive control problems. Automatic Control, IEEE Transactions on, 37(7):1083–1088,
1992.
[53] H. Weidemann and Edwin B. Stear. Entropy analysis of estimating systems. Information
Theory, IEEE Transactions on, 16(3):264–270, 1970.
[54] A.D. Wyner and J. Ziv. Some asymptotic properties of the entropy of a stationary ergodic
data source with applications to data compression. Information Theory, IEEE Transactions
on, 35(6):1250–1258, 1989.
[55] J. Zaborszky. An information theory viewpoint for the general identification problem. Automatic
Control, IEEE Transactions on, 11(1):130–131, 1966.
82
© Copyright 2026 Paperzz