Entropy and Compressibility of Symbol Sequences

PhysComp96
Full paper
Draft, February 23, 1997
Entropy and Compressibility of Symbol Sequences
Werner Ebeling
[email protected]
Thorsten Pöschel
[email protected]
Alexander Neiman
[email protected]
Institute of Physics, Humboldt University
Invalidenstr. 110, D-10115 Berlin, Germany
The purpose of this paper is to investigate long-range correlations in symbol sequences using methods of statistical physics and nonlinear dynamics. Beside the principal interest in the analysis of
correlations and fluctuations comprising many letters, our main aim is related here to the problem
of sequence compression. In spite of the great progress in this field achieved in the work of Shannon,
Fano, Huffman, Lempel, Ziv and others [1] many questions still remain open. In particular one must
note that since the basic work by Lempel and Ziv the improvement of the standard compression
algorithms is rather slow not exceeding a few percent per decade. One the other hand several experts
expressed the idee that the long range correlations, which clearly exist in texts, computer programs
etc. are not sufficiently taken into account by the standard algorithms [1]. Thus, our interest in
compressibility is twofold:
(i) We would like to explore how far compressibility is able to measure correlations. In particular
we apply the standard algorithms to model sequences with known correlations such as, for instance,
repeat sequences and symbolic sequences generated by maps.
(ii) We aim to detect correlations which are not yet exploited in the standard compression algorithms
and belong therefore to potential reservoirs for compression algorithms.
First, the higher-order Shannon entropies are calculated. For repeat sequences analytic estimates are
derived, which apply in some approximation to DNA sequences too. For symbolic strings obtained
from special nonlinear maps and for several long texts a characteristic root law for the entropy
scaling is detected. Then the compressibilities are estimated by using grammar representations and
several standard computing algorithms. In particular we use the Lempel-Ziv compression algorithm.
Further the mean square fluctuations of the composition with respect to the letter content and
several characteristic scaling exponents are calculated. We show finally that all these measures
are able to detect long-range correlations. However, as demonstrated by shuffling experiments,
different measuring methods operate on different length scales. The algorithms based on entropy or
compressibility operate mainly on the word and sentence level. The characteristic scaling exponents
reflect mainly the long-wave fluctuations of the composition which may comprise a few hundreds or
thousands of letters.
1
Correlation measures and investigated sequences
sis. These methods found several applications to DNA
sequences [9, 14, 12] and to human writings [11, 15].
At first we study model sequences with simple construction rules (nonlinear maps as e.g. the logistic map and
the circle map, stochastic sequences containing repeating
blocks etc.). A repeat sequence is defined as a random
(Bernoulli-type) sequence on an alphabet A1 . . . Aλ where
a given word of length s written on the same alphabet
is introduced ν times (ν being large). This simple type
of sequences which admits most easily analytical calculations was introduced by Herzel et al. [13] for modelling
the structure of DNA strings. Repeat sequences may be
generated in the following way. First we define the repeat
e.g. by a string of length s
Characteristic quantities which measure long correlations
are dynamic entropies [2, 3, 4, 5, 6], correlation functions
and mean square deviations, 1/f δ noise [7, 8], scaling
exponents [9, 10], higher order cumulants [11] and, mutual information [12, 13]. Our working hypothesis which
we formulated in earlier papers [3, 8], is that texts and
DNA show some structural analogies to strings generated by nonlinear processes at bifurcation points. This is
demonstrated here first by the analysis of the behaviour of
the higher order entropies. Further analysis is based on
the mapping of the text to random walk and to “fluctuating gas” models as well as on the spectral analy-
R = A1 . . . As
1
by a given sequence of length s (i.e., len(R) = s). Then we
generate a Bernoulli sequence of length L0 with N0 = L0
letters on the alphabet R, A1 . . . Aλ . This sequence consists of ν repeats and N0 −ν “random” letters. Finally we
replace the letters R by the defining string. In this way
a string with N letters which is like a sea??? of random
letters with interspersed “ordered repeats” is obtained.
This string contains sν repeat letters. Going along the
string, each time we meet a repeat, we first have to identify it. Let us assume we need the first sc ¿ s letters
for the identification. After any identification of a repeat we know where we are in a string and know how
to continue, ie. the uncertainty is decreased. The repeat
sequences defined in the way described above have well
defined correlations with a range of s. Beyond the distance s the correlations are destroyed by the stochastic
symbols inter-dispersed between the repeats. This procedure may be continued in a hierarchical way in order to
generate longer correlations and hierarchical structures
with some similarity to texts. In the special case that the
sequence consists only of one kind of repeats we get periodic sequences with the period s. In this case the range
of correlations is infinite. Sequences with slowly decaying long correlations are obtained from the logistic map
and the circular map at critical points [3, 8]. Further we
study long standard information carriers as e.g. books
and DNA strings and compare them with the model sequences. In particular we studied the book “Moby Dick”
by Melville (L ≈ 1, 170, 200 letters), the German edition of the Bible (L ≈ 4, 423, 030 letters), Grimm’s Tales
(L ≈ 1, 435, 820 letters), the “Brothers Karamazov” by
Dostoevsky (L ≈ 1, 896, 000 letters), the DNA sequences
of the lambda-virus (L ≈ 50, 000 letters) and that of yeast
(L ≈ 300, 000 letters).
In order to find out the origin of the long-range correlations we studied the effects of shuffling of long texts on
different levels (letters, words, sentences, pages and chapters). The shuffled sequences texts were always compared
with the original one (without any shuffling). Of course
the original and the shuffled files have the same letter distribution. However only the correlations on scales below
the shuffling level are conserved. The correlations (fluctuations) on higher levels which are based on the large-scale
structure of texts as well as on semantic relations are destroyed by shuffling.
2
n:
Hn = −
X
p(n) (A1 . . . An ) log p(n) (A1 . . . An ),
(1)
hn = Hn+1 − Hn
(2)
where the summation runs over all possible words of
length n, i.e. over all words which could be found if the
text would have infinite length. For the case of repeat
sequences we can carry out analytical calculations for the
entropies [13]. For the periodic case N = νs the higher
order entropies Hn are constant and the uncertainties hn
are zero
hn = 0;
H n = Hs . . .
(3)
if n ≥ s. The lower order entropies for n ≤ s depend on
the concrete structure of the individual repeat and can
easily be calculated by simple counting For the case of
proper repeat sequences N ≥ sν approximative formulae
for the entropy are available [13]. We find, for example
(in logλ units),
hn = 1 − (s − sc )ν/N,
(4)
Hn = Hs + (n − s)[1 − (s − sc )ν/N ],
(5)
if s ≥ n. Our methods for the analysis of the entropy of
natural sequences were in detail explained elsewhere [16].
We have shown that at least in a reasonable approximation the scaling of the entropy against the word length is
given by a root law. Our best fit of the data obtained
for texts on the 32-alphabet (measured in log 32 units)
reads [17]
√
(6)
Hn ≈ 0.5 · n + 0.05 · n + 1.7 ,
hn ≈ 0.25 ·
√
n + 0.05 .
(7)
The dominating term is given by a root law corresponding to a rather long memory tail. We mention that a scaling law of the root type was first found by Hilberg who
made a new fit for Shannons original data. We used our
own data for n = 1, . . . , 26 but included Shannons result
for n = 100 as well.
The idea of algorithmic entropy goes back to
Kolmogorov and Chaitin and is based on the idea that
sequences have a “shortest representation”. The relation between the Shannon entropy and the algorithmic
entropy is given by the theorems of Zvonkin and Levin.
Several concrete procedures to construct a shorter representations, are used in data compression algorithms [1].
One may assume that those algorithms are still fare from
being optimal, since as a rule the do not take into account
Entropy and complexity of sequences
Let A1 A2 . . . An be the letters of a given substring of
length n ≤ L. Let further p(n) (A1 . . . An ) be the probability to find in a string a block with the letters A1 . . . An .
Then we may introduce the entropy per block of length
2
long correlations. A rather simple algorithm for finding
compressed algorithms, which was proposed by Thiele and
Scheidereiter, is called grammar complexity [6].
Let us consider a word p on certain alphabet and let
K(p) be the length of shortest representation of the given
string. Then we define the compressibility with respect
to a given algorithm by
c(p) = K(p)/K0 (p)
(8)
where K0 (p) is the representation of the corresponding
Bernoulli string on the same alphabet. For repeat sequences of length N we may find c(p) by using grammar
representations [6]. We get the following compressed representation
S → A1 . . . Ak RA1 . . . As RA1 . . . An R . . . ,
R → A1 . . . As .
Figure 1: Lempel–Ziv complexities (dashed line) and scaling
exponents of diffusion (full line) represented on the level of
shuffling.
From this representation we calculate the “grammar
compressibility”
K(p) ≤ (N − ν(s − 1)) log(λ + 1) + s log λ
c(p) = 1 +
log(λ + 1)
s
− ρ(s − 1)
N
log λ
(9)
express it in terms of the fluctuation theory of statistical
physics. Let us consider a sequence with the total length
L. Then the total number of letters is N = L and the
density is equal to “1”. However the number density of
different symbols may fluctuate along the string. In an
earlier work [11] we considered for example the fluctuating local density of blanks in “Moby Dick” and pointed
out the existence of rather long wave fluctuations. We
represented there the (averaged over windows of length
4,000) local frequency of the blanks (and other letters)
in the text “Moby Dick” in dependence on the position
along the text. The original text shows a large-scale structure extending over many windows. This reflects the fact
that in some part of the texts we have many short words,
e.g. in conversations (yielding the peaks of the space frequency), and in others we have more long words, e.g. in
descriptions and in philosophical considerations (yielding
the minima of the space frequency). The shuffled text
shows a much weaker non-uniformity of the text, the lower
the shuffling level, the larger is the uniformity. More uniformity means less fluctuations and more similarity to a
Bernoulli sequence. For the case of DNA sequences no
analogies of pages, chapters, etc. are known. Nevertheless the reaction on shuffling is similar to those of texts.
(10)
The algorithmic entropy according to Lempel and Ziv is
introduced as the relation of the length of the compressed
sequence (with respect to a Lempel-Ziv compression algorithm) to the original length.
Explicit results obtained for the Lempel-Ziv complexities (entropies) of several sequences were given in earlier
work [17] (tab. 1). The complexities in dependence on
the shuffling level are graphically represented for the text
Moby Dick in fig. 1.
xxxxx
xxxxx
xxxxx
xxxxx
xxxxx
yyyyy
yyyyy
yyyyy
yyyyy
yyyyy
Table 1: The Lempel–Ziv complexities for several original and
for the corresponding shuffled sequences.
3
In order to quantify these findings let us define the number of letters of kind k inside a substring of length l by
N (k, l). In the limit l → ∞ we get the average density
n(k) = lim N (k, l)/l Since we have λ different symbols we
get in this way a λ-dimensional composition space. Let
us now consider the fluctuations of N (k, l) as function of
l. We expect that N (k, l) fluctuates around the mean
value hN (k)i = nl(k). Further we assume that the mean
square fluctuations scale with certain power of the mean
Mean square fluctuations and
correlation functions
In this part we follow the methods proposed by Peng
et al. [9, 10] and the invariant representation proposed
Voss [7]. However we shall use another language: Instead
of formulating the problem in terms of random walks we
3
(particle) numbers
2
h[N (k, l) − hN (k)i] i = consthN (k)i
αk
We have investigated the characteristic exponents for
several typical symbol sequences which were described
above [17].
(11)
The exponent is called the characteristic mean square
fluctuation exponent. In an analogous way we consider
the sum of the mean square fluctuations defining an exponent α by
X
h[N (k, l) − hN (k)i]2 i = constN α
(12)
The data are summarized in tab 1.
In the same way we can obtain other important statistical quantities, such as higher-order moments and cumulants of y(k, l) (see [11]). By calculations of the Hölder
exponents Dq up to q = 6 we have shown that the higher
order moments have (in the limits of accuracy) the same
scaling behaviour as the second moment. We repeated
the procedure described above for the shuffled files.
The case α(k) = 0.5 corresponds to the normal behaviour
of mean square fluctuations which statistical mechanics
predicts in the absence of long-range correlations. If
α(k) > 0.5 we have an anomalous fluctuation behaviour
which reflects the existence of long-range correlations.
In this respect we may use the term “coherent fluctuations” [17].
One can easily see that the above definitions give the
same numbers for the α-coefficients as the mapping to a
random walk in a λ-dimensional space [11]. Let us consider in brief this procedure: Instead of the original string
consisting of λ different symbols we generate λ strings on
the binary alphabet (0,1) (λ = 32 for texts). In the first
string we place a “1” on all positions where there is an “a”
in the original string and a “0” on all other positions. The
same procedure is carried out for the remaining symbols
too. Then we generate random processes corresponding
to these strings moving one step upwards for any “1” and
remaining on the same level for any “0”. The resulting
move over a distance l is called y(k, l) where k denotes the
symbol. Then by defining a λ-dimensional vector space
considering y(k, l) as the component k of the state vector
at the (discrete) “time” l we can map the text to a trajectory. The corresponding procedure is carried out e.g. for
the DNA sequences which are mapped to a random walk
on a 4-dimensional discrete space.
Let us study now the anomalous diffusion coefficients [10]. The mean-square displacement for symbol k
is determined by
­
®
2
F 2 (k, l) = y 2 (k, l) − (hy(k, l)i) ,
(13)
The results of the calculations in dependence on the
shuffling level are shown also in tab. 1. A graphical representation of the results for Moby Dick is given in fig. 1.
We see from the numbers in tab. 1 and from fig. 2
Figure 2: Double–logarithmic plot of the power spectrum for
“Moby Dick”. The full line corresponds to the original text,
the dashed line corresponds to the text shuffled on the chapter
level and the dotted line to the text shuffled on the page level.
Shuffling on a level below pages destroys the low frequency
branch.
where the brackets h·i mean the averaging over all initial
positions. The behaviour of F 2 (k, l) for l À 1 is the focus
of interest. It is expected that F (k, l) follows a power
law [10].
F (k, l) ∝ lα(k) ,
Our results show that the original texts and DNA sequences show strong long-range correlations, i.e. the coefficients of anomalous diffusion are clearly different from
1/2. After the shuffling below the page level the sequences
become practically Bernoullian in comparison with the
original ones since the diffusion coefficients decrease to a
value of about 1/2. The decrease occurs in the shuffling
regime between the page level and the chapter level. For
DNA sequences the characteristic level of shuffling where
the diffusion coefficient goes to 1/2 is about 500–1000.
Our result demonstrates that shuffling on the level of symbols, words, sentences or pages, or segments of length
500–1000 in the DNA case, destroys the long range correlations which are felt by the mean square deviations.
(14)
where α(k) is the diffusion exponent for symbol k. We
note that the diffusion exponent is related to the exponent of the power spectrum [10, 11]. Besides the individual diffusion exponents for the letters we get an averaged
diffusion exponent α for the state space. The comparison
of the formulae given above shows the complete equivalence of the fluctuation picture and the random walk
picture. Both are equivalent.
4
4
Conclusions
[5] Ebeling, Werner, Thorsten Pöschel, and Karl
Friedrich Albrecht, “Transinformation and Word
Distribution of Information-Carrying Sequences”,
Int. J. Bifurcation & Chaos, 5 (1995), 51-61.
Our results show that the dynamic entropies, the compressibilities and the scaling of the mean square deviations, are appropriate measures for the long-range correlations in symbolic sequences. However, as demonstrated
by shuffling experiments, different measures operate on
different length scales. The longest correlations found in
our analysis comprise a few hundreds or thousands of letters and may be understood as long-wave fluctuations of
the composition. These correlations (fluctuations) give
rise to the anomalous diffusion and to coherent fluctuations. These fluctuations comprise several hundreds or
thousands of letters. There is some evidence that these
correlations are based on the hierarchical organization of
the sequences and on the structural relations between the
levels. In other words these correlations are connected
with the grouping of the sentences into hierarchical structures as the paragraphs, the pages, the chapters etc. Usually inside certain substructure the text shows a greater
uniformity on the letter level. Possibly a more careful comparison of the correlations in texts and in DNA
sequences may contribute to a better understanding of
the informational structure of DNA in particular to their
modular structure.
Our results clearly demonstrate that the longest-range
correlations in information carriers are of structural origin. The entropy-like measures studied in part two operate on the sentence and the word level. In some sense
entropies are the most complete quantitative measures of
correlation relations. This is due to the fact that the
entropies include many point-correlations. On the other
hand the calculation of the higher order entropies is extremely difficult and at the present moment there is no
hope to extend the entropy analysis to the level of thousands of letters.
Hopefully the analysis of entropies, compressibilities
and scaling exponents can be developed to useful instruments for studies of the large-scale structure of
information-carrying sequences and may contribute finally to the finding of improved compression algorithms.
[6] Ebeling, Werner, and Miguel Angel Jiménez–
Montaño, “titel???”,Math. Biosci. 52 (1980), 53???.
[7] Voss, Richard F., Phys. Rev. Lett., “titel???” 68
(1992), 3805–???; Voss, Richard F., “???” Fractals
2 (1994), 1–???.
[8] Anishchenko, Vadim S., Werner Ebeling, and
Aleksander B. Neiman, “???”, Chaos, Solitons &
Fractals 4 (1994), 69–???.
[9] Peng, C???. K. et al.???, “???”, Nature 356 (1992),
168–???.
[10] Stanley, H. Eugene et al.???, Physica, “titel???”,
A205 (1994), 214–???.
[11] Ebeling, Werner, and Neiman, Aleksander, “titel???”, Physica A215 (1995), 233–???.
[12] Li, Wentian, and Kaneko, Europhys. Lett. 17
(1992), 655-???.
[13] Herzel, Hans-Peter, Werner Ebeling, and Armin
O. Schmitt, “Entropies of biosequences: the role of
repeats”, Phys. Rev. E, 50 (1994), 5061-???.
[14] Peng, C???.-K. et al.???, “titel???”, Phys. Rev. E
49 (1994), 1685–???.
[15] Schenkel, A???, J??? Zhang, and Zhang, Y???,
“titel???”, Fractals 1, (1993), 47–???.
[16] Pöschel, Thorsten, Werner Ebeling, and Helge
Rosé, “Guessing probability distributions from
small samples” J. Stat. Phys. 80 (1995), 1443–1452.
[17] Ebeling, Werner, Aleksander Neiman, and
Thorsten Pöschel, “Dynamic entropies, long-range
correlations and fluctuations in complex linear
structures”. In: Suzuki, M???, and Kawashima, N.
(eds.), Coherent Approaches to Fluctuations, World
Scientific (1995), ???–???
References
[1] Storer, James A., Data Compression: Methods and
Theory, Computer Science Press (1988).
[2] Hilberg, W., “titel???”, Frequenz 44 (1990), 243–
???; 45 (1991), 1–???.
[3] Ebeling, Werner, and Gregoire Nicolis, “titel???”,
Chaos, Solitons and Fractals 2 (1992), 635–???.
[4] Ebeling, Werner, and Thorsten Pöschel, “Entropy and long range correlations in literary English”, Europhys. Lett., 26 (1994), 241-246.
5