Deterministic Computation of Complexity, Information and Entropy

ISIT 1998, Cambridge, MA, USA, August 16 - August 21
Deterministic Computation of Complexity, Information and Entropy
Mark R. Titchener
Dept of Computer Science,
The University of Auckland,
Auckland, New Zealand.
Email: markQtcode.auckland. ac .nz
Abstract - A new measure of string complexity [3] for finite
strings is presented based on a specific recursive hierarchical
string production process (c.f. [ 2 ] ) .From the maximal bound
we deduce a relationship between complexity and total information content.
( ( # A , 0 , 0 , ...)
for 1 = 1
Given an alphabet A and prefix-free code W C A + , we
define the generalized T-augmentation of W by:
k
w:p; = U P ' ( W \ { P ) )
U
{Pk+').
(1)
is0
p E W is referred to as the T-prefix, and k E NINfas
the corresponding T-expansion parameter. Applying Eqn (1)
recursively starting with A , and subject to the recursive con>...,k,-i
straints pl E A , pi E (Pl..--,Pi-I i = 2,.. . ,n, yields:
The right-most non-zero element position in d(') is the
length of the maximal-length strings. The complexity of
the strings of this length is simply CT(Z))=
mi and
found
to
be
very
accurately
described
by
li(loge#A.lxl)
where
( k i ,...,kn)
li(n)=
d u / l o g ( u ) . Conversely a lower bound for C T ( X )is)
obtained by observing that for a single repeating symbol, Eqn
(2) reduces to n = 1 , kl = 1x1 - 1. Thus: logz(lxl) 5 C T ( ~5 )
li()z)loge#A).
We view generalized T-augmentation as a production proIn computing CT(2)as a function of file length for printed
cess for the maximal length strings in A;;:::::;;'',),each having texts, for example, one finds a function closely of the form
)
C is a constant for the source.
the form x = p $ p k ; ' . . .p:'a, a E A. Conversely, given any C T ( Z )N L i ( v 1 ~ 1where
x E A+, it is straight forward t o derive the vectors ( p i , . . . ,pn) Recognising that C represents in the context of the present
model a bound on the expected compression for the source,
and (k1,. . . , kn) respectively.
We define our string complexity, denoted C T ( X )as
, the ef- to be achieved by mapping the source string onto a maximal
complexity string of equivalent complexity, we write C T ( Z )=
fective number of T-augmentation steps required to generate
li(& (x)x 1x1) where E is the expected entropy for the string
x from A . More formally:
x in nats/symbol. Thus we conclude that the complexity of a
arity(k)
C,(Z) =
logz(k1+ 1).
(2) string (in taugs) is simply the logarithmic integral of the total
information E(z) x 1x1 (in nats/symbol x symbols = nats).
i=l
Given a file, we may easily compute C!r(x)from Eqn (2)
An upper bound for CT(Z)as a function of string length and from this E ( x ) = lZC1(C~(~))/l~1.
This was done for a
is deduced from understanding the growth in the length number of English texts (alphabet sizes of 75 - 90 symbols)
of the maximal-length strings with minimal T-augmentation. yielding entropy values ranging from 1.6-1.9 bits/char. This
Let 4 E JVm, i.e. d = ( d l ,d z , . . . ,d,,. . .), denote a distribu- compares well with [l]in which an 'upper bound' of 1.75 bits
tion vector of unbounded arity whose elements di E N rep- per character for full text is arrived at by constructing a word
resent the number of code strings of length i in A;;:;::::;'';. trigram model from 583 million words of training text and
More particularly let d(') denote the distribution resulting then estimating the cross-entropy. This correspondence is infrom exhaustive simple T-augmentation, that is, where all terpreted as corroborating evidence of the new deterministic
code strings of length less than 1 are consumed in turn as theory.
REFERENCES
T-prefixes pi with corresponding ki = 1. We assume a unit
vector g = (1,0,0, . . .) and define a shift operator a such that [l] P. F. Brown, V. J . Della Pietra, S. A. Della Pietra, J. C. Lai,
and R. L. Mercer, "An Estimate of an Upper Bound for the
d' = &)d- is given by:
Entropy of English", Computational Linguistics, vol. 18, no. 1,
pp. 3 2 4 0 , 1992.
0 for i < j
d' =
A.
Lenipel and J. Ziv, "On the Complexity of Finite Sequences",
[2]
di-j for a > j
;,
son
c
(TI
-
' {
IEEE Trans. Inform. Theory, vol. 22, no. 1, pp. 75-81, January
1976.
We recursively determined('), 1 = 1,2,. . . in terms of d("-')
and mt-1, the value of the left-most non zero element of d('-'), [3] M. R. Titchener, "A Deterministic Theory of Complexity, Information and Entropy", in Recent Results, IEEE Information
which is the number of smallest T-prefixes available of length
Theory Workshop ITW-98, San Diego, February 1998.
1-1:
0-7803-5000-6/98/$10.000 1998 IEEE.
326