Mehran University of Engineering and
Technology, Jamshoro
Institute of Information Technology
Third term CSN and IT
Neural Networks
Stochastic Machines and their
approximates rooted in statistical
machines
1
Information:
We define the amount of information gained after
observing the event X = xk with probability pk
as the logarithmic function
(1)
1
I( xk ) log
pk
where the base of logarithm is arbitrary. When
the natural logarithm is used the units of
information are nats, and when the base 2
logarithm is is used the units are bits.
Entropy: The average amount of information
conveyed per message is called the entropy
H(X). Mathematically
H( X ) p k log p k
k 0
K
(2)
2
Statistical Mechanics:
Consider a system with many degrees of freedom, that can
reside in any one of a large number of states. Let pi
denote the probability of occurrence of state i, with the
following properties:
pi 0 for all i
and
pi = 1
Let Ei denote the energy of the system when it is in
state i. A fundamental result from statistical mechanics
tells us that when the system is in thermal equilibrium
with its surrounding environment, state i occurs with a
probability defined by
Ei
1
p i exp
Z
k BT
(3)
3
Where T is the absolute temperature in kelvins, kB = 1.38
10-23 joules per kelvin = Boltsmann’s constant, and Z is
the partition function (also known as sum over states)
defined by
Ei
Z exp
i
k BT
(4)
The probability distribution of equation (3) is called the
Canonical distribution or Gibb’s distribution.
The exponential factor exp(-Ei/kBT) is called the
Boltzmann factor.
The following two points are noteworthy from the Gibb’s
distribution:
States of low energy have a higher probability of
occurrence than states of high energy.
As the temperature T is reduced, the probability is
concentrated on a smaller subset of low energy states. 4
In the context of neural networks, the
parameter T may be viewed as a
pseudotemperature that controls thermal
fluctuations representing the effect of
“synaptic noise” in a neuron. Its precise
scale is therefore irrelevant. Accordingly,
we may choose to measure it by setting the
constant kB equal to unity, and thereby
redefine the probability pi and partion
function Z as follows:
1
Ei
p i exp
Z
T
and
Ei
Z exp
T
i
(5)
5
Free energy and entropy:
The Helmholtz free energy of a physical system,
denoted by F, is defined as
F = - TlogZ
(6)
The average energy of the system is defined by
(7)
E p i Ei
i
Thus, using equations (5) to (7) we have
E F T p i log p i
(8)
i
Using equation (2) we have
<E> - F = TH
Or
F = <E> - TH
(9)
(10)
6
Principle of Minimal Free Energy
The minimum of free energy of a stochastic
system with respect to variables of the
system is achieved at thermal equilibrium,
at which point the system is governed by
the Gibb’s distribution.
Nature likes to find a physical system with
minimum free energy.
7
Markov Chains:
Consider a system whose evolution is described
by a stochastic process {Xn, n = 1,2,3,…},
consisting a family of random variables. The
value xn assumed by a random variable Xn at
discrete time n is called the state of the system
at that time instant. The space of all possible
values that the random variables can assume is
called the state space of the system. If the
structure of the stochastic process {Xn, n =
1,2,3,…} is such that the conditional
probability distribution of Xn+1 depends only on
the value of Xn and is independent of all
previous values, we say that the process is a
Markov Chain.
8
Markov Chains:
More precisely, we have
P(Xn+1 = xn+1|Xn = xn, …, X1 = x1) =
P(Xn+1 = xn+1|Xn = xn)
(11)
which is called Markov property. In other
words, a sequence of random variables
X1, X2, …, Xn, Xn+1 forms a Markov
Chain if the probability that the system is
in state xn+1 at time n+1 depends
exclusively on the probability that the
system is in state xn at time n.
9
Transition Probabilities:
In a Markov Chain, the transition from one
state to another is probabilistic, but the
production of an output symbol is
deterministic. Let
pij = P(Xn+1 = j|Xn = i)
(12)
denote the transition probability from state i
at time n to state j at time n+1. All the pij
must satisfy the following two conditions:
pij 0 for all (i,j).
(13)
pij = 1 for all i
(14)
10
If the transition probabilities are fixed and do not
change with time, then the Markov Chain is
said to be homogenous in time.
In the case of a system with a finite number of
possible states K, the transition probabilities
constitute a K by K matrix:
p 11
p
P 21
p K 1
p 12
p 22
pK 2
p 1K
P2 k
p KK
(15)
This is called a stochastic matrix.
The definition of one step transition probability
may be generalized to cases where the
transition from one state to another takes place
in some fixed number of steps.
11
Let pij(m) denote the m step transition
probability from state i to state j:
pij(m) = P(Xn+m = xj|Xn = xi). m = 1,2,…
(16)
We may view pij(m) as the sum overall
intermediate states k through which the
system passes in its transition from state i
to state j.Specifically, pij(m+1) is related to
pij(m) by be recursive relation:
p(ijm1) p(ikm )pkj
k
with
p
(1)
ik
pik
12
The above equation may be generalized as follows:
( m 1 )
(m ) (n )
(17)
p
p p
ij
ik
kj
k
When a state of the chain can only reoccur at
time intervals that are multiple of d, where d is
the largest such integer, we say that the state
has period d . A Markov chain is said aperiodic
if all of its states have period 1.
13
Recurrent properties of the Markov
Chain
Suppose that a Markov chain starts in state i.
The state i is said to be a recurrent state if
the Markov chain returns to state i with
probability 1. That is,
fi = P(ever returning to state i) = 1
If the probability fi is less than 1, state i is
said to be a transient state.
14
Irreducible Markov Chains:
The state j of a Markov chain is said to be
accessible from state i if there is a finite
sequences of transitions from i to j with positive
probability.
If the state i and j are accessible to each other,
the states i and j of the Markov chain are said
to communicate with each other.
If two states of Markov chain communicate
with each other, they are said to belong to the
same class.
If all the states consist of a single class, the
Markov chain is said to be indecomposable or
irreducible.
15
The mean recurrence time of state i is defined as
the expectation of time Ti[k] over the returns k.
The steady stare probability of state i, denoted as
i is equal to the reciprocal of the mean recurrence
time E[Ti[k]], as shown by
i = 1/ E[Ti[k]]
If E[Ti[k]] < , that is i > 0, the state i is said to
be a positive recurrent (persistent) state.
If E[Ti[k]] = , that is i = 0, , the state i is said to
be a null recurrent state.
16
Ergodic Markov Chains:
In principle, ergodicity means that we
may substitute time averages for
ensemble averages. In the context of
Markov chain, ergodicity means that the
long term proportion of time spent by the
chain in state i correspond to the steady
state probability i.
A sufficient but not necessary condition
for a Markov Chain to be ergodic is for it
be both irreducible and aperiodic.
17
Convergence to stationary distributions:
Consider an ergodic Markov chain characterized by
a stochastic matrix P. Let the row vector (n-1)
denote the state distribution vector of the chain at
time n-1. The state distribution vector at time n is
defined by
(n) = (n-1) P
(18)
By iteration of equation (18), we obtain
(n) = (n-1) P = (n-2) P2 = (n-3) P3 = ….
and finally we may write
(n) = (0) Pn
where (0) is the initial value of the state
distribution vector.
18
Let Pij(n) denote the ij-th element of Pn.
Suppose that as time n approaches infinity,
Pij(n) tends to j independent of i, where j is
the steady state probability of state j.
Correspondingly, for large n, the matrix Pn
approaches the limiting form a square
matrix with identical rows as shown by
1
lim P n 1
n
1
2
2
2
K
K
K
(19)
19
Ergodicity Theorem for Markov Chains:
Let an ergodic Markov chain with states x1, x2, …, xK and
stochastic matrix P be irreducible. The chain then has a
unique stationary distribution to which it converges from
any initial state; that is there is a unique set of numbers
{j}j=1K such that
1.
2.
3.
(n)
lim Pij
j
n
(20)
j > 0 for all j.
(21)
K
j1
4.
j
j
1
(22)
K
p
i 1
i
ij
for j = 1,2,…,K
(23)
Conversely, suppose that the Markov chain is irreducible and
aperiodic , and there exist numbers {j}j=1K satisfying Equations
(21) through (23). Then the chain is ergodic, the j are given by
20
(20), and the mean recurrence time of state j is 1/ j
In the light of ergodicity theorem, we may
say the following:
Starting
from an arbitrary initial
distribution, the transition probabilities
of a Markov chain will converge to a
stationary distribution provided that
such a distribution exists.
The
stationary distribution of the
Markov chain is completely independent
of the initial distribution if the chain is
ergodic.
21
Example: Show that the Markov chain
shown in the following figure is ergodic.
3/4
1/4
x1
x2
1/2
1/2
Solution:
The stochastic chain of the matrix is
14
P 1
2
3
4
1
2
22
0.4375 0.5625
2
P
0.3750 0.6250
0.3906
P
0.4063
3
0.4023 0.5977
P
0.3984 0.6016
0.3994 0.6006
5
P
0.4004 0.5996
0.4001
P
0.3999
P
8
0.4
0.4
0.6
0.6
0.4
0.4
0.6
0.6
0.6094
Thus 1 = 0.4 and 2 = 0.6
0.5938 Convergence to the stationary
4
6
P
7
0.5999
0.6001
distribution accomplished in
n = 7 iterations. With both 1
And 2 being greater than zero,
Both states are positive
Recurrent, and the chain is
Therefore irreducible. Note
Also that the chain is aperiodic
Since the greatest common
divisor is equal to one.
Therefore the Markov chain
is
23
ergodic.
Principle of Detailed Balance:
The principle of detailed balance states that at thermal
equilibrium, the rate of occurrence of any transition
equals the corresponding rate of the occurrence of the
inverse transition, as shown by
ipij = jpji
(23)
or
j/i = pij/pji
(24)
The only outstanding requirement is how to choose the
ratio j/i. To cater this requirement, we choose the
probability distribution that we want the Markov chain
to converge to be a Gibbs distribution, as shown by
j = (1/Z)exp(-Ej/T)
Therefore
j/i = exp(- E/T)
(25)
where E = Ej - Ei
(26)
24
Simulated Annealing:
Annealing refers to a physical process that proceeds
as follows:
A solid in a heat bath is heated by raising the temperature
to a maximum value at which all particles of the solid
arrange themselves randomly in the liquid phase.
Then the temperature of the heat bath is lowered,
pertaining all particles of the solid arrange themselves in
the low energy ground state of a corresponding lattice.
It is presumed that the maximum temperature in phase
1 is sufficiently high, and the cooling in phase 2 is carriedout sufficiently slowly. However, if the cooling is too
rapid, that is if the solid is not allowed enough time to
reach thermal equilibrium at each temperature value – the
resulting crystal will have many defects.
25
Metropolis Algorithm:
This is an algorithm for efficient simulation of the
evolution to thermal equilibrium of a solid for
a given temperature.
In each step of the algorithm, an atom (unit) of a
system is subjected to a small random
displacement, and the resulting change E in
the energy of the system is computed. If we
find that the change E 0, the displacement
is accepted, and the new system configuration
with the displaced atom is used as the starting
point for the next step of the algorithm. If, on
the other hand, we find that the change E
>0, the algorithm proceeds in a probabilistic
manner, as described next.
26
The probability that the configuration with the displaced
atom accepted is given by
P(E) = exp(- E/T)
(27)
where T is the temperature. To implement the
probabilistic part of the algorithm, we may use a
generator of random numbers distributed uniformly in
the interval (0,1). Specifically, one such number is
selected and compared with the probability P(E) of
equation (27). If the random number is less than the
probability P(E), the new configuration with the
displaced atom is accepted. Otherwise, the original
system configuration is reused for the next step of the
algorithm.
Provided that
the temperature is lowered in a
sufficiently slower manner, the system can reach
thermal equilibrium at each temperature. In the
Metropolis algorithm, this condition is achieved by
having a large number of transitions at each
temperature.
27
Thus, by repeating the basic steps of the
Metropolis algorithm, we effectively simulate
the motion of the atoms in a physical system in
thermal equilibrium with a heat bath of
absolute temperature T. Moreover, the choice
of P(E) defined in equation (27) ensures that
thermal equilibrium is characterized by the
Gibb’s distribution just as in statistical
mechanics.
Since, in simulated annealing, the current state
of a system that has experienced a transition
depends only on the previous state, it follows
that simulated annealing has the Markov
property.
28
To implement a finite time approximation of the
simulated annealing algorithm, we need to
specify a set of parameters governing the
convergence of the algorithm. These
parameters are combined in a so-called
annealing schedule or cooling schedule.
An annealing schedule specifies a finite
sequence of values of the temperature and a
finite number of transitions attempted at each
value of temperature. The annealing schedule
due to Kirkpatrick et al. [1983] is described
below:
29
Initial value of the temperature: The initial value T0 of
the temperature is chosen high enough to ensure that
virtually all proposed transitions are accepted by the
annealing algorithm.
Decrement of the temperature: The decrement function
is defined by
Tk = Tk-1,
k = 1,2,3,…
where is a constant smaller but close to unity. Typical
values of lie between 0.8 and 0.99. At each
temperature, enough transitions are attempted so that
there are 10 accepted transitions per experiment on the
average.
Final value of the temperature: The system is frozen and
annealing stops if the desired number of acceptances is
not achieved at three successive temperatures.
30
Boltzmann Machine:
The Boltzmann machine is a neural network that relies on
a stochastic form of learning. Basic to the operation of the
Boltzmann machine is the idea of simulated annealing
described earlier.
The Boltzmann machine and the Hoppfield network share
the following common features:
Their processing units are binary values (±1 say) for
their states.
All the synaptic connections between their units are
symmetric.
The units are picked at random one at a time for
updating
they have no self feedback.
31
Boltzmann Machine:
The Boltzmann machines and Hopfiled
networks differ from each other in the
following respects:
The Boltzmann’s machine permits the
use of hidden neurons, whereas no such
neurons exist in the Hopfield network.
The Boltzmann machine use stochastic
neurons with a probabilistic firing
mechanism, whereas the standard
Hopfield network uses neurons based on
the McCulloch Pitts model and a
deterministic firing mechanism.
32
The stochastic neurons of the Boltzmann
machine partition into two function
groups: visible and hidden, as shown in
the following figure.
33
The visible neurons provide an interface
between the network and the environment in
which it operates. During the training phase of
the network, the visible neurons are all
clamped onto specific states determined by the
environment.
The hidden neurons, on the other hand, always
operate freely; they are used to explain
underlying constraints contained in the
environmental input vectors.
The primary goal of Boltzmann learning is to
produce a neural network that correctly
models input patterns according to a
Boltzmann distribution.
34
The energy of the Boltzmann machine is
defined by
E( x )
1
2
w
i
j
ji
xi x j
i j
(28)
where xj is the state of neuron j and wji is the
synaptic weight connecting neuron j to neuron
i.
The machine operates by choosing a neuron at
random – for example neuron i – at some step
of the learning process and then flipping the
state of neuron i from state xi to –xi at some
temperature T with probability
where Ei is the
1
P( x i x i )
1 exp( Ei / T) energy change.
35
There are two modes of operation to be considered:
Clamped condition, in which the visible
neurons are all clamped onto specific states
determined by the environment.
Free running condition, in which all the
neurons (visible and hidden) are allowed to
operate freely.
Let +ij denote the correlation between the states of
neuron i and j, with the network in its clamped
condition. Let -ij denote the correlation between
between the states of neurons i and j with the
network in its free running condition. Both
correlations are averaged over all possible states
of the machine when it is in thermal equilibrium.
36
Then according to Boltzmann learning rule,
the change wij applied to synaptic weight
wij from neuron i to neuron k is defined by
wij
=
(+ij
-ij
)
(29)
where is the learning rate parameter.
Equation (29) is called the Boltzmann
learning rule. A useful feature of
Boltzmann learning is that rule for adjusting
the synaptic weight from neuron i to neuron
j is independent of whether these two
neurons are both visible, both hidden, or
one of each.
37
© Copyright 2026 Paperzz