csse2011_submission_19.pdf

Cryptanalysis of ciphertext substitution
using optimization heuristics
Tahar MEKHAZNIA1, , M.Bachir MENAI2,Abdelmadjid ZIDANI3
department, TEBESSA University ALGERIA, [email protected]
2 Computing department science, CCIS, King Saud University, RIADH, KSA, [email protected]
3 Computing department, BATNA University, ALGERIA, [email protected]
1 Computing
Abstract
The document presents a first step towards the
automation of several techniques for classical
cryptanalysis of ciphertexts by substitution ant
transposition methods without manual intervention
using heuristics algorithms. The tests presented are
limited due to the large number of parameters used,
including statistical tables of the literary languages
of which belong to. The study focuses mainly on the
choose of initial parameters, a question that remains
unanswered so far.
Key words:
Heuristic algorithm, substitution cipher,
cryptanalysis
1.
Introduction
Cryptanalysis is the art of transforming ciphertext
into its equivalent in readable format without a priori
knowing the decryption key. This takes part of the
important challenges of current research in data
security.
The techniques of attacks to the cipherd texts are
varied. The most frequent and so, the hardest consist
in a brut force where it is necessary to appeal to a
multitude of keys in order to obtain a enough plain
text. The technique is safe, nevertheless, it consumes
abundantly of resources and turns out not interesting
in practice.
The research focuses on heuristic techniques.
These do not however appear safe but in practice,
they seem most commonly used for solving a wide
range of combinatorial problems. They consist, in a
search space in cryptanalysis, to a progressive
elimination of key improvements deemed
unnecessary to obtain a plain text based on the
characteristics of literary language used.
In this paper, an overview of these techniques is
presented in Section 2, followed by a summary of
work in the field in Section 3. Sections 4, 5 and 6
illustrate the mechanism of cryptanalysis by
substitution and adaptation to some heuristic
algorithms. Other section of the paper shows the
experimental part of these algorithms and the
associated results.
2.
Encryption techniques :
Classical
techniques
produce
ciphertexts
simultaneously using the substitution and
transposition of characters within a text. Ultimately,
each ASCII character is replaced by another of the
same set using a key. The latter is illustrated by a
table with two entries (vectors of 255 characters
max), the first consisting of the characters in natural
order and the second with the same characters in
another order (disorder).
Modern techniques use iterative algorithms for
the production of complex encryption keys, where
the substitution and transposition will be used in
level of bits after processing the text to binary code.
This family of techniques does not allow to find
the best solution (or if it is found, it would be
difficult to prove it), but allows to find a good
solution at reasonable time [1].
The ant colony algorithms are a class of meta
heuristics intended to solve difficult optimization
problems. They inspired on the collective behavior
of ants including the tracking and deposition of
pheromone. In their motion, every ant in the colony
indirectly communicates with its neighbors through
dynamic changes in their environment and thus,
builds a solution that improves as time.
The
genetic
algorithms
are
stochastic
optimization class based on the mechanisms of
evolution of the nature: crossings, mutations,
selections, etc.... They belong to the évolutionary
methods. Being a part of the family of the méta
heuristics algorithms, their purpose is to obtain a
suitable solution in a reasonable time.
3.
Previous works :
The use of heuristics for solving optimization
problems in cryptanalysis to become an important
part in the research of recent years, starting with
Peleg & Rosenfeld [2], who modeled the problem
in a probabilistic cryptanalysis. Carrol & Martin [3]
have developed an approach of an expert system for
decryption using relaxation methods. Safavi-Naini
& Forsyth [4], Spillman & Al [5] and Clerk [6] who
used the methods of simulated annealing and
genetic algorithms for solving various instances of
decryption cases. Bahler & King [7] have reimplemented the work of Peleg & Rosenfeld using
various statistics of occurrence of characters in the
literary language. M.faisal & Youssef [8] had
implemented various heuristics for cryptanalysis of
substitution ciphers. A. D. Dimovski & Gligoroski
[9] had used the same techniques for cryptanalysis
of ciphertexts by transposition.
The work thus listed and many others obviously
showed that little research has been devoted to
cryptanalysis by substitution and using only basic
heuristic methods such simulated annealing and
tabou search. Complex algorithms, including GA
and ACO were often handled in a non-depth. Their
results were not competitive. This is evident, given
that they included a large number of parameters
sensitive to changes of which it is possible to adjust
them only with experiments.
4.
Cryptanalyse per substitution :
4.1 Definition
Let be an alphabet A(n) = (a0, a1, ..,an-1) and
B(n)=(b0, b1, bn-1) another alphabet obtained from
A by
a bijective function K:A→B, which
substitutes a character ai of A by an other of the
same set to obtain bi The function K is called
encrypton key, it performs a permutation of the
entire alphabet of A to obtain a ciphertext B. The
function K-1 will realize the inverse work. The
cryptanalysis (or decoding) consists in obtaining a
plaintext from a ciphertext without, in general
knowing the key K.
The following example illustrates the sets A and
B and the function K. Naturally, this last one can be
extended, depending on use to a set of predefined
caracters of the ASCII table (integration of numeral
caracters if uses of commercial messages for
example).
K
A
B
ABCDEFGHIJKLMNOPQRSTUVWXYZ
POIUYTREZAMSKJHGFDLQNBVCXW
CRYPTANALYSEPARSUBSTITUION
IDXGQPJPSXLYGPDLNOLQZQNZHJ
4.2 Appearance of characters :
The frequency of appearance of some character
of the alphabet within a given text is different from
a language in the other one. Also, it is also different
in texts at the level of the same language as in the
case of literary, political or commercial texts.
The frequency of appearance of the alphabet in
an English text (unigram) is presented in the
following order : ETAON RISHD LFCMU
GYPWB VKXJQ Z [10]. In other words, the letter
E is the one which appears most in a text. The
frequency of appearance of the pairs of letters
(bigrams) is given by the following order: TH HE
AN RE ER IN ON AT ND ST ES EN OF TE ED
OR TI HI AS TO and the repetition of similar
letters within the same word is given by LL EE SS
OO TT FF RR NN PP CC.
This ranking is not fixed in all cases. Various
other projects are present in the literature [11] [12].
The ICE is the most famous. It includes statistics of
several variants of the English language from a
dozen English-speaking countries [13]. However,
exceptions are always present in specific texts,
because it can distort the rule if we processes a text
on the X-ray technology or a story about the life of
the Zebras in Qatar where the characters less
frequeted appears most.
In general case, statistics of the average
appearance of characters were compiled in tables,
called frequency tables of characters. They are used
as references when deciphering a text in order to
determine the nature of a character according to its
frequency of appearance in the text
4.3 Index of coincidence :
Let be a text t of length n. The index of
coincidence of a character c of t is given by the
relation :
I c (t ) =
z
1
∑ pi ( pi − 1)
( n / n − 1) i = a
Where pi is the number of occurrences of the
character c in the text t consisted of 26 letters of the
alphabet. For a text of another type, such as
commercial, other considerations should be taken as
the numbers.
In theory, the index of coincidence for every
character is equal to 1/26≈0.04. In reality, some
characters appear more than the others as
mentioned above. The global index of coincidence
of every character is given by the relation:
z
I c = ∑ pi
i=a
The value of this index varies from a language to
the other one. It is for example, for the English
equal to 0.065, and 0.074 for the French language.
4.4 Cost function :
Using frequency tables, while decrypting, the
difference of cost between the values of original
character and the one with which it was substituted
is more small that is closer to the exact substitution
to obtain a clear text. If this value is zero, that
character is the right choice. Of course, this
treatment can be extended to bigrams and trigrams.
The economic function of the cost of a text is
illustrated by the relationship
cos t ( K ) = α ∑ R U − D U + β ∑ R B − D B + γ ∑ R T − D T
Where K denotes the key by which the text has
been deciphered, R, D designates the portions of the
ciphertext and the plaintext obtained after
decryption, U, B and T are references to tables used:
unigram, bigram or more.
The cefficients α, β and γ between 0 and 1 can
improve the function. Their values will be justified
in experiments.
5.
The ACO Technique :
5.1 Definition
During its displacement, an ant delivers a
uniform and continuous quantity of pheromone on
its way. The choice of its direction is subordinated
by the trace of pheromone delivered by its previous.
The pheromone evaporates on contact with air in a
constant manner also. Tracks less pheromented
disappear in a progressive manner.
After a number of movements, ants will tend to
frequent paths richest in pheromone that resist to
evaporation and provide the optimal distance
between the nest and the food.
moves on the arcs already visited. Starting
from a node i, the choice of the next node j
depends on the distance d(I,j) between them
and the amount of pheromone τij on the arc
(i,j). It is defined by the equation:
p(i → j ) =
τ (i, j ) a d (i, j )b
∑ (i, j )
a
d (i, j )b
where a and b, variables, of “tunning” included
in 0 and 1 and will be justified in experiments.
c. Updating the pheromone
At the end of each movement, an update of the
pheromone will be made on the arc in question
by the relationship:
τ(i,j)= τ(i,j)+Δτ(i,j)
Where Δ(i,j) is a positive quantity dependent
on the version of the used algorithm. For
example, for the case of a virtual ant, it is equal
to Q/L where L being the length of the
hamiltonien path visited by the ant and Q, the
cost of the text generated..
The amount of pheromone is inversely
proportional to the cost of the text. It is more
important as we approach the plaintext.
5.2 Adapting to the problem :
d. Evaporation
a. Initial data
It will be made in a discreet way during regular
temporary intervals. In other words, after a set
number of ant movements according to the
equation:
- Field exploration is a strongly connected graph
of 26 nodes (letters of the alphabet). It can also
be extended to other characters (space for
example) where each move of an ant from one
node to another corresponds to a substitution of
one character with another. The journey ends
when all nodes have been visited. In practice,
all the characters of ciphertext have been
substituted by other characters in order to
obtain a clearer version of the text.
- To have a homogeneous movement, the ants
are initially distributed randomly on the nodes
of the graph: ithis is a basic key K0.
- The distance d(i,j) between two nodes,
parameter not significant in cryptanalysis, can
be obtained from the cost function by the
relation:
d(i,j) = cost(i,j)c
with c, a tunnig parameter between -1 and 1,
justified in experiments.
b. Itérations
The movement of ants is a discrete manner by
leaving of initial node to make a Hamiltonian
path. A control function is needed to avoid
τ(i,j)=(1-ρ) τ(i,j
where ρ is a constant between 0 and 1 that is
important to choose because if it is close to 0,
the arc in question tends to be abandoned
because it is devoid of pheromone. If the
constant is close to 1, the arc will be saturated
and therefore visited permanently from which a
rapid convergence of the solution and therefore
persistence of bad solutions.
5.3 Algorithm AntSystem
The proposed algorithm contains, in a implicit
way, the stages which reflect the movement of ants
as well as in the update of pheromone. It will be
defined as follows:
Build an initial solution (generally random),
Repeat
Improve the solution by choice of new roads
Update of the pheromone
Until (better solution or max of iteration)
In a more explicit case, the algorithm considered,
dedicated to the cryptanalysis will take the
following shape:
Calculate the cost of the initial text (That is S_opt),
Determine the distances between the various arcs,
Fix the period of evaporation Evap,
Place m ants on nodes of graph randomly,
For nb_iter = 1 to max_iter do
For nb_ant = 1 to m do
Build a hamiltonien path S( nb_ant ),
Calculate the cost C of the solution S(nb_ant ),
If ( nb_iter%Evap )= 0 evap_pheromone
Endfor
If (S(nb_ant) is better than S_opt) S_opt=S(nb_ant)
Endfor
6.
genetic Algorithm
6.1 Definition :
A genetic algorithm uses the concept of the
natural evolution. Basing itself on an initial
population of individuals, operations selection,
crossover and mutation are operated on individuals
to produce a generation with a party and, according
to certain criteria, is included in the initial
population.
So, and after a certain number of iterations, the
initial population is transformed towards a new
shape having characteristics considered as
satisfactory with regard to the defined objective.
The genetic algorithms were successfully used to
break complex cipher as Enigma encryption [15].
6.2 Adaptation to the problem:
a. Basic data :
- The basic population is a table containing a
finite number of keys.
- A key K (chromosome) is a character string of
26 or 27 letters (alphabet and space). A
character within the key is a gene.
- Each key Ki is estimated according the cost
function defined in §4.4. Its value being
Cost(Ki).
b. Initialisation :
A function for generating random keys is
launched. The table of the population is updated
ensuring remove duplicates. Each key is
evaluated using a ciphertext of the test database.
c. Iterations:
All the characters is the same for all the keys.
Only the position of the characters in the key
differentiates between these keys themselves. At
each iteration, the following opérations will be
executed in order:
• Selection:
The table of the population is initially sorted by
the cost of each key. A selection of Np keys for
reproducing the next generation is made. Whether
by rank, roulette, tournament, or simply natural,
the best choice will be justified in experiments.
• Crossing:
It is to swap gene segments between the parents.
This process encourages the exploration of the
search space and provides a sweeping genetic
material, however, may cause the divergence of
the solution or generate duplicates if the operator
selection is a misnomer. The number of
crosspoints loci and the probability of crossover
Pc are variable and can introduce more diversity
among individuals.
Swapping characters in key can generate
duplications at some of them with no other
characters in the key. A challenge function key
order
is
necessary
in
this
case.
Two tests were made:
- A bilateral cross where exchange of segments
takes place between two parents. To limit the
exploration space, a single point cross was
chosen.
- A unilateral cross is to swap segments within a
single parent. In this case, the population apte of
breeding will be halved.
• Mutation
The mutation operator is to reverse a gene with a
low probability (of the order of 10-2).
In our case, a gene is a character, its reversal
should be done with another character from 26
(or 27 as appropriate), then returns to a crossing
of two distant parts within the same chromosome.
• Replacement:
Whether stationary, elite or otherwise, various
alternative cases are tested using a corpus on the
language used. The replacement of the population
is conditioned by the absence of duplicates. An
audit function is triggered after the completion of
each operator..
6.3 Algorithm GeneticSystem:
The GS algorithm proposed achieves the various
operations to generate populations by natural
genetic evolution. It includes the following tasks:
Creating a population of random initial key
Evaluation of each key (depending on cost
function used)
Repeat
Select Np keys
Cross Keys
Mute characters within each key
Evaluate new keys
Population replacement
Until (acceptable solution or MaxGeneration)
The best results obtained with a text of 150
characters and a colony of ants contains 25 to 90 are
illustrated in the following table:
7. Experiments
7.1 Test parameters:
The algorithms tested include a substantial
number of settings where it would be difficult to
treat them simultaneously. It would also be difficult
to set some parameters in the absence of effective
mathematical model to justify this fact. The
exhaustive testing consume considerable resources,
however, preliminary tests were made to fix and, in
approximate parameters necessary for the conduct
of the algorithms. Similarly, the results of some
tests have been reimplemented as a baseline for
further testing. The end results look more
interesting, especially in resource consumption.
The following table shows some values of
parameters:
Parameter
α, β, γ
Range
0-1
Step
0.1
a, b
0-1
0.05
c
-1-1
0.1
0-1
10-100
5-100
50-600
20-200
1(ind/2)
1-3
0.005
50-150
0.1
5
5
10
10
2
τ
ρ
cevap
Nb_ant
gen
ind
Np
Pc
Pm
Maxcars
1
10
Signification
Parameters of cost
function
Prob. Of choice next
direction
parameter of length of
arcs
Quantity of ph deposed
evaporation
evaporation cycle
Number of ants
Number of generations
Number of individus
Number of parents
Nbre of crossing pts
Prob of mutation
Size of ciphetext
The experiments were operated on diverse texts
encrypted with 3 kinds of keys: simple, as Cesar's
keys or AlBash, average as key of Vigenere and
more difficult as that of Delastelle
7.2 Variant algorithms:
a. Real ants:
A real ant deposits pheromone during its
movement in a homogeneous and continuous
manner. Its path ends at the last node of graph.
However, it may hit a dead end (dry arc for
example) and causes a chain blocking nearby that
attract other ants to progressively due to
surphéremontation arcs in the same portion of the
graph.
In cryptanalysis, this algorithm is used in a
reduced way. It ends prematurely if part of the
plaintext has been revealed. By continuing its
execution causes undesirable re-encryption of the
text. The results serve as a platform for further
testing for other algorithms.
Ant
30
22
77
Parameters
α= 0.7, β= 0,4, γ=
0,5, τ=0,5, Cevap=77
α= 0.2, β= 0,4, γ=
1,0, τ=0,2, Cevap=52
α= 0.9, β= 0,6, γ=
1,0, τ=0,1, Cevap=52
Key
Match Car.
Max
Avg
Simple
26
21,4
Middle
18
16,3
Difficult
12
10,8
b. Virtual ants:
To avoid stagnation of pheromone on portions of
the search space, real drawback of the ant, virtual
ants avoid this act by depositing the pheromone
during the returns path when it has done it
successfully. Similarly, the amount deposited is
proportional to the length of this path.
Of course, this fact requires a memory of the
accomplishments and an additional step for return.
In practice, this algorithm allows to know the cost
of the decrypted text before the decryption
operation. However, we can ignore the
corresponding key if it is not interesting.
Concerning the filing of the pheromone, two
alternatives are put to the experiment:
- The amount deposited is proportional to the length
of the path. It is defined by the relation:
τ (i, j) = τ (i, j) +cout (i, j) /Σ cout(i,j) with i,j Є K
- Only the best path among those having been
identified by all the ants will phéremonted. In this
case a fixed amount will be deposited.
With more than 100 characters and a performance
of less than 600 iterations, the average of the results
is illustrated as follows:
Ant
30
22
77
Parameters
α= 0.7, β= 0,5, γ= 0,5,
τ=0,5, Cevap=65
α= 0.1, β= 0,7, γ= 1,0,
τ=0,2, Cevap=20
α= 0.8, β= 0,5, γ= 1,0,
τ=0,4, Cevap=41
Key
Match Car.
Max
Avg
Simple
18
13,51
Middle
15
11,3
Difficu
lt
11
9,41
c. Elitist ants:
Proposed by [16] the idea of the algorithm is to
grant an additional amount of pheromone on the
arcs involved in an interesting path. In other words,
allowing some called elitist ants to trace these arcs
so that they remain rich in pheromone and invite
other ants to pass through.
In practice, the key to giving a clearer text is kept
in view in the next iterations and change only a few
characters, just to get a better result, otherwise
return to the previous key.
Under the same conditions, an average of unigram
and bigram is illustrated by the following scheme:
The amount of pheromone granted is defined by
the relation:
25
20
Car-c orrec t
τ(i,j=τ(i,j)+Bonus/ Σcout(i,j)
with i,j Є K
The experiment shows that the amount is
significantly higher close to the value of pheromone
deposited by other ants. The best results obtained
with a number of iterations close to 850, a
population of 60 ants and a text of 180 characters is
illustrated in following table:
Key
α=0.8 β= 0.5 γ=-1.0 τ =0.2 Evap=1 BI=40
BS=200 Cvp=55
difficult
F-reelles
F-virtuelles
elitistes
A -Génétique
10
5
0
100
In this algorithm, only the change of basic
parameters can give effective results. Owever,
various alternatives were tested, including the
manner of selection of individuals (elitist, per
tournament, etc...) or replacement within the
population base (stationary, elitist, etc...).
100
0
100
120
120
Simple
23
18,31
Middle
17
11,3
Difficult
10
9,04
7.3 Synthesis:
As it was mentioned above, each algorithm can
give best results under specific conditions,
including the choice of initial parameters.
A summary of the various algorithms with the same
parameters, a text of 120 characters and a number
of iterations close to 550, gave the following
results:
Char Corrects
25
20
200
300
400
500
Each test was started by a random choice of keys.
Each iteration of the various algorithms gives birth
to one or more new keys obtained by changing the
order of a few key characters available in the
previous iteration.
Thus, the number of keys increases as
manipulated as the treatment effect which consume
more resources.
The number of keys generated by different
algorithms is shown in the following figure:
10000
8000
6000
F-reelles
F-virtuelles
elitistes
A-Génétique
4000
2000
0
15
10
100 200 300 400 500 600
F-reelles
F-virtuelles
elitistes
A ,Genetique
5
message size
0
100
200
300
400
message size
500
600
600
message size
Generated keys
Gen=80, Np=40,
Pc=1
Gen=360, Np=60,
Pc=1
Gen=120, Np=61,
Pc=1
300
600
F-reelles
F-virtuelles
elitistes
A-Génétiques
The average results obtained is shown in the
following table:
key
500
300
200
Parameters
400
The treatment was carried on a dual processor 2.0.
The execution time for the various algorithms was
as follows:
Match
Char
13
Match Car.
Max
Avg
300
message size
d. Genetic algrithm:
Message
size
200
Time(ms)
Parameters
15
8. Conclusion
[6] A.J. Clark, "Optimisation Heuristics for
Cryptology",
PhD
thesis,
Queensland
University of Technology, 1998.
In this paper, we presented results of comparison
of some algorithms belonging to the class of
heuristics. The field of exploration is a set of texts
encrypted by techniques of substitution and
transposition of middle class and made difficult by
various modern cryptosystems.
[7] D. Bahler and J. King, "An implementation of
probabilistic relaxation in the cryptanalysis of
simple substitution systems", Cryptologia,
vol.16(3),1992.
The first performance test is to control most
parameters of the algorithms used, this was
achieved by transferring the results of some
algorithms to be integrated as data for other, which
helped to improve these results in a distinct manner.
[8] M.Faisal Uddin, Amr M. Youssef, "Life
Technique for the Cryptanalysis of Simple
Substitution Ciphers", IEEE CCECE/CCGEI,
Ottawa, May 2006.
The second performance is achieving results
equivalent to those present in literature with a
minimum of conditions, including short texts and a
fairly reasonable processing time.
[9] A. Dimovski, D. Gligoroski, "Alphabetic
substitution cipher using a parallel genetic
algorithm domain cooperation through
SCOPES PROJECT", Ohrid, Maccedonia,2003
The synthesis of the tests proved that the ACO
algorithms can yield better results than those
generated by genetic algorithms, however this last,
are more efficient in terms of resource consumption
and can compete to decipher texts with significant
volume.
[10] Zim, Herbert Spencer. Codes and secret
writing (abridged edition). Scholastic Book
Services, fourth printing, 1962
[11] Beker, Henry; Piper, Fred (1982). Cipher
Systems: The Protection of Communications
The major problem in this kind of research is the
existence of various statistical tables inspired
several languages Corpus and the use of which
diversifies the results in a distinct manner.
However, and for satisfaction in this area, a
statistical study and classification of these tables
according to the specific texts to decipher which is
essential.
9. References
[1] A Malapert, G. Jeantet, « Métaheuristique d’un
ordonnancement Juste à temps », Université
Pierre et Marie Curie,2005
[2] S. Peleg and A. Rosenfeld, "Breaking
substitution ciphers using a relaxation
algorithm," Communications of the ACM, vol.
22(11), 1979
[3] J. Carrol and S. Martin, "The automated
cryptanalysis
of
substitution
ciphers,"
Cryptologia, vol. 10(4), 1986.
[4]W. S. Forsyth and R. Safavi-Naini,
cryptanalysis
of
substitution
ciphers",
Cryptologia, vol.17(4), 1993.
[5] R. Spillman, M. Janssen, B. Nelson and M.
Kepner, "Use of a genetic algorithm in the
cryptanalysis of simple substitution ciphers,"
Cryptologia, vol.17(1), 1993.
[12] Lewand, Robert (2000). Cryptological
Mathematics. The Mathematical Association of
America
[13] Nelson, Gerald, Wallis, Sean, and Aarts, Bas
(2002). Exploring Natural Language. Working
with the British Component of the International
Corpus of English.
[14]
Christophe
RITZENTHALER,
"The
cryptology Course", Université de Marseille,
2006.
[15] AJ Bagnall. Les applications des algorithmes
génétiques en cryptanalyse, 1996.
[16]Dorigo M., V. Maniezzo, A. Colorni , "Ant
System:Optimization by a colony of
cooperating agents",IEEETransactions on
Systems,
Man,
and
Cybernetics-Part
B,26(1):29-41