Solving Quadratic Equations with XL on Parallel Architectures

Solving Quadratic Equations with XL
on Parallel Architectures
Chen-Mou Cheng1 , Tung Chou2 ,
Ruben Niederhagen2 , Bo-Yin Yang2
1 National
Taiwan University
Sinica
Taipei, Taiwan
2 Academia
Leuven, Sept. 11, 2012
The XL algorithm
Some cryptographic systems can be attacked by solving a system of
multivariate quadratic equations, e.g.:
I
AES:
I
I
I
8000 quadratic equations with 1600 variables over F2
(Courtois and Pieprzyk, 2002)
840 sparse quadratic equations and 1408 linear equations over
3968 variables of F256 (Murphy and Robshaw, 2002)
multivariate cryptographic systems, e.g. QUAD stream cipher
(cryptanalysis by Yang, Chen, Bernstein, and Chen, 2007)
The XL algorithm
I
XL is an acronym for extended linearization:
I
I
I
extend a quadratic system by multiplying with appropriate
monomials
linearize by treating each monomial as an independent variable
solve the linearized system
I
special case of Gröbner basis algorithms
I
first suggested by Lazard (1983)
I
reinvented by Courtois, Klimov, Patarin, and Shamir (2000)
I
alternative to Gröbner basis solvers like Faugère’s F4 (1999,
e.g., Magma) and F5 (2002) algorithms
I
suitable for solving generic/random systems
The XL algorithm
For b ∈ Nn denote by x b the monomial x1b1 x2b2 . . . xnbn and by
|b| = b1 + b2 + · · · + bn the total degree of x b .
given:
choose:
extend:
linearize:
solve:
finite field K = Fq
system A of m multivariate quadratic equations:
`1 = `2 = · · · = `m = 0, ì ∈ K [x1 , x2 , . . . , xn ]
operational degree D ∈ N
system A to the system
R(D) = {x b ì = 0 : |b| ≤ D − 2, ì ∈ A}
consider x d , d ≤ D a new variable
to obtain a linear system M
linear system M
The XL algorithm
For b ∈ Nn denote by x b the monomial x1b1 x2b2 . . . xnbn and by
|b| = b1 + b2 + · · · + bn the total degree of x b .
given:
choose:
extend:
linearize:
solve:
finite field K = Fq
system A of m multivariate quadratic equations:
`1 = `2 = · · · = `m = 0, ì ∈ K [x1 , x2 , . . . , xn ]
How?
operational degree D ∈ N
system A to the system
R(D) = {x b ì = 0 : |b| ≤ D − 2, ì ∈ A}
consider x d , d ≤ D a new variable
to obtain a linear system M
linear system M
The XL algorithm
For b ∈ Nn denote by x b the monomial x1b1 x2b2 . . . xnbn and by
|b| = b1 + b2 + · · · + bn the total degree of x b .
given:
choose:
extend:
linearize:
solve:
finite field K = Fq
system A of m multivariate quadratic equations:
`1 = `2 = · · · = `m = 0, ì ∈ K [x1 , x2 , . . . , xn ]
How?
operational degree D ∈ N
system A to the system
R(D) = {x b ì = 0 : |b| ≤ D − 2, ì ∈ A}
consider x d , d ≤ D a new variable
to obtain a linear system M
linear system M
minimum degree D0 for reliable termination (Yang and Chen):
D0 := min{D : ((1 − λ)m−n−1 (1 + λ)m )[D] ≤ 0}
The XL algorithm
For b ∈ Nn denote by x b the monomial x1b1 x2b2 . . . xnbn and by
|b| = b1 + b2 + · · · + bn the total degree of x b .
given:
choose:
extend:
linearize:
solve:
finite field K = Fq
system A of m multivariate quadratic equations:
`1 = `2 = · · · = `m = 0, ì ∈ K [x1 , x2 , . . . , xn ]
How?
operational degree D ∈ N
system A to the system
R(D) = {x b ì = 0 : |b| ≤ D − 2, ì ∈ A}
consider x d , d ≤ D a new variable
to obtain a linear system M
How?
linear system M
minimum degree D0 for reliable termination (Yang and Chen):
D0 := min{D : ((1 − λ)m−n−1 (1 + λ)m )[D] ≤ 0}
The XL algorithm
For b ∈ Nn denote by x b the monomial x1b1 x2b2 . . . xnbn and by
|b| = b1 + b2 + · · · + bn the total degree of x b .
given:
choose:
extend:
linearize:
solve:
finite field K = Fq
system A of m multivariate quadratic equations:
`1 = `2 = · · · = `m = 0, ì ∈ K [x1 , x2 , . . . , xn ]
degree
D ∈ N algorithm
How?
operational
We use the
Wiedemann
system
A
to
the
system
instead of a Gauss solver. Thus, we
(D)not
b ` = 0 :a|b|
compute
complete
Rdo
= {x
≤ D − 2,Gröbner
ì ∈ A}
i
d
basis but
consider
x distinguished
, d ≤ D a new solutions!
variable
to obtain a linear system M
How?
linear system M
minimum degree D0 for reliable termination (Yang and Chen):
D0 := min{D : ((1 − λ)m−n−1 (1 + λ)m )[D] ≤ 0}
The Wiedemann algorithm – the basic idea
given:
wanted:
solution:
A ∈ K N×N
v ∈ K N such that Av = 0
compute minimal polynomial f of A of degree d : f (A) = 0
d
X
fi Ai = 0
i=0
d
X
fi Ai z = 0
choose z ∈ K N randomly
i=0
d
X
fi Ai z + f0 z = 0
i=1
d
X
fi Ai z = 0
i=1
d
X
A·
!
i−1
fi A
=0
z
i=1
|
{z
=v
}
since f0 = 0
The Wiedemann algorithm – the basic idea
given:
wanted:
solution:
A ∈ K N×N
v ∈ K N such that Av = 0
How?
compute minimal polynomial f of A of degree d : f (A) = 0
d
X
fi Ai = 0
i=0
d
X
fi Ai z = 0
choose z ∈ K N randomly
i=0
d
X
fi Ai z + f0 z = 0
i=1
d
X
fi Ai z = 0
i=1
d
X
A·
!
i−1
fi A
=0
z
i=1
|
{z
=v
}
since f0 = 0
The Berlekamp–Massey algorithm
Given a linearly recurrent sequence
S = {ai }∞
i=0 , ai ∈ K ,
compute an annihilating polynomial f of degree d such that
d
X
fi aj+i = 0, for all j ∈ N.
i=0
The Berlekamp–Massey algorithm requires the first 2 · d elements
of S as input and computes fi ∈ K , 0 ≤ i ≤ d .
The Berlekamp–Massey algorithm
ai = xAi z, x ∈ K 1×N
Given a linearly recurrent sequence
S = {ai }∞
i=0 , ai ∈ K ,
compute an annihilating polynomial f of degree d such that
d
X
fi aj+i = 0, for all j ∈ N.
i=0
The Berlekamp–Massey algorithm requires the first 2 · d elements
of S as input and computes fi ∈ K , 0 ≤ i ≤ d .
The block Wiedemann algorithm
Due to Coppersmith (1994), three steps:
Input: A ∈ K N×N , parameters m, n ∈ N, κ ∈ N of size
N/m + N/n + O(1).
The block Wiedemann algorithm
Due to Coppersmith (1994), three steps:
Input: A ∈ K N×N , parameters m, n ∈ N, κ ∈ N of size
N/m + N/n + O(1).
BW1: Compute sequence {ai }κi=0 of matrices ai ∈ K n×m using
random matrices x ∈ K m×N and z ∈ K N×n
ai = (xAi y )T , for y = Az. O N 2 (wA + m)
The block Wiedemann algorithm
Due to Coppersmith (1994), three steps:
Input: A ∈ K N×N , parameters m, n ∈ N, κ ∈ N of size
N/m + N/n + O(1).
BW1: Compute sequence {ai }κi=0 of matrices ai ∈ K n×m using
random matrices x ∈ K m×N and z ∈ K N×n
ai = (xAi y )T , for y = Az. O N 2 (wA + m)
BW2: Use block Berlekamp–Massey to compute polynomial f with
coefficients in K n×n .
Coppersmith’s version: O(N 2 · n)
Thomé’s version: O(N log2 N · n)
The block Wiedemann algorithm
Due to Coppersmith (1994), three steps:
Input: A ∈ K N×N , parameters m, n ∈ N, κ ∈ N of size
N/m + N/n + O(1).
BW1: Compute sequence {ai }κi=0 of matrices ai ∈ K n×m using
random matrices x ∈ K m×N and z ∈ K N×n
ai = (xAi y )T , for y = Az. O N 2 (wA + m)
BW2: Use block Berlekamp–Massey to compute polynomial f with
coefficients in K n×n .
Coppersmith’s version: O(N 2 · n)
Thomé’s version: O(N log2 N · n)
BW3: Evaluate the reverse of f :
deg(f )
W =
X
j=0
Aj z(fdeg(f )−j )T .
O N 2 (wA + n)
Parallelization of BW1
Input: A ∈ K N×N , parameters m, n ∈ N, κ ∈ N of size
N/m + N/n + O(1).
Compute sequence {ai }κi=0 of matrices ai ∈ K n×m using random
matrices x ∈ K m×N and z ∈ K N×n
ai = (xAi y )T ,
using {ti }κi=0 , ti ∈ K N×n ,
(
y = Az
ti =
Ati−1
for y = Az
for i = 0
for 0 < i ≤ κ,
ai = (xti )T .
Parallelization of BW1
INPUT: macaulay_matrix<N, N> A;
sparse_matrix<N, n> z;
matrix<N, n> t_new, t_old;
matrix<m, n> a[N/m + N/n + O(1)];
sparse_matrix<m, N, weight> x;
x.rand();
t_old = z;
for (unsigned i = 0; i <= N/m + N/n + O(1); i++)
{
t_new = A * t_old;
a[i] = x * t_new;
swap(t_old, t_new);
}
RETURN a
Parallelization of BW1 – multicore processor
ti
=
A
×
ti−1
Parallelization of BW1 – multicore processor
2 cores
ti
=
A
×
ti−1
Parallelization of BW1 – multicore processor
4 cores
ti
=
A
×
ti−1
Parallelization of BW1 – multicore processor
4 cores
ti
=
A
×
ti−1
Parallelization of BW1 – multicore processor
4 cores
ti
=
A
OpenMP
×
ti−1
Parallelization of BW1 – cluster system
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
2 computing nodes
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
=
a(i) ∈ K n×m
A
n
×
ti−1
Parallelization of BW1 – cluster system
10000
Runtime [s]
5000
BW1
BW2 Thomé
BW2 Coppersmith
BW3
1500
1000
500
32
64
128
256
512
Block Width: m = n
1024
Parallelization of BW1 – cluster system
90
80
Memory [GB]
70
BW2 Thomé
BW2 Coppersmith
36 GB
60
50
40
30
20
10
0
32
64
128
256
512 1024
Block Width: m = n
Parallelization of BW1 – cluster system
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
2 computing nodes
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
=
A
×
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
×
ti−1 ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti−1
=
A
×
ti−1 ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
×
ti
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
MPI: ISend, IRecv, . . .
×
ti
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
×
ti
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
×
ti
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
×
ti
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
×
ti
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
×
ti
ti−1
Parallelization of BW1 – cluster system
4 computing nodes
ti
ti
=
A
InfiniBand Verbs
×
ti
ti−1
Parallelization of BW1 – cluster system
1
Verbs API
MPI
Ratio
0.75
0.5
0.25
0
2
4
8
16
32
Number of Nodes
64
128
256
Parallelization of BW1 – cluster system
Runtime
Communication
Computation
Sent per Node
2
4
8
16
32
64
128
Number of Cluster Nodes
(InfiniBand MT26428, 2 ports of 4×QDR, 32 Gbit/s)
256
Runtime n = 16, m = 18, F16
2500
Runtime [min]
2000
BW1
BW2 Tho.
BW2 Cop.
BW3
1500
1000
500
0
8
16 32
Cluster
64
6
Number of Cores
12 24
NUMA
48
Comparison to Magma F4
XL (64 cores)
XL (scaled)
Magma F4
100
Memory [GB]
Time [hrs]
1000
100
10
1
0.1
0.01
0.001
XL
Magma F4
10
1
0.1
0.01
18 19 20 21 22 23 24 25
18 19 20 21 22 23 24 25
n, m = 2n
n, m = 2n
Conclusions
XL with block Wiedemann as system solver is an alternative for
Gröbner basis solvers, because
I
in about 80% of the cases it operates on the same degree
(for generic systems),
I
it scales well on multicore systems and moderate cluster sizes,
and
I
it has a relatively small memory demand.
Thank you!

Download Report

Solving Quadratic Equations with XL on Parallel Architectures

Paperzz.com

Your Paperzz