Improved Lattice Basis Reduction Algorithms and their Efficient

Improved Lattice Basis Reduction
Algorithms and their Ecient
Implementation on Parallel
Systems
Timo Bartkewitz
October 30, 2009
Diploma Thesis
Ruhr-University Bochum
Department of Electrical Engineering
and Information Sciences
Chair for Embedded Security
Prof. Dr.-Ing. Christof Paar
Adviced by Dr.-Ing. Tim Güneysu
October 30, 2009 8:00
Statement / Erklärung
I hereby declare that the work presented in this thesis is my own work and that
to the best of my knowledge it is original, except where indicated by references
to other authors.
Hiermit versichere ich, dass ich meine Diplomarbeit eigenständig verfasst und
keine anderen als die angegebenen Quellen und Hilfsmittel benutzt, sowie Zitate
kenntlich gemacht habe.
Date / Datum
Timo Bartkewitz
Abstract / Zusammenfassung
In this work, we present the rst implementation of a complete lattice basis reduction algorithm on graphics cards. Sequential algorithms for lattice basis reduction
oer good results concerning runtime and reduction quality but unfortunately not
at the same time. For graphics cards, we refrain from these sequential algorithms
and improve parallel algorithms that are known to deliver much better runtime
results, however at the cost of a weaker reduction quality. Therefore, we introduce a variant of the parallel All-swap algorithm the ordered All-swap algorithm
an advance to meet the parallel computing capabilities of modern streaming
processors. With our novel approach we can achieve both, sucient reduction
quality and improved runtime.
Keywords: Lattice Basis Reduction, Parallelization, Graphics Cards, CUDA,
Ordered-All-Swap
In dieser Arbeit präsentieren wir die erste Implementierung eines kompletten
Gitterbasisreduktionsalgorithmusses auf Grakkarten. Sequentielle Algorithmen
für die Gitterbasisreduktion bieten bereits gute Resultate im Hinblick auf die
Reduktionqualität und Laufzeit, aber leider meist nicht gleichzeitig. Mit Bezug
auf Grakkarten nehmen wir Abstand von diesen Algorithmen und wenden uns
parallelen Algorithmen zu, die, verbessert, eine wesentlich bessere Laufzeit auf
Kosten der Reduktionsqualität versprechen. Hierzu führen wir eine Variante des
Alltausch-Algorithmusses ein den geordneten Alltausch-Algorithmus , eine
Weiterentwicklung, um den heutigen Möglichkeiten der parallelen Datenverarbeitung von modernen Streaming-Prozessoren gerecht zu werden. Mit unserem
neuen Ansatz können wir eine zufriedenstellende Reduktionsqualität mit einer
verbesserten Laufzeit erreichen.
Stichwörter: Gitterbasisreduktion, Parallelisierung, Grakkarten, CUDA, geord-
neter Alltausch
Contents
1. Introduction
1
2. Lattice Theory
5
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2. Aim of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3. Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1. Linear Algebra - Notations and Denitions . . . . . . . . . . . . .
2.2. Introduction to Lattices . . . . . . . . . . . . . . . . . . . . . . .
2.3. Lattice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .
3. QR Decomposition
3.1. Gram-Schmidt Process . . . . . . . . . . . . . . . . . . . . . . . .
3.2. Givens Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3. Householder Reections . . . . . . . . . . . . . . . . . . . . . . .
4. Lattice Basis Reduction I: Basics and Sequential Approach
4.1. Fundamentals . . . . . . . . . . . . .
4.1.1. Orthogonalization . . . . . . .
4.1.2. Size Reduction . . . . . . . .
4.1.3. Basis Permutation . . . . . .
4.2. Sequential Algorithms . . . . . . . .
4.2.1. LLL-Algorithm . . . . . . . .
4.2.2. Floating Point LLL-Algorithm
4.2.3. Other Sequential Algorithms .
4.3. Lattice Attacks and Applications . .
5. Parallel Computing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1. Fundamentals of Parallel Computing . . . . . . . . . .
5.2. GPU Computing . . . . . . . . . . . . . . . . . . . . .
5.2.1. Compute Unied Device Architecture (CUDA) .
5.2.2. Hardware Overview . . . . . . . . . . . . . . . .
5.2.3. CUDA Programming Model . . . . . . . . . . .
5.2.4. CUDA Programming Interface . . . . . . . . . .
5.2.5. CUDA Performance Guidelines . . . . . . . . .
5.3. Outlook: Open Computing Language (OpenCL) . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
5
6
12
15
16
18
24
29
29
29
31
33
35
35
37
38
40
43
43
46
47
47
48
51
54
57
Contents
viii
6. Lattice Basis Reduction II: Parallel Approach
6.1.
6.2.
6.3.
6.4.
Parallel Orthogonalization . . . . . . . . .
Parallel Size Reduction . . . . . . . . . . .
Parallel Basis Permutation . . . . . . . . .
Parallel Algorithms . . . . . . . . . . . . .
6.4.1. All-Swap Algorithm . . . . . . . . .
6.4.2. Ordered All-Swap Proposal . . . .
6.4.3. Parallelized Sequential Algorithms .
7. Implementation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.1. Overview . . . . . . . . . . . . . . . . . . . . . .
7.2. Phase I: Orthogonalization . . . . . . . . . . . .
7.2.1. Standard Householder Reections . . . .
7.2.2. Standard Givens Rotations . . . . . . . .
7.2.3. Givens Rotations with Combined Stages
7.2.4. Transformation . . . . . . . . . . . . . .
7.3. Phase II: Size Reduction . . . . . . . . . . . . .
7.3.1. Transposition . . . . . . . . . . . . . . .
7.3.2. Size Reduction i: Coecients . . . . . .
7.3.3. Size Reduction ii: Basis . . . . . . . . .
7.4. Phase III: Permutation . . . . . . . . . . . . . .
7.4.1. Usual All-Swap Phase . . . . . . . . . .
7.4.2. Ordered All-Swap Phase . . . . . . . . .
7.4.3. Swap Indication . . . . . . . . . . . . . .
8. Practical Results and Discussion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
63
66
66
66
69
70
75
75
76
77
80
85
86
87
88
88
90
90
90
91
93
97
9. Conclusion and Future Work
111
A. Short Vectors
113
List of Figures
116
List of Tables
117
List of Algorithms
119
Bibliography
123
1. Introduction
In this preluding chapter we outline why we are concerned with this subject, in
particular the related work and what we aim for in this thesis. A brief thesis
outline will conclude this chapter.
1.1. Motivation
The lattice basis reduction is an important and interesting tool in linear algebra. Various applications concern the factorization of polynomials and integer numbers, solving of knapsack problems, hidden number problem [Hin04]
enabled by the nding of a relatively short lattice basis and especially the shortest vector for a given lattice. The latter could be used to break instances of
the public-key cryptosystem RSA [RSA78]. Beside this cryptosystem, based
on factoring, there is the class of lattice-based cryptosystems that are assumed
to be secure against attacks using a quantum computer. Considering those
crypto schemes, the lattice basis reduction can be applied to determine the
hardness of respective problems. Well-known lattice-based cryptosystems are
NTRU [HPS98], Merkle-Hellman [MMI+ 78], Ajtai-Dwork [AD97], and GoldreichGoldwasser-Halevi [GGH96]. However, the problem of nding such a short basis,
relatively a short basis vector is N P -complete [vEB81] and thus it can be solved
only in exponential runtime. Nevertheless, there is another weakened problem,
the so called approximated version. The approximated version can be solved in
almost polynomial runtime and fortunately it can be shown that it is sucient
to solve this version instead for the above mentioned applications.
The rst practical algorithm to nd a short lattice basis vector was proposed in
1982 by Lenstra, Lenstra and Lovász called the LLL-algorithm [LLL82]. It oers
polynomial runtime but involves a quality factor which is exponentially dependent
on the magnitude of the lattice. Additionally, the LLL-algorithm suers from a
further disadvantage. For correctness reasons it applies exact integer arithmetic
which has been shown itself to be very inecient. Hence there were many eorts
made to improve this algorithm concerning runtime behavior and the reachable
reduction quality. That already implies that an optimal approach must be a
tradeo between performance and the expected reduction results. Schnoor and
Euchner therefore proposed a variant of the LLL-algorithm that applies oating
2
Introduction
point arithmetic and thus oers much better runtime results, better known as
Schnoor-Euchner algorithm [SE94]. Simultaneously, Schnoor and Euchner presentend the hybrid BKZ-algorithm [SE94] that is most widely used for lattice
basis reduction today because it delivers the best reduction quality.
Concerning the runtime another usual approach in practice is to parallelize
common algorithms. Especially nowadays it makes sense to make use of this
paradigm. Modern parallel systems, i.e. multicore CPUs and many core based
graphics cards, are available for everyone and oer impressive computational
performance. Massive parallelism became a keyword for many scientic elds
to facilitate computations of very complex tasks. The lattice basis reduction in
general belongs to this scope, too. As already suggested it is usual to parallelize
common sequential algorithm but of course there exist algorithms which are natively designed for parallel computing structures. The All-swap algorithm [Vil92]
is such a native parallel algorithm but met with little response in practice due
to its demand for huge computational resources, i.e. processors. Furthermore it
oers a weaker reduction quality than its sequential counterparts.
1.2. Aim of This Thesis
In this work we will concentrate on already existing algorithms that are capable
to be most widely parallelized. There are virtually no previous works considering
lattice basis reduction algorithms on many core parallel systems, except the usual
way to involve systems with a small number of cores in the range of 4 (multicore
CPU) to 16 (multicore CPU cluster). Here we tend to architectures based on
graphics cards that oer numerous processors and enable employing algorithms
that were almost of theoretical interest so far due to their great demand for
computational resources, the All-swap algorithm for instance. Ultimately, we
want to present the rst - to the best of our knowledge - implementation of the
lattice basis reduction on graphics cards and provide an incentive for future works.
Already known algorithms and concepts will be adapted to t the graphics card
architecture but as well new concepts, respectively improvements will be proposed
to form an optimal solution for lattice basis reduction.
1.3. Thesis Outline
In chapter two we will introduce to lattice theory conveying signicant terms and
notions. The lattice is dened and corresponding important problems illustrated.
In chapter three the rst component of the reduction is introduced and its
several methods explained. The QR decomposition constitutes the foundation
1.3 Thesis Outline
3
of the lattice basis reduction and thus will also include examples to facilitate a
better understanding.
In chapter four we will provide an overview of sequential algorithms and their
fundamentals. Furthermore we exemplary show an application, an attack on a
well known lattice-based public-key cryptosystem.
Afterwards in chapter ve, the approach of parallel computing will be introduced, especially the advantages to be expected, as well as the limits dictated by
the underlying system. The parallel system employed in this work is described
in detail because its properties are crucial to the entire procedure.
Chapter six shows how to parallelize the respective algorithms and concepts
and how they form the lattice basis reduction as a whole.
Chapter seven contains our main work. The implementation is apportioned
into three phases in respect of the major steps implied in the reduction. All
implementations are tailor-made to t the given computational architecture and
deliver best performance.
Finally chapter eight depicts in which sections we achieve satisfying results and
which sections still accommodate weak spots.
2. Lattice Theory
In this chapter we rst provide notations and basic mathematical fundamentals
to facilitate understanding of the lattice theory and QR decomposition. We then
introduce the denitions, properties and the most important problems concerning
a lattice.
2.1. Linear Algebra - Notations and Denitions
According to usual practice, R denotes the set of real numbers, Q the set of
rational numbers, and Z the set of integer numbers.
If a capital letter is used we denote a m-by-n matrix A within the vector space
R
in row-major order. The corresponding lower case aij refers to the (i, j)th
entry of A.
m×n
Given a matrix A AT denotes the corresponding transpose and A−1 the corresponding inverse. A matrix is called orthogonal if and only if its transpose is
equal to its inverse AT = A−1 .
 
x1
 x2 
 
If a lower case is used we denote a column vector x =  ..  within the vector
.
xn
space R , respectively a row vector x = (x1 , x2 , . . . , xn ) within R1×n . We refer
to xi as the ith element of x.
Furthermore, we write drc := r + 21 for the nearest integer of r ∈ R whereas
brc denotes the greatest integer smaller or equal to r ∈ R. By |r| we denote the
absolute value of r ∈ R.
n×1
T
Subsequent considerations on the lattice theory and QR decomposition demand
some important denitions.
Denition 2.1.1 (Inner Product) Let u, v ∈ Rn , the
product) is dened by
hu, vi =
n
X
i=1
ui vi = uT v.
inner product ( scalar
Lattice Theory
6
Denition 2.1.2 (Linear Hull) Given a vector space V , the set S is dened
to be the intersection of all subspaces of V spanned by S . Let S be the set
S = {vi |1 ≤ i ≤ d, vi ∈ Rn }
then
span(S) = span(v1 , v2 , . . . , vd ) =
( d
X
)
ai vi |ai ∈ R
i=1
is called the linear Hull of the set S .
In other words the linear hull may also be dened as the collection of all (nite)
linear combinations of the elements of S . The linear hull span(S)⊥ is given by a
set S consisting of pair-wise orthogonal vectors.
Denition 2.1.3 (p-Norm) Let u ∈ Rn and 1 < p < ∞, the p-norm is dened
by
kukp =
n
X
! p1
upi
.
i=1
p = 2 yields the 2-norm which is also known as the Euclidean norm, respectively
Euclidean length
kuk2 =
n
X
! 21
u2i
q
= u21 + u22 + . . . + u2n .
i=1
For p = ∞ the maximum norm is dened by
kuk∞ = max(|u1 | , |u2 | , . . . , |un |).
2.2. Introduction to Lattices
In linear algebra a lattice in Rn is a discrete, additive, abelian subgroup of Rn
consisting of points. Discrete signies that there are no cluster points but all
points have a minimum Euclidean distance from each other. Lattices compose
a solution space of a linear, real system of equations. The construction of short
solutions is of particular importance. As we shall see later this means nding
short solutions algorithmically. Further detailed information can be found in
[Sch06, May08].
2.2 Introduction to Lattices
7
Denition 2.2.1 (Lattice) Let b1 , b2 , . . . , bk ∈ Rd , k ≤ d linear independent,
the set
(
d
L=
u ∈ R |u =
k
X
)
ai bi , ai ∈ Z
i=1
is called a lattice.
Hence, every lattice L can be represented by a set B = {b1 , b2 , . . . , bk } of column
vectors. We call B the basis of the lattice L, thus L(B) is the set of all nite,
integer linear combinations of the basis vectors bi .
The addition of vectors is associative and commutative and further a lattice
satises the following abelian group axioms :
ˆ
ˆ
ˆ
Closure. For all points a, b ∈ L, (a + b) is also a point in L.
Identity element. Every lattice L contains the zero vector 0d .
Inverse element. For each point a the inverse −a is also in L.
Denition 2.2.2 (Lattice Dimension and Embedding Dimension) The parameter
dim(L) := rank(B) = d
is called the lattice dimension. Parameter k is called the embedding dimension.
A lattice where k = d we refer to as a lattice with full rank.
b1
b2
0
c2
c1
Figure 2.1.: 2-dimensional lattice Z2 with distinct bases
A basis B is never unique. Every basis B can be transformed into another basis
C such that L(B) = L(C) (Fig. 2.1) while
Lattice Theory
8
ˆ permuting columns or
ˆ multiplying a column by −1 or
ˆ adding an integer multiple of one vector to another vector.
These operations are also referred to as a unimodulary transformation. By
such a transformation we multiply the basis matrix B with a unimodulary matrix
U ∈ Rd×d which has the determinant det(U ) = ±1.
Example:
As seen in Figure 2.1 the two distinct bases
1 2
3
1
B=
C=
2 0
−2 −2
generate the 2-dimensional lattice Z2 . We now want to verify if actually L(B) = L(C) by nding two unimodulary matrices U1 , U2 ∈ Z2×2
such that
B = C · U1 ,
C = B · U2 .
We nd
1
1
U1 =
−2 −1
|
{z
}
det=1
−1 −1
U2 =
.
2
1
|
{z
}
det=1
Denition 2.2.3 (Gram-Matrix) The matrix consisting of the inner products
of the basis vectors bi is called the Gram-matrix
G = B T B = [hbi , bj i]1≤i,j≤d ∈ Rd×d .
Denition 2.2.4 (Lattice Determinant) If the lattice has full rank the absolute value of the determinant of the basis B
det(L) = |det(B)| =
√
G
is called the lattice determinant. In the case the lattice has not full rank (cf.
Lattice Dimension) the determinant is dened by the so called Gram-Schmidt
orthogonalization.
The lattice determinant det(L) remains unchanged by use of a unimodulary
transformation and is not bound to a particular basis. It is an invariant of the
lattice. For our example we obtain |det(B)| = |det(C)| = 4. An upper bound of
the determinant is given by Hamard's inequality.
2.2 Introduction to Lattices
9
b1
0
b2
c1
c2
Figure 2.2.: Volume of the basis mesh (parallelepiped)
Proposition 2.2.5 (Hamard's Inequality) Let L be a d-dimensional lattice
and corresponding basis B ∈ Rd×k , then it holds
det(L) ≤
k
Y
kbi k2 .
i=1
Q
It holds det(L) = ki=1 kbi k2 when the basis vectors bi are pairwise orthogonal.
Moreover, the determinant can be interpreted geometrically as the volume of the
spanned parallelepiped
(
)
d
X
P(B) := u ∈ Rd |
ai b i , 0 ≤ ai ≤ 1 ∈ R .
(2.1)
i=1
P(B) is also called the basis mesh of basis B (Fig. 2.2).
Due to the innite set of unimodulary transformations there exists an innite
number of bases. Consequently, the question arises whether a basis is "good"
or more useful than another. For many algorithmic problems the interesting
bases consist of short, in respect of the Euclidean length, and pairwise orthogonal
vectors.
A basis is short if the basis vectors compose an orthogonal basis by which means
the vectors are pairwise orthogonal and normalized. Logically, it is possible to
transform a basis B into such an orthogonal basis B ⊥ but this necessarily not
means that L(B) = L(B ⊥ ) nor that C generates a lattice at all. Therefore, we
need metrics to dene short bases and in particular short vectors. Short vectors
can be identied by the successive minima.
Lattice Theory
10
b1
0
b2
Figure 2.3.: Successive minima λ1 = kb2 k2 , λ2 = kb1 k2
Denition 2.2.6 (Successive Minima) Let L be a d-dimensional lattice. For
each i ≤ d we denote λi (L) the minimal radius of a ball enclosing the zero point
in L such that the ball contains i linear independent vectors. We call λi (L) the
ith successive minimum.
Figure 2.3 exemplarily depicts the two successive minima in Z2 .
The shortest vector which has length λ1 (L) is stated by an upper bound that
only depends on the lattice determinant det(L) and its dimension dim(L). We
know that the lattice determinant is invariant and equal to the volume of the
spanned parallelepiped. Hence, the shortest vector should be the shortest edge
of the parallelepiped with a dimension less or equal to the dth root of det(L).
This corresponds to the rst Minkowski proposition, except a small factor the
Hermitian constant.
Denition 2.2.7 (Hermitian Invariant and Hermitian Constant) Let L be
a d-dimensional lattice, the Hermitian invariant is dened by
γ(L) :=
λ1 (L)2
2
det(L) d
.
The Hermitian constant of dimension d is dened by
γd := max {γ(L)| dim(L) = d} .
2.2 Introduction to Lattices
11
The Hermitian constant γd is bounded above by
γd ≤
4
d 2
2d
· Γ(1 + ) d ≤
+ O(1) ≈ 0.2342d + O(1).
π
2
eπ
(2.2)
If a corresponding Hermitian constant is known it facilitates formulating the
following upper bound of the rst successive minimum.
Proposition 2.2.8 (Minkowski 1) Let L be a d-dimensional lattice, then
λ1 (L) ≤
√
1
γd det(L) d .
Minkowski's second proposition delivers an upper bound for the product of all
successive minima.
Proposition 2.2.9 (Minkowski 2) Let L be a d-dimensional lattice, then
d
Y
λi (L) ≤
q
λdd det(L).
i=1
Beside the Minkowski propositions there exists a second metric to help deciding
whether a basis is short compared to another. Again, a basis is short if the basis
vectors are pairwise orthogonal but such a basis may not generate a lattice. The
aim is to generate a lattice basis whose basis vectors a near orthogonal. The
quality of the basis vector orthogonalization is stated by the orthogonality defect.
Denition 2.2.10 (Orthogonality Defect) Let L be a d-dimensional lattice
and corresponding basis B ∈ Rd×k , the orthogonality defect is dened by
Qk
df t(B) =
kbi k2
.
det(L)
i=1
It holds 1 ≤ df t(B) because of Hamard's inequality.
The orthogonality defect is closely related to Hamard's inequality. When it
holds df t(B) = 1 it means that the basis vectors are pairwise orthogonal. Conversely, it states the relative deviation from a fully orthogonalized basis.
Lattice Theory
12
2.3. Lattice Problems
The lattice theory is concerned with three major problems that are briey discussed in the following. As already mentioned one aim is to nd a near orthogonalized basis for a given basis because of the fact that many algorithmic problems
of interest require bases with exactly that property.
The rst problem was already outlined above, the problem to nd a short vector
within a given lattice basis. This is called the shortest vector problem (SVP).
Denition 2.3.1 (Shortest Vector Problem) Let L be a d-dimensional lat-
tice and corresponding basis B ∈ Rd×k . By the shortest vector problem (SVP)
one obtains the basis B as input and searches for a vector with Euclidean length
λ1 within it.
It is often sucient to nd a short vector that is almost as short as the real
shortest vector with length λ1 . That implies the approximated version.
Denition 2.3.2 (Approximated γ -Shortest Vector Problem) The approximated version γ -SVP is equal to the exact version except the factor γ , thus one
searches for a short vector that has at most the length γλ1 .
A more general case of the shortest vector problem is the closest vector problem.
It deals with the problem to nd a distinct vector that is as close as possible to
a given vector.
Denition 2.3.3 (Closest Vector Problem) Let L be a d-dimensional lattice
and corresponding basis B ∈ Rd×k . By the closest vector problem (CVP) one
searches a vector u that is the closest to a given target vector t, meaning
ku − tk = min kv − tk .
v∈L
Analogous to the γ -SVP the approximated version of the closest vector problem
involves the factor γ to nd a distinct that is near the real closest vector.
Denition 2.3.4 (Approximated γ -Closest Vector Problem) The approximated version γ -CVP is equal to the exact version except the factor γ , thus one
searches for a vector u to a given target vector t such that
ku − tk = γ · min kv − tk .
v∈L
2.3 Lattice Problems
13
The last major problem we want to introduce is the shortest basis problem. This
problem represents the consequent extension of the shortest vector problem. We
know that a d-dimensional lattices contains d successive minima which implicates
that the shortest vector has length λ1 the second shortest the length λ2 and so
on.
Denition 2.3.5 (Shortest Basis Problem) Let L be a d-dimensional lattice
and corresponding basis B ∈ Rd×k . By the shortest basis problem (SBP) one
obtains the basis B as input and searches a basis such that
kbi k2 = λi , bi ∈ B
or equivalent
df t(B) = min {df t(C), C is a basis of L} .
Denition 2.3.6 (Approximated γ -Shortest Basis Problem) The approximated version γ -SBP is equal to the exact version except the factor γ , thus one
searches for a basis such that
kbi k2 = γλi , bi ∈ B
or equivalent
df t(B) ≤ γ · min {df t(C), C is a basis of L} .
In this thesis we merely concentrate on the shortest basis problem. This is due
to the existing algorithms which handle the bases as a whole. Nevertheless, the
probably most important problem SVP is also covered by this work. However,
we work on the approximated versions as customary.
3. QR Decomposition
Given a vector space it is possible to state a corresponding orthogonal basis
without great eort. Generally, this is not true for a given lattice basis B =
(b1 , b2 , . . . , bk ), because an orthogonal basis is not existent for every lattice. Only
the aforementioned unimodulary transformations lead to a new lattice basis that
generates the given lattice. That is why we have to consider about a way to
construct a basis whose basis vectors are pairwise, relatively near orthogonal.
The so called QR decomposition is the tool that generates an orthogonal basis in the sense of an unimodulary transformation. It is also often denoted as
QR factorization. As these names indicate the QR decomposition disjoints an
intended matrix A ∈ Rd×k into two matrices. The rst matrix Q ∈ Rd×k contains
the orthogonalized portion of A. The second matrix R ∈ Rk×k has upper triangular form and shapes the relation between A and Q. Detailed informations on
the QR decomposition can be found in [GHM04], section 4.1.
Proposition 3.0.7 (QR Decomposition) Let A ∈ Rd×k and nonsingular, then
the QR decomposition A = QR is unique, Q unitary and R an upper triangular
matrix with positive diagonal elements

r1,1 r1,2 . . . r1,k
 0 r2,2 . . . r2,k 


A = (a1 , a2 , . . . , ak ) = QR = (q1 , q2 , . . . , qk )  ..
..  .
.
.
.
.
 .
.
. . 
0 . . . 0 rk,k

Thereby we annotate that the orthogonalized portion Q is not in our focus
of interest but the upper triangular matrix R. As we said above R shapes the
relation between A and Q or in other words how A is created resting upon the
orthogonal matrix Q. As we shall see later by the help of R, more precisely the
Gram-Schmidt coecients matrix (see next section) enables us to orthogonalize
a lattice basis in the sense of a unimodulary transformation.
There are several methods to compute the QR decomposition, by all means the
Gram-Schmidt process, Givens rotations, Householder reections, and Cholesky
decomposition. Each has a number of advantages and disadvantages.
Due to the characteristics of the chosen parallel system we pass on the Cholesky
decomposition because of its disadvantages concerning our intentions to handle
QR Decomposition
16
arbitrarily natured basis matrices because for one thing it demands the matrix to
be symmetric and for another thing the matrix has to be positive denite. These
properties would have to be checked before processing and in addition not every
matrix satieses them. All other methods are presented hereafter.
Note.
The orthogonalized version of a basis B is denoted by B ∗ and by implication the basis vectors bi by b∗i .
3.1. Gram-Schmidt Process
The method is named for Jørgen Gram and Erhard Schmidt and probably the
most common QR decomposition method. It is also part of rst lattice basis
reduction algorithm the LLL-algorithm.
Denition 3.1.1 (Orthogonal Projection) Let u, v ∈ Rn , then
πu (v) =
hv, ui
u
hu, ui
is called the orthogonal projection of vector v onto the vector u.
The orthogonal versions b∗i of the basis vectors bi are then determined by the
iterated process
(3.1)
b∗1 := b1
b∗i
:= bi −
i−1
X
π (bi ) = bi −
b∗j
j=1
whereas
µi,j
i−1
X
µi,j b∗j
 b ,b∗
hi i

 b∗ ,bj∗ , i > j
h j ji
:=
1,
i=j


0,
else
are called the Gram-Schmidt coecients.
Example:
This example illustrates the Gram-Schmidt process for
the lattice basis
3 1
B=
,
2 2
as seen in Figure 3.1.
(3.2)
j=1
(3.3)
3.1 Gram-Schmidt Process
17
3
= b1 =
2
∗
7
hb2 , b i
µ2,1 = ∗ ∗1 =
hb1 , b1 i
13
7
1 −8
1
3
∗
−
b2 = b2 − µ2,1 b1
·
=
2
2
13
13 12
b∗1
After applying the Gram-Schmidt process, the vectors b∗1 and b∗2 are pairwise
orthogonal. In this examples we see that the orthogonal vectors b∗i do not form
a lattice basis, this is due to the real result of µ2,1 .
b2
b1
b2 - µ2,1 b1
µ2,1 b1
0
Figure 3.1.: Illustrative Gram-Schmidt process in Z2
The Gram-Schmidt process gives the QR decomposition of a basis B = (b1 , b1 , . . . , bk )
such that
Q = (q1 , q2 , . . . , qk ) =
b∗1
b∗2
b∗k
,
,
.
.
.
,
kb∗1 k2 kb∗2 k2
kb∗k k2
(3.4)
and

r1,1 r1,2
 0 r2,2

R =  ..
..
 .
.
0 ...

r1,1 0

 0 r2,2
= .
..
 ..
.
0 ...

. . . r1,k
. . . r2,k 

.. 
..
.
. 
0 rk,k


 1 µ2,1 . . . µk−1,1 µk,1
... 0
...
µk,2 
..  
..
0 1 µ3,2

.
.   .. . .
..  .
..
..


.
.
.
.
. 
..

. 0 
0 . . .
0
1
µk,k−1 
0 rk,k
0 0
...
0
1
(3.5)
(3.6)
QR Decomposition
18
We now present the algorithms involving the Gram-Schmidt process to compute the QR decomposition. Algorithm 3.1 is the classical version (CGS) that
lacks from very poor numerical properties. Typically, there is a severe loss of
orthogonality of the computed vectors qi while exceeding a certain basis dimension due to roundo errors on the computer. The second algorithm embodies
the so called modied Gram-Schmidt (MGS) method that yields much better
results. This is due to a rearrangement of the computation within the process.
All algorithms concerning the QR decomposition and its methods are taken from
[GvL96], chapter ve.
Algorithm 3.1 Classical Gram-Schmidt (CGS)
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: Q = (q1 , q2 . . . , qk ), R = [ri,j ]1≤i,j≤k
1:
2:
3:
4:
5:
6:
7:
8:
9:
for i = 1 to k do
b∗i = bi
for j = 1 to i − 1
ri,j = hqi , bi i
b∗i = b∗i − ri,j qi
do
end for
ri,i = kb∗i k2
b∗
qi = ri,ii
end for
Algorithm 3.2 states the modied Gram-Schmidt method. Both, the classical
and the modied version require 2dk 2 oating point operations (ops).
3.2. Givens Rotations
Givens rotations were introduced in the 1950s to the numerical analysis by Wal-
lace Givens. By a slight amendment they are also suitable to compute the QR
decomposition. The Givens rotations produce the same result but by another
way. In the rst place they do not focus on the orthogonal matrix Q but on the
matrix R. They step by step bring the basis matrix B into the upper triangular
form by inserting lower triangular zeros.
One Givens rotation is represented by an usual two-dimensional rotation matrix
of the form
3.2 Givens Rotations
19
Algorithm 3.2 Modied Gram-Schmidt (MGS)
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: Q = (q1 , q2 . . . , qk ), R = [ri,j ]1≤i,j≤k
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
for i = 1 to k do
b∗i = bi
end for
for i = 1 to k do
ri,i = kb∗i k2
b∗
qi = ri,ii
for j = i + 1 to k
ri,j = hqi , b∗i i
b∗i = b∗i − ri,j qi
do
end for
end for

1
 ..
.

0
.
G(i, j, θ) = 
 ..
0

.
 ..
0
...
..
.
...
0
..
.
... 0 ...
..
.
c
..
.
s ...
..
.
0
c ...
.. . .
.
.
... 0 ...
...
..
.
. . . −s . . .
..
.
...

0
.. 
.

0
.. 
.

0

.. 
.
1
(3.7)
where c = cos(θ) and s = sin(θ). That is, a Givens rotation matrix is equivalent to the identity matrix except
gi,i
gi,j
gj,i
gj,j
=c
=s
= −s
= c.
(3.8)
(3.9)
(3.10)
(3.11)
Note that contrary to the ordinary convention i determines the column and j
the row. Usually, the matrix is used to rotate a vector x counterclockwise in the
(i, j)-plane of θ radians, consequent
G(i, j, θ)T x
where the name rotations come from.
Since we are interested in inserting zeros in the vector x we need to nd values
for c = cos(θ) and s = sin(θ) such that
QR Decomposition
20
(xi,xj)
x
0
0
(r,0)
Figure 3.2.: Rotation of a vector x by θ radians in Z2
c s
−s c
xi
r
=
.
xj
0
(3.12)
It is obvious that the explicit calculation of θ is not necessary nor desirable. In
a direct way one obtains
q
(3.13)
r = x2i + x2j
xi
(3.14)
c=
r
xj
s= .
(3.15)
r
However, the calculation of r may overow or underow due to roundo errors
on a computer. The alternative algorithm 3.3 is used to gain c and s.
Example:
tice basis
This example illustrates the Givens rotations for the lat
3 1
B=
.
2 2
In Figure 3.2 the rst column is shown as it is rotated by the Givens
rotation. The second column of B is indeed also altered but not
zeroised.
1
c s
3 2
G(1, 2) =
=√
−s c
13 −2 3
1
G(1, 2)B = √
13
13 7
.
0 4
3.2 Givens Rotations
21
Algorithm 3.3 Computation of c and s
Input: xi , xj
Output: c, s
1:
2:
3:
4:
5:
6:
7:
8:
9:
if xj = 0 then
c = 1, s = 1
else
if |xi | < |xj | then
t=
xi
,
xj
s=
√ 1 ,
1+t2
c = st
t=
xj
,
xi
c=
√ 1 ,
1+t2
s = ct
else
end if
end if
When a Givens rotation is multiplied from the left to the basis matrix B then
only the rows i and j of B are aected. This process is iteratively repeated until
the lower triangular part is zeroised (see also the example above). For instance,
given a basis B ∈ R3×3 this approach of applying the Givens rotations is shown
below.








× × ×
∗ ∗ ∗
∗ ∗ ∗
× × ×
B = × × × ⇒  0 ∗ ∗  ⇒ 0 × × ⇒  0 ∗ ∗ 
× × ×
× × ×
0 ∗ ∗
0 0 ∗
B ⇒ G(1, 2)B ⇒ G(1, 3)G(1, 2)B ⇒ G(2, 3)G(1, 3)G(1, 2)B
Entries labeled with "∗" were altered by the last rotation but not zeroised,
while entries labeled with "×" were not aected. The order is important to
obtain the correct result. Columns have to be zeroised step by step from left to
right but it makes no dierence whether we start with the rst or the last entry
2
entries
within a column. All in all for a basis matrix B ∈ Rd×k there are (2d−1)k−k
2
to zeroise.
Obviously, after applying the last Givens rotation Gl,d with
k,
d>k
l :=
k − 1, else (d = k)
one obtains the matrix R of the QR decomposition
R = Gl,d Gl−1,d . . . Gl−1,d−1 Gl−2,d−1 . . . G1,3 G1,2 B
(3.16)
and matrix Q of the form
Q = (Gl,d Gl−1,d . . . Gl−1,d−1 Gl−2,d−1 . . . G1,3 G1,2 )T .
(3.17)
QR Decomposition
22
Algorithm 3.4 Standard Givens Rotations
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: Q = (q1 , q2 . . . , qk ), R = [ri,j ]1≤i,j≤k
1: Q = I
2: R = B
3: for i = 1 to k do
4:
for j = i + 1 to d do
5:
Using Algorithm 3.3 to compute c, s with input ri,j , ri,i
6:
R = G(c, s)R
7:
Q = G(c, s)Q
8:
end for
9: end for
10: Q = QT
These equations directly lead to Algorithm 3.4 which describes the standard
Givens rotations.
The standard method requires 3k 2 (d − k3 ) oating point operations.
Fast Givens Rotations However, it is possible to nearly halve that number by
an appropriate representation of the matrix Q [Gen73]. Therefore, the matrix Q
is represented by a pair (M, D) where M T M = D = diag(di ) and D a diagonal
matrix with each di positive. Hence,
1
Q = M D− 2 .
(3.18)
The matrices M and D are updated in a manner such that
Dnew = F T DF and
Mnew = M F
(3.19)
(3.20)
where F is a simplied Givens rotations matrix dependent on the two pivot
elements that determine the values c and s. The matrix F is either


1 ... 0 ... 0 ... 0
..
.. 
 .. . . ..
. .
.
.
.


0 . . . s . . . 1 . . . 0
.
.. . . ..
.. 
.
F1 (i, j) = 
(3.21)
.
.
.
.
.


0 . . . 1 . . . c . . . 0



.
.
.
.
.
..
.. . . .. 
 ..
0 ... 0 ... 0 ... 1
3.2 Givens Rotations
or
23

1
 ..
.

0
.
F2 (i, j) = 
 ..
0

.
 ..
0
...
..
.
...
0 ...
..
.
0 ...
..
.
1 ...
.. . .
.
.
s ...
..
.
... c ...
..
.
1 ...
.. . .
.
.
... 0 ... 0 ...

0
.. 
.

0
.. 
.
.
0

.. 
.
1
(3.22)
The matrix F1 is referred to as type-1 transformation, consequently F2 as type2 transformation. This implies Algorithm 3.5 to compute c, s for use with the
Givens rotations with alternative representation of the matrix Q.
Algorithm 3.5 New Computation of c and s, Update of D
Input: xi , xj , D
Output: c, s, type, Dnew
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
if xj = 0 then
c = 1, s = 1
else
if |xi | < |xj | then
d
c = xxji , s = c dji , t = cs + 1
v = di , di = tdj , dj = tv
type = 1
else
x
c = xji , s = c ddji , t = cs + 1
di = tdi , dj = tdj
type = 2
end if
end if
In contrast to Algorithm 3.3 this algorithm does not involve square roots in
the computation. That is why this kind of Givens rotations is called square
root free Givens rotations or fast Givens rotations. According to the alternative
representation of Q and computation of c, s we are able to derivate the algorithm
to compute the QR decomposition from Algorithm 3.4. Thus, Algorithm 3.6
states the QR decomposition considering the fast Givens rotations method.
Beside the fast Givens rotations there are many variants based on this concept,
all summarized in [GS91]. They dier from each other in the way to compute c, s
and Q, R. Some have advantages regarding value overow relatively underow in
higher matrix dimensions but with negligible eects. One of these also forgoes
divisions and should be well suited for devices without corresponding logical units.
QR Decomposition
24
Algorithm 3.6 Fast Givens Rotations
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: Q = (q1 , q2 . . . , qk ), R = [ri,j ]1≤i,j≤k
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
D = diag(di ), each di = 1
M =I
R=B
for i = 1 to k do
for j = i + 1 to d do
Using Algorithm 3.5 to compute c, s, type, Dnew with input ri,j , ri,i , D
if type = 1 then
R = F1 (c, s)R
M = M F1 (c, s)
else
R = F2 (c, s)R
M = M F2 (c, s)
end if
end for
end for 1
Q = M D− 2
3.3. Householder Reections
Householder reections describe a reection of a vector about a hyperplane containing the origin. Introduced in 1958 by Alston Householder they are primarily
intended to perform the QR decomposition.
Denition 3.3.1 (Householder Reection) Applying a matrix of the form
H=I−
2
vT v
vv T
to a vector x means performing a Householder reection. H is called the Householder matrix and v is called the Householder vector. The vector x is said to be
reected in the hyperplane span(v)⊥ .
Although the Householder reections denitely follow a dierent algebraic approach, there exist similarities to the Givens rotations. They also triangulate the
lattice basis B to obtain the matrix R but with the dierence that the Householder reections zeroise one entire column at once instead of introducing zeros
step by step to it.
3.3 Householder Reections
25
The Householder matrix H is multiplied to a vector x from the left. The vector
x is then reected in the hyperplane such that
Hx = (a, 0, . . . , 0)T ,
a ∈ R.
(3.23)
As depicted above the Householder matrix H is only determined the Householder vector v which is easy to create by aid of the vector x. The Householder
vector can be expressed as
v = x + sign(x1 ) kxk2 e1 ,
(3.24)
where e1 is the unit vector (1, 0, . . . , 0)T . Usually, the Householder vector is normalized in practice so that v1 = 1. Apparently, this relation causes a cancellation
if x is a close multiple of e1 . This is why in practice the Algorithm 3.7 is used
to determine v because the underlying formula by Parlett [Par71] does not suer
from this defect.
Algorithm 3.7 Computation of the Householder Vector
Input: x ∈ Rn
Output: Householder vector v
1:
if x1 > 0 then
h(x2 ,x3 ,...,xn )T ,(x2 ,x3 ,...,xn )T i
2:
α=−
x1 +kxk2
3: else
4:
α = x1 − kxk2
5: end if
6: v = (1, xα2 , xα3 , . . . , xαn )T
Example:
This example illustrates the Householder reections for
the lattice basis
3 1
B=
.
2 2
In Figure 3.3 the rst column is shown as it is reected by the Householder reection. The second column of B is indeed altered but not
zeroised.
1
v=
−3.3028
2
0.8321 0.5547
T
H = I − T vv =
0.5547 −0.8321
v v
3.6056 1.9415
HB =
.
0
−1.1094
QR Decomposition
26
x
H
0
Hx
v
Figure 3.3.: Reection of a vector x in Z2
Similar to the Givens rotations the Householder matrix is multiplied from the
left to a lattice basis matrix. Moreover, this approach is also iteratively repeated
until the lower triangular part of the basis matrix is zeroised. A signicant
dierence is stressed by the part that is taken into account when applying a
Householder reection. In contrast to the Givens rotations which only aects
two rows, one Householder reection aects a submatrix of the basis matrix B
that is getting smaller each time. For example, a basis B ∈ R4×4 is handled as
follows.
*
 0
B⇒
 0
0

*
*
*
*
*
*
*
*


×
*


0
* 
⇒
 0
* 
*
0
×
*
0
0
×
*
*
*


×
×


* 
0
⇒
 0
* 
0
*

× × ×
× × × 

0 * * 
0 0 *
B ⇒ H1 B4,4 ⇒ H2 H1 B3,3 ⇒ H3 H2 H1 B2,2
Again, entries labeled with "∗" were altered by the last reection but not
zeroised, while entries labeled with "×" were not aected. Bi,i denotes the lower
right submatrix that is altered by the (k − i + 1)th Householder reection. Once
more the order is crucial to obtain the correct result. Columns have to be zeroised
step by step from left to right. Clearly, for a basis matrix B ∈ Rd×k there are k
columns to zeroise (k − 1 if k = d)
After applying the last Householder reection Hk one obtains the matrix R of
the QR decomposition
R = Hk Hk−1 . . . H1 B
(3.25)
3.3 Householder Reections
27
and matrix Q of the form
Q = (H1T H2T . . . HkT ).
(3.26)
The QR decomposition based on the Householder reections is given by Algorithm 3.8. This algorithm requires 2k 2 (d − k3 ) ops.
Algorithm 3.8 Householder Reections
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: Q = (q1 , q2 . . . , qk ), R = [ri,j ]1≤i,j≤k
1: Q = I
2: R = B
3: for i = 1 to k − 1 do
4:
Using Algorithm 3.7 to compute the Householder vector v
5:
H = I − vT2 v vv T
6:
R = HR
7:
Q = QH T
8: end for
Blocked Householder Reections The above algorithm is quite simple and
dominated by matrix-vector products. To improve the algorithm to the eect
that mainly matrix-matrix products dominate the process Bischof and Van Loan
propose a generalized version of the Householder QR decomposition [BvL87]. In
that version multiple Householder reections are represented as single a transformation. Since each Householder matrix is a rank-one modication of the identity
matrix
H = H1 H2 . . . Hr
2
2
2
= (I − T v1 v1T )(I − T v2 v2T ) . . . (I − T vr vrT )
vr vr
v1 v1
v2 v2
(3.27)
(3.28)
can be rewritten as
Hwy = H1 H2 . . . Hr = I + W Y T
(3.29)
where W and Y are k -by-r matrices.
Lemma 3.3.2 Let Hwy = I + W Y T with W, Y ∈ Rk,r . If H = I − vT2 v vvT with
v ∈ Rk and z = − vT2 v Hwy v , then
Hwy H = I + (W z)(Y v)T .
QR Decomposition
28
Algorithm 3.9 Computation of W
and Y
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k , Householder vectors vi , 1 ≤ i ≤ r
Output: W, Y ∈ Rk,r
1: Y = v1
2: W = − vT2v v1
1 1
3: for i = 2 to r do
4:
z = − vT2v (I + W Y T )vi .
5:
6:
7:
i
i
W = (W z)
Y = (Y vi )
end for
By repeatedly applying the lemma, the block representation Hwy can be generated by Algorithm 3.9.
By using the blocked Householder QR decomposition the idea is to apply clusters of Householder transformations that are represented in the W Y Form. The
cluster size is determined by the "blocking parameter" r. Unlike the unblocked
Householder QR decomposition where a Householder matrix H is multiplied to
all of lattice basis B we rst generate the Householder matrices H1 , H2 , . . . , Hr
multiply them to the sub matrices of the rst r columns of B , generate the block
representation I + W1 Y1 = H1 , H2 , . . . , Hr and then apply it to the remainder of
B . Algorithm 3.10 illustrates this pattern.
Algorithm 3.10 Blocked Householder Reections
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k , blocking parameter r
Output: Q = (q1 , q2 . . . , qk ), R = [ri,j ]1≤i,j≤k
1: Q = I
2: R = B
3: while i ≤ k do
4:
l = min(i + r, k)
5:
for j = i + 1 to l do
6:
Using Algorithm 3.7 to compute the Householder vector v
7:
H = I − vT2 v vv T
8:
R = HR
9:
end for
10:
Using Alg. 3.9 to compute the W Y representation Hi+r Hi+r+1 . . . Hl
11:
R = (I + W Y T )R
12:
Q = Q(I + W Y T )
13:
i=i+r
14: end while
4. Lattice Basis Reduction I:
Basics and Sequential
Approach
In this section we provide the most established lattice basis reduction algorithms.
The lattice basis reduction deals with the problem to nd a short lattice basis
for a given lattice basis. As outlined in chapter 2 this means nding a basis
whose basis vectors satisfy the successive minima condition, namely the shortest
basis problem (SBP). As customary in practice, the focus lies on nding one
short vector with a length within the rst successive minimum λ1 , the shortest
vector problem (SVP). To avoid an exponential runtime each algorithm solves
the approximated γ -version. A lattice basis reduction on one hand follows the
paradigm to reduce a lattice basis with high dimensions but neglecting a strong
denition of the shortest vector at the same time. On the other it reduces smaller
bases but with the aim to nd valuable short vectors. Justiably, there exists no
perfect lattice basis reduction algorithm since this depends on many factors like
runtime, dimension of the basis, the given problem to solve, and the expected
quality of the solution. Naturally, the runtime plays an important role due to the
fact that a perfect algorithm should be able to handle high dimensions and output
appropriate solutions in acceptable time. In reality we need a scalable algorithm
that oers the necessary tradeo between the runtime and a good solution for a
given problem.
4.1. Fundamentals
To lay the foundation we start describing the orthogonalization, the size reduction, and the basis permutation which constitute the three fundamentals of the
lattice basis reduction.
4.1.1. Orthogonalization
In chapter 3 we explained how to obtain a orthogonal basis from a given basis.
In this section we clarify how to benet from that process considering the lat-
Lattice Basis Reduction I: Basics and Sequential Approach
30
tice basis reduction. A lattice basis B and the matrix R acquired by the QR
decomposition are related such that
B = QR = QT R

r1,1

 0
⇔ (b1 , b2 , . . . , bk ) = QT  .
 ..
0
0
r2,2
..
.
...
(4.1)
  −1

r1,1 0 . . . 0
0
.. 
..  
−1 . . .
. 
.   0 r2,2
 .
 R.
.
.
..
.. .. 0 
. 0   ..
−1
0 . . . 0 rk,k
0 rk,k
...
..
.
Since Q is orthogonal the columns of Q have the 2-norm 1 and by virtue of

r1,1


Q 

T
0
0
..
.
...
..
.

0
.. 
. 
 = (b∗1 , b∗2 , . . . , b∗k )
0 
r2,2
.. ..
.
.
. . . 0 rk,k
0
(4.2)
it follows that
2
kb∗i k22 = ri,i
.
(4.3)
In section 3.1 we saw that



R=

r1,1
0
0
..
.
r2,2
..
.
...
0


 1 µ2,1 . . . µk−1,1 µk,1
0
...
µk,2 
..  
0 1 µ3,2

.   .. . .
.. 
..
..


.
.
.
. 
.
..

. 0 
0 . . .
0
1
µk,k−1 
0 rk,k
0 0
...
0
1
...
..
.
and with (4.3) we have


kb∗1 k2 kb∗1 k2 µ2,1
...
kb∗1 k2 µk−1,1
kb∗1 k2 µk,1
 0
kb∗2 k2
kb∗2 k2 µ3,2
...
kb∗2 k2 µk,2 


 ..

..
..
..
..
R= .
.
.
.
.
.
∗ ∗ 

 0
...
0
bk−1 2
bk−1 2 µk,k−1 
0
0
...
0
kb∗k k2
(4.4)
If we omit the multiplication of each Gram-Schmidt Coecient µi,j with the
2-norms of the orthogonalized lattice basis vectors b∗i then we can consider each
µi,j as the relation between the vector bi and the succeeding vectors bj with j < i,
4.1 Fundamentals
31
kb k
more precisely as the ratio kbji k2 . This shall not mean that those are equivalent,
2
they are not. This statement only should give a hint of how the Gram-Schmidt
coecients or matrix R are used in the lattice basis reduction. Regardless of
whether we use the Gram-Schmidt process, the Givens rotations, or the Householder reections we are only interested in the Gram-Schmidt coecients.
e while each
Hence, we have to transform the matrix R into our desired matrix R
entry, except ri,i , of the ith row is divided by kb∗i k2 . Later on the squared 2-norms
kb∗i k22 become important as well, so the entries ri,i are squared afterwards. After
this transformation we obtain
 ∗ 2

kb1 k2 µ2,1 . . . µk−1,1
µk,1
 0
kb∗2 k22 µ3,2
...
µk,2 


.
.. 
..
..
..
e=
R
 ..
.
.
.
. .


∗ 2
 0
...
0 bk−1 2 µk,k−1 
0
0
...
0
kb∗k k22
(4.5)
e we are now
With the help of the QR decomposition and the denition of R
able to state the orthogonalization algorithm in form of Algorithm 4.1.
Algorithm 4.1 Basis Orthogonalization
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: The Gram-Schmidt coecients µi,j and the squared 2-norms kb∗i k22 in
e
shape of R
1: Using Algorithm 3.1, 3.4 or 3.8 to compute R
2: for i = 1 to d do
3:
for j = i + 1 to k do
r
4:
ri,j = ri,j
i,i
5:
end for
2
6:
ri,i = ri,i
7: end for
4.1.2. Size Reduction
Recapitulating the unimodulary transformation and its operations as dened in
chapter 2.2 a basis vector bi can be altered by multiplying −1 or add another basis
vector bj to it. The above orthogonalization already implies the exact approach,
a vector bi becomes smaller in respect to its Euclidean length while subtracting a
multiple of another vector bj . The factor is determined by the dedicated GramSchmidt coecient µi,j .
Lattice Basis Reduction I: Basics and Sequential Approach
32
Denition 4.1.1 (Size Reduced Basis) Given a basis B = (b1 , b2 , . . . , bk ), then
if it holds for each Gram-Schmidt coecient
1
|µi,j | ≤ ,
2
1≤i<j≤k
the basis B is called size reduced.
The upper bound 21 is due to the unimodulary transformation since only integer
multiples are allowed. So each Gram-Schmidt coecient µi,j is rounded to the
next integer dµi,j c. If now |µi,j | ≤ 21 a subtraction is stated to be not worthwhile
since it may be possible that kbi k2 becomes larger instead shorter.
Acting by this denition and involving each basis vector bi one immediately
discerns that, if bi is reduced for the rst time, a second reduction may produce the
exact opposite because the second µi,j does not hold the correct ratio anymore.
To counteract the µi,j have to be size reduced analogously. This deliberations
directly lead to Algorithm 4.2 the size reduction algorithm.
Algorithm 4.2 Basis Size Reduction
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k and the Gram-Schmidt coecients [µi,j ]1≤i,j≤k
Output: Size reduced basis B
1:
2:
3:
4:
5:
6:
7:
8:
9:
for i = 1 to k do
for j = i − 1 downto 1 do
bi = bi − dµi,j c bj
µi,j = µi,j − dµi,j c
for l = 1 to j − 1 do
µi,l = µi,l − dµi,j c µi,j
end for
end for
end for
Example:
We consider the Gram-Schmidt coecients given by the
4-by-4 matrix


1 0.3 0.1 1.2
0 1 −0.4 2.1


0 0
1
0.8
0 0
0
1
The rst three columns are already size reduced since
1
|µi,j | ≤ ,
2
i = 2, 3 and j < i
4.1 Fundamentals
33
and so are the rst three basis vectors of the corresponding basis
matrix that is not shown here. For the next reduction we obtain
µ4,3 = 0.8. This means it is worthwhile to subtract (dµ4,3 c = d0.8c =
1)-times column three from column four. Analogous we proceed with
the basis vector b2 and b4 . Now we have


1 0.3 0.1
1.1
0 1 −0.4 2.5 

.
0 0
1
−0.2
0 0
0
1
For the last reduction we subtract (dµ4,2 c = d2.5c = 3)-times column
two from column four. We nally obtain


1 0.3 0.1
0.2
0 1 −0.4 −0.5

.
0 0
1
0.2 
0 0
0
1
A reduction concerning column one is not longer necessary since the
Gram-Schmidt coecient already satises the condition µ4,1 = 0.2 ≤
1
.
2
Although each Gram-Schmidt coecient satises the condition such that |µi,j | ≤
the size reduction is a weak reduction. Calling to mind that a basis vector
bi is optimally short when it is equivalent to its orthogonalized version b∗i the
following proposition gives an upper bound for the size reduction.
1
,
2
Proposition 4.1.2 For a size reduced basis B it holds
i−1
kbi k22 ≤ kb∗i k22 +
1 X
b∗j 2 .
2
4 j=1
There are also other size reduction conditions but not relevant in practice due
to their complexity and resulting runtime.
4.1.3. Basis Permutation
Observing the orthogonalization and the size reduction one recognizes that both
are dependent on the order of the basis vectors bi . Because permuting two basis
vectors is a unimodulary transformation operation, it naturally changes the order
of the bi and therefore the dedicated orthogonalized versions b∗i . However, this
fact is helpful to gain a better size reduction.
34
Lattice Basis Reduction I: Basics and Sequential Approach
From Algorithm 4.2 we know that the back most vector bk can be size reduced,
dependent on the Gram-Schmidt coecients, by k − 1 vectors, the vector bk−1 by
k − 2 vectors and nally b2 by one vector namely b1 . In this manner the vector
b1 itself is not size reduced. Hence it is suggesting that relative short vectors
are designated to be taken to the front and long vectors to the end of the basis.
According to this a relative long vector has a better chance to get size reduced.
The other way around the shortest vector can not be size reduced by other vectors
at all so it is sound to put it in the rst place. So it is useful to swap two vectors
bi and bj when the dedicated orthogonalized version b∗i gets shorter. Usually, this
is done when
∗ 2 3 ∗ 2
bi < bi .
new 2
4 old 2
Deduced from this inequality we can nd the swap condition for consecutive
vectors bi and bi+1 , the so called Lovász condition
2
3 ∗ 2 kbi k2 > b∗i+1 2 + µ2i+1,i kb∗i k22 .
4
(4.6)
Note that kb∗i k22 is referred to as the squared 2-norm of b∗i .
This enables us to formulate the following denition of a reduced basis.
Denition 4.1.3 A lattice basis is called
reduced when
1. it is size reduced and
2. condition 4.6 holds for 1 ≤ i ≤ k.
For reduced bases we have
3 ∗ 2
kb k
4 i 2
1
with |µi,j | ≤
2
3 ∗ 2
⇔ kbi k2
4
1 ∗ 2
⇔ kbi k2
2
⇔ kb∗i k22
2
≤ b∗i+1 2 + µ2i+1,i kb∗i k22
2 1
≤ b∗i+1 2 + kb∗i k22
4
∗ 2
≤ bi+1 2
2
≤ 2 b∗i+1 2 .
This implies the simplied swap condition
2
kb∗i k22 > 2 b∗i+1 2 .
(4.7)
4.2 Sequential Algorithms
35
4.2. Sequential Algorithms
In the adjoining sections we briey introduce common sequential algorithms including the originator the LLL-algorithm. Each algorithm owns dierent properties which recommend them for various applications whereas each is a evolution
of the latter one. Nevertheless, there are numerous other lattice reduction algorithms whether with sequential or parallel processing but with negligible success
concerning runtime or reduction quality. So we introduce the best known sequential algorithms, the LLL-algorithm and its practical and often implemented
oating point based variant the SE-algorithm. Subsequently, we summarize other
algorithms that are of theoretical interest and comprise new approaches.
4.2.1. LLL-Algorithm
In 1982 Lenstra, Lenstra and Lovász proposed the rst lattice basis reduction
[LLL82] that terminates in polynomial time according to the lattice dimension.
It is known as the LLL-Algorithm named for its three inventors. The algorithm
uses the classical Gram-Schmidt process (Alg. 3.1) for the orthogonalization and
the size reduction algorithm (Alg. 4.2). Algorithm 4.3 shows the entire LLLalgorithm.
Algorithm 4.3 LLL-Algorithm
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k and reduction parameter δ with
1
<δ<1
4
Output: δ -LLL-reduced basis B
1: Using Algorithm 3.1 to compute the Gram-Schmidt coecients µi,j
2: i = 2
3: while i ≤ k do
4:
Using
4.2 to size reduce
the vector bi
2
∗Algorithm
∗ 2
2
∗ 2
5:
if δ bi−1 2 > kbi k2 + µi,i−1 bi−1 2 then
6:
Swap basis vectors bi and bi−1
7:
Update Gram-Schmidt coecients µl,j for l > i
8:
i = max(i − 1, 2)
9:
else
10:
i=i+1
11:
end if
12: end while
The LLL-algorithm iteratively works in stages where each such stage processes
the vector bi . During the process, the algorithm can return to a certain stage
Lattice Basis Reduction I: Basics and Sequential Approach
36
several times before it terminates. A single stage consists of the following operations. First of all the vectors bi are size reduced by the preceding vectors
bi−1 , bi−2 , . . . , b1 in compliance with Algorithm 4.2. Afterwards it is checked if
the swap condition is true. If so, then vector bi is interchanged with vector bi−1 .
This is why the algorithm starts in stage 2. Being in stage i the basis vectors
bi−1 , bi−2 , . . . , b1 are already LLL-reduced.
Denition 4.2.1 (LLL-Reduced Basis) A lattice basis is called δ-LLL-reduced
when
1. it is size reduced and
2
2. δ kb∗i k22 > b∗i+1 2 + µ2i+1,i kb∗i k22 for 1 ≤ i ≤ k ,
where δ is the reduction parameter.
LLL-reduced simply means size reduced with reduction parameter δ instead of
factor 43 as dened above. The process terminates in stage k if there is no vector
left the satises the swap condition. After a complete run one can expect that
the shortest vector is embodied by b1 .
For a δ -LLL-reduced basis we can state the following properties which expose
the quality of the reduction.
Corollary 4.2.2 Let B ∈ Rd×k a δ-LLL-reduced basis with
α = (δ −
1 −1
)
4
, then
kb1 k22 ≤ α
kbi k22
kbi k22
k
Y
k−1
2
1
det(L) k
≤ αk−1 λi (L)2
2
≤ αj−1 b∗j 2 , 1 ≤ j ≤ i ≤ k
kbi k22 ≤ α
k(k−1)
2
det(L)2 .
1
4
< δ < 1 and
(4.8)
(4.9)
(4.10)
(4.11)
i=1
In Section 4.1.3 we said that we can expect a much better reduction quality
when the basis vectors bi are ordered by the squared 2-norms of the dedicated orthogonalized version b∗i which is assured by the Lovász condition. Unfortunately,
the LLL-algorithm performs only one swap to achieve that order. As well it is left
unanswered whether the vector is placed at the optimal position at all. Perhaps
it should actually be moved onwards due to a wide shortening.
This reections lead to a swap condition that inserts the vector bi at the optimal
position determined by its dedicated squared 2-norm. This technique is called
deep insertion since it places bi deeper between preceding vectors. Concurrently,
there is no need to perform an additional swap concerning this vector except
4.2 Sequential Algorithms
37
Algorithm 4.4 Deep Insertion
2
1: c := kb∗i k2
2: m = 1
3: while m < i do
4:
if δ kb∗m k22 ≤ c then
5:
c = c − µ2i,m kb∗m k22
6:
m=m+1
7:
else
8:
(b1 , b2 , . . . , bi ) = (b1 , . . . , bm−1 , bi , bm , . . . , bi−1 )
9:
i = max(m − 1, 2)
10:
end if
11: end while
12: i = i + 1
it is later reduced by a succeeding vector that was not handled yet. The deep
insertion method can be applied while replacing the lines 5 − 11 of Algorithm 4.3
by Algorithm 4.4
Denitely, one gains a better reduction quality than with the original LLLalgorithm as stated in the corollary above but of course at expense of runtime.
Naturally, it is possible to use other methods to obtain the intended order but
probably with negative eects concerning the runtime. Involving the deep insertion method, the LLL-algorithm still yields a (super) polynomial runtime algorithm.
4.2.2. Floating Point LLL-Algorithm
Considering the speed, the LLL-algorithm possesses a bottleneck, the required exact arithmetic on large integers. It has been therefore proposed to keep the basis
in exact integer representation only and perform the orthogonalization in oating point arithmetic (approximated) to gain a speedup. As abovementioned in
section 3.1 the classical Gram-Schmidt process that is used in the LLL-algorithm
is numerically unstable. More exactly, this means using it on a computer with
constrained accuracy we get a loss of orthogonality of the basis and moreover we
perform unnecessary or incorrect swaps. Hence, the LLL-algorithm is rather a
theoretical construction but the crucial fundament for practical algorithms.
The so called Schnorr-Euchner -algorithm (SE) [SE94] is a such evolution, a
practical oating point arithmetic LLL-algorithm with good numerical stability.
Therefore, the basis is kept in exact and in oating point arithmetic. The trick
is to check whether a overow or underow according to the accuracy, provided
by the system, has occurred. The step is then corrected by recalculating it using
exact arithmetic. The SE-algorithm is exhibited by Algorithm 4.5. A value v 0
38
Lattice Basis Reduction I: Basics and Sequential Approach
denotes the oating point value corresponding to an exact value v . The integer
τ denotes the number of precision bits in the oating point arithmetic.
To ensure the stability of the algorithm dedicated conditions are added at two
important positions. For the rst condition in line 12 − 16 it is checked whether
the scalar product is too small in relation to the lengths of the involved vectors.
If the condition is true, the scalar product is recalculated in exact arithmetic and
approximated afterwards. Secondly in line 21 it is checked whether the GramSchmidt coecient is too big. Then, as a result, the swapping in line 33 is skipped
and the coecient in recalculated.
Although the SE-algorithm is a variant of the original LLL-algorithm there is
no proof for a polynomial runtime nor that it terminates at all. Due to the bit
precision two cases are possible. On one side, a too low accuracy can lead to an
endless loop and on the other side a very high accuracy could cause an exponential
runtime. However, in practice one can expect a notable speedup that was shown
by reference to numerous practical examples. This is why the SE-algorithm is
mostly implemented in practice, for example in the NTL [Sho] which oers the
best performance for sequential algorithms.
4.2.3. Other Sequential Algorithms
Here we present other interesting sequential algorithms that are more or less of
practical meaning. Denitely they are not within the focus of this thesis.
Block Korkin-Zolotarev-Algorithm The BKZ-algorithm is a very promising
one since it searches for short vectors instead of a short basis only. It was also
invented by Schnorr and Euchner in the same paper [SE94] in which they proposed
the SE-algorithm. The algorithm iterates over blocks which consist of a variable
number β of basis vectors where β as well represents the determinant of runtime.
To facilitate a block-wise reduction a dedicated reduction term is established in
form of the Korkin-Zolotarev basis.
Denition 4.2.3 (β -Korkin-Zolotarev-Reduced Basis) A lattice basis is called
β -Korkin-Zolotarev-reduced when
1. it is size reduced and
2. kb∗i k2 ≤ λ1 (πi (b1 , b2 , . . . , bmin(i+β−1,k) )),
i = 1, 2, . . . , k − 1,
where πi : Rk → span(b1 , b2 , . . . , bi )⊥ .
The strength of a Korkin-Zolotarev reduction compared to a LLL-reduction is
depicted by the following theorem.
4.2 Sequential Algorithms
39
Algorithm 4.5 Floating Point LLL-Algorithm (SE-Algorithm)
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k and reduction parameter δ with
1
<δ<1
4
Output: δ -LLL-reduced basis B
1: i = 2, Fc = f alse
2: for i = 1 to k do
3:
Approximate basis b0i = (bi )0
4: end for
5: while i ≤ k do
6:
Using Algorithm 3.1 or 3.2 to compute the Gram-Schmidt coecients
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
µi,1 , . . . , µi,i−1
if i = 2then
0 2
c1 = b∗1 2
end if
ci = kb0i k22
for j
= 1 toi − 1 do
τ
if b0i , b0j < 2− 2 kb0i k2 b0j 2
s = (hbi , bj i)0
else s = b0i , b0j
end if
µi,j =
(s−
then
Pj−1
µj,l µi,l cl )
,
cj
l=1
ci = ci − µ2i,j cj
end for
for j = i − 1 downto 1 do
if |µi,j | > 21 thenτ
if |dµi,j c| > 2 2 then
Fc = true
end if
for l = 1 downto j − 1 do
µi,l = µi,l − dµi,j c µj,l
end for
µi,j = µi,j − dµi,j c, bi = bi − dµi,j c bj , b0i = (bi )0
end if
end for
if Fc then
Fc = f alse, i = max(i − 1, 2)
else
if δci−1 > ci + µ2i,i−1 ci−1 then
Swap basis vectors bi , bi−1 and b0i , b0i−1 , i = max(i − 1, 2)
else
i=i+1
end if
end if
end while
40
Lattice Basis Reduction I: Basics and Sequential Approach
Theorem 4.2.4 Every Korkin-Zolotarev basis b1 , b2 , . . . , bk satises
kb∗ k2
4
i+3
≤ i2 2 ≤
,
i+3
λi
4
i = 1, 2, . . . , k.
Usually β is often chosen to be at most 25. Values that are clearly greater than
this maximum probably cause an exponential runtime.
Semi Block 2k-Reduction A further lattice basis reduction by Schnorr [Sch87].
Similar to the BKZ-algorithm this reduction algorithm handles blocks but this
time of size 2k where k should be around 24 to oer an acceptable runtime.
Roughly said it also targets a Korkin-Zolotarev basis to generate short vectors
but yielding a dierent reduction quality.
Primal-Dual Reduction This reduction by Koy represents a variant of Schnorr's
Semi Block 2k -reduction [Koy04]. Due to some improvements it guarantees within
the same time bound better approximations of the shortest lattice vector.
Random Sample Reduction Another algorithm by Schnorr [Sch02]. This reduction method samples short lattice vectors and inserts by a global transform
short vectors into the lattice basis. The BKZ-algorithm is used to reduce the new
basis. It decreases the approximation factor of the shortest vector that can be
found compared to the real shortest factor achievable in a given time by known
algorithms to less than its fourth root.
Segment LLL-Reduction This method was proposed by Schnorr and Koy
[KS01]. It can be seen as a very ecient variant of the original LLL-reduction.
Therefore the basis is divided in segments of size k . The speedup of a segment
LLL-reduction algorithm is at expense of the reduction quality that is slightly
weaker than for the original LLL-algorithm.
4.3. Lattice Attacks and Applications
In this section we briey expose the practical value of the lattice basis reduction
in shape of an attack on the NTRU cryptosystem [HPS98]. NTRU is a recent high
performance public-key cryptosystem, especially suited for constrained devices,
and a lattice based alternative to RSA or ECC.
The setup is as follows. Operations are based on objects in the truncated
polynomial ring R = Z[x]/(xN − 1) where N is an odd prime. The ring is
4.3 Lattice Attacks and Applications
41
identied with a set of polynomials that have integer coecients and degree at
most N − 1. The sets DF , Dg and Dr are subsets of R that have df , dg and dr
coecients to 1 and the rest to 0. Two parameters p and q are chosen such that
gcd(p, q) = 1. Typical parameters are N = 251, q = 239, p = 2, df = 72, dg = 72,
and dr = 72 (NTRU251).
The encryption method NTRUEncrypt will described hereafter. The private
key is chosen to be of the form
(4.12)
f =1+p∗F
for random F ∈ DF where "∗" denotes the convolution multiplication in R.
Moreover, f has to be invertible modulo q . The inverse is denoted by fq . The
public key is
h ≡ p ∗ fq ∗ g (mod q ),
(4.13)
with randomly chosen polynomial g ∈ Dg .
To encrypt a message m represented by a binary polynomial in R, a randomly
chosen r ∈ Dr will give the ciphertext
e = r ∗ h + m (mod q ).
(4.14)
Coppersmith and Shamir proposed a lattice attack [CS97] the can be seen as
a kind of message recovery attack on NTRUEncrypt. The NTRU lattice L is a
(2N + 1) dimensional lattice generated by the public key h = (h0 , h1 , . . . , hN −1 ),
the parameter q , and the ciphertext e = (e0 , e1 , . . . , eN −1 ) corresponding to message m. Additionally a dummy constant c is necessary.

1
0
.
.
.

0

L = 0

0
.
 ..

0
0
0 ...
1 ...
.. . .
.
.
0 ...
0 ...
0 ...
.. . .
.
.
0 h0 h1
0 hN −1 h0
..
..
..
.
.
.
1 h1 h2
0
q
0
0
0
q
..
..
..
.
.
.
0 ... 0
0
0
0 . . . 0 e0
e1
. . . hN −1
. . . hN −2
..
..
.
.
. . . h0
...
0
...
0
..
..
.
.
...
q
. . . eN −1

0
0
.. 

.

0

0

0
.. 
.

0
c
(4.15)
The lattice L contains the short vector v = (−r, m, c). With high probability v
is the shortest vector in L and thus an attacker can use a lattice basis reduction
to nd v and recover m. This attack works well for smaller degrees and numerous
experiments has shown that the attack is exponential with degree N .
Further lattice-based attacks and applications can be found in [Hin04, JS94].
5. Parallel Computing
In general larger problems can often be divided into smaller problems which are
then solved concurrently, i.e., in parallel. But these problems consist of both a
serial and parallel part. Serial here means that the data to be processed depend
on each other in a certain order and hence can not be processed concurrently. The
challenge to be faced is to examine a given problem and isolate those two parts
from each other. Afterwards the parallel part has to be adapted to t the desired
parallel system. There are many distinct types of parallel systems of which each
has its right to exist because of a unique property that qualies the system for
certain tasks or problems.
Although the advantages of a parallelization were known for a long time, the
attempts to gain a computational speed-up were limited to improving the serial
systems. Frequency scaling was the major reason for improvements in computational performance until the early beginning of this century since the runtime of
a program linearly decreases while the frequency increases. The accompanying
rising power consumption then contributes to the change of paradigm through to
parallel computing.
In this chapter we want to summarize the fundamentals of parallel computing
and introduce our chosen platform that oers massively parallel computing.
5.1. Fundamentals of Parallel Computing
Speed-Up Ratio First of all we need to understand in how far one prots from a
parallelization of a given algorithm. The question is not whether it is worthwhile
to examine the algorithm at all since the performance is getting better even
if there is a small parallelizable portion. The question is rather to state the
maximum gain of performance. As already implied this is done with help of a
dedicated ratio the speed-up. The speed-up takes the serial and parallel portion
and the number of processors into account. The rst idea goes back to Amdahl
who formulated the so called Amdahl's law. His law is given by
1
(5.1)
S=
P
(1 − P ) + N
where S is the speed-up ration, P the parallel fraction and N the number of processors. Theoretically, the speed-up is innite but in reality an algorithm always
Parallel Computing
44
consists of sequential parts, e.g., initialization phases. Reaching a certain threshold depending on P the speed-up will stagnate near its theoretical maximum and
give a useful upper bound of the processor amount. Figure 5.1 shows Amdahl's
law for dierent P . It unambiguously depicts the bounds of the parallelization.
Even an algorithm that has a parallel portion of nearly 95% which is obviously an
almost excellent value gains a maximum speed-up of 20. Of course, the speed-up
depends on many various properties, e.g., the intended parallel system.
20
50%
75%
90%
95%
100% Limit
50% Limit
75% Limit
90% Limit
95% Limit
18
16
Speed Up
14
12
10
8
6
4
2
1
2
4
8
16
32
64
128 256
Number of Processors
512
1024 2048 4096
Figure 5.1.: Amdahl's law for dierent P
There is a second law that addresses the shortcomings of Amdahl's law. The
so called Gustafson's law implies that the sequential part of an algorithm also
changes with respect to the number of processors. He proposed that the speed-up
is scaled up for larger problem size in contrast to Amdahl who assumes a xed
problem size and hence a xed sequential part regardless of the problem size.
Gustafson's law is dened by
S(N ) = P (N − 1) + 1
(5.2)
where S(N ) is the speed-up ratio dependent on N which states the number of
processors and P the parallel fraction.
As Figure 5.2 claries there is no upper bound anymore and the speed-up is
nearly scaled with the number of processors. Consequently, the question arises:
who is right? Actually, the real speed-up for a given algorithm lies in between
5.1 Fundamentals of Parallel Computing
45
20
50%
75%
90%
95%
100% Limit
18
16
Speed Up
14
12
10
8
6
4
2
1
2
4
8
16
32
64
128 256
Number of Processors
512
1024 2048 4096
Figure 5.2.: Gustafson's law for dierent P
these to laws. If the problem is scalable by which means the sequential portion depends on the problem size then the speed-up is closer to Gustafson's law
otherwise to Amdahl's law.
Classication of Computational Architectures Further on we want to present
Flynn's taxonomy which is a classication dened by Flynn which is based upon
the number of concurrent instruction and data streams in the available computation architecture. Basically the two indicators the instruction and the data
stream give four classes.
ˆ
Single Instruction, Single Data (SISD).
ˆ
Multiple Instruction, Single Data (MISD). This type of architecture
ˆ
Single Instruction, Multiple Data (SIMD). The same instructions are
This architecture contains
exactly one processor that executes an instruction stream to operate on a
data stream from a single memory. It is referred to as the classical sequential
von Neuman architecture.
contains many processors that performs dierent operations on the same
data stream. It is an uncommon architecture which is merely used for fault
tolerance systems.
executed on dierent data streams. Hence, the data streams are concurrently processed (in parallel). As an example a vector computer shall be
mentioned.
Parallel Computing
46
ˆ
Multiple Instruction, Multiple Data (MIMD). Multiple independent
processors concurrently executing dierent instructions on dierent data
streams. MIMD structures can be either of shared memory (the processors
are sharing one memory space) or distributed memory (every processor has
its own private memory space). Most of the TOP500 supercomputers 1 are
based on this type.
5.2. GPU Computing
GPU computing, in general referred to as general-purpose computing on graphics
processing units (GPGPU), is the shift of computations that are traditionally
handled by the central processing unit (CPU), the host processor, to the graphics
processing unit (GPU). A GPU is typically made for handling computations of
computer graphics, but with growing demands for a higher programmability it
become feasible to use processing on non-graphics data, too. The GPU is a typical
representative of the SIMD architecture and suitable for vector computing.
Normally graphics data is passed through the graphics pipeline to transform a
3D model into frames that can be shown on a display. The pipeline consist of
the stages [Cut09]
Modeling Transformation.
ˆ
The input 3D models consisting of vertices
are described in their own coordinate system (object space) and have to be
transformed into a common coordinate frame (world space).
ˆ
Illuminating (Shading).
ˆ
Viewing Transformation.
ˆ
Clipping.
ˆ
Projection.
ˆ
Scan conversion (Rasterization).
The vertices are lit (shaded) according to material relatively surface properties and light sources by applying a local
lightning model (diuse, ambient, phong, etc.).
The world space is mapped to the eye space.
The eye space depends on the view point from which the scene is seen,
mainly the viewing angle, distance and so on.
The eye space in turn is transformed to normalized device coordinates since the eye space merges into one point in the distance the
vanishing point. Additionally objects that are not within eye space but
described in the model are removed.
The three dimensional objects within the space are projected
to a 2D image place (screen space).
Objects are rasterized into pixels by
masking the 2D image with dierent triangles. Afterwards the properties
like color, depth, etc. are interpolated.
1 Detailed
informations on http://www.top500.org/
5.2 GPU Computing
ˆ
Visibility.
47
Each pixel remembers its closest object (depth buer).
Crucial for GPGPU is the unit that is responsible for the rst three stages
the vertex shader but today commonly denoted as stream processor. However,
GPGPU was a costly process so far since the programming interface was especially
built to implement the graphics pipeline. All data had to be inconveniently
adapted to t this interface. For example the data had to be encoded in textures,
an array of pixels, which are unfortunately read-only objects. Hence, the nished
processed data was written to the frame buer that contains the frames to be
shown on the display and looped back into the texture space.
As of 2007 addressing this shortcomings and the growing popularity of the
GPGPU concept the main GPU manufacturers ATI/AMD and nVidia released
frameworks to make GPU computing more attractive. ATI/AMD's concept was
rst called Close To Metal (CTM) aimed to enable GPGPU by giving direct
access to the native instruction set and memory of ATI/AMD's graphics cards.
The framework is now called Stream SDK [ATI] and until today in beta status. It feature Brook+ a C language derivate optimized for stream computing.
nVidia's approach is named Compute Unied Device Architecture (CUDA) which
is introduced hereafter.
5.2.1. Compute Unied Device Architecture (CUDA)
In this thesis we decided to work with nVidia's CUDA since it is very pleased of a
wide acceptance in the GPGPU community, regularly updated, and the common
framework in scientic researches. CUDA features C for CUDA also a C language
derivate that is C with special extensions. Moreover, it is possible through third
party wrappers to use Python, Fortran, Java, or even MATLAB program code.
All informations on CUDA are mostly taken from [nVi09b, nVi09a].
5.2.2. Hardware Overview
CUDA works with all nVidia GPUs from the G8x series (8800 GTX/GTS/GT,
etc.). The latest architecture supported is the GT2xx series that includes the
GTX 280/285, GTX 260 and the 2-GPU solution GTX 295 but also other products from the Quadro or Tesla line from which the latter is a dedicated GPU
computing solution. The most import dierence between these generations and
models is the number of stream processors and the compute capability. Obviously the number of stream processors is crucial for the computing performance
but otherwise the compute capability determines the qualication for a certain
problem. The compute capability is subjected to the technological improvement
process which recently yields the ability of double precision oating point arithmetic (GT2xx series only).
Parallel Computing
48
Due to this essential ability of double precision oating point arithmetic we will
employ a graphics card from the GT200 series and in order to have unrestricted
hardware resources we take the high-end model GTX280. Table 5.1 summarizes
the given resources which are dierent for other models.
Compute capability
Global memory
Number of multiprocessors
Number of stream processors
Constant memory
Shared memory per block
Registers available per block
Warp size
Maximum number of threads per block
Maximum size of dimension of a block
Maximum size of each dimension of a grid
1.3
1 GiB
30
240
64 KiB
16 KiB
16384
32
512
512 × 512 × 64
65535 × 65535 × 1
Table 5.1.: GTX 280 Resources
The device's main unit the multiprocessor is a set of eight stream processors
which share memory, caches, and an instruction unit. The multiprocessor creates,
manages, and executes threads in hardware with zero scheduling overhead and
lightweight thread creation. A thread in CUDA is, as well as in other parallel
architectures, the smallest unit of parallelism that is executed with other threads
in hardware. Despite the fact that a GPU is a representative of the SIMD architecture nVidia call the architecture single-instruction, multiple-thread (SIMT)
which is akin to SIMD. In contrast with SIMD machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well
as data-parallel code for coordinated threads.
5.2.3. CUDA Programming Model
The CUDA programming model involves three key abstractions. First to be mentioned a hierarchy of threads and thread groups, secondly a hierarchy of dierent
memory types, and nally a barrier synchronization. We will now introduce the
model by covering each abstraction.
Thread Hierarchy
Thread As mentioned above the thread is the smallest unit of parallelism that
is executed with other threads physically. They concurrently run dierent pieces
of code or the same piece of code, gaining a parallel instruction execution on
a data stream. Compared to threads on a multicore CPU threads on a GPU
5.2 GPU Computing
49
need a lot less resources concerning creation and switching eort. Otherwise,
GPU computing is ecient only when involving a large number of threads which
primary asserts a constant workload.
Thread Block Threads are organized in a thread block, a group of threads in
which the threads can communicate with each other and synchronize their state.
A thread can be identied by its thread index, a scalar that is unique within
the block but not beyond. Additionally, it is possible to dene a two and three
dimensional block, as well as the thread index.
Thread Grid A group of thread blocks is called a thread grid. A thread grid
forms the execution unit in the CUDA model since it is not possible to execute
a thread or thread block solely. Again a block is identied within the grid by
its block index which is unique for the grid. Here the grid can be two dimensional because the third dimension is set to 1 yet. Multiple grids are executed
sequentially.
The complete hierarchy is exemplarily depicted in Figure 5.3 for two dimensional thread block and grid.
Grid
Block
(0,0)
Block
(0,1)
Block
(0,2)
Block
(1,0)
Block
(2,0)
Block
Thread
(0,0)
Thread
(1,0)
Thread
(2,0)
Thread
(0,1)
Thread
(1,1)
Thread
(2,1)
Thread
(0,2)
Thread
(1,2)
Thread
(2,2)
Figure 5.3.: Thread hierarchy of CUDA programming model
Parallel Computing
50
Memory Model and Hierarchy
Threads in the CUDA programming model can access data from various memory
spaces that dier in size and access time. However these memory spaces are not
implemented in hardware but abstractions. The memory model is illustrated in
Figure 5.4.
Grid
Block(0,0)
Block(1,0)
Shared Memory
Registers
Thread
(0,0)
Local
Memory
Registers
Thread
(1,0)
Local
Memory
Shared Memory
Registers
Thread
(0,0)
Local
Memory
Registers
Thread
(1,0)
Local
Memory
Global
Memory
Constant
Memory
Host
Mem.
Texture
Memory
Figure 5.4.: CUDA memory model
The memory model shows the accessible memory spaces from the view point of
the smallest execution unit, the thread. According to that, at lowest level a thread
has read and write access to its own registers and additionally its own copy of
local memory. Threads within the same block have read and write access to a
shared memory on the next higher level. Beyond the block, all threads access
the read- and writable largest memory space, the global memory. Beside the
global memory, there are two further spaces that are read only, the constant
memory and the texture memory. The latter three memory space can also be
read from and written into the main memory of the CPU. The following Table
5.2 summarizes the most important features of the device memory.
5.2 GPU Computing
Memory
51
Location
(on/o chip)
Register On
Local
O (RAM)
Shared
On
Global
O (RAM)
Constant O (RAM)
Texture O (RAM)
Cached
n/a
No
n/a
No
Yes
Yes
Access
R/W
R/W
R/W
R/W
R
R
Scope
1 thread
1 thread
All threads in block
All threads and host
All threads and host
All threads and host
Lifetime
Thread
Thread
Block
Host allocation
Host allocation
Host allocation
Table 5.2.: Features of device memory
Synchronization
Usual memory spaces that are shared by threads contain potential hazards of
conicts such as read-after-write,write-after-read, or write-after-write. Thus the
programming model implements a barrier in the shape of a synchronization instruction. When one thread has reached this instruction it halts its execution until
every thread has reached the instruction. This is provided for threads within a
block only so one has to care about threads that are not within the block but
access the same memory address.
5.2.4. CUDA Programming Interface
Indeed there are two interface currently supported to write CUDA programs. A
program must use either the already mentioned C for CUDA or the CUDA driver
API. The driver API is a low level C API that provides functions to load binary
or assembly code. The driver API is not considered here.
Kernels
The piece of code that is executed by one thread is called a kernel. A kernel is
written in a way such that every thread in a block or grid performs the same
operation stream on a dedicated data stream.
Host and Device
The CUDA programming model assumes two parties involved in GPU computing,
the CPU and the GPU itself. The CPU therefore acts as a host which manages
the computations and thus the GPU is a slave that is given the instruction and
data stream and then performs the claimed tasks. The CPU rst transfers the
data from its host memory (main memory) to the GPU's global memory. The
execution of a heterogeneous CUDA program is shown in Figure 5.5.
Parallel Computing
52
Serial code
Host
Parallel kernel
Device
Grid 0
Serial code
Host
Parallel kernel
Device
Block
(0,0)
Block
(1,0)
Block
(2,0)
Block
(0,1)
Block
(1,1)
Block
(2,1)
Block
(0,0)
Block
(1,0)
Block
(2,0)
Block
(0,1)
Block
(1,1)
Block
(2,1)
Grid 1
Figure 5.5.: Heterogeneous CUDA program
Execution on Device
Due to the architecture of a multiprocessor which contains shared memory, registers and an instruction unit a block will be executed on exactly one multiprocessor. Prior execution a block is being split into units processed at once called
warps. Warps are equal sized block slices with the rst thread of the block being
the rst thread of the rst warp. The warp size is dependent on the graphics
card model and its actual value is 32. The scheduler unit of a multiprocessor
constantly switches among these warps but the order is not specied, moreover
warps execute asynchronously. Knowing that a multiprocessor possesses eight
stream processors one warp will be handled in four cycles.
Language Extensions, Compilation, and Limitations
C for CUDA provides a minimal set of extension to the C language to enable
GPU computing. We want to introduce the most common extensions that are
crucial for CUDA applications.
5.2 GPU Computing
53
Function Type Qualiers Function type qualiers specify whether a function
is callable or executed on the host, relatively the device.
ˆ __device__ The function is executed on the device and callable from the
device only.
ˆ __global__ It declares the function as being a kernel executed on the device
and callable from the host only.
ˆ __host__ States the opposite of the device declaration, executed on the
host and callable from the host only. This is a normal C function and thus
it can be omitted.
Variable Type Qualiers Variable type qualiers specify the memory location
on the device.
ˆ __device__ The variable resides in the global memory space and is accessible from all threads in the grid.
ˆ __constant__ The variable resides in the constant memory space and is
accessible from all threads in the grid but read-only.
ˆ __shared__ The variable is a shared variable within the block. Initialization
of such a variable is not allowed.
ˆ Declared variables without having a type qualier are stated to reside in the
registers of the thread. A register variable is accessible from its dedicated
thread only.
Built-in Variables Built-in variables specify the grid and block dimensions, as
well as the block and thread indices. They are only dened for functions that are
executed on the device.
ˆ gridDim This variable represents a three dimensional array containing the
grid dimension.
ˆ blockIdx This variable is a three dimensional integer array and contains
the block index within the grid.
ˆ blockDim Like the block index it is a three dimensional integer array that
contains the number of threads within the block.
ˆ threadIdx A three dimensional integer array containing the thread index.
ˆ warpSize This variable contains the warp size in threads.
The dimensions are accessed via .{x,y,z} where x represents the rst dimension. The z component is xed to 1 for the grid dimension.
54
Parallel Computing
Barrier Instruction The barrier instruction which is used to synchronize threads
within their block is realized by the function __syncthreads().
Mathematical Functions C For CUDA provides a mathematical standard li-
brary according to the C language such as sqrt(x) or log(x). By default those
functions handle single point arithmetic variables only and even if the device offers double precision support the functions are mapped to their single-precision
equivalents. One has to add the compiler option -arch sm_13 in order to use
double-precision arithmetic. To distinguish both types single precision functions
end with f, e.g. sqrtf(x). Additionally there are some functions that have faster
counterparts with reduced accuracy prexed with __ such as __logf(x).
Memory Transfer and Allocation Instructions Linear memory is allocated
using cudaMalloc() which allocates memory in the global address space of the
device. Data is transferred with cudaMemcpy() which is bidirectional. The instruction is used to copy data to be processed from the host into the device global
memory space and vice versa. The allocated memory is freed after processing with
cudaFree().
Execution Conguration The execution conguration species any call to a
kernel prexed with the __global__ function type qualier. The execution conguration is an expression of the form <<< Dg, Db >>> where Dg is the dimension
of the grid and Db the dimension of the block. For example <<< 30, 512 >>>
means that a kernel is executed by 30 blocks each containing 512 threads.
5.2.5. CUDA Performance Guidelines
To achieve the optimal performance out of CUDA application many details have
to be considered. Some of them have a crucial impact on performance, especially
when they deal with the global memory space. Hereafter we want to sensitize in
dealing with CUDA to avoid performance collapses.
Memory Transfers
Dealing correctly with various memory types that are provided by the CUDA
architecture is the Alpha and Omega. To begin with GPU computing at all,
the data must be transferred form the host memory into the device memory.
Host-to-device transfers are limited by the bandwidth of the PCI-e 16x interface
that oers a theoretical peak bandwidth of 4 GB per second. In reality we gain
about 2.87 GB per second with our device. Contrary a device-to-device transfers
5.2 GPU Computing
55
provides a theoretical peak bandwidth of 160 GB per second and in reality an
eective bandwidth of 129.23 GB per second2 . Hence the strong suggestion is
to avoid host-to-device transfers except the rst which is mandatory. Afterwards
one should try to process the data on the GPU only and not shifting it back and
forth.
Memory Access
Memory accesses on the device should be handled very carefully since one could
give away a large part of performance. First of all Table 5.3 gives a glimpse on
memory latency.
Type
Instructions
Registers
Shared Memory
Global memory
Constant / texture
memory
Latency
hidden due to caching
single cycle
single cycle
n cycles
400-600 cycles
1-100 cycles
400-600 cycles
Condition
no bank conicts
n-way bank conict
cache hit
cache miss
Table 5.3.: Latency for memory actions on the device
As already expected the global memory have the highest latency concerning
read and write accesses. This fact is not surprising because the globals memory
space is the largest of all spaces and accessible for all threads. During such a
global memory fetch all further instructions in a thread will defer. Thus one
should obey the following concepts.
Shared Memory The Shared memory is on-chip and much faster than global
memory if there are no conicts. The shared memory is organized in equally
sized banks that can be accesses simultaneously. Therefore any load or storage
that addresses n distinct memory banks can be done concurrently yielding an
bandwidth that is n times higher than the bandwidth of a single bank. Nevertheless, multiple accesses could cause a bank conict and hence a serialization.
Shared memory should always be used when processing the fetched data more
than once, i.e. not only performing a single instruction on it and then store it
back. The size is hardware dependent and severely limited for a block.
2 Both
bandwidths measured with
puting SDK.
Bandwidth Test
CUDA code sample from the GPU com-
Parallel Computing
56
Memory Access Coalescing Memory coalescing has the highest priority in
order to avoid performance issues. A coalesced memory access is a coordinated
access that results in as few as possible memory transactions (read or write). This
is achieved by global memory loads and storages by threads of a half warp (half
the warp size, e.g., 16 threads) such that threads access the words in sequence by
which means the ith thread in the half-warp must access the ith word. Secondly
the addressed word have to lie in the same segment. Figure 5.6 depicts some
examples of memory accesses.
Address
Address
128 132 136 140 144 148 152 156
128 132 136 140 144 148 152 156
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Thread
Thread
Address
Address
128 132 136 140 144 148 152 156
128 132 136 140 144 148 152 156
0
1
Thread
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Thread
Warp size = 16
Above: non-sequential 4-byte memory access each resulting in 8 memory transactions
Below: sequential 4-byte memory access resulting in 1 (right), 2 (left) memory transactions
Figure 5.6.: Examples of global memory access patterns
The size of the segment, the range that can be accessed at once, is determined
by the size of the word.
ˆ 32 byte if all threads access 1-byte words
ˆ 64 byte if all threads access 2-byte words
ˆ 128 byte if all threads access 4-byte or 8-byte words
If a half-warp addresses words in n dierent segments then n memory transactions will be issued. Otherwise if all words lie within one segment then one
memory transaction will be issued.
Latency Hiding Memory latency can be hidden by intensive computations that
overlaps memory transactions. The threads therefore feature a high arithmetic
5.3 Outlook: Open Computing Language (OpenCL)
57
workload and, additionally, the multiprocessor runs a high number of threads
concurrently. However, this technique depends on the problem to be solved by a
CUDA application.
Warp Divergence
Flow control is inevitable for sequential code but can signicantly aect the overall performance when using warps. Due to data dependent control ow instructions the warps follow dierent execution paths, the warp has diverged from the
actual execution path. The dierent execution paths must be serialized and the
total number of instructions for this warp increases. A possible countermeasure
is to make diverging dependent on the thread identication such that dierent
warps diverge but not the threads within a warp.
Occupancy
To achieve the highest possible bandwith one denitely wants to execute as many
warps in parallel as possible. A single multiprocessor processes warps of a block
and therefore the block dimension should be a multiple of the warp size not to
waste computational resources. The occupancy is a metric to determine how
eciently the device is kept busy.
Denition 5.2.1 (Occupancy) The occupancy is the ratio of the warps con-
currently running on a single multiprocessor to the number of warps that could
be possibly run on a multiprocessor.
The number of how many warps can be executed at once by a single multiprocessor is based upon the registers available. The occupancy is calculated
in the following manner. For example a device with compute capability 1.2 or
higher has 16384 32-bit registers per multiprocessor and can have a maximum of
1024 simultaneous threads. This means each thread could apply a maximum of
16384
= 16 registers to gain an occupancy of 100%. If the threads apply more
1024
than 16 registers the occupancy decreases while the number of warps decerases.
5.3. Outlook: Open Computing Language
(OpenCL)
OpenCL [Gro09], developed by the Khronos group, a collaboration of Apple,
nVidia, and AMD, is akin to CUDA with some dierences. It represents a
framework that enables writing programs to be executed across heterogeneous
58
Parallel Computing
platforms but also involving other kinds of processor instead of CPUs or GPUs
solely. OpenCL applies a language based on C, simply called OpenCL C. Since a
short time ago nVidia released their rst public software development kit OpenCL
for nVidia supporting OpenCL 1.0 specication. So we decided not to involve
OpenCL in this work.
Similar to CUDA there is a host and one or more OpenCL devices which execute
kernels. By contrast, kernels are translated during the compiler's runtime and
hence the programmer has not to consider for which device the kernel is intended.
Considering OpenCL for nVidia the procedure of programming is akin to CUDA.
Kernels are executed by threads which belong to thread blocks and grids. The
host beforehand allocates global memory and transfers the data to it, afterwards
the kernel is created and executed.
Concerning the lattice basis reduction OpenCL becomes interesting in future
versions because of the reserved data type quad, a quadruple precision oating
point data type (128 bit).
6. Lattice Basis Reduction II:
Parallel Approach
In the following we introduce common parallel, relatively parallelized lattice basis
reduction algorithms. For sure parallel methods are not a new research subject
but they become more and more attractive while modern parallel systems like
a multicore CPUs or many core graphics cards become available for everyone.
In addition those modern systems are easy to deal with and gain an impressive
computing performance.
There are two kinds of parallel algorithms those which are natively designed for
parallel execution and those which are derived from sequential algorithms by a
partial parallelizing process. Obviously, it is impossible to nd an equivalent
parallel structure for every part of each algorithm. In contrast native parallel
methods are until recently impractical due to their massive demand for processors, so to speak a highly parallel platform but just not in the meaning of a
multicore processor cluster.
Recently a couple of eorts were made to attune the well known sequential approaches to parallel platforms, thus we concentrate on the native parallel methods.
6.1. Parallel Orthogonalization
In accordance with chapter 3 there are several methods to compute the orthogonalization of a given basis and so there are several methods to obtain a parallelized orthogonalization. However, we only deal with the Givens rotations and
the Householder reections, the Gram-Schmidt process is omitted here because
of its numerical weaknesses.
Again we are merely interested in the computation of the matrix R to obtain
the Gram-Schmidt coecients and we are not interested in the computation of
the matrix Q.
Parallel Givens Rotations Referring to chapter 3.2 a Givens rotations matrix
is multiplied from the left to a basis. As intended only two rows are aected at
once. This is iteratively repeated until the lower triangular part of the basis is
Lattice Basis Reduction II: Parallel Approach
60
zeroised. Nevertheless, the order of applying the Givens rotations is important
to obtain the correct result. The Givens rotations algorithm (Alg. 3.4) itself can
not be brought into a parallel structure nor is it necessary. By knowledge of the
required serial order and exploiting that only two rows are aected at once we
achieve the following virtually parallel order (Fig. 6.1).

∗
1st
2nd
3rd
∗
...

∗


3rd
∗


4th
5th
∗


 4th
5th
6th
7th
∗

 5th
6th
7th
8th
9th
 .
..
..
..
..
 ..
.
.
.
.

k − 1
k
k
+
1
k
+
2
k
+
3th

th
th
th
th
 k
k + 1th k + 2th k + 3th k + 4th
 th
 .
..
..
..
..
 ..
.
.
.
.
d − 1th
dth
d + 1th d + 2th d + 3th
..
.
∗
...
...
...
∗
2k − 3th
2k − 2th
..
.
∗
2k − 1th
..
.
...
. . . d + k − 3th d + k − 2th




















Figure 6.1.: Parallel pattern for the Givens rotations
Entries labeled with "∗" were altered by the last rotation but not zeroised.
All other entries are zeroised in the identied stage. In the rst two stages the
parallel pattern is equal to the serialized pattern because the serial order dictates
to apply the Given rotations G(1, 2) and G(1, 3) rst. In stage three it is possible
to apply G(2, 3) beside G(1, 4). Thus one exploits the dierence between the last
concerned is and js . Therefore, the distance has to be greater than two so that
two rows ts it. This is also why the second column starts in stage three, the
third in stage ve and so on.
For instance, given a 10-by-7 matrix, the pattern in written form stressed by
the is and js is shown in Figure 6.2.
The example clearly illustrates how the number of parallel processed rows linearly decreases
as it rst increases. The maximum is reached in stage d − 1 with
d
min(k, 2 ) Givens rotations to apply.
These considerations on the parallel pattern lead to Algorithm 6.1, a parallelized version of the Givens rotations.
Parallel Householder Reections The Householder matrix is multiplied from
the left to a basis to obtain the updated basis. This is iteratively repeated
until the lower right submatrix is reached (see chapter 3.3). The update step is
6.1 Parallel Orthogonalization
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
11th
12th
13th
14th
15th
61
1, 2
1, 3
1, 4
1, 5
1, 6
1, 7
1, 8
1, 9
1, 10
2, 10
3, 10
4, 10
5, 10
6, 10
7, 10
2, 3
2, 4
2, 5
2, 6
2, 7
2, 8
2, 9
3, 9
4, 9
5, 9
6, 9
7, 9
3, 4
3, 5
3, 6
3, 7
3, 8
4, 8
5, 8
6, 8
7, 8
4, 5
4, 6
4, 7 5, 6
5, 7
6, 7
Figure 6.2.: 10-by-7 matrix with parallel pattern for the Givens rotations
performed by virtue of
HB = (I −
2
vT v
vv T )B = B −
2
vT v
v(v T B).
(6.1)
The term vT2 v =: β is a scalar multiplied to the Householder vector v . From the
equation above we see that we need three operations to perform the update, the
vector-matrix multiplication wT = v T B and the matrix subtraction B − B 0 with
preceding vector-vector multiplication B 0 = βvwT .
The vector-matrix multiplication is dened as

w T = v T B = v1 v2
b1,1 b1,2

b2,1 b2,2
. . . vd  ..
..
 .
.
bd,1 bd,2

. . . b1,k
. . . b2,k 

.. 
... . 
(6.2)
. . . bd,k
= v1 b1,1 + . . . + vd bd,1 . . . v1 b1,k + . . . + vd bd,k ,
(6.3)
Lattice Basis Reduction II: Parallel Approach
62
Algorithm 6.1 Parallel Standard Givens Rotations
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: R = (ri,j )1≤i,j≤k
1: R = B
2: if d = k then
3:
l =k−1
4: else
5:
l=k
6: end if
7: for e = 3 to d + l do
− max(0, e−1
− k) parallel do
8:
for i = max(1, e − d) to e−1
2
2
9:
j =e−i
10:
Using Algorithm 3.3 to compute c, s with input ri,j , ri,i .
11:
R = G(c, s)R
12:
end for parallel
13: end for
the vector-vector multiplication which yields a d-by-k matrix


βv1 w1 βv1 w2 . . . βv1 wk
βv2 w1 βv2 w2 . . . βv2 wk 


T
βvw =  ..
..
.. 
 .
.
...
. 
(6.4)
βvd w1 βvd w2 . . . βvd wk
and nally the matrix subtraction


b1,1 − βv1 w1 b1,2 − βv1 w2 . . . b1,k − βv1 wk
b2,1 − βv2 w1 b2,2 − βv2 w2 . . . b2,k − βv2 wk 


T
B − βvw = 
 = HB.
..
..
..


.
.
...
.
bd,1 − βvd w1 bd,2 − βvd w2 . . . bd,k − βvd wk
(6.5)
With exception of the Householder vector v that is the scaled basis vector
b1 (actually the foremost basis vector) every column can be handled separately.
In this form it is possible to parallelize the Householder reections but with
some disadvantages. We need to calculate two scalar products which is often
not ecient in contrast to other operations, e.g. a matrix-matrix multiplication.
Furthermore, in the iterative process of computing the orthogonalization, the
number of parallel operations linearly decreases each time until the lower right
submatrix is reached. For example a given lattice basis consisting of k basis
vectors every vector can be processed in parallel for the rst update step. For
6.2 Parallel Size Reduction
63
the next step there are k − 1 vectors left. For the last update step there are only
two vectors left to be handled in parallel.


1st 1st 1st . . .
1st
1st

2nd 2nd . . .
2nd
2nd 




3
.
.
.
3
3
rd
rd
rd




..
..
.
.


.
.
.
(k − 1)th (k − 1)th
Figure 6.3.: Parallel pattern for the Householder reections
The above pattern (Fig. 6.3) depicts that progress of the linearly decreasing
parallel vector update. More precisely it shows in which Householder matrix the
associated vector is involved. With these observations we are able to formulate
Algorithm 6.2, a parallelized version of the Householder reections.
Algorithm 6.2 Parallel Householder Reections
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: R = (ri,j )1≤i,j≤k
1: R = B
2: for i = 1 to k − 1 do
3:
for j = i to k parallel do
4:
Using Algorithm 3.7 to compute the Householder vector v
5:
β = vT2 v
6:
w = hbj , vi
7:
bj = bj − βwv
8:
end for parallel
9: end for
Parallel Approach The parallelism is mainly achieved by the parallelization of
the abovementioned parallelized version of the Givens rotations and the Householder reections. This makes clear that the parallel basis orthogonalization algorithm slightly diers from Algorithm 4.1. Solely the transformation of matrix
e can be performed in parallel, too, as Algorithm 6.3 stresses.
R into R
6.2. Parallel Size Reduction
Just like the basis orthogonalization there does not exist a parallel method in
order to size reduce a basis. This is due to the circumstance that a basis vector
Lattice Basis Reduction II: Parallel Approach
64
Algorithm 6.3 Parallel Basis Orthogonalization
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: The Gram-Schmidt coecients µi,j and the squared norms kb∗i k22 in
e
shape of R
1: Using Algorithm 6.1 or 6.2 to compute R
2: for i = 1 to d parallel do
3:
for j = i + 1 to k do
r
4:
ri,j = ri,j
i,i
5:
end for
2
6:
ri,i = ri,i
7: end for parallel
bi has to be size reduced by a previous vector bj prior to reducing bj itself. Again
we have to face the problem to bring a serial order into a corresponding parallel
order. Obvious, this order is very close related to the serial order of the basis
orthogonalization using the Givens rotations but with the crucial dierence that
a vector bi−1 is shortly available after it was utilized to size reduce the adjoining
vector bi . Commonly, in former works [Wie94, Dag09] one therefore mostly nds
the suggestion to use the pattern which were introduced for the Givens rotations
but beginning with the basis vector bk (Alg. 6.4). This is also for the reason that
only disjoint basis vector sets shall be size reduced to decline the communication
eort on older parallel systems. The number of parallel size reductions grows by
two in each second stage since the latter two basis vectors are added to the after
next stage.
Algorithm 6.4 Parallel Basis Size Reduction 1
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k and the Gram-Schmidt coecients (µi,j )1≤i,j≤k
Output: Size reduced basis B
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
for e = 2k − 1 downto 3 do
for i = max(1, e − k) to e−1
parallel do
2
j =e−i
bi = bi − dµi,j c bj
µi,j = µi,j − dµi,j c
for l = 1 to j − 1 do
µi,l = µi,l − dµi,j c µi,j
end for
end for parallel
end for
6.2 Parallel Size Reduction
65
To exploit the fact that a vector bi−1 is shortly available after it was utilized
to size reduce the adjoining vector bi we want to propose the most obvious pattern. Clearly, the most obvious pattern size reduces vector bk by bk−1 in the rst
stage, the vectors bk , bk−1 by bk−2 in the second stage and so on until the vectors
bk , bk−1 , . . . , b2 are size reduced by b1 . Figure 6.4 elucidates this approach.
1st
2nd
3rd
k, k − 1
k, k − 2
k, k − 3
..
.
k − 1, k − 2
k − 1, k − 3
..
.
k − 1th
k, k − n
k − 1, k − n
k − 2, k − 3
..
.
k − 2, k − n
..
.
...
k − n + 1, k − n
Figure 6.4.: Generalized parallel pattern for the size reduction
The approach is advantageous since we have much less stages and much more
parallel size reductions. But we suggest it is probably dependent on the desired
parallel system to benet from it. Algorithm 6.5 illustrates how to utilize the
pattern.
Algorithm 6.5 Parallel Basis Size Reduction 2
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k , the Gram-Schmidt coecients
(µi,j )1≤i,j≤k
Output: Size reduced basis B
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
for e = 1 to k − 1 do
for i = k to k − e + 1 parallel do
j =i−1
bi = bi − dµi,j c bj
µi,j = µi,j − dµi,j c
for l = 1 to j − 1 do
µi,l = µi,l − dµi,j c µi,j
end for
end for parallel
end for
Concluding we exemplarily contrast the commonly used pattern with the obvious pattern to emphasize the dierences. This is done for a 10 − by − 10 lattice
basis matrix (b1 , b2 , . . . , b10 ) in Figure 6.5.
Lattice Basis Reduction II: Parallel Approach
66
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
11th
12th
13th
14th
15th
16th
17th
10, 9
10, 8
10, 7
10, 6
10, 5
10, 4
10, 3
10, 2
10, 1
9, 1
8, 1
7, 1
6, 1
5, 1
4, 1
3, 1
2, 1
9, 8
9, 7
9, 6
9, 5
9, 4
9, 3
9, 2
8, 2
7, 2
6, 2
5, 2
4, 2
3, 2
8, 7
8, 6
8, 5
8, 4
8, 3
7, 3
6, 3
5, 3
4, 3
7, 6
7, 5
7, 4
6, 4
5, 4
6, 5
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10, 9
10, 8
10, 7
10, 6
10, 5
10, 4
10, 3
10, 2
10, 1
9, 8
9, 7
9, 6
9, 5
9, 4
9, 3
9, 2
9, 1
8, 7
8, 6
8, 5
8, 4
8, 3
8, 2
8, 1
7, 6
7, 5
7, 4
7, 3
7, 2
7, 1
6, 5
6, 4
6, 3
6, 2
6, 1
5, 4
5, 3
5, 2
5, 1
4, 3
4, 2
4, 1
3, 2
3, 1
2, 1
Figure 6.5.: Generalized parallel pattern compared to the usually suggested
pattern
6.3. Parallel Basis Permutation
Because the orthogonalization and size reduction are not bound to a local set of
basis vectors but working on the entire lattice basis we consequently demand for
a likewise parallel basis permutation. Seeing that the Lovász condition has been
established and used in almost every algorithm there is no reason to repudiate
from this simple condition. It is expected that a parallel basis permutation delivers a similar reduction quality. At this point we do not want to present such
parallel methods since they are close related to the parallel lattice basis reduction
algorithms and hence presented simultaneous.
6.4. Parallel Algorithms
At the beginning of this chapter we said that there are two kinds of parallel
methods those which embody real parallel structures and those which represent
partially parallelized sequential algorithms. Both will be introduced but again
the focus lies on the natives.
6.4.1. All-Swap Algorithm
The idea of a parallel basis permutation is closely related to the concept of a
true parallel method. A concept that is not constrained to work in stages and
thus merely working on a small excerpt of the lattice basis. The so called AllSwap [Vil92] lattice basis reduction attempts to process as much as possible of
the entire basis concerning orthogonalization, size reduction, and permutation
6.4 Parallel Algorithms
67
i.e. swapping. First introduced by Villard it was also a theoretical concept in exact arithmetic until Heckler and Thiele proposed a version in oating arithmetic
[HT93]. Analogous to the SE-algorithm the basis is kept in both representations,
exact arithmetic and oating point arithmetic to speed-up the established computations. The algorithm works iteratively in competitive alternating phases, an
odd and an even phase. These phases facilitate performing as much vector swaps
as possible involving disjoint vector sets. To cushion the appearance of round o
errors the approximated basis is deduced from the exact basis anew each time and
thus concurrently the orthogonalization and the size reduction is performed each
time on the entire lattice basis. Nevertheless, it is nothing but a parallelization of
the LLL-algorithm. The All-Swap algorithm by Heckler and Thiele is presented
in Algorithm 6.6.
Here, too, the algorithm terminates if there is no swap left considering the
Lovász condition. Compared to a stage based algorithm the All-swap method requires the phases to terminate whereas not only one large vector moves backwards
but almost k2 (considering the Lovász condition) which is equivalent to performing
k
local LLL-reductions. However, the reduction quality is roughly the same as
2
for the LLL-algorithm with δ = 43 . This is plausible since the All-swap algorithm
uses the Lovász Condition for this δ value (see section 4.1.3).
For an All-swap-reduced basis Heckler [Hec95] species the following properties,
similar to a LLL-reduced basis, to expose the reduction quality.
Corollary 6.4.1 Let B ∈ Rd×k an All-swap-reduced basis, then
r
kb1 k2 ≤
l
Y
1
4 k−1
c 4 det(L) k
3
2l
4 l(k−l)
kb∗i k22 ≤ c 2 det(L) k ,
3
i=1
where c =
32
9
(6.6)
l ≤ k.
(6.7)
.
Compared to the LLL-algorithm the All-swap algorithm promises a slightly
inferior reduction quality. By default, the phase to begin with is set to odd but
due to properties of the given lattice basis, it may be useful to set it to even
in some cases, rerun the algorithm, and compare the results to gure out which
gives the better reduction quality.
Nevertheless, the Lovász condition is not absolutely xed to value 2. In Section
4.1.3 we saw that
2
2
3 ∗ 2 kbi k2 > b∗i+1 2 + µ2i+1,i kb∗i k22 ⇔ kb∗i k22 > 2 b∗i+1 2 .
4
Lattice Basis Reduction II: Parallel Approach
68
Algorithm 6.6 All-Swap Lattice Basis Reduction
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k
Output: All-swap-reduced basis B
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
phase = odd
swapped = true
while swapped parallel do
B 0 = (B)0
Using Algorithm 6.3 to compute the Gram-Schmidt coecients µi,j
Using Algorithm 6.4 or 6.5 to size reduce the basis B
if phase = oddthen
if kb∗i k22 > 2 b∗i+1 22 and i is odd then
Swap basis vectors bi and bi+1
phase = even
swapped = true
else
swapped = f alse
end if
else
if kb∗i k22 > 2 b∗i+1 22
and i is even
Swap basis vectors bi and bi+1
phase = odd
swapped = true
then
else
swapped = f alse
end if
end if
end while parallel
B 0 = (B)0
Using Algorithm 6.3 to compute the Gram-Schmidt coecients µi,j
Using Algorithm 6.4 or 6.5 to size reduce the basis B
6.4 Parallel Algorithms
69
2
In general, with the notion of an δ -LLL reduced basis δ kb∗i k22 > b∗i+1 2 +
µ2i+1,i kb∗i k22 and 41 < δ < 1 we obtain
2
kb∗i k22 > δ 0 b∗i+1 2
with
4
3
< δ0 =
(6.8)
1
.
δ− 14
Blocked Variant Wetzel presented a blocked version of the All-swap algorithm
[Wet98] while reducing over blocks. The original All-swap algorithm locally reduces two vectors which means block-size 2 for the blocked version. This version
is generalized such that m disjoint basis vector blocks of size l will be reduced
locally. The borderlines are then examined to expose possible overlapping swaps.
So m = 2 will generate the original All-Swap algorithm and m = k will give
the original LLL-algorithm. Both the original and the blocked version share the
same complexity in respect to arithmetic operations but tests have shown that
the speedup is at best for m = 2 and thus unprotable in comparison with the
original All-swap method.
6.4.2. Ordered All-Swap Proposal
We now want to discuss further variants of the All-swap approach. To the best of
our knowledge there is no similar proposal concerning a blocked ordered, relatively
a fully ordered basis permutation. Ordered here means the basis is sorted such
that we perform every possible swap within a certain range.
In section 4.2.1 we depict a method to enhance the LLL-algorithm regarding
the reduction quality. The deep insertion method can be suggested as a searching
algorithm that searches for the position where to put the basis vector bi at best.
For this purpose the method involves the partial lattice basis (b1 , b2 , . . . , bi−1 ),
the other part (bi+1 , bi+2 , . . . , bk ) is not considered. Of course, this is due to the
staged fashion of the LLL-algorithm but with the assertion that (b1 , b2 , . . . , bi−1 )
is already LLL-reduced, the vector bi is inserted at the optimal position so far.
Aside from that the eort is great while keeping in mind moving one vector solely.
Nevertheless there exist no similar appropriate method to work with the entire
lattice basis, i.e. the All-swap algorithm. Sticking to the thought of processing
as much as possible of the lattice basis at a whole we want to extend the concept
of the deep insertion and put each basis vector to its optimal position.
The blocked ordered All-swap phase is a generalized variant of the usual Allswap phase. Instead of swapping two adjoining vectors by which means sorting
them according to their squared 2-norms, blocks of size 2l , with 2l ≤ k , vectors
will be sorted. Clearly, the metric after which the position of a vector is determined is dened by the Lovász condition, here δ 0 . Obviously, l = 1 will give
70
Lattice Basis Reduction II: Parallel Approach
the usual All-swap phase. The blocked ordered approach is stated in shape of
Algorithm 6.7.
The idea of a fully ordered All-swap phase is to perform every possible swap at
once with help of a sorting algorithm. Again, the metric after which the position
of a vector is determined is dened by δ 0 . Following this idea, the phases become
obsolete and Algorithm 6.8 arises.
Due to numerous sorting algorithm with diversely properties in respect to its
ability to t the given parallel system we intentionally do not provide a specic
algorithm. To minimize the swaps we denitely do not want to include the basis
vectors into the sorting algorithm but the squared 2-norms only. We suggest
two ways to perform the sorting assuming a shared memory for the whole lattice
basis. On the one hand a pointer table is used containing the memory addresses
to the rst entry of each basis vector. This table is then rearranged according
to the squared 2-norms. However, such a table has to be applied for the orthogonalization, size reduction and is dependent on the given parallel system which
could mean a greater administration eort. On the other hand such a table is
generated anew in each iteration. For this, the table holds numbered entries
which are rearranged afterwards. Next to that the basis vectors are accordingly
rearranged. Incidentally in contrast to the deep insertion method it is sucient
to consider the squared 2-norms only since in the next iteration the whole basis
is processed again whereas the LLL-algorithm does not. The swapping by itself
is not costlier at all since in the worst case of an usual All-swap phase we move
every basis vector as well. Compared to the other signicant consuming operations, a sorting algorithm is probably of no consequence. All in all, we expect
a better reduction quality and a shorter runtime in the shape of less iterations
due to more performed swaps. Of course, the latter expectation is optimistic and
true for some lattice bases but probably not for the most. On the other hand,
sorting a larger range of basis vectors could result in an endless loop, hence one
needs to adjust the δ 0 value by increasing it while the range of involved vectors
increases.
6.4.3. Parallelized Sequential Algorithms
Quite recently there were some attempts to develop new methods to parallelize
well known reduction algorithms. These attempts mainly concentrate on the
SE-algorithm relatively the BKZ-algorithm. The former is an implementation
on multicore processor system and promises good results in respect to runtime
compared to its sequential counterpart. The latter is the dicult attempt to
parallelize parts of the BKZ lattice basis reduction but delivering no speed-up in
contrast to the sequential SE-algorithm. The implementation of the SE-algorithm
from the NTL is mostly consulted as a reference.
6.4 Parallel Algorithms
71
Algorithm 6.7 Blocked Ordered All-Swap Lattice Basis Reduction
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k , reduction parameter δ 0 with
δ 0 > 34 and block-size 2l
Output: δ 0 -All-swap-reduced basis B
1: phase = odd
2: reordered = true
3: while reordered do
4:
B 0 = (B)0
5:
Using Algorithm 6.3 to compute the Gram-Schmidt coecients µi,j
6:
Using Algorithm 6.4 or 6.5 to size reduce the basis B
7:
if phase = odd then
8:
Split the basis into m blocks of size 2l starting with b1
9:
for i = 1 to m parallel do
10:
Using
an appropriate sorting algorithm to sort the squared 2-norms
2
∗
2
2
2
b2l (i−1)+1 , . . . , b∗2l i 2 by their size, i.e. kb∗r k2 ≤ δ 0 kb∗s k2 ,
2l (i −
2
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
1) + 1 ≤ r < s ≤ 2l i
Reorder the block b2l (i−1)+1 , . . . , b2l i accordingly
phase = even
if order has been changed then
reordered = true
else
reordered = f alse
end if
end for parallel
else
Split the basis into m blocks of size 2l starting with b1+2l−1
for i = 1 to m parallel do
Using an appropriate
algorithm to sort the squared 22 sorting
∗
2
∗
norms b2l (i−1)+1+2l−1 , . . . , b2l i+2l−1 2 by their size, i.e. kb∗r k22 ≤
2
δ 0 kb∗s k22 , 2l (i − 1) + 1 + 2l−1 ≤ r < s ≤ 2l i + 2l−1
Reorder the block b2l (i−1)+1+2l−1 , . . . , b2l i+2l−1 accordingly
phase = even
if order has been changed then
reordered = true
23:
24:
25:
26:
27:
else
28:
reordered = f alse
29:
end if
30:
end for parallel
31:
end if
32: end while
33: B 0 = (B)0
34: Using Algorithm 6.3 to compute the Gram-Schmidt coecients µi,j
35: Using Algorithm 6.4 or 6.5 to size reduce the basis B
72
Lattice Basis Reduction II: Parallel Approach
Algorithm 6.8 Fully Ordered All-Swap Lattice Basis Reduction
Input: Lattice basis B = (b1 , b2 , . . . , bk ) ∈ Rd×k and reduction parameter δ 0 with
δ 0 > 43
Output: δ 0 -All-swap-reduced basis B
1: reordered = true
2: while reordered parallel do
3:
B 0 = (B)0
4:
Using Algorithm 6.3 to compute the Gram-Schmidt coecients µi,j
5:
Using Algorithm 6.4 or 6.5 to size reduce the basis B
6:
Using an appropriate sorting algorithm to sort the squared 2-norms kb∗i k22
2
by their size, i.e. kb∗i k22 ≤ δ 0 b∗j 2 , 1 ≤ i < j ≤ k
Reorder the basis b1 , b2 , . . . , bk agreeing to the sorted kb∗i k22
if order has been changed then
reordered = true
7:
8:
9:
10:
else
11:
reordered = f alse
12:
end if
13: end while parallel
14: B 0 = (B)0
15: Using Algorithm 6.3 to compute the Gram-Schmidt coecients µi,j
16: Using Algorithm 6.4 or 6.5 to size reduce the basis B
6.4 Parallel Algorithms
73
For the parallel variant of the SE-algorithm [BW09] POSIX threads are used
to make an eective use of today's multicore computer architectures. In a shared
memory setting consuming inter-process communication is replaced by synchronization points (barriers) and locks (mutexes). The implementation is intended
to reduce high dimensional lattices with big entries that require a multi-precision
oating point arithmetic to approximate the lattice basis. In experiments with
sparse and dense lattice bases the algorithm shows a speed-up factor of about
1.75 for a 2-thread and close to factor 3 for the 4-thread version.
The aim for the parallel variant of the BKZ-algorithm [Dag09] was to raise
the limit for the block size. This new variant is based upon a self-made parallel
SE-algorithm. In fact just a few parts were suciently parallelized but with
noticeable success. Experiments were done on a cluster of 16 processors which
states an old fashioned way compared to modern multicore systems due to the
required inter-process communication. In our view the greatest disadvantage is
the usage of JAVA whereby the implementation is unable to compete with the
SE-algorithm from the NTL.
7. Implementation
In this chapter we want to present our approach concerning the implementation of
the lattice basis reduction and the implied operations. We do this in three phases
agreeable to the three major components, the orthogonalization, size reduction,
and permutation. Incidentally, the lattice basis reduction is advantageous since
these components are independent of each other and leave space to try out different variants without touching the entire reduction algorithm. It is our priority
objective to implement the lattice basis reduction in a way such that it handles
arbitrarily sized basis matrices, except k ≤ d.
7.1. Overview
In order to create programs with CUDA we do not require many aids. nVidia
itself does not provide many informations on minimum requirements. The only
requirement is a CUDA capable graphics card that is equipped with a minimum
of 256 MB video RAM. That is true for most graphics cards since the GeForce 8
series but as we know not every card fully supports the latest version of CUDA
due to the compute capability. Thus the choice of the card depends on intended
calculations. In our case it is crucial to have double precision oating point
arithmetic which is only supported by the latest 200 series. Our setup possesses
the following key data.
Hardware
ˆ GTX 280 with 1 GiB video RAM, GPU clock: 700 MHz,
memory clock: 1250 MHz, shader clock: 1400 MHz.
ˆ CPU: Intel Core 2 Quad Q9550 @3.6 GHz,
RAM: 4 GiB DDR-3 @1333 MHz.
Software
ˆ Windows 7 64-bit,
ˆ CUDA toolkit 2.3 for Windows 7 (64-bit),
ˆ CUDA SDK 2.3 code samples for Windows 7 (64-bit),
Implementation
76
ˆ CUDA driver 190.38 for Windows 7 (64-bit) with CUDA support, and
ˆ Microsoft Visual Studio 2008 Professional Edition
Conventions A function that is intended to be executed on the device only will
be prexed by d__, otherwise h__ for functions run on the host. Due to dierent
memory spaces, variables which are accessible from a thread only will be prexed
by thread_, variables that are available within the thread block by shared__,
and all other variables are denoted without any prex. Additionally we employ
the pre-processor denes:
ˆ REAL, placeholder for either float or double to dene the oating point
precision,
ˆ INT, placeholder for integer data types such as int or long,
ˆ DIM_ROW, lattice basis dimension d,
ˆ DIM_COL, number of lattice basis vectors k .
In the case that the number of entries of a row or column processed by a block
exceeds the number of applied threads (e.g. 512) then a pattern is used such that
a single thread handles entries at a distance of #threads per block (#tpb).
Basis vector
1
2
3
...
513 514 515
0
1
2
...
1025 1026 1027
...
...
2560
511
Threads
When a CUDA device function is called from the host, the function is initialized
with the correct number of thread blocks. As a result, within the kernel we do
not have to care about respective boundaries.
7.2. Phase I: Orthogonalization
For sure, the orthogonalization part will be the most time consuming one, thus we
spend much time in the respective algorithms. We implement both, the Givens
rotations and the Householder reections where the latter should serve as a reference for runtime comparisons. The usual way when implementing the Householder reections is to choose the blocked version. As this way was focused in
7.2 Phase I: Orthogonalization
77
almost every work concerning the QR decomposition on graphics cards we will
concentrate on the Givens rotations. The Givens rotations are often stated to
be inecient and not practical on graphics cards. Until now there exist no work
that examines and implemented the Givens rotations we are aware of. With this
work we want to eliminate this grievance and present both, an implementation
and a concluding comparison. We start by describing the implementation of the
Householder reections and then continue with the implementation of various
variants of the Givens rotations.
We also intend to keep the basis in exact integer, as well as in oating point
arithmetic. Therefore we use the following simple CUDA kernel (List. 7.1) to
approximate the lattice basis and copy it row-wise to another global memory
address space.
__global__ void d__approximateBasis (INT *A, REAL *B){
int thread_dedRowBase = __umul24( blockIdx . x , DIM_COL) ;
}
for ( int i = threadIdx . x ; i < DIM_COL; i += blockDim . x )
B[ thread_dedRowBase + i ] = (REAL)A[ thread_dedRowBase + i ] ;
Listing 7.1: C for CUDA code to approximate the basis
The integer value thread_dedRowBase holds the oset address to the thread
block's dedicated row. __umul24() is a fast multiplication that returns the 24
least signicant bits.
Function name
d__approximateBasis()
d__transformGsCoe()
utilized
required calls
shared memory registers
0
0
1
2
1
1
Table 7.1.: Implemented helper functions for basis orthogonalization
For the implementation of the transformation kernel see section 7.2.4.
Apart from implementing the QR decomposition methods on the device we
additionally implemented all method including the Gram-Schmidt process on the
host. These implementations should serve as a sequential reference.
7.2.1. Standard Householder Reections
At the beginning of any CUDA implementation one has to consider about how to
distribute the data over the thread blocks. Since in our case the data is supplied in
matrices and we know that the Householder reections process columns, a single
thread block is responsible for a single column, moreover a thread is responsible
Implementation
78
Function name
h__modGramSchmidt()
h__givensRotations_standard()
h__givensRotations_fast()
h__householderReections()
Table 7.2.: Implemented QR decomposition methods on host
for one column entry. The basis is therefore stored columns-wise into the global
memory. According to the sequence (section 3.3) in each iteration the number
of threads blocks is decreased by 1, as well as the rst considered entry. The
rst column in each iteration, marked with two vertical lines at each side in the
following sequence, is the column from which the Householder vector is computed.

×
 ×

 ..
 .
×
×
×
...
...
×
...
...
..
.


×

× 


⇒
.. 
.  
×
×
×
..
.
×
×
×
...
...
×
...
...
..
.


×

× 


⇒ ... ⇒ 
.. 

. 
×

×
...
×
×
×
×
...
...
...
×
×
× 
×
..
.
..
.
.. 
. 

The kernel implements Algorithm 6.2 and is given a pointer to the basis residing
in the global memory space and a stage number that holds the current iteration
number. At the beginning of the kernel each thread determines the position of
the top entry of the foremost column in the stage. This is done by calculating an
oset involving the stage number which is added to the basis pointer. Afterwards
the Householder vector has to be computed corresponding to Algorithm 3.7 but
with adjoining calculating of factor
β=
2
h(x2 ,x3 ,...,xn )T ,(x2 ,x3 ,...,xn )T i
α2
+1
(7.1)
where the required values are already computed by the algorithm. As of now the
scalar product has to be calculated always.
The calculation of scalar products in CUDA requires a change of thinking.
Reminding that each thread possesses its own registers and has access to a shared
memory that can not be initialized the question arises how to compute the scalar
product involving each thread in the thread block. The approach is as fallows.
Given a vector bi and bj each thread computes a number of pair multiplications
such that every pair is covered, if the number of threads is smaller than the
dimension of the vectors. The partial sums are now stored in the registers of
the threads and have to be added up. Therefore an array of a size equal to the
number of threads is declared in shared memory and all partial sums are stored
in this array. Soon after the partial sums are added up in a binary tree fashion
7.2 Phase I: Orthogonalization
79
such that the shared memory is divided into two halves, the partial sums of the
second halve are added to their counterparts in the rst halve. This approach is
repeated until the nal sum resides at the rst position like illustrated in Figure
7.1. In general, the lower binary tree like addition will be referred to as binary
tree like reduction method.
Basis vectors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
0
1
2
3
4
5
6
7
Threads
Partial sums
in shared memory
0
0
0
1
1
1
2
2
3
3
0
Final sum
4
5
6
7
Above: pair-wise multiplications performed by threads
Below: binary tree like additions
Figure 7.1.: Exemplary computation of a scalar product in CUDA
After the scalar product (x2 , x3 , . . . , xn )T , (x2 , x3 , . . . , xn )T has been calculated α is calculated according to Algorithm 3.7 and β as stated above. This is
done by a single thread of each corresponding block only to avoid bank conicts in
the shared memory when accessing the scalar product result. The values of β and
the reciprocal of α are stored merely, instead of storing the whole Householder
vector, and provided in the shared memory.
The below code sample (List. 7.2) shows the computation of a scalar product.
Therefore the pointers thread_vecBase and thread_vecEnd mark the beginning,
relatively the end of a column vector within a matrix A. The included synchronization is necessary for binary three fashioned sum up to ensure that all threads
have nished before starting the next iteration.
Implementation
80
thread_sum = 0 . 0 f ;
for ( int i = thread_vecBase + threadIdx . x ; i < thread_vecEnd ;
i += blockDim . x )
thread_sum += A[ i ] * A[ i ] ;
block_sum [ threadIdx . x ] = thread_sum ;
for ( int i = blockDim . x >> 1 ; i > 0 ; i >>= 1){
__syncthreads ( ) ;
i f ( threadIdx . x < i )
block_sum [ threadIdx . x ] += block_sum [ threadIdx . x + i ] ;
}
Listing 7.2: C for CUDA code to compute a scalar product
Again a scalar product, namely w, has to be calculated next (line 6 in Alg. 6.2),
this time involving the foremost column in the stage and the dedicated column
of the thread block. The dedicated column pointer is determined with help of
the block identication number. Subsequently the reciprocal of α is multiplied to
the sum since we did not save the Householder vector which is indeed involved
in this scalar product.
In the last step (line 7 in Alg. 6.2) we update each column entry such that
1
bi,j = bi,j − βw bi,j .
α
Function name
d__householderReections()
(7.2)
utilized
required calls
shared memory registers
#tpb + 2
4
k−1
Table 7.3.: Implemented functions for Householder reections
7.2.2. Standard Givens Rotations
For the Givens rotations there are several patterns to implement. Initially we
have to consider about the distribution of the thread blocks. Here, too, this is
easy to decide because Givens rotations aect two rows per iteration, hence a
single thread block is responsible for a single row and thus a thread for a row
entry. In contrast to the Householder implementation the data is store row-wise
in the global memory space. The two left most entries (pivot values) in each row
in a certain iteration is determined by the number of the upper row. For instance,
if we process the rows 3 and 4 the third entry of each row represents the row's
pivot value as derived from the sequence in section 3.2.
7.2 Phase I: Orthogonalization

×

 ×

 ..
 .
×
×
×
...
...
×
...
...
..
.


×



× 
⇒
..  
.  
×
×
×
..
.
×
81
×
×
...
...
×
...
...
..
.


×


× 

 ⇒ ... ⇒ 

.. 

. 
×

×
...
×
×
×
×
...
...
...
×
×

× 
×
..
.
..
.
.. 
. 

The sequence enables three useful patterns to be implemented considering Figure 6.1. The rst conceivable pattern is often proposed while the Givens rotations
were examined theoretically for parallel structures. Therefore
m
l a single thread rek
ally aects a single row entry and hence one row requires
#threads per block
thread blocks instead of one (one-thread-per-entry). The second pattern is equal
to that but reducing the administrative eort while reducing pointer calculations
(one-block-per-row). The third continues to decrease the eort while applying
one block for two associated rows (one-block-per-associated-rows).
However, we need a function that computes the values c and s. This is done in
straight fashion complying to Algorithm 3.3. The function is given the two pivot
values and return c and s by reference (List. 7.3).
__device__ void d__getGrCoeff (REAL upperPivot , REAL lowerPivot ,
REAL * c , REAL, * s ){
REAL t ;
}
i f ( int ( c e i l ( fabs ( l o w e r P i v o t ) ) ) == 0){
*c = 1.0 f ;
*s = 0.0 f ;
}
e lse i f ( fabs ( l o w e r P i v o t ) < fabs ( upperPivot ) ) {
t = l o w e r P i v o t / upperPivot ;
* c = rsqrt (1 + t * t ) ;
* s = (* c )* t ;
}
e lse {
t = upperPivot / l o w e r P i v o t ;
* s = rsqrt (1 + t * t ) ;
*c = (* s )* t ;
}
Listing 7.3: C for CUDA code to compute c, s
The rst check whether the pivot value from the lower row is zero is done for
two reasons. On the one hand, if this value is zero, the two aected rows have
not to be altered, thus c and s are exactly set to their respective values. On the
other hand this avoids an overow in the case that both values are zero.
Implementation
82
Hereafter we describe the dierences between the three implementation ways
that are based on each other from the rst to last but becoming more lightweight
every time.
Function name
d__getGrCoe()
d__givensRotations_otpe()
d__givensRotations_obpr()
d__givensRotations_obpa()
utilized
required calls
shared memory registers
0
2
2
2
1
9
5
5
1
≈d+k−2
≈d+k−2
≈d+k−2
Table 7.4.: Implemented functions for Givens rotations I
One-Thread-Per-Entry
The kernel usually starts with a calculation to determine to which row and which
part of it each block belongs. For this purpose the kernel is given the number of
the required blocks per row
k
.
blocksPerRow =
#threads per block
The row is then determined by
block index
rowBaseNr =
blocksPerRow
and the position within the row by
block index
− rowBaseNr · blocksPerRow.
blocksPerRow
Next to that we have to assign the row number to each block that should be
handled by the block and the number of its associated row to compute c, s.
Therefore we exploit the sequence of the parallel pattern of the Givens rotations,
exemplarily shown in Figure 6.2. The minimum of each stage sMin is used as an
oset to determine the dedicated row number and stage maximum sMax as an
oset to determine the associated row.
Dedicated row
0 + sMin 1 + sMin 2 + sMin 3 + sMin 4 + sMin 5 + sMin
...
Associated row
sMax - 0 sMax - 1 sMax - 2 sMax - 3 sMax - 4 sMax - 5
...
7.2 Phase I: Orthogonalization
83
Again, the relation between two rows is crucial to calculate c, s, hence a comparison of the two row numbers reveals which of them is the upper row.
The kernel proceeds with the computation of c, s and in order to reduce the
latency when accessing the same address in the global memory space this is done
by a single thread that calls the function d__getGrCoeff(). The values are
stored in the shared memory. Afterwards each thread updates its assigned entry
by
bi,l = cbi,l + sbj,l
(7.3)
bi,l = cbi,l − sbj,l
(7.4)
or
where bi,l is the dedicated entry value and bj,l the entry of the associated row.
The upper equation is used if the dedicated row is the upper row, otherwise
the lower one. Listing 7.4 depicts the update of the dedicated value where
thread_dedRowThreadOffset refers to the position of the thread's dedicated
entry and thread_dedRowThreadOffset to the position of the thread's associated value within the basis matrix A. The synchronization is necessary since two
distinct thread blocks perform this update and hence the values have to be saved
prior updating.
...
i f ( thread_dedBaseOffset < t h r e a d _ a s s o c B a s e O f f s e t ){
i f ( threadIdx . x == 0)
d__getGrCoeff ( . . . ) ;
...
thread_assocValue = A[ thread_assocRowThreadOffset ] ;
__syncthreads ( ) ;
}
i f ( thread_dedRowThreadOffset < thread_dedRowEnd )
A[ thread_dedRowThreadOffset ] = block_c
*A[ thread_dedRowThreadOffset ] + block_s * thread_assocValue ;
...
Listing 7.4: C for CUDA code to update the dedicated row
The one-thread-per-entry method is not of pratical meaning as the synchronization barrier synchronizes threads within the same block only. Since two aected
rows are processed by two dierent blocks, the synchronization is not guranteed
and therefore the results are indeterminate.
Implementation
84
One-Block-Per-Row
The implementation of this pattern is very similar to the above pattern. Since a
single thread block is responsible for a single row the initial pointer calculations
are excluded. Additionally due to less thread blocks we can continue reducing
latency when accessing the same address in global memory to load the pivot
values. Further on a single
threads now has
l
m to compute more than one entry
k
update, more precisely
on average. However, this method
#threads per block
exposes the same problem concerning the synchronization since two aected rows
are processed by two dierent blocks (see above).
One-Block-Per-Associated-Rows
This pattern is by a distance the simplest implementation of the Givens rotations.
A single thread block is here responsible for the two, by the Givens rotations
aected rows. The simple process is as follows. At the beginning the osets are
added to obtain the pointer to both, the dedicated and associated row. Then the
values c, s are calculated by a single thread. Again this means less latency when
accessing the global memory. Like the one-block-per-row pattern a single threads
computes more than one entry update and this a thread updates the entries in
both rows. As a result the comparison to determine which is the upper row is
also omitted. The below code sample (List. 7.5) once again shows the entry
update but this time no synchronization is needed since both associated values
are updated at a time.
...
i f ( threadIdx . x == 0)
d__getGrCoeff ( . . . ) ;
for ( . . . ){
...
A[ thread_dedRowThreadOffset ] = block_c * thread_dedValue
+ block_s * thread_assocValue ;
A[ thread_assocRowThreadOffset ] = block_c * thread_assocValue
}
− block_s * thread_dedValue ;
Listing 7.5: C for CUDA code to update associated rows
7.2 Phase I: Orthogonalization
85
7.2.3. Givens Rotations with Combined Stages
Re-examining the parallel Givens rotations pattern (Fig. 6.2) one recognizes that
in the upper halve every column begins with the same row number and moreover
the corresponding associated row number is numbered consecutively in increasing
order. In the second halve the other way round. This observation is advantageous
with keeping in mind that less accesses to the global memory are a crucial factor
concerning the entire performance of a CUDA application.
Evidently, it makes sense to combine exactly two stages at rst sight since the
stages grow for odd numbers. When combining more than two stages and being
in the last thread block a check has to be performed such that computations begin
deferred. Indeed this is the case for the last thread block only but the kernel is
written to be executed by all thread blocks. A further reason that contributes to
performance declines could be the fact that the serial eort analogously increases
while more stages are combined. The optimal balance can only be found by
benchmarking. So, without anticipating the results, we implement a kernel that
combines two stages and a second that combines four stages. The following code
sample (List. 7.6) depicts the entry update for the 2-staged kernel. Of course the
approach for the 4-staged and a general n-staged is exactly the same. Both are
based on the one-block-per-associated-rows method. Due to the diverging row
numbers that are xed in the upper halve compared to the lower halve we need to
implement two respective kernels for each method, an upper and lower variant.
...
i f ( threadIdx . x == 0)
d__getGrCoeff ( . . . ) ;
for ( . . . ){
...
thread_dedPreValue = block_c * thread_dedValue
+ block_s * thread_assocValue ;
A[ thread_assocRowThreadOffset ] = block_c * thread_assocValue
− block_s * thread_dedValue ;
__syncthreads ( ) ;
thread_assocValue = A[ thread_assocSuccRowThreadOffset ] ;
if ( . . . )
d__getGrCoeff ( . . . ) ;
__syncthreads ( ) ;
A[ thread_dedRowThreadOffset ] = block_c * thread_dedPreValue
+ block_s * thread_assocValue ;
Implementation
86
A[ thread_assocSuccRowThreadOffset ] = block_c * thread_assocValue
}
− block_s * thread_dedPreValue ;
Listing 7.6: C for CUDA code to update the dedicated row (2-staged)
Figure 7.2 illustrates the combined parallel Givens rotations pattern.
1st
2nd
3rd
4th
5th
6th
..
.
1 ,2
1, 3
1 ,4
1,5
1 ,6
1,7
2 ,3
2,4
2 ,5
2,6
1 ,2
1 ,3
1 ,4
1, 5
3 ,4
3,5
1 ,6
1 ,7
2 ,3
2,4
2 ,5
2 ,6
3 ,4
3 ,5
r = updated entries of the row r are not stored in this stage
Figure 7.2.: Saved storages with 2-staged, respectively 4-staged method
The 2-staged method enables us to save one out of four storages to the global
memory space compared to the 1-staged version (one-block-per-associated-rows).
Assuming a lattice basis represented by a d-by-k matrix the 2-staged method
does not cover at most one stage, i.e. there is at most one stage that can not be
combined with another stage. Hence we can expect to save about 25% storages
in total.
Using the 4-staged method we rst assume that all of the four involved stages
have the same length, then we can save three out of eight which means we can
expect to save a little less than 37.5% storages in total. However, both estimations
embody the best case. The real portion that can be saved is dependent on the
matrix size and how the two methods cover the stages but for large matrices these
estimations are quite sucient.
Theoretically, the n-staged method achieves, under same assumption as for the
storages in total. For large n values the saving
4-staged method, a saving of n−1
2n
tends to 50% but again, without considering the omitted updates in the back
most thread blocks.
7.2.4. Transformation
The QR decomposition gives the matrix R and has now to be transformed into
e. In respect of Algorithm 6.3 we made a straight implementation (List.
matrix R
7.7) to obtain the Gram-Schmidt coecients µi,j and squared 2-norms kb∗i k22 .
7.3 Phase II: Size Reduction
Function name
d__givensRotations_obpa_2_u()
d__givensRotations_obpa_2_l()
d__givensRotations_obpa_4_u()
d__givensRotations_obpa_4_l()
87
utilized
shared memory registers
2
2
2
2
7
7
9
9
required calls
≈ 0.5(d + k − 2)
≈ 0.5(d + k − 2)
≈ 0.25(d + k − 2)
≈ 0.25(d + k − 2)
Table 7.5.: Implemented functions for Givens rotations II
__global__ void d__transformGsCoeff (REAL *A){
REAL thread_pivot ;
int thread_dedRowBase = __umul24( blockIdx . x , DIM_COL) ;
thread_pivot = A[ thread_dedRowBase + blockIdx . x ] ;
for ( int i = threadIdx . x + blockIdx . x + 1 ; i < DIM_COL;
i += blockDim . x )
A[ thread_dedRowBase + i ] =
A[ thread_dedRowBase + i ] / thread_pivot ;
}
A[ thread_dedRowBase + blockIdx . x ] = thread_pivot * thread_pivot ;
Listing 7.7: C for CUDA code to obtain Gram-Schmidt coecients
7.3. Phase II: Size Reduction
In this section we describe the implementation approach for the size reduction.
The size reduction implies the reduction of the basis and the Gram-Schmidt
coecient at a time just like Algorithm 6.5 claries. But we propose a more
ecient procedure, while decoupling these related size reductions. If we perform
a size reduction on Gram-Schmidt coecients rst we are able to size reduce
a single basis vector in only one step. However, the modied Gram-Schmidt
e has to be prepared rst.
coecients matrix R
Function name
d__trianTransposeGsCoe()
d__getNearInt()
d__sizeReduceGsCoe_4()
d__sizeReduceBasis_8()
utilized
required calls
shared memory registers
#tpb
4
0
4
8 · #tpb
1
10
18
k
≈ #tpb
1
≈ k4
k−1
Table 7.6.: Implemented functions for size reduction
Implementation
88
7.3.1. Transposition
e reveals a
In view of CUDA and its performance considerations the matrix R
severely disadvantage for further processing. The relevant coecients for a lattice
basis vector are analogously to them forming a column at the same position. But
loading columns causes a performance penalty since the global memory access
is not coalesced, in particular this means each entry requires an own memory
transaction. Hence we suggest to transpose the matrix rst. In this case, a
transposition is very ecient because the lower triangular matrix is not of interest.
The diagonal transposition is in staged fashion and copies k columns with at most
n entries at once in the rst stage. In the second stage it copies k − n columns
with at most n entries and so on. Furthermore, the squared 2-norms are copied
e is given by a 7-by-7 and n = 4
to the top row (see Phase III). For instance, if R
the diagonal transposition proceeds as follows.

kb∗1 k22 µ2,1
µ3,1
µ4,1
µ5,1
µ6,1
µ7,1
 0
kb∗2 k22 µ3,2
µ4,2
µ5,2
µ6,2
µ7,2 


 0

2
∗
0
kb
k
µ
µ
µ
µ

4,3
5,3
6,3
7,3 
3 2


2
e= 0
R
0
0
kb∗4 k2 µ5,4
µ6,4
µ7,4 


 0
0
0
0
kb∗5 k22 µ6,5
µ7,5 


 0
0
0
0
0
kb∗6 k22 µ7,6 
0
0
0
0
0
0
kb∗7 k22
 ∗ 2
µ5,1
µ6,1
µ7,1
kb1 k2 kb∗2 k22 kb∗3 k22 kb∗4 k22
2
∗
 µ
µ4,2
µ5,2
µ6,2
µ7,2
 2,1 kb2 k2 µ3,2
 µ
µ3,2 kb∗3 k22 µ4,3
µ5,3
µ6,3
µ7,3
 3,1

2
∗
⇒  µ4,1
µ5,4
µ6,4
µ7,4
µ4,2
µ4,3 kb4 k2

∗ k2
 µ5,1
µ
µ
µ
kb
µ
µ7,5
5,2
5,3
5,4
6,5
5 2

2
∗
 µ6,1
µ6,2
µ6,3
µ6,4
0
kb6 k2 µ7,6
µ7,1
µ7,2
µ7,3
µ7,4
0
0
kb∗7 k22







 ⇒ ...




Clearly, we set n = #threads per block. The transposed enables us to perform
coalesced loads from the global memory space.
7.3.2. Size Reduction i: Coecients
This is achieved while pre-computing the Gram-Schmidt coecients in a manner
such that coecients that contribute to a basis reduction are not reduced but
rounded to its nearest integer only, in other words we replace line 4 (Alg. 6.5) by
µi,j = dµi,j c .
(7.5)
7.3 Phase II: Size Reduction
89
It has to be ensured that µi,j is not altered within further iterations. Listing
7.8 shows the device function that calculates the nearest integer to a given GramSchmidt coecient.
__device__ void d__getNearInt (REAL * g s C o e f f , REAL * n e a r I n t ){
REAL f r a c ;
* n e a r I n t = trunc ( g s C o e f f ) ;
f r a c = ( gsCoeff − (* nearInt ) ) * 1 0 . 0 f ;
}
i f ( frac > 5.0 f )
( * n e a r I n t )++;
e lse i f ( f r a c < − 5.0 f )
( * n e a r I n t ) −−;
Listing 7.8: C for CUDA code to obtain the nearest integer value
Here, too, similar to the Givens rotations with combined stages we can apply
the same approach. Fortunately, the size reduction aects one row only by which
means that updated entries are stored at the end of a combined stage and no entry
has to be stored within the combined stage. Figure 7.3 illustrates the combined
parallel size reduction.
1st
10,9
2nd
10,8
9,8
3rd
10,7
9,7
8,7
4th
5th
10, 6
10,5
9, 6
9,5
8, 6
8,5
7, 6
7,5
6,5
6th
10,4
9,4
8,4
7,4
6,4
5,4
7th
10,3
9,3
8,3
7,3
6,3
5,3
4,3
8th
..
.
10, 2
9, 2
8, 2
7, 2
6, 2
5, 2
4, 2
3, 2
r = updated entries of the row r are not stored in this stage
Figure 7.3.: Saved storages with 4-staged method
Without combining stages, k(k+1)
storages have to be performed. With n stages
2
combined, storages
are performed every nth time which results in s(s+1)n
in total
2
k
where s = n . For large n-values the savings tends to 75%. Nevertheless, the
4-staged method embodies the optimum.
Implementation
90
7.3.3. Size Reduction ii: Basis
To benet from the pre-calculated nearest integer versions of the Gram-Schmidt
coecients we suggest to size reduce the basis in a binary tree fashioned way
very close related to scalar product computation. The basis size reduction is
processed row-wise as Figure 7.4 exposes. The idea is to sum up the row values
of all previous vectors - more precisely it is weighted sum involving the nearest
integers - and subtract this sum from the row value of the actual vector.
Basis vectors
b1,1 b2,1 b3,1 b4,1 b5,1 b6,1 b7,1 b8,1 b9,1
+
+
+
+
+
+
+
+
...
bk,1
+
-
...
b1,d b2,d b3,d b4,d b5,d b6,d b7,d b8,d b9,d
+
+
+
+
+
+
+
+
...
bk,d
+
-
Figure 7.4.: Size reduction for basis vector bk
Therefore, the weighted sum is calculated in exactly the same way the scalar
product is computed (Fig. 7.1). Rather than two basis vectors we take each row
e which contains the nearest
vector together with the respective row from matrix R
integers.
Once again we can combine computations to make the implementation more
ecient. This time we combine a number of rows to save loadings from the global
memory, i.e. the loadings of the nearest integers are reduced.
Normally these
d
are loaded d-times. Assuming a n-staged method, only n loadings have to be
performed.
7.4. Phase III: Permutation
In this nal phase we explain our proceedings concerning the basis permutation.
Since we intend to implement the All-swap approach we employ the usual Allswap condition and our ordered All-swap proposal.
7.4.1. Usual All-Swap Phase
The kernel (List. 7.9) is a very straight implementation of the swapping part of
Algorithm 6.6 (line 7 - 22) except the assigning of the swapped indicator. The
7.4 Phase III: Permutation
Function name
d__swapBasis_as()
d__swapBasis_bl_ord()
d__swapBasis_ord()
d__genBasisChecksum()
d__verBasisChecksum()
91
utilized
shared memory
registers
0
0
#tpb + 2
#tpb
#tpb
required calls
3
2 · blocksize + 4
5
4
4
1
1
k−1
1
1
Table 7.7.: Implemented functions for basis permutation
oset value phaseOffset (0 or 1) determines whether the phase is odd or even,
respectively.
__global__ void d__swapBasis_as (INT *A, REAL *B, int p h a s e O f f s e t ){
REAL dedValue , a s s o c V a l u e ;
int thread_rowBase = __umul24( blockIdx . x , DIM_COL) ;
}
for ( int i = ( threadIdx . x << 1) + p h a s e O f f s e t ; i < DIM_COL;
i += blockDim . x ){
dedValue = A[ thread_rowBase + i ] ;
a s s o c V a l u e = A[ thread_rowBase + i + 1 ] ;
i f (B[ i ] > 2 *B[ i + 1 ] ) {
A[ thread_rowBase + i ] = a s s o c V a l u e ;
A[ thread_rowBase + i + 1 ] = dedValue ;
}
}
Listing 7.9: C for CUDA code for usual all-swap phase
The kernel performs the swapping on the entire basis while every row of the
basis is handled independent of each other. Consequently the kernel is only called
once and has no notable impact on performance if we combine rows.
7.4.2. Ordered All-Swap Phase
Blocked Ordered All-Swap Contrary to the usual All-swap phase, the blocked
ordered All-swap phase loads 2l squared 2-norms and sorts them instead of a
simple comparison, i.e. this done by a single thread. The elements are sorted
with help of insertion sort but any other algorithm is appropriate, too. For the
rest, this kernel equals the usual All-swap kernel.
Fully Ordered All-Swap In the case the whole basis has to be sorted rather
than blocks we need to follow a dierent approach. Implementing sorting algorithms on a device based upon CUDA architecture is not trivial. There are a
Implementation
92
few recommended algorithms that are proclaimed to be ecient like radix-sort,
merge-sort, and bitonic-sort. However, in this work we focus on small tailormade
kernels. Furthermore, for lattice basis reduction there is no need for a sorting
algorithm that is ecient for several hundred thousands of elements but at most
a few thousands.
Here, too, we prefer to implement a parallel variant of insertion-sort, a very
simple, but easy to port to CUDA, algorithm that virtually requires no additional memory. Insertion-sort works in the following manner. In every iteration
insertion sort removes an element from the given list to be sorted and inserts it
into the correct position in the already-sorted part of the list. Therefore insertion
sort trivially assumes the rst element in the given list to be sorted. In the rst
iteration the seconds element is compared with the rst element and inserted
before or behind. Next, the third element is compared to the rst two elements
and inserted correctly. This process is repeated until the last element is reached
and accordingly inserted. The algorithm can be parallelized by performing all
arised comparison per iteration at once. In our case we have to consider both the
value, the element itself, and a respective key that participates the comparison.
The value is embodied by an element of a lattice basis vector and the key by the
corresponding squared 2-norm. Figure 7.5 illustrates the sorting process.
Sorted part
x1 x 2
unsorted part
...
xi-1 xi
...
2
Sorted part
2
xi
unsorted part
xi
2
xi
...
Figure 7.5.: Insertion sort implying Lovász condition as key
Each comparison is performed by a dedicated thread, consequantially the correct position has to be marked. This is dicult due to the circumstance that
probably more than one thread wants to mark the correct position because the
comparison was positive. In fact, the correct position is the foremost of all possible positions. By this knowledge the correct position can be determined by the
thread indices. The thread which possesses the smallest index holds the desired
position. To lter this thread index from all other possible thread indices, each
thread which obtains a positive comparison result writes its index to an array of
size #threads per block at its respective position. Threads which obtain a negative result write i at its respective position. With help of the binary tree method
and size comparison the desired thread index is ltered out (Fig. 7.6).
7.4 Phase III: Permutation
93
Thread index table
8
1
2
8
4
1
2
8
2
1
4
5
8
8
1
Figure 7.6.: Binary tree fashioned comparison assuming current position i = 8
and #threads per block = 8
The current squared 2-norm is written to the ltered position, squared 2-norms
right from the ltered position are right-shifted by one. At the same time the
basis vectors are swapped accordingly.
7.4.3. Swap Indication
The above kernels does not indicate whether a swap has occurred or not but this
indication is essential due to being the termination condition of the entire lattice
basis reduction. Integrating this check into the kernel is dicult because the
same thread of each row wants to report the swapping, i.e. write to a variable
in global memory. If there is more than one swap much more threads wants to
write to such a variable, hence this approach is anything else than ecient. This
problem can be solved by either applying a checksum over the row entries, an
indirect approach, or a direct counting of swaps.
Checksum
Principally, we need to involve just one row but of course the probability exists
in order that a swap remains undetected. Otherwise the checksum calculation
shall be very ecient, thus we merely apply this checksum to a certain amount of
row, say 10%. We propose a procedure akin to a signature including generation
and verication. We will exemplary explain this procedure for a given row i.
Implementation
94
Checksum Generation First we take a random but constant vector consisting
of k integer numbers. Then the following operations are performed to achieve
the row checksum ci over the ith row (Fig. 7.7), a pair-wise and operation and
exclusive or that maps the results to a single output.
ci = (b1 , i & ri , 1) ⊕ (b1 , i & ri , 1) ⊕ . . . ⊕ (b1 , k & rk , 1)
(7.6)
Surely, one can involve other appropriate operations and we suggest to use a set
of distinct constant vectors. An individual vector can be generated with help of
the index numbers, for instance rj = block index ⊕ thread index where rj is the
jth element.
b1,i b2,i b3,i b4,i b5,i b6,i b7,i b8,i b9,i
&
&
&
&
&
&
&
&
...
&
ri,1 ri,2 ri,3 ri,4 ri,5 ri,6 ri,7 ri,8 ri,9
bk,i
&
...
ri,k
ci
Figure 7.7.: Checksum generation on basis vector bi and integer vector ri
The procedure is also well suited for the binary tree fashion sum up implementation approach and illustrated in Listing 7.10.
Checksum Verication Clearly, the verication is executed after the swapping.
For the verication we rst apply the generation again to obtain the checksums
c0i . Now we verify if
c1 ⊕ c01 |c2 ⊕ c02 | . . . |ck ⊕ c0k = 0.
(7.7)
"|" is the bit-wise or operator. As soon as any checksum c0i is not equal to its
respective old checksum ci , the vercation result is not zero. If the equation holds
then no swap has occurred, the lattice basis reduction terminates. Otherwise, at
least two vectors were interchanged, the lattice basis reduction continues.
__global__ void d__genBasisChecksum (INT *A, INT *B){
__shared__ REAL block_xorSum [DIM_BLOCK ] ;
INT thread_xorSum = 0 , thread_xorSumOp ;
int thread_rowBase = __umul24( blockIdx . x , DIM_COL) ;
INT t h r e a d _ p a t t e r n D i g i t = threadIdx . x ^ blockIdx . x ;
for ( int i = threadIdx . x ; i < DIM_COL; i += blockDim . x )
thread_xorSum ^= A[ thread_rowBase + i ] & t h r e a d _ p a t t e r n D i g i t ;
block_xorSum [ threadIdx . x ] = thread_xorSum ;
7.4 Phase III: Permutation
95
for ( int i = blockDim . x >> 1 ; i > 0 ; i >> = 1){
__syncthreads ( ) ;
i f ( threadIdx . x < i ){
thread_xorSum = block_xorSum [ threadIdx . x ] ;
thread_xorSum_op = block_xorSum [ threadIdx . x + i ] ;
}
}
}
block_xorSum [ threadIdx . x ] = thread_xorSum ^ thread_xorSumOp ;
i f ( threadIdx . x == 0)
B[ blockIdx . x ] = block_xorSum [ 0 ] ;
Listing 7.10: C for CUDA code to generate the basis checksum
Swap Counting
By a slighltly modication to the usual and ordered All-swap kernel it becomes
possible to directly count the performed swaps. Therefore, both kernels are only
executed by the single block that processes the rst row of the lattice basis. When
a thread within this block performs a swap, it increases a dedicated counting
variable at the same time. After each thread has nished the processing, the
counting variables are added up in binary tree fashion. The lattice basis reduction
terminates if the sum is zero, otherwise it continues. Clearly, this method, shown
in Listing 7.11, is more accurate.
__global__ void d__swapBasis_as_si (INT *A, REAL *B, INT *C,
int p h a s e O f f s e t ){
__shared__ int block_swapCount [DIM_BLOCK ] ;
int thread_swapCount = 0 ;
REAL dedValue , a s s o c V a l u e ;
int thread_rowBase = DIM_COL;
for ( int i = ( threadIdx . x << 1) + p h a s e O f f s e t ; i < DIM_COL;
i += blockDim . x ){
dedValue = A[ thread_rowBase + i ] ;
a s s o c V a l u e = A[ thread_rowBase + i + 1 ] ;
i f (B[ i ] > 2 *B[ i + 1 ] ) {
A[ thread_rowBase + i ] = a s s o c V a l u e ;
A[ thread_rowBase + i + 1 ] = dedValue ;
thread_swapCount++;
}
}
block_swapCount [ threadIdx . x ] = thread_swapCount ;
for ( int i = blockDim . x >> 1 ; i > 0 ; i >> = 1){
__syncthreads ( ) ;
Implementation
96
}
}
i f ( threadIdx . x < i )
block_swapCount [ threadIdx . x ] = block_swapCount [ threadIdx . x ]
+ block_swapCount [ threadIdx . x + i ] ;
i f ( threadIdx . x == 0)
C [ 0 ] = block_xorSum [ 0 ] ;
Listing 7.11: C for CUDA code for basis all-swap phase
8. Practical Results and
Discussion
In this chapter we present our practical results concerning each respective implementation we made. We will discuss the results in respect of dierent metrics
including occupancy but mainly runtime behavior for dierent settings involving
variations of lattice basis size and the number of threads per block. All runtime
results indicate what is possible on the CUDA architecture whether the matrix
size is meaningful or not.
Phase I Results At the beginning we show the runtime results for our sequential
CPU implementations of the QR decomposition methods (Fig. 8.1) to oer a
brief glimpse how this methods dier from each other, in particular when they
are executed on a GPU later.
150
Runtime in seconds
Mod. Gram−Schmidt
Fast Givens Rotations
Std. Givens Rotations
Std. Householder Reflections
100
50
0
500
1000
1500
2000
Matrix size (d−by−d)
2500
3000
Figure 8.1.: Runtime for sequential QR decomposition methods on CPU
Practical Results and Discussion
98
However, the shown runtimes result from straight C implementations without
any optimizations. Obviously, the Givens rotations perform much better whereas
the fast method is barely better than its standard counterpart.
Contrary, Figure 8.2 illustrates the runtime results of our implementations of
the QR decomposition methods run by the GPU. For this rst benchmark we
naively assume the more threads per block are involved the higher is the gained
performance.
80
Std. Householder Reflections
Givens rotations OTPE
Givens rotations OBPR
Givens rotations OBPA
Givens rotations OBPA 2−staged
Givens rotations OBPA 4−staged
70
Runtime in seconds
60
50
40
30
20
10
0
1000
2000
3000
4000
5000
Matrix size (d−by−d)
6000
7000
8000
Figure 8.2.: Runtime for parallel QR decomposition methods on GPU (512
threads)
Each curve ends at a point where the kernel execution is not longer possible
due to a failed execution conguration. The reason for this is probably the kernel
structure that exceeds the maximum of registers. For instance, the one-threadper-entry Givens rotations method involves few registers but a large amount of
thread blocks. By contrast the one-block-per-associated-rows 4-staged method
involves much less thread blocks but as much more registers per thread block.
As expected the one-thread-per-entry method has the weakest performance
because of its large overhead concerning thread creation, loads and storages to
the global memory and required synchronizations. Nevertheless, it has halve
the runtime compared to the fast Givens rotations on CPU. The Householder
reections kernel was slightly optimized and it therefore oers a good comparison
between a straight sequential and straight parallel implementation. The gained
Practical Results and Discussion
99
speed up is obvious (Fig. 8.3). For the other Givens rotations implementations
we see that a simple built kernel could achieve a better performance. Comparing
the two combined stage methods the 4-staged methods has a slightly weaker
performance. We already considered this in section 7.2.3 and we suggest the
serial eort per thread is greater than the gain of performance when performing
less global memory storages.
16
14
Householder Reflections
Givens rotations OBPR
Givens rotations OBPA 2−staged
12
Speed up
10
8
6
4
2
0
500
1000
1500
2000
Matrix size (d−by−d)
2500
3000
Figure 8.3.: Gained speed up on GPU
The speed is in relation to the respective fastest implementation on the CPU.
The speed up of the Householder reections increases linearly whereas the speed
up of the Givens rotations reaches its maximum at 4, relatively 8 for the fastest
Givens rotations method.
Next we show how theses GPU implementations react on dierent numbers of
threads per block for a chosen xed matrix size. The results are depicted in the
following Figure 8.4.
The measuring the runtime for the one-thread-per-entry method become only
feasible when using at least 256 threads per block, otherwise too many blocks
will be created and the executing fails. The gure clearly shows that none of the
methods have optimal performance with 512 threads, moreover every method
has a dierent optimal number of threads per block. Maybe this is caused by
switching eects from the scheduler. It also shows that the 4-staged method is
the fastest method after all. These result are additionally conrmed by other
Practical Results and Discussion
100
11
Std. Householder Reflections
Givens rotations OTPE
Givens rotations OBPR
Givens rotations OBPA
Givens rotations OBPA 4−staged
Givens rotations OBPA 2−staged
10
9
Runtime in seconds
8
7
6
5
4
3
2
1
0
32
64
128
Threads per block
256
512
Figure 8.4.: Runtime for parallel QR decomposition methods on GPU for dierent
numbers of threads per block (matrix size 3000-by-3000)
matrix sizes which are not shown here. Due to this we rerun the benchmark
with setting the optimal number of threads per block for each QR decomposition
method. The optimal number of threads is shown in Table 8.1, the new runtime
results in Figure 8.5.
QR decomposition method
optimal block dimension
Householder Reections
64
Givens rotations OTPE
256
Givens rotations OBPR
256
Givens rotations OBPA
64
Givens rotations OBPA 2-staged
128
Givens rotations OBPA 4-staged
64
Table 8.1.: Optimal block dimension for QR decomposition methods
Kerr, Campbell and Richards recently presented an implementation using the
blocked Householder reections method [KCR09]. Contrary to their opinion that
Givens rotations not optimally t the CUDA architecture we present a benchmark
(Fig. 8.6) of our one-block-per-associated-row 4-staged method under exactly the
same assumptions that reveals a better runtime. Moreover, we are not bound to
particular matrix sizes (multiple of 32) or block dimensions. The optimal block
Practical Results and Discussion
101
80
Std. Householder Reflections
Givens rotations OTPE
Givens rotations OBPR
Givens rotations OBPA
Givens rotations OBPA 2−staged
Givens rotations OBPA 4−staged
70
Runtime in seconds
60
50
40
30
20
10
0
1000
2000
3000
4000
5000
Matrix size (d−by−d)
6000
7000
8000
Figure 8.5.: Runtime for parallel QR decomposition methods on GPU (optimal
block dimension)
dimension is 512. They emphasize two results, 4.43 seconds for a matrix size of
6656 × 3328 and 7.629 seconds for 8192 × 4096. Our implementation requires
3.87, respectively 7.215 seconds.
Completing phase I the runtime results of the transformation kernel is shown
in Figure 8.9. Since it has barely no impact on performance we consolidate the
results of this and other helper kernels. The approximation kernel that approximates the exact integer basis belongs to this helper functions too.
Phase II Results We proceed to the size reduction and therefore we begin with
the transposition kernel. Of course this is another helper kernel and as well as
the approximation and transformation kernel it has no impact on performance.
The size reduction is divided into two parts and thus we show both runtimes
separately including the overall runtime of the size reduction to see which of them
is dominant for a certain matrix size. The results are illustrated in Figure 8.7.
Referring to section 7.3.1 and 7.3.2 we pass on measuring each stage combined
method and thus the measured results are only related to the 4-staged method
for the Gram-Schmidt coecient reduction, respectively 8-staged method for the
basis reduction. The runtime measurements already consider the optimal number
of threads per block.
Practical Results and Discussion
102
8
7
Runtime in seconds
6
5
4
3
2
1
0
0
1000
2000
3000
4000
5000
6000
Matrix size (d−by−d/2)
7000
8000
9000
Figure 8.6.: Runtime for OBPA 4-staged method using data type in single
precision
All in all the size reduction requires less runtime than the orthogonalization
but is still a crucial factor of the entire lattice basis reduction. Whysoever, both
size reduction kernels fails to execute with a matrix size larger than 6000 × 6000.
We were not able to gure out why.
Phase III Results The last phase mainly constitutes the invariant of the lattice
basis reduction but does not contribute directly to basis reduction. Here we are
particularly interested in the runtime result of our ordered All-swap proposal
compared the usual All-swap phase. The checksum calculation, if applied, is
although a kind of helper function and hence depicted in Figure 8.9. The runtime
results for each All-swap method are shown in Figure 8.8.
We can not state without a doubt whether the shown runtime results for the
usual All-swap and the blocked ordered All-swap method with block-size four are
correct because the measuring was repeated several times and mostly the results
were within the measuring inaccuracy. Considering the blocked ordered method
for block-size eight and higher the runtime is exactly the same as for the fully
ordered method by which means a signicant increasing of runtime.
Practical Results and Discussion
103
12
Runtime in seconds
10
GS coeff. size red.
Basis size red.
Entire size red.
8
6
4
2
0
1000
2000
3000
4000
Matrix size (d−by−d)
5000
6000
Figure 8.7.: Runtime for the basis size reduction (optimal block dimension)
Overall Performance and Reduction Properties We are now able to state
the overall performance and the reduction properties of our several lattice basis
reduction algorithms. To the best of our knowledge these are the rst results of
an attempt to implement such an algorithm on a CUDA based graphics cards.
For meaningful results we only consider lattice bases from the TU Darmstadt
Lattice Challenge 1 . The lattice bases are generated according to an approach
by Ajtai [?]. Buchmann et. al show how the lattice bases are constructed in
detail and prove the existence of short vectors in [?]. The challenge for a given
lattice basis is won when one nds a short vector which has an Euclidean norm
less than a given value. We will use those lattice bases to proof the correctness
of our implementations and to expose the reduction quality and the respective
runtimes.
First we want to proof that our implementations output a valid, reduced lattice
basis with the aid of a small lattice basis with random entries and the challenge
lattice of dimension 200. Agreeing to Section 2.2 an output lattice basis is valid
if the absolute value of the lattice determinant remains unchanged compared to
the original lattice basis.
Using each All-swap method - the usual, blocked-ordered and fully-ordered with the same δ 0 value, the randomly chosen 10-by-10 lattice basis
1 Available
on http://www.latticechallenge.org
Practical Results and Discussion
104
4
2.5
x 10
Usual All−swap
Bl. Ord. All−swap size 4
Bl. Ord. All−swap size 8
Bl. Ord. All−swap size 16
Fully Ord. All−swap
Runtime in milliseconds
2
1.5
1
0.5
0
1.000
2.000
3.000
4.000
Matrix size (d−by−d)
5.000
6.000
Figure 8.8.: Runtime for All-swap methods, δ 0 = 2 (optimal block dimension)
32 30
29 36
45 31

47 7
23 21
B=
2 4
40 48

8 6

29
9
4
4
23
43
40
17
15
38
30
3
35
26
48
38
39
22
36
15
34
37
3
13
29
33
41
7
26
41
26
48
31
38
1
22
11
40
30
35
37
39
20
19
47
5
36
44
10
35
3
24
12
16
38
26
16
13
50
45
28
2
17
0
19
49
25
4
5
9
18
40
1
41
14
48
32

13
48

9

3
13

36
12

is reduced to
 −8
Bred =
3
−6
−6
9
11
−5
4
7
−6
−3
−16
5
−17
2
7
9
−10
8
−1 

 6
−1 −10
3
−20 24
5
−1 −18 −5 


11
−3 −22
4
−8
8
1
−7 −11
 7
 17 14 −3 −1 −4 13 −8 −5 13 11 

.
 2
5
−14
6
11
−4 −1
10
1
21 


0
−13 13
−2
7 
 7 −13 13 −3 12
 −1
7
20
−6
15 −10 19
13
11
7 


−12 −5
9
−12
2
7
−5 −17 −18
8
5
−6 −2
5
7
4
4
−17 −16 −24

Practical Results and Discussion
105
45
d__transformGsCoeff()
d__approximateBasis()
d__trianTransposeGsCoeff()
d__genBasisChecksum()
d__verBasisChecksum()
40
Runtime in milliseconds
35
30
25
20
15
10
5
0
1000
2000
3000
4000
5000
Matrix size (d−by−d)
6000
7000
8000
Figure 8.9.: Runtime for helper kernels (optimal block dimension)
It holds |det(B)| = |det(Bred )| = 3.3774 · 1014 . The quality of the reduction is
stated by the orthogonality defect, df t(B) = 115910.2438 and df t(Bred ) = 3.6199.
Hence, the output reduced lattice basis is valid and short since the orthogonlity
defect is near to 1.
For the CLB-200 (challenge lattice basis of size 200) we obtain similar reduction
properties for All-swap algorithms using dierent δ 0 values (Tab. 8.2).
Method
δ 0 Determinant Overall size red.
l
∗
Blocked-ordered 2 = 2 1.34 2.0589 · 1044
280
Blocked-ordered 2l = 4
1.6 2.0589 · 1044
178
l
44
Blocked-ordered 2 = 8
3.0 2.0589 · 10
75
kvs k2
19.03
18.84
21.79
Runtime
3.37s
2.11s
0.89s
vs is the shortest vector within the lattice basis, kvs k2 < n = 30 to win CLB-200
∗
usual All-swap
Table 8.2.: Properties for each All-swap method after processing CLB-200
Generally, we observe that methods that perform more swaps than the usual
All-swap require much less runtime and deliver a similar reduction quality at
a time. The shortest vector we found with any of our methods, has Euclidean
length 9.22 (see appendix A) by application of the block-ordered All-swap with
Practical Results and Discussion
106
block-size 16, δ 0 = 2.55 and a computation time of 63.9 seconds. The CLB-500
challenge can be performed with a similar setup, more precisely block-size 16,
δ 0 = 3.05 and a computation time of 104.63 seconds with Euclidean norm of
62.94 (n = 63). Indeed, although this is not an optimal value since it is very
close to n, it depicts an signicant reduction in runtime in favor for our lattice
basis reduction.
Note, however, that our implementations are not optimized to nd short vectors
since we mainly focus on LLL-reduced (All-swap-reduced) lattice bases to compare them with implementations on other platforms and therefore the runtimes
depicted in Figure 8.10 consider the usual All-swap (blocked-ordered All-swap
with 2l = 2) algorithm with δ 0 = 1.34, a value that corresponds to δ = 0.99 which
is widely used for sequential algorithms.
45
40
Block−size 2
Block−size 4
Runtime in seconds
35
30
25
20
15
10
5
0
200
250
300
350
400
450
Challenge lattice basis dimension
500
550
600
Figure 8.10.: Runtime for challenge lattice bases with δ 0 = 1.34 (δ 0 = 1.6)
Compared to runtime results from NTL2 , that involve the Schnorr-Euchner
algorithm in double oating point precision using Given Rotations (G_LLL_FP()),
our implementation achieves a speed-up of about 12.5 (l = 1), respectively 18.6
(l = 2) on average (Fig. 8.11). In general, we observe a raising speed-up factor
according to higher challenge lattice basis dimensions.
Figure 8.12 depicts the cost eciency of both, sequential lattice basis reduction
on a CPU and parallel reduction on a gpu. As a performance criterion we use
2 The
measuring was done on the same system (see Section 7.1)
Practical Results and Discussion
107
700
Runtime in seconds
500
24
20
400
16
300
12
200
8
100
4
0
200
250
300
350
400
450
Challenge lattice basis dimension
500
550
Speed−up
600
28
NTL, G_LLL_FP()
Block−size 2
Block−size 4
Speed−up (2)
Speed−up (4)
0
600
Figure 8.11.: Comparison with NTL: runtime
lattice basis reductions per minute and then correlate it with the costs of the
corresponding system. Although, the system based on a dedicated graphics card
is purchased at higher costs, the parallel reduction is much more worthwhile.
A parallel lattice basis reduction algorithm depends on the reduction parameter, here δ 0 , unlike a respective sequential algorithm. Therefore, in higher lattice
dimension the probability rises that the parallel algorithm does not terminate for
a certain δ 0 since this value is crucial for the swaps to be performed and hence
for the runtime. There are δ 0 values, severely dependent on the given lattice,
that can potentially lead to an endless loop. This means that even smaller values
make the algorithm terminate with desired results. For sure, using higher values
the algorithm terminates with a higher probability. For instance, with δ 0 = 2
the usual All-swap algorithm can handle higher lattice dimensions as shown in
Figure 8.13.
To reach optimal results, no matter if this means nding a short basis or short
vector, one has to netune the δ 0 value. Generally, a smaller value leads to better
results using any All-swap algorithm but methods that perform more swaps, i.e.,
larger block-size, the δ 0 value has to be accordingly increased. Although, the
practical test has shown that there is also a lower limit for which each method
tends to output worse results. In this case one should proceed with a higher
block-size and a repective higher δ 0 .
Practical Results and Discussion
108
0.06
NTL, Sys. Config 1
Block−size 2, Sys. Config 2
Block−size 4, Sys. Config 2
Reductions per minute per Euro
0.05
0.04
0.03
0.02
0.01
0
200
250
300
350
400
450
Challenge lattice basis dimension
500
550
600
Sys. Cong 1: 4GB DDR-3, Intel Core 2 Quad Q9550, onboard graphics card, 499 Euro
Sys. Cong 2: 4GB DDR-3, Intel Celeron S 430, nVidia GeForce GTX285, 679 Euro
Figure 8.12.: Comparison with NTL: cost eciency
Practical Results and Discussion
250
Runtime in seconds
200
150
100
50
0
200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000
Challenge lattice basis dimension
Figure 8.13.: Runtime for challenge lattice bases with δ 0 = 2
109
9. Conclusion and Future Work
Our proposed approach to implement the lattice basis reduction on a parallel
system in shape of a graphics card holds advantages, as well as disadvantages.
We achieved satisfying results concerning the QR decomposition, our phase I.
Since previous works on that subject mainly suggested the Householder reections
to be the best choice we have shown that Givens rotations are at least coequal.
Moreover, our implementation oers a better runtime than the blocked Householder reections implementation by Kerr et al. which constitutes the highest
performance we are aware of.
For the size reduction, phase II, we followed our own approach and for performance reasons we split this component into two parts. By the pre-calculation of
the nearest integer versions of the Gram-Schmidt coecients we were able to size
reduce the basis in few iterations. However, there is no similar implementation
and since the size reduction is severly runtime consuming we believe there is space
for further improvements.
In phase III we implemented our ordered all-swap approach and proved by
runtime that it is denitely worthwhile on this architecture. It delivers better
reduction results with decreased runtime with respect to challenge lattice bases.
Our implementation can also be used to nd short vectors, however at the cost
of higher runtime. Still, our implementation require some trials due to nding
an optimal δ 0 value. We deliberately pass on excessively comparing our overall
performance with other implementations since this implementation is completely
dierent and from our point of view not directly comparable. Moreover, runtime
results of the NTL, for instance, are adequately noted and available in many
related works. By contrast we suggest this work to be the initial attempt to
implement the lattice basis reduction on massively parallel systems. Nevertheless,
to provide a glimpse on performance our comparison with the NTL revealed a
speed-up of about 18.5.
Disadvantageously, the CUDA architecture only oers at most double precision oating arithmetic. Accuracy is an important factor for the orthogonalization. Our implementations theoretically work with dimensions up to 6000 but
it is questionable whether the basis reduction delivers useful outcomes. With
a higher precision oating point arithmetic it would become possible to reduce
either lattice bases with very high dimensions or lattice bases consisting of large
entries which is, as of now, restricted by the current GPU generation. Thus,
112
Conclusion and Future Work
OpenCL could address this problem by introducing quadruple precision oating
point arithmetic in a future version.
A. Short Vectors
Short vector found in challenge lattice basis of size 200 (cf. Chapter 8):
[-2 1 1 0 0 -2 -1
0 -2 -1 0 1 0 0
1 0 -1 0 -1 -2 0
-1 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 1 0 0 0
0
0
0
0
0
0
0
0
0
2
1
-1
-1
0
0
0
0
0
0
1
1
-2
-2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
-1
-2
0
0
0
0
0
0
1
0
-1
-2
0
0
0
0
0
1
0
1
0
-1
0
0
0
0
0
-1
0
0
0
0
0
0
0
0
0
1
1
-1
-1
-2
0
0
0
0
0
-1
0
0
1
0
0
0
0
0
0
0
2
0
-1
1
0
0
0
0
0
-1
1
-1
-1
1
0
0
0
0
0
-1
0
1
0
-1
0
0
0
0
0
0
-1]
List of Figures
2.1. 2-dimensional lattice Z2 with distinct bases . . . . . . . . . . . . .
2.2. Volume of the basis mesh (parallelepiped) . . . . . . . . . . . . .
2.3. Successive minima λ1 = kb2 k2 , λ2 = kb1 k2 . . . . . . . . . . . . . .
7
9
10
3.1. Illustrative Gram-Schmidt process in Z2 . . . . . . . . . . . . . .
3.2. Rotation of a vector x by θ radians in Z2 . . . . . . . . . . . . . .
3.3. Reection of a vector x in Z2 . . . . . . . . . . . . . . . . . . . .
17
20
26
5.1.
5.2.
5.3.
5.4.
5.5.
5.6.
Amdahl's law for dierent P . . . . . . . . . . .
Gustafson's law for dierent P . . . . . . . . . .
Thread hierarchy of CUDA programming model
CUDA memory model . . . . . . . . . . . . . .
Heterogeneous CUDA program . . . . . . . . .
Examples of global memory access patterns . . .
.
.
.
.
.
.
44
45
49
50
52
56
6.1.
6.2.
6.3.
6.4.
6.5.
Parallel pattern for the Givens rotations . . . . . . . . . . . . . .
10-by-7 matrix with parallel pattern for the Givens rotations . . .
Parallel pattern for the Householder reections . . . . . . . . . . .
Generalized parallel pattern for the size reduction . . . . . . . . .
Generalized parallel pattern compared to the usually suggested
pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
61
63
65
7.1.
7.2.
7.3.
7.4.
7.5.
7.6.
Exemplary computation of a scalar product in CUDA . . . . . . .
Saved storages with 2-staged, respectively 4-staged method . . . .
Saved storages with 4-staged method . . . . . . . . . . . . . . . .
Size reduction for basis vector bk . . . . . . . . . . . . . . . . . .
Insertion sort implying Lovász condition as key . . . . . . . . . .
Binary tree fashioned comparison assuming current position i = 8
and #threads per block = 8 . . . . . . . . . . . . . . . . . . . . .
7.7. Checksum generation on basis vector bi and integer vector ri . . .
79
86
89
90
92
8.1. Runtime for sequential QR decomposition methods on CPU . . .
8.2. Runtime for parallel QR decomposition methods on GPU (512
threads) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3. Gained speed up on GPU . . . . . . . . . . . . . . . . . . . . . .
97
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
93
94
98
99
116
List of Figures
8.4. Runtime for parallel QR decomposition methods on GPU for different numbers of threads per block (matrix size 3000-by-3000) . .
8.5. Runtime for parallel QR decomposition methods on GPU (optimal
block dimension) . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6. Runtime for OBPA 4-staged method using data type in single precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7. Runtime for the basis size reduction (optimal block dimension) . .
8.8. Runtime for All-swap methods, δ 0 = 2 (optimal block dimension) .
8.9. Runtime for helper kernels (optimal block dimension) . . . . . . .
8.10. Runtime for challenge lattice bases with δ 0 = 1.34 (δ 0 = 1.6) . . .
8.11. Comparison with NTL: runtime . . . . . . . . . . . . . . . . . . .
8.12. Comparison with NTL: cost eciency . . . . . . . . . . . . . . . .
8.13. Runtime for challenge lattice bases with δ 0 = 2 . . . . . . . . . . .
100
101
102
103
104
105
106
107
108
109
List of Tables
5.1. GTX 280 Resources . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2. Features of device memory . . . . . . . . . . . . . . . . . . . . . .
5.3. Latency for memory actions on the device . . . . . . . . . . . . .
48
51
55
7.1.
7.2.
7.3.
7.4.
7.5.
7.6.
7.7.
77
78
80
82
87
87
91
Implemented
Implemented
Implemented
Implemented
Implemented
Implemented
Implemented
helper functions for basis orthogonalization
QR decomposition methods on host . . . .
functions for Householder reections . . . .
functions for Givens rotations I . . . . . . .
functions for Givens rotations II . . . . . .
functions for size reduction . . . . . . . . .
functions for basis permutation . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8.1. Optimal block dimension for QR decomposition methods . . . . . 100
8.2. Properties for each All-swap method after processing CLB-200 . . 105
List of Algorithms
3.1. Classical Gram-Schmidt (CGS) . . . . . .
3.2. Modied Gram-Schmidt (MGS) . . . . . .
3.3. Computation of c and s . . . . . . . . . . .
3.4. Standard Givens Rotations . . . . . . . . .
3.5. New Computation of c and s, Update of D
3.6. Fast Givens Rotations . . . . . . . . . . .
3.7. Computation of the Householder Vector .
3.8. Householder Reections . . . . . . . . . .
3.9. Computation of W and Y . . . . . . . . .
3.10. Blocked Householder Reections . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
19
21
22
23
24
25
27
28
28
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
(SE-Algorithm)
.
.
.
.
.
.
.
.
.
.
4.1.
4.2.
4.3.
4.4.
4.5.
Basis Orthogonalization . . .
Basis Size Reduction . . . . .
LLL-Algorithm . . . . . . . .
Deep Insertion . . . . . . . . .
Floating Point LLL-Algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
32
35
37
39
6.1.
6.2.
6.3.
6.4.
6.5.
6.6.
6.7.
6.8.
Parallel Standard Givens Rotations . . . . . . . . .
Parallel Householder Reections . . . . . . . . . . .
Parallel Basis Orthogonalization . . . . . . . . . . .
Parallel Basis Size Reduction 1 . . . . . . . . . . .
Parallel Basis Size Reduction 2 . . . . . . . . . . .
All-Swap Lattice Basis Reduction . . . . . . . . . .
Blocked Ordered All-Swap Lattice Basis Reduction
Fully Ordered All-Swap Lattice Basis Reduction . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
63
64
64
65
68
71
72
Bibliography
[AD97]
M. Ajtai and C. Dwork. A public-key cryptosystem with worstcase/average-case equivalence. In STOC '97: Proceedings of the
twenty-ninth annual ACM symposium on Theory of computing, pages
284293, New York, NY, USA, 1997. ACM. 1
[ATI]
ATI/AMD. ATI Stream SDK. http://developer.amd.com/gpu/
ATIStreamSDK/Pages/default.aspx. 47
[BvL87]
C. Bischof and C.F. van Loan. The WY representation for products of
householder matrices. SIAM J. Sci. Stat. Comput., 8(1):213, 1987.
27
[BW09]
W. Backes and S. Wetzel. A Parallel LLL using POSIX Threads. Technical report, Dept. of Computer Science, Stevens Institute of Technology, 2009. DIMACS Technical Report 2008-12. 73
[CS97]
D. Coppersmith and A. Shamir. Lattice Attacks on NTRU. Advances
in Cryptology - EUROCRYPT '97, LNCS 1233, 1997. 41
[Cut09]
http:
//www.cs.rpi.edu/~cutler/classes/advancedgraphics/S09/
lectures/15_Graphics_Pipeline.pdf. 46
[Dag09]
Ö. Dagdelen. Parallelisierung von Gitterbasisreduktionen. Master's
thesis, Technische Universität Darmstadt, 2009. 64, 73
[Gen73]
M. Gentleman. Least Squares Computations by Givens Transformations Without Square Roots. J. Inst. Math. Appl., 12:329336, 1973.
22
B. Cutler.
The Traditional Graphics Pipeline, 2009.
[GGH96] O. Goldreich, S. Goldwasser, and S. Halevi. Collision-Free Hashing
from Lattice Problems, 1996. 1
[GHM04] J.E. Gentle, W. Härdle, and Y. Mori. Handbook of Computational
Statistics. Springer-Verlag Berlin Heidelberg, 2004. 15
[Gro09]
Khronos Group. The OpenCL Specication, 2009. 57
[GS91]
J. Götze and U. Schwiegelshohn. A square root and division free givens
rotation for solving least squares problems on systolic arrays. SIAM
J. Sci. Stat. Comput., 12(4):800807, 1991. 23
Bibliography
122
[GvL96]
G.H. Golub and C.F. van Loan. Matrix Computations. The John
Hopkins University Press, third edition, 1996. 18
[Hec95]
C. Heckler. Automatische Parallelisierung und parallele Gitterbasisreduktion. PhD thesis, Universität des Saarlandes, Saarbrücken, 1995.
67
[Hin04]
M.J. Hinek. Lattice Attacks in Cryptography: A Partial Overview.
Technical report, School of Computer Science, University of Waterloo,
2004. 1, 41
[HPS98]
J. Hostein, J. Pipher, and J.H. Silverman. NTRU: A Ring-Based
Public Key Cryptosystem. In Lecture Notes in Computer Science,
pages 267288. Springer-Verlag, 1998. 1, 40
[HT93]
C. Heckler and L. Thiele. Parallel Complexitiy of Lattice Basis Reduction and a Floating-Point Parallel Algorithm. In PARLE, pages
744747, 1993. 67
[JS94]
A. Joux and J. Stern. Lattice Reduction: a Toolbox for the Cryptanalyst. Journal of Cryptology, 11:161185, 1994. 41
[KCR09]
A. Kerr, D. Campbell, and M. Richards. QR Decomposition on GPUs.
Technical report, Georgia Institue of Technlogogy, Georgia Tech Research Institute, 2009. 100
[Koy04]
H. Koy. Primale/duale Segment-Reduktion von Gitterbasen, 2004.
Universität Frankfurt. 40
[KS01]
H. Koy and C.P. Schnorr. Segment LLL-Reduction of Lattice Bases,
2001. 40
[LLL82]
A.K. Lenstra, H.W. Lenstra, and L. Lovász. Factoring polynomials with rational coecients. Mathematische Annalen, Volume
261(4):515534, 1982. 1, 35
[May08]
A. May. Public Key Kryptanalyse, 2008. Skript zur Vorlesung, Ruhr
Universität Bochum, Sommersemester 2008. 6
[MMI+ 78] R.C. Merkle, Student Member, Ieee, M.E. Hellman, and Senior Member. Hiding information and signatures in trapdoor knapsacks. IEEE
Transactions On Information Theory, 24:525530, 1978. 1
[nVi09a]
nVidia. NVIDIA CUDA C Programming Best Practices Guide, 2009.
47
[nVi09b]
nVidia. NVIDIA CUDA Programming Guide, 2009. 47
[Par71]
B.N. Parlett. Analysis of Algorithms for Reections in Bisectors.
SIAM Review, 13(2):197208, 1971. 25
Bibliography
123
[RSA78]
R.L. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining
Digital Signatures and Public-Key Cryptosystems. Communications
of the ACM, 21:120126, 1978. 1
[Sch87]
C.P. Schnorr. A hierarchy of polynomial time lattice basis reduction
algorithms. Theor. Comput. Sci., 53(2-3):201224, 1987. 40
[Sch02]
C.P. Schnorr. Lattice Reduction by Random Sampling and Birthday
Methods, 2002. 40
[Sch06]
C.P. Schnorr. Gitter und Kryptographie, 2006. Vorlesung an
der Johann Wolfgang Goethe-Universität Frankfurt/Main im Wintersemester 2005/2006. 6
[SE94]
C.P. Schnorr and M. Euchner. Lattice basis reduction: improved practical algorithms and solving subset sum problems. Math. Program.,
66(2):181199, 1994. 2, 37, 38
[Sho]
V. Shoup. NTL: A Library for doing Number Theory. http://www.
shoup.net/ntl/. 38
[vEB81]
P. van Emde Boas. Another NP-complete partition problem and the
complexity of computing short vectors in a lattice. Technical report
81-04, 1981. 1
[Vil92]
G. Villard. Parallel lattice basis reduction. In ISSAC '92: Papers from
the international symposium on Symbolic and algebraic computation,
pages 269277, New York, NY, USA, 1992. ACM. 2, 66
[Wet98]
S. Wetzel. An Ecient Parallel Block-Reduction Algorithm. In ANTSIII: Proceedings of the Third International Symposium on Algorithmic
Number Theory, pages 323337, London, UK, 1998. Springer-Verlag.
69
[Wie94]
K. Wiese. Parallelisierung von LLL-Algorithmen zur Gitterbasisreduktionen. Master's thesis, Universität des Saarlandes, 1994. 64