Parallel Solution of Sparse Linear Systems Defined Over

Parallel Solution of Sparse Linear Systems
Dened Over GF (p)
D. Page
Department of Computer Science, University of Bristol,
Merchant Venturers Building, Woodland Road,
Bristol, BS8 1UB, United Kingdom.
[email protected]
The second stage of processing in the Number Field Sieve
(NFS) and Function Field Sieve (FFS) algorithms when applied to discrete logarithm problems requires the solution of a large, sparse linear
system dened over a nite eld. This operation, although well studied in
theory, presents a practical bottleneck that can limit the size of problem
either algorithm can tackle. In order to partially bridge this gap between
theory and practise, we investigate and develop a fast, scalable implementation of the Lanczos algorithm over GF (p) that can solve such systems
in parallel using workstation clusters and dedicated Beowulf clusters.
Abstract.
1
Introduction
The security of modern public-key cryptography is usually based on the presumed hardness of problems such as factoring integers or computing discrete
logarithms. The Number Field Sieve [19] (NFS) and Function Field Sieve [1]
(FFS) oer two examples of algorithms that can attack these problems. Such algorithms are generally specied in two phases. The rst phase, sometimes called
the sieving step, aims to collect many relations that represent small items of
information about the problem one is trying to solve. This phase is easy to
parallelise since one can generate the relations independently. It is therefore
attractive for distributed, Internet based collaborative computation [26]. The
second phase of processing, sometimes called the matrix step, aims to collect
the relations and combine them into a single linear system which, when solved,
allows one to eÆciently compute answers to the original problem. EÆcient implementation of the matrix step is challenging since the linear system is generally
very large, and usually represents a practical bottleneck even though it is often
overlooked in theoretical discussion. However, both the NFS and FFS allow one
to balance the computational eort between the sieving and matrix step in the
sense that one can trade-o work in one phase for work in the other. It is therefore common for systems to be parameterised such that the work required in the
matrix step is closely matched to the available computational resources. Under
this assumption, a more eÆcient implementation of the matrix step will allow
us both to reduce the work in the sieving step and signicantly accelerate the
overall computation speed.
The solution of linear systems is a well studied topic since eÆcient methods
are required in many dierent applications [9, 2]. Solving small linear systems
is a trivial task thanks to methods such as Gaussian Elimination that mechanically reduce the problem. Ultimately, one arrives at a form where it is possible
to substitute known variables into all equations so as to solve the entire system.
However, constructing this form requires that most of the matrix is populated
with non-zero entries at some point. In small systems this is not a problem but
with large systems such as those produced by the NFS and FFS, holding the matrix in memory is infeasible due to the physical size of such an object. Therefore
iterative methods that make successively more accurate approximations of the
answer are usually preferred. The Lanczos [18] and Weidemann [30] algorithms
represent two such methods that can solve sparse linear systems eÆciently in
both time and space.
Current expositions of the matrix step are either abstract in their description of processor architecture [6, 31] and hence do not present results of actual
implementation; are biased toward the characteristic two case used in integer
factorisation systems like the NFS [17, 16, 22, 23]; or do not consider parallel
computation [14]. Our aim is to partially ll this gap in the literature by providing concrete experience of implementing and using a parallel Lanczos implementation for systems over GF (p) using Linux based workstation clusters and
dedicated Beowulf clusters. It is important to note that throughout the paper
we use GF (p) to denote a nite eld where p is a large prime integer of upto
roughly 1000 bits in length.
We set this work in the context of real datasets generated by the sieving
step of an FFS implementation [12]. The implementation was used to attack
discrete logarithm problems (DLP) posed in nite elds of characteristic three.
As a result, the sieving produces a large, sparse linear system dened over GF (p)
which, when solved, is used to compute answers to the original problem. Since
the solution of this sort of system is not well understood in terms of performance, we take a pessimistic approach and opt for long-term scalability over
short-term optimisation in our design choices. This guarantees we can can investigate and understand the properties of large systems before making incremental
improvements that match the specic requirements of a given case.
The paper is organised as follows. Section 2 describes the Lanczos algorithm
and the minor alterations required so it can solve linear systems generated by
the FFS. We describe the details of our implementation in Section 3, focusing on
issues of eÆcient parallelisation and eld arithmetic; matrix representation; and
pre-processing phases. We then present some experimental results in Section 4
that describe the performance of our system when used to solve three test cases.
Finally, we oer some concluding remarks and highlight areas for further work
in Section 5.
A description of the Lanczos algorithm suitable for solving
linear systems produced by the sieving stage of typical FFS implementations.
Algorithm 1:
: An n m matrix M with null-space of dimension greater than zero,
and a random m element vector r.
: The vector y such that M y = 0 and y is non-zero, or ? if the
algorithm fails to nd a solution.
Input
Output
T
w0
M
w1
w0
( M
r
)
(M w1 )
(v2 ; v2 )
(v2 ; w1 )
(w1 ; w0 )
v2
(t0 =t1 ) w1
(t4 =t1 ) w1
v2
M
t0
t1
t4
w2
x
T
repeat
v3
t0
t1
t2
t3
t4
if
T (M w )
M
2
(v3 ; v3 )
(v3 ; w2 )
(v3 ; v2 )
(w1 ; v2 )
(w2 ; w0 )
t1 = 0 or t3 = 0 then
return
w3
x
w1
?
(t0 =t1 ) w2
x + (t4 =t1 ) w2
w2
v2
v3
w2
w3
v3
=0
=x r
if y = 0 then
until w2
y
return
?
else
return y
(t2 =t3 ) w1
2
The Lanczos Algorithm
We are given an n m matrix M with m > n that represents a linear system
where columns in the matrix represent variables and rows represent equations.
Our aim is to produce a non-zero vector y such that
M y = 0:
Note that we assume that the null-space of M has dimension greater than zero,
i.e. that such a solution does exist. The Lanczos algorithm demands that this
problem be stated in the form
M y =w
where w is non-zero and M is both square and symmetrical. Dealing with the
rst constraint of the algorithm, that w is non-zero, is straightforward since we
can simply take a random vector r, compute
w = M r;
and solve
M x=w
for x before calculating the required result as y = x r. The second constraint,
that M is square and symmetrical, is satised by multiplying both sides of the
original equation by M T . This ensures the composite matrix is of the required
form and transforms our original problem into
(M y) = M w:
However, since M M is of size m m and probably very dense, it is unrealistic
M
T
T
T
to assume we can hold it in memory. We therefore view this as a calculated
value, holding only sparse matrix data in memory.
When presented with the problem in the correct form, the Lanczos algorithm
uses a set of recurrence equations to iteratively produce a solution. Following
the description of LaMacchia and Odlyzko [17], let
w0 = w
v1 = M w0
(v1 ; v1 )
w1 = v1
w
(w0 ; v1 ) 0
and then for i 1
v +1 = M w
i
w +1 = v +1
i
i
i
(vi+1 ; vi+1 )
wi
(wi ; vi+1 )
(vi+1 ; vi )
w :
(wi 1 ; vi ) i 1
The algorithm terminates when it nds some wj that is self-conjugate, that is
when (wj ; B wj ) = 0. Unless the algorithm fails to nd a solution, it completes
with j less than or equal to the number of variables in the system. That is, due
to the dimension of the Krylov space being close to m we get a solution after
O(m) iterations. Hence, if wj = 0 the solution is recovered as
X1
j
i
where
=0
s =
s w
i
i
(wi ; w)
:
(wi ; vi+1 )
There are two reasons why the algorithm could fail to nd a solution even if
one does exist. Firstly, the algorithm could encounter zero values of either t1 or
t3 which it can not invert. The probability of this happening is 1=p. Secondly,
there is a chance that the algorithm could compute an x so that y = x r = 0.
That is, the algorithm calculates a trivial result. In the event that one of these
cases does occur, one is forced to select a new random vector r and restart the
algorithm from the beginning. Clearly this is undesirable but for large values
of p, it does not present a practical problem since the chance of either event
occurring is very small.
Considering the augmented description of the original Lanczos method shown
in Algorithm 1, after initialisation we nd three main computational phases that
are repeated until a solution is found:
i
Phase 1:
Phase 2:
Phase 3:
3
Firstly we compute a matrix-vector product which, due to the size
of the matrix, usually represents the most costly operation in the
algorithm.
Following the matrix-vector product, we compute a number of vectorvector inner products.
Finally, we update our working vectors using a number of vectorscalar products and vector-vector addition.
Implementation Details
Many of our design decisions will be eected by the characteristics of the source
linear system which, from here on, we interchangeably describe as the matrix.
It is therefore important to understand these characteristics so that we may use
suitable data structures that eÆciently capture and operate on the matrix in our
physical implementation. Figure 1 shows a visualisation of an example problem
where the darker regions of the image represent more dense areas of the matrix.
One can clearly see features such as the vertical banding eect, due to there
being more small elements than large in the factor base, and the mirror type
product of the rational and algebraic factor bases that create two similar regions
to the left and right of centre.
Visualisation of a matrix representing a linear system as produced by the FFS.
Note that darker regions of the image represent more dense areas of the matrix, i.e.
those areas with more non-zero elements.
Fig. 1.
3.1
Arithmetic And Communication
Parallelisation There are several ways one could distribute data among the
processors in order to operate the Lanczos algorithm in parallel. For example, if
we know about the processor topology we might select a blocked allocation of the
vector and matrix data onto the processors so that communication between close
processors working on related data is very quick. However, without knowledge of
the underlying topology, such distributions are hard to implement eÆciently. We
therefore opted to follow the description of Yang and Brent [31] in making the
starting assumption that every processor is going to compute part of the result
vector. In order to do this, each processor must hold a portion of each vector
and also a region of the matrix. Since the working data is eectively scattered
among the processors, there must be communication to allow their calculation
of combined or shared results.
We used the MPICH [13] implementation of MPI [25] to pass data between
the processor nodes. We currently view the processor network as an abstract
object, employing no topology-specic communication, although doing so would
probably improve performance and hence warrants further investigation.
Figure 2 illustrates parallelisation of matrix-vector multiplication, the major operation within the Lanczos algorithm. Reading the diagram from left to
right, we see that each of the four processors holds a sub-section of the vector
and the corresponding sub-section of the matrix required to compute the partial
multiplication result. These partial results are then summed into a single result
vector which is redistributed ready for the next operation. Note that since we are
required to perform a normal matrix-vector product followed by a transposed
matrix-vector product the operation is split into two phases, each of which completes in a similar manner. Other than matrix-vector multiplication, the only
Communicate
and Sum
Communicate
and Sum
A crude attempt to demonstrate the operations within Algorithm 1 when computing the step v3 M T (M w2 ) in parallel with P = 4 processors each holding a
portion of the vector and matrix.
Fig. 2.
other global calculation is that of the vector inner product. Since the vectors are
distributed in sections among the processors, each processor calculates a local
result which are collectively summed to form the global result used subsequently
in the vector update phase.
3.2
Matrix Representation
As a starting point, let ! denote the weight of an input matrix M . That is, ! is
the number of non-zero entries in the matrix
!=
X
m
i;j
where mi;j =
0 if Mi;j = 0
1 if Mi;j 6= 0
We assume the average proportion of non-zero entries in a given row or column
is roughly 0:2% 2% and use to denote the average number of non-zeros in
a row, i.e. the average weight of a given row. Entries in the matrix represent
coeÆcients of the variables that, although members of GF (p), are guaranteed to
be have small positive or negative magnitude of roughly 50 due to the way that
FFS sieving works. The large majority of these entries are expected to be 1.
We therefore hold the coeÆcients as single-precision, 32-bit integers rather than
multi-precision members of GF (p), an approach that saves signicant amounts
of memory.
We hold the matrix in compressed row format where each row is represented
by linear arrays of values and column indexes for the non-zero entries; see [10]
for a thorough treatment of this sort of technique. We call the tuple of value
and index arrays that hold non-zero entries for a given row a row list and an
0 1 11 13
1 1 -1 2
Row List Index
Row List Value
Row List Hints
Matrix Index
Matrix Value
1
0 1 1
@@HHHH
@ HH
@
HH
@
H
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1 0 0 0 0 0 0 0 0 0 -1 0 2 0 0
Using cell hints to represent the matrix as a low-resolution bit-set vastly improves performance by allowing very fast searching within the matrix and presenting
more manageable memory access characteristics. Note that in this case, four cells are
represented by one bit so we say the hints have a resolution of four.
Fig. 3.
entry in the matrix, specied by single row and column index, a cell. Due to
the exceptionally sparse nature of our matrices, using a compressed row format
oers a massive saving in terms of space since only the non-zero entries consume memory. Since we are interested in investigating the size of systems we can
process using our implementation, we save additional space by ignore the potential to hold two copies of the matrix for normal and transposed access. Instead
we hold only one copy of the matrix and rely on the performance or row and
column-wise access to this structure. Indeed, once all the cells are loaded into
the memory resident row lists, performance of operations involving the matrix
are signicantly inuenced by how quickly we can locate the value associated
with a given cell.
Speed Oriented Improvements We can quickly locate the correct row for a
given cell since the rows are stored in an sequential, ordered manner. However,
locating the required column entry requires that we search the array of non-zero
indexes within that row list. This approach presents a performance hurdle for two
main reasons. Firstly, we will often search row lists when the required column is
not present. Even if we maintain an ordering within the row lists, this searching
operation consumes a vast amount of time when trying to locate a specic cell
or cells. Secondly, since the matrix is large the searching operation highlighted
above is additionally costly due to the nature of the memory sub-system in most
processors. The matrix is far too large to t into cache memory and since we
are eectively streaming through our row lists, using each only once, there will
be very little reference locality and thus negligible benet from having a cache
5000
Using naive search
With 155 k of cell hints
With 312 k of cell hints
With 625 k of cell hints
Execution Time (in microseconds)
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0
50
100
150
Non-zeros Per Column
200
250
A graph comparing column look-up speed with dierent sized cell hint structures against a naive method in a matrix of 10000 10000 elements using a 1 GHz
Pentium III processor.
Fig. 4.
present at all. Indeed, the matrix may be too large to t entirely into physical
memory which presents a further burden to the virtual memory mechanism.
Fortunately, we can reduce the impact of these problems by constructing a
second data structure to accelerate searching for cells in the matrix [10]. By
building a bit-set that holds a low-resolution representation of the matrix cells,
locating zero entries is vastly accelerated to a single bit-test as demonstrated by
Figure 3. Locating non-zero cells is made far more eÆcient by only searching the
row list when there is a high probability of nding the target column: when the
bit-set shows a zero for a given element we know it is deantly not there, if it
contains a one the element might be there and we need to search. Clearly the
resolution of our bit-set and sparsity patterns of our matrix eect this probability
with lower resolutions oering less certainty about their predictions. However,
we can adjust the resolution in a exible, per-matrix manner so as to ensure
good performance while using very little memory. In fact, the bit-sets can be
made small enough that they t into, and are retained in, cache memory.
Figure 4 compares dierent sized cell hint structures versus a naive method
of searching the row lists. For a typical matrix size of 10000 10000 having rows
and columns of varied average weight, retrieving a row is obviously costs almost
nothing since the matrix is stored in row-wise format. However, retrieving a
column is clearly signicantly faster using the cell hints for as long as the structure does not become over-polluted, i.e. the number of columns in the matrix
represented by each bit in the cell hint is suÆciently small. Furthermore, this
is possible using structures that can easily t into the level-two cache of most
processors, a fact we can ensure by tuning the size used on a per-matrix basis.
With P = 4 processors, each need only hold a sub-region of the matrix cells
since they will use only those in computing the local portions of a collective matrixvector product.
Fig. 5.
As well as improving over naive methods
of searching the matrix, we can exploit the parallelisation method described
in Section 3.1 to reduce the amount of memory used. Recall that the main
computational task is to compute a matrix-vector product of the form
Space Oriented Improvements
c =
X
i
A
i;j
b
j
where i and j range over the dimensions of the matrix A. Note that this operation
can obviously be restructured depending on if we are dealing with normal or
transposed matrix access.
In parallelising this operation, we have tasks each processor with computing
part of the result so that ll i lh for some limits ll and lh . Because of
this, each processor will use only ever use a sub-region of the matrix in order
to compute a local result: it is wasteful for every processor to hold the entire
matrix in memory. To save memory and hence improve our ability to process
large matrices, each processor only holds the matrix sections required to compute
local matrix-vector products. Note that this requires both normal and transposed
access to a single matrix and hence the data held is only that required for these
operations. Figure 5 demonstrates this for P = 4 processors where the regions
loaded are shown by solid blocks hewn from the encompassing dashed matrix.
Note that this saving in space can also result in performance improvement since
there it implies less pollution of the cell hint information and shorter row lists
to search through.
Field Arithmetic Arithmetic in GF (p) demands fast multi-precision integer
operations as well as modular reduction for general values of p. Although in specic cases we could have investigated optimisations for special or small values of
p, we consider only the use of general reduction algorithms in our results. The major operation in calculating both matrix-vector multiplication and vector-vector
inner products is modular multiply-accumulate. That is, we want to compute
X
a b
i
i
(mod p)
for ai and bi taken from a bounded list of values within the matrix and vector.
Given that eÆcient methods for multi-precision addition and multiplication are
Execution Time (in microseconds)
600
Barrett reduction (strict)
Barratt reduction (delayed)
Montgomery reduction (strict)
Montgomery reduction (delayed)
Single division
500
400
300
200
100
0
0
50
100
150
Number of Summands
200
250
A graph comparing the performance of dierent modular multiply-accumulate
operations over a varied number of summands using p of 1000 bits in size and a 1 GHz
Pentium III processor.
Fig. 6.
a well studied subject [21], modular reduction holds the key to high performance
of this operation. In this respect we have three main alternatives:
{
{
{
Perform strict reductions after every arithmetic operation so that partial
results are always strict members of GF (p).
Delay the reduction that is performed after multiplication of ai and bi until after the addition, thus reducing slightly bigger intermediate values but
executing less operations.
Perform no reductions at all in the multiply-accumulate operation, delaying
a single reduction of a large result until the very end.
With general values of p, techniques such as Montgomery [24] and Barrett [3]
reduction are popular. These methods pre-compute some values to scale expensive operations such as multiplication and division into shifts. By tweaking the
pre-computation, both methods can be altered to eÆciently cope with delayed
reduction given some upper-bound for the values they can reduce. Although Barrett reduction operates on integers in their native form, Montgomery reduction
requires the integers be transformed into Montgomery form. This is problematic
since we want to avoid storing multi-precision GF (p) values in our matrix given
that most entries will initially have small magnitude. For the cost of some space,
this issue can be resolved by pre-computing enough Montgomery form integers
that one can simply look them up using small, word sized indexes stored in the
matrix.
Figure 6 compares the performance of Montgomery reduction, Barrett reduction and standard division when used to compute multiply-accumulate operations over a varied number of summands. Using lists of summands whose
(a) After SGE pre-processing.
(b) After LOB pre-processing.
The eects of performing SGE and LOB pre-processing on the matrix shown
in Figure 1. Note that these images are not to scale and that the actual matrices are
around a sixth of the size of the original. The SGE matrix appears darker than the
original due to having more non-zero elements and them being more densely packed
together. The LOB matrix with P = 4 processors clearly shows a new banding pattern
representing the regions allocated to each processor.
Fig. 7.
length match those typically encountered, a single delayed division is clearly the
quickest option. Clearly the results are somewhat dependent on the choice of
p but in general, the presented trend holds in that a single division is generally quicker except for very small values of p. This choice gives the additional
benet of allowing the matrix to hold only single-precision values and from obvious optimisation of multiplication between single and multi-precision values. We
investigated the use of NTL [29] to deal with the described arithmetic but eventually opted for a custom implementation to make integration with MPI easier.
Our implementation was written in C++ and made use of small fragments of
platform specic assembler to boost performance.
3.3
Matrix Pre-processing
The standard Gaussian Elimination method used to solve small linear systems employs a number of atomic
operations that preserve the meaning of the solution while transforming it into a
reduced form. Although we can not use the same overall technique to solve large
sparse systems, we can utilise similar atomic operations to reduce our problem
into a smaller, easier to solve version coupled with some post-processing. Consider for example a variable in the system that is specied in only one equation.
We can remove this equation, reducing the number of columns in the matrix,
and produce a solution for the remaining variables as normal. After obtaining
our solution, we use simple back-substitution to recover the value of our removed
Structured Gaussian Elimination (SGE)
variable and hence a solution to the entire system. However, since the complexity
of the Lanczos algorithm is related to the number of columns of the matrix, our
solution phase will run faster than normal due to the reduction in iterations.
The application of this sort of reduction operation is called Structured Gaussian Elimination (SGE) and is vital to allow the eÆcient solution of very large
linear systems, see the work of Cavallar [8] for a thorough treatment of the topic.
We start by marking some proportion of those columns in the matrix which have
the highest weight as heavy and the rest as light. Typically this proportion is
about 5 10%. To reduce our source matrix into a smaller, denser version we
then iteratively apply four steps:
Step 1:
Step 2:
Step 3:
Step 4:
Delete all columns that have a single non-zero entry and the rows in
which those non-zero entries exist.
Declare some proportion of light columns to be heavy, selecting those
that are heaviest.
Delete some proportion of the rows, selecting those that have the largest
number of non-zero entries in light columns.
For any row with a single non-zero entry equal to 1 in the light
columns, subtract the appropriate multiple of that row from all other
rows that have non-zero entries in that column so as to make those
entries zero.
The application of these operations is heuristically guided and hence somewhat
vague. Our approach follows both LaMacchia and Odlyzko [17] and Pomerance
and Smith [27], resulting in a matrix reduction factor of between four and six; see
the work of Joux and Lercier [15] for an alternative approach to applying SGE.
Essentially, the aim is to stop SGE at some point that optimally balances the
computational cost of dealing with the resulting matrix with the communication
cost of distributing it among many processors. Since our focus was the implementation of the Lanczos algorithm, our results in this area may be somewhat
sub-optimal. This is certainly partly due to the vague nature of SGE but perhaps
also to the specic and diering characteristics of our datasets compared to those
in other work. Note that due to our matrix representation we were additionally
required to check for and prevent overow in the single-precision content. We
achieved this by simply rolling back and avoiding any step which produced such
a situation.
The columns in our matrix represent variables in the
linear system and can be freely re-ordered before processing with the Lanczos
algorithm provided that we reverse the re-ordering afterwards. We can use this
property to improve the performance of our parallel implementation by balancing
the workload of each processor in the system.
Consider using parallel Lanczos with P processors to solve the system shown
in Figure 7a which has already undergone SGE. If we naively divide the matrix
M of size n m into P regions by giving each processor about dm=P e columns,
some processors will be required to perform signicantly more work since their
Load Balancing (LOB)
Name
Before SGE
Matrix A
Matrix B
Matrix C
After SGE
Matrix A
Matrix B
Matrix C
Dimensions
Rows Columns
Average Weight Total Weight
Rows Columns
80045
96365
139376
70001
80001
140002
15
15
16
18
19
16
1261200
1530722
2243537
12634
13218
24746
11483
11995
23819
198
217
148
218
239
154
2512996
2873741
3684466
DLP
(31186 )
(3293 )
362
GF (3
)
GF
GF
(31186 )
(3293 )
362
GF (3
)
GF
GF
A table describing the composition of the three test matrices used for performance evaluation. These represent real linear systems as generated by the FFS used
to attack real DLP instances in a given eld.
Table 1.
allocation contains far more non-zero entries than others: the number of modular
multi-precision integer operations performed will be directly proportional to the
number of non-zero entries in each processors allocation. Note that from a rowwise perspective, all allocations will be roughly equivalent because all rows have
an approximately equal number of non-zero elements.
Since we can re-order the columns of the matrix as we wish, we perform a
pre-processing phase called load balancing (LOB) which orders the columns so
that with P processors, each receives d!=P e non-zero entries. In general, this
ideal is only roughly achievable but, even in the worst cases, we signicantly
improve performance by minimising the time a given processor lies idle while
waiting for the others. However, as demonstrated by Figure 7b, the result of LOB
processing is a skewed allocation in the sense that each processor will usually
have a dierent number of columns in the region allocated to it. Although this
somewhat complicates the implementation, we can easily generate an allocation
map from the LOB processing that is passed to the Lanczos algorithm so that
each processor knows which region of the matrix it is responsible for operating
on.
4
4.1
Experimental Results
Environment
To experiment with our implementation, we conducted a number of benchmarking tests by using a varying number of processors to solve three test matrices
described by Table 1. These matrices were generated by the sieving step of an
FFS [12] implementation attacking DLP instances in a given eld. We used two
dierent types of clusters in these experiments, a cluster of standard workstations and a dedicated Beowulf cluster, hoping to gain some insight into how our
implementation would perform and scale running on each technology. Because
160000
With only SGE pre-processing
With SGE and LOB pre-processing
45000
Execution Time (in seconds)
Execution Time (in seconds)
50000
40000
35000
30000
25000
20000
15000
10000
5000
0
With only SGE pre-processing
With SGE and LOB pre-processing
140000
120000
100000
80000
60000
40000
20000
0
1
2
4
8
16
32
1
2
Number of Processors
Matrix A, Workstation Cluster
With only SGE pre-processing
With SGE and LOB pre-processing
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
16
32
With only SGE pre-processing
With SGE and LOB pre-processing
140000
120000
100000
80000
60000
40000
20000
0
1
2
4
8
16
32
1
2
Number of Processors
8
16
32
Matrix B, Beowulf Cluster
160000
Execution Time (in seconds)
With only SGE pre-processing
With SGE and LOB pre-processing
45000
4
Number of Processors
Matrix B, Workstation Cluster
50000
Execution Time (in seconds)
8
Matrix A, Beowulf Cluster
160000
Execution Time (in seconds)
Execution Time (in seconds)
50000
4
Number of Processors
40000
35000
30000
25000
20000
15000
10000
5000
0
With only SGE pre-processing
With SGE and LOB pre-processing
140000
120000
100000
80000
60000
40000
20000
0
1
2
4
8
16
Number of Processors
Matrix C, Workstation Cluster
32
1
2
4
8
16
32
Number of Processors
Matrix C, Beowulf Cluster
Performance of the workstation and Beowulf clusters while solving three dierent example matrices using a varied number of processing nodes.
Fig. 8.
of the recent trend toward large installations of workstation class machines, for
example in render farms or the Google search engine, we considered this an
interesting comparison against the more traditional, purpose built cluster computer. Certainly from the point of view of understanding the cost of realising an
attack using the FFS, potential use of a more readily accessible workstation cluster reduces the signicant nancial overhead of owning and operating a cluster
computer.
Our Linux based Beowulf cluster had 160 processors, 2 Intel Pentium III
processors running at 1 GHz housed in each of 80 nodes. We were able to use
up to 32 of these processors at a time. Each node was equipped with 512 Mb
of memory and a large local disk space which was further supplemented by a
shared NFS lesystem. The nodes were linked with Myrinet [5], an Ethernet type
technology that allows full-duplex, point-to-point communication at peak rates
of around 1:1 Gb/s. We used the C++ portions of a Portland 3:2 compiler suite
to build our system on this platform. The workstation cluster consists of a large
number of Linux PCs that each contain AMD AthlonXP processors running at
2:4 GHz. These machines each have 512 Mb of memory and rely mainly on a
shared networked lesystem for storage. Communication between the nodes is
achieved by standard switched 100 Mb/s Ethernet connection: islands of 8 nodes
are connected via hubs that then feed into a central switch. Unlike the Beowulf
cluster, we were unable to achieve dedicated access to either the communication
medium or the processor nodes and so any results will be somewhat inuenced
by background activity. We used the GCC 3:3 compiler suite to build our system
on this platform.
4.2
Goals
Beyond functional testing, the goals of our experimentation were two-fold: get a
basic idea of peak computation speed and investigate how the implementation
scales with large numbers of processors. Both these goals are underpinned by
the need to determine how large a linear system we could feasibly solve and
hence how large a problem we could attack with the FFS. Results from our host
platform are shown in Figure 8. Note that all timings include the overhead of
checkpointing, the act of periodically backing up the working state so that processor downtime never results in total loss of the accumulated work. This action
is performed at hourly intervals. Direct comparison between the two platforms
is not very meaningful since the processor and communication architectures and
compiler systems are signicantly dierent. However, several key trends are evident that go some way to answering our original goals.
4.3
Analysis
Both sets of results show that using the parallel solution oers signicant performance improvements over the scalar case and that given the solution time
is only a matter of hours, we can easily solve much more complex systems. By
more complex we mean larger in dimension or weight. It is additionally clear that
our LOB pre-processing phase is eective in further improving solution speed
by balancing the workload of each processor. In context, this proves that we can
use the FFS to attack discrete logarithm problems over much larger elds and
that attacking elds of a cryptographically interesting size should be possible in
terms of the matrix step.
Following the description of Brent [7], we attempt to reason about the efciency of our implementation in terms of the computation and communication speed of the host platforms. The computational cost will be dominated by
matrix-vector product operations where the matrix has n m elements and the
vector m elements. Since a given row in the matrix will have non-zero entries,
we expect this operation to take 2m multiplications since we are required to
compute both normal and transposed products. Given that the Lanczos algorithm requires O(m) iterations, the total computational cost using P processors
is
2m2 =P
for some constant that models the speed of a single multiplication. Communication cost is dominated by the broadcast of each processors contribution to the
matrix-vector product described above. Each processor calculates m elements of
the result and is required to communicate this to each processor so they all hold
enough information to continue processing. MPI implements this operation using
an eÆcient binary tree broadcast method. We therefore estimate the speed as
the product of the number of bits communicated and the communication speed
that models the properties of a specic medium
m2 log2 p:
Hence, we use TP to denote the total estimated execution time for a system with
P processors
TP = 2m2 =P + m2 log2 p:
Ideally, we want to balance the computation and communication terms so that
neither dominates the nal execution time: we want to make full use of both the
computation and communication bandwidth and hence minimise TP . Applying
SGE allows us to tune these features since generally, longer runs of SGE mean
a heavier input matrix, i.e. larger value of , which is smaller in size, i.e. smaller
value of m.
In practise, previous work on systems generated by the NFS has shown that
an ideal balance is hard due to the high cost of communication versus computation. In our case computation cost is much higher, around two orders of
magnitude more so, since we are dealing with multi-precision arithmetic. Depending on the platform, an iteration of Lanczos takes roughly between 1 and
3 seconds to complete in all three test cases. Given this fact, and since the estimated computation time is far less, there seems to be a massive burden on the
communication medium and not enough per-processor computation to balance
it.
One way to redress this balance comes from inevitably larger values of p
as the eld in which the DLP instance is posed grows in size. For these small
example cases, the values of p were an order of magnitude less than our imposed
upper bound of 1000 bits. One can naturally expect the cost of computation to
increase as p grows larger. Improving the application of SGE is the other way to
impose balance irrespective of the value of p. Our results imply that for P > 4 we
were not aggressive enough in applying SGE which would have increased and
decreased m to compensate for this imbalance. This problem was exacerbated
in the workstation cluster by the problem of switching between local hubs of
machines. Furthermore, since we are communicating reasonably large chunks of
information, i.e. we have an extra factor of log2 p over the NFS case, MPI will
probably not deal with our workload as eÆciently [28].
5
Conclusions
We have presented a practical investigation into the parallel solution of large,
sparse linear systems dened over GF (p) where p is a large prime. We set this
work in the context of solving the types of liner system produced by the FFS
algorithm. By doing so, we both proved the feasibility of such an approach,
learnt several valuable lessons about the potential power of our composite FFS
system and conjectured about how to further improve on the performance of our
Lanczos implementation.
Perhaps the most signicant nding is that we can comfortably manage much
larger matrices than previously considered using sequential methods [1] or indeed
larger values of p. Even though our results were somewhat sub-optimal in terms
of scalability due to the SGE processing phase, our current test cases take only
a few hours to solve. Given this fact, there is clearly enough capacity to process
much more complex systems in the future, either larger in size with greater
density. In terms of the FFS, this means we can solve discrete logarithms in
larger elds and impact on the security of currently used cryptosystems.
Since this area was relatively unexplored previous to this work, there are
some key areas where we can improve or extend our investigation:
{
{
{
Better understanding and parameterisation of the SGE pre-processing stage
will probably yield a better balance between the computational and communication costs involved. More specically, it should improve the scalability
of our implementation due to the better utilisation of additional processors.
In our implementation using MPI, we avoided the issue of processor topology and how one might improve performance by dictating this so as to
improve communication speed. This was done to make the implementation
more portable. However, the results show that communication performance
characteristics are crucial to solution speed: it seems important to address
this issue in further work.
There are other methods of solving this sort of linear system besides the
Lanczos algorithm. Specically, it seems important to investigate and compare our results against those produced by the Weidemann [30] algorithm
running in a similar context.
{
6
It would be interesting to study the possibility for massively parallel custom
hardware with very small processor units, along the lines of work for the
NFS in characteristic two by Geiselmann and Steinwandt [11], Bernstein [4]
and Lenstra et al. [20].
Acknowledgements
The author would like to thank Nigel Smart, Fre Vercauteren and Andrew Holt
for useful discussions throughout this work, and anonymous reviewers for their
comments. In addition, he would like to thank Stephen Wiggins and Jason
Hogan-O'Neill from the Laboratory for Advanced Computation in the Mathematical Sciences (LACMS) at the University of Bristol for allowing and overseeing access to the Beowulf cluster used for experimentation.
References
1. L.M. Adleman and M.A. Huang. Function Field Sieve Method for Discrete Logarithms Over Finite Fields. In Information and Computation, 151, 5{16, 1999.
2. R. Barrett, M. Berry, T.F. Chan, J. Demmel, J. Donato, J.J. Dongarra, V. Eijkhout, R. Pozo, C. Romine and H.A. van der Vorst. Templates for the Solution
on Linear Systems: Building Blocks for Iterative Methods. Society for Industrial
and Applied Mathematics (SIAM), 1994.
3. P.D. Barrett, Implementing the Rivest, Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In Advances in Cryptology
(CRYPTO), Springer-Verlag LNCS 263, 311{323, 1987.
4. D.J. Bernstein. Circuits for Integer Factorization: a Proposal. Available
from: http://cr.yp.to/papers/nfscircuit.pdf
5. N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic
and W-K. Su. Myrinet: A Gigabit-per-Second Local-Area Network. In IEEE Micro
15(1), 29{36, 1995
6. R.P. Brent. Some Parallel Algorithms for Integer Factorisation. In Euro-Par,
Springer-Verlag LNCS 1685, 1{22, 1999.
7. R.P. Brent. Recent Progress and Prospects for Integer Factorisation Algorithms.
In Computing and Combinatorics (COCOON), Springer-Verlag LNCS 1858, 3{22,
2000.
8. S.H. Cavallar. On the Number Field Sieve Integer Factorisation Algorithm. PhD
Thesis, University of Leiden, 2002.
9. J.J. Dongarra, I.S. Du, D.C. Sorensen and H.A. van der Vorst. Solving Linear
Systems on Vector and Shared Memory Computers. Society for Industrial and
Applied Mathematics (SIAM), 1991.
10. I.S. Du, A.M. Erisman and J.K. Reid. Direct Methods for Sparse Matrices, Oxford
University Press, 1986.
11. W. Geiselmann and R Steinwandt. Hardware to Solve Sparse Systems of Linear Equations over GF (2). In Cryptographic Hardware and Embedded Systems
(CHES), Springer-Verlag LNCS 2779, 51{61, 2003.
12. R. Granger, A.J. Holt, D. Page, N.P. Smart and F. Vercauteren. Function Field
Sieve in Characteristic Three. To appear in Algorithmic Number Theory Symposium (ANTS-VI), 2004.
13. W. Gropp, E. Lusk, N. Doss and A. Skjellum. A High-Performance, Portable
Implementation of the MPI Message Passing Interface Standard. In Parallel Computing 22(6), 789{828, 1996.
14. A.J. Holt and J.H. Davenport. Resolving Large Prime(s) Variants for Discrete
Logarithm Computation. In Cryptography and Coding, Springer-Verlag LNCS
2898, 207{222, 2003.
15. A. Joux and R. Lercier. Improvements to the General Number Field Sieve for
Discrete Logarithms in Prime Fields. In Mathematics of Computation, 72 (242),
953{967, 2003.
16. E. Kaltofen and A. Lobo. Distributed Matrix-Free Solution of Large Sparse Linear
Systems over Finite Fields. In Algorithmica, 22(3/4), 331{348, 1999.
17. B.A. LaMacchia and A.M. Odlyzko. Solving Large Sparse Linear Systems Over
Finite Fields. In In Advances in Cryptology (CRYPTO), Springer-Verlag LNCS
537, 109{133, 1991.
18. C. Lanczos. Solution of Systems of Linear Equations by Minimising Iterations. In
Journal of Research of National Bureau of Standards, 49, 33{53, 1952.
19. A.K. Lenstra, H.W. Lenstra, M.S. Manasse and J. M. Pollard. The Number Field
Sieve. In ACM Symposium on Theory of Computing, 564{572, 1990.
20. A.K. Lenstra, A. Shamir, J. Tomlinson, and E. Tromer. Analysis of Bernstein s
Factorization Circuit. In Advances in Cryptology (ASIACRYPT), Springer-Verlag
LNCS 2501, 1{26, 2002.
21. A.J. Menezes, P.C. van Oorschot and S.A. Vanstone. Handbook of Applied Cryptography. CRC Press, 1997.
22. P.L. Montgomery. A Block Lanczos Algorithm for Finding Dependencies Over
GF(2). In In Advances in Cryptology (EUROCRYPT), Springer-Verlag LNCS
921, 106{120, 1995.
23. P.L. Montgomery. Distributed Linear Algebra. In Workshop on Elliptic Curve
Cryptography (ECC), 2000.
24. P.L. Montgomery. Modular Multiplication Without Trial Division. Mathematics
of Computation, 44, 519{521, 1985.
25. MPI: A Message-Passing Interface Standard. In Journal of Supercomputer Applications, 8(3/4), 159{416, 1994.
26. NFSNET: Large Scale Integer Factoring. Housed at: http://www.nfsnet.org
27. C. Pomerance and J.W. Smith. Reduction of Huge, Sparse Matrices over Finite
Fields Via Created Catastrophes. In Experimental Mathematics, 1(2), 89{94, 1992.
28. R. Rabenseifner. Optimization of Collective Reduction Operations. In Computational Science (ICCS), Springer-Verlag LNCS 3036, 1{9, 2004.
29. V. Shoup. NTL: A Library for doing Number Theory. Available from: http:
//www.shoup.net/ntl/
30. D.H. Wiedemann. Solving Sparse Linear Equations over Finite Fields. In IEEE
Transactions on Information Theory, 32(1), 54{62, 1986.
31. L.T. Yang and R.P. Brent. The Parallel Improved Lanczos Methods for Integer
Factorization over Finite Fields for Public Key Cryptosystems. In International
Conference on Parallel Processing Workshops (ICPPW), IEEE Press, 106{111,
2001.