Parallel Solution of Sparse Linear Systems Dened Over GF (p) D. Page Department of Computer Science, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, United Kingdom. [email protected] The second stage of processing in the Number Field Sieve (NFS) and Function Field Sieve (FFS) algorithms when applied to discrete logarithm problems requires the solution of a large, sparse linear system dened over a nite eld. This operation, although well studied in theory, presents a practical bottleneck that can limit the size of problem either algorithm can tackle. In order to partially bridge this gap between theory and practise, we investigate and develop a fast, scalable implementation of the Lanczos algorithm over GF (p) that can solve such systems in parallel using workstation clusters and dedicated Beowulf clusters. Abstract. 1 Introduction The security of modern public-key cryptography is usually based on the presumed hardness of problems such as factoring integers or computing discrete logarithms. The Number Field Sieve [19] (NFS) and Function Field Sieve [1] (FFS) oer two examples of algorithms that can attack these problems. Such algorithms are generally specied in two phases. The rst phase, sometimes called the sieving step, aims to collect many relations that represent small items of information about the problem one is trying to solve. This phase is easy to parallelise since one can generate the relations independently. It is therefore attractive for distributed, Internet based collaborative computation [26]. The second phase of processing, sometimes called the matrix step, aims to collect the relations and combine them into a single linear system which, when solved, allows one to eÆciently compute answers to the original problem. EÆcient implementation of the matrix step is challenging since the linear system is generally very large, and usually represents a practical bottleneck even though it is often overlooked in theoretical discussion. However, both the NFS and FFS allow one to balance the computational eort between the sieving and matrix step in the sense that one can trade-o work in one phase for work in the other. It is therefore common for systems to be parameterised such that the work required in the matrix step is closely matched to the available computational resources. Under this assumption, a more eÆcient implementation of the matrix step will allow us both to reduce the work in the sieving step and signicantly accelerate the overall computation speed. The solution of linear systems is a well studied topic since eÆcient methods are required in many dierent applications [9, 2]. Solving small linear systems is a trivial task thanks to methods such as Gaussian Elimination that mechanically reduce the problem. Ultimately, one arrives at a form where it is possible to substitute known variables into all equations so as to solve the entire system. However, constructing this form requires that most of the matrix is populated with non-zero entries at some point. In small systems this is not a problem but with large systems such as those produced by the NFS and FFS, holding the matrix in memory is infeasible due to the physical size of such an object. Therefore iterative methods that make successively more accurate approximations of the answer are usually preferred. The Lanczos [18] and Weidemann [30] algorithms represent two such methods that can solve sparse linear systems eÆciently in both time and space. Current expositions of the matrix step are either abstract in their description of processor architecture [6, 31] and hence do not present results of actual implementation; are biased toward the characteristic two case used in integer factorisation systems like the NFS [17, 16, 22, 23]; or do not consider parallel computation [14]. Our aim is to partially ll this gap in the literature by providing concrete experience of implementing and using a parallel Lanczos implementation for systems over GF (p) using Linux based workstation clusters and dedicated Beowulf clusters. It is important to note that throughout the paper we use GF (p) to denote a nite eld where p is a large prime integer of upto roughly 1000 bits in length. We set this work in the context of real datasets generated by the sieving step of an FFS implementation [12]. The implementation was used to attack discrete logarithm problems (DLP) posed in nite elds of characteristic three. As a result, the sieving produces a large, sparse linear system dened over GF (p) which, when solved, is used to compute answers to the original problem. Since the solution of this sort of system is not well understood in terms of performance, we take a pessimistic approach and opt for long-term scalability over short-term optimisation in our design choices. This guarantees we can can investigate and understand the properties of large systems before making incremental improvements that match the specic requirements of a given case. The paper is organised as follows. Section 2 describes the Lanczos algorithm and the minor alterations required so it can solve linear systems generated by the FFS. We describe the details of our implementation in Section 3, focusing on issues of eÆcient parallelisation and eld arithmetic; matrix representation; and pre-processing phases. We then present some experimental results in Section 4 that describe the performance of our system when used to solve three test cases. Finally, we oer some concluding remarks and highlight areas for further work in Section 5. A description of the Lanczos algorithm suitable for solving linear systems produced by the sieving stage of typical FFS implementations. Algorithm 1: : An n m matrix M with null-space of dimension greater than zero, and a random m element vector r. : The vector y such that M y = 0 and y is non-zero, or ? if the algorithm fails to nd a solution. Input Output T w0 M w1 w0 ( M r ) (M w1 ) (v2 ; v2 ) (v2 ; w1 ) (w1 ; w0 ) v2 (t0 =t1 ) w1 (t4 =t1 ) w1 v2 M t0 t1 t4 w2 x T repeat v3 t0 t1 t2 t3 t4 if T (M w ) M 2 (v3 ; v3 ) (v3 ; w2 ) (v3 ; v2 ) (w1 ; v2 ) (w2 ; w0 ) t1 = 0 or t3 = 0 then return w3 x w1 ? (t0 =t1 ) w2 x + (t4 =t1 ) w2 w2 v2 v3 w2 w3 v3 =0 =x r if y = 0 then until w2 y return ? else return y (t2 =t3 ) w1 2 The Lanczos Algorithm We are given an n m matrix M with m > n that represents a linear system where columns in the matrix represent variables and rows represent equations. Our aim is to produce a non-zero vector y such that M y = 0: Note that we assume that the null-space of M has dimension greater than zero, i.e. that such a solution does exist. The Lanczos algorithm demands that this problem be stated in the form M y =w where w is non-zero and M is both square and symmetrical. Dealing with the rst constraint of the algorithm, that w is non-zero, is straightforward since we can simply take a random vector r, compute w = M r; and solve M x=w for x before calculating the required result as y = x r. The second constraint, that M is square and symmetrical, is satised by multiplying both sides of the original equation by M T . This ensures the composite matrix is of the required form and transforms our original problem into (M y) = M w: However, since M M is of size m m and probably very dense, it is unrealistic M T T T to assume we can hold it in memory. We therefore view this as a calculated value, holding only sparse matrix data in memory. When presented with the problem in the correct form, the Lanczos algorithm uses a set of recurrence equations to iteratively produce a solution. Following the description of LaMacchia and Odlyzko [17], let w0 = w v1 = M w0 (v1 ; v1 ) w1 = v1 w (w0 ; v1 ) 0 and then for i 1 v +1 = M w i w +1 = v +1 i i i (vi+1 ; vi+1 ) wi (wi ; vi+1 ) (vi+1 ; vi ) w : (wi 1 ; vi ) i 1 The algorithm terminates when it nds some wj that is self-conjugate, that is when (wj ; B wj ) = 0. Unless the algorithm fails to nd a solution, it completes with j less than or equal to the number of variables in the system. That is, due to the dimension of the Krylov space being close to m we get a solution after O(m) iterations. Hence, if wj = 0 the solution is recovered as X1 j i where =0 s = s w i i (wi ; w) : (wi ; vi+1 ) There are two reasons why the algorithm could fail to nd a solution even if one does exist. Firstly, the algorithm could encounter zero values of either t1 or t3 which it can not invert. The probability of this happening is 1=p. Secondly, there is a chance that the algorithm could compute an x so that y = x r = 0. That is, the algorithm calculates a trivial result. In the event that one of these cases does occur, one is forced to select a new random vector r and restart the algorithm from the beginning. Clearly this is undesirable but for large values of p, it does not present a practical problem since the chance of either event occurring is very small. Considering the augmented description of the original Lanczos method shown in Algorithm 1, after initialisation we nd three main computational phases that are repeated until a solution is found: i Phase 1: Phase 2: Phase 3: 3 Firstly we compute a matrix-vector product which, due to the size of the matrix, usually represents the most costly operation in the algorithm. Following the matrix-vector product, we compute a number of vectorvector inner products. Finally, we update our working vectors using a number of vectorscalar products and vector-vector addition. Implementation Details Many of our design decisions will be eected by the characteristics of the source linear system which, from here on, we interchangeably describe as the matrix. It is therefore important to understand these characteristics so that we may use suitable data structures that eÆciently capture and operate on the matrix in our physical implementation. Figure 1 shows a visualisation of an example problem where the darker regions of the image represent more dense areas of the matrix. One can clearly see features such as the vertical banding eect, due to there being more small elements than large in the factor base, and the mirror type product of the rational and algebraic factor bases that create two similar regions to the left and right of centre. Visualisation of a matrix representing a linear system as produced by the FFS. Note that darker regions of the image represent more dense areas of the matrix, i.e. those areas with more non-zero elements. Fig. 1. 3.1 Arithmetic And Communication Parallelisation There are several ways one could distribute data among the processors in order to operate the Lanczos algorithm in parallel. For example, if we know about the processor topology we might select a blocked allocation of the vector and matrix data onto the processors so that communication between close processors working on related data is very quick. However, without knowledge of the underlying topology, such distributions are hard to implement eÆciently. We therefore opted to follow the description of Yang and Brent [31] in making the starting assumption that every processor is going to compute part of the result vector. In order to do this, each processor must hold a portion of each vector and also a region of the matrix. Since the working data is eectively scattered among the processors, there must be communication to allow their calculation of combined or shared results. We used the MPICH [13] implementation of MPI [25] to pass data between the processor nodes. We currently view the processor network as an abstract object, employing no topology-specic communication, although doing so would probably improve performance and hence warrants further investigation. Figure 2 illustrates parallelisation of matrix-vector multiplication, the major operation within the Lanczos algorithm. Reading the diagram from left to right, we see that each of the four processors holds a sub-section of the vector and the corresponding sub-section of the matrix required to compute the partial multiplication result. These partial results are then summed into a single result vector which is redistributed ready for the next operation. Note that since we are required to perform a normal matrix-vector product followed by a transposed matrix-vector product the operation is split into two phases, each of which completes in a similar manner. Other than matrix-vector multiplication, the only Communicate and Sum Communicate and Sum A crude attempt to demonstrate the operations within Algorithm 1 when computing the step v3 M T (M w2 ) in parallel with P = 4 processors each holding a portion of the vector and matrix. Fig. 2. other global calculation is that of the vector inner product. Since the vectors are distributed in sections among the processors, each processor calculates a local result which are collectively summed to form the global result used subsequently in the vector update phase. 3.2 Matrix Representation As a starting point, let ! denote the weight of an input matrix M . That is, ! is the number of non-zero entries in the matrix != X m i;j where mi;j = 0 if Mi;j = 0 1 if Mi;j 6= 0 We assume the average proportion of non-zero entries in a given row or column is roughly 0:2% 2% and use to denote the average number of non-zeros in a row, i.e. the average weight of a given row. Entries in the matrix represent coeÆcients of the variables that, although members of GF (p), are guaranteed to be have small positive or negative magnitude of roughly 50 due to the way that FFS sieving works. The large majority of these entries are expected to be 1. We therefore hold the coeÆcients as single-precision, 32-bit integers rather than multi-precision members of GF (p), an approach that saves signicant amounts of memory. We hold the matrix in compressed row format where each row is represented by linear arrays of values and column indexes for the non-zero entries; see [10] for a thorough treatment of this sort of technique. We call the tuple of value and index arrays that hold non-zero entries for a given row a row list and an 0 1 11 13 1 1 -1 2 Row List Index Row List Value Row List Hints Matrix Index Matrix Value 1 0 1 1 @@HHHH @ HH @ HH @ H 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 0 0 0 0 0 0 0 0 0 -1 0 2 0 0 Using cell hints to represent the matrix as a low-resolution bit-set vastly improves performance by allowing very fast searching within the matrix and presenting more manageable memory access characteristics. Note that in this case, four cells are represented by one bit so we say the hints have a resolution of four. Fig. 3. entry in the matrix, specied by single row and column index, a cell. Due to the exceptionally sparse nature of our matrices, using a compressed row format oers a massive saving in terms of space since only the non-zero entries consume memory. Since we are interested in investigating the size of systems we can process using our implementation, we save additional space by ignore the potential to hold two copies of the matrix for normal and transposed access. Instead we hold only one copy of the matrix and rely on the performance or row and column-wise access to this structure. Indeed, once all the cells are loaded into the memory resident row lists, performance of operations involving the matrix are signicantly inuenced by how quickly we can locate the value associated with a given cell. Speed Oriented Improvements We can quickly locate the correct row for a given cell since the rows are stored in an sequential, ordered manner. However, locating the required column entry requires that we search the array of non-zero indexes within that row list. This approach presents a performance hurdle for two main reasons. Firstly, we will often search row lists when the required column is not present. Even if we maintain an ordering within the row lists, this searching operation consumes a vast amount of time when trying to locate a specic cell or cells. Secondly, since the matrix is large the searching operation highlighted above is additionally costly due to the nature of the memory sub-system in most processors. The matrix is far too large to t into cache memory and since we are eectively streaming through our row lists, using each only once, there will be very little reference locality and thus negligible benet from having a cache 5000 Using naive search With 155 k of cell hints With 312 k of cell hints With 625 k of cell hints Execution Time (in microseconds) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 50 100 150 Non-zeros Per Column 200 250 A graph comparing column look-up speed with dierent sized cell hint structures against a naive method in a matrix of 10000 10000 elements using a 1 GHz Pentium III processor. Fig. 4. present at all. Indeed, the matrix may be too large to t entirely into physical memory which presents a further burden to the virtual memory mechanism. Fortunately, we can reduce the impact of these problems by constructing a second data structure to accelerate searching for cells in the matrix [10]. By building a bit-set that holds a low-resolution representation of the matrix cells, locating zero entries is vastly accelerated to a single bit-test as demonstrated by Figure 3. Locating non-zero cells is made far more eÆcient by only searching the row list when there is a high probability of nding the target column: when the bit-set shows a zero for a given element we know it is deantly not there, if it contains a one the element might be there and we need to search. Clearly the resolution of our bit-set and sparsity patterns of our matrix eect this probability with lower resolutions oering less certainty about their predictions. However, we can adjust the resolution in a exible, per-matrix manner so as to ensure good performance while using very little memory. In fact, the bit-sets can be made small enough that they t into, and are retained in, cache memory. Figure 4 compares dierent sized cell hint structures versus a naive method of searching the row lists. For a typical matrix size of 10000 10000 having rows and columns of varied average weight, retrieving a row is obviously costs almost nothing since the matrix is stored in row-wise format. However, retrieving a column is clearly signicantly faster using the cell hints for as long as the structure does not become over-polluted, i.e. the number of columns in the matrix represented by each bit in the cell hint is suÆciently small. Furthermore, this is possible using structures that can easily t into the level-two cache of most processors, a fact we can ensure by tuning the size used on a per-matrix basis. With P = 4 processors, each need only hold a sub-region of the matrix cells since they will use only those in computing the local portions of a collective matrixvector product. Fig. 5. As well as improving over naive methods of searching the matrix, we can exploit the parallelisation method described in Section 3.1 to reduce the amount of memory used. Recall that the main computational task is to compute a matrix-vector product of the form Space Oriented Improvements c = X i A i;j b j where i and j range over the dimensions of the matrix A. Note that this operation can obviously be restructured depending on if we are dealing with normal or transposed matrix access. In parallelising this operation, we have tasks each processor with computing part of the result so that ll i lh for some limits ll and lh . Because of this, each processor will use only ever use a sub-region of the matrix in order to compute a local result: it is wasteful for every processor to hold the entire matrix in memory. To save memory and hence improve our ability to process large matrices, each processor only holds the matrix sections required to compute local matrix-vector products. Note that this requires both normal and transposed access to a single matrix and hence the data held is only that required for these operations. Figure 5 demonstrates this for P = 4 processors where the regions loaded are shown by solid blocks hewn from the encompassing dashed matrix. Note that this saving in space can also result in performance improvement since there it implies less pollution of the cell hint information and shorter row lists to search through. Field Arithmetic Arithmetic in GF (p) demands fast multi-precision integer operations as well as modular reduction for general values of p. Although in specic cases we could have investigated optimisations for special or small values of p, we consider only the use of general reduction algorithms in our results. The major operation in calculating both matrix-vector multiplication and vector-vector inner products is modular multiply-accumulate. That is, we want to compute X a b i i (mod p) for ai and bi taken from a bounded list of values within the matrix and vector. Given that eÆcient methods for multi-precision addition and multiplication are Execution Time (in microseconds) 600 Barrett reduction (strict) Barratt reduction (delayed) Montgomery reduction (strict) Montgomery reduction (delayed) Single division 500 400 300 200 100 0 0 50 100 150 Number of Summands 200 250 A graph comparing the performance of dierent modular multiply-accumulate operations over a varied number of summands using p of 1000 bits in size and a 1 GHz Pentium III processor. Fig. 6. a well studied subject [21], modular reduction holds the key to high performance of this operation. In this respect we have three main alternatives: { { { Perform strict reductions after every arithmetic operation so that partial results are always strict members of GF (p). Delay the reduction that is performed after multiplication of ai and bi until after the addition, thus reducing slightly bigger intermediate values but executing less operations. Perform no reductions at all in the multiply-accumulate operation, delaying a single reduction of a large result until the very end. With general values of p, techniques such as Montgomery [24] and Barrett [3] reduction are popular. These methods pre-compute some values to scale expensive operations such as multiplication and division into shifts. By tweaking the pre-computation, both methods can be altered to eÆciently cope with delayed reduction given some upper-bound for the values they can reduce. Although Barrett reduction operates on integers in their native form, Montgomery reduction requires the integers be transformed into Montgomery form. This is problematic since we want to avoid storing multi-precision GF (p) values in our matrix given that most entries will initially have small magnitude. For the cost of some space, this issue can be resolved by pre-computing enough Montgomery form integers that one can simply look them up using small, word sized indexes stored in the matrix. Figure 6 compares the performance of Montgomery reduction, Barrett reduction and standard division when used to compute multiply-accumulate operations over a varied number of summands. Using lists of summands whose (a) After SGE pre-processing. (b) After LOB pre-processing. The eects of performing SGE and LOB pre-processing on the matrix shown in Figure 1. Note that these images are not to scale and that the actual matrices are around a sixth of the size of the original. The SGE matrix appears darker than the original due to having more non-zero elements and them being more densely packed together. The LOB matrix with P = 4 processors clearly shows a new banding pattern representing the regions allocated to each processor. Fig. 7. length match those typically encountered, a single delayed division is clearly the quickest option. Clearly the results are somewhat dependent on the choice of p but in general, the presented trend holds in that a single division is generally quicker except for very small values of p. This choice gives the additional benet of allowing the matrix to hold only single-precision values and from obvious optimisation of multiplication between single and multi-precision values. We investigated the use of NTL [29] to deal with the described arithmetic but eventually opted for a custom implementation to make integration with MPI easier. Our implementation was written in C++ and made use of small fragments of platform specic assembler to boost performance. 3.3 Matrix Pre-processing The standard Gaussian Elimination method used to solve small linear systems employs a number of atomic operations that preserve the meaning of the solution while transforming it into a reduced form. Although we can not use the same overall technique to solve large sparse systems, we can utilise similar atomic operations to reduce our problem into a smaller, easier to solve version coupled with some post-processing. Consider for example a variable in the system that is specied in only one equation. We can remove this equation, reducing the number of columns in the matrix, and produce a solution for the remaining variables as normal. After obtaining our solution, we use simple back-substitution to recover the value of our removed Structured Gaussian Elimination (SGE) variable and hence a solution to the entire system. However, since the complexity of the Lanczos algorithm is related to the number of columns of the matrix, our solution phase will run faster than normal due to the reduction in iterations. The application of this sort of reduction operation is called Structured Gaussian Elimination (SGE) and is vital to allow the eÆcient solution of very large linear systems, see the work of Cavallar [8] for a thorough treatment of the topic. We start by marking some proportion of those columns in the matrix which have the highest weight as heavy and the rest as light. Typically this proportion is about 5 10%. To reduce our source matrix into a smaller, denser version we then iteratively apply four steps: Step 1: Step 2: Step 3: Step 4: Delete all columns that have a single non-zero entry and the rows in which those non-zero entries exist. Declare some proportion of light columns to be heavy, selecting those that are heaviest. Delete some proportion of the rows, selecting those that have the largest number of non-zero entries in light columns. For any row with a single non-zero entry equal to 1 in the light columns, subtract the appropriate multiple of that row from all other rows that have non-zero entries in that column so as to make those entries zero. The application of these operations is heuristically guided and hence somewhat vague. Our approach follows both LaMacchia and Odlyzko [17] and Pomerance and Smith [27], resulting in a matrix reduction factor of between four and six; see the work of Joux and Lercier [15] for an alternative approach to applying SGE. Essentially, the aim is to stop SGE at some point that optimally balances the computational cost of dealing with the resulting matrix with the communication cost of distributing it among many processors. Since our focus was the implementation of the Lanczos algorithm, our results in this area may be somewhat sub-optimal. This is certainly partly due to the vague nature of SGE but perhaps also to the specic and diering characteristics of our datasets compared to those in other work. Note that due to our matrix representation we were additionally required to check for and prevent overow in the single-precision content. We achieved this by simply rolling back and avoiding any step which produced such a situation. The columns in our matrix represent variables in the linear system and can be freely re-ordered before processing with the Lanczos algorithm provided that we reverse the re-ordering afterwards. We can use this property to improve the performance of our parallel implementation by balancing the workload of each processor in the system. Consider using parallel Lanczos with P processors to solve the system shown in Figure 7a which has already undergone SGE. If we naively divide the matrix M of size n m into P regions by giving each processor about dm=P e columns, some processors will be required to perform signicantly more work since their Load Balancing (LOB) Name Before SGE Matrix A Matrix B Matrix C After SGE Matrix A Matrix B Matrix C Dimensions Rows Columns Average Weight Total Weight Rows Columns 80045 96365 139376 70001 80001 140002 15 15 16 18 19 16 1261200 1530722 2243537 12634 13218 24746 11483 11995 23819 198 217 148 218 239 154 2512996 2873741 3684466 DLP (31186 ) (3293 ) 362 GF (3 ) GF GF (31186 ) (3293 ) 362 GF (3 ) GF GF A table describing the composition of the three test matrices used for performance evaluation. These represent real linear systems as generated by the FFS used to attack real DLP instances in a given eld. Table 1. allocation contains far more non-zero entries than others: the number of modular multi-precision integer operations performed will be directly proportional to the number of non-zero entries in each processors allocation. Note that from a rowwise perspective, all allocations will be roughly equivalent because all rows have an approximately equal number of non-zero elements. Since we can re-order the columns of the matrix as we wish, we perform a pre-processing phase called load balancing (LOB) which orders the columns so that with P processors, each receives d!=P e non-zero entries. In general, this ideal is only roughly achievable but, even in the worst cases, we signicantly improve performance by minimising the time a given processor lies idle while waiting for the others. However, as demonstrated by Figure 7b, the result of LOB processing is a skewed allocation in the sense that each processor will usually have a dierent number of columns in the region allocated to it. Although this somewhat complicates the implementation, we can easily generate an allocation map from the LOB processing that is passed to the Lanczos algorithm so that each processor knows which region of the matrix it is responsible for operating on. 4 4.1 Experimental Results Environment To experiment with our implementation, we conducted a number of benchmarking tests by using a varying number of processors to solve three test matrices described by Table 1. These matrices were generated by the sieving step of an FFS [12] implementation attacking DLP instances in a given eld. We used two dierent types of clusters in these experiments, a cluster of standard workstations and a dedicated Beowulf cluster, hoping to gain some insight into how our implementation would perform and scale running on each technology. Because 160000 With only SGE pre-processing With SGE and LOB pre-processing 45000 Execution Time (in seconds) Execution Time (in seconds) 50000 40000 35000 30000 25000 20000 15000 10000 5000 0 With only SGE pre-processing With SGE and LOB pre-processing 140000 120000 100000 80000 60000 40000 20000 0 1 2 4 8 16 32 1 2 Number of Processors Matrix A, Workstation Cluster With only SGE pre-processing With SGE and LOB pre-processing 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 16 32 With only SGE pre-processing With SGE and LOB pre-processing 140000 120000 100000 80000 60000 40000 20000 0 1 2 4 8 16 32 1 2 Number of Processors 8 16 32 Matrix B, Beowulf Cluster 160000 Execution Time (in seconds) With only SGE pre-processing With SGE and LOB pre-processing 45000 4 Number of Processors Matrix B, Workstation Cluster 50000 Execution Time (in seconds) 8 Matrix A, Beowulf Cluster 160000 Execution Time (in seconds) Execution Time (in seconds) 50000 4 Number of Processors 40000 35000 30000 25000 20000 15000 10000 5000 0 With only SGE pre-processing With SGE and LOB pre-processing 140000 120000 100000 80000 60000 40000 20000 0 1 2 4 8 16 Number of Processors Matrix C, Workstation Cluster 32 1 2 4 8 16 32 Number of Processors Matrix C, Beowulf Cluster Performance of the workstation and Beowulf clusters while solving three dierent example matrices using a varied number of processing nodes. Fig. 8. of the recent trend toward large installations of workstation class machines, for example in render farms or the Google search engine, we considered this an interesting comparison against the more traditional, purpose built cluster computer. Certainly from the point of view of understanding the cost of realising an attack using the FFS, potential use of a more readily accessible workstation cluster reduces the signicant nancial overhead of owning and operating a cluster computer. Our Linux based Beowulf cluster had 160 processors, 2 Intel Pentium III processors running at 1 GHz housed in each of 80 nodes. We were able to use up to 32 of these processors at a time. Each node was equipped with 512 Mb of memory and a large local disk space which was further supplemented by a shared NFS lesystem. The nodes were linked with Myrinet [5], an Ethernet type technology that allows full-duplex, point-to-point communication at peak rates of around 1:1 Gb/s. We used the C++ portions of a Portland 3:2 compiler suite to build our system on this platform. The workstation cluster consists of a large number of Linux PCs that each contain AMD AthlonXP processors running at 2:4 GHz. These machines each have 512 Mb of memory and rely mainly on a shared networked lesystem for storage. Communication between the nodes is achieved by standard switched 100 Mb/s Ethernet connection: islands of 8 nodes are connected via hubs that then feed into a central switch. Unlike the Beowulf cluster, we were unable to achieve dedicated access to either the communication medium or the processor nodes and so any results will be somewhat inuenced by background activity. We used the GCC 3:3 compiler suite to build our system on this platform. 4.2 Goals Beyond functional testing, the goals of our experimentation were two-fold: get a basic idea of peak computation speed and investigate how the implementation scales with large numbers of processors. Both these goals are underpinned by the need to determine how large a linear system we could feasibly solve and hence how large a problem we could attack with the FFS. Results from our host platform are shown in Figure 8. Note that all timings include the overhead of checkpointing, the act of periodically backing up the working state so that processor downtime never results in total loss of the accumulated work. This action is performed at hourly intervals. Direct comparison between the two platforms is not very meaningful since the processor and communication architectures and compiler systems are signicantly dierent. However, several key trends are evident that go some way to answering our original goals. 4.3 Analysis Both sets of results show that using the parallel solution oers signicant performance improvements over the scalar case and that given the solution time is only a matter of hours, we can easily solve much more complex systems. By more complex we mean larger in dimension or weight. It is additionally clear that our LOB pre-processing phase is eective in further improving solution speed by balancing the workload of each processor. In context, this proves that we can use the FFS to attack discrete logarithm problems over much larger elds and that attacking elds of a cryptographically interesting size should be possible in terms of the matrix step. Following the description of Brent [7], we attempt to reason about the efciency of our implementation in terms of the computation and communication speed of the host platforms. The computational cost will be dominated by matrix-vector product operations where the matrix has n m elements and the vector m elements. Since a given row in the matrix will have non-zero entries, we expect this operation to take 2m multiplications since we are required to compute both normal and transposed products. Given that the Lanczos algorithm requires O(m) iterations, the total computational cost using P processors is 2m2 =P for some constant that models the speed of a single multiplication. Communication cost is dominated by the broadcast of each processors contribution to the matrix-vector product described above. Each processor calculates m elements of the result and is required to communicate this to each processor so they all hold enough information to continue processing. MPI implements this operation using an eÆcient binary tree broadcast method. We therefore estimate the speed as the product of the number of bits communicated and the communication speed that models the properties of a specic medium m2 log2 p: Hence, we use TP to denote the total estimated execution time for a system with P processors TP = 2m2 =P + m2 log2 p: Ideally, we want to balance the computation and communication terms so that neither dominates the nal execution time: we want to make full use of both the computation and communication bandwidth and hence minimise TP . Applying SGE allows us to tune these features since generally, longer runs of SGE mean a heavier input matrix, i.e. larger value of , which is smaller in size, i.e. smaller value of m. In practise, previous work on systems generated by the NFS has shown that an ideal balance is hard due to the high cost of communication versus computation. In our case computation cost is much higher, around two orders of magnitude more so, since we are dealing with multi-precision arithmetic. Depending on the platform, an iteration of Lanczos takes roughly between 1 and 3 seconds to complete in all three test cases. Given this fact, and since the estimated computation time is far less, there seems to be a massive burden on the communication medium and not enough per-processor computation to balance it. One way to redress this balance comes from inevitably larger values of p as the eld in which the DLP instance is posed grows in size. For these small example cases, the values of p were an order of magnitude less than our imposed upper bound of 1000 bits. One can naturally expect the cost of computation to increase as p grows larger. Improving the application of SGE is the other way to impose balance irrespective of the value of p. Our results imply that for P > 4 we were not aggressive enough in applying SGE which would have increased and decreased m to compensate for this imbalance. This problem was exacerbated in the workstation cluster by the problem of switching between local hubs of machines. Furthermore, since we are communicating reasonably large chunks of information, i.e. we have an extra factor of log2 p over the NFS case, MPI will probably not deal with our workload as eÆciently [28]. 5 Conclusions We have presented a practical investigation into the parallel solution of large, sparse linear systems dened over GF (p) where p is a large prime. We set this work in the context of solving the types of liner system produced by the FFS algorithm. By doing so, we both proved the feasibility of such an approach, learnt several valuable lessons about the potential power of our composite FFS system and conjectured about how to further improve on the performance of our Lanczos implementation. Perhaps the most signicant nding is that we can comfortably manage much larger matrices than previously considered using sequential methods [1] or indeed larger values of p. Even though our results were somewhat sub-optimal in terms of scalability due to the SGE processing phase, our current test cases take only a few hours to solve. Given this fact, there is clearly enough capacity to process much more complex systems in the future, either larger in size with greater density. In terms of the FFS, this means we can solve discrete logarithms in larger elds and impact on the security of currently used cryptosystems. Since this area was relatively unexplored previous to this work, there are some key areas where we can improve or extend our investigation: { { { Better understanding and parameterisation of the SGE pre-processing stage will probably yield a better balance between the computational and communication costs involved. More specically, it should improve the scalability of our implementation due to the better utilisation of additional processors. In our implementation using MPI, we avoided the issue of processor topology and how one might improve performance by dictating this so as to improve communication speed. This was done to make the implementation more portable. However, the results show that communication performance characteristics are crucial to solution speed: it seems important to address this issue in further work. There are other methods of solving this sort of linear system besides the Lanczos algorithm. Specically, it seems important to investigate and compare our results against those produced by the Weidemann [30] algorithm running in a similar context. { 6 It would be interesting to study the possibility for massively parallel custom hardware with very small processor units, along the lines of work for the NFS in characteristic two by Geiselmann and Steinwandt [11], Bernstein [4] and Lenstra et al. [20]. Acknowledgements The author would like to thank Nigel Smart, Fre Vercauteren and Andrew Holt for useful discussions throughout this work, and anonymous reviewers for their comments. In addition, he would like to thank Stephen Wiggins and Jason Hogan-O'Neill from the Laboratory for Advanced Computation in the Mathematical Sciences (LACMS) at the University of Bristol for allowing and overseeing access to the Beowulf cluster used for experimentation. References 1. L.M. Adleman and M.A. Huang. Function Field Sieve Method for Discrete Logarithms Over Finite Fields. In Information and Computation, 151, 5{16, 1999. 2. R. Barrett, M. Berry, T.F. Chan, J. Demmel, J. Donato, J.J. Dongarra, V. Eijkhout, R. Pozo, C. Romine and H.A. van der Vorst. Templates for the Solution on Linear Systems: Building Blocks for Iterative Methods. Society for Industrial and Applied Mathematics (SIAM), 1994. 3. P.D. Barrett, Implementing the Rivest, Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In Advances in Cryptology (CRYPTO), Springer-Verlag LNCS 263, 311{323, 1987. 4. D.J. Bernstein. Circuits for Integer Factorization: a Proposal. Available from: http://cr.yp.to/papers/nfscircuit.pdf 5. N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic and W-K. Su. Myrinet: A Gigabit-per-Second Local-Area Network. In IEEE Micro 15(1), 29{36, 1995 6. R.P. Brent. Some Parallel Algorithms for Integer Factorisation. In Euro-Par, Springer-Verlag LNCS 1685, 1{22, 1999. 7. R.P. Brent. Recent Progress and Prospects for Integer Factorisation Algorithms. In Computing and Combinatorics (COCOON), Springer-Verlag LNCS 1858, 3{22, 2000. 8. S.H. Cavallar. On the Number Field Sieve Integer Factorisation Algorithm. PhD Thesis, University of Leiden, 2002. 9. J.J. Dongarra, I.S. Du, D.C. Sorensen and H.A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics (SIAM), 1991. 10. I.S. Du, A.M. Erisman and J.K. Reid. Direct Methods for Sparse Matrices, Oxford University Press, 1986. 11. W. Geiselmann and R Steinwandt. Hardware to Solve Sparse Systems of Linear Equations over GF (2). In Cryptographic Hardware and Embedded Systems (CHES), Springer-Verlag LNCS 2779, 51{61, 2003. 12. R. Granger, A.J. Holt, D. Page, N.P. Smart and F. Vercauteren. Function Field Sieve in Characteristic Three. To appear in Algorithmic Number Theory Symposium (ANTS-VI), 2004. 13. W. Gropp, E. Lusk, N. Doss and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. In Parallel Computing 22(6), 789{828, 1996. 14. A.J. Holt and J.H. Davenport. Resolving Large Prime(s) Variants for Discrete Logarithm Computation. In Cryptography and Coding, Springer-Verlag LNCS 2898, 207{222, 2003. 15. A. Joux and R. Lercier. Improvements to the General Number Field Sieve for Discrete Logarithms in Prime Fields. In Mathematics of Computation, 72 (242), 953{967, 2003. 16. E. Kaltofen and A. Lobo. Distributed Matrix-Free Solution of Large Sparse Linear Systems over Finite Fields. In Algorithmica, 22(3/4), 331{348, 1999. 17. B.A. LaMacchia and A.M. Odlyzko. Solving Large Sparse Linear Systems Over Finite Fields. In In Advances in Cryptology (CRYPTO), Springer-Verlag LNCS 537, 109{133, 1991. 18. C. Lanczos. Solution of Systems of Linear Equations by Minimising Iterations. In Journal of Research of National Bureau of Standards, 49, 33{53, 1952. 19. A.K. Lenstra, H.W. Lenstra, M.S. Manasse and J. M. Pollard. The Number Field Sieve. In ACM Symposium on Theory of Computing, 564{572, 1990. 20. A.K. Lenstra, A. Shamir, J. Tomlinson, and E. Tromer. Analysis of Bernstein s Factorization Circuit. In Advances in Cryptology (ASIACRYPT), Springer-Verlag LNCS 2501, 1{26, 2002. 21. A.J. Menezes, P.C. van Oorschot and S.A. Vanstone. Handbook of Applied Cryptography. CRC Press, 1997. 22. P.L. Montgomery. A Block Lanczos Algorithm for Finding Dependencies Over GF(2). In In Advances in Cryptology (EUROCRYPT), Springer-Verlag LNCS 921, 106{120, 1995. 23. P.L. Montgomery. Distributed Linear Algebra. In Workshop on Elliptic Curve Cryptography (ECC), 2000. 24. P.L. Montgomery. Modular Multiplication Without Trial Division. Mathematics of Computation, 44, 519{521, 1985. 25. MPI: A Message-Passing Interface Standard. In Journal of Supercomputer Applications, 8(3/4), 159{416, 1994. 26. NFSNET: Large Scale Integer Factoring. Housed at: http://www.nfsnet.org 27. C. Pomerance and J.W. Smith. Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes. In Experimental Mathematics, 1(2), 89{94, 1992. 28. R. Rabenseifner. Optimization of Collective Reduction Operations. In Computational Science (ICCS), Springer-Verlag LNCS 3036, 1{9, 2004. 29. V. Shoup. NTL: A Library for doing Number Theory. Available from: http: //www.shoup.net/ntl/ 30. D.H. Wiedemann. Solving Sparse Linear Equations over Finite Fields. In IEEE Transactions on Information Theory, 32(1), 54{62, 1986. 31. L.T. Yang and R.P. Brent. The Parallel Improved Lanczos Methods for Integer Factorization over Finite Fields for Public Key Cryptosystems. In International Conference on Parallel Processing Workshops (ICPPW), IEEE Press, 106{111, 2001.
© Copyright 2026 Paperzz