Computing Scalar Multiplication with Many Cores Chen-Mou Cheng Department of Electrical Engineering National Taiwan University Taipei, Taiwan [email protected] August 24, 2009 Acknowledgement This is a joint work with I I I I Daniel J. Bernstein, University of Illinois at Chicago, USA Tanja Lange, Technische Universiteit Eindhoven, the Netherlands Bo-Yin Yang, Academia Sinica, Taiwan Students F F F Hsueh-Chung Chen and Tien-Ren Chen (GPU) Ming-Shing Chen and Chun-Hung Hsiao (x86 CPU) Zong-Cing Lin (Cell) C.-M. Cheng (NTU) CUDA EECM August 24 2 / 37 Outline Performance walls for single-threaded processors How (many-core) graphics cards come to rescue Computing scalar multiplication for ECM using many cores Performance results C.-M. Cheng (NTU) CUDA EECM August 24 3 / 37 Practical Side of Computing Moore’s law in semiconductor industry I Transistor budget doubles every 18–24 months In the past, it has been used to increase processor clock frequency Year Processor MHz 1979 Z80 2 1984 80286 10 1989 80486 40 1994 Pentium 100 1999 Athlon 750 2004 Pentium 4 3800 2009 Core i7 3200 C.-M. Cheng (NTU) CUDA EECM August 24 4 / 37 C.-M. Cheng (NTU) CUDA EECM August 24 5 / 37 Why Stalled? Power wall I I Power consumption increases as clock frequency increases Gelsinger’s prediction on heat density of high-end processors F F F Nuclear reactor by 2005 Rocket nozzle by 2010 Sun’s surface by 2015 Other “walls” stopping people from chasing after THz processors I Memory wall F F I Increasing gaps between processor and memory speeds Resulting in ever larger caches (>50% in modern Intel processors!) ILP wall F F Increasingly difficult to exploit instruction-level parallelism Out-of-order execution, speculative execution, branch prediction, . . . C.-M. Cheng (NTU) CUDA EECM August 24 6 / 37 hapter 1. Introduction to CUDA But the Grass Always Looks Greener on the Other Side "#$%&''$#!()*+!*#&,&-.$/'!%$,"/*0*)$-01!+$#'&"$(&#!0-.!2!+)4+!,&,$#3! 50-.().*+6!0'!)11/'*#0*&.!53!7)4/#&!898:! ! 36)$$% 32$% 32$% CDA-'% 35)% 304% !"#/% !"#$% !".$% &'(% &*(% +,-% )$$#% )$$.% 30$% #;)%3<=% <'-,?-A7B(% #;$%3<=% >7-?)%@*7% &*(% 9'-% )$$/% !78% 9':% )$$1% )$$0% &*(% )$$2% GT200 = GeForce GTX 280 G71!=!GeForce!7900!GTX! NV35!=!GeForce!FX!5950!Ultra! G92 = GeForce 9800 GTX! G70!=!GeForce!7800!GTX! NV30!=!GeForce!FX!5800! G80!=!GeForce!8800!GTX NV40!=!GeForce!6800!Ultra! C.-M. Cheng (NTU) CUDA EECM August 24 7 / 37 Why Are GPUs So Fast? Massively parallel paradigm I I “Many-core” processors Better way of exploiting transistor budget afforded by Moore’s law Example: NVIDIA’s G200b I I I 240 “cores,” >1.4 billion transistors 470 mm2 , TDP: 204 watts (TSMC 55nm) 1062.72 GFLOPS (single-precision), 159 GB/s memory bandwidth F Compared with 106.56 GFLOPS, 25.6 GB/s of Intel i7 at 3.33 GHz Very, very cheap per GFLOPS, thanks to WoW gamers! :D C.-M. Cheng (NTU) CUDA EECM August 24 8 / 37 NVIDIA’s GPU Hardware @%)5A%)(&BC()C6(A& D8(&/-'$()&#E&;20&#/&+8(&!)%786.&.%)5&5(.65(0&+8(&.#'7-+%=#/&7#A()F! '-"=7)#.(00#)&>;2?& 924& !"#$%"&'('#)*! 5607%+.8()& .#/0+%/+&'('#)*! +(,+-)(&'('#)*! $"#.:! $"#.:! $"#.:! ;%/*&$"#.:0<;2! C.-M. Cheng (NTU) 134! 134! 12! 12! 12! 12! 12! 12! 12! 12! )(!60+()& 08%)(5&'('#)*& .%.8(& CUDA EECM August 24 9 / 37 Hardware Design NVIDIA advertises streaming processors (SPs) as “cores” However, an x86 “core” is closer to a streaming multiprocessor (MP) I I I SPs on the same MP share a single dispatcher (instruction decoder) Hence must execute the same instruction (SIMD) More like ALU (arithmetic-logic unit) MP: basic building block for NVIDIA’s GPUs I I Scalable design: simply put more MPs into higher-end GPUs Examples: GTX 280: 30 MPs; GTX 260: 24 MPs C.-M. Cheng (NTU) CUDA EECM August 24 10 / 37 How to Program: CUDA and OpenCL GPU: data-parallel coprocessor to CPU I CPU-GPU communication and synchronization via device memory CUDA: Compute Unified Device Architecture I I I NVIDIA’s proprietary technology Runs on NVIDIA 8-series and later GPUs Contains compiler, debugger/profiler, run-time environment OpenCL: Open Computing Language I I I Industry standardization and extension of GPGPU First draft looks very similar to CUDA Runs on GPUs as well as CPUs F F NVIDIA released SDK for GPUs AMD/ATI released SDK for CPUs; will release GPU SDK soon C.-M. Cheng (NTU) CUDA EECM August 24 11 / 37 CUDA Programming Model Hierarchical thread model I I 5#6&'(/,78/(89/&,:;3<=>, ?(901,@,A"#$%1,@B'()1,@,C.(/'01D! Only threads in a block can cooperate and synchronize Each thread block must execute on same MP !"#$%! Via shared memory C.-M. Cheng (NTU) &'()! &'()! &'()! &'()! &'()! &'()! Barrier synchronization Explicit caching I &'()! &'()! 3),-#,+4,&'()12!"#$%! *+,-.(/'012&'()! !"#$,E,*+,-.(/'0,91,-./,F9G9FHF,/I/$HJ#G,HG9-D, %&'(),E,C./,-.(/'01,9G,',!"#$%,$'G,$##)/('-/,/K$9/ *#+,,,,E,MG-/($#GG/$J#G,91,9G/K$9/G-D! CUDA EECM August 24 12 / 37 Why So Many Threads? Why are there 32 threads in a warp? I I I Dispatcher runs at roughly 25% of the speed of SPs Hence each MP should be regarded as a 32-way SIMD processor The number may change in future GPUs Why not just run one single warp on each MP? I I I To exploit thread-level parallelism Need 192 threads to completely hide arithmetic latency Need more to hide device memory latency C.-M. Cheng (NTU) CUDA EECM August 24 13 / 37 Device Memory Latency Hiding Example for (;;) { x = devmem[addr]; addr += offset; acc += x; x *= x; acc += x; } Assume access latency of device memory is 200 cycles How many warps do we need to completely hide it? I I I Each SP sees 4 instructions per warp Each memory access sees 4 compute instructions Hence need d200/16e = 13 warps (416 threads) C.-M. Cheng (NTU) CUDA EECM August 24 14 / 37 Ideal Applications Need a lot of independent threads of computation Example: Elliptic curve method of integer factorization (ECM) I I Computing scalar multiplications on (lots of) elliptic curve Each point addition and doubling involve several arithmetic operations modulo the number to be factored Similarly, can also benefit Pollard’s rho method Internet servers that need to compute lots of ECC signatures: maybe I Not so ideal for computing only a few scalar multiplications, e.g., client-side ECC C.-M. Cheng (NTU) CUDA EECM August 24 15 / 37 Elliptic Curve Method of Integer Factorization ECM: an important factoring algorithm by itself Also critical to 1024+-bit NFS (finding ≈30-bit prime factors in lots of 100–300-bit numbers) I I I I 5% sieve time in 1999 RSA-512 factorization Steadily increasing for RSA-576, 640, and 663 33% sieve time in RSA-768 (linear algebra not yet finished) Estimated to account for 50% sieve time for RSA-1024 Faster ECM will speed up NFS Can increase its workload share and speed up NFS even more C.-M. Cheng (NTU) CUDA EECM August 24 16 / 37 Pollard’s p − 1 Method Assume n = pq, where p − 1 is B1 -powersmooth but q − 1 is not Let s = lcm(2, 3, 4, . . . , B1 ); then (p − 1)|s but (q − 1) 6 |s Pick a random a and compute as I I I For every a, as ≡ 1 mod p by Fermat’s little theorem Conversely, we must have as 6≡ 1 mod q for some a In this case, gcd(as − 1, n) = p C.-M. Cheng (NTU) CUDA EECM August 24 17 / 37 Stage 1 ECM Generalization of Pollard’s p − 1 method I From Z∗p to elliptic curve groups over Fp If a non-torsion point P on elliptic curve E = E (Q) has B1 -powersmooth order mod p but not mod q, then E factors n I I Take rational function φ on E s.t. φ(O) = 0 Compute gcd(φ([s]P), n), where s = lcm(2, 3, 4, . . . , B1 ) Can spend a bit more computation and execute “stage 2” to increase the probability of finding prime factors C.-M. Cheng (NTU) CUDA EECM August 24 18 / 37 Overall Design of CUDA EECM Use Edwards curves Scalar in non-adjacent form (NAF) with large signed window GPU: Multi-precision modular arithmetic coprocessor to CPU I I In current range, schoolbook multiplication faster than Karatsuba due to efficient use of fused multiply-and-add instruction Montgomery reduction (no trial divisions) C.-M. Cheng (NTU) CUDA EECM August 24 19 / 37 Edwards Curves x 2 + y 2 = 1 + dx 2 y 2 I (x1 , y1 ) + (x2 , y2 ) = I I y1 y2 − x1 x2 x1 y2 + x2 y1 , 1 + dx1 x2 y1 y2 1 − dx1 x2 y1 y2 Neutral point O is (0, 1) −(x, y ) = (−x, y ) An elliptic curve in Weierstrass form is birationally equivalent to an Edwards curve if and only if it has an order-4 point Using homogeneous coordinates (X : Y : Z ) = (X /Z , Y /Z ) I I Mixed addition: 9M+6a (Hisil et al., ASIACRYPT’08) Doubling: 4S+3M+6a (Bernstein and Lange, ASIACRYPT’07) C.-M. Cheng (NTU) CUDA EECM August 24 20 / 37 Modular Multiplication Bottleneck operation of many cryptosystems I I Special moduli: ECC over prime fields General moduli: ECM, RSA Question: How fast can we do mulmods? I More interestingly, on PC C.-M. Cheng (NTU) CUDA EECM August 24 21 / 37 Montgomery Reduction Euclid’s division I y = qn + r , 0 ≤ r < n Hensel’s division I y + qn = p k r , 0 ≤ r < 2n, p prime Montgomery reduction I I I I I x 7→ p k x mod n: ring isomorphism if gcd(p, n) = 1 Precomputation: compute p 0 , n0 such that p k p 0 − nn0 = 1 q ← (y mod p k )n0 q 0 ← (q mod p k )n r ← (y + q 0 )/p k C.-M. Cheng (NTU) CUDA EECM August 24 22 / 37 GPU-based Modular Arithmetic Coprocessor Design goals I I I Fast Flexible Scalable Challenges I I I I I Easy-to-use yet powerful interface Efficiency/processor utilization Synchronization Resource management ··· C.-M. Cheng (NTU) CUDA EECM August 24 23 / 37 Detailed Design SIMD, RISC-like interface I I Not suitable for scalar applications Expose memory latency to programmer Task decomposition and assignment I Single-thread-does-it-all to minimize synchronization overhead Memory management I I Operand compaction Explicit caching C.-M. Cheng (NTU) CUDA EECM August 24 24 / 37 Interface Design SIMD: Single Instruction Multiple (Massive?) Data I I Work on several thousands of integers simultaneously Enough threads of computation to fully occupy large GPUs RISC: Reduced Instruction Set Computer I I I All computation happens between “registers” Explicit load/store operations for data movement Expose memory latency to programmer C.-M. Cheng (NTU) CUDA EECM August 24 25 / 37 D?43#3?8E3#?8E#>8#C89A#E4>?#8:=#@89:;<=#<=4>?@ABC#>88;F! Example Code Fragment !"#$%&'()*(+),'(*-#+.)/0( QKA=<B85! !"#$%#&#'(## )"*#&#+## ,"-#&#!## .")/,# 0")1,#! C.-M. Cheng (NTU) *C>:<;#8KA=<B85! ;8<9G$%H$(IJ# ;8<9G*H+IJ# 6K:72LDG#!H#$%H'(IJ## ;8<9G-IJ## 6K:72LDG#)H#*H+IJ## 6K:72LDG#,H#-H!IJ## 6K:*!!G#.H#)H,IJ## 6K:M2+G#0H#)H,IJ#! CUDA EECM O8=P456#3A>! $%H$(# $%H$(H*H+# $%H!H*H+# -H!H*H+# -H!H)H+# ,H!H)H+# ,H.H)H+# N! August 24 26 / 37 Task Decomposition and Assignment, First Try The EUROCRYPT’09 way I I I I I k threads cooperate in computing a k-limb multiplication Each thread must read operands, multiply, read accumulator, add the product to it, and write back to shared memory Lots of headaches in solving race conditions Lots of time wasted in memory I/O and synchronization Need to hand-optimize thread organization for different k’s C.-M. Cheng (NTU) CUDA EECM August 24 27 / 37 Task Decomposition and Assignment, Second Try The new, improved way I I I I One thread to do them all Minimizes synchronization overhead Maximizes compute-to-memory ratio Automatic generation and tuning of code for various modulus sizes However, this is like dumping garbage into your neighbor’s backyard I I Now using k times more memory per thread The challenge goes to memory management C.-M. Cheng (NTU) CUDA EECM August 24 28 / 37 Memory Management Storage compaction Double buffering Explicit cache management I I I Use shared memory as caches of device memory Can store many pre-computed points Allow use of a large window in NAF Cons: More programming complications C.-M. Cheng (NTU) CUDA EECM August 24 29 / 37 !"#$$%&'%(!"!#$%&'&()!*#(+!,%(-&,./!#$%&'&()!0(,%#123,2'!)1% Explicit Cache Management 424%56!#,,233/!'2#1&()!7&.8!52)&3.25!3,%92!1&4&.#:%(/!424% 1#.2(,6!8&'&()!$&#!952;<2.,8&()/!=>()!&(!.82!7%5+&()!32./?/!2. AB! AB! AB! AB! AB! AB! AB! AB! Fast! 52)&3.25! Slow! 38#52'!424%56! )1%*#1!424%56! C.-M. Cheng (NTU) CUDA EECM August 24 30 / 37 Performance Results Curves per second Million mulmods per second Price-performance ratio Performance per watt C.-M. Cheng (NTU) CUDA EECM August 24 31 / 37 The Billion-Mulmod-Per-Second PC 11 Performance Comparison ECM throughput (curves/second) 10000 1000 100 GTX 295 K10+ C2+ PS3 QS22 Eurocrypt 2009 100 200 300 400 Number of bits in the modulus Fig. 4. Performance comparison of stage-1 ECM on various hardware platforms C.-M. Cheng (NTU) CUDA EECM August 24 32 / 37 Performance Comparison #cores clock (MHz) price (USD) TDP (watts) GFLOPS #threads #bits in moduli #limbs window size (bits) mulmods (106 /sec) curves (1/sec) curves (1/sec, scaled) GTX 295 480 1242 470 295 1192 46080 210 15 u6 481 4928 5895 K10+ 4 3000 170 95 6+24 48+16 192 3+7 u6 202 1881 1881 C2+ 4 2830 220 95 3+23 48+16 192 3+7 u6 114 1110 1110 QS22 16 3200 $$$ 200 204 160 192 8 s5 334 3120 3120 PS3 6 3200 413 <100 154 6 195 15 s5 102 1010 1042 [1] 480 1242 470 295 1192 280 28 s4 42 401 853 [1] Bernstein et al. “ECM on graphics cards.” EUROCRYPT 2009. C.-M. Cheng (NTU) CUDA EECM August 24 33 / 37 Software Thread Integration Modern CPUs tend to be underutilized I I I Wide arithmetic pipelines Vector instructions engines Superscalar, i.e., can issue multiple independent instructions per cycle How to squeeze out the last bits of performance? I I I Run many “threads” of execution simultaneously Exploit as much available circuitry as possible Similar to why we must have so many threads on GPU Result: 22% speed-up on AMD CPUs I I I Compared with Kleinjung’s code using 64-bit integer multiplications Also works on Cell (240%) Doesn’t work so well for Intel CPUs (< 10%) C.-M. Cheng (NTU) CUDA EECM August 24 34 / 37 The Billion-mulmods-per-second PC Buy parts, say, from NewEgg.COM (8/24/2009 2PM) ITEM CPU MB RAM GPUs Case HDD USD 170 170 140 940 360 100 Description AMD Phenom II 945 (3.0 GHz) ASUS M4N82 Deluxe 4x DDR2-800 ECC Kingston 2GB 2x PNY 1792MB NVIDIA GTX 295 Supermicro 4U with 865W PS 2x Seagate SATA II 320GB Notes retail K10+ ECC-capable 5x fans RAID 1 Total: 1880 USD for 1.3 billion 192-bit mulmods per second C.-M. Cheng (NTU) CUDA EECM August 24 35 / 37 Concluding Remarks Many-core processors are becoming mainstream I I Cheap, available off-the-shelf Good for throughput-oriented computations Requires a different kind of thinking from sequential programming Watch out for memory management I “All programming is an exercise in caching” C.-M. Cheng (NTU) CUDA EECM August 24 36 / 37 Thanks for Listening! Questions or comments? C.-M. Cheng (NTU) CUDA EECM August 24 37 / 37
© Copyright 2026 Paperzz