Computing Scalar Multiplication with Many Cores

Computing Scalar Multiplication with Many Cores
Chen-Mou Cheng
Department of Electrical Engineering
National Taiwan University
Taipei, Taiwan
[email protected]
August 24, 2009
Acknowledgement
This is a joint work with
I
I
I
I
Daniel J. Bernstein, University of Illinois at Chicago, USA
Tanja Lange, Technische Universiteit Eindhoven, the Netherlands
Bo-Yin Yang, Academia Sinica, Taiwan
Students
F
F
F
Hsueh-Chung Chen and Tien-Ren Chen (GPU)
Ming-Shing Chen and Chun-Hung Hsiao (x86 CPU)
Zong-Cing Lin (Cell)
C.-M. Cheng (NTU)
CUDA EECM
August 24
2 / 37
Outline
Performance walls for single-threaded processors
How (many-core) graphics cards come to rescue
Computing scalar multiplication for ECM using many cores
Performance results
C.-M. Cheng (NTU)
CUDA EECM
August 24
3 / 37
Practical Side of Computing
Moore’s law in semiconductor industry
I
Transistor budget doubles every 18–24 months
In the past, it has been used to increase processor clock frequency
Year Processor MHz
1979 Z80
2
1984 80286
10
1989 80486
40
1994 Pentium
100
1999 Athlon
750
2004 Pentium 4 3800
2009 Core i7
3200
C.-M. Cheng (NTU)
CUDA EECM
August 24
4 / 37
C.-M. Cheng (NTU)
CUDA EECM
August 24
5 / 37
Why Stalled?
Power wall
I
I
Power consumption increases as clock frequency increases
Gelsinger’s prediction on heat density of high-end processors
F
F
F
Nuclear reactor by 2005
Rocket nozzle by 2010
Sun’s surface by 2015
Other “walls” stopping people from chasing after THz processors
I
Memory wall
F
F
I
Increasing gaps between processor and memory speeds
Resulting in ever larger caches (>50% in modern Intel processors!)
ILP wall
F
F
Increasingly difficult to exploit instruction-level parallelism
Out-of-order execution, speculative execution, branch prediction, . . .
C.-M. Cheng (NTU)
CUDA EECM
August 24
6 / 37
hapter 1. Introduction to CUDA
But the Grass Always Looks Greener on the Other Side
"#$%&''$#!()*+!*#&,&-.$/'!%$,"/*0*)$-01!+$#'&"$(&#!0-.!2&#3!+)4+!,&,$#3!
50-.().*+6!0'!)11/'*#0*&.!53!7)4/#&!898:!
!
36)$$%
32$%
32$%
CDA-'%
35)%
304%
!"#/%
!"#$%
!".$%
&'(% &*(%
+,-%
)$$#%
)$$.%
30$%
#;)%3<=%
<'-,?-A7B(%
#;$%3<=%
>7-?)%@*7%
&*(%
9'-%
)$$/%
!78% 9':%
)$$1%
)$$0%
&*(%
)$$2%
GT200 = GeForce GTX 280
G71!=!GeForce!7900!GTX!
NV35!=!GeForce!FX!5950!Ultra!
G92 = GeForce 9800 GTX!
G70!=!GeForce!7800!GTX!
NV30!=!GeForce!FX!5800!
G80!=!GeForce!8800!GTX
NV40!=!GeForce!6800!Ultra!
C.-M. Cheng (NTU)
CUDA EECM
August 24
7 / 37
Why Are GPUs So Fast?
Massively parallel paradigm
I
I
“Many-core” processors
Better way of exploiting transistor budget afforded by Moore’s law
Example: NVIDIA’s G200b
I
I
I
240 “cores,” >1.4 billion transistors
470 mm2 , TDP: 204 watts (TSMC 55nm)
1062.72 GFLOPS (single-precision), 159 GB/s memory bandwidth
F
Compared with 106.56 GFLOPS, 25.6 GB/s of Intel i7 at 3.33 GHz
Very, very cheap per GFLOPS, thanks to WoW gamers! :D
C.-M. Cheng (NTU)
CUDA EECM
August 24
8 / 37
NVIDIA’s GPU Hardware
@%)5A%)(&BC()C6(A&
D8(&/-'$()&#E&;20&#/&+8(&!)%786.&.%)5&5(.65(0&+8(&.#'7-+%=#/&7#A()F!
'-"=7)#.(00#)&>;2?&
924&
!"#$%"&'('#)*!
5607%+.8()&
.#/0+%/+&'('#)*!
+(,+-)(&'('#)*!
$"#.:!
$"#.:!
$"#.:!
;%/*&$"#.:0<;2!
C.-M. Cheng (NTU)
134!
134!
12!
12!
12!
12!
12!
12!
12!
12!
)(!60+()&
08%)(5&'('#)*&
.%.8(&
CUDA EECM
August 24
9 / 37
Hardware Design
NVIDIA advertises streaming processors (SPs) as “cores”
However, an x86 “core” is closer to a streaming multiprocessor (MP)
I
I
I
SPs on the same MP share a single dispatcher (instruction decoder)
Hence must execute the same instruction (SIMD)
More like ALU (arithmetic-logic unit)
MP: basic building block for NVIDIA’s GPUs
I
I
Scalable design: simply put more MPs into higher-end GPUs
Examples: GTX 280: 30 MPs; GTX 260: 24 MPs
C.-M. Cheng (NTU)
CUDA EECM
August 24
10 / 37
How to Program: CUDA and OpenCL
GPU: data-parallel coprocessor to CPU
I
CPU-GPU communication and synchronization via device memory
CUDA: Compute Unified Device Architecture
I
I
I
NVIDIA’s proprietary technology
Runs on NVIDIA 8-series and later GPUs
Contains compiler, debugger/profiler, run-time environment
OpenCL: Open Computing Language
I
I
I
Industry standardization and extension of GPGPU
First draft looks very similar to CUDA
Runs on GPUs as well as CPUs
F
F
NVIDIA released SDK for GPUs
AMD/ATI released SDK for CPUs; will release GPU SDK soon
C.-M. Cheng (NTU)
CUDA EECM
August 24
11 / 37
CUDA Programming Model
Hierarchical thread model
I
I
5#6&'(/,78/(89/&,:;3<=>,
?(901,@,A"#$%1,@B'()1,@,C.(/'01D!
Only threads in a block can
cooperate and synchronize
Each thread block must
execute on same MP
!"#$%!
Via shared memory
C.-M. Cheng (NTU)
&'()!
&'()!
&'()!
&'()!
&'()!
&'()!
Barrier synchronization
Explicit caching
I
&'()!
&'()!
3),-#,+4,&'()12!"#$%!
*+,-.(/'012&'()!
!"#$,E,*+,-.(/'0,91,-./,F9G9FHF,/I/$HJ#G,HG9-D,
%&'(),E,C./,-.(/'01,9G,',!"#$%,$'G,$##)/('-/,/K$9/
*#+,,,,E,MG-/($#GG/$J#G,91,9G/K$9/G-D!
CUDA EECM
August 24
12 / 37
Why So Many Threads?
Why are there 32 threads in a warp?
I
I
I
Dispatcher runs at roughly 25% of the speed of SPs
Hence each MP should be regarded as a 32-way SIMD processor
The number may change in future GPUs
Why not just run one single warp on each MP?
I
I
I
To exploit thread-level parallelism
Need 192 threads to completely hide arithmetic latency
Need more to hide device memory latency
C.-M. Cheng (NTU)
CUDA EECM
August 24
13 / 37
Device Memory Latency Hiding Example
for (;;) {
x = devmem[addr]; addr += offset;
acc += x; x *= x; acc += x;
}
Assume access latency of device memory is 200 cycles
How many warps do we need to completely hide it?
I
I
I
Each SP sees 4 instructions per warp
Each memory access sees 4 compute instructions
Hence need d200/16e = 13 warps (416 threads)
C.-M. Cheng (NTU)
CUDA EECM
August 24
14 / 37
Ideal Applications
Need a lot of independent threads of computation
Example: Elliptic curve method of integer factorization (ECM)
I
I
Computing scalar multiplications on (lots of) elliptic curve
Each point addition and doubling involve several arithmetic operations
modulo the number to be factored
Similarly, can also benefit Pollard’s rho method
Internet servers that need to compute lots of ECC signatures: maybe
I
Not so ideal for computing only a few scalar multiplications, e.g.,
client-side ECC
C.-M. Cheng (NTU)
CUDA EECM
August 24
15 / 37
Elliptic Curve Method of Integer Factorization
ECM: an important factoring algorithm by itself
Also critical to 1024+-bit NFS (finding ≈30-bit prime factors in lots
of 100–300-bit numbers)
I
I
I
I
5% sieve time in 1999 RSA-512 factorization
Steadily increasing for RSA-576, 640, and 663
33% sieve time in RSA-768 (linear algebra not yet finished)
Estimated to account for 50% sieve time for RSA-1024
Faster ECM will speed up NFS
Can increase its workload share and speed up NFS even more
C.-M. Cheng (NTU)
CUDA EECM
August 24
16 / 37
Pollard’s p − 1 Method
Assume n = pq, where p − 1 is B1 -powersmooth but q − 1 is not
Let s = lcm(2, 3, 4, . . . , B1 ); then (p − 1)|s but (q − 1) 6 |s
Pick a random a and compute as
I
I
I
For every a, as ≡ 1 mod p by Fermat’s little theorem
Conversely, we must have as 6≡ 1 mod q for some a
In this case, gcd(as − 1, n) = p
C.-M. Cheng (NTU)
CUDA EECM
August 24
17 / 37
Stage 1 ECM
Generalization of Pollard’s p − 1 method
I
From Z∗p to elliptic curve groups over Fp
If a non-torsion point P on elliptic curve E = E (Q) has
B1 -powersmooth order mod p but not mod q, then E factors n
I
I
Take rational function φ on E s.t. φ(O) = 0
Compute gcd(φ([s]P), n), where s = lcm(2, 3, 4, . . . , B1 )
Can spend a bit more computation and execute “stage 2” to increase
the probability of finding prime factors
C.-M. Cheng (NTU)
CUDA EECM
August 24
18 / 37
Overall Design of CUDA EECM
Use Edwards curves
Scalar in non-adjacent form (NAF) with large signed window
GPU: Multi-precision modular arithmetic coprocessor to CPU
I
I
In current range, schoolbook multiplication faster than Karatsuba due
to efficient use of fused multiply-and-add instruction
Montgomery reduction (no trial divisions)
C.-M. Cheng (NTU)
CUDA EECM
August 24
19 / 37
Edwards Curves
x 2 + y 2 = 1 + dx 2 y 2
I
(x1 , y1 ) + (x2 , y2 ) =
I
I
y1 y2 − x1 x2
x1 y2 + x2 y1
,
1 + dx1 x2 y1 y2 1 − dx1 x2 y1 y2
Neutral point O is (0, 1)
−(x, y ) = (−x, y )
An elliptic curve in Weierstrass form is birationally equivalent to an
Edwards curve if and only if it has an order-4 point
Using homogeneous coordinates (X : Y : Z ) = (X /Z , Y /Z )
I
I
Mixed addition: 9M+6a (Hisil et al., ASIACRYPT’08)
Doubling: 4S+3M+6a (Bernstein and Lange, ASIACRYPT’07)
C.-M. Cheng (NTU)
CUDA EECM
August 24
20 / 37
Modular Multiplication
Bottleneck operation of many cryptosystems
I
I
Special moduli: ECC over prime fields
General moduli: ECM, RSA
Question: How fast can we do mulmods?
I
More interestingly, on PC
C.-M. Cheng (NTU)
CUDA EECM
August 24
21 / 37
Montgomery Reduction
Euclid’s division
I
y = qn + r , 0 ≤ r < n
Hensel’s division
I
y + qn = p k r , 0 ≤ r < 2n, p prime
Montgomery reduction
I
I
I
I
I
x 7→ p k x mod n: ring isomorphism if gcd(p, n) = 1
Precomputation: compute p 0 , n0 such that p k p 0 − nn0 = 1
q ← (y mod p k )n0
q 0 ← (q mod p k )n
r ← (y + q 0 )/p k
C.-M. Cheng (NTU)
CUDA EECM
August 24
22 / 37
GPU-based Modular Arithmetic Coprocessor
Design goals
I
I
I
Fast
Flexible
Scalable
Challenges
I
I
I
I
I
Easy-to-use yet powerful interface
Efficiency/processor utilization
Synchronization
Resource management
···
C.-M. Cheng (NTU)
CUDA EECM
August 24
23 / 37
Detailed Design
SIMD, RISC-like interface
I
I
Not suitable for scalar applications
Expose memory latency to programmer
Task decomposition and assignment
I
Single-thread-does-it-all to minimize synchronization overhead
Memory management
I
I
Operand compaction
Explicit caching
C.-M. Cheng (NTU)
CUDA EECM
August 24
24 / 37
Interface Design
SIMD: Single Instruction Multiple (Massive?) Data
I
I
Work on several thousands of integers simultaneously
Enough threads of computation to fully occupy large GPUs
RISC: Reduced Instruction Set Computer
I
I
I
All computation happens between “registers”
Explicit load/store operations for data movement
Expose memory latency to programmer
C.-M. Cheng (NTU)
CUDA EECM
August 24
25 / 37
D?43#3?8E3#?8E#>8#C89A#E4>?#8:=#@89:;<=#<=4>?@ABC#>88;F!
Example Code Fragment
!"#$%&'()*(+),'(*-#+.)/0(
QKA=<B85!
!"#$%#&#'(##
)"*#&#+##
,"-#&#!##
.")/,#
0")1,#!
C.-M. Cheng (NTU)
*C>:<;#8KA=<B85!
;8<9G$%H$(IJ#
;8<9G*H+IJ#
6K:72LDG#!H#$%H'(IJ##
;8<9G-IJ##
6K:72LDG#)H#*H+IJ##
6K:72LDG#,H#-H!IJ##
6K:*!!G#.H#)H,IJ##
6K:M2+G#0H#)H,IJ#!
CUDA EECM
O8=P456#3A>!
$%H$(#
$%H$(H*H+#
$%H!H*H+#
-H!H*H+#
-H!H)H+#
,H!H)H+#
,H.H)H+#
N!
August 24
26 / 37
Task Decomposition and Assignment, First Try
The EUROCRYPT’09 way
I
I
I
I
I
k threads cooperate in computing a k-limb multiplication
Each thread must read operands, multiply, read accumulator, add the
product to it, and write back to shared memory
Lots of headaches in solving race conditions
Lots of time wasted in memory I/O and synchronization
Need to hand-optimize thread organization for different k’s
C.-M. Cheng (NTU)
CUDA EECM
August 24
27 / 37
Task Decomposition and Assignment, Second Try
The new, improved way
I
I
I
I
One thread to do them all
Minimizes synchronization overhead
Maximizes compute-to-memory ratio
Automatic generation and tuning of code for various modulus sizes
However, this is like dumping garbage into your neighbor’s backyard
I
I
Now using k times more memory per thread
The challenge goes to memory management
C.-M. Cheng (NTU)
CUDA EECM
August 24
28 / 37
Memory Management
Storage compaction
Double buffering
Explicit cache management
I
I
I
Use shared memory as caches of device memory
Can store many pre-computed points
Allow use of a large window in NAF
Cons: More programming complications
C.-M. Cheng (NTU)
CUDA EECM
August 24
29 / 37
!"#$$%&'%(!"!#$%&'&()!*#(+!,%(-&,./!#$%&'&()!0(,%#123,2'!)1%
Explicit Cache Management
424%56!#,,233/!'2#1&()!7&.8!52)&3.25!3,%92!1&4&.#:%(/!424%
1#.2(,6!8&'&()!$&#!952;<2.,8&()/!=>()!&(!.82!7%5+&()!32./?/!2.
AB!
AB!
AB!
AB!
AB!
AB!
AB!
AB!
Fast!
52)&3.25!
Slow!
38#52'!424%56!
)1%*#1!424%56!
C.-M. Cheng (NTU)
CUDA EECM
August 24
30 / 37
Performance Results
Curves per second
Million mulmods per second
Price-performance ratio
Performance per watt
C.-M. Cheng (NTU)
CUDA EECM
August 24
31 / 37
The Billion-Mulmod-Per-Second PC
11
Performance Comparison
ECM throughput (curves/second)
10000
1000
100
GTX 295
K10+
C2+
PS3
QS22
Eurocrypt 2009
100
200
300
400
Number of bits in the modulus
Fig. 4. Performance comparison of stage-1 ECM on various hardware platforms
C.-M. Cheng (NTU)
CUDA EECM
August 24
32 / 37
Performance Comparison
#cores
clock (MHz)
price (USD)
TDP (watts)
GFLOPS
#threads
#bits in moduli
#limbs
window size (bits)
mulmods (106 /sec)
curves (1/sec)
curves (1/sec, scaled)
GTX 295
480
1242
470
295
1192
46080
210
15
u6
481
4928
5895
K10+
4
3000
170
95
6+24
48+16
192
3+7
u6
202
1881
1881
C2+
4
2830
220
95
3+23
48+16
192
3+7
u6
114
1110
1110
QS22
16
3200
$$$
200
204
160
192
8
s5
334
3120
3120
PS3
6
3200
413
<100
154
6
195
15
s5
102
1010
1042
[1]
480
1242
470
295
1192
280
28
s4
42
401
853
[1] Bernstein et al. “ECM on graphics cards.” EUROCRYPT 2009.
C.-M. Cheng (NTU)
CUDA EECM
August 24
33 / 37
Software Thread Integration
Modern CPUs tend to be underutilized
I
I
I
Wide arithmetic pipelines
Vector instructions engines
Superscalar, i.e., can issue multiple independent instructions per cycle
How to squeeze out the last bits of performance?
I
I
I
Run many “threads” of execution simultaneously
Exploit as much available circuitry as possible
Similar to why we must have so many threads on GPU
Result: 22% speed-up on AMD CPUs
I
I
I
Compared with Kleinjung’s code using 64-bit integer multiplications
Also works on Cell (240%)
Doesn’t work so well for Intel CPUs (< 10%)
C.-M. Cheng (NTU)
CUDA EECM
August 24
34 / 37
The Billion-mulmods-per-second PC
Buy parts, say, from NewEgg.COM (8/24/2009 2PM)
ITEM
CPU
MB
RAM
GPUs
Case
HDD
USD
170
170
140
940
360
100
Description
AMD Phenom II 945 (3.0 GHz)
ASUS M4N82 Deluxe
4x DDR2-800 ECC Kingston 2GB
2x PNY 1792MB NVIDIA GTX 295
Supermicro 4U with 865W PS
2x Seagate SATA II 320GB
Notes
retail K10+
ECC-capable
5x fans
RAID 1
Total: 1880 USD for 1.3 billion 192-bit mulmods per second
C.-M. Cheng (NTU)
CUDA EECM
August 24
35 / 37
Concluding Remarks
Many-core processors are becoming mainstream
I
I
Cheap, available off-the-shelf
Good for throughput-oriented computations
Requires a different kind of thinking from sequential programming
Watch out for memory management
I
“All programming is an exercise in caching”
C.-M. Cheng (NTU)
CUDA EECM
August 24
36 / 37
Thanks for Listening!
Questions or comments?
C.-M. Cheng (NTU)
CUDA EECM
August 24
37 / 37