Using GPUs to Accelerate the Bisection Algorithm for Finding

!s#$%&'(!s&t*&+,,e.e/0te&t1e&2#se,t#*$&+.%*/#t13&4*/&5#$6#$%&7#%e$80.9es&
*4&:;33et/#,&</#6#0%*$0.&=0t/#,es&&
!A#A$% 'ork+ng .ote 123&
Vasily Volkov
James W. Demmel
Computer Science Division
University of California at Berkeley
Computer Science Division and Department of Mathematics
University of California at Berkeley
+bst/0,t&
function Count(x)
Count g 0
dg1
for i g 1 to n
d g ai – x – b2i–1kd
(!)
if (d i 0) then Count g Count j 1
endfor
Graphical Processing Units (GPUs) potentially promise widespread and inexpensive high performance computation. However,
architectural limitations (only some operations and memory access patterns can be performed quickly, partial support for IEEE
floating point arithmetic) make it necessary to change existing
algorithms to attain high performance and correctness. Here we
show how to make the bisection algorithm for eigenvalues of
symmetric tridiagonal matrices (sstebz from LAPACK) run
both fast and correctly on an ATI Radeon X1900 GPU. Our fastest algorithm takes up to 156! less time than IntelYs Math Kernel
Library version of sstebz running on the CPU, but does so by
doing many redundant floating point operations compared to the
CPU version. We use an automatic tuning procedure analogous to
ATLAS or PHiPAC to decide the optimal redundancy. Correctness despite partial IEEE floating point semantics required explicitly adding 0 in the inner loop. The problems and solutions discussed here are of interest on other GPU architectures.
Figure 1: The kernel of bisection algorithm. Function Count(x)
may be evaluated for an array of arguments concurrently.
Let T be an n ! n symmetric tridiagonal matrix with diagonals
a1, e, an and off-diagonals b1, e, bn–1. For the convenience of
presentation, let b0 g bn g 0. Then the algorithm Count(x) in Fig. 1
implements LDLT decomposition of T – xI without pivoting and
counts the number of negative entries in the diagonal matrix D.
According to Sylvester’s inertia theorem it gives the number of
eigenvalues of T that are less than the real number x.
Now, suppose we are given a set of 4-tuples (li, ui, nli, nui) for
i g 1, e, EL so that nli g Count(li) i nui g Count(ui) and the union
of intervals [li, ui) contains all eigenvalues. Then nui – nli gives the
number of eigenvalues in the interval [li, ui). Computing nmi g
Count(mi) for mi g (li j ui)k2 equal to the midpoints produces new
tuples (li, mi, nli, nmi) and (mi, ui, nmi, nui) for half as wide intervals. Intervals that contain no eigenvalues (nli g nui) are discarded
and the process is repeated until a sufficiently small enclosing
interval for each eigenvalue is found. The interval to start the
iterations is constructed using Gershgorin’s theorem.
The total work done in Count(x) to find k eigenvalues is
O(nk). This is usually much larger than O(k) work done in the rest
of the algorithm. This motivates the efforts in finding the more
efficient implementation of Count(x).
A trivial way to speedup Count(x) given parallel resources is
to evaluate values nmi g Count(mi) for i g 1, e, EL concurrently.
The utilibation of parallel resources may be low unless EL is large
enough. A well-known technique to increase the utilibation and
cut the running time of the bisection algorithm is to subdivide
each interval [li, ui) with multiple points mi5 g li j 5(ui – li) k (ML j
1) for 5 g 1, e, ML [Lo et al. 1987l Simon 1989l Katagiri et al.
2006]. In this case Count(x) is evaluated at EL!ML points concurrently achieving better utilibation of the parallel resources. ML g 1
corresponds to the bisection algorithm and ML ! 2 is called multisection. ML is chosen to balance the gain from achieving higher
utilibation and the loss from introducing arithmetic redundancy by
using multiple points (ML g 1 minimibes the operation count).
There are known alternative designs that were not considered
in the present work. Newton’s or Zeroin algorithm may improve
convergence of intervals that are found to contain only one eigenvalue. A vectoribable alternative to Count(x) such as considered
by Lo et al. [1987] may allow achieving high utilibation with
lower arithmetic redundancy. Versions of Count(x) using nparallel-prefixn to parallelibe evaluations for a single x were analybed
by Ren [1996] and Mathias [1995] and found to be numerically
unstable, and will not be further considered here.
?& =*t#80t#*$&0$6&@bAe,t#8es&
Modern graphics processors (GPUs) are data parallel architectures
that can run general-purpose computations in single precision (so
far) at high computational rates. They are capable of achieving
110 GFLOPS in matrix-matrix multiplication [Segal and Peercy
2006] and show 30-40x speedups compared to the recent Intel
Xeon processors in computationally intensive applications such as
Black-Scholes option pricing [McCool et al. 2006] and gas dynamics solvers [Hagen et al. 2007]. It is tempting to exploit this
computational power in solving other common numerical problems.
In this work we consider an implementation of another widely
used linear algebra routine — the bisection algorithm for finding
the eigenvalues of symmetric tridiagonal matrices. A numerically
robust, vectoribed implementation of this algorithm in single precision is available in LAPACK’s sstebz routine [Anderson et
al. 1999]. Our goal is to port the vectoribed segments of the code
to the GPU. In order to increase the utilibation of the parallel resources, we use the Multi-section with Multiple Eigenvalues
method used previously by Katagiri et al. [2006].
For the purpose of this study we restrict our attention to finding all eigenvalues of the matrix. The extension to finding a subset
of the eigenvalues as done in LAPACK’s sstebz routine, is
straightforward.
B& <1e&2#se,t#*$&+.%*/#t13&
BC?& @8e/8#eD&
A detailed description of the bisection algorithm can be found in
Demmel [1997] or Parlett [1980]. A thorough analysis of its correctness in finite-precision machine arithmetic is presented in
Demmel et al. [1995]. In the following we summaribe the important features of the algorithm and present two novel techniques to
ensure correctness in an unusual floating-point semantic.
1
#$%&$'"
1$g'3$4"
5666"
5666."
:GoH'" ?@" Y;&&;l" ;4" 7lI" Q1SSZU" 4G74" 4G;" ?$:;=4$o'" 7lgo8$4G&" =7'"
?;"&7>;"=o88;=4"$'"4G;"7?:;'=;"oF"&o'o4o'$=$4@"?@"=78;FBl"7>fB:4P
&;'4" oF" 4G;" 8;4B8'" %7lB;:" oF" ()*'+(0)," :o" 4G74" 74" 7'@" #o$'4" $'" 4G;"
7lgo8$4G&"4G;":;4"oF"'o';&#4@"4B#l;:"(5$,"*$,"'5$,"'*$)"&7$'47$';>"?@"
4G;" 7lgo8$4G&" :74$:F$;:" !" #8o#;84$;:C" (1)" 4G;" $'4;8%7l:" Q5$,*$)" 78;"
#7$8H$:;" >$:fo$'4," 7'>" (!)" Fo8" ;7=G" $'4;8%7l" ()*'+(5$)" !" '5$2 !" '*$" !"
()*'+(*$)," $I;I" 4G;" %7lB;:" oF" '5$2 7'>" '*$" &7@" ?;" g7>fB:4;>g" F8o&"
4G;$8"'o&$'7l"%7lB;:"oF"()*'+(5$)"7'>"()*'+(*$)6254"=7'"?;"7>7#4;>"
Fo8"4G;"&Bl4$:;=4$o'"7lgo8$4G&"7:"FolloH:I"
K;4"&1"!"&!"!"h"!"&78"?;"4G;"&Bl4$:;=4$o'"#o$'4:2:B?>$%$>$'g"
Q5,"*)I""Ao8"=o'%;'$;'=;,"l;4"&.","5,"&78/1","*"7'>"'&.","'5,"'&78/1"
,"'*I"JG;'"4G;" 7>fB:4&;'4"#8o=;;>:"F8o&"92,"1"B#"4o"78"?@":;4P
4$'g"
!!"#"()!""*"#$%&$')"t%&n"!","-#$%&$'"
!!"#"(!"*".)"t%&n"()*'+","()*'+"/"1"
""()*'+","()*'+"/",$-'.$+(!)"
""(&748$9"$:"#8;#8o=;::;>"?@":;44$'g"/$","/$"/".)"
!!"#"(!"*".)"t%&n"()*'+","()*'+"/"1""
""!","!"/"."
!!"#"(!"*".)"t%&n"()*'+","()*'+"/"1"
A$gB8;"!C"Do::$?l;"&o>$F$=74$o':"oF"()*'+(0)"8oB4$';"4o"G7'>l;"
o%;8FloHI"JG;"l$';:"78;"B:;>"$':4;7>"oF"(!)"$'"A$gI"1I"JG;"#$%&$'"
%;8:$o'"$:"B:;>"$'"KLDLMN,"1$g'3$4"%;8:$o'"$:"B:;>"$'"1=7KLP
DLMN"Q3l7=RFo8>";4"7lI"1SST7U,"5666"%;8:$o'"$:"B:;>"$'"oB8"MDV"
=o>;"7'>"5666."%;8:$o'"$:"B:;>"$'"oB8"WDV"=o>;I"
2.2
Handling Divisions by 1ero and 4verflow
'&9","&79('&9-1,"&$'(()*'+(&9),"'*))I"
L4"7"F$'$4;"'B&?;8"oF"#o$'4:,":B=G"7:"0","/1,"'"!"!,"4G;"7lgo8$4G&"$'"
A$gI" 1" ;'=oB'4;8:" 7" >$%$:$o'" ?@" X;8oI" 1$&$l78l@," HG;'" 8B'" $'" &7P
=G$';" 78$4G&;4$=" $4" &7@" 7l:o" ;'=oB'4;8" o%;8FloHI" JG;" >;:$g':" oF"
4G;" FB'=4$o'" ()*'+(0)" oB4l$';>" ?;loH" #8o>B=;" 7" =o':$:4;'4" 8;:Bl4"
;%;'"74"4G;:;"#o$'4:I"
JG;"#$%&$'"%;8:$o'"oF"()*'+(0)" $'"A$gI"!" 7%o$>:">$%$:$o':"?@"
X;8o" 7'>" o%;8FloH" ?@" $'$4$7ll@" :=7l$'g" 4G;" &748$9" ('o4" :GoH')" :o"
4G74" $4:" l78g;:4" ;'48@" $:" ';$4G;8" 4oo" =lo:;" 4o" 4G;" B'>;8FloH" o8" 4G;"
o%;8FloH"4G8;:Gol>"($'"#784$=Bl78,":o"4G74"'o"1$!"o%;8FloH:),"7'>"?@"
&o%$'g" 4G;" #$%o4" !" 7H7@" F8o&" X;8o" ?@" 7" :&7ll" 4G8;:Gol>"#$%&$'I"
JG$:" 4G8;:Gol>" $:" 48$%$7ll@" =o&#B4;>" Fo8" ;%;8@" $'#B4" &748$9I" JG$:"
7lgo8$4G&"#8o>B=;:"7"=o88;=4"8;:Bl4"$'"';78l@";%;8@"&7=G$';"78$4GP
&;4$=I"54"$:"=o':$>;8;>"$'"&o8;">;47$l"$'"QY;&&;l";4"7lI"1SSZ["N7P
G7'"1S\\UI"
JG;"1$g'3$4"%;8:$o'"$'"A$gI"!"&7@"@$;l>"F7:4;8"=o>;"$F"4G;"&7P
=G$';"78$4G&;4$=":B##o84:"5666";9=;#4$o'"G7'>l$'g"8Bl;:"7'>"4G;@"
>o" 'o4" $'=B8" G$gG" #;8Fo8&7'=;" #;'7l4@" QY;&&;l" 7'>" K$" 1SS]UI"
JG;",$-'.$+(!)"FB'=4$o'"8;4B8':"1"Fo8"';g74$%;"%7lB;:"$'=lB>$'g"-""
7'>" -.[" $4" 8;4B8':" ." o4G;8H$:;I" ^o4;" 4G74",$-'.$+(!)" >$FF;8;'4$74;:""
-." 7'>" /.," B'l$R;" 4G;" Flo74$'g" #o$'4" =o&#78$:o'" !" *" ." >o';" 7=P
=o8>$'g" 4o" 4G;" 5666" :47'>78>" Q5666" 1S_T," MGI" ZITUI" JG$:" %78$7'4"
8;`B$8;:" 4G74" 1$!" ';$4G;8" B'>;8FloH:" 'o8" o%;8FloH:I" a4G;8H$:;," 7"
^7^"&7@"?;"#8o>B=;>"4G74"&7R;:"4G;"8;:Bl4"$'=o88;=4"(4$'@"1$2 &7@"
?;":;4"4o"X;8o,":#l$44$'g"4G;"&748$9"$'4o"$'>;#;'>;'4":B?#8o?l;&:)I"
5F" 4G;" &7=G$';" 78$4G&;4$=" >o;:" 'o4" #8o%$>;" 7'" $';9#;':$%;"
H7@"4o"=o&#B4;",$-'.$+(!),"4G;"FB'=4$o'"=7'"?;"8;#l7=;>"?@"=o&P
#78$:o'" H$4G" X;8o" (!" *" .)" $F" ;%;8@" #o::$?$l$4@" Fo8" !" ," -." $:" ;l$&$P
'74;>" QY;&&;l" ;4" 7lI" 1SSZ," MGI" _UI" 5F" &7=G$';" 78$4G&;4$=" ';%;8"
#8o>B=;:" -." $'" 7>>$4$o'," $4" $:" :BFF$=$;'4" 4o" #8;#8o=;::" 7ll" /$" ," -."
$'4o"."(5666"%;8:$o'"$'"A$gI"!)I"5F"4G$:"$:"'o4"4G;"=7:;,";IgI"$F"B'>;8P
FloH"$'"7>>$4$o'"&7@"8;:Bl4"$'"-.,"H;"&7@"$':4;7>":$&$l78l@"#8o=P
;::"4G;"#$%o4"!"74";%;8@"$4;874$o'"7:":GoH'"$'"4G;"5666."%78$7'4"$'"
A$gI" !I" 3o4G" 5666" 7'>" 5666." %78$7'4:" 8;`B$8;" 4G74" (-.)" /" ." ," ."
Gol>:I"
5666."%;8:$o'"7l:o"#8o>B=;:"=o88;=4"8;:Bl4"HG;'"X;8ob:":$g'"$:"
o==7:$o'7ll@"lo:4"$'"=o&#B474$o'I"
^oH,"]P4B#l;:"(&$,"&$/1,"'&$,"'&$/1)"Fo8"$",".,"h,"78"8;#8;:;'4"4G;"
F$';8" $'4;8%7l:I" 54" $:" ;7:@" 4o" :;;," 4G74" ;$4G;8" '&$" ," '&$/1" 7'>" 4G;"
$'4;8%7l" $:" >$:=78>;>," o8" ()*'+(&$)" !" '&$" *" '&$/1" !" ()*'+(&$/1)I"
JG$:" =o'>$4$o'" $:" :BFF$=$;'4" Fo8" 4G;" 7lgo8$4G&" 4o" ?;" =o88;=4," :;;"
MG7#4;8"T"$'"Y;&&;l";4"7lI"Q1SSZUI"
2.3
2.4
i;4;8og;';oB:" =o&#B4$'g" ;'%$8o'&;'4:" H$4G" >$FF;8;'4" 8oB'>$'g"
#8o#;84$;:"$'">$FF;8;'4"FB'=4$o'7l"B'$4:"#8;:;'4"7"G7X78>"Fo8"=o88;=4"
#787ll;l";9;=B4$o'"oF"?$:;=4$o'"Q3l7=RFo8>";4"7lI"1SST?["Y;&&;l";4"
7lI" 1SSZUI" ioH;%;8," $F" o'l@" FB'=4$o'" ()*'+(0)" $:" =o&#B4;>" B:$'g"
4G;" G;4;8og;';oB:" #787ll;l" 8;:oB8=;:," 4G$:" 4G8;74" $:" ;7:$l@" =oB'P
4;8;>I"
5'>;;>,"$'4;8l;7%$'g" 4G;"%7lB;:"oF"()*'+(0)"4G74" 78;" ;%7lB74;>"
o'"#8o=;::o8:"4G74"B:;">$FF;8;'4"Flo74$'g"#o$'4"8oB'>$'g"8Bl;:,":B=G"
7:" 4G;"MDV" 7'>"4G;"WDV,"&7@"8;:Bl4" $'"'o'P&o'o4o'$="?;G7%$o8,"
;%;'" $F" %7lB;:" o'" ;7=G" #8o=;::o8" 78;" &o'o4o'$=I" 5'" 4G74" =7:;" 4G;"
7>fB:4&;'4"#8o=;>B8;">;:=8$?;>"$'"4G;"1;=4$o'"!Ij"&7@"?;"B:;>"4o"
;'Fo8=;"&o'o4o'$=$4@I"JG$:"8;:ol%;:"4G;"#8o?l;&"=o&#l;4;l@I"
3
The GPU Architecture
JG;" 478g;4" #l74Fo8&" Fo8" oB8" WDV" $&#l;&;'474$o'" $:" 4G;" k7>;o'"
l1S.."lJ"4G74"H7:"8;l;7:;>"?@"LJ5"$'"m7'B78@"!..\I"5'"4G$:":;=P
4$o'"H;"?8$;Fl@"8;%$;H"$4:" 78=G$4;=4B87l"F;74B8;:" 4G74" 78;" $&#o847'4"
Fo8" B'>;8:47'>$'g" 4G;" WDV" #;8Fo8&7'=;I" AB84G;8" >;47$l:" =7'" ?;"
FoB'>"$'"%;'>o8n:"4;=G'$=7l"#B?l$=74$o':" QLJ5"!..Z7["LJ5"!..Z?["
LJ5"!..Z=["k$gB;8"!..\["1;g7l"7'>"D;;8=@"!..\["LoY"!..\UI"
3.1
Processing Units
JG;" WDV" G7:" 7'" 7887@" oF" #8o=;::o8:" 4G74" 78;" o#;874;>" $'" 15oY"
lo=R:4;#" &o>;I" oBl4$4G8;7>$'g" $:" B:;>" 4o" G$>;" &;&o8@" 7==;::"
l74;'=@I" JG;" ?7:$=" >747" Fo8&74" $:" 7" FoB8P=o&#o';'4" %;=4o8" oF" j!P
?$4" 5666" Flo74$'g" #o$'4" 'B&?;8:I" JG;8;" 78;" 4G8;;" 4@#;:" oF" B'$4:"
4G74" 78;" o#;874$'g" $'" #787ll;l" 74" \!Z" oiX" =lo=R" 874;" p" LKV:,"
&;&o8@"F;4=G"B'$4:"7'>"FloH"=o'48ol"B'$4:I"
JG;8;"78;"]_"LKV:I"67=G"LKV"=o':$:4:"oF"7"%;=4o8"7'>"7":=7P
l78"B'$4"4G74" 78;"#8og87&&;>":;#7874;l@I"JG;"%;=4o8"B'$4";9;=B4;:"
:$&#l;" $':48B=4$o':" o'" 4G;" F$8:4" 4G8;;" =o&#o';'4:" oF" 4G;" 8;g$:4;8:I"
JG;:;"$':48B=4$o':"$'=lB>;"&Bl4$#l@"7'>"7>>"(oLY),">o4"#8o>B=4:,"
&$'$&B&" o8" &79$&B&" oF" 4Ho" %7lB;:," 47R$'g" F87=4$o'7l" #784:" 7'>"
=o'>$4$o'7l" 7::$g'&;'4:I" JG;" :=7l78" B'$4" &7@" ;9;=B4;" 4G;" :7&;"
$':48B=4$o':" ;9=;#4" 4G;" >o4" #8o>B=4:" o'" 4G;" FoB84G" =o&#o';'4" oF"
4G;"8;g$:4;8:I"5'"7>>$4$o'"$4"=7'"=o&#B4;";9#o';'4,"log78$4G&,":$';,"
=o:$';,"1"c"0"(kMD)"7'>"1"c""0"o#;874$o':I"67=G"78$4G&;4$="$':48B=P
4$o'"&7@"G7%;"7'"$'#B4"7'>"7'"oB4#B4"&o>$F$;8I"JG;"$'#B4"&o>$F$;8"
=7'"?;""-0,")0)"7'>"-)0)I" 5'"7>>$4$o',"o';"oF"4G;"$'#B4:" =7'"?;"7" 8;P
:Bl4" oF" 7" :$&#l;" o#;874$o'" o%;8" 4Ho" o4G;8" $'#B4:" (=7ll;>" q#8;:B?P
487=4r)I" JG$:" o#;874$o'" =7'" ?;" 0" /" 3," 0" -" 3," 1" -" 0" 7'>"1" -" !0I" JG;"
FoB8" =o&#o';'4:" oF" 4G;" $'#B4" 8;g$:4;8:" =7'" ?;" 78?$4878$l@" :GBFFl;>"
7'>":H$XXl;>I"JG;"oB4#B4"&o>$F$;8"7lloH:"&Bl4$#l@$'g"4G;"8;:Bl4"oF"
Monotonicity of !"#n%(&)
5'";97=4"78$4G&;4$=,"()*'+(0)"g8oH:"&o'o4o'$=7ll@"H$4G"0I"JG$:"$:"
'o4"';=;::78$l@"48B;"$'"F$'$4;"#8;=$:$o'"78$4G&;4$=I"54"=7'"?;"#8o%;'"
4G74" ;7=G" oF" 4G;" o%;8FloHP:7F;" %;8:$o':" oF" ()*'+(0)" $'" A$gI" !" $:"
&o'o4o'$="$F"7'>"o'l@"$F"4G;"Flo74$'g"#o$'4"78$4G&;4$="$:"&o'o4o'$="
QN7G7'"1S\\["Y;&&;l";4"7lI"1SSZU,"$I;I"4G;"8;:Bl4"oF"o#;874$o':"02/"
3," 02 -" 3" 7'>" 02 c" 3" &o'o4o'$=7ll@" >;#;'>:" o'" 4G;" 78gB&;'4:I" d;"
'o4;"4G74"4G;"#8ooF";94;'>:"4o"4G;"=7:;"HG;8;"02423"$:"$&#l;&;'4;>"
?@"02e"(1"c"3),"7'>"&Bl4$#l$=74$o'"7'>"8;=$#8o=74$o'"78;"?o4G"&o'oP
4o'$=I"5666"Flo74$'g"#o$'4":;&7'4$=:"HoBl>"gB787'4;;"&o'o4o'$=P
$4@I"
^o'P&o'o4o'$=" ()*'+(0)" &7R;:" 4G;" o?%$oB:" $&#l;&;'474$o'"
oF"4G;"7lgo8$4G&"$'=o88;=4I"Ao8";97&#l;,":o&;"oF"4G;"$'4;8%7l:"&7@"
?;" FoB'>" 4o" =o'47$'" ';g74$%;" 'B&?;8:" oF" ;$g;'%7lB;:I" 54" H7:"
!"
"
Heterogeneous Computing
Operation Type
How Nmplemented
16A
]CP
]CP in the scalar pipe2
MAD in the vector pipe
MAD
MAD and presubtract
MAD
MAD and presubtract
MAD and presubtract
MAD2 presubtract and
output modifier
A6L
AaL
AfL
AaLfC
(AfL)aCfC
(1c2aA)aLfC
((1c2aA)aLfC)64
grammed eDplicitly in the high<level languages such as DirectVgs
HHUH.
The sign of ^ero may be lost when copying data between reg<
isters. The reason is the lacE of a separate MOi operation2 so that
another arithmetic operation must be used instead. Nf the assign<
ment is done using MAD2 i.e. as ! b A a A f " or ! b 1 a " f A2 the
sign of ^ero is lost in the addition. A safer solution is ! b MAV("2
"). However the choice might not be under control when pro<
gramming in a high<level language.
To summari^e2 underflow in addition may result in cA on this
particular GPU so that NCCC version of #$%&'(") may give incor<
rect values in corner cases. NCCCA and pivmin versions are eD<
pected to be correct. Arithmetic is liEely to be monotonic2 but we
cannot say it is for sure.
Theoretical
PeaE ]ate
3A Gflop6s
3A Gflop6s
12A
24A
24A
3GA
44A
Gflop6s
Gflop6s
Gflop6s
Gflop6s
Gflop6s
GAA Gflop6s
Table 1l Theoretical peaEs for various arithmetic operations on
the ]adeon V1WAA VT (single precision only).
4
Our implementation is done in Cff following the OO]T]A_
codes of the sstebz routine in HAPACj version 3.1 that is
available in the public domain. Nt was compiled using the Nntel
Cff compiler version W.1 with all optimi^ations for speed turned
on eDcept the floating point arithmetic options. Nn order to ensure
sufficient NCCC compliance we used /fp:source and
/fp:except- compiler options.
The function #$%&'(") is implemented both on the CPU and
the GPU. Nn each iteration of the algorithm one of them is chosen
using a performance model as elaborated in Uection 4.G. The re<
turn values of #$%&'(") are adkusted to handle non<monotonicity
following Uection 2.3. There are two sources of non<
monotonicityl potentially non<monotonic GPU arithmetic and
miDing the results of #$%&'(") computed on the CPU and the
GPU.
the operation by factors of 22 42 42 1622 164 and 164. Also2 the out<
put value may be clamped to the range @A2 1B. Cach unit can eDe<
cute one instruction every clocE cycle. The estimates of the theo<
retical peaE performance based on this data are given in Table 1.
There are 1G memory fetch units2 i.e. one per three AHUs.
GPULench @LucE et al. 2AA4B2 which is a popular benchmarEing
utility2 shows that the cost of fetching a four<component vector is
12 AHU cycles if the data is in the cache2 i.e. it taEes one cycle for
a unit to fetch one 32<bit floating point number. Nt accounts to 4A
GL6s. Oetch<4 eDtension allows Puadrupling this rate to 1GA GL6s.
Memory fetches that miss the cache are more eDpensive.
Olow control units allow implementing if<else statements by
eDecuting both branches and masEing off the non<participating
processors. Also they support loops that are repeated a constant
number of times. The constant is limited to the range @A2 2RRB and
can be changed only when a program is not running. A larger
number of iterations may be achieved using nested loops.
3.2
Implementation
4.1
The GPU Program
The GPU is programmed using DirectV W.A. Three different ver<
sions of #$%&'(") are programmed in HHUH as shown in Oig. 3.
The HHUH program solves for four different values of " at once to
fit better to the GPU architecture.
The negative pivots in the GPU program are counted using
floating point numbers. This limits the counter and hence the di<
mension of the input matriD to 224 ! 1G2AAA2AAA. Nt is a sufficiently
large number m it may taEe a few weeEs to solve a problem of
such a si^e on a Pentium 4 processor.
The loop is naturally unrolled four times as the matriD entries
are fetched in four<component vectors. Unrolled four times more
it yields a perceivable speedup. Nn order to handle arbitrary matriD
dimensions2 the matriD si^e is rounded up to the nearest multiple
of four introducing dummy entries at the tail. They are set to f!
for the diagonal and to A for the off<diagonal entries. That could
produce an error if run in NCCC<compliant arithmetic as it gives
()26* b A6A b _a_ for * b A2 but is correct on this GPU since it
implements Aa! b A. The dummy entries produce * b f! so that
#$%&' is not incremented+,An analogous techniPue would worE on
an NCCC<compliant platform.,
The matriD is laid out into two 2D arrays (XteDturesZ) and is
transferred to the GPU memory once per sstebz call. The itera<
tion through the entries is implemented in two nested loops2 each
iterating through a hori^ontal or a vertical dimension of the arrays.
Oetch<4 eDtension is not enabled for these matriD structures2 since
the program is computation<bound anyway (see Uection 4.3).
Floating Point Arithmetic
The ATN CTM guide @AMD 2AAGB gives the most detailed specifi<
cation of the GPU floating point arithmetic that we Enow. Utill2 it
is not eDhaustive and even disagrees with our tests done in
DirectV W.A.
Most arithmetic operations are Xaccurate to within one bit on
each inputY transcendental functions have larger tolerancesZ2 but
rounding rules are not specified. Hence2 it is unclear if the arith<
metic is monotonic or not. \e were able to determine that ]CP
operation is monotonic by eDecuting it for every possible input.
This is possible since it taEes only one 32<bit argument2 i.e. there
are fewer than 232 different inputs. Addition and multiplication are
liEely to be monotonic as their most straightforward implementa<
tions are. Nn our tests we observed rounding towards ^ero in MAD
operation2 which is monotonic.
The special values2 such as !!2 !A and _a_ are supported.
Though the ATN CTM guide claims that they are treated according
to the NCCC `R4 standard in many operations including MAD and
]CP2 we found that on the GPU Aa! b A and Aa_a_ b A2 which
deviates from the standard. According to our eDperience handling
of the special NCCC values does not incur a performance penalty.
Denormali^ed numbers are accepted but are always flushed to
^ero with the sign preserved. That means that underflow in addi<
tion may produce cA that was verified by our tests. This behavior
also disagrees with the NCCC standard @NCCC 1W4`2 Ch. G.3B.
The convention (cA) f A b fA held in our tests.
Nt is unclear if the adder in the presubtract units implements
the same or different floating point semantics. Nt cannot be pro<
4.2
Transferring Data and Running the Job
As the input and output data for each parallel run of #$%&'(") is
produced and processed in the main memory2 we need to transfer
it to the GPU memory and bacE for each call to the routine.
3
(1)
(4)
a = tex2D( matrix_a, pos );
bb = tex2D( matrix_bb, pos );
pos.x += increment;
d = a.R – x – bb.R / d;
Count += (d < 0) ? 1 : 0;
d = a.G – x – bb.G / d;
Count += (d < 0) ? 1 : 0;
d = a.B – x – bb.B / d;
Count += (d < 0) ? 1 : 0;
d = a.A – x – bb.A / d;
Count += (d < 0) ? 1 : 0;
"
>.A"[VVV"De%s$65"$5"K]U]"0.584.8e+"
"
(2) d = (|d| < pivmin) ? –pivmin : d;
(3) d = d + 0;
T$84%e"!e"l455$58"t$1es"63"2$33e%e5t"9:;"st.8es".52"De%s$65s"63"
#$%!&>'A"36%"!"O"L*N+(;=06.2"$s"t-e"t%.5s3e%"t6"t-e"9:;"1e16%7I"
.52"26G506.2"$s"6t-e%"G.7".%6452+"T6%"-e$8-t""I"NMJ!"($5C
st.5'es"63"#$%!&>'A".%e"%45+"
"
>bA"VFt%."0$5es"4se2"$5"6t-e%"De%s$65s+"]$5e">NA"$s"4se2"$5"t-e"
=$D1$5"De%s$65I"0$5e">PA"k"$5"[VVV,"De%s$65+
(1)
(1)
(1)
(1)
(1)
(1)
RCP
RCP
RCP
RCP
ADD
MAD
r0.R, r0.R
r0.G, r0.G
r0.B, r0.B
r0.A, r0.A
r1, r3.R, -r2
r0, r4.R, -r0, r1
r0:
r1:
r2:
r3:
r4:
r5:
(2) ADD r1, r0_abs, -c0
(2) CMP r0, r1, r0, -c0
(3) ADD r0, r0, c1.R
(4) CMP r1, r0, c1.R, c1.G
(4) ADD r5, r5, r1
"
s-6Gs" t-e" t-e6%et$'.0" est$1.tes" 63" t-e" =e%36%1.5'e" 4s$58" t-e" .%C
'-$te't4%.0"2et.$0s"=%ese5te2"$5"Ue't$65"P+"^e".ss41e"t-.t"t-e"16st"
63" t-e" 1e16%7" 3et'-es" -$t" '.'-e" .52" .%e" '61=0ete07" 6De%0.==e2"
G$t-"'61=4t.t$65I"-e5'e"3%ee+"E-e"541be%"63"306.t$58"=6$5t"6=e%.C
t$65s"=e%"$te%.t$65"$s"P"$5"e.'-"63"t-e"De%s$65s"63"#$%!&>'A+"
E-e"6bse%De2"%.tes".%e" s-6G5"$5"t-e"s.1e"E.b0e+"E-e7"1.t'-"
t-e" =%e2$'te2" %.tes" G$t-$5" L,_+" E-e" '62e" .'-$eDe2" 4=" t6" `N_" 63"
t-e"t-e6%et$'.0"=e.a"63" t-e"$5=4t"b.52G$2t-"63"!,"9bcs+"<0s6I"t-e"
'62es" =e%36%1e2" .t" 5e.%07" t-e" s.1e" %.tes" G-e5" 1e16%7" 3et'-es"
-.De"bee5"%e16De2"3%61"t-e"$55e%"066=".52"s4bst$t4te2"G$t-"%e8$sC
te%" .ss$851e5ts" t-.t" =%ese%De" t-e" 2.t." 2e=e52e5'e" =.tte%5+" E-$s"
s4==6%ts" 64%" .ss41=t$65" t-.t" t-e" 1e16%7" 3et'-es" 26" 56t" $5'4%"
eFt%."0.te5'7+"
"
E-e6%7"
Ze.s4%e2"
de%s$65" \06'as"
9306=cs"
9bcs"
9306=cs"
9bcs"
[VVV"
`"
!M"
P,"
!*"
PP"
[VVV,"
*"
!,"
NQ"
!P"
N*"
=$D1$5"
L,"
PJ"
N!"
P`"
NM"
"
"
d
temp
x
a
bb
Count
c0: pivmin
c1: (0,1,0,0)
"
"
>'A"<"s$1=0e"1.==$58"t6":U"P+,"0.584.8e+"
T$84%e"Pe"#$%!&>'A"eF=%esse2"$5"K]U]".52":U"P+,+"E-e"066=$58"
.52"$5$t$.0$z.t$65"068$'"$s"56t"s-6G5+"d.%$.b0es"d".52"x".%e"364%"
'61=65e5t"De't6%sI"e.'-"'61=65e5t">RI"GI"B".52"AA"st.7s"36%"."
2$33e%e5t"$5st.5'e"63"#$%!&>'A+"Y41be%s"t6"t-e"0e3t"s-6G"-6G"
K]U]".52":U"P+,"'62es".%e"%e0.te2"t6"e.'-"6t-e%+"
#$%e't)" *+," -.s" 0$1$te2" 345't$65.0$t7" 36%" t%.5s3e%%$58" 2.t."
3%61"t-e"9:;"t6"t-e"1.$5"1e16%7+"<%%.7s"t-.t".''e=t"t-e"64t=4t"
63"t-e"=%68%.1s"%45"65"t-e"9:;s">?%e52e%"t.%8ets@A"14st"be"t%.5sC
3e%%e2"e5t$%e07"eDe5"$3"."s1.00"=6%t$65"63"t-e1"$s"5ee2e2+"E6".D6$2"
."=e%36%1.5'e"=e5.0t7"24e"t6"t-ese"eFt%."b.52G$2t-"%eH4$%e1e5tsI"
Ge"'%e.te"."541be%"63"%e52e%"t.%8ets".52"'-66se"t-e"s1.00est"t-.t"
3$ts"t-e"2.t.+"E-e"G$2t-"63"eDe%7"%e52e%"t.%8et"$s"J!"36%"t-e"%e.s65s"
eF=0.$5e2"0.te%+"Ke$8-ts".%e"3%61"L"t6"MLNI"G-$'-".006Gs"-.520$58"
.s" 0.%8e" =%6b0e1s" .s" !" O" LPLI,QN+" E-e" s=.'$58" betGee5" -e$8-ts"
D.%$es"3%61"L"36%"s1.00"-e$8-ts"t6"J!"36%"0.%8e%"36%"e'6561$'.0"4se"
63" t-e" 9:;" 1e16%7+" E-e" 9:;" =%68%.1" $s" t-e5" eFe'4te2" 36%" t-e"
3$%st"""0$5es"63"t-e"se0e'te2"%e52e%"t.%8et">b7"2%.G$58".5".==%6=%$C
.te"t%$.580eR".00"=6ss$b0e"t%$.580es".%e"set"$5"t-e"9:;"1e16%7"24%C
$58"t-e"$5$t$.0$z.t$65"st.8eAI"$+e+"36%"J!!!!""$5st.5'es"63"#$%!&>'A+"
"($s"'-6se5"t6"be"t-e"s1.00est"t-.t"s.t$s3$es"J!!!!""!")*!+*+"
T6%"st6%$58".5".%%.7"63"'( >s-$3tsA"65"t-e"9:;"Ge"4se" ." s$580e"
teFt4%e" .0s6" 63" G$2t-" J!I" '%e.te2" G$t-" #P#;U<9VW#XY<Z[\"
30.8+"E-$s"30.8".006Gs"4=2.t$58"t-e"teFt4%e"=.%t$.007"G-e5"t%.5s3e%C
%$58"2.t."t6"t-e"9:;"1e16%7+"
!"#
E.b0e"Ne":%e2$'te2".52"6bse%De2"'61=4t.t$65.0"%.tes".52"b.52C
G$2t-s"36%"2$33e%e5t"9:;"De%s$65s"63"#$%!&>'A+"
"
T$8+" !" s-6Gs" t-e" %455$58" t$1es" 63" t-e" t-%ee" De%s$65s" 63" t-e"
#$%!&>'A" .52" 2.t." t%.5s3e%s+" E-e" #$%e't)" f4e%7" 1e'-.5$s1" G.s"
4se2" t6" G.$t" 45t$0" t-e" 9:;" '61=0etes" eFe'4t$65+" E-e" [VVV," .52"
=$D1$5" De%s$65s" .%e" '6%%es=652$5807" LM_" .52" P,_" s06Ge%" t-.5"
t-e"[VVV"De%s$65"t-.t"$s"s$1$0.%"t6"t-e"%es40ts"%e=6%te2"b7"#e11e0"
.52" ]$" gL**!h+" ;=06.2" t$1e" $s" 5e80$8$b0e" '61=.%e2" t6" t-e" 26G5C
06.2"t$1e">t-e".1645t"63"2.t."t%.5s3e%%e2"$s"5e.%07"t-e"s.1eA+"<t"!"
O"L*N"=$'t4%e2"$5"T$84%e"t-e"t6t.0"t$1e"s=e5t"$5"t-e"2.t."t%.5s3e%"$s"
.b64t" t-e" s.1e" .s" t-e" t$1e" s=e5t" $5" '61=4t.t$65+" \61=4t.t$65"
t$1e"8%6Gs"0$5e.%07"G$t-"!"b4t" t-e" t%.5s3e%"t$1e"26es"56t" '-.58e+"
U6I".t"0.%8e"!"t-e"t$1e"s=e5t"$5"t%.5s3e%"$s"56t"s$85$3$'.5t+"
Y6te"t-e" st.$%s"63"=e%$62"J!"t-.t"s-6G"t-.t"'61=4t.t$65.0"%.te"
$s" -$8-e%" G-e5" "" $s" ." 140t$=0e" 63" J!+" E-e" s.1e" .==0$es" t6" t-e"
G$2t-"63"t-e"%e52e%e2" %e't.580e"t-.t"16t$D.te2"4s" '-66s$58"G$2t-"
J!"36%"t-e"%e52e%"t.%8ets+"T6%"eF.1=0eI" '61=4t$58"PN!PNI"J!!PNI"
PN!J!" .52" J!!J!" b06'as" 63" =$Fe0s" t.ae" t-e" s.1e" t$1e+" <0s6I" t-e"
s06=e" $5" t-e" 8%.=-" 36%" "( O" LIiIPN" $s" tG$'e" .s" 0.%8e" .s" 36%" "( O"
J!j,I" ,OLINIi+" E-$s" '6402" 1e.5" t-.t" 6507" -.03" 63" t-e" =%6'ess6%s"
.%e" 4t$0$ze2" G-e5" "" "" PN" .52" 6507" ." H4.%te%" .%e" 4se2" G-e5" b6t-"
t-e"-e$8-t".52"t-e"G$2t-".%e"0ess"6%"eH4.0"t6"PN+"<"=6ss$b0e"162e0"
36%"t-$s"be-.D$6%"'6402"be"eFe'4t$65"63"t-%e.2s"$5"J!FJ!"t$0esI"e.'-"
t$0e" s=0$t"$5t6" 364%"H4.2%.5ts" t-.t".%e".ss$85e2"t6"2$33e%e5t"=%6'esC
s6%s+" [3" t-e"H4.2%.5t"$s" e1=t7I" t-e"=%6'ess6%s"$t"$s" .ss$85e2"t6" .%e"
%erfor*+nce of !"#$%.&/ on the 2%3
T$8+"P>'A"s-6Gs"." s$1=0e":U"P+," '62e" s$1$0.%" t6"t-.t"=%624'e2"b7"
t-e" K]U]" '61=$0e%+" E-$s" '62e" .==%6F$1.tes" t-e" 1.'-$5e" '62e"
t-.t" $s" 40t$1.te07" =%624'e2" b4t" 56t" .D.$0.b0e" G-e5" =%68%.11$58"
$5"#$%e't)+"Y6teI"t-.t"t-e"tG6"$3Cst.te1e5ts"$5"t-e"=$D1$5"De%s$65"
$5" T$8+" N" 26" 56t" %eH4$%e" b%.5'-es" $5" t-e" :U" P+," '62e+" E.b0e" N"
!"
"
/
//"&(1/2/3/45/2/6/%2.*5/277/8/
//9/
/////:;2</3/=>45/
/////?&'@,;2</3/4>45/
//A/
//"&(1/B/3/45/B/6/@5/B778//
/////"&(1/2/3/45/2/6/%2.*5/277/8/
/////9/
////////:;2</3/+;B</C/D;2</C/--;B</!/:;2<5/
////////?&'@,;2</73/:;2</6/4>4/E/=>4/$/4>45//
/////A/
"
W#Fu+&"!f"Y-&".&Eto+#4&$"CRU".&+1#on"oK"t-&"!ountM&P"+out#n&"t-*t"
&x&Eut&1"si)e"#n1t*nE&1"oK"!ountM&P'"
"
W#Fu+&"^f"munn#nF"t#)&"oK"t-&"CRU"c&+n&%"Ko+"n"Z"^!!9^'/
#$%&'" (#)#%*+" ,&-*.#o+" 0*1" o,1&+.&$" ,2" 3o%4" &t" *%'" 67889:" on"
<=>?>@A1"-*+$0*+&'"
>n"ou+"&xD&+#&nE&"*"1#)#%*+"D+oF+*)"+un"#n"GD&nHI"-*1"$&)J
on1t+*t&$" 1#)#%*+" 1t*#+1" 0#t-" *" D&+#o$" oK" 97'" >t" )#F-t" %&*$" to" t-&"
EonE%u1#onL" t-*t" t-#1" D*+*)&t&+" Mt-&" N1t*#+O" 0#$t-P" #1" *$Qu1t*,%&L"
t-ouF-"t-&"-*n$%&"#1"not"*.*#%*,%&"to"t-&"D+oF+*))&+'"
4.4
4.:
CP& 'mp*ementation o1 Count(x)
(#)#%*+%2"*1"#t"#1"$on&"#n"I@R@CTL"0&"#)D%&)&nt&$",ot-".&Eto+J
#4&$" *n$" nonJ.&Eto+#4&$" .&+1#on1" oK" !ountM&P" on" CRU'" 3ot-" *+&"
t-&">VVV".&+1#onL"t-*t" #1" t-&"K*1t&1t"*n$"Eo++&Et" *1" t-&"CRU"*+#t-J
)&t#E" #1" >VVV" Eo)D%#*nt" M0-&n" u1#nF" !"#$%&'()*" Eo)D#%&+"
oDt#onP'"@1"#n"t-&"HRU"Eo$&1L" K%o*t#nF"Do#nt"nu),&+1"*+&"u1&$"to"
Eount" t-&" n&F*t#.&" D#.ot1'" W#F'" !" 1-o01" t-&" .&Eto+#4&$" .&+1#on" oK"
t-&"+out#n&'"@EEo+$#nF"to"t-&"Eo)D#%&+")&11*F&1L"&.&+2"%#n&"#n"t-&"
%ooD",o$2"#1"Eo)D#%&$"#nto"(>X?"#n1t+uEt#on1'"Y-&"nonJ.&Eto+#4&$"
.&+1#on"-*1"#n.&+1&"%ooD"o+$&+"Mt-&"Eo)D#%&+".&Eto+#4&1"on%2"#nn&+"
%ooD1P'"Y-&"nonJ.&Eto+#4&$".&+1#on"#1"+un"0-&n"si)e"Z"+,-.,"["\'"
Y-&" +unn#nF" t#)&" oK" t-#1" CRU" .&+1#on" #1" %*,&%&$" Nn*].&O" #n"
W#F'" ^'" @1" on&" )*2" 1&&L" #t1" +unt#)&" $o&1" not" #nE+&*1&" )onoton#J
E*%%2L"0-#E-")&*n1" t-*t" #nE+&*1#nF"si)e",2"Eo)Dut#nF"!ountM&P"*t"
*" K&0" &xt+*" Do#nt1" )*2" $&E+&*1&" t-&" +unt#)&_" Y-u1L" #K" si)e !" 9"
M)o$"\P"0&"*$$"$u))2"Do#nt1"to"#nE+&*1&"si)e"to"t-&"n&*+&1t")u%J
t#D%&"oK"\'"Y-&"n&0"+unt#)&" #1"1-o0n"#n"t-&" 1*)&"W#Fu+&" %*,&%%&$""
N$u))#&1O'">t"#1"uD"to"9'7x"K*1t&+"*EEo+$#nF"to"t-&"F+*D-'"
4.5
5estin7 Correctness o1 Count(x)
`&" Eon1t+uEt&$" *" a!a" t+#$#*Fon*%" )*t+#xL" 0#t-" &#F&n.*%u&1" 1uKK#J
E#&nt%2" $#1t*nt" K+o)" 4&+oL" t-*t" 2#&%$1" *" n&F*t#.&" $&no+)*%#4&$"
D#.ot" #n" !ountM&P" *t" &" Z" 8'" >K" $&no+)*%1" *+&" K%u1-&$" to" 4&+o" D+&J
1&+.#nF" 1#FnL" #t" D+o$uE&1" d" Z" b8'" `&" E-&Ec&$" #K" t-#1" D+o$uE&1" *"
Eo++&Et"+&1u%t"#n"$#KK&+&nt" #)D%&)&nt*t#on1"oK"!ountM&P'"<ot&L"t-*t"
*1"*%%"&#F&n.*%u&1"*+&"K*+"K+o)"4&+oL"+oun$oKK"&++o+")*2"not"#nK%uJ
&nE&" t-&" .*%u&" oK" !ountM8PL" *1" t-&" *%Fo+#t-)" #1" ,*Ec0*+$" 1t*,%&"
*n$"t-&"12))&t+#E"&#F&n.*%u&"D+o,%&)"#1"0&%%JEon$#t#on&$'"
@)onF" t-&" t-+&&" HRU" *%Fo+#t-)1" on%2" t-&" >VVV" .&+1#on" -*1"
K*#%&$L" *1" 0*1" &xD&Et&$" #n" (&Et#on" 9'7'" 3ot-" .&Eto+#4&$" *n$" nonJ
.&Eto+#4&$".*+#*nt1"oK"CRU"!ountM&P""D+o$uE&$"Eo++&Et"+&1u%t1'"Gn"
ot-&+" -*n$L" t-&" +&1u%t1" oK" CRU"!ountM&P" 0&+&" #nEo++&Et" 0-&n" t-&"
Eo$&" 0*1" Eo)D#%&$" u1#nF" !"#$"+%," oDt#onL" t-*t" *%%o01" *E-#&.J
#nF"-#F-&+"Eo)Dut*t#on*%"+*t&1",2"0*.#nF"1t+#Et">VVV"d!a"Eo)D%#J
*nE&'"
@1"on%2"t-&">VVV8"*n$"D#.)#n".&+1#on1"oK"t-&"HRU"!ountM&P"
*+&" D+o.&n" to" ,&" Eo++&Et" on" t-#1" HRU" *n$" t-&" >VVV8" .&+1#on" #1"
E%&*+%2" K*1t&+" t-*n" t-&" *%t&+n*t#.&L" on%2" t-&" >VVV8" +out#n&" 0*1"
u1&$" #n" HRU" Eo)Dut*t#on1" #n" t-&" +&1t" oK" t-&" D*D&+'" >VVV" *n$"
D#.)#n".&+1#on1")*2"1t#%%",&"Eon1#$&+&$"0-&n"Eo)Dut#nF"on"ot-&+"
HRU")o$&%1"0#t-"$#KK&+&nt"K%o*t#nF"Do#nt"Eon.&nt#on1'""
5
Resu*ts
@%%" +&1u%t1" E#t&$" #n" t-#1" (&Et#on" M*n$" ot-&+1P" 0&+&" o,t*#n&$" 0#t-"
7'\" Hl4" R&nt#u)" a" !78" MR+&1EottP" *n$" @Y>" m*$&on" ngo88" nY'"
Y-&"Ko%%o0#nF"#)D%&)&nt*t#on1"0&+&"u1&$"#n"t-&"t&1t1f"
!! CRUJ*%on&f")u%t#1&Et#on"+unn#nF"!ountM&P"on"t-&"CRU"on%2k"
!! HRUJ*%on&f")u%t#1&Et#on"+unn#nF"!ountM&P"on"t-&"HRU"on%2k"
!! CRUJHRUf" )u%t#1&Et#on" +unn#nF" !ountM&P" ,ot-" on" t-&" HRU"
*n$"CRU'"Y-#1"#1"ou+"K*1t&1t"Eo$&k"
!! CI@R@CTf",#1&Et#on"+out#n&"%%,*-."#n"CI@R@CT"9'8k"
!! XTI"3pf"",#1&Et#on"+out#n&"%%,*-."#n">nt&%"XTI"o'8k"
!! XTI" mWf" %%,*("/ +out#n&" #n" >nt&%" XTI" o'8" t-*t" #1" qm"
*%Fo+#t-)"oDt#)#4&$"Ko+"K#n$#nF"*%%"&#F&n.*%u&1"on%2k"
!! XTI"Hmf"%%,*0("+out#n&"#n">nt&%" XTI"o'8"t-*t"u1&1"$j$1"
*%Fo+#t-)"to"K#n$"t-&"&#F&n.*%u&1"on%2'"
>n"+out#n&1"t-*t"+&ju#+&"1D&E#K2#nF"t-&"*,1o%ut&"to%&+*nE&"t-&".*%u&"
7isfmin"0*1"u1&$L"t-*t"#1"t-&"K#n&1t"*EE&Dt*,%&"Ko+"%%,*-.'"sfmin"
#1"t-&"1)*%%&1t".*%u&"1uE-"t-*t"grsfmin"$o&1"not"o.&+K%o0'"
Y-&"Ko%%o0#nF")*t+#E&1"0&+&"u1&$"#n"t&1t1"Mi"Z"gsn= % > 2@23
#1"t-&")*E-#n&"&D1#%onPf"
!! un#Ko+)f"ai Z"g"h"Mi@gBrn="bi"Z"7rnk"
!"
"
5unin7
Y-&+&" *+&"t0o"E-o#E&1"to",&")*$&"#n" &.&+2"#t&+*t#on"oK" t-&",#1&EJ
t#on"*%Fo+#t-)"e"0-*t".&+1#on"oK"!ountM&P"to"u1&"Mt-&"HRU"o+"t-&"
CRU"on&P"*n$"-o0"%*+F&".,"1-ou%$",&'"Y-&"E-o#E&"t-*t"+&1u%t1"#n"
1-o+t&1t" +unn#nF" t#)&1" oK" t-&" ,#1&Et#on" *%Fo+#t-)" 1-ou%$" ,&" D+&J
K&++&$'"`&"Eon1#$&+"*%%"Do11#,%&"E-o#E&1"*n$"E-oo1&"t-&")o1t"&KK#J
E#&nt"#n"t&+)1"oK"t-&"Ko%%o0#nF"$&K#n#t#onf"
log('( + 1) "
eff#c#e%c& =
,
)#*e
0-&+&" Time" #1" t-&" +unn#nF" t#)&" oK" on&" #t&+*t#on" un$&+" t-&" E-o#E&"
oK".,"*n$".&+1#on oK !ountM&P'"Wo+"&x*)D%&L"1D%#tt#nF"&*E-"#nt&+J
.*%" #nto" 7" D*+t1" M.," Z" gP" #n" t#)&" T" -*1" t-&" 1*)&" &KK#E#&nE2" *1"
1D%#tt#nF" #t" #nto" \" D*+t1" M.," Z" dP" #n" t#)&" 9T'" W#n&+" 1u,$#.#1#on"
M-#F-&+" .,P" $on&" #n" t-&" 1*)&" t#)&L" *n$" K*1t&+" Eo)Dut*t#on" *t"
1*)&" .," *+&" Eon1#$&+&$" )o+&" &KK#E#&nt'" @%%" $&E#1#on1" *+&" )*$&"
oKK%#n&"*n$"t*,u%*t&$"Ko+"u1&"*t"+unt#)&",2"t-&"CRU'"
Y-&".*%u&"oK"Time"#1"&1t#)*t&$"u1#nF"t-&"+&1u%t1"oK"*"t-o+ouF-"
,&nE-)*+c#nF'" Y-&" non%#n&*+" 1t*#+J%#c&" ,&-*.#ou+" oK" t-&" HRU"
Eo$&"#1" E*Dtu+&$" #n"*"t*,%&"0#t-"*n" &nt+2"Ko+" &*E-"3'" @"%#n&*+"$&J
D&n$&nE&"on"n"#1"*11u)&$"Mtime"Z"latenc7"h"n"i"band9idthP'"Y#)&"
1D&nt"out1#$&"oK"t-&"!ountM&P"#1"&1t#)*t&$"*1" ! ; "-+, ; #-., ;
$-+,-.,'" Y-&" Eo&KK#E#&nt1" *+&" K#t" u1#nF" 0&#F-t&$" %#n&*+" %&*1t"
1ju*+&1"to")#n#)#4&"t-&"+&%*t#.&"&++o+k"t-&"0&#F-t1"*+&"1&t"&ju*%"to"
t-&")&*1u+&$"t#)&'"Y-&" +unt#)&1"oK"t-&"CRU".&+1#on"oK"!ountM&P"
*+&" K#t" 1#)#%*+%2L" t*c#nF" #nto" *EEount" t-&" 4#F4*F" D*tt&+n" *n$"
#nt+o$uEt#on"oK"t-&"$u))2"&nt+#&1'"
#igure )* Computational rates (5flops7s) ac:ie;e< => <ifferent
;ersions of !!t#$% an< a mi? of matrices.
#igure L* Bpee<up of our CPU ;ersion of !!t#$% relati;e to t:e
;ersion in Intel NOP A.G.
geometric* ai Q (R!)(iIF)7(nIF)J bi Q aiSF7RT
(IFJKJIF)* ai Q KJ bi Q IFT
glue<* (IFJKJIF) matri? Uit: bk Q R ! U:en k Q G (mo< K5)J n
is a multiple of K5T
!
practical* a su=set of matrices from WarUellEXoeingJ Uni;erE
sit> of #lori<a an< 5eorge #ann collections re<uce< to tri<iE
agonal form F.
Uniform an< geometric matrices appro?imate uniform an< geoE
metric <istri=ution of eigen;alues respecti;el>. T:e glue< matri?
:as eigen;alues strongl> clustere< aroun< t:e eigen;alues of t:e
(IFJKJIF) matri? Uit: n Q K5.
YffE<iagonals of t:e test matrices Uere alUa>s large enoug:
t:at t:e PZPZCO !!t#$% routine <oes not =rea[ t:e pro=lem
into smaller onesJ U:ic: is not currentl> implemente< in our ;erE
sion. \e also ensure< t:at ot:er algorit:ms use< (suc: as <]<s) <o
not e?:i=it unusuall> fast con;ergence t:at :appens U:en t:e
matri? is ;er> close to <iagonal. T:is e?plains t:e c:oice of enE
tries suc: as R! a=o;e.
#irst Ue anal>^e t:e =e:a;ior of t:e CPU co<es alone. #ig. L
s:oUs t:e spee<up ac:ie;e< in our CPU co<e relati;e to t:e NOP
;ersion. It ranges from 4.K to L.) for n ` 5G. T:e CPZPZCO ;erE
sion Uas from 5a faster to K5a sloUer t:an NOP ;ersion.
#ig. ) s:oUs t:e computational rates in CPZPZCOJ CPUE
alone an< CPUE5PU ;ersions. Ynl> t:e floating point operations
in Count(x) algorit:m Uere ta[en into account in t:is <ata.
CPZPZCO performs at L6IF)K Nflop7s an< our CPUEalone ;erE
sion is at KLGI)GG Nflop7s. CPUE5PU ;ersion s:oUs up to 4G
5flop7sJ U:ic: is up to 5G b KKG times :ig:er t:an t:e pea[ rates
in t:e CPUEalone b CPZPZCO ;ersions respecti;el>. WoUe;erJ
t:e CPUE5PU ;ersion <i< up to L.6! more flops t:an t:e =isection
algorit:m in CPZPZCO an< ML use< Uas up to )5. #or compariE
sonJ t:e largest optimal ML use< in t:e CPUEonl> an< 5PUEonl>
;ersions Uas ) an< FGK4 respecti;el>.
To un<erstan< t:e importance of using multisection ;s. =isecE
tionJ Ue performe< runs forcing ML Q F. #ig. A s:oUs t:at 5PUE
alone ;ersion is spe< up => t:e factors of 5 to 6.6 for n c FGG =>
using multisection. Bpee<ups at large n are su=stantial onl> if t:e
eigen;alues are clustere< b t:e spee<up Uas a=out 4.K! for t:e
glue< matrices. Bpee<up s:oul< also =e su=stantial U:en fin<ing
onl> a small su=set of all eigen;aluesJ U:ic: is currentl> not imE
plemente<. Bpee<up is less noticea=le in t:e 5PUECPU ;ersion as
it runs on t:e CPU U:ene;er t:e 5PU multisection is too sloU b
t:e spee<up Uas onl> up to K.G!.
#ig. FG compares t:e runtimes of CPUEonl>J 5PUEonl> an<
CPUE5PU ;ersions for t:e (IFJKJIF) matri?. T:e runtime of t:e
!
!
!
F
#igure A* Bpee<up gaine< in t:e 5PUEalone ;ersion => using
multisection ;s. =isection.
#igure FG* T:e comparison of t:e runtimes for t:e case of
(IFJKJIF) matri?.
CPUE5PU ;ersion is nearl> t:e minimum of t:e runtimes of t:e
ot:er tUo ;ersions. T:e crosso;er =etUeen CPUEalone an< 5PUE
alone ;ersions is at n ! FRG. #or n Q FGGG t:e 5PUEalone ;ersion
is ! A.G! faster t:an t:e CPUEalone ;ersion.
#ig. FF s:oUs t:e percentage of time spent in Count(x) on t:e
CPU an< on t:e 5PU in t:e CPUE5PU ;ersion for (IFJKJIF) maE
tri?. Zt n Q FKF t:e time spent in t:e 5PU co<e eumps from Ga to
6La an< time spent in t:e CPU co<e falls from AGa to KFa. AGa
an< AAa of t:e time spent in t:e 5PU co<es are reac:e< at n !
F4GG an< n ! 6KGGG respecti;el>.
a;aila=le at :ttp*77cr<.l=l.go;7dosni7Co<es7stetester7
6
#$%&'( )*+ ,-((.&- /0 /&' 1/.( '&2 /2 34.(/2 5)677 58 9:; <=(
1/.( $2 >?@A ,@B );) '&2 /2 C(#/'1( DD77 C85;
E4<'$F
-'41<$14I
QR)S*R)T
&2$0/'U
%(/U(<'$1
%I&(.
#$%&'( ))+ 8=( W'(4\./X2 /0 <=( '&2<$U( /0 >G?HCG? 9(':$/2 0/'
QR)S*S R)T U4<'$F; 8=( <$U( 1$<(. 0/' CountQxT /2 <=( CG? $21I&.(:
<'42:0(''$2% .4<4 W(<X((2 <=( CG? 42. <=( U4$2 U(U/'$(:;
8=( '&2<$U(: 42. 411&'41J 41=$(9(. $2 <(:<: X$<= .$00('(2<
U4<'$1(: 42. 4I%/'$<=U: 4'( :=/X2 $2 #$%; )PR)N 42. 84WI(: PRV;
8=( >G?HCG? 9(':$/2 X4: &- </ )NO" 04:<(' <=42 EBK LM 9('H
:$/2 42. &- </ VN" 04:<(' <=42 >G?H/2IJ 9(':$/2; AI<('24<$9( ($H
%(2:/I9(': X('( 4I:/ :&W:<42<$4IIJ /&<-('0/'U(.+ 0/' -'41<$14I
U4<'$1(: <=( >G?HCG? 9(':$/2 X4: &- </ OD" 04:<(' <=42 EBK
3# 42. &- </ )P!" 04:<(' <=42 EBK C3;
!it*u( $2 84WI( V X4: 0/&2. &:$2% KAGA>B !ste%& '/&<$2(
$2 <=( ]2<(I EBK <=4< $: <=( ./&WI( -'(1$:$/2 $U-I(U(2<4<$/2 /0
<=( W$:(1<$/2 4I%/'$<=U; A11/'.$2% </ <=( <4WI(S <=( EBK 3# 42.
EBK C3 :/I9(': :=/X(. :&W:<42<$4IIJ I/X(' 411&'41J <=42 <=(
$U-I(U(2<4<$/2: /0 <=( W$:(1<$/2 4I%/'$<=U; AII $U-I(U(2<4<$/2:
/0 <=( W$:(1<$/2 4I%/'$<=U: =49( :=/X2 :$U$I4' 4W:/I&<( 411&'41J;
8=( '(I4<$9( (''/' .(0$2(. 4: U4Fi QZ!i"o$%ut() R !it*u(Z [ Z! !it*u(ZT 0/'
%(/U(<'$1 U4<'$1(: X4: 4I:/ :U4II ^ );V6S );6! 42. );PP 0/'
>G?H/2IJS >G?HCG? 42. EBK LM $U-I(U(2<4<$/2: 1/''(:-/2.H
$2%IJ; 8=$: :=/X: <=4< <=( CG?HW4:(. :/I9(' =4: =$%= '(I4<$9(
411&'41J <=4< $: (F-(1<(. $2 :/U( :-(1$4I 14:(:S :(( _L4'I/X 42.
@(UU(I )667`;
Y( 0/&2. <=4< CountQxT $2 >G?H4I/2( 42. CG?H4I/2( 9('H
:$/2: X4: 4IX4J: U/2/</2$1 $2 <=( <(:<:S W&< >G?HCG? 9(':$/2
.$. -'/.&1( 2/2HU/2/</2$1 94I&(:S '(a&$'$2% <=( 1/''(1<$/2 .$:H
1&::(. $2 ,(1<$/2: *;P 42. *;V;
Y( 4I:/ <'$(. '&22$2% CountQxT 1/21&''(2<IJ /2 <=( CG? 42.
<=( >G? W&< 0/&2. <=4< <=$: ./(: 2/< J$(I. :&W:<42<$4I W(2(0$<: 4:
<=( CG? &:&4IIJ /&<-('0/'U: <=( >G? WJ 4< I(4:< 42 /'.(' /0
U4%2$<&.(;
6
>G?H/2IJ
**
PD
VN
P)
*;6
EBK LM
)*N
)P7
)NO
)7!
)*;D
EBK 3#
OD
PD
P!
*)
!D7
EBK C3
)P!
D6
D6
N*
O!7
84WI( P+ E4F$U&U :I/X./X2: /0 .$00('(2< $U-I(U(2<4<$/2: '(I4H
<$9( </ <=( >G?HCG? 9(':$/2;
>G?H
>G?H
EBK
EBK
EBK
/2IJ
CG?
LM
3#
C3
-'41<$14I
);P)
);*D
);P)
PD
)VD
QR)S*SR)T
);7)
);7)
);77
*V
*D
&2$0/'U
);77
);77
);77
OP7
*O7
%(/U(<'$1
);*P
);*N
);*V
))*
!6
%I&(.
);77
);77
);77
N!
)76
84WI( V+ Y/':< 4W:/I&<( (''/': /W:('9(. $2 <(:<:; AW:/I&<( ('H
'/' $: .(0$2(. 4: QU4Fi Z!i"o$%ut() R !it*u(ZT [ Q! U4Fi Z!it*u(ZT;
E4<'$F
bc]@]Ae: $U-I(U(2<4<$/2 '&2: (2<$'(IJ /2 <=( CG?; 8=$: '(H
a&$'(. 4 :$%2$0$142< -'/%'4UU$2% (00/'< <=4< $: .(:1'$W(. $2 _K(:H
:$% *77!`S 4: <=( '(:< /0 <=( 4I%/'$<=U W(J/2. CountQxT $: 2/< (UH
W4''4::$2%IJ -4'4II(I; ]2 /&' /-$2$/2S <=('( $: I$<<I( U/<$94<$/2 0/'
<=$: 1/U-I$14<(. 42. (''/'H-'/2( .(:$%2S :$21( CountQxT ./U$24<(:
<=( 1/:< 0/' :&00$1$(2<IJ I4'%( -'/WI(U:S :4JS <4\(: 67i /0 <$U( 0/'
n f )77 4: $2 #$%; )); j2 /<=(' =42.S $0 <=( -'/WI(U $: :U4IIS $< $:
04:<(' </ :/I9( $< (2<$'(IJ /2 <=( >G?S :(( #$%; )7; A2/<=(' 4'%&H
U(2< 0/' -&<<$2% <=( (2<$'( 4I%/'$<=U /2 <=( CG? $: </ 49/$. <=(
1/UU&2$14<$/2 /9('=(4. 4< (41= 14II </ CountQxT; L&< 4: $< X4:
:=/X2 $2 ,(1<$/2 V;P <=$: /9('=(4. $: 2/< :&W:<42<$4I X=(2 '&2 /2
34.(/2 5)677 0/' :&00$1$(2<IJ I4'%( U4<'$1(:; ]< U4J (9(2 W( I(::
:&W:<42<$4I X$<= 2(X(' CG?: 4: <=(J =49( 42 /'.(' /' U4%2$<&.(
=$%=(' W42.X$.<=: $2 <=( >G?H</HCG? <'42:0(':;
A: <=( CG? &:&4IIJ 1/U(: X$<= 4 >G? Q42. $2 <=( 0&<&'( U4J
1/U( /2 <=( :4U( .$(S 4: $: .$:1&::(. </.4J WJ W/<= ]2<(I 42.
AE@TS X( 4.9/14<( .(-4'<$2% 0'/U <=( <'(2. /0 U/9$2% (2<$'(
4I%/'$<=U: </ <=( CG? </ 1/2:$.('$2% $2:<(4. <=( >G?HCG? <42H
.(U 4: <=( <4'%(< -I4<0/'U; E42J (F$:<$2% -4'4II(I 4I%/'$<=U:
:-(2. :U4II 0'41<$/2 /0 <=( X/'\ $2 1/.(: <=4< ./ 2/< (F-/:( :&WH
:<42<$4I -4'4II(I$:U; j00I/4.$2% <=$: X/'\ </ <=( CG? U4J W( W/<=
-4$20&I 42. &2-'/0$<4WI(;
Comparison with Previous Work
bc]@]A >?@A );) ,@B 1/2<4$2: 42/<=(' $U-I(U(2<4<$/2 /0
W$:(1<$/2 4I%/'$<=U <=4< $: /-<$U$d(. 0/' bc]@]Ae: CG?: _K(::$%
*77!`; 8=$: $U-I(U(2<4<$/2 :&00(': 0'/U U42J /9('0I/X -'/WI(U:
$2 .4<4 :<'&1<&'(: <=4< I(4.: </ 04$I&'(: /' 1'4:=(. X=(2 N)* # n #
)7*VS 42. 04$I&'(: X$<= %(/U(<'$1 U4<'$1(: X$<= n f )7*V <=4<
J$(I. :(9('(IJ $UW4I421(. $2<('94I <'((:; AI:/S $< =4: 4 2/2H
-'41<$14I :</--$2% 1'$<('$/2 42. $: 1/''(1< /2IJ $0 CG? 4'$<=U(<$1
$: U/2/</2$1S <=4< $: 2/< 1I(4' %$9(2 <=( .(<4$I(. 9(2./'e: -'/H
%'4UU$2% %&$.( _bc]@]A *77!`;
g/X(9('S X( U424%(. </ :&11(::0&IIJ '&2 <=$: 1/.( X$<= &2$H
0/'U U4<'$1(: 42. :</--$2% 1'$<('$/2 4I$%2(. X$<= <=4< $2
KAGA>B 4: ./2( $2 /&' $U-I(U(2<4<$/2; 3&22$2% $< /2 C(#/'1(
DD77 C85S X=$1= $: 4 2(X(' 42. 04:<(' CG? <=42 X4: &:(. $2 /&'
X/'\S $< -('0/'U(. &- </ U/'( <=42 * <$U(: :I/X(' <=42 /&' 1/.(
'&2 /2 34.(/2 5)677 58 42. G(2<$&U VS :(( #$%; )*; #/' (F4UH
-I(S /&' 1/.( '&2: $2 );7: 0/' nhV7777 9:; *;N: 0/' <=( bc]@]Ae:
1/.(;
7
Conclusion
Y( =49( -'/.&1(. 4 2&U('$14IIJ 1/''(1< $U-I(U(2<4<$/2 /0 <=(
W$:(1<$/2 4I%/'$<=U 0/' <=( CG? <=4< :&W:<42<$4IIJ /&<-('0/'U: <=(
W$:(1<$/2 42. /<=(' 4I%/'$<=U: '&2 /2 <=( >G?; A&</U4<$1 <&2$2%
X4: /2( /0 <=( \(J 1/U-/2(2<: /0 /&' =$%= -('0/'U421( .(:$%2;
Y( <//\ 4.942<4%( /0 <=( -4'<$4I 1/U-I$421( /0 <=( CG? 4'$<=U(H
<$1 X$<= <=( ]kkk !NV :<42.4'. </ '(.&1( <=( '&2<$U( 4W/&< )Ni;
!
#lso' we showe+ that a hi/her +e/ree o1 2333 756 compliance
coul+ win an a++itional <5= assumin/ no per1ormance penalt> 1or
/reater compliance? @rivial improvement woul+ raise the 1uncB
tionalit> o1 our implementation to the 1ull 1unctionalit> o1
C#P#CFGs sstebz' such as 1in+in/ onl> a suHset o1 ei/envalues
an+ splittin/ the matriI into Hlocks 1or Hetter per1ormance when
o11B+ia/onal elements are small? Kuture work inclu+es portin/ a
tri+ia/onal ei/envector solver' such as the LMMM al/orithm
NOhillon an+ Parlett PQQRS or the inverse iteration al/orithm? TsB
in/ the UPT in the re+uction to tri+ia/onal 1orm promises
spee+up in the +ense s>mmetric ei/enproHlems V these al/oB
rithms are rich in WC#XP an+ WC#XR operations such as matriI
multipl>' which is known to run 1aster on the UPT?
O3LL3C' a? W?' O`2CC^Z' 2?' #ZO M3Z' `? <cc5? ^n the correctB
ness o1 some HisectionBlike parallel al/orithms in 1loatin/ point
arithmetic' U:+<B.01)<$ ".61/6<B)01/$ 01$ J(9+.)<6:$ !16:;/)/ N'
<<6d<6c? e#lso C#P#CF Workin/ Zote f7Qg?
O3LL3C' a? W?' #ZO C2' k? <cc6? Kaster numerical al/orithms via
eIception han+lin/' #UUU$ ".61/6<B)01/$ 01$ %09O(B+./ VN' !'
c!RdccP?
O`2CC^Z' 2?' #ZO P#MC3@@' W? Z? PQQR? ^rtho/onal ei/envectors
an+ relative /aps' C#!&$ I0(.16:$01$&6B.)W$!16:;/)/$61*$!OO:)X
<6B)01/$KY' R' !5!d!cc?
`#U3Z' @? M?' `3ZM2FX3Z' L? ^?' ` a3CL3M[2F' a? L?' #ZO C23'
F?B#? PQQ7? `ow to solve s>stems o1 conservation laws numeriB
call> usin/ the /raphics processor as a hi/hBper1ormance comB
putational en/ine' 2n '+09+B.)<6:$&0*+:)1>,$J(9+.)<6:$C)9(:6X
B)01$61*$?OB)9)Z6B)01[$#1*(/B.)6:$&6B=+96B)</$6B$C#J"U8' 3+s?'
`asle' U?' Cie' F?B#?' an+ luak' 3?' Xprin/er [erla/' P<<dP66?
!"#$%&'e)*e+e$ts.
We want to thank #@2 an+ Z[2O2# 1or the +onate+ UPTs' @akaB
hiro Fata/iri 1or provi+in/ his implementation o1 the LultiB
section with Lultiple 3i/envalues metho+ an+ Pro1essor Xara
LcLains 1or the course on /eneralBpurpose computation on the
UPTs an+ Hein/ help1ul with e\uipment?
2333 <c!7? 2333 stan+ar+ 756B<c!5 1or Hinar> 1loatin/Bpoint
arithmetic' C#'ED!J$KK' P' cdP5?
F#`#Z' W? <c66? !<<(.6B+$ U)>+1-6:(+/$ 0M$ 6$ C;99+B.)<$ ".)X
\)6>016:$ &6B.)W' @echnical Meport CX6<' Computer Xcience
Oepartment' Xtan1or+ Tniversit>' aul> PP' <c66 ewith revisions
to aune <c6!g?
/e0e1e$"es.
#LO PQQ6? !"#$%"&$'()*+,$-+./)01$2342?
#@2 PQQ5a? 56*+01$ 72444$ 869):;$ "+<=10:0>;$ ?-+.-)+@' #@2
@echnolo/> White Paper?
#@2 PQQ5H? 56*+01$ 72A44$ &+90.;$ %01B.0::+.' #@2 @echnolo/>
White Paper?
F#@#U2M2' @?' [hL3C' C?' #ZO O3LL3C' a? W? PQQ6? #utomatic
per1ormance tunin/ 1or the multiBsection with multiple ei/enB
values metho+ 1or the s>mmetric ei/enproHlem' 2n E!5!]4^'
Tmea' Xwe+en' aune' PQQ6?
#@2 PQQ5c? 56*+01$72A44$C=6*+.$!.<=)B+<B(.+' #@2 @echnolo/>
White Paper?
C3XX2U' C? PQQ7? U)>+1-6:(+$ %09O(B6B)01$ @)B=$ %G\!' Z[2O2#
CTO# XOF <?<?
#ZO3MX^Z' 3?' W#2' _?' W2XC`^K' C?' O3LL3C' a?' O^ZU#MM#' a?'
OT CM^_' a?' UM33ZW#TL ' #?' `#LL#MC2ZU' X?' L CF3ZZ3b'
#?' ^X@M^TC`^[' X?' #ZO X^M3ZX3Z' O? <ccc? D!E!%F$G/+./H$
'()*+' X2#L?
C^' X?BX?' P`2C2PP3' W?' #ZO X#L3`' #? <c!7? # multiprocessor
al/orithm 1or the s>mmetric tri+ia/onal ei/envalue proHlem'
C#!&$I0(.16:$01$C<)+1B)M)<$61*$CB6B)/B)<6:$%09O(B)1>$A' P' <55d
<65?
W#MC^W' a?' #ZO O3LL3C' a? <ccQ? Computin/ accurate ei/ens>sB
tems o1 scale+ +ia/onall> +ominant matrices' C#!&$I0(.16:$01$
J(9+.)<6:$ !16:;/)/$ KL' R' 76Pd7c<? e#lso C#P#CF Workin/
Zote f7g?
L#@`2#X' M? <cc5? @he instaHilit> o1 parallel pre1iI matriI multiB
plication' C#!&$I0(.16:$0M$C<)+1B)M)<$%09O(B)1>$2^' 6' c56dc7R?
LCC^^C' L?' W#OC32U`' F?' `3ZO3MX^Z' W?' #ZO C2Z' `?Bb?
PQQ6? E+.M0.961<+$ U-6:(6B)01$ 0M$ 'EG/$ G/)1>$ B=+$ 56O)*&)1*$
\+-+:0O9+1B$E:6BM0.9' ^ctoHer P6' PQQ6?
WC#CFK^MO' C? X?' C`^2' a?' CC3#Mb' #' OG#_3[#O^' 3?'
O3LL3C' a?' O`2CC^Z' 2?' O^ZU#MM#' a?' `#LL#MC2ZU' X?'
`3ZMb' U?' P3@2@3@' #?' X@#ZC3b' F?' W#CF3M' O?' #ZO
W`#C3b' M? <cc7a? C<6D!E!%F$G/+./H$'()*+' X2#L?
Z[2O2# PQQ7? J_#\#!$%G\!$%09O(B+$G1)M)+*$\+-)<+$!.<=)B+<X
B(.+ E.0>.699)1>$'()*+, -+./)01$232?
P#MC3@@' W? Z? <c!Q? "=+$ C;99+B.)<$ U)>+1-6:(+$ E.0T:+9' PrenB
ticeB`all?
WC#CFK^MO' C? X?' CC3#Mb' #' O3LL3C' a?' O `2CC^Z' 2?'
O^ZU#MM#' a?' `#LL#MC2ZU' X?' P3@2@3@' #?' M3Z' `?'
X@#ZC3b' F?' #ZO W`#C3b' M? <cc7H? Practical eIperience in
the numerical +an/ers o1 hetero/eneous computin/' !%&$
".61/6<B)01/$ 01$ &6B=+96B)<6:$ C0MB@6.+ KN' P' <RRd<67? eXee
also C#P#CF Workin/ Zote f<<Pg
M3Z' `? <cc6? ?1$B=+$U..0.$!16:;/)/$61*$#9O:+9+1B6B)01$0M$C09+$
U)>+1-6:(+$ \+<09O0/)B)01$ 61*$ C)1>(:6.$ _6:(+$ \+<09O0/)B)01$
!:>0.)B=9/' PhO @hesis in #pplie+ Lathematics' Tniversit> o1
Cali1ornia at Werkele> esee also C#P#CF Workin/ Zote f<<5g?
W^C_' a?' K#ML3M' 2?' U M2ZXPTZ' 3?' #ZO X C`MhO3M' P? PQQR?
Xparse matriI solvers on the UPTi conju/ate /ra+ients an+
multi/ri+' !%&$".61/6<B)01/$01$'.6O=)</$KK' R' c<7dcP6?
M2UT3M' U? PQQ6? "=+$56*+01$72444$C+.)+/$E.0>.699)1>$'()*+'
Ma+eon XOF' Larch PQQ6?
X3U#C' L?' #ZO P33MCb' L? PQQ6? # per1ormanceBoriente+ +ata
parallel virtual machine 1or UPTs' !%&$ C#''5!E`$ K44^$
CQ+B<=+/?
WTCF' 2?' K#@#`#C2#Z' F?' #ZO `#ZM#`#Z' P? PQQ6? UPTWenchi
evaluatin/ UPT per1ormance 1or numerical an+ scienti1ic apB
plications' !%&$P0.Q/=0O$01$'+1+.6:$E(.O0/+$%09O(B)1>$01$
'.6O=)</$E.0<+//0./$R'EKS?
X2L^Z' `? O? <c!c? Wisection is not optimal on vector processors'
C#!&$ I0(.16:$ 01$ C<)+1B)M)<$ 61*$ CB6B)/B)<6:$ %09O(B)1> 24' <'
PQ5dPQc?
O3LL3C' a? W? <cc7? !OO:)+*$J(9+.)<6:$D)1+6.$!:>+T.6,$X2#L?
!
"
""
"
""
"
#$%&'e")*+",'-.t$.-0"1-t'$.es+"'&3t$1es"45"d$55e'e3t"$170e1e3t-t$43s"-3d"s048d483s"'e0-t$9e"t4"t:e";,<=>,<"9e's$43?"
"
""
"
"
#$%&'e")@+"AB)CDCB)E"1-t'$.es+"'&3t$1es"45"d$55e'e3t"$170e1e3t-t$43s"-3d"t:e$'"s048d483s"'e0-t$9e"t4"t:e";,<=>,<"9e's$43?"<3$54'1"-3d"
%e41et'$."1-t'$.es"F$e0d"s$1$0-'".&'9es?"
"
!"
"
Figure 15: Glued matrices: runtimes of different implementations and their slowdowns relative to the CPU-GPU version. The benefits of
using the GPU are small due to small inherent parallelism in the problem. Bisection algorithms run in linear time due to strongly clustered
eigenvalues.
10