GPU-Based Parallel Solver via Kantorovich Theorem

Feifei Wei, Jieqing Feng, Hongwei Lin
by
Lotem Fridman
Seminar in Computer Graphics - Spring 2017
Dr. Gershon Elber - Technion
1
Agenda
 Introduction
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
2
Agenda
 Introduction – Background, B´ezier Clipping, Normal Cone
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
3
Introduction - Background
 Root finding of a nonlinear system In terms of B-spline or
Bernstein polynomials is a fundamental problem in geometric modeling
 Analytical solutions only exist for univariate polynomials of degree no
more than 4
 When the degree of the polynomial or the number of constraints
increases, an efficient and robust solver is considered a difficult
problem
4
Introduction - Background
 Many approaches have been proposed to address this problem:
G.E. Collins, R. Loos, Real zeroes of polynomials, Computer
algebra: symbolic and algebraic computation (2nd ed.) (1983)
.94–83.
D. Manocha, J. Demmel, Algorithms for intersecting
parametric and algebraic curves i: simple intersections,
K. Mehlhorn, M. Sagraloff, Isolating real roots of realACM Trans. Graph .100–73 )1994( 13
polynomials.,
in: J. Johnson, H. Park, E. Kaltofen (Eds.), ISSAC, ACM,
R.E. Moore, F. Bierbaum, Methods and Applications of
2009, pp. 247–254.
Interval
Analysis (SIAM Studies in Applied and Numerical
Mathematics)
(Siam Studies in Applied Mathematics, 2.), Soc for
5
Industrial & Applied Math, 1979.
Introduction - Background
 However, the subdivision-based approach is more attractive for
geometric modeling applications due to their geometric significance:
(Remember this …?)
 The geometric approach fully exploits the inherent convex hull
property and the numerical stability of Bernstein polynomials
or B-spline basis functions.
6
Introduction - Background
 Rather than solving a nonlinear Bernstein polynomial system directly, if
all of the different roots could be isolated via polynomial subdivision or
domain clipping, we can apply NR
(Remember this …?)
 Where the center of a reduced sub-domain containing an isolated root
can be adopted as an initial guess
7
Introduction - B´ezier clipping
 The B´ezier clipping method
 An improved subdivision method ,applied to ray-tracing rational
parametric surface patches
 Instead of bisecting the domain directly, the clipping approaches clips
the domain more elaborately according to the convex hull of control points,
exploiting the advantages of the Bernstein polynomials.
8
Introduction - B´ezier clipping
 Fat Line – the region between two parallel lines
 di– d(xi,yi), the signed distance from control point Pi = (xi,yi) to L
9
Introduction - B´ezier clipping
 Two polynomial cubic Bezier curves P(t)
and Q(u), and a Fat Line L which bounds Q(u)
 Identifying the intervals of t for which
P(t) lies outside of L, and hence doesn’t
intersect Q(u), hence defininig P as:
10
Introduction - B´ezier clipping
 The function d(t) is a polynomial in Berstein
form, and can be represented as a
‘non-parametric’ Bezier curve:
 The horizontal coordinate of any point
D(t) is equal to the parameter value t
11
Introduction - B´ezier clipping
 The function d(t) is a polynomial in Berstein
form, and can be represented as a
‘non-parametric’ Bezier curve:
 Values of t for which P(t) lies outside of L
correspond to values of t for which D(t)
lies above dmin or below dmax
12
Introduction - B´ezier clipping
 Parameter ranges of t can be identified
for which P(t) is guaranteed to lie outside
of L, by identifying ranges of t for which
the convex hull of D(t) lies above dmin
or below dmax
13
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 Enables deduction of subdivision termination
(Remember this …?)
criterion that can isolate all of the different roots
of a nonlinear Bernstein polynomial system
14
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 The Normal Cone Test:

(Remember this …?)
If all the normal cones of {Fi(x)}i=1 have no
intersection in a domain,
the domain will contain at most one root
15
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 The Normal Cone Test:

(Remember this …?)
Otherwise, the domain should be subdivided
recursively until all of single roots are isolated or
the domain size reaches a prescribed threshold.
Then the quadratically convergent
Newton-Raphson method can be employed
to approximate each single root
16
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 The Normal Cone Test: The Problem
(Remember this …?)
17
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 The Normal Cone Test: The Problem
(Remember this …?)
18
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 The Normal Cone Test: The Problem
(Remember this …?)
19
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 Another problem arises from the multiple root case
(Remember this …?)
since in a sub-domain containing a multiple root,
the Normal Cone test will always fail.
20
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 Another problem arises from the multiple root case
(Remember this …?)
since in a sub-domain containing a multiple root,
the Normal Cone test will always fail.
21
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 Another problem arises from the multiple root case
(Remember this …?)
since in a sub-domain containing a multiple root,
the Normal Cone test will always fail.
 Kantorovich Theorm to the rescue…
22
Introduction – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 Another problem arises from the multiple root case
(Remember this …?)
since in a sub-domain containing a multiple root,
the Normal Cone test will always fail.
 Kantorovich Theorm to the rescue…
(but before… some technicalities…)
23
Agenda
 Introduction – Background, B´ezier Clipping, Normal Cone
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
24
Tensor Preliminaries
 Consider a nonlinear system as follows:
 Where F = (F1(x), F2(x), · · · , Fn(x))T
 Its roots are real points {x} in Rn, such that Fi(x) = 0, for i = 1, · · · , n
25
Tensor Preliminaries
 A tensor is a higher dimensional analog of a matrix, where the number
of indices is the rank of the tensor
 These surfaces are rendered directly in
terms of their polynomial representations,
as opposed to a collection of
approximating triangles
 The tensor representation can facilitate arithmetic operations related to
Bernstein polynomials on SIMD architecture GPU
26
Tensor Preliminaries
 There are three operations associated with a rank n
tensor



of multivariate constraint:
Contraction
Transformation Norm Estimation -
27
Tensor Preliminaries
 There are three operations associated with a rank n
tensor

of multivariate constraint:
Contraction
-
 Tensor contraction corresponds to evaluation of a multivariate constraint in
the Equation:
28
Tensor Preliminaries
 There are three operations associated with a rank n
tensor

of multivariate constraint:
Transformation
-
 Tensor transformation corresponds to a subdivision operation, which
transforms one tensor on a given domain to a new one on its sub-domain
29
Tensor Preliminaries
 There are three operations associated with a rank n
tensor

of multivariate constraint:
Norm Estimation -
 Norm estimation gives a measurement of tensor magnitude, which is
useful in the Kantorovich theorem
30
Agenda
 Introduction – Background, B´ezier Clipping, Normal Cone
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
31
Reminder – Normal Cone
 A criterion which determines that a sub-domain has at most one solution
 Another problem arises from the multiple root case
(Remember this …?)
since in a sub-domain containing a multiple root,
the Normal Cone test will always fail.
 Kantorovich Theorm to the rescue…
32
The Kantorovich Theorem
 Contributions:

By using the Kantorovich theorem, we can not only identify the
existence of a unique root, but also guarantee the convergence of the
Newton-Raphson iteration with a suitable guess

The multiple root of tangential case can be solved more efficiently.
33
The Kantorovich Theorem
 Main Idea:

If conditions are satisfied in the Kantorovich theorem, there will be two
concentric regions surrounding the initial guess:
1. The large one is the region in which unique zero exists
2. The smaller one contains all of the Newton-Raphson iteration sequences,
in which they will converge to the unique zero

This is helpful for solving the multiple root case, since we can improve the
efficiency of root finding by terminating the subdivision earlier than the
normal cone based method
34
The Kantorovich Theorem
 Definition:
35
The Kantorovich Theorem
 Definition:
36
The Kantorovich Theorem
 Definition:
37
The Kantorovich Theorem
 Definition:
38
The Kantorovich Theorem
 Example:

Two planar algebraic curves
intersect at one point x*

N(x0, r1) is the region in which there is
a unique root

N(x0, r0) is the convergent region of
subsequent the Newton-Raphson
iterations
39
The Kantorovich Theorem
 Example:

If the long edge d of the sub-domain D
satisfies
, then
the sub-domain D is in the
neighborhood N(x0, r1) completely

Thus, there is a unique root in D
40
The Kantorovich Theorem
 Example:

Otherwise, if
,then
D is not in the neighborhood N(x0, r1)

In these cases, we should subdivide
the sub-domain D further so that we
can delimit the unique root and
convergent region
41
Agenda
 Introduction – Background, B´ezier Clipping, Normal Cone
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
42
GPUs
for general-purpose computing
 The problem with the Kantorovich Theorem was that in practice it was
virtually always more work to estimate the parameters for it, than to run the
NR method and check for convergence
 The paper’s tensor-based Norm bounds, combined with the tremendous
computational horsepower and high memory bandwidth of “modern” GPUs,
gave better results than other methods
43
GPUs
for general-purpose computing
 Driven by high performance and high quality 3D graphics
applications, the programmable GPU has evolved into
a highly parallel, multithreaded, manycore processor with
tremendous computational horsepower and a very high
memory bandwidth
44
GPUs
for general-purpose computing
 With the rapid development of GPGPU (General Purpose
computing on Graphics Processing Units), graphics
hardware is becoming a new attractive parallel computing
platform.
 The proposed subdivision-based nonlinear system solver based on
Kantorovich theorem is tailored for SIMD architecture of contemporary
GPUs
45
GPUs
for general-purpose computing
 GPU vs CPU Performance
 A simple way to understand the difference between a GPU and a CPU is to
compare how they process tasks.
 A CPU consists of a few cores optimized for
sequential serial processing while a GPU has
a massively parallel architecture consisting of
thousands of smaller, more efficient cores
designed for handling multiple tasks simultaneously.
46
GPUs
for general-purpose computing
 GPU vs CPU Performance
 A simple way to understand the difference between a GPU and a CPU is to
compare how they process tasks. A CPU consists of a few cores optimized
for sequential serial processing while a GPU has a massively parallel
architecture consisting of thousands of smaller, more efficient cores
designed for handling multiple tasks simultaneously.
 GPU vs CPU – MythBusters Style…
https://www.youtube.com/watch?v=-P28LKWTzrI
47
Agenda
 Introduction – Background, B´ezier Clipping, Normal Cone
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
48
Implementation
 To exploit parallelism of GPUs, each Fi of n variables is
subdivided uniformly into
ones, instead of bisection
in the Serial Kantorovich - Subdivision
 Obviously, a more dense subdivision could result in a smaller
subdivision depth.
However, limited resources in GPU will pay back the benefits of
increased parallelism
49
Implementation
 The most time-consuming step in our algorithm is the
subdivision of the constraints
 As an alternative approach, a multivariate Bernstein
polynomial is represented in tensor form
 The control coefficients tensor is associated with two operations, i.e.
contraction and transformation (as we’ve seen before..)
50
Implementation
 The control coefficients of a tensor on a sub-domain can be obtained by
sequential tensor transformations of the entire domain along each
direction
 If we subdivide a Bernstein polynomial of degree d into m ones
uniformly, the k-th transformation tensor
in Equation (2) can be
obtained via the following formula:
- Subdivision of Fi
51
Implementation
 The Multiple Root Case: The Tangent Root – The Problem
 If a domain contains a multiple root, the nonlinear system
will always fail the NC test.
Thus, the subdivision based methods will keep on subdividing the
domain until the size of sub-domain is less than the threshold.
 In this case, a large number of subdivisions will seriously affect the
efficiency of algorithm, even if there is a good initial guess in the subdomain.
52
Implementation
 The Multiple Root Case: The Tangent Root – The Solution
 The main idea is to consider as many as possible initial guesses other
than just the center in the KC test, so that we can choose
those well-defined guesses.
 The local distribution of the root around these initial guesses can give us
more information to purge away useless regions according to the
Kantorovich theorem
53
Implementation
 We have implemented the proposed algorithm on a PC
with an Intel Core 2 Quad 2.83GHz CPU and an NVIDIA
GTX280 GPU
 The parallel solver on GPU is written in CUDA v2.5, a general-purpose
C language interface for general purpose computing
54
Implementation - Example
55
Agenda
 Introduction – Background, B´ezier Clipping, Normal Cone
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
56
Conclusions
 By exploiting the parallelism of GPU, our algorithm can achieve over
100 times speedup for a large number of systems, compared with the
CPU solver
 The proposed solver can also dealt with under-determined system and
multiple root case
 This work can also be adapted to other SIMD architecture processes,
such as multicore CPU.
57
Conclusions
 Currently the proposed parallel solver is designed for GPU, which has no
flexible memory management. Thus, the scalability is subject to hardware
specification.
 Stream processors in the contemporary GPU are designed for processing
single precision floating point arithmetic, while double precision floating point one
is just added to address
 Scientific and high-performance computing applications lately. The performance of
double precision arithmetic is still much slower than the single ones. However, we
58
believe that rapid development of GPU can overcome the restriction.
Agenda
 Introduction – Background, B´ezier Clipping, Normal Cone
 Tensor Preliminaries
 Kantorovich Solver
 GPUs for general-purpose computing
 Implementation
 Conclusions
 CUDA and GPU evolution
59
Cuda and GPU Evolution
 Since the paper was released back in 2011, a lot has changed in the
GPGPU realm:
GTX 280
TESLA P100
60
Cuda and GPU Evolution
 Since the paper was released back in 2011, a lot has changed in the
GPGPU realm:
61
Thank You…
62