Real Time Computing With the Parareal Algorithm

Florida State University Libraries
Electronic Theses, Treatises and Dissertations
The Graduate School
2008
Realtime Computing with the Parareal
Algorithm
Christopherr.Harden
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]
FLORIDA STATE UNIVERSITY
COLLEGE OF ARTS AND SCIENCES
REALTIME COMPUTING WITH THE PARAREAL ALGORITHM
By
CHRISTOPHERR.HARDEN
A Thesis submitted to the
School of Computational Science
in partial fulfillment of the
requirements for the degree of
Master of Science
Degree Awarded:
Spring Semester, 2008
The members of the Committee approve the Masters Thesis of Christopher R. Harden
defended on April 8, 2008.
Janet Peterson
Professor Directing Masters Thesis
Max Gunzburger
Committee Member
Robert Van Engelen
Committee Member
Approved:
Max Gunzburger, Director
Department of School of Computational Science
The Office of Graduate Studies has verified and approved the above named committee members.
ii
This thesis is dedicated to all of the people who have helped and guided me throughout my
research including but not limited to Janet Peterson, Max Gunzburger, John Burkardt,
Robert Van Engelen, and the many other professors who have provided me with an
excellent level of instruction throughout all of my course work here at FSU. Also, I would
like to dedicate this work to my wife, Jennifer Alligood, and my son, Youth, for their
infinite depth of understanding of my situation as a graduate student and for their
unending support throughout this endeavor.
iii
ACKNOWLEDGEMENTS
I would like to acknowledge my gratitude to my committee members, who have taken the
time to review my work and to pose the difficult questions which have kept me honest
and thus have allowed me to grow as an academic throughout this process. A special
acknowledgment is in order for my advisor who took a chance in bringing me under her
wing and teaching me how to become a researcher and to John Burkardt, who took a lot of
time out of his schedule to teach me many of the tools that were necessary for me to be able
to complete this work. Also, some thanks are owed to Clayton Webster for all of his TEX
support.
iv
TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2. The
2.1
2.2
2.3
4
4
5
8
Parareal Algorithm . . . . . . . . . . . . . . . .
The Basic Algorithm . . . . . . . . . . . . . . .
A Simple Example . . . . . . . . . . . . . . . .
Comments on Some Mathematical Properties of
. .
. .
. .
the
. . . . .
. . . . .
. . . . .
Parareal
. . . . . . .
. . . . . . .
. . . . . . .
Algorithm .
3. The Finite Element Method and The Parareal Algorithm . . . . . .
3.1 A Finite Element Method . . . . . . . . . . . . . . . . . . . .
3.2 The Finite Element Method and the Parareal Algorithm for
PDE’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
Nonlinear
. . . . . .
4. Combining the Parareal Algorithm and Reduced Order Modeling . . .
4.1 Reduced Order Modeling with Proper Orthogonal Decompositions
4.2 Reduced Order Modeling and The Parareal Algorithm . . . . . .
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
22
.
.
.
.
25
25
29
33
5. Computational Experiments and Results . . . . . . . . . . . . . . . . . . . . .
5.1 FEM and The Parareal Algorithm Results . . . . . . . . . . . . . . . . .
5.2 ROM and The Parareal Algorithm Results . . . . . . . . . . . . . . . . .
35
35
38
6. Performance Analysis and Scalability . . . . . . . . . . . . . . . . . . . . . .
6.1 Introduction to performance analysis concepts and terminology . . . .
6.2 Problem Parameters in our FEM and ROM Parareal Implementations .
6.3 Strong Scaling Trends of the Parareal Algorithm . . . . . . . . . . . . .
.
.
.
.
44
44
50
57
7. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
70
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
v
.
.
.
.
14
14
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
74
LIST OF TABLES
5.1
Comparison of errors using standard finite element approach and the parareal/FEM
approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2
Speedup results for the parareal/FEM approach compared to the serial FEM
approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
(Comparison of errors for 4-parameter problem using standard ROM approach
and the parareal/ROM algorithm. Errors are calculated by comparing to the
full finite element solution.) . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.1
Results of FEM Test Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.2
Results of FEM Test Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.3
Results of FEM Test Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6.4
Results of FEM Test Case 4 . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6.5
Results of FEM Test Case 5 . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.6
Results of FEM Test Case 6 . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.7
Results of ROM Test Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.8
Results of ROM Test Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.3
vii
LIST OF FIGURES
2.1
Illustration of the coarse and fine grids . . . . . . . . . . . . . . . . . . . . .
6
2.2
B2, Exact Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.3
B2, Parareal Solution on the Coarse Grid After 2 iterations of the Correction
Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
B1, Phase Portrait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.5
B1, Coarse Grid Solution and Divergent Refined Solution. . . . . . . . . . .
13
4.1
The H-cell domain of the building ventilation problem . . . . . . . . . . . . .
32
5.1
speedup of Parareal/FEM Implementation : Blue-∆t = 0.01 and Red-∆t =
0.005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
speedup of Parareal/ROM Implementation : Blue-∆t = 0.01 and Red-∆t =
0.005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
The H-cell domain of the building ventilation problem, with boundary parameters illustrated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
speedup of Parareal/ROM Implementation of the Navier-Stokes Problem :
Blue-∆t = 0.01 and Red-∆t = 0.005 . . . . . . . . . . . . . . . . . . . . . .
43
6.1
Suite A, Speedup vs. Processors . . . . . . . . . . . . . . . . . . . . . . . .
53
6.2
Speedup of Parareal/FEM with h = 0.1, ∆t = 0.005, and T =1 . . . . . . .
61
6.3
Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 1 . . . . . . .
62
6.4
Speedup of Parareal/FEM with h = 0.05, ∆t = 0.005, and T = 1 . . . . . .
63
6.5
Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 1 . . . . . .
64
6.6
Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 10 . . . . . .
65
6.7
Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 10 . . . . . .
66
5.2
5.3
5.4
viii
6.8
Speedup of Parareal/ROM with h = 0.1, ∆t = 0.005, and T = 1 . . . . . . .
67
6.9
Speedup of Parareal/ROM with h = 0.1, ∆t = 0.001, and T = 10 . . . . . .
68
ix
ABSTRACT
This thesis presents and evaluates a particular algorithm used for the real time computations of time dependent ordinary and partial differential equations which employs a
parallelization strategy over the temporal domain. We also discuss the coupling of this
method with another popular technique used for real time computations, model reduction,
which will be shown to provide more gains than either method alone. In particular, we
look at reduced order modeling based on proper orthogonal decompositions. We present
some applications in terms of solving time dependent nonlinear partial differential equations
and solving these equations with a coupled approach of combining model reduction and the
parareal algorithm . The performance of this method, both numerically and computationally,
is discussed in terms of the gains in speedup and efficiency, and in terms of the scalability
of the parallelization of the temporal domain on a larger and larger set of compute nodes or
processors.
x
CHAPTER 1
INTRODUCTION
Many of the equations and systems governing the various mechanisms of nature implicitly
involve time evolution and time dependencies in general. Many successful attempts have
been made to construct computational schemes for integration in time. Until recently, many
of these schemes have shared the common theme of being purely serial in implementation.
Traditional schemes seem to have taken the view that time, itself, is sequential and many
schemes have been designed in ways such that computations at a current time step rely on
values computed at previous time steps. In contrast to this traditionally sequential view of
the temporal domain there are all sorts of varieties of ways in which researchers have dealt
with the spatial domain: serial, parallel, or a combination of both.
In the context of time dependent ordinary and partial differential equations many parallel
schemes have been proposed in which discretizations of the spatial domain are implemented
in parallel. One class of methods in which such implementations are prevalent is in domain
decomposition methods. These methods have been shown to be useful in both serial and
parallel implementations.
The scheme presented here shares much of the flavor of these domain decomposition
methods. There are two significant points of departure though. The first is the focus on
decomposing the temporal domain and not the spatial domain. The second lies in the fact
that this scheme has absolutely no value in a serial computation and is, therefore, a purely
parallel algorithm. Also, many people will notice its similarity to some of the various flavors
of methods for iterative improvement for linear systems.
The Parareal algorithm was introduced by Lions, Maday, and Turinici in 2001 [13]
as a numerical method to solve time-evolution problems in parallel. The name of the
algorithm already indicates the intention of its design. The purpose is for parallel, real time
1
computations involving time evolution equations whose solutions may not be obtainable
in real time using only a single processor machine. Similar to what takes place in spatial
decompositions they introduced a decomposition of the temporal domain. Also, in the spirit
of domain decompositions they introduce both coarse and fine grid resolutions in time. These
grids are then combined in a corrector scheme which allows for the coarse resolution to be
updated iteratively while preserving the accuracy and stability of the time discretization
scheme being used over the coarse and fine grid resolutions. The coarse grid approximation
and the application of the corrector scheme are purely serial in the implementation. The fine
grid approximations are serial only within each sub-domain of the coarse grid, thus allowing
for the parallel implementation of the fine grid scheme on each of these sub-domains. The
corrector scheme is then used to update the coarse grid approximation using the results of
the fine grid approximations on each sub-domain, which have been computed simultaneously
in parallel, and this procedure is iterated until convergence.
In principal the Parareal algorithm is completely problem independent. Further, the
algorithm leaves much flexibility in the choice of a time discretization scheme. It is important
to note that the algorithm can never exceed the accuracy or the stability of the numerical
scheme being employed. Also, when we talk about the convergence of the algorithm we do so
in terms of its approach towards the solution that would have been obtained given that the
problem was solved directly using the fine grid over the entire temporal domain. Another
upshot of this algorithm, which makes it especially promising for real time computations,
and in stark contrast to the more traditional spatial decompositions, is that in a true parallel
implementation the algorithm requires a very minimal amount of communication between
any of the processors carrying out the fine grid approximations.
In this work, we explain the computational capabilities of this algorithm and then
analyze the practical results of the computations of some numerical experiments involving
a variety of complex systems. The original contribution being made in this work is the
combining of the parareal algorithm and model reduction to obtain more computational
gains than either method alone. We will look at the effective speedup and efficiency of the
parallel computations versus their serial implementations as a function of the number of
processors employed and with this provide an analysis of the scalability of these parallel
implementations.
We begin with an introduction to the standard parareal algorithm. In this introductory
2
section we illustrate the algorithm with a very simple ordinary differential equation (ODE) to
help make the idea clear. In the next chapter we discuss our implementation of the algorithm
using a Finite Element Method (FEM) for the solution of nonlinear partial differential
equations (PDEs). The next chapter will detail our coupling of the algorithm with Reduced
Order Modeling (ROM) to obtain further gains in speedup and efficiency. We then devote
an entire section to our numerical experiments that were implemented. Then we provide
a section on performance and scalability analysis. Finally, a summary of this thesis and a
discussion of future work is given in the concluding chapter.
3
CHAPTER 2
The Parareal Algorithm
2.1
The Basic Algorithm
In general we are interested in solving time dependent differential equations of the general
form,
∂u
+ Au = f, t ∈ [0, T ], x ∈ Ω
∂t
u(0, x) = u0
(2.1)
where, in general, A is an operator from a Hilbert Space S into another Hilbert space S ′ .
A typical example would be the standard heat equation in which case, A = −∇2 = −∆,
the Laplace operator. Although we are focusing primarily on the case of A being a linear
operator, for the sake of the introduction, the algorithm works well for nonlinear cases as
well, which is the class of primary interest in our research. Nonlinearities can also enter this
general form through f and whether or not it is a function of the solution u and/or any of
its derivatives. To illustrate the basic idea of the method in this chapter our focus will be
on an implementation involving ODEs in which the equation depends solely on time; in this
case one can then think of A as either just a constant or some single function of time, in the
case of scalar equations, or as some coefficient matrix in the case of a system of equations.
We then introduce a decomposition of the time interval [0, T ] into N equal subintervals
[T n , T n+1 ] of size ∆T = T /N such that,
0 = T 0 < T 1 < · · · < T n = n∆T < T n+1 < T N = T,
where, T n = n∆T.
(2.2)
It is not necessary for the time step to be uniform but makes the introduction to the method
a bit simpler. We then solve the discretized problem over this coarse grid with the large time
4
step ∆T and thus obtain a cheap approximation with a coarse resolution at each T n ∈ [0, T ].
We denote this coarse grid solution by U n ≈ u(T n ) and of course U 0 = u0 .
Next, we pose a similar, yet connected, problem on each subinterval [T n , T n+1 ] which
is the fine grid resolution with the much smaller time step δt. In the case of uniform time
steps we have that the coarse step will be directly proportional to the fine step, i.e. we have
that, ∆T = M δt, where M is the number of steps to be taken on each of the N subintervals,
[T n , T n+1 ]. On each of these subintervals we solve the problem,
∂un
+ Aun = f, t ∈ [T n , T n+1 ]
∂t
un (T n ) = U n
(2.3)
where un is the fine grid approximation of u for t ∈ [T n , T n+1 ]. The fine grid resolution on
each subinterval can be computed independently and thus in parallel. If we let k denote the
iterate and U1n = U n , un1 = un , then we can iterate to improve the accuracy of the scheme
by using the known values of Ukn and unk over each [T n , T n+1 ] in a corrector scheme to obtain
n
better approximations Uk+1
and unk+1 over these subintervals. Specifically we,
(i)
(T n ) − Ukn ;
introduce the defect: Skn = un−1
k
(ii)
then on the coarse grid solve for the correction, δkn ,
to the fine grid solution at T n using the defect Skn ;
n
(T n ) + δkn and solve (2.3) again in parallel.
(iii) then update Uk+1
= un−1
k
This procedure is then iterated until the accuracy that would have been attained by solving
the full fine grid resolution problem over [0, T ] with time step δt is achieved.
The important thing to keep in mind is that the steps involving calculations on the
fine grid within each subinterval are independent and thus parallelizable. To illustrate the
method, let us look at an implementation of the algorithm employed on a simple ODE.
2.2
A Simple Example
Here is an example of the use of this algorithm on a very simple, first order linear ODE. To
further aid in simplicity I will employ an implicit Euler scheme for the discretization of the
temporal domain. This example will help to illustrate the mechanical aspect of implementing
5
Coarse Grid
T0
T1
Tn
∆T
T n+1
TN
❄
Fine Grid
Tn
δt
T n+1
1
Figure 2.1: Illustration of the coarse and fine grids
this algorithm in practice. Consider the ODE,
dy
− ay(t) = sin(t), t ∈ [0, T ]
dt
y(0) = y 0
(2.4)
and then employ the backward Euler scheme over the coarse grid, 0 = T0 < T1 < · · · < TN =
T , with time step ∆T ,
Y n+1 − Y n
− aY n+1 = sin(tn+1 ), t ∈ [0, T ]
∆T
Y 0 = y0,
6
(2.5)
where n = 0, 1, . . . , N − 1. Next, we pose a similar problem on each sub-domain [T n , T n+1 ]
using the fine grid resolution with time step δt. Here we use the nodal values computed
over the coarse grid for each T n as the initial value for this new but similar problem on each
[T n , T n+1 ] as follows,
dy n
− ay n (tδt ) = sin(tδt ), tδt ∈ [T n , T n+1 ]
dt
y n (tδt = T n ) = Y n
(2.6)
where Y n is the solution of (2.5) at the value of T n and tδt denotes the values of t within
each of the subintervals [T n , T n+1 ] with fine time step δt.
The next step is to use the information computed on each of these sub-domains to
update the coarse grid solution at each T n . In fact, the only information needed from the
computations done on these subintervals is the value of each y n (T n+1 ), n = 0, 1, . . . , N − 1.
That is, the value which is the last (right most) value computed on each subinterval
[T n , T n+1 ]. This value is relative to the position of each value of the solution to (2.5) on
the coarse grid and then used to update these nodal values over the coarse grid through the
corrector scheme as follows,
(i)
introduce the defects Skn = ykn−1 (T n ) − Ykn ;
(ii) then propagate the defect with a coarse resolution of the δkn as follows;
Sn
δkn+1 − δkn
− aδkn+1 = k
∆T
∆T
0
δk = 0.
(2.7)
Note: we define, yk−1 (T 0 ) = y 0 . Also, we introduce k as the iteration number, hence if this
is the first application of this procedure then k = 1.
The computations done with the fine grid resolution on each subinterval are all independent problems and thus parallelizable. The only information required globally by the various
processors is the nodal values of the coarse grid approximation Ykn . Once each y n (T n+1 ) has
been computed they may be returned to the master processor and used to correct the original
nodal values over the coarse grid approximation by solving (2.7) and then setting,
n
Yk+1
= ykn−1 (T n ) + δkn .
7
(2.8)
Finally, we iterate this procedure by next solving the problem,
n
dyk+1
n
− ayk+1
(tδt ) = sin(tδt ), tδt ∈ [T n , T n+1 ]
dt
n
n
yk+1
(tδt = T n ) = Yk+1
and we continue iterating in this fashion until convergence is reached.
(2.9)
Note that the
convergence here is used in the sense of the coarse approximation approaching the accuracy
of the fine grid approximation. Once the error in the coarse grid approximation gets to be
within machine epsilon of the error in the fine grid approximation then further iterations
become superfluous. There is, therefore, an optimal number of iterations required to reach
the desired accuracy of the fine grid approximation.
This algorithm is rich in computation over communication. Many of the methods used to
solve PDEs numerically, such as our FEM and ROM methods, are dominated in computation
and in storage by having to solve large linear systems of equations. In the linear case it is one
linear solve for each time step, and the nonlinear solve typically involves many linear solves.
In an implementation of the parareal algorithm the expensive, i.e. the many solves done on
the fine grid, portion of the algorithm is done in parallel while it requires one initial coarse
solve, to generate initial data for each of the fine solves, and only one coarse solve for the
correction term per iteration. Furthermore, we are saving on storage and communication by
only storing and then communicating the solution of the final step in the fine solve with step
δt. Thus, if we have J spatial unknowns at each of the N time steps in the coarse grid we
will be passing arrays of size J a total of N times and hence we pass JN doubles or f loats, or
some type of data structure, per each iteration of this algorithm. Since N is typically small
compared to M (number of steps in the fine grid) then we are only passing a small number
of arrays per each iteration but doing a large number of linear solves in parallel. So, this
saturates the method in computation compared to communication, which is minimal.
2.3
Comments on Some Mathematical Properties of
the Parareal Algorithm
The primary focus of this work is to highlight the computational aspects of this algorithm
both in terms of performance and implementation. The purpose here is simply to highlight
some of the basic mathematical properties of the algorithm which have already been given
a great deal of attention in the literature.
8
In principle the convergence and stability of the method is essentially inherited from the
numerical scheme that is being employed over the coarse and fine grid. There are some cases
where this fails though. For a thorough discussion of these results the reader is referred to
[1], where one will find the most general and abstract analysis of the mathematical details
on the convergence and stability criterion for the parareal algorithm. Also, one can find
similar results in [5]. For a nice general overview of all of the basic mathematical results of
this algorithm, without all of the analysis and formal proofs the reader should take a look
at, [17].
First, we provide a simple example problem which works well with this algorithm and
then we shall take a look at a case where the method breaks down and discuss why it fails in
that case. Both of these problems were taken from a test suite of dynamical systems found
in [12].
Consider the following dynamical system. This system models a linear chemical reaction,
denote the system by B2,
y1′
= −y1 + y2 ,
y1 (0) = 2,
(2.10)
y2′ = y1 − 2y2 + y3 , y2 (0) = 0,
y3′
= y2 − y3 ,
y3 (0) = 1.
First, take a look at the exact solution in Figure (2.2).
Next, take a look at the parareal solution on the coarse grid after two iterations of the
correction scheme in Figure (2.3). A simple backward in time Euler method was used on
both the coarse and the fine grid. This is an example of a problem where the parareal
algorithm works very well.
Now, let us consider the following dynamical system. These equations model the growth
of two conflicting populations, denote the system by B1,
y1′
= 2(y1 − y1 y2 ), y1 (0) = 1,
(2.11)
y2′ = −(y2 − y1 y2 ), y2 (0) = 3.
First, take a look at a phase portrait of the system (2.11) in Figure (2.4),
Again, a simple backward in time Euler method is employed here, but the parareal
algorithm breaks down for this system. Take a look at Figure (2.5) and we see the initial
9
B2 E xa ct S olution
1 .0
0 .9 5
0 .9
y3 0 .8 5
0 .8
0 .7 5
0 .7
0 .7 5
0 .0
0 .1
0 .2
y2
0 .3
0 .4
0 .5
1 .0
0 .6
1 .2 5
y1
1 .5
1 .7 5
2 .0
Figure 2.2: B2, Exact Solution
coarse solve in red and in navy blue we see one iteration of the corrector scheme which is
diverging from the initial coarse solution. The question then is what was different about
these two systems such that the parareal worked well for B2 but then falls apart for B1? If
you’re thinking that it has to do with B1 being nonlinear, it is important to note that the
nonlinearity has absolutely nothing to do with why the correction scheme diverged.
B1 is a dynamical system who’s eigenvalues are not just complex but very near to being
purely imaginary. So, for systems whose eigenvalues are very near to being purely imaginary
or when the magnitude of the imaginary part is much greater than the magnitude of the real
10
P a ra re a l S olution of B2
1 .0
0 .9 5
y3
0 .9
0 .8 5
0 .0
10.2.8
0 .2 5
y2
0 .5
0 .7 5
1 .4
y1 1 .6
1 .8
2 .0
Figure 2.3: B2, Parareal Solution on the Coarse Grid After 2 iterations of the Correction
Scheme
part then the correction scheme in the parareal algorithm becomes dissipative and quickly
begins to diverge from the original coarse grid solution.
If one looks closely at Figure (2.5) it will be seen that initially the corrector scheme starts
out doing fine and only after a few corrections does it start to veer away from the initial
coarse grid solution. So far what researchers have been proposing in the literature is to do
the most obvious thing which is to monitor when the corrector begins to veer away from
the previous iterated solution, particularly when one has some a priori knowledge of the
11
P ha se P ortra it for B1 , t in [0 ,2 0 ]
5
4
3
y2
2
1
0
0
2
4
6
8
y1
Figure 2.4: B1, Phase Portrait
eigenvalues associated with the problem, such as when working with hyperbolic equations
or equations with very large amplitude oscillations. When the corrector begins to veer they
halt the iteration and restart it from the previous step from which it began to veer away
from the desired trajectory. For more details on the performance of a restarted version of
the parareal algorithm the reader is referred to, [2], [4], and [17].
12
B1 Coarse and ’Refined’ Solution
6
5
4
y2 3
2
1
0
!5.0
!2.5
0.0
2.5
5.0
7.5
y1
Coarse
Refined
Figure 2.5: B1, Coarse Grid Solution and Divergent Refined Solution.
13
CHAPTER 3
The Finite Element Method and The Parareal
Algorithm
In the previous chapter we illustrated the method using a very simple linear ODE. It is
our primary goal, though, to utilize the parareal algorithm for the real time computations
of much larger and complex multivariate nonlinear differential equations. Our method for
discretizing PDE’s in space is to utilize a Finite Element Method. In this chapter we provide
an extremely brief over view of the FEM for time dependent partial differential equations.
Then we explain how we implement the parareal algorithm to parallelize the temporal domain
coupled with a FEM discretization of the spatial domain.
3.1
A Finite Element Method
The most significant difference between a Finite Element Method (FEM) and a more typical
Finite Difference Method (FDM) is in contrast to the FDM, where you directly discretize
the derivatives in the equation by replacing them with finite difference quotients. We instead
work with the weak or integral form of the equation and then seek an approximation to the
solution,
u(x) ≈
N
!
cj φj (x),
(3.1)
j=1
where the φj ’s are a finite set of basis functions, typically piecewise polynomials, and N is
the cardinality of this set of basis functions. The cj ’s are then the unknown coefficients of
our basis functions to be determined so that the function expansion, in terms of these basis
functions, holds for the the problem of interest.
Although, we are ultimately interested in time dependent PDE’s it is enough, for this
work, to have a basic understanding of how we implement our FEM descretization in space.
14
As an illustration consider an elliptic problem which can be thought of physically as the
steady state form of the linear heat or diffusion equation in two spatial dimensions over a
unit rectangle. If we let Ω be our domain and ∂Ω be the boundary of our domain then, the
classic or strong problem takes the familiar form,
− ∆u = f (x, y) with (x,y) ∈ Ω = (0, 1) × (0, 1)
u = g
on ∂Ω,
(3.2)
(3.3)
where g is our Dirichlet boundary condition. This example is referred to as an elliptic PDE
with Dirichlet boundary conditions; FEMs are very adaptable to a much larger variety of
boundary conditions and domains; this is another area where we see the benefits of using a
FEM.
To formulate the corresponding integral or weak form of the problem we begin by choosing
an appropriate space of test functions. In general, this choice will be a type of Hilbert
space which is a complete, inner product space, which is further equipped with a norm
or metric induced by this inner product. The use of such spaces gives us and, in general,
the mathematical analyst all of the tools necessary to prove the properties needed for the
numerical convergence, stability, and consistency of these FEM schemes. Mathematically,
this provides another desirable advantage over many FDMs.
The choice of the underlying Hilbert space is most directly relevant to the desired degree
of smoothness required by your solution. It is very common for second order PDE’s to work
with the Sobolev space H 1 (Ω) and in fact this is our choice for the underlying function space
in the implementations presented in this present work. Informally, this space is simply the
space of all functions which are square integrable and further have first derivatives which
are also square integrable over their domain Ω. Another way to put this is, the space of
functions which are in the more familiar Hilbert space L2 (Ω), of square integrable functions,
and whose gradient, in magnitude, is also in this space, i.e.,
H 1 (Ω) = {u ∈ L2 (Ω) : |∇u| ∈ L2 (Ω)}.
(3.4)
Clearly, H 1 (Ω) is a subspace of L2 (Ω) and we define the inner product on this space as,
(u, v)H 1 (Ω) = (u, v)L2 (Ω) + (∇u, ∇v)L2 (Ω) , ∀u, v ∈ H 1 (Ω),
15
(3.5)
where (u, v)L2 (Ω) denotes the L2 (Ω) inner product given by,
"
uv, ∀u, v ∈ L2 (Ω).
(u, v)L2 (Ω) =
(3.6)
Ω
Using this definition of the inner product we have the naturally induced norm,
1
'u'H 1 (Ω) = (u, u)H2 1 (Ω) , ∀u ∈ H 1 (Ω).
(3.7)
Many of the error estimates used in FEMs involve the behavior of the solution in the H 1
semi-norm which we denote by |·|H 1 and is defined as,
1
|u|H 1 = (∇u, ∇u)L2 2 (Ω) , ∀u ∈ H 1 (Ω).
(3.8)
This is referred to as a semi-norm because it fails to conform to one property that defines a
norm. Namely the property that states 'u' = 0 iff u = 0.
Another commonly used space that one often encounters in the literature and which is
used extensively for the analysis of a FEM is used when dealing with homogeneous Dirichlet
boundary conditions, i.e. when the value of the solution to the differential equation is
specified to be zero on the boundary of the domain over which it is defined. The space
is the Sobolev space, denoted, H01 (Ω) of all functions which are both H 1 (Ω) functions but
that also satisfy the zero boundary conditions on the domain Ω. This is a case where the
boundary conditions can be safely imposed directly on the underlying space of test functions
which can’t be done in general due to the compromise of the closure of that space and thus
violating the requirement that it be a vector space. Thus,
H01 (Ω) = {v ∈ H 1 (Ω) : v = 0 on ∂Ω}.
(3.9)
Now that we have the notion of our solution space clearly defined, we can look at how
we develop the weak form of the problem. First, we take an arbitrary function in H 1 (Ω) and
multiply our strong problem by this test function call it v. Then, we integrate this equation
over all of Ω so we have,
−
"
Ω
∆uv =
"
f v,
∀u, v ∈ H 1 (Ω).
(3.10)
Ω
Next, we use integration by parts or Green’s theorem, rather, on the term involving the
laplacian to obtain the full integral or weak form of the problem and by doing so we will
16
have reduced the order of the derivatives in the equation from two to one and thus we can
relax the level of smoothness required by our solution allowing the method to capture a
larger class of solutions than in the FDM, for example. So the continuous weak problem is
to seek u ∈ H 1 (Ω) such that,
"
"
∇u∇v −
"
∂u
f v, ∀v ∈ H 1 (Ω).
(3.11)
v=
∂Ω ∂n
Ω
Ω
The immediate question arises as to how well this relates to our original classical problem.
It can be shown that if u satisfies the classical problem in differential form then it will indeed
satisfy the corresponding weak or integral form of the problem. There is a small caveat in
stating the converse though. It can also be shown that the weak solution does satisfy the
classical formulation in the case that u is smooth enough. The details of the proof essentially
boil down to being able to show that the forcing term on the right-hand side, f, is in fact an
L2 (Ω) function.
From the boundary integral term in (3.11) one can see why Neumann boundary conditions
are satisfied naturally by a FEM because then
∂u
∂n
is specified and thus becomes a contribution
to the right-hand side. For simplicity let us consider the case where g from (3.2) is zero,
i.e. homogeneous Dirichlet boundary conditions. This allows us to impose the boundary
conditions directly on our underlying space of test functions. If v = 0 over ∂Ω then the
boundary integral term, in (3.11), becomes zero. We will thus restrict ourselves to the
problem of finding u ∈ H01 (Ω) such that,
"
"
f v,
∇u∇v =
Ω
Ω
∀v ∈ H01 (Ω).
(3.12)
Inhomogeneous Dirichlet boundary data can be handled in several ways. One of the easiest
ways to handle this case is to transform the problem into one which does have homogeneous
Dirichlet boundary data, solve this transformed problem and then transform its solution
back to the solution that satisfies the original inhomogeneous conditions. Another approach
is to deal with the boundary integral directly which will add a term to our bilinear form and
additional contributions to the right-hand side of the equation.
For the purposes of analysis the problem is usually cast into a more general and abstract
form. If we let V denote a Hilbert space, let A(·, ·) denote a bilinear form on V × V and
let F denote a linear functional on V, then the general weak problem we consider is to seek
u ∈ V satisfying,
A(u, v) = F (v),
17
∀v ∈ V.
(3.13)
Many weak formulations encountered in applications can be posed within this general
framework with an appropriate choice of the Hilbert space, the bilinear form, and the linear
functional.
In terms of our example concerning the elliptic problem with homogeneous Dirichlet
boundary conditions we have as our Hilbert space, V = H01 (Ω) and the bilinear form,
"
∇u∇v, ∀v ∈ H01 (Ω)
(3.14)
A(u, v) =
Ω
and for the linear functional we have,
F (v) =
"
∀v ∈ H01 (Ω).
f v,
Ω
(3.15)
If F is a bounded linear functional on the given Hilbert space V and the bilinear form
A(·, ·) is bounded, which is equivalent to being continuous on the space V, and furthermore,
satisfies a property referred to as coercivity which is also referred to as V-elllipticity, in the
literature, then the Lax-Milgram theorem guarantees the existence and the uniqueness of the
solution to (3.13). For our example problem, these properties can be readily demonstrated,
see [19]. Moreover, this theorem provides us with a bound of the solution in terms of the
data that is given in the original problem.
Theorem 3.1.1 ( Lax-Milgram Theorem )
Let V be a Hilbert space with a norm '·' and let A(·, ·) : V × V → R1 be a bilinear form
on V which satisfies
|A(u, v)| ≤ M 'u''v'
∀u, v ∈ V
(3.16)
and
A(u, u) ≥ m'u'2
∀u ∈ V,
(3.17)
where M and m are positive constants independent of u, v ∈ V . Let F : V → R1 be a bounded
linear functional on V. Then there exists a unique u ∈ V satisfying (3.13). Moreover,
'u' ≤
1
'F '.
m
(3.18)
Once we have done the appropriate analysis to convince ourselves that we are indeed
working with a problem that bares a solution that is unique. We thus have some information
about some a-priori error bounds we can begin to think about how to compute such a solution.
18
The first computational issue that arises immediately is that our general problem was posed
over an infinite dimensional Hilbert space but for computational purposes we will always
be restricted to having to work within finite dimensional spaces. So, it is then that our
first task in making this a viable computational method is in choosing an appropriate finite
dimensional subspace of our underlying infinite dimensional Hilbert space V. It is, in fact, the
choice of this finite dimensional subspace which will either make or break the computational
efficiency of our FEM.
Since most of the computational effort in running a FEM simulation is usually dominated
by solving linear systems of equations, it is extremely important to try to construct the
method such that these resulting linear systems will be well structured, i.e. sparse or banded,
for example. The key to achieving nicely structured matrices, is in one’s choice of the basis
functions to be used to construct the finite dimensional subspace of the problem underlying
the infinite dimensional Hilbert space V. The way to achieving such nicely structured linear
systems is to seek basis functions with small local and very compact support. A popular way
to achieve this is to utilize piecewise continuous polynomials constructed over the discretized
computational domain Ω. Many applications involve the use of piecewise linears, or piecewise
quadratics. The choice of what type of polynomials to use relates to one’s desired degree of
regularity and accuracy obtained in the approximation (we will see this illustrated clearly in
what follows).
Once we have chosen an appropriate subspace, S h ⊂ V which is true for conforming
FEMs, we have a nice theorem that tells us about the existence and uniqueness of our
solution in this new finite dimensional space. This gives us an estimate of the deviation of
our finite dimensional approximation from the infinite dimensional one.
Theorem 3.1.1 ( Galerkin’s or Cea’s Lemma )
Let V be a Hilbert space with a norm '·' and let A(·, ·) : V × V → R1 be a bilinear form
on V satisfying (3.16) and (3.17), and let F (·) be a bounded linear functional on V. Let u
be the unique solution of
A(u, v) = F (v),
∀v ∈ V
guaranteed by the Lax-Milgram theorem. Let {S h }, 0 < h < 1, be a family of finite
dimensional subspaces of V. Then for every h there exists a unique uh ∈ S h such that
A(uh , v h ) = F (v h ),
∀v h ∈ S h
19
and moreover,
(3.19)
'u − uh ' ≤
M
inf 'u − χh '
h
m χ ∈S h
(3.20)
where M,m are the constants appearing in the Lax-Milgram theorem and '·' denotes the
norm on V.
The bound provided in this theorem may not seem immediately useful, but if we open a
text on approximation theory, we shall find that if we let I h u be the S h interpolant of u we
have,
inf 'u − χh ' ≤ 'u − I h u' ≤ Chr ,
χh ∈S h
(3.21)
where the h relates the maximum step size in our discretization, C is a constant independent
of h, and r is a constant determined by the degree of the polynomial interpolant used as a
basis for S h . For example, if we use piecewise linear polynomials we have,
|u − I h u|H 1 ≤ C1 h'u'H 2 ,
'u − I h u'L2 ≤ C2 h2 'u'H 2 ,
provided u ∈ H 2 (Ω)
provided u ∈ H 2 (Ω)
(3.22)
(3.23)
and for the case of piecewise quadratic polynomials we have,
|u − I h u|H 1 ≤ C3 h2 'u'H 3 ,
provided u ∈ H 3 (Ω)
(3.24)
'u − I h u'L2 ≤ C4 h3 'u'H 3 ,
provided u ∈ H 3 (Ω)
(3.25)
and thus we are now equipped with some extremely useful a-priori error estimates on
the results of approximating our solution by these choices of basis functions in the finite
dimensional space S h .
Now that we have a reasonable notion of what kinds of subspaces may be used and how
a set of basis functions can be constructed we may introduce the fully discrete form of the
weak problem which is the actual problem to be solved computationally. First let us recall
that having a set of basis functions for the subspace, S0h , implies that all functions within
that space can be represented in terms of a linear combination of the elements within that
h
h
set of basis functions. In particular, if we let {φi }i=N
i=1 ∈ S0 which is a nodal basis for S ,
where N = dim(S h ), or the dimension of S h , then we have,
uh =
N
!
cj φj ,
j=1
20
uh ∈ S h .
(3.26)
Taking our test functions v to be the nodal basis itself we have as the discrete weak problem
to seek a uh ∈ S h , such that,
"
h
∇u ∇φi =
"
f φi ,
∀φi ∈ S h .
(3.27)
Ω
Ω
If we make use of the series expansion of uh in (3.26), we have for the fully discrete weak
problem,
!
cj
"
∇φj ∇φi =
"
f φi . i, j = 1, . . . , N
(3.28)
Ω
Ω
It is these equations that produce the linear system of equations to be solved. In general
the system is often written in the form,
Mc + Kc = F
(3.29)
where M is referred to as the mass matrix and K the stiffness matrix. In our example we
have for the mass matrix,
Mij = 0
(3.30)
and the stiffness matrix is,
Kij =
"
∇φj ∇φi
(3.31)
f φi .
(3.32)
Ω
and the right-hand side vector is,
Fi =
"
Ω
In a time dependent FEM implementation, the computational complexity is in solving
these systems of equations over each time step. This is then our motivation for turning
to the parareal algorithm and to its hope of allowing us to divide up this computational
complexity in parallel over multiple processes to be able to solve these problems quicker, in
real time, and further to allow us to approach larger, more complex, and thus more realistic
and practical sized problems obtaining our results in a much more reasonable time frame
than previously available.
In practice one one commonly discretizes the spatial domain as described here and then
uses a finite difference approximation for the discretization of the temporal domain. For the
details of the mathematical analysis of a time dependent FEM the reader is referred to, [19].
It is enough to understand what we mean when we talk about the FEM errors, which was
the major point of the linear analysis, that we use to show the convergence of the parallel
21
implementation, i.e. that the parallel implementation can produce the same results as a
sequential version of the numerical scheme, of the parareal algorithm when applied to time
dependent nonlinear PDE’s. Similar results can be shown for nonlinear problems as well, but
this requires quite a bit more background and results in functional analysis. So, we present
just the linear analysis here simply to provide a flavor of the results and to give the reader
a basic understanding of how the error estimates are obtained.
3.2
The Finite Element Method and the Parareal
Algorithm for Nonlinear PDE’s
In the previous section we looked at a linear analysis to get a sense of what the error
calculations mean and what norms are to be used in measuring the spatial error. It is our
primary interest to use the parareal algorithm for large, complex nonlinear systems of time
dependent differential equations.
In this section we describe how we have implemented the parareal algorithm in conjunction with the FEM to solve time dependent nonlinear partial differential equations. First,
let us recall an overview of the basic algorithm. We introduce the notation, C, to denote
the coarse grid solve; F, to denote the fine grid solves; and δ to denote the correction solves.
This is fairly consistent with what is commonly found in the literature.
• Step (i) Decompose the time interval [0, T ] into N coarse time intervals [Tn , Tn+1 ],
n = 0, . . . , N − 1 of length ∆T and solve the discretized problem sequentially in time; denote
the solution at each Tn by, Cn , n = 0, . . . , N where C0 denotes the initial condition of the
problem;
• Step (ii) decompose each coarse interval [Tn , Tn+1 ] n = 0, . . . , N −1 into M subintervals
of length ∆t; in parallel solve the discretized problem over each subinterval using Ci ,
i = 0, . . . , N − 1 as the initial condition at Ti ; denote this solution at the points Ti ,
i = 1, . . . , N by, Fi ;
•Step (iii) define the errors or defects at the points Tn , n = 1, . . . , N , by Sn = Fn − Cn ;
solve the coarse grid problem for the correction to Fn where the right-hand side of the
equation is the jump
Sn
;
∆T
call these corrections δn , n = 1, . . . , N ;
•Step (iv) set Cn = Fn + δn ; return to step(ii) if satisfactory convergence has not been
achieved.
In our implementation of this algorithm we are careful to implement it in a way such that
22
the sequential solves are linear, and thus the full nonlinear solves are done in parallel only.
We achieve this in two different ways. First, when we do our initial coarse solve for Cn we
simply lag the nonlinearity in the equation thus making it a known term and the equation
reduces to just a linear solve. Second, when we solve for the correction term δ n we solve a
linearized version of the differential equation.
To illustrate an implementation of the algorithm, consider an initial boundary value
problem for a nonlinear parabolic equation where we use a FEM for the spatial discretization
and a backward Euler approximation for the temporal domain. Specifically, consider the
problem,
ut − ∆u + f (u) = g(x, t) (x, t) ∈ Ω × (0, T ]
(3.33)
u(x, 0) = u0 (x) x ∈ Ω
(x, t) ∈ ∂Ω × (0, T ].
u(x, t) = 0
(3.34)
• Step (i) For n = 1, . . . , N , solve the linear problem with φ ∈ H01
"
Ω
0
Cn0 − Cn−1
φ+
∆T
"
∇Cn0
· ∇φ =
Ω
"
g(x, Tn )φ −
Ω
"
0
f (Cn−1
)φ,
C00 = u0
where,
Ω
For k = 0, 1, . . . perform the following steps until satisfactory convergence is achieved
(Tn+1 −Tn )
M
k
Cn and for
• Step (ii) for each interval [Tn , Tn+1 ], n = 0, . . . , N − 1, either set ∆t =
(Tn+1 −Tn )
, Fn,0
∆t
with a fixed M or one can also fix ∆t, instead, and set M =
=
m = 1, . . . , M , solve in parallel the nonlinear problems over each subdomain,
"
Ω
k
k
Fn,m
− Fn,m−1
φ+
∆t
"
k
∇Fn,m
∇φ
+
"
k
f (Fn,m
)φ
"
=
g(x, tm )φ;
Ω
Ω
Ω
denote this solution at each point Tn , n = 1, . . . , N , by Fnk ;
•Step (iii) for n = 0, . . . , N − 1, define Snk = Fnk − Cnk , set δ0k = 0 and for n = 1, . . . , N
solve the linear problem
"
Ω
k
δnk − δn−1
φ+
∆T
"
∇δnk ∇φ
+
"
f
′
(Fnk )δnk φ
=
Ω
Ω
• Step (iv)
Cnk+1 = Fnk + δnk ,
23
n = 1, . . . , N.
"
Ω
k
Sn−1
φ;
∆T
Note the subtle distinction in our choice of how we implement the nonlinear solves in parallel.
In one case we can choose the number of steps M to be taken within each of the subintervals
[Tn , Tn+1 ] to be fixed, which then allows ∆t to vary with P . Another option is to instead fix
the underlying fine grid resolution by setting ∆t to be fixed. In this case, M will vary with
the number of processors P . This subtle choice in implementation has a significant effect on
how this algorithm scales as we increase P . We explore these implications in detail within
our chapter on performance analysis.
It is important to keep in mind that sequentially we are solving linear problems only
by lagging the nonlinear terms or solving a linearized version of the equation. It is only in
parallel, over the fine grid, where we solve the fully nonlinear equations using a nonlinear
solver such as Newton’s method or any of its variants.
Performing all of the necessary sequential solves linearly is key to keeping the execution
time of the sequential portions of our algorithm at a minimum. The parareal algorithm is
such that the sequential computation time is dependent on the number of processors since
we typically take the number of coarse intervals N to be equal to the number of processors
P we are employing by letting ∆T =
T
,
P
then the initial sequential coarse solve and the
one coarse solve done per each further iteration for the correction term δnk becomes more
complex, in terms of computational costs, as we scale this algorithm to larger P .
24
CHAPTER 4
Combining the Parareal Algorithm and Reduced Order
Modeling
The decrease in computational time that is achieved by the parareal algorithm alone is
oftentimes still not sufficient to be able to perform real time calculations. Model reduction
or reduced order modeling (ROM) techniques have been shown to be effective in reducing
computational costs and therefore compute time, itself. In this chapter we couple the
techniques of ROM and the parallel in time integration method to achieve a further reduction
in compute time, thus bringing us closer to achieving true real time calculations. In what
follows we first give a brief overview of the ROM methods we used which are the ones based
on proper orthogonal decomposition (POD) and then describe our implementation of this
ROM technique along with the parareal algorithm.
4.1
4.1.1
Reduced Order Modeling with Proper
Orthogonal Decompositions
Main Idea
In model reduction the goal is to intelligently sample over pre-computed snapshots of the
state equations to extract important samples that illustrate the dynamics of the problem
and then use these to construct a reduced basis, {ψi (x)}di=1 for the reduced state space. If
the number of vectors in this reduced basis, d, is much smaller than that of the state space,
then the effect will be that one will end up solving a d × d dense linear system instead
of a n × n sparse linear system, where d << n, over each time step, and thus reducing
the computational costs significantly. Once the reduced basis is constructed one seeks an
25
approximation urom (x, t) to the state u of the form,
urom (x, t) =
d
!
aj (t)ψj (x) ∈ W ≡ span{ψ1 , . . . , ψd }.
(4.1)
j=1
Then one determines the coefficients aj , j = 1, . . . , d, by solving the equation in the set W ,
e.g. one could find a Galerkin solution of the equation in a standard way, using W for the
finite element space of approximations.
Here we use the approach where the reduced basis is generated using proper orthogonal
decomposition (POD) for a set of snapshots obtained by computing approximations of the
solution of the differential equation using a sampling of parameter values over the computed
time steps.
4.1.2
Generating the Snapshot Set
One of the key steps in achieving a good reduced model is in being able to generate good
snapshot sets of the state equations that capture the dynamics of the problem that one is
interested in investigating. A reduced model, in fact, is only as good as the snapshot set
that one samples from; for if the dynamics of interest are not in the snapshot set then they
most certainly will not be present in the ROM solution to the system. Unfortunately, at
present, the generation of snapshot sets is not a science but still much more of an art and
many ad hoc approaches are used in a large variety of situations.
Most snapshot sets will likely contain large amounts of redundant information with
respect to the dynamics of interest and so any a priori knowledge of the system in question
can and should be used to close in on these interests in an attempt to reduce, as much as
possible, the amount of redundant information present in the snapshot set. For example,
one may have knowledge of certain bounds of allowable parameter values, there may be
known constraints on the parameters or even the independent variables themselves, one
may also have knowledge of correlations between parameters and variable, or one may also
have knowledge of the way parameters are distributed in terms of a probability distribution
function, for example. These are examples of the types of a priori information that should
be considered and used when setting out to generate snapshot sets of one’s state simulations.
Regardless of the detail in how the snapshot sets are generated, it is the case that
almost every approach will involve the computations of the full, high dimensional state
26
or adjoint state, in the case of control or optimization, solves. It is the hope that paying the
computational costs of a single or very few full high fidelity system simulations, to provide
adequate snapshot sets, can be amortized by being given the ability to perform many more
times the reduced order computations in the cases of real time designs or in instances where
emergences arise and actual lives are dependent on the capabilities of real time computations.
In practice the steps involving snapshot generation are considered as an offline, preprocessing step. The idea again being that once a good snapshot set is in hand then many
ROM simulations can be achieved multiples of times over. So, it is that when the gains, such
as in speedup or computational complexity are achieved by the ROM methods and reported
the steps involving the snapshot generation are not included as part of the computational
costs of the method being discussed or reported upon.
Many approaches have been taken to obtain good snapshot sets. In this work and within
the experiments reported we use the rather direct approach of computing full, high fidelity
state solutions to the differential equations using a sampling of the parameter values involved
in the system over a sampling of time steps.
4.1.3
Proper Orthogonal Decomposition
Now, we assume that we have a well-generated snapshot set that captures our dynamics of
interest within the system we are studying. The question now becomes that of determining a
reduced basis for our state space based on the snapshot set we have available to us. We now
describe the method of proper orthogonal decompositions (POD) for achieving this goal.
Let, {&ak }K
k=1 denote the set of points in parameter space that are chosen for generating the
snapshots. Let, ∆t denote the time-sampling interval, which is usually some multiple of the
actual time step used to discretize the state system and perhaps even that same step itself.
Although, there is no need for this sampling step to be uniform, we consider the uniform case
in this introduction for simplicity. Let, l∆t, l = 1, . . . , L, denote the corresponding sampling
& denote the solution (e.g., a vector of nodal values) of the discretized (e.g.
times. Let, S̃
k,l
by a high-dimensional finite element method) state system corresponding to the parameter
point &ak sampled at time l∆t. The snapshot set could consist of the N = KL vectors,
& ,
&i = S̃
S
k,l
k = 1, . . . , K,
l = 1, . . . , L,
27
i = (k − 1)L + l.
(4.2)
&j ∈ Rm , let S denote the M × N snapshot matrix whose columns
Given N snapshots S
are the snapshots, i.e.
&1 , S
&2 , . . . , S
&N ).
S = (S
(4.3)
Let, S = UΣVT , denote the singular value decomposition (SVD) of the snapshot matrix S.
&i ∈ RM , i = 1, . . . , N , are the first N left singular vectors of the
The POD basis vectors, ψ
snapshot matrix S, i.e.
&i = U
& i,
ψ
for i = 1, . . . , N.
(4.4)
The d-dimensional POD-reduced basis (d < N ) are the first d left singular vectors of the
snapshot matrix S, i.e.
&i = U
& i,
ψ
for i = 1, . . . , d.
(4.5)
By construction the POD basis is an orthonormal basis, i.e.
&
&T ψ
ψ
i j = 0 for i -= j
and
&T ψ
&
ψ
i i = 1.
(4.6)
It can be shown that the energy error of the d-dimensional POD subspace is given by
the sum of the squares of the remaining singular values associated with the singular values
not used for the reduced POD-basis, i.e.
εpod =
N
!
σj2
(4.7)
j=d+1
where N is the number of snapshots in the snapshot set and d is the dimension of the POD
subspace. If one wishes for the relative error to be less than a perscribed tolerance, δ, i.e. if
one wants,
εpod ≤ δ
N
!
&j |2 ,
|S
(4.8)
j=1
then one should choose the smallest integer d, such that.
#d
2
j=1 σj
#N 2 ≥ γ = 1 − δ.
j=1 σj
(4.9)
In this way we have an approach to guide us in choosing the number, d, of POD vectors
to be used in our reduced basis. In general, we can look at when the decay of the singular
values becomes rapid enough that the inclusion of more singular vectors from the snapshot
28
matrix becomes redundant in the sense that they are no longer helping us to capture more
information than what we already have included in the reduced basis. In the other case,
when we know a priori what sort of tolerances we need to meet we can again use the singular
values of the snapshot matrix to help guide us in how many POD vectors we need in our
reduced basis.
4.2
Reduced Order Modeling and The Parareal
Algorithm
Although, much work has been done, independently, on on both the parareal algorithm and
model reduction, to date no one has looked into combining these two separate real time
computing methods to investigate their combined performance capabilities. This section
presents the work that was done on combining the parareal algorithm and reduced order
modeling for the real time computations of time dependent nonlinear partial differential
equations.
The ROM techniques used have a purely spatial effect in that they drastically reduce
the size of the linear system or systems that are solved at each of the time steps used in
the temporal evolution scheme. The system being solved is due to the spatial discretization,
at a fixed time step. So, ROM immediately provides a more efficient way of computing
time evolution equations already. Traditionally though, the numerical temporal evolution
schemes have all been sequential in nature and have thus been viewed as providing a limited
means of performance gains.
In the proceeding chapters on numerical experiments and performance analysis, we shall
demonstrate that one can observe significant amounts of speedup in the high fidelity FEM
simulations with the parareal implementations alone, as desired. It is well-documented
in the literature that significant speedup can be achieved using ROM instead of the high
fidelity FEM simulations, see [7], [8], [9], and [10]; we view the application of the parareal
algorithm to the ROM setting as a way to exploit the overlooked performance gains that,
consequentially we view, were inherent in our choice of how we implement our numerical
schemes in the temporal domain. We will see that the performance gains such as speedup
and scalability, of our ROM implementations, follow the performance trends of the full FEMs
very closely.
In combining the ideas of the parareal algorithm and ROM, we simply use ROM to solve
29
both of the coarse and fine grid calculations in space. We will see, in the next chapter, that
the parareal algorithm behaves as expected when applied in the ROM setting.
In this work we consider the problem of computing POD-based ROM solutions to
nonlinear PDE’s with multiple parameters on the boundary. So, in general suppose we
obtain an approximation upod (tn , &x) to the solution u(t, &x) of a nonlinear partial differential
equation defined in a domain Ω with boundary Γ and evaluated at the time tn . The general
form of the boundary conditions considered are given by
u(t, &x) = βk (t)gk (&x) on Γk , k = 1, . . . , K
and
u(t, &x) = 0 on Γ −
K
$
Γk
(4.10)
(4.11)
k=1
where Γk
%
Γl = ∅ if k -= l and
&K
k=1
Γk may be a portion of the boundary Γ or the entire
boundary. The functions {gk (&x)}K
k=1 are assumed to be given so that there are K (time
dependent) parameters {βk }K
k=1 that serve to specify the problem.
Having time dependent parameters over the boundary complicates the ROM process; for
more details on how this is handled see [8] and [10].
We consider two examples in this work.
4.2.1
4 - Parameter ROM Problem
We started with a reaction diffusion problem with multiple parameters. The problem is to
solve,
ut − ∆u + u2 = 0 (x, t) ∈ Ω × (0, T ]
(4.12)
u(x, 0) = 0 (x, t) ∈ Ω.
(4.13)
Where, we take Ω = [0, 1] × [0, 1] and T = 1, with the following boundary parameters,
y = 1 u = 4x(1 − x)β1
where,
β1 =
'
2t
if t < 0.5
2(1 − t) if t ≥ 0.5
30
(4.14)
and,
y = 0 u = 4x(1 − x)β2
where β2 = 4t(1 − t)
(4.15)
x = 0 u = 4y(1 − y)β3
where β3 = | sin(2πt)|
(4.16)
x = 1 u = 4y(1 − y)β4
where β4 = | sin(4πt)|.
(4.17)
For the snapshot generation we sampled points in the four-dimensional parameter space,
then solved the full finite element model with h = 0.1 for the equation by impulsively jumping
between the sampled parameters; snapshots were generated from the solution at various time
intervals for the choice of parameters. A total of 300 snapshots were generated and a POD
technique was used to determine the basis vectors. Satisfying the inhomogeneous Dirichlet
boundary data requires extra work and was handled by generating basis vectors, which
satisfied general inhomogeneous boundary data; again see [8] for more details.
4.2.2
ROM - Navier Stokes Equation with 6 Boundary Parameters
The next problem we looked at was one where a ROM method had just been developed for
a particular application which was much more complicated and thus practical. Our interest
was to verify that we could observe the same performance trends as in both the first, simpler,
ROM implementation and the FEM implementations.
In this problem we are looking at a gas that has been released in a building, assumed toxic,
and the question is the optimal configuration of the ventilation system with its inflow/outflow
orifices to clear the building of the toxin as quickly as possible. We model the flow of the gas
through the building with the time dependent incompressible Navier-Stokes equations over
an H shaped domain, to emulate a particular portion of the building, and multiple boundary
parameters to emulate the options of the inlet/outlet orifices.
Formally, the flow problem is stated as
∂&u
− ν∆&u + &u · ∇&u + ∇p = 0
∂t
∇ · &u = 0
&u(&x, 0) = u&0
∈ Ω × (0, T ]
(4.18)
∈ Ω × (0, T ]
∈Ω
for the velocity &u and the pressure p; here the Reynolds number
1
ν
is chosen to be 100. The
physical domain for this problem is illustrated in Figure 4.1 ; along the boundary, of this
31
flow domain, one should note the six sets of inlet/outlet orifices Γi , i = 1, . . . , 6 and a main
outlet orifice. The remainder of the flow domains boundary is a solid wall. We enforce a
zero stress outflow boundary condition at the main outlet orifice as indicated in Figure 4.1
and homogeneous zero velocity boundary conditions along the solid portions of the wall.
Figure 4.1: The H-cell domain of the building ventilation problem
At the six sets of inlet/outlet orifices Γi , i = 1, . . . , 6, we impose the following boundary
conditions:

Γ1 (inlets)












Γ2 (inlets)












 Γ3 (inlets)






Γ4 (inlets)












Γ5 (outlets)







Γ6 (outlets)
x1 = 0, 8 ≤ x2 ≤ 9
ai ≤ x1 ≤ bi , x2 = 6
u = .48β1 (x2 − 8)(9 − x2 )
v = .48β1 (x1 − ai )(bi − x1 )
x1 = 105, 8 ≤ x2 ≤ 9 u = −.5β2 (x2 − 8)(9 − x2 )
ci ≤ x1 ≤ di , x2 = 6 v = .5β2 (x1 − ci )(di − x1 )
x1 = 0, 2 ≤ x2 ≤ 3
ai ≤ x 1 ≤ b i , x 2 = 5
u = .44β3 (x2 − 2)(3 − x2 )
v = −.44β3 (x1 − ai )(bi − x1 )
x1 = 105, 2 ≤ x2 ≤ 3 u = −.352β4 (x2 − 2)(3 − x2 )
ci ≤ x1 ≤ di , x2 = 6
v = −.352β4 (x1 − ci )(di − x1 )
ai ≤ x1 ≤ bi , x2 = 11 u = .612β5 (x1 − ai )(bi − x1 )
ci ≤ x1 ≤ di , x2 = 11 u = .896β6 (x1 − ci )(di − x1 ),
where (ai , bi ) ∈ {(10, 11), (22, 23), (34, 35)} and (ci , di ) ∈ {(70, 71), (82, 83), (94, 95)},
i = 1, 2, 3.
Approximate solutions of the Navier-Stokes equations are obtained using the standard
Taylor-Hood finite element method for the spatial discretization, i.e., continuous piecewise
32
linear functions on triangles are used to approximate the pressure and continuous piecewise
quadratic functions are used on the same triangles, as the pressure was computed, to
approximate the components of the velocity. The backward Euler approximation is used for
the temporal discretization. A uniform grid consisting of 8,520 triangles is used, resulting
in 35,730 unknowns, and a uniform time step is also used in the full high fidelity FEM
approximation.
The full high fidelity FEM approximation is then used to generate our snapshot set which
is then used to generate the POD bases. Again, having multiple parameters on the boundary
complicates the ROM procedure see, [10], for the full details on how this is achieved. In the
paper, [10], it is shown that the 35,730 unknowns in the full FEM solution can be reduced
to 14 unknowns and still produce a very accurate approximation to the full finite element
solution.
On FSU’s IBM SP3 supercomputer this simulation ran in about fifty minutes with the
ROM method. When we implemented the parareal algorithm we obtained speedup factors
of up to six with the problem parameters we used. Thus, a potentially life saving simulation
that was taking close to an hour to compute could then be computed in under ten minutes
making it much more practical in terms of real time use.
4.3
Implementation
Assuming that you have produced a snapshot set and then generated a reduced basis from
the snapshots, {ψi (x)}di=1 , of cardinality d, then we apply this basis as a direct substitute
for the formulated finite dimensional finite element basis functions, {φi (x)}di=1 . We apply
the same algorithm as we do in the FEM case but with the ROM bases instead, i.e. let C to
denote the coarse grid solve, F to denote the fine grid solves, and δ to denote the correction
solves, then;
• Step (i) Decompose the time interval [0, T ] into N coarse time intervals [Tn , Tn+1 ],
n = 0, . . . , N − 1 of length ∆T and solve the differential equation sequentially in time:
denote the solution at each Tn by, Cn , n = 0, . . . , N where C0 denotes the initial condition
of the problem;
• Step (ii) decompose each coarse interval [Tn , Tn+1 ] n = 0, . . . , N −1 into M subintervals
of length ∆t; in parallel solve the differential equation over each subinterval using Ci ,
i = 0, . . . , N − 1 as the initial condition at Ti ; denote this solution at the points Ti ,
33
i = 1, . . . , N by, Fi ;
•Step (iii) define the errors or defects at the points Tn , n = 0, . . . , N − 1, by ,
Sn = Fn − Cn : solve the coarse grid problem for the correction to Fn where the right-hand
side of the equation is the jump Sn ; call these corrections δ n , n = 0, . . . , N − 1;
•Step (iv) set C n = F n + δ n ; return to step(ii) if satisfactory convergence has not been
achieved.
And thus on the 4 parameter ROM problem we have,
• Step (i) For n = 1, . . . , N , solve the linear problem with φ ∈ H01
"
Ω
0
Cn0 − Cn−1
ψ+
∆T
"
∇Cn0
· ∇ψ = −
Ω
"
0
(Cn−1
)2 ψ,
where,
Ω
C00 = u0
For k = 0, 1, . . . perform the following steps until satisfactory convergence is achieved
(Tn+1 −Tn )
M
k
Cn and for
• Step (ii) for each interval [Tn , Tn+1 ], n = 0, . . . , N − 1, either set ∆t =
with a fixed M or one can also fix ∆t, instead, and set M =
(Tn+1 −Tn )
, Fn,0
∆t
=
m = 1, . . . , M , solve in parallel the nonlinear problems over each subdomain,
"
Ω
k
k
− Fn,m−1
Fn,m
ψ+
∆t
"
k
∇Fn,m
∇ψ
+
"
k
(Fn,m
)2 ψ = 0;
Ω
Ω
denote this solution at each point Tn , n = 1, . . . , N , by Fnk ;
•Step (iii) for n = 0, . . . , N − 1, define Snk = Fnk − Cnk , set δ0k = 0 and for n = 1, . . . , N
solve the linear problem
"
Ω
k
δnk − δn−1
ψ+
∆T
"
∇δnk ∇ψ
+
Ω
"
2(Fnk )δnk ψ
=
Ω
"
Ω
k
Sn−1
ψ;
∆T
• Step (iv)
Cnk+1 = Fnk + δnk ,
n = 1, . . . , N.
Again, we do only linear sequential solves so that the full nonlinear solves are done in parallel
only thus providing us with some performance gains.
34
CHAPTER 5
Computational Experiments and Results
In this chapter we report the results of three separate trial problems primarily to show
that the parareal algorithm is sequentially consistent and to give some basic evidence of the
speedup capabilities of this algorithm. The speedup reported here is relative speedup, i.e.,
Sp =
T1
,
Tp
where T1 is the time to execute on a single processor and Tp is the execution time
over P processors. In the next chapter we take a much closer look at the speedup, scalability,
and an overall parallel performance analysis of this algorithm.
All of the results reported in this chapter were computed on the Teragold supercomputer
at Florida State University, which is an IBM SP2 system.
5.1
FEM and The Parareal Algorithm Results
We first consider a nonlinear parabolic problem, the model problem discussed in chapter (3),
to illustrate the parareal algorithm applied in the FEM setting.
ut − ∆u + f (u) = g(x, t) (x, t) ∈ Ω × (0, T ]
(5.1)
u(x, 0) = u0 (x) (x, t) ∈ Ω
(x, t) ∈ ∂Ω × (0, T ].
u(x, t) = 0
As a test we computed the case where f (u) = u2 and g(x, t) was chosen so that the exact
solution is u(x, y, t) = cos(10t) tan(x2 + y 2 − 1) with Dirichlet boundary data. All results use
continuous piecewise quadratic elements on a triangular grid for the spatial finite element
approximation, the backward Euler approximation is used for the temporal discretization,
and a Newton’s method is used to solve the nonlinear equations with a serial direct banded
linear solver, we take Ω = [0, 1] × [0, 1] and T = 1.
35
The first goal is to demonstrate the convergence of the parareal algorithm by showing that
we do indeed obtain the same convergence properties as we do in a sequential implementation
of a FEM. In Table, 5.1, we summarize the errors obtained by solving this problem using
Table 5.1: Comparison of errors using standard finite element approach and the
parareal/FEM approach
h
∆T
∆t
FEM
L2 −error
Parareal Algorithm
Initial Coarse Iteration # 1 Iteration # 2
L2 −error
L2 −error
L2 −error
1
10
0.1
0.01
9.873×10−3
8.424×10−2
1.749×10−2
7.339×10−3
1
20
0.1
0.0025
2.501×10−3
8.424×10−2
1.760×10−2
3.116 ×10−3
H 1 −error
H 1 −error
H 1 −error
H 1 −error
1
10
0.1
0.01
6.117×10−2
0.4382
9.156×10−2
5.152×10−2
1
20
0.1
0.0025
1.588×10−2
0.4372
8.457×10−2
1.704 ×10−2
a standard serial finite element approach with a timestep of length ∆t and the parareal
algorithm using the same timestep for the fine grid calculation but a much larger timestep,
∆T , for the coarse grid calculations. As can be seen from the table, it takes only two
iterations of the parareal algorithm to obtain equivalent accuracy as in the serial finite
element approach. The errors reported are the maximum L2 and H 1 errors over the time
interval in which the problem was computed.
In Table 5.2, we give a brief example of the speedup potential for this algorithm. Here
T
∆T
= # of steps taken in the coarse grid which typically corresponds to the number of
processors,
T
∆t
= # of time steps taken in the fine grid over the whole domain,
time steps taken in the fine grid over each sub-domain of size ∆T , and
1
h
∆T
∆t
= # of
= the number of
elements used in the FEM spatial discretization. In this table the problem size is specified
in a way so that the number of time steps taken within each sub-interval is a fixed constant
but since, ∆t =
∆T
M
and ∆T =
T
,
P
where P is the number of processors, this means that
36
the underlying fine grid problem changed size for each P. Thus, the near linear scaling that
is present in the table is only a weak scaling which tends to be linear. The details of this
phenomenon will be the subject of the next chapter where we will take a much closer look
at the scalability of this algorithm and the problem parameters that affect these trends.
Table 5.2: Speedup results for the parareal/FEM approach compared to the serial FEM
approach.
No. of
h
T
∆t
T
∆T
∆T
∆t
1
10
1
10
1
10
100
400
6400
10
20
80
10
20
80
10
20
80
3.62
4.82
8.39
1
20
1
20
1
20
100
400
6400
10
20
80
10
20
80
10
20
80
3.66
4.97
9.13
processors speedup
In Figure 5.1, one can note the nonlinear strong scaling trend of this algorithm. As can
be seen, this algorithm does not scale linearly in the strong sense, which is not necessarily
bad, whenever the underlying fine grid spacing ∆t is fixed. As the coarse grid is refined to
the point that it begins approaching the resolution of the fine grid, then we begin to loose
speedup. The result is that there is always a sweet spot or an optimal number of processors
to be used for a given problem. We investigate the scalability and how it depends on the
problem one is working with in the next chapter. Some basic speedup results are presented
here simply to provide a flavor of what is to follow in the proceeding chapter.
37
6
5
SpeedUp
4
3
2
1
0
0
10
20
30
40
50
60
Processors
Figure 5.1: speedup of Parareal/FEM Implementation : Blue-∆t = 0.01 and Red-∆t = 0.005
5.2
ROM and The Parareal Algorithm Results
In this section we look at some of the results of our combination of the parareal with model
reduction algorithms.
5.2.1
4 - Parameter Problem, Reaction Diffusion
To illustrate the behavior of the algorithm we discussed in the in the ROM setting,
we consider the following nonlinear reaction diffusion example, which involves four time
dependent parameters on the boundary. We state the problem, once again, as in chapter (4)
38
to be clear.
ut − ∆u + u2 = 0 (x, t) ∈ Ω × (0, T ]
(5.2)
u(x, 0) = 0 (x, t) ∈ Ω
(5.3)
Where, we take Ω = [0, 1] × [0, 1] and T = 1, with the following boundary parameters,
y = 1 u = 4x(1 − x)β1
where,
β1 =
'
(5.4)
2t
if t < 0.5
2(1 − t) if t ≥ 0.5
and,
y = 0 u = 4x(1 − x)β2
where β2 = 4t(1 − t)
(5.5)
x = 0 u = 4y(1 − y)β3
where β3 = | sin(2πt)|
(5.6)
x = 1 u = 4y(1 − y)β4
where β4 = | sin(4πt)|
(5.7)
For the snapshot generation we sampled points in the four-dimensional parameter space,
then solved the full finite element model with h = 0.1 for the equation by impulsively jumping
between the sampled parameters; snapshots were generated from the solution at various time
intervals for the choice of parameters. A total of 300 snapshots were generated and a POD
technique was used to determine the basis vectors. Satisfying the inhomogeneous Dirichlet
boundary data requires extra work and was handled by generating additional basis vectors,
which satisfied general inhomogeneous boundary data; again see [8] for more details.
The calculations reported in Table 5.3 are for the spatial discretization of h = 0.1. The
errors are computed by comparing with the full FEM solution of the problem since an exact
solution is not known; once again the errors measure the maximum L2 error over all time.
In Table 5.3, column four gives the error between the ROM solution and the standard FEM
solution and the remaining columns give the errors at each stage of the parareal/ROM
algorithm. Note that similar to the FEM example, it takes only two iterations of the
parareal/ROM algorithm to match the error in the serial ROM solution.
Again, Figure 5.2 is included to show that indeed the speedup trend is maintained for
the ROM implementation.
39
Table 5.3: (Comparison of errors for 4-parameter problem using standard ROM approach
and the parareal/ROM algorithm. Errors are calculated by comparing to the full finite
element solution.)
# basis
vectors
∆T
∆t
ROM
L2 −error
Parareal/ROM Algorithm
Initial Coarse Iteration # 1 Iteration # 2
L2 −error
L2 −error
L2 −error
4
0.1
0.01
2.442×10−2
0.1064
1.899×10−2
1.344×10−2
8
0.1
0.01
2.338×10−3
0.1054
1.405×10−2
1.940×10−3
16
0.1
0.01
1.126 ×10−4
0.1053
1.403×10−2
1.569×10−3
5.2.2
ROM and the Navier-Stokes Equations
We also wanted to test the algorithm on a more complex and thus more realistic system. We
state the problem, once again, as in chapter (4) to be clear. For this example, we solve the
time dependent incompressible Navier-Stokes equations given by
∂&u
− ν∆&u + &u · ∇&u + ∇p = 0
∂t
∇ · &u = 0
&u(&x, 0) = u&0
∈ Ω × (0, T ]
(5.8)
∈ Ω × (0, T ]
∈ Ω × (0, T ]
for the velocity &u and the pressure p; here the Reynolds number
1
ν
is chosen to be 100. The
physical domain for this problem is illustrated in Figure 5.3 ; along the boundary, of this
flow domain, one should note the six sets of inlet/outlet orifices Γi , i = 1, . . . , 6 and a main
outlet orifice. The remainder of the flow domains boundary is a solid wall. We enforce a
zero stress outflow boundary condition at the main outlet orifice as indicated in Figure 5.3
and homogeneous zero velocity boundary conditions along the solid portions of the wall.
At the six sets of inlet/outlet orifices Γi , i = 1, . . . , 6, we impose the following boundary
conditions:
40
6
5
SpeedUp
4
3
2
1
0
0
10
20
30
40
50
60
70
80
90
Processors
Figure 5.2: speedup of Parareal/ROM Implementation : Blue-∆t = 0.01 and Red-∆t =
0.005

Γ1 (inlets)












Γ2 (inlets)












 Γ3 (inlets)






Γ4 (inlets)












Γ5 (outlets)







Γ6 (outlets)
x1 = 0, 8 ≤ x2 ≤ 9
ai ≤ x1 ≤ bi , x2 = 6
u = .48β1 (x2 − 8)(9 − x2 )
v = .48β1 (x1 − ai )(bi − x1 )
x1 = 105, 8 ≤ x2 ≤ 9 u = −.5β2 (x2 − 8)(9 − x2 )
ci ≤ x1 ≤ di , x2 = 6 v = .5β2 (x1 − ci )(di − x1 )
x1 = 0, 2 ≤ x2 ≤ 3
ai ≤ x1 ≤ bi , x2 = 5
u = .44β3 (x2 − 2)(3 − x2 )
v = −.44β3 (x1 − ai )(bi − x1 )
x1 = 105, 2 ≤ x2 ≤ 3 u = −.352β4 (x2 − 2)(3 − x2 )
ci ≤ x1 ≤ di , x2 = 6
v = −.352β4 (x1 − ci )(di − x1 )
ai ≤ x1 ≤ bi , x2 = 11 u = .612β5 (x1 − ai )(bi − x1 )
ci ≤ x1 ≤ di , x2 = 41
11 u = .896β6 (x1 − ci )(di − x1 ),
(5.9)
Figure 5.3: The H-cell domain of the building ventilation problem, with boundary parameters
illustrated.
where (ai , bi ) ∈ {(10, 11), (22, 23), (34, 35)} and (ci , di ) ∈ {(70, 71), (82, 83), (94, 95)},
i = 1, 2, 3.
Approximate solutions of the Navier-Stokes equations are obtained using the standard
Taylor-Hood finite element method for the spatial discretization, i.e., continuous piecewise
linear functions on triangles are used to approximate the pressure and piecewise quadratic
functions are used on the same triangles, as the pressure was computed, to approximate the
components of the velocity. The backward Euler approximation is used for the temporal
discretization. A uniform grid consisting of 8,520 triangles is used, resulting in 35,730
unknowns for the full high fidelity FEM approximation.
The full high fidelity FEM approximation is then used to generate our snapshot set which
is then used to generate the POD basis. Again, having multiple parameters on the boundary
complicates the ROM procedure see, [10], for the full details on how this is achieved. In the
paper, [10], it is shown that the 35,730 unknowns in the full FEM solution can be reduced
to 14 unknowns and still produce a very accurate approximation to the full finite element
solution.
In Figure 5.4, we see that the combination of our ROM approach along with the use of the
parareal algorithm further accelerates the generation of the desired solution. At this point
one should also note the trend displayed within each of the examples presented; namely that
by using a higher resolution in the fine grid of the parareal implementation we are achieving
a higher speedup factor. It is this property that makes the parareal a workhorse algorithm;
42
in that the more work we give it the more performance gains we get back out of it. The
goal of the next chapter will be to try and quantify exactly how these performance gains are
affected by the problem parameters such as the underlying fine grid resolution, for example.
6
5
SpeedUp
4
3
2
1
0
0
10
20
30
40
50
60
70
80
90
Processors
Figure 5.4: speedup of Parareal/ROM Implementation of the Navier-Stokes Problem : Blue∆t = 0.01 and Red-∆t = 0.005
43
CHAPTER 6
Performance Analysis and Scalability
In this chapter, we take a closer look at the parallel performance of the parareal algorithm.
The focus here will be on the interesting way that this algorithm scales with a larger number
of processors and how this scalability depends on specific aspects of the problem one is
interested in solving. Efficiency and cost effectiveness of the algorithm will be addressed,
also.
All of the results reported in this chapter were computed on the new Florida State
University shared High Performance Computing (HPC) facility. The FSU HPC consists of
four head nodes, 128 quad core compute nodes (512 cores), 78 TB of usable storage, and
non-blocking Infiniband and IP communication fabrics.
6.1
6.1.1
Introduction to performance analysis concepts
and terminology
Quantities of Interest
Speedup
One of the primary quantities of interest in this chapter is the speedup capabilities we get
with the application of the parareal algorithm to both our FEM and ROM implementations.
In order to make the results clear, we first need to define precisely what we mean here by
speedup and exactly how this quantity is measured. In the previous section we gave some
hints towards the speedup capabilities of this algorithm and now we seek to make these
results more precise.
In practice there are two basic classes of speedup, absolute and relative. In the case
of absolute speedup the parallel algorithm is compared directly to the fastest known serial
implementation of the method. Thus, if we let TA be the wall clock time of this serial
44
implementation and TP be the wall clock time of the parallel implementation using P
processors then the absolute speedup is,
SPA =
TA
.
TP
(6.1)
Often times it isn’t clear what the fastest serial implementation is and further many times
these highly optimized versions may only be able to reach this peak on very specific
architectures. In the case of the parareal algorithm we are dealing with a purely parallel
algorithm so it is difficult to compare directly to a serial FEM or ROM implementation, for
example.
In these types of situations it is very common for people to study and report the relative
speedup of their algorithms performance. In this case we let T1 be the wall clock time of the
code running on a single processor while TP is again the wall clock time of the implementation
over P processors and thus the relative speedup with respect to one processor is defined as,
SP1 R =
T1
.
TP
(6.2)
Similarly, if we let Tk be the wall clock time of the code running on a k processor while TP is
again the wall clock time of the implementation over P processors then a generalized relative
speed with respect to k processors is defined as,
SPkR =
Tk
,
TP
(6.3)
where it is assumed that k < P . In this work all of the speedup results reported are relative
speedup as defined by (6.2) which will typically be denoted as simply, S, SP , or just speedup.
For the parareal algorithm this is reasonable because it is a purely parallel algorithm and
computing on a single processor is essentially just doing a single nonlinear solve over the full
fine grid with the fine time step δt.
Speedup is considered to be linear whenever,
SP ≈ P
(6.4)
where SP denotes the speedup obtained with P processors, and it is called super linear
whenever,
SP > P.
45
(6.5)
One common factor in the advent of observing a super linear speedup is the effect of cache
aggregation. In the parallel computations, not only does the number of processors change,
but so does the size of accumulated caches from different processors. With the larger
accumulated cache size, more or even the entire data set can fit into the caches which can
have a dramatic effect on reducing memory access time and producing additional speedup
beyond what is provided purely from the computations being done in parallel.
When neither one of these apply the speedup trend is then said to be nonlinear.
Efficiency
The efficiency of an algorithm using P processors is,
EP =
SP
.
P
(6.6)
The efficiency of an algorithm estimates how well-utilized the processors are in solving the
problem, compared to how much effort is lost in communication and synchronization.
Algorithms with linear speedup and algorithms on a single processor have EP = 1. Many
difficult to parallelize algorithms have an efficiency that approaches zero as P increases.
Scalability
The speedup trend is a statement about the scalability of the algorithm, which is our primary
interest in this chapter. Scalability analysis is the process of analyzing the speedup trend
as the number of processors is varied. Many times though the scaling of the algorithm
depends on other problem specific parameters such as the problem size, for example. It
is then also of great importance to explore the problem parameters to see how they can
affect the scaling trend as the number of processes P is varied. We will look closely at what
problem parameters are playing a significant role in affecting the scalability of our parareal
implementations.
In general, speedup describes how the parallel algorithm’s performance changes with
an increasing number of processors P .
Scalability is concerned with the efficiency of
the algorithm with changing problem parameters, such as problem size, by choosing a P
dependent on the problem parameter so that the efficiency of the algorithm is bounded
below. If we take the problem size to be the parameter of interest in the scalability analysis,
then we may define scalability formally.
46
Definition 6.1.1 An algorithm is scalable if there is a minimal efficiency ǫ > 0 such that
given any problem size N there is a number of processors P (N ) which tends to infinity as N
tends to infinity, such that the efficiency EP (N ) ≥ ǫ > 0 as N is made arbitrarily large.
There are two classes of scalability, strong and weak scalability. Weak scalability is a
type of scaling study where the problem size is allowed to change as P is varied. Strong
scalability, on the other hand, is a type of scaling study where all of the parameters used to
specify the problem size are fixed as the number of processes P are varied and increase to
larger and larger sizes.
The hope with most parallel algorithms is to achieve at least linear speedup or rather a
linear scaling trend. The idea being that if the parallel implementation provides any speedup
whatsoever one would like to observe more speedup as the number of processes are increased.
There are many parallel algorithms that by design don’t scale in a linear fashion and this
doesn’t imply in any way that these types of algorithms are bad or not useful. In fact, there
are applications where one needs a computation that involves a set problem size and with a
nonlinear scaling trend there typically exists an optimal number of processors to be used so
that one needs only invest in a cluster resource of size X instead of size Y , where X < Y , and
perhaps saving a large sum of money; if this were going to be a routine type of calculation,
such as in many real time computing needs.
We will see that with the parareal algorithm we can achieve a near to linear scaling but
only in the weak sense while the strong scaling trend is nonlinear. It is a goal of this chapter
to explain how this occurs and the role that the problem parameters play in the scalability
of the parareal algorithm with respect to our FEM and ROM implementations.
6.1.2
Metrics
In the practice of parallel computing there are some classic results that give us a means of
understanding the capabilities and limitations of the implementations of parallel algorithms.
Two such classic results are Amdahl’s law and Gustafson’s law.
Amdahl’s law is a model for the theoretical relationship between the expected speedup
of parallelized implementations of an algorithm relative to the serial algorithm under the
assumption that the problem size remains the same when the algorithm is parallelized.
More specifically, the law is concerned with the speedup achievable from an improvement to
47
a portion of that computation, typically the parallelized fraction, where that improvement
leads to an observable speedup in the overall execution of the algorithm. It is often used in
the practice of parallel computing to try and predict the theoretical maximum speedup of
an algorithm using multiple processors.
Amdahl’s law can be stated in a large variety of ways where each has something more or
less revealing to express about the meaning of the statement of the law. If we let f denote
the fraction of the computation that is purely sequential and cannot be parallelized into
concurrent tasks and let Ts denote the execution time of a sequential run of the algorithm,
then we can state Amdahl’s law in terms of the expected speedup.
Law 6.1.1 (Amdahl’s Law v1) Amdahl’s law states that the speedup given P processors
is,
SP =
Ts
f · Ts +
(1−f )·Ts
P
=
P
.
1 + (P − 1)f
(6.7)
Something immediately evident from this formulation is that the maximum speedup is
limited by f −1 since,
1
.
(6.8)
P →∞
f
denote the execution time of the parallelized portion of the algorithm using a
lim SP =
If we let TP1
single processor and Ts denote the execution time of the sequential portion of the algorithm,
then we can formulate Amdahl’s Law in terms of the total parallel execution time.
Law 6.1.2 (Amdahl’s Law v2) Amdahl’s law states that the wall clock time of the parallel
execution, T (P ), of the algorithm given P processors is,
T (P ) = Ts +
TP1
.
P
(6.9)
Again, one can immediately see that in the limit the serial fraction of the algorithm is still
present as an upper bound on the total execution time. This is a part of the motivation for
many researchers to pay a great deal of attention to the optimization of the serial fractions of
their algorithms. In our implementations of the parareal algorithm we achieve sequentially
consistent results by doing only linearized state solves within the sequential portions of the
algorithm precisely to help in reducing the amount of time being spent in the sequential
fractions of the code.
48
Gustafson’s law addresses the shortcomings of Amdahl’s law, which cannot scale to match
the availability of computing power as the machine size increases. Amdahl’s law is based on
the assumption that there is a fixed problem size or fixed problem parameters. Gustafson’s
law tries to account for a changing problem size by instead fixing the parallel execution time.
Law 6.1.3 (Gustafson’s Law) Gustafson’s law defines the scaled speedup by keeping the
parallel execution time constant by adjusting P as the problem size N changes
SP,N = P + (1 − P )α(N ),
(6.10)
where α(N ) is the non-parallelizable fraction of the normalized parallel time, i.e. TP,N = 1,
given the problem size N . Assuming that the serial function α(N ) diminishes with the
problem size N , then the speedup approaches P as N approaches infinity as desired in a
linear scaling.
An immediate problem with applying these classic parallel computing principles to the
parareal algorithm is that some of the assumptions implicit within each of these laws fails to
hold true for this algorithm. In the case of Amdahl’s law, it is assumed that the execution
time of the sequential portion of the algorithm remains constant as the number of processors
is varied. In the case of Gustafson’s law, it is assumed that the sequential execution time
decreases as the problem size grows. Both of these assumptions fail to hold for the parareal
algorithm. In all of our implementations of the parareal algorithm the sequential portions
of the algorithm, namely the initial coarse grid solve and the solve for the correction term
upon each iteration, each grow with the the number of processors because we set the coarse
grid time step to be ∆T =
T
P
and so as we increase P we refine the coarse grid time step
thus making each of the sequential solves more expensive computationally.
A more practical problem exists for using these principles as a performance measure or
estimator. In each case these laws require a priori knowledge of the execution time of the
sequential portion of the algorithm. In most cases this is not known or difficult to gauge
especially when this value changes with any of the problem parameters. Fortunately, this
isn’t the end of the story.
In 1990, Alan H. Karp and Horace P. Flatt proposed a metric that allows one to
empirically measure the amount of time consumed by the sequential portion of the execution
time of an implementation of a parallel algorithm. This is now well-known as the Karp-Flatt
49
Metric, which is an a posteriori empirical measurement. Given a parallel computation
exhibiting a relative speedup ψ using P > 1 processors, the experimentally determined
sequential fraction e is defined to be the Karp-Flatt metric viz,
e=
1
ψ
−
1−
1
P
1
P
(6.11)
.
It is easy to see that this metric is still consistent with Amdahl’s law. Consider the case
where P = 1 in (6.1.2) and we have, T (1) = Ts + TP . If we define the serial fraction e =
Ts
T (1)
then the equation can be re-written as,
T (P ) = T (1)e +
The relative speedup is defined as before, ψ =
T (1)(1 − e)
.
P
T (1)
,
T (P )
(6.12)
thus dividing (6.12) by T (1) we get,
1−e
1
=e+
,
ψ
P
(6.13)
and by simply solving for the serial fraction we obtain the Karp-Flatt Metric. Keep in mind
that this only shows that the metric is consistent with Amdahl’s law and is not a formal
derivation since e is a metric and not merely a mathematically derived quantity.
We will see that this metric is useful in tracking at which points the parareal algorithm
is operating at or near its peak performance for a given set of parameters where strong
scalability results, i.e. the problem size over the temporal domain remains fixed. What we
will demonstrate is that the parareal is achieving it’s near to peak performance when the
serial fraction of the algorithm, e is at its minimum.
6.2
Problem Parameters in our FEM and ROM
Parareal Implementations
In this section the goal is to explain in detail the problem parameters that define the problem
size of the simulation. It is these parameters and how they are treated that affect both the
strong and weak scalability of the parareal algorithm. Thus, in order to understand how this
algorithm scales, we first must shed light on precisely how the input parameters define the
problem size of a particular implementation of the algorithm.
50
6.2.1
Input Parameters and Problem Size
Parareal
The general input parameters for any implementation of the parareal algorithm, over a
temporal domain, are the number of processors being used P , the coarse grid time step ∆T ,
the fine grid time step ∆t, the number of steps taken within each subinterval M , the size of
the temporal domain T (assuming that we begin at T = 0), and the number of iterations k
needed to reach the desired level of accuracy, which we take to be the accuracy that would
be obtained with a solve over the entire temporal domain with the fine grid time step. Since
we observed convergence to the desired accuracy of the fine grid resolution in two iterations
of the correction scheme, with all of our implementations, we fix k = 2, in our performance
analysis, to eliminate this variable and thus simplify the scalability study. Of course, all of
these parameters are not independent of one another. Let us recall the following relations.
In all of the implementations designed for this scaling study we let,
T
,
P
(6.14)
∆t =
∆T
,
M
(6.15)
M=
∆T
.
∆t
(6.16)
∆T =
or
The last two equations, (6.15) and (6.16), at a glance don’t appear to be significantly
different but, in fact, the distinction is very important in terms of how the algorithm will
scale. It is a seemingly subtle distinction in how the algorithm is implemented but the
resulting behavior is drastically different.
Weak Scalability
In the case of (6.15) we fix M and essentially are telling the algorithm that regardless of
the number of processors we are using, and thus the number of subintervals [T n , T n+1 ] we
generate, we want to perform M state solves over each of these subintervals with the fine
time step ∆t. When M is fixed in this way we observe a linear scaling trend, i.e. SP ≈ P .
This is not very surprising once one looks closer at what is happening with this configuration.
It’s important to realize that under this configuration the fine grid resolution is a function
of P , i.e. we have ∆t(P ) =
T 1
( ).
M P
Thus, as we let P increase to larger sets of processors
51
the fine grid resolution, which is characterized by the fine time step ∆t, gets increasingly
smaller so the problem size in the temporal domain is growing such that the fine step ∆t is
always considerably smaller than the coarse step ∆T . This is a case of weak scalability and
so we can conclude that the parareal algorithm can be made to scale linearly, but as we will
demonstrate, only in the weak sense.
To illustrate the weak scaling of the parareal algorithm under this type of problem
parameter configuration consider the following test suite of time dependent scalar ODEs;
this test suite is taken from [12] and each of the solutions to the ODEs in the suite are
designed to exhibit various types of different behavior.
This problem, A1, is the simplest case exhibiting basic monotonic decay,
dy
+ y(t) = 0
dt
y(0) = 1, t ∈ [0, 10].
(6.17)
This problem, A2, is a special case of the Riccati equation,
dy
−y 3
=
dt
2
y(0) = 1, t ∈ [0, 10].
(6.18)
This problem, A3, exhibits oscillatory behavior,
dy
= y cos(t)
dt
y(0) = 1, t ∈ [0, 10].
(6.19)
This problem, A4, is a basic logistic growth model,
dy
y
y
= (1 − )
dt
4
20
y(0) = 1, t ∈ [0, 10].
(6.20)
The solution to this problem, A5, is a spiral curve,
y−t
dy
=
dt
y+t
y(0) = 4, t ∈ [0, 10].
(6.21)
In Figure (6.1) we have plotted the speedup results for this test suite of ODEs. The
number of processors P is plotted along the vertical axis and the speedup S is plotted along
52
1 ,0 0 0
750
P
500
250
0
0
50
100
150
200
S
250
300
350
A1
A2
A3
A4
A5
Figure 6.1: Suite A, Speedup vs. Processors
the horizontal axis. We don’t currently have the capability to compute on a machine with
1,000 processors so the parallel speed up was emulated by a series of sequential runs where
the parallel section of the code was looped P times. Each of the loops were timed and then
divided by P and the timings from the sequential portions was added to the total. There is
an analytical formula that does a fairly good job of capturing the weak scaling phenomenon.
The result was derived by G. Bal in 2003, see [2] for the details of the derivation of the
speedup and efficiency.
53
Theorem 6.2.1 (Speed Up for Weak Scaling)
S=
T
∆t
T
k ∆T
+ (k −
1) ∆T
∆t
=
∆t
k ∆T
1
,
+ (k − 1) ∆T
T
(6.22)
where ∆T and ∆t are the coarse and fine grid time steps respectively, T is the maximum
time, and k is the number of iterations.
Theorem 6.2.2 (Efficiency for Weak Scaling)
E=
1
,
T ∆t
(k − 1) + k (∆T
)2
(6.23)
where again, ∆T and ∆t are the coarse and fine grid time steps respectively, T is the
maximum time, and k is the number of iterations.
As a demonstration of how close this result comes to capturing the actual speedup and
efficiency, consider the tabulated results of problem (6.21).
P = 1,000 processors,
A5
Speed-Up
Theoretical
333.33
Actual
323.66
Efficiency
33%
32%
Iterations Required
2
2
P = 500 processors,
A5
Speed-Up
Theoretical
166.67
Actual
157.21
Efficiency
33%
31%
Iterations Required
2
2
P = 100 processors,
A5
Speed-Up
Theoretical
33.3
Actual
27.27
Efficiency
33%
27%
Iterations Required
2
2
P = 50 processors,
A5
Speed-Up
Theoretical
10
Actual
16.67
Efficiency
20%
33%
Optimal Iterations
2
2
Note that again we were able to reach the accuracy of the underlying fine grid resolution
in each case with only two iterations of the corrector scheme.
The weak scalability of the parareal algorithm is fairly well-understood and straight
forward to explain. One problem with a weak scaling type configuration is that there is no
54
fixed underlying accuracy that is being sought for any P . It’s true that for a fixed P we have
a fixed underlying fine grid resolution, but this reference point changes as we vary P , and if
we simply let P get too large, then ∆t begins to get very small to the point that round off
becomes a serious concern.
Strong Scalability
On the other hand, if we consider the case of (6.16) where we fix ∆t, then M varies instead
but also as a function of P , we have M =
T 1
( ).
∆t P
With this type of configuration we now
have a fixed underlying fine grid resolution, and thus the problem size over the temporal
domain remains fixed as we vary P . The price we pay is that we loose our nearly linear
speedup and the scaling trend becomes nonlinear. This makes sense because as P becomes
larger then at some point ∆T begins to approach the fixed value of ∆t and you are essentially
solving the full fine grid problem with a purely parallel algorithm and hence speedup is lost
and eventually slowdown will begin to occur. What we will demonstrate is that with such a
configuration the strong scaling trend is such that for a specified set of problem parameters,
in which the problem size remains fixed in time, then there is always an optimal choice of
processors to achieve the maximum speedup for the given problem configuration.
FEM
The parareal algorithm applied to time dependent PDEs is sensitive to all of the general
problem parameters of the standard parareal algorithm in the temporal domain, plus
the additional parameters that serve to specify the problem sizes in each of the spatial
dimensions as well. In all of the implementations within this work we are solving PDEs
in two spatial dimensions over a unit square. In principle, we could have different sized
spatial discretizations in each of the dimensions, but to simplify the performance analysis
we implement a uniform discretization in each spatial dimension. In this work the problem
size in space is specified by the number of finite elements used in the spatial discretization
which we denote by
1
h
and thus h is the actual step size used to discretize the domain
Ω = (0, 1) × (0, 1), i.e. the unit rectangle.
So, in our parareal/FEM implementations the problem parameters that effect the scaling
trend in general are P , T , ∆T =
T
,
P
∆t, h, and M =
investigating the strong scaling trend we keep ∆t fixed.
55
∆T
.
∆t
Since we are now interested in
ROM
The parareal algorithm applied to time dependent PDE’s, where ROM has been applied, is
very similar to the parareal/FEM parameters with a slightly different involvement with h.
Recall that in the ROM setting we use POD to generate a reduced basis from a snapshot set
that was generated from a high fidelity FEM solution which was generated using a specific
h. What ultimately influences the scaling trend is the number of POD-basis vectors used in
the reduced basis in space. In this scaling study we fix the number of effective basis vectors
to be eight but in the problems with multiple parameters we generate extra basis vectors to
handle each of the inhomogeneous parameters. In this scalability study we look specifically
at the ROM problem with four parameters on the boundary, so if the effective basis is eight,
then for this problem the total number of POD-basis vectors used is twelve.
Thus, the only parameters that we probe in the ROM setting are the standard parareal
parameters that affect the temporal domain, i.e. P , T , ∆T =
T
,
P
∆t, h, and M =
∆T
.
∆t
We
will see though that the strong scaling trend in the ROM setting is very similar to what we
will observe in the FEM cases. If a value of h is specified in a ROM result, it specifically
refers to the value of h used by FEM to generate the snapshot set from which the POD-basis
is generated.
6.2.2
Test Suite for Strong Scalability Study
In each of the test suites described here we are simply stating, explicitly, the values that were
set within each of the separate strong scaling experiments. In each case we choose two values
for each of the input parameters and then run experiments with nearly every combination
of those parameters to observe the scalability trend as we vary P .
FEM
The FEM test suite consists of the following six cases.
FEM case 1: h = 0.1, ∆t = 0.005, and T = 1.0, Figure (6.2).
FEM case 2: h = 0.1, ∆t = 0.001, and T = 1.0, Figure (6.3).
FEM case 3: h = 0.05, ∆t = 0.005, and T = 1.0, Figure (6.4).
FEM case 4: h = 0.05, ∆t = 0.001, and T = 1.0, Figure (6.5).
FEM case 5: h = 0.1, ∆t = 0.001, and T = 10.0, Figure (6.6).
FEM case 6: h = 0.05, ∆t = 0.001, and T = 10.0, Figure (6.7).
56
ROM
The ROM test suite consists of the following two cases.
ROM case 1: h = 0.1, ∆t = 0.005, and T = 1.0, Figure (6.8).
ROM case 2: h = 0.1, ∆t = 0.001, and T = 10.0, Figure (6.9).
6.3
Strong Scaling Trends of the Parareal Algorithm
In each case all of the timings were done with the MPI standard wall clock timers from the
openMPI distribution.
6.3.1
FEM Suite
Table 6.1: Results of FEM Test Case 1
P Speed-Up
4
1.51
5
1.873
10
3.51
15
1.42
20
0.807
40
0.266
Efficiency
38%
37%
35%
9.5%
4%
0.67%
Karp-Flatt
55%
42%
21%
68%
125%
383%
Table 6.2: Results of FEM Test Case 2
P Speed-Up
4
1.535
5
2.03
10
4.486
20
2.58
40
1.67
50
0.876
Efficiency
38%
41%
45%
13%
4.2%
1.8%
Karp-Flatt
54%
37%
14%
36%
59%
114%
Notice that in comparing cases one and two that although the maximum speed up occurs
at around ten processors in each of the case that case two, which has a more refined fine
grid resolution, is able to achieve more speed up for about the same price. This is typical in
the strong scaling trend. In general, as we make the problem sizes larger we obtain better
57
performance gain although typically at the cost of using more processors. Also, note that the
Karp-Flatt metric does indeed obtain a minimum value at the optimal choice of processors,
which in these cases is about ten. Also, note that with the larger problem size in time, case
two, a higher percentage of the implementation is benefitting from the parallelization.
Table 6.3: Results of FEM Test Case 3
P Speed-Up
4
1.53
5
1.88
10
2.35
40
3.54
50
2.92
80
1.92
100
1.56
Efficiency
38%
37.6%
24%
8.9%
5.8%
2.4%
1.6%
Karp-Flatt
54%
41%
36%
26%
33%
51%
64%
Table 6.4: Results of FEM Test Case 4
P Speed-Up
4
1.96
5
2.55
10
2.38
20
2.89
40
1.86
Efficiency
49%
51%
24%
14%
4.7%
Karp-Flatt
35%
24%
36%
31%
53%
In cases three and four we have increased the problem size in space. In adjusting this
parameter we notice that we can obtain similar factors in speedup, but the cost is that it
takes roughly two to four times as many processors to achieve this result. Also, the KarpFlatt metric is still at a minimum near the sweet spot, i.e. the optimal number of processors
where the maximum speedup occurs.
In cases five and six we observe something very interesting. In each of these cases we
have T = 10 and ∆t = 0.001. The only difference is the problem size in space h. So far
it seems that the strong scaling of the algorithm is most sensitive to the length of the time
interval. Also, one can note that when the time interval is made larger then the problem
size in space has a much more drastic affect on the scaling trend. In these cases one can
observe that when h = 0.1 as in (6.5) the maximum speedup occurs at P = 40, with a factor
58
Table 6.5: Results of FEM Test Case 5
P Speed-Up
5
2.03
10
4.06
15
2.95
20
3.95
40
8.003
50
6.93
60
5.61
80
5.81
100
4.71
Efficiency
41%
41%
20%
20%
20%
14%
14%
7.3%
4.7%
Karp-Flatt
37%
16%
29%
21%
10%
13%
16%
16%
20%
Table 6.6: Results of FEM Test Case 6
P Speed-Up
4
2.09
5
2.6
10
2.57
20
4.74
40
18.69
50
22.55
80
32.191
100
36.925
200 39.7016
Efficiency
52%
52%
26%
24%
47%
45%
40%
37%
20%
Karp-Flatt
30%
23%
32%
17%
2.9%
2.5%
1.9%
1.7%
2%
of about 8. For the case, (6.6) where h = 0.05 the maximum speedup was computed for
P = 200, with a factor of about 40 which is a jump by a factor of five in both the speedup
and the optimal number of processors needed to achieve the maximum speedup. Also, we
again see that the Karp-Flatt metric is near its minimum value near the optimal number of
processors required to achieve the maximum amount of speedup. It may be that for (6.6)
the true maximum speedup actually occurs at some number of processors between 100 and
200 since the value of the Karp-Flatt metric increases slightly from P = 100 to P = 200.
6.3.2
ROM Suite
An important thing to note with both of the ROM cases is that they do indeed have a very
similar scaling trend as in the FEM cases. One can observe that in case 1, (6.7) the maximum
59
Table 6.7: Results of ROM Test Case 1
P Speed-Up
4
1.96
5
2.417
10
4.29
20
5.96
40
4.465
50
3.723
100
1.98
Efficiency
49%
48%
43%
30%
11%
7.4%
2%
Karp-Flatt
35%
27%
15%
12%
20%
25%
50%
Table 6.8: Results of ROM Test Case 2
P Speed-Up
4
1.003
10
4.945
20
9.91
40
19.003
80
32.373
100
36.74
Efficiency
25%
49%
50%
48%
40%
37%
Karp-Flatt
99.6%
11%
5.4%
2%
1.9%
1.7%
speedup occurs when P = 20 and in (6.8) we, again, see how drastically the lengthening of
the temporal domain affects the strong scaling trend of the implementation. Once again we
see that the Karp-Flatt metric is indeed attaining its minimum at the point when we are
achieving the maximum speed up of our implementation.
It is extremely nice to see that we have found a metric that aids us in tracking where the
given problem will provide optimal performance in the case of strong scaling. The problem is
that in its current formulation it is only useful in an a posteriori empirical measurement, so
it’s not yet clear how we can make use of this in a predictive way. These results are at least
promising in the sense that this metric may lead us towards an analytical approach to being
able to predict the number of processors required to achieve the maximum performance from
a given problem configuration in an a priori way. More work is required to determine how
fruitful this metric could be for this purpose.
60
4
3.5
3
S peed U p
2.5
2
1.5
1
0.5
0
0
5
10
15
20
P rocessors
25
30
35
40
Student Version of MATLAB
Figure 6.2: Speedup of Parareal/FEM with h = 0.1, ∆t = 0.005, and T =1
61
4.5
4
3.5
S peed U p
3
2.5
2
1.5
1
0.5
0
5
10
15
20
25
P rocessors
30
35
40
45
50
Student Version of MATLAB
Figure 6.3: Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 1
62
4
3.5
S peed U p
3
2.5
2
1.5
1
0
10
20
30
40
50
P rocessors
60
70
80
90
100
Student Version of MATLAB
Figure 6.4: Speedup of Parareal/FEM with h = 0.05, ∆t = 0.005, and T = 1
63
3
2.8
2.6
2.4
S peed U p
2.2
2
1.8
1.6
1.4
1.2
1
0
5
10
15
20
P rocessors
25
30
35
40
Student Version of MATLAB
Figure 6.5: Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 1
64
9
8
7
S peed U p
6
5
4
3
2
1
0
10
20
30
40
50
P rocessors
60
70
80
90
100
Student Version of MATLAB
Figure 6.6: Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 10
65
40
35
30
S peed U p
25
20
15
10
5
0
0
20
40
60
80
100
120
P rocessors
140
160
180
200
Student Version of MATLAB
Figure 6.7: Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 10
66
6
5.5
5
4.5
S peed U p
4
3.5
3
2.5
2
1.5
1
0
10
20
30
40
50
P rocessors
60
70
80
90
100
Student Version of MATLAB
Figure 6.8: Speedup of Parareal/ROM with h = 0.1, ∆t = 0.005, and T = 1
67
40
35
30
S peed U p
25
20
15
10
5
0
0
10
20
30
40
50
P rocessors
60
70
80
90
100
Student Version of MATLAB
Figure 6.9: Speedup of Parareal/ROM with h = 0.1, ∆t = 0.001, and T = 10
68
CHAPTER 7
Conclusions and Future Work
7.1
Conclusions
In conclusion, first note that we have clearly demonstrated the potential of the parareal
algorithm to aid in providing significant performance gains for computing nonlinear partial
differential equations, and thus its potential for the computations of real time computing
applications. Further, we have shown that in combining the parareal algorithm with model
reduction we do indeed obtain greater performance gains than either method can provide
alone.
Also, we provide a fairly detailed performance analysis of the weak and strong scalability
of the parareal algorithm. The subtle implementation differences are clearly explained so that
one is aware of how to achieve either weak or strong scaling with this algorithm depending on
which is preferred for their particular application. Demonstrations of each type of scalability
trend has been provided. While we have provided explanations of why each of the two
types of scalability trends behave either linearly or nonlinearly only the weak case has been
captured analytically so far. Although, in the literature the analytic formula for the speedup
and efficiency of the parareal algorithm, which applies to the case of weak scalability only,
is reported to be the speedup and efficiency formulas and no distinction is made between
the differences between implementing a configuration, which leads to either weak or strong
scaling, see [17], for example. We have demonstrated that these analytical results are only
applicable to the case of weak scalability.
A test suite of problems for both the FEM and ROM implementations were employed to
explore the sensitivity of the strong scaling trends to the input parameters of these problems.
The results demonstrate that there is no significant difference in the strong scaling trends of
either implementation, FEM or ROM. We have identified a metric that allows one to track
69
the performance trend in the case of strong scalability but only in the sense of an a posteriori
empirical measurement. We believe that this provides a potential avenue for investigation
that may perhaps provide the ability to capture the strong scaling trends of the parareal
algorithm analytically.
7.2
Future Work
Scalability Analysis
The holy grail, so to speak, of the strong scalability analysis is to be able to capture
the phenomenon analytically for the purpose of being able to predict a priori how many
processors will be needed to achieve the maximum speedup possible for a given problem
configuration and application. There is a lot of work that is left to be done on fully
understanding and quantifying, in particular, the strong scalability trends of the parareal
algorithm. One possibility is to explore what more can be done with the Karp-Flatt metric
to produce an analytic formula to try and capture these trends.
Also, there is more work to be done in trying to further understand the algorithms
sensitivity to all of the problem parameters. Recall, that for the strong scaling study we
fixed the number of iterations k for each problem and in the ROM setting we fixed the
number of POD-basis vectors. This leaves some question as to how much these parameters
might affect the performance trends of the algorithm.
Coupling with other Parallel Algorithms
It would be interesting to see how much of a performance gain would result from the
coupling of this algorithm with other parallel algorithms.
Parallelization in space, i.e.
domain decomposition, might have a significant affect on the performance of high fidelity
FEM simulations. Also, the use of parallel linear solvers could be significant in both the
FEM and ROM settings for gaining some performance, since the computational complexity
of each method tends to be dominated by solving systems of linear equations.
After
parallelizing so many routines in a FEM simulation, for example, the actual assembly
procedure for constructing the linear systems associated with the FEM discretization may
quickly become a significant bottleneck to gaining more performance. There has been some
interest in researching techniques to actually parallelize the assembly routines within a FEM
implementation.
70
Adaptivity
In this work all of our discretizations were uniform in both the spatial and temporal domains.
It would be interesting to investigate the use of non-uniform discretizations, particularly in
the temporal domain, to see how this might affect the performance of the parareal algorithm.
If non-uniform time stepping in either the corse grid, fine grid, or perhaps both doesn’t have
any significant affects on the performance of the algorithm, then this would open the door
for investigating the use of various adaptive methods. It would be nice to see us saving
computational time by using adaptive methods to focus the computational efforts of the
algorithm in regimes that are in need of a finer resolution or further iterations of the corrector
scheme.
Applications
The real point of this work is to explicate the computational aspects of the parareal algorithm
to see if the performance gains are suitable for applications involving the need for real time
computations of complex nonlinear partial differential equations. It would then be really fun
to begin exploring what types of real-time computing applications in science and engineering
would work well with the parareal algorithm. Some examples that come to mind are the
need for performing real time control of PDEs, real time design simulations driven by PDEs,
interactive physics simulators for various types of career training, and diagnostic safety
systems that have to compute PDEs very quickly to determine the appropriate configuration
of a safety mechanism of some sort.
Of course, many people might be interested in simply being able to compute their PDE
or ODE driven simulations quicker but not necessarily in real time. Exploring the rich span
of application areas to determine which settings are well-suited for the parareal algorithm is
something seen as essential and is of great interest.
71
REFERENCES
[1] G. Bal, On the Convergence and the Stability of the Parareal Algorithm to Solve
Partial Differential Equations , in Proceedings of the 15th international domain
decomposition conference, R. Kornhuber, R.H.W. Hoppe, J. Péeriaux, O. Pironneau,
O.B. Widlund, and J. Xu, eds., Springer LNCSE, 2003, pp 426-432.
[2] G. Bal, Parallelization in time of (stochastic) ordinary differential
equations. ,Math. Meth. Anal. Num. (submitted),
2003b. Preprint;
www.columbia.edu/ gb2030/PAPERS/ParTimeSODE.ps, 2003.
[3] G. Bal and Y. Maday, A Parareal Time Discretization for Nonlinear PDEs with
applications to the pricing of an American put , vol. 23 of Lect. notes Compt. Sci.
Eng., Springer, 2002, pp 189-202.
[4] P.F. Fishcher, F. Hecht, Y. Maday, The parareal in time semi-implicit approximation of the Navier-Stokes equations, in Proceedings of the 15th International
Conference on Domain Decomposition Methods, Berlin, Vol 40, LNCSE series,
Springer Verlag, 2004.
[5] M.J. Gander and S. Vandewalle, Analysis of the Parareal Time-Parallel TimeIntegration method , Technical Report, Report TW 443. Katholieke Universiteit
Leuven, Department of Computer Science, Belgium, 2005.
[6] William Gropp, Ewing Lusk, Anthony Skjelllum, Using MPI, Portable
Parallel Programming with the Message-Passing Interface , MIT Press, Cambridge,
1997
[7] M. Gunzburger, J. Burkardt, Q.Du, H.-C. Lee, Reduced-order modeling of
complex systems, in Numerical Analysis 2003 : Proceedings of the 20th Biennial
Conference on Numerical Analysis, University of Dundee, Dundee, 2003, pp 29-38.
[8] M. Gunzburger, J. Peterson, Reduced-order modeling of complex systems with
multiple system parameters, in Large Scale Scientific Computing : 5th International
Conference, LSSC 2005, Sozopol, Bulgaria, June 6-10, 2005 Springer, Berlin, 2006,
pp 15-27.
[9] M. Gunzburger, J. Burkardt, H.-C. Lee, POD and CVT-based reduced-order
modeling of Navier-Stokes flows, in Comput. Methods Appl. Mech. Engrg. 196, 2006,
pp 337-355
72
[10] M. Gunzburger, J. Peterson, J. Shadid, Reduced-order modeling of timedependent PDEs with multiple parameters in the boundary data, in Comput. Methods
Appl. Mech. Engrg. 196, 2007, pp 1030-1047
[11] C. Harden, J. Peterson, Combining the parareal algorithm and reduced order
modeling for real-time calculations of time dependent PDE’s, International conference
on recent advances in scientific computation, Beijing, China, June 2006
[12] T.E.Hull; W.H.Enright; B.M.Fellen; A.E.Sedgwick, Comparing Numerical
Methods for Ordinary Differential Equations , SIAM Journal on Numerical Analysis,
Vol. 9, No. 4. (Dec., 1972), pp. 603-637.
[13] Jacques-Louis Lions, Y. Maday, and G. Turinici, A parareal in time discretization of PDEs , C.R. Acad. Sci. Paris. , Serie I, 332, 2001, pp. 661-668.
[14] G.E.Karniadakis, R.M.Kirby II, Parallel Scientific Computing in C++ and
MPI, A Seamless Approach to Parallel Algorithms and Their Implementation ,
Cambridge University Press, 2003
[15] Y. Maday, and G. Turinici, The Parareal in Time Iterative Solver: a further
direction to parallel implementation, in Proceedings of the 15th international domain
decomposition conference, R. Kornhuber, R.H.W. Hoppe, J. Péeriaux, O. Pironneau,
O.B. Widlund, and J. Xu, eds., Springer LNCSE, 2003, pp 441-448.
[16] Y. Maday, and G. Turinici, A parareal in time procedure for the control of partial
differential equations, C.R. Acad. Sci. Paris, Ser I 335 (2002) 387-392.
[17] G.A.Staff, The Parareal Algorithm: A survey of present work, Technical Report,
Norwegian University of Science and Technology. Department of Mathematical Sciences, Norway, 2003
[18] G.A. Staff and E.M. Rønquist , Stability of the Parareal Algorithm, in Proceedings of the 15th international domain decomposition conference, R. Kornhuber,
R.H.W. Hoppe, J. Péeriaux, O. Pironneau, O.B. Widlund, and J. Xu, eds., Springer
LNCSE, 2003, pp 449-456.
[19] V. Thomée, Galerkin Finite Element Methods for Parabolic Systems, Springer Series
in Computational Mathematics, 2006 2.3
2.3, 6.2.1
73
BIOGRAPHICAL SKETCH
Christopher R. Harden
Christopher R. Harden was born on the twenty ninth of December, in the year of nineteen
seventy six, in Jacksonville, Florida. In the spring of 2005, he completed a Bachelors degree
in Philosophy and a Bachelors degree in Pure Mathematics at the University of North
Florida. In the fall of 2005, he entered into the Ph.D program for Applied and Computational
Mathematics at Florida State University. At the end of his first semester in the program
he came under the advisement of Prof. Janet Peterson and was supported as a research
assistant by her for the following two semesters in the School of Computational Science.
When the new degree programs became approved in Computational Science he happily and
excitedly switched over into this new program and upon the defense of this thesis expects to
be the first to obtain his Masters degree under the new program in Computational Science
towards the end of the 2008 spring semester. He has plans to continue on in pursuit of his
Ph.D in the newly approved Ph.D program in Computational Science and to continue under
the advisement of Prof. Janet Peterson.
Chris’ research interests include numerical ordinary and partial differential equations,
finite elements, reduced order modeling, general real time computing strategies, parallel
computing and algorithms, the analysis of the performance of parallel algorithms, and code
verification and validation.
Chris currently lives in Tallahassee, FL, with his wife, Jennifer; his son, Youth; their dog,
Eris; and their cat, Arthur.
74