lecture16

Multigrid
Some material from lectures of J. Demmel, UC Berkeley Dept of CS
High Performance Computing 1
Multigrid
• Multigrid is a “unite-and conquer” algorithm for
solving the elliptic PDEs. Multigrid begins by
obtaining (or improving) a solution on an NxN
grid by using and (N/2)x(N/2) grid (and
proceeding recursively). This strategy, in addition
to reducing computational work, shifts the
frequency structure of errors present in the
approximate solution.
• For a more detailed mathematical introduction to
the multigrid algorithm, see "A Multigrid Tutorial"
by W. Briggs.
High Performance Computing 1
Overview
• Consider an (2^m-1)-by-(2^m-1) grid of unknowns (a
convenient number to work with). With boundary nodes, e
add the nodes at the boundary, get a (2^m+1)-by-(2^m+1)
grid on which the algorithm will operate. We let n=2^m+1.
• Let P(i) denote the problem of solving a discrete Poisson
equation on a (2^i+1)-by-(2^i+1) grid, with (2^i-1)^2
unknowns.
• The problem is specified by the grid size i, the coefficient
matrix T(i), and the right hand side b(i).
• Generate a sequence of related problems P(m), P(m-1),
P(m-2), ... , P(1) on coarser and coarser grids, where the
solution to P(i-1) is a good approximation to the solution
of P(i).
High Performance Computing 1
Grids
High Performance Computing 1
Method
• Let b(i) be the right-hand-side of the linear system P(i),
and x(i) an approximate solution to P(i). Thus, x(i) and b(i)
are (2^i-1)-by-(2^i-1) arrays of values at each grid point.
• Next define operators that transfer information between
grids.
– The restriction operator R(i) takes a pair (b(i),x(i)), consisting of a
problem P(i) and its approximate solution x(i), and maps it to (b(i1),x(i-1)), the companion problem on the next coarser grid. On grid
(i-1), start with a guess at the solution x(i-1): ( b(i-1), x(i-1) ) =
R(i)( b(i), x(i) ) We will see that restriction is implemented simply
by computing a weighted average of each grid point value with its
nearest neighbors.
High Performance Computing 1
Method
– The interpolation operator In(i-1) takes an approximate
solution x(i-1) for P(i-1) and converts it to an
approximation x(i) for the problem P(i) on the next
finer grid: ( b(i), x(i) ) = In(i-1)( b(i-1), x(i-1) ) Its
implementation also requires just a weighted average
with nearest neighbors.
– The solution operator S(i) take a problem P(i) and
approximate solution x(i), and computes an improved
x(i). x_improved(i) = S(i)( b(i), x(i) ) We often a
variation of Jacobi's method as the solution operator
(recall symmetry of Jacobi, in contrast to, say, GS)
High Performance Computing 1
Restriction
High Performance Computing 1
Interpolation or Prolongation
High Performance Computing 1
V Cycle
• Starting with a problem on a fine grid ( b(i), x(i) ),
we want to damp the high frequency error by
coarsening the grid, and improve the low
frequency error by a smoothing operator.
• First a cut at the high frequency error by
smoothing/approximate solution:
x(i) = S(i)( b(i), x(i) ).
• Coarsen the grid: R(i)( b(i), x(i) ).
• Solve the coarser problem recursively
MGV( R(i)( b(i), x(i) ) ).
High Performance Computing 1
V Cycle
• Coarse grid solution to fine grid In(i-1)(
MGV( R(i)( b(i), x(i) ) ) ).
• Subtract the correction computed on the
coarse grid from the fine grid solution x(i) =
x(i) - d(i)
• Improve the solution more by smoothing:
x(i) = S(i)( b(i), x(i) ).
High Performance Computing 1
Grids
High Performance Computing 1
V Cycle
• The work at each point in the algorithm is
proportional to the number of unknowns, since the
value at each grid point is just averaged with its
nearest neighbors. So each grid point on level i will
cost (2^i-1)^2 = O(4^i) operations. If the finest grid
is at level m, the total serial work will be given by
the geometric sum
m
i
m

1
O(4 )  O(4 )
which is O(m) = O( log #unknowns ).
High Performance Computing 1
Full Multigrid
• The full Multigrid algorithm uses the V-cycle as a
building block.
• FMG( b(m), x(m) ) begins by solving the simplest
problem P(1) exactly.
• Given a solution x(i-1) of the coarse problem P(i1), prolongate to a starting guess x(i) for the next
finer problem P(i): In(i-1)( b(i-1), x(i-1) ).
• Solve the finer problem using the Multigrid Vcycle with this starting guess: MGV( In(i-1)( b(i1), x(i-1) ) )
High Performance Computing 1
Full Multigrid
• To get a sense of the work, note there is one V
cycle for each call to MGV in the inner loop of
FMG. On a serial machine, this means that, at
level i, it costs O(4^i) computations. Thus the total
serial cost is again O( 4^m ) = O( # unknowns )
• The serial complexity is optimal, in that a constant
amount of work per unknown is performed.
Analysis of a parallel version is considered later.
High Performance Computing 1
1D in detail
• Projection and prolongation are easier in
1D, and it is easier to get a sense of the
algorithmic details.
• Consider P(i), the problem on a 1D grid
with 2^i+1 points, with the middle 2^i-1
being the unknowns.
High Performance Computing 1
1D version
High Performance Computing 1
MG 1D
• Let T(i) be the coefficient matrix of problem
P(i), that is a scaled (2^i-1)-by-(2^i-1)
tridiagonal matrix with 2s on the diagonal
and -1s on the off-diagonal.
• The scale factor 4^(-i) is the 1/h^2 term
that shows up because T(i) is approximating
a second derivative on a grid with grid
spacing h=2^(-i).
High Performance Computing 1
MG 1D
T(i) = 4^(-i) *
[ 2 -1
]
[ -1 2 -1
]
[ -1 2 -1
]
[
... ... ...
]
[
-1 2 -1 ]
[
-1 2 ]
High Performance Computing 1
MG 1D
Recall from Fourier analysis, any function (in this case, the
error in our approximate solution) can be considered as a
linear combination of basis vectors - here, sine functions of
different frequencies.
These sine functions are in fact the eigenvectors of the T(i)
matrix. The following figure shows some of these sinecurves when i=5, so T(5) is 31-by-31. The top plot is a plot
of the eigenvalues (frequencies) of 4^5*T(5), from lowest
to highest. The subsequent plots are the eigenvectors (sinecurves), starting with the lowest 4 frequencies, and then a
few more frequencies up to the highest (number 31).
High Performance Computing 1
MG 1D
High Performance Computing 1
MG 1D
• Let v(j) be the j-th eigenvector of T(i), and V the
eigenvector matrix V=[v(1),v(2),...] whose j-th column is
v(j). V is an orthogonal matrix, so any vector x can be
written as a linear combination of the v(j).
• alpha(j) the j-th frequency component of x,
• On the upper half of the frequency domain, multigrid
acting on P(m), damps the error in the solution by the
solution operator S(i). On the next coarser grid, with half
as many points, multigrid damps the upper half of the
remaining frequency components in the error. On the next
coarser grid, the upper half of the remaining frequency
components are damped, and so on, until we solve the
exact (1 unknown) problem P(1).
High Performance Computing 1
S(j)
• Here we consider a so-called weighted Jacobi smoothing
operator S(i).
• Pure Jacobi would replace the j-th component of the
approximate solution x(i) by the weighted average:
x(i)(j) = .5*( x(i)(j-1) + x(i)(j+1) + 4^i * b(i)(j) )
= x(i)(j) + .5*( x(i)(j-1) - 2*x(i)(j) + x(i)(j+1)
+ 4^i * b(i)(j) )
• Weighted Jacobi instead uses
x(i)(j) = x(i)(j) + w*.5*( x(i)(j-1) - 2*x(i)(j) + x(i)(j+1)
+ 4^i * b(i)(j) )
High Performance Computing 1
S(j)
• Choose the weight w to damp the
highest frequencies; w=2/3 is a
good choice
x(i)(j) = (1/3)*( x(i)(j-1) + x(i)(j) + x(i)(j+1)
+ 4^i * b(i)(j) )
High Performance Computing 1
S(j)
• Take i=6, so there are 2^6-1=63 unknowns. In the
figures to follow, the true solution is a sine curve,
shown as a dotted line in the leftmost plot in each
row. The approximate solution is a solid line in the
same plot. The middle plot shows the error alone.
The rightmost plot shows the frequency
components of the error.
• Initially, the norm of the vector decreases rapidly,
from 1.78 to 1.11, but then decays more gradually,
because there is little more error in the high
frequencies to damp. Thus, it only makes sense to
do 1 or perhaps 2 iterations of S(i) at a time.
High Performance Computing 1
Errors 1
High Performance Computing 1
Errors 2
High Performance Computing 1
Errors 3
High Performance Computing 1
R(i)
• The restriction operator R(i) takes a problem P(i) with
right-hand-side b(i) and an approximate solution x(i), and
maps it to a simpler problem P(i-1) with right-hand-side
b(i-1) and approximate solution x(i-1).
• Let r(i) be the residual of the approximate solution x(i):
r(i) = T(i)*x(i) - b(i)
• If x(i) were the exact solution, r(i) would be zero.
• If we could solve the equation T(i) * d(i) = r(i) exactly for
the correction d(i), then x(i)-d(i) would be the solution we
seek, because
T(i) * ( x(i)-d(i) ) = T(i)*x(i) - T(i)*d(i)
= (r(i) + b(i)) - (r(i)) = b(i)
High Performance Computing 1
R(i)
Thus, to apply R(i) to ( b(i), x(i) ) and get P(i-1),
• compute the residual r(i),
• restrict it to the next coarser grid to get b(i-1)
= r(i-1)
• letting the initial guess x(i-1) be all zeros.
High Performance Computing 1
R(i)
• Thus, P(i-1) will be the problem of solving T(i1)*d(i-1) = r(i-1), for the correction d(i-1), starting
with the initial guess x(i-1) of all zeros.
• The restriction will be accomplished by averaging
nearest neighbors of r(i) on the fine grid (with 2^i1 unknowns) to get an approximation on the
coarse grid (with 2^(i-1)-1 unknowns): The value
at a coarse grid point is .5 times the value at the
corresponding fine grid point, plus .25 times each
of the fine grid point neighbors.
High Performance Computing 1
In(i-1)
• The interpolation operator In(i-1) takes the
solution of the residual equation
T(i-1)*x(i-1) = r(i-1)
and uses it to correct the fine grid
approximation x(i).
High Performance Computing 1
Errors again
• The next graphs summarize the convergence of
Multigrid in two ways. The graph on the right
shows the residual
residual(k) = norm(T(i)x(i)-b(i))
after k iterations of Full Multigrid (FMG). The
graph on the left shows the ratio of consecutive
residuals, showing how the error (residual) is
decreased by a factor bounded away from 1 at
each step (for less smooth solutions, the residual
will tend to decrease by a factor closer to .5 at
each step.)
High Performance Computing 1
Error again
High Performance Computing 1
Parallel Multigrid
• Look more carefully at a parallel version of
multigrid in 2D, including communication costs.
• Multigrid requires each grid point to be updated
depending on as many as 8 neighbors (those to the
N, E, S, W, NW, SW, SE and NE). This will
determine communication costs.
• Assume an n=(2^m+1)-by-n grid of data, and that
p=s^2 is a perfect square.
• Assume the data is laid out on an s-by-s grid of
processors, with each processor owning an ((n1)/s)-by-((n-1)/s) subgrid.
High Performance Computing 1
Parallel Multigrid
• This is illustrated in the following diagram by the
pink dotted line, which encircles the grid points
owned by the corresponding processor. The grid
points in the top processor row have been labeled
(and color coded) by the grid number i of the
problem P(i) in which they participate. There is
exactly one point labeled 2 per processor. The only
grid point in P(1) with a nonboundary value is
owned by the processor above the one with the
pink dotted line. In general, if p>=16, there will be
.5*log_2(p)-1 grids P(i) in which fewer than all
processor own participating grid points.
High Performance Computing 1
High Performance Computing 1
Parallel MG
• The processor indicated by the pink dotted line requires
certain grid point data from its neighbors to execute the
algorithm. For example, to update its own (blue) grid
points for P(5), the processor requires 8 blue grid point
values from its N, S, E, and W neighbors, as well as single
blue grid point values from its NW, SW, SE and NE
neighbors. Similarly, updating the (red) grid points for P(4)
requires 4 red grid point values from the N, W, E and W
neighbors, and one each from the NW, SW, SE and NE
neighbors. This pattern continues until each processor has
only one grid point. After this, only some processors
participate in the computation, requiring one value each
from 8 other processors.
High Performance Computing 1
Parallel MG
• From this, we can do a simple complexity analysis of the
communication costs. For simplicity, let p=2^(2*k), so
each processor owns a 2^(m-k)-by-2^(m-k) subgrid.
Consider a V-cycle starting at level m. Then in levels m
through k of the V-cycle, each processor will own at least
one grid point. After that, in levels k-1 through 1, fewer
than all processors will own an active grid point, and so
some processors will remain idle. Let f be the time per
flop, alpha the time for a message startup, and beta the
time per word to send a message.
High Performance Computing 1
Parallel MG
At level m >= i >= k, the time in the V-cycle will be
Time at level i = O( 4^(i-k) ) * f ... number of
flops, proportional to the
number of grid points
per processor
+ O( 1 ) * alpha ... send a constant
number of messages
to neighbors
+ O( 2^(i-k) ) *
... number of words sent
High Performance Computing 1
Parallel MG
Summing all these terms from i=k to m yields
Time at levels k to m =
O( 4^(m-k) )*f + O( m-k )*alpha
+ O( 2^(m-k) )*beta
= O( n^2/p )*f + O( log(n/p) )*alpha
+ O( n/sqrt(p) )*beta
High Performance Computing 1
Parallel MG
• On levels k >= i >= 1, this effect disappears, because each
processor needs to communicate about as many words as it
does floating point operations:
Time at level i = O( 1 ) * f
... number of flops, proportional
to # of grid points per processor
+ O( 1 ) * alpha ... send a constant number of
messages to neighbors
+ O( 1 ) * beta ... number of words sent
High Performance Computing 1
Parallel MG
Summing all these terms yields
Time at levels 1 to k-1 = O( k-1 )*f
+ O( k-1 )*alpha + O( k-1 )*beta
= O( log(p) )*f + O( log(p) )*alpha
+ O( log(p) )*beta
High Performance Computing 1
Parallel MG
The total time for a V-cycle starting at the finest
level is therefore
Time = O( n^2/p + log(p) )*f
+ O( log(n) )*alpha
+ O( n/sqrt(p) + log(p) )*beta
High Performance Computing 1
Parallel MG
For Full Multigrid, assuming n^2 >= p, so that each
processor has at least 1 grid point:
Time = O( N/p + log(p)*log(N) )*f
+ O( (log(N))^2 )*alpha
+ O( sqrt(N/p) + log(p)*log(N) )*beta
where N=n^2 is the number of unknowns. The speedup
over the serial floating point work O( N ), is nearly
perfect, O( N/p + log(p)*log(N) ), when N>>p, but
reduces to log(N)^2 when p=N (the PRAM model).
High Performance Computing 1
Parallel MG
• In practice, the time spent on the coarsest
grids, when each processor has one or zero
active grid points, while negligible on a
serial machine, is significant on a parallel
machine, enough so to make the efficiency
noticeably lower.
High Performance Computing 1