Parallel algorithms for planar and spherical Delaunay construction

Geoscientific
Model Development
Open Access
Hydrology and
Earth System
Sciences
Open Access
Ocean Science
Open Access
Geosci. Model Dev., 6, 1353–1365, 2013
www.geosci-model-dev.net/6/1353/2013/
doi:10.5194/gmd-6-1353-2013
© Author(s) 2013. CC Attribution 3.0 License.
ess
Data Systems
Parallel algorithms for planar and spherical Delaunay construction
with an application to centroidal Voronoi tessellations
D. W. Jacobsen1,2 , M. Gunzburger1 , T. Ringler2 , J. Burkardt1 , and J. Peterson1
1 Department
2 Theoretical
of Scientific Computing, Florida State University, Tallahassee, FL 32306, USA
Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
Received: 5 February 2013 – Published in Geosci. Model Dev. Discuss.: 22 February 2013
Revised: 18 June 2013 – Accepted: 2 July 2013 – Published: 30 August 2013
1
Introduction
Voronoi diagrams and their dual Delaunay tessellations have
become, in many settings, natural choices for spatial griding due to their ability to handle arbitrary boundaries and
refinement well. Such grids are used in a wide range of applications and can, in principle, be created for almost any
geometry in two and higher dimensions. The recent trend
towards the exascale in most aspects of high-performance
Open Access
Abstract. A new algorithm, featuring overlapping domain
decompositions, for the parallel construction of Delaunay
and Voronoi tessellations is developed. Overlapping allows
for the seamless stitching of the partial pieces of the global
Delaunay tessellations constructed by individual processors.
The algorithm is then modified, by the addition of stereographic projections, to handle the parallel construction of
spherical Delaunay and Voronoi tessellations. The algorithms are then embedded into algorithms for the parallel construction of planar and spherical centroidal Voronoi
tessellations that require multiple constructions of Delaunay tessellations. This combination of overlapping domain
decompositions with stereographic projections provides a
unique algorithm for the construction of spherical meshes
that can be used in climate simulations. Computational tests
are used to demonstrate the efficiency and scalability of the
algorithms for spherical Delaunay and centroidal Voronoi
tessellations. Compared to serial versions of the algorithm
and to STRIPACK-based approaches, the new parallel algorithm results in speedups for the construction of spherical
centroidal Voronoi tessellations and spherical Delaunay triangulations.
Solid Earth
Open Access
Correspondence to: D. W. Jacobsen ([email protected])
computing further demands fast algorithms for the generation of high-resolution
spatial
meshes that are also of high
The
Cryosphere
quality, e.g. featuring variable resolution with smooth transition regions; otherwise, the meshing part can dominate the
rest of the discretization and solution processes. However,
creating such meshes can be time consuming, especially for
high-quality, high-resolution meshing. Attempts to speed up
the generation of Delaunay tessellations via parallel divideand-conquer algorithms were made in, e.g. Cignoni et al.
(1998); Chernikov and Chrisochoides (2008), however these
approaches are restricted to two-dimensional planar surfaces.
With the current need for high, variable-resolution grid generators in mind, we first develop a new algorithm that makes
use of a novel approach to domain decomposition for the
fast, parallel generation of Delaunay and Voronoi grids in
Euclidean domains.
Centroidal Voronoi tessellations provide one approach
for high-quality Delaunay and Voronoi grid generation (Du
et al., 1999, 2003b, 2006b; Du and Gunzburger, 2002;
Nguyen et al., 2008). The efficient construction of such grids
involves an iterative process (Du et al., 1999) that calls for the
determination of multiple Delaunay tessellations; we show
how our new parallel Delaunay algorithm is especially useful in this context.
Climate modelling is a specific field which has recently
begun adopting Voronoi tessellations as well as triangular
meshes for the spatial discretization of partial differential
equations (Pain et al., 2005; Weller et al., 2009; Ju et al.,
2008, 2011; Ringler et al., 2011). As this is a special interest of ours, we also develop a new parallel algorithm for
the generation of Voronoi and Delaunay tessellations for the
entire sphere or some subregion of interest. The algorithm
uses stereographic projections, similar to Lambrechts et al.
Published by Copernicus Publications on behalf of the European Geosciences Union.
M
1354
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
(2008), to transform these tasks into planar Delaunay constructions for which we apply the new parallel algorithm
we have developed for that purpose. Spherical centroidal
Voronoi tessellation (SCVT) based grids are especially desirable in climate modelling because they not only provide
for precise grid refinement, but also feature smooth transition regions (Ringler et al., 2011, 2013). We show how such
grids can be generated using the new parallel algorithm for
spherical Delaunay tessellations.
The paper is organized in the following fashion. In
Sect. 2.1, we present our new parallel algorithm for the construction of such tessellations. In Sect. 2.2, we provide a brief
discussion of centroidal Voronoi tessellations (CVTs) followed by showing how the new parallel Delaunay tessellation algorithm can be incorporated into an efficient method
for the construction of CVTs. In Sect. 3.1, we provide a short
review of stereographic projections followed, in Sect. 3.2,
by a presentation of the new parallel algorithm for the construction of spherical Delaunay and Voronoi tessellations. In
Sect. 3.3, we consider the parallel construction of spherical
CVTs. In Sect. 4, the results of numerical experiments and
demonstrations of the new algorithms are given; for the sake
of brevity, we only consider the algorithms for grid generation on the sphere. Finally, in Sect. 5, concluding remarks are
provided.
2
Parallel Delaunay and Voronoi tessellation
construction in Rk
In this section, we present a new method for the construction
of Delaunay and Voronoi tessellations. The algorithms are
first presented in R2 and later extended to R3 .
2.1
Parallel algorithm for Delaunay tessellation
construction
The construction of planar Delaunay triangulations in parallel has been of interest for several years; see, e.g. Amato and
Preparata (1993); Cignoni et al. (1998); Zhou et al. (2001);
Batista et al. (2010). Typically, such algorithms divide the
point set up into several smaller subsets, each of which can
then be triangulated independently from the others. The resulting triangulations need to be stitched together to form
a global triangulation. This stitching, or merge step, is typically computed serially because one may need to modify significant portions of the individual triangulations. The merge
step is the main difference between the different parallel algorithms. Here, we provide a new alternative merge step that,
because no modifications of the individual triangulations are
needed, can be performed in parallel.
We are given a set of points P = {x j }nj=1 in a given
domain  ⊂ Rk . We begin by covering  by a set S =
{Sk (ck , rk )}N
k=1 of N overlapping spheres Sk , each of which is
defined by a centre point ck and radius rk . For each sphere Sk ,
Geosci. Model Dev., 6, 1353–1365, 2013
a connectivity or neighbour list is defined which consists of
the indices of all the spheres Si ∈ S, i 6= k, that overlap with
Sk . From the given point set P , we then create N smaller
point sets Pk , k = 1, . . . , N , each of which consists of the
points in P which are in the sphere Sk . Due to the overlap
of the spheres Sk , a point in P may belong to multiple points
sets Pk . This defines what we are referring to as our sorting
method. The presented sort method is the simplest to understand, but provides issues with load balancing on variable
resolution meshes. For that reason, we use an alternate sort
method in practice which is described Sect. 2.1.1.
The next step is to construct the N Delaunay tessellations
Tk , k = 1, . . . , N, of the N point sets Pk , k = 1, . . . , N . For
this purpose, one can use any Delaunay tessellation method
at one’s disposal; for example, in the plane, one can use the
Delaunay triangulator that is part of the Triangle software of
Shewchuk (1996). A note should be made, that these triangulation software packages only aid in the computation of the
triangulation. For example, they can not be used to determine
which points should be triangulated. These Delaunay tessellations, of course, can be constructed completely in parallel
after assigning one point set Pk to each of N processors.
At this point, we are almost but not quite ready to merge
the N local Delaunay tessellations into a single global one.
Before doing so, we deal with the fact that although, by construction, the tessellation Tk of the point set Pk is a Delaunay
tessellation of Pk , there are very likely to be simplices in the
tessellation Tk that may not be globally Delaunay, i.e. that
may not be Delaunay with respect to points in P that are
not in Pk . This follows because the points in any Tk are unaware of the triangles and points outside of Sk so that there
may be points in P that are not in Pk that lie within the circumsphere of a simplex in Tk . In fact, only simplices whose
circumspheres are completely contained inside of the ball Sk
are guaranteed to be globally Delaunay; those points satisfy
the criteria
||ck −b
ci || +b
ri < r k ,
(1)
whereb
ci and b
ri are the centre and radius of the circumsphere
of the i-th triangle in the local Delaunay tessellation Tk .
So, at this point, for each k = 1, . . . , N , we construct the
triangulation Tbk by discarding all simplices in the local Delaunay tessellation Tk whose circumspheres are not completely contained in Sk , i.e. all simplices that do not satisfy
Eq. (1). Figure 1 shows one of the local Delaunay triangulations Tk and the triangulation Tbk after the deletion of triangles that are not guaranteed to satisfy the Delaunay property
globally.
The final step is to merge the N modified tessellations Tbk ,
k = 1, . . . , N , that have been constructed completely in parallel into a single global tessellation. The key observation here
is that each regional tessellation Tbk is now exactly a portion of the global Delaunay tessellation because if two local tessellations Tbi and Tbk overlap, they must coincide wherever they overlap. This follows from the uniqueness property
www.geosci-model-dev.net/6/1353/2013/
120
i.e., all simplices that do not satisfy (1). Figure 1 shows one of the local Delaunay triangulations Tk
and the triangulation Tbk after the deletion of triangles that are not guaranteed to satisfy the Delaunay
globally. algorithms for Delaunay construction
D. W. Jacobsenproperty
et al.: Parallel
1355
Fig. 1. A local Delaunay triangulation Tk (left) and the triangulation Tbk after the deletion of triangles that are not guaranteed to satisfy the
Delaunay property globally.
Fig. 1. A local Delaunay triangulation Tk (left) and the triangulation Tbk after the deletion of triangles that are
not guaranteed to satisfy the Delaunay property globally.
of the Delaunay tessellations. Thus, the union of the local
points than are required. This is an even greater challenge
Delaunay tessellations
is
the
global
Delaunay
tessellation;
when considering variable resolution meshes and regions that
The final step is to merge the N modified tessellations Tbk , k = 1,...,N , that have been conby using an overlapping domain decomposition, the stitchstraddle both fine and coarse regions.
structed
completely
in parallel
a singleone
global tessellation.
key observation
here isthe
that
ing of the regional
Delaunay
tessellations
intointo
a global
In order toThe
properly
sort the points,
overlap of the reis transparent.
course,
some
bookkeeping
chores
havea to
toDelaunay
be large enough
to satisfy
125 Of
each
regional
tessellation
Tbk is now
exactly
portiongions
of theneed
global
tessellation
becausethe
if following three
be done such astwo
rewriting
simplex information
in terms
of must requirements:
local tessellations
Tbi and Tbk overlap,
they
coincide wherever they overlap. This follows
global point indices and counting overlapping simplices only
– all Thus,
pointsthe
contained
in alocal
region’s
Voronoi cell are not
from the uniqueness property of the Delaunay tessellations.
union of the
Delaunay
once.
vertices of any non-Delaunay triangles;
In summary, the
algorithmisfor
construction
a Delau- by using an overlapping domain decomposition,
tessellations
thethe
global
Delaunayoftessellation;
nay tessellation the
in parallel
consists
of
the
following
steps:
the resulting
triangulation
includes
all Delaunay trianstitching of the regional Delaunay tessellations
into –a global
one is transparent.
Of course,
some
gles for which owned points are vertices of;
130
bookkeeping
have
rewriting simplex information in terms of global point
– define
overlapping
ballschores
Sk , k =
1, .to
. . ,beNdone
, thatsuch
coverasthe
given region
;
indices and counting overlapping simplices only once. – the point set defines all triangles whose circumcircles
fall in the region’s Voronoi cell.
A final
radii
{rk }N
– sort the given
pointnote
set isPthat
intothethe
subsets
Pk should
, k = be chosen large enough so that there are no gaps
k=1
If all three of these are satisfied, the point set can be used in
1, . . . , N , each containing points in P that are in Sk ;
any of the algorithms described in this paper.
5
Whereas we have yet to describe a method for the choice
– for k = 1, . . . , N , construct in parallel the N Delaunay
of region radius, let us first begin by describing the two extessellations Tk of the points sets Pk ;
tremes. First, consider a radius that encompasses the entire
– for k = 1, . . . , N , construct in parallel the tessellation Tbk
domain. This obviously provides enough overlap to guaranby removing from Tk all simplices that are not guarantee the three conditions are satisfied. Second, consider a rateed to satisfy the circumsphere property;
dius that is equivalent to the edges of the region’s Voronoi
cell, and no larger. In this case, the overlap is obviously in– construct the Delaunay tessellation of P as the union of
sufficient to satisfy the three conditions. This can be seen if
N
the N modified Delaunay tessellations {Tbk }k=1 .
we assume at least one non-Delaunay triangle exists at the
boundary of the Voronoi cell, in which case both criteria 1
Once a Delaunay tessellation is determined, it is an easy mat2
3
and 2 are not met.
ter, at least for domains in R and R , to construct the dual
In practice, choosing the radii {rk }N
k=1 to be equal to the
Voronoi tessellation; see, e.g. Okabe et al. (2000).
maximum distance from the centre ck of its ball Sk to the
centre of all its adjacent balls allows enough overlap in quasi2.1.1 Methods for decomposing global point set
uniform cases. However, this method still fails to provide
In Sect. 2.1, an initial method was presented for distributenough overlap in certain cases. All of the failures occur
ing points between regions. We refer to this method as a diswhen the global point set does not include enough points
tance sort. The distance between a point and a region centre
for the decomposition of choice. Before discussing the target
is computed, and if the distance is less than the region’s ranumber of points for a given decomposition, an alternative
dius the point is considered for triangulation. The key diffisorting method is introduced.
culty for this sort method is determining a radius which proIn an attempt to provide better load balancing in genvides enough overlap, without including significantly more
eral for arbitrary density functions, we also developed an
www.geosci-model-dev.net/6/1353/2013/
Geosci. Model Dev., 6, 1353–1365, 2013
1356
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
alternative sorting method. We refer to this sorting method as
the Voronoi sort. In this case, the method for sorting points
makes use of the Voronoi property that all points contained
within a Voronoi cell are closer to that cell’s generator than
any other generator in the set. It is important to note that
the coarse Voronoi cells used for domain decomposition here
are not the same as Voronoi cells related to the Delaunay
triangulation we are attempting to construct. To begin, the
point set is divided into the Voronoi cells defined by the region’s centre, and its neighbour’s region centres. The union
of the owned region’s point set and its immediately adjacent
region’s point sets defines the final set of points for triangulation.
As with the distance sort, the Voronoi sort meets all of the
criteria for use in the described algorithms. However, it can
also still fail in certain cases. Both sort methods can fail if the
target point set is too small for the target decomposition. In
this case, one or more regions do not include enough points
to correctly compute a triangulation through any method. In
practice, these algorithms fail if the decomposition includes
more than two regions N, and the point set contains less than
16N points, or two bisections. For example, a 12 processor
decomposition cannot be used with a target of 42 points, but
a 12 processor decomposition can be used with a target of
162 points.
In general, the decomposition should be created using the
target density function to provide optimal load balancing, and
to improve robustness of the algorithm. Results for these sort
methods are presented in Sect. 4.2.2.
2.2
Application to the construction of centroidal
Voronoi and Delaunay tessellations
Given a Voronoi tessellation V = {Vj }nj=1 of a bounded region  ⊂ Rk corresponding to the set of generators P =
{x j }nj=1 and given a nonnegative function ρ(x) defined over
, referred to as the point-density function, we can define the
centre of mass or centroid x ∗j of each Voronoi region as
R
Vj xρ(x) dx
∗
xj = R
,
j = 1, . . . , n.
(2)
Vj ρ(x) dx
In general, x j 6 = x ∗j , i.e. the generators of the Voronoi cells
do not coincide with the centres of mass of those cells. The
special case for which x j = x ∗j for all j = 1, . . . , n, i.e. for
which all the generators coincide with the centres of mass of
their Voronoi regions, is referred to as a centroidal Voronoi
tessellation (CVT).
Given , n, and ρ(x), a CVT of  must be constructed.
The simplest means for doing so is Lloyd’s method (Lloyd,
1982) in which, starting with an initial set of n distinct generators, one constructs the corresponding Voronoi tessellation,
then determines the centroid of each of the Voronoi cells,
and then moves each generator to the centroid of its Voronoi
cell. These steps are repeated until satisfactory convergence
Geosci. Model Dev., 6, 1353–1365, 2013
is achieved, e.g. until the movement of generators falls below a prescribed tolerance. The convergence properties of
Lloyd’s method are rigorously studied in Du et al. (2006a).
The point-density function ρ plays a crucial role in how
the converged generators are distributed and the relative sizes
of the corresponding Voronoi cells. If we arbitrarily select
two Voronoi cells Vi and Vj from a CVT, their grid spacing
and density are related as
hi
≈
hj
ρ(x j )
ρ(x i )
1
d+2
,
(3)
where hi denotes a measure of the linear dimension, e.g. the
diameter, of the cell Vi and x i denotes a point, e.g. the generator, in Vi . Thus, the point-density function can be used
to produce nonuniform grids in either a passive or adaptive
manner by prescribing a ρ or by connecting ρ to an error indicator, respectively. Although the relation Eq. (3) is at the
present time a conjecture, its validity has been demonstrated
through many numerical studies; see, e.g. Du et al. (1999);
Ringler et al. (2011). CVTs and their dual Delaunay tessellations have been successfully used for the generation of
high-quality nonuniform grids; see, e.g. Du and Gunzburger
(2002); Du et al. (2003b, 2006b); Ju et al. (2011); Nguyen
et al. (2008).
As defined above, every iteration of Lloyd’s method requires the construction of the Voronoi tessellation of the current set of generators followed by the determination of the
centroids of the Voronoi cells. However, the construction of
the Voronoi tessellation can be avoided until after the iteration has converged and instead, the iterative process can
be carried out with only Delaunay tessellation constructions.
Thus, the first task within every iteration of Lloyd’s method
is to construct the Delaunay tessellation of the current set of
generators. The parallel algorithm for the generation of Delaunay tessellations given in Sect. 2.1 is thus especially useful in reducing the costs of the multiple tessellations needed
in CVT construction.
After the Delaunay tessellation of the current generators is
computed, every Voronoi cell centre of mass must be computed by integration, so its generator can be replaced by the
centre of mass. Superficially, it seems that we cannot avoid
constructing the Voronoi tessellation to do this. However, it
is easy to see that one does not actually need the Voronoi
tessellation and instead one can determine the needed centroids from the Delaunay tessellation. For simplicity, we restrict this discussion to tessellations in R2 . Each triangle in
a Delaunay triangulation contributes to the integration over
three different Voronoi cells. The triangle is split into three
kites, each made up of two edge midpoints, the triangle circumcentre, and a vertex of the triangle. Each kite is part of
the Voronoi cell whose generator is located at the triangle
vertex associated with the kite. Integrating over each kite and
updating a portion of the integrals in the centroid formula
Eq. (2) allows one to only use the Delaunay triangulation to
www.geosci-model-dev.net/6/1353/2013/
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
determine the centroids of the Voronoi cells. Thus, because
both steps within each Lloyd iteration can be performed using only the Delaunay triangulation, determining the generators of a CVT does not require construction of any Voronoi
tessellations nor does it require any mesh connectivity information. If one is interested in the CVT, one need only construct a single Voronoi tessellation after the Lloyd iteration
has converged; if one is interested in the Delaunay tessellation corresponding to the CVT, then no Voronoi construction
is needed.
To make this algorithm parallel, one can compute the
Voronoi cell centroids using the local Delaunay tessellations
and not on the stitched-together global one. However, because the local Delaunay tessellations overlap, one has to ensure that each generator is only updated by one region, i.e. by
only one processor. This can be done using one of a variety
of domain decomposition methods. We use a coarse Voronoi
diagram corresponding to the region centres ck . Each processor only updates the generators that are inside of its coarse
Voronoi cell. Because Voronoi cells are non-overlapping,
each generator will only get updated by one processor. After all the generators are updated, each coarse region needs
to transfer its newly updated points only to its adjacent regions and not to all active processors. This limits each processor’s communications to roughly six sends and receives,
regardless of the total number of processors used.
We use two metrics to check for the convergence of the
Lloyd iteration, namely the `2 and `∞ norms of generator
movement given by
`2 norm =
n
1/2
1 X
new 2
(x old
−
x
)
j
n j =1 j
1357
In summary, the algorithm for the construction of a CVT
in parallel consists of the following steps; the steps in Roman
font are the same as in the algorithm of Sect. 2.1 whereas the
italicized steps, as well as the do loop, are additions for CVT
construction:
– define overlapping balls that cover the given region;
– while not converged do
– sort the current point set;
– construct in parallel the local Delaunay tessellations;
– remove from the Delaunay tessellations simplices
which are not guaranteed to satisfy the circumsphere property;
– in parallel, determine the centroids of the Voronoi
cells by integrating over simplices;
– move each generator to the corresponding cell centroid;
– test the convergence criteria;
– communicate new generator positions to neighbouring balls;
– end
– construct the Delaunay tessellation corresponding to the
CVT as the union of the modified local Delaunay tessellations.
and
new
`∞ norm = max (|x old
j − x j |),
j =1,...,n
respectively, are compared with a given tolerance; here, old
and new refer to the previous and current Lloyd iterates. If
either norm falls below the tolerance, the iterative process is
deemed to have converged. The `∞ norm is more strict, but
both norms follow similar convergence paths when plotted
against the iteration number.
A variety of point sets can be used to initiate Lloyd’s
method for CVT construction. The obvious one is Monte
Carlo points (Metropolis and Ulam, 1949). These can either
be sampled uniformly over  or sampled according to the
point-density function ρ(x). The latter approach usually reduces the number of iterations required for convergence. One
can instead use a bisection method to build fine grids from
a coarse grid (Heikes and Randall, 1995). To create a bisection grid, a coarse CVT is constructed using as few points
as possible. After this coarse grid is converged, in R2 , one
would add the midpoint of every Voronoi cell edge or Delaunay triangle edge to the set of points. This causes the overall
grid spacing to be reduced by roughly a factor of two in every
cell so that the refined point set is roughly four times larger.
www.geosci-model-dev.net/6/1353/2013/
3
Parallel Delaunay and Voronoi tessellation
construction on the sphere
Replacing the planar constructs of Delaunay triangulations
and Voronoi tessellations with their analogous components
on the sphere, i.e. on the surface of a ball in R3 , creates the
spherical versions of Delaunay triangulations and Voronoi
tessellations. The Euclidean distance metric is replaced by
the geodesic distance, i.e. the shortest of the two lengths
along the great circle joining two points on the sphere. Triangles and Voronoi cells are replaced with spherical triangles
and spherical Voronoi cells whose boundaries are geodesic
arcs. To develop a parallel Delaunay and Voronoi tessellation construction algorithm on the sphere, we could try to
parallelize the STRIPACK algorithm of Renka (1997). Here,
however, we take a new approach that is based on a combination of stereographic projections (which we briefly discuss
in Sect. 3.1) and the algorithm of Sect. 2.1 applied to planar
domains. The serial version of the new Delaunay tessellation
construction algorithm is therefore novel as well.
Geosci. Model Dev., 6, 1353–1365, 2013
265
P , we then create N smaller point sets Pk , k = 1,...,N , each of w
which are in the spherical cap Uk . Due to the overlap of the spherica
develop a parallel algorithm for the construction of SDTs. The most important adaptation is to use
up belonging
multiple
points sets
Pk .
D. W. Jacobsen
et al.: to
Parallel
algorithms
for Delaunay
construction
stereographic projections so that the actual construction of Delaunay triangulations is done on planar
1358
3.1
domains.
Stereographic projections
260
We are now given a set of points P = {xj }nj=1 on a subset Ω of the sphere; Ω may be the whole
Stereographic projections are special mappings between the
sphere.
Wetobegin
by covering
surface of a sphere and a plane
tangent
the sphere.
Not onlyΩ by a set U = {Uk (tk ,rk )}N
k=1 of N overlapping “umbrellas” or
are stereographic projections conformal mappings, meaning
spherical caps Uk , each of which is defined by a centre point tk and geodesic radius rk . See the left
that angles are preserved, but they also preserve circularity,
in Fig.
For each
spherical
meaning that circles on theplot
sphere
are 4.
mapped
to circles
on cap Uk , a connectivity (or neighbours) list is defined that consists
the plane. Stereographic projections
also
map
the
interior
of caps U ∈ U , i 6= k, that overlap with U . From the given point set
of the indices of all the spherical
i
k
these circles to the interior of the mapped circles (Bowers
2651999).
P , we
N smaller
point sets Pk , k = 1,...,N , each of which consists of the points in P
et al., 1998; Saalfeld,
Forthen
our create
purposes,
the importance of circularity preservation
is
that
it
implies
that
stereowhich are in the spherical cap Uk . Due to the overlap of the spherical caps Uk , a point in P may end
graphic projections preserve the Delaunay circumcircle propbelonging
to multiple
points
erty. This follows because up
triangle
circumcircles
(along
withsets Pk .
(a)
their interiors) are preserved. Therefore, Delaunay triangulations on the sphere are mapped to Delaunay triangulations
on the plane and conversely. As a result, stereographic pro- Fig. 4. Left: An overlapping subdivision of the sphere into N = 12 sphe
jections can be used to construct a Delaunay triangulation of
a portion of the sphere by first constructing a Delaunay tri- the projections onto the sphere of the centroids of the 12 pentagons of the i
angulation in the plane, which is a simpler and well-studied coloured ring represents the edge of cap of radius rk . Right: A 10,242 gene
task.
Without loss of generality, we assume that we are given parallel algorithm
the unit sphere in R3 centred at the origin. We are also given
a plane tangent to the sphere at the point t. The focus point f
At this point in Sect. 2.2, we assigned each of the point subsets Pk
is defined as the reflection of t about the centre of the sphere,
i.e. f is the antipode of t. Let p denote a point on the unit N Delaunay tessellations in parallel. Before doing so in the spheri
sphere. The stereographic projection of p onto the plane tan270 point sets Pk (b)
onto the plane tangent at the corresponding point tk . T
gent at t is the point q on the plane defined by
q = sp + (1 − s)f ,
Fig. 2. (a) An overlapping subdivision of the sphere into N = 12
processor
to construct a planar Delaunay triangulation instead of
1
spherical
caps. The
t k are caps.
the projections
onto the
Fig.
4.
Left:
An
overlapping
subdivision
of
the
sphere
into cap
N =centres
12 spherical
The cap centres
tk are
. (4)
where
s=2
ekthe
sphere
ofconstruct
the centroids
ofplanar
the 12 point
pentagons
of
inscribed
regeach
k,
we
the
set
P
=
S(P
;t
),
f
·
(f
−
p)
k k where S(P
the projections onto the sphere of the centroids
of the 12 pentagons
of thering
inscribed
regular
icosahedron.
ular icosahedron.
Each coloured
represents
the edge
of cap ofEach
For our purposes, it is more
convenient
to define the
projecprojection
points
in Pkgenerator
onto
theDelaunay
plane
tangent
to the
at th
rkof
.r(b)
A 10
242
generator
Delaunay
triangulation
by the
parcoloured
ring represents
the edge
of cap
ofradius
radius
Right:
A 10,242
triangulation
bysphere
the
k .the
allel algorithm. e
tion relative to t rather than f . The simple substitution of
parallel algorithm
of the point sets Pk to a different processor and have each processo
f = −t into Eq. (4) results in
275 triangulation of its point set Pek . Because stereographic projections p
1
defined
consists
the indices
all the spherical
caps
eachthat
of the
point of
subsets
Pk to of
a processor
and constructed
q = sp + (s − 1)t,
whereAt thiss point
= 2 in Sect. .2.2, we
(5)assigned
t · (p + t)
preserve
circumcircle
property,
it point
is then
Ui ∈ Uthe
, i 6=Delaunay
k, that overlap
with Uk . From
the given
set just a sim
N Delaunay tessellations in parallel. Before doing so in the spherical case, we project each of the
P , we then create N smaller point sets Pk , k = 1, . . . , N , each
The definitions Eqs. (4) and (5) can also be used to define triangle
vertex and edge indices to define the spherical Delaunay tria
ofthe
which
consists of the
points
P which
are in step
the spherical
270
point
sets
P
onto
the
plane
tangent
at
corresponding
point
tk .inThis
additional
allows each
k
k
stereographic projections in R for k > 3.
capinUkSect.
. Due to
the we
overlap
ofproceed
the spherical
caps Uk ,spherical
a point triangles
As
2.2,
now
to
remove
processor to construct a planar Delaunay triangulation instead of spherical one. Specifically, for
in P may end up belonging to multiple points sets Pk .
3.2 Parallel algorithm for spherical Delaunay
Delaunay
with
respect
to the
global
set the
Pof. stereographic
Instead
each k, we construct the planar point
setAt
Pekthis
=
S(P
where
;tk point
) denotes
point
2.1,S(P
wek assigned
each
the pointof (1), in th
k ;tin
k ),Sect.
triangulation construction
Pk to toa the
processor
constructed
Delaunay
projection of the points in Pk onto the subsets
plane tangent
sphere atand
the point
tk . WeNthen
assign each
A spherical Delaunay triangulation (SDT) is the Delaunay
tessellations in parallel. Before doing so in the spherical case,
e
11tangent
point on
setsthe
Pksurface
to a different
and each
have of
each
theplane
planar
Delaunay
triangulation of a point setofPthe
defined
of the processor
we project
theprocessor
point sets construct
P onto the
k
2.1Pefor
sphere. We adapt the
developed
Sect.set
at the corresponding point t k . This additional step allows
275algorithm
triangulation
of itsinpoint
k . Because stereographic projections preserve circularity and therefore
domains in Rk to develop a parallel algorithm for the coneach processor to construct a planar Delaunay triangulation
preserve
the Delaunay
property,
is then one.
just aSpecifically,
simple matter
of keeping
track of
struction of SDTs. The most
important
adaptation circumcircle
is to use
instead
of itspherical
for each
k, we conek = S(Pk ; t k ), where S(Pk ; t k )
stereographic projections so
that
the
actual
construction
of
struct
the
planar
point
set
P
triangle vertex and edge indices to define the spherical Delaunay triangulations of the point sets Pk .
Delaunay triangulations is done on planar domains.
denotes the stereographic projection of the points in Pk onto
As in PSect.
proceedthe
to plane
remove
spherical
nott .guaranteed
to be
We are now given a set of points
= {x j2.2,
}nj=1we
on now
a subset
tangent
to thetriangles
sphere atthat
the are
point
k We then asekthe
 of the sphere;  may be
the whole
sphere.
Wetobegin
by
sign set
eachP of
the point
sets in
P
to spherical
a differentcase,
processor
andmust
Delaunay
with
respect
the global
point
. Instead
of (1),
triangles
covering  by a set U = {Uk (t k , rk )}N
of
N
overlapping
have
each
processor
construct
the
planar
Delaunay
trianguk=1
ek . Because stereographic projections
“umbrellas” or spherical caps Uk , each of which is defined
lation of its point set P
by a centre point t k and geodesic radius rk . See Fig. 2a. For
preserve circularity
and
therefore preserve the Delaunay cir11
each spherical cap Uk , a connectivity (or neighbours) list is
cumcircle property, it is then just a simple matter of keeping
Geosci. Model Dev., 6, 1353–1365, 2013
www.geosci-model-dev.net/6/1353/2013/
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
track of triangle vertex and edge indices to define the spherical Delaunay triangulations of the point sets Pk .
As in Sect. 2.1, we now proceed to remove spherical triangles that are not guaranteed to be Delaunay with respect
to the global point set P . Instead of Eq. (1), in the spherical
case, triangles must satisfy
Table 1. Description of generator count labels, along with approximate grid resolution on an earth sized sphere.
Generator Label
coarse
medium
fine
cos−1 ||t k −b
ci || +b
ri < rk ,
for that guarantee to be in effect, where b
t i and b
ri are the
centre and radius, respectively, of the circumsphere of the
i-th triangle in the local Delaunay tessellation of Pk . Once
the possibly unwanted triangles are removed, we can, as in
Sect. 2.1, transparently stitch together the modified Delaunay triangulations into a global one.
In summary, the algorithm for the construction of a spherical Delaunay tessellation in parallel consists of the following
steps, where the italicized steps are the ones added to the algorithm of Sect. 2.1:
– define overlapping spherical caps Uk , k = 1, . . . , N , that
cover the given region  on the sphere;
– sort the given point set P into the subsets Pk , k =
1, . . . , N , each containing the points in P that are in Uk ;
ek
– for k = 1, . . . , N , construct in parallel the point set P
by stereographically projecting the points in Pk onto the
plane tangent to the sphere at the point t k ;
– for k = 1, . . . , N , construct in parallel the planar Delauek ;
nay triangulation Tek of the points set P
– for k = 1, . . . , N, construct in parallel the spherical Delaunay triangulation Tk by mapping the planar Delaunay triangulation Tek onto the sphere;
– for k = 1, . . . , N , construct in parallel the spherical triangulation Tbk by removing from Tk all simplices that
are not guaranteed to satisfy the circumsphere property;
– construct the Delaunay tessellation of P as the union
of the N modified spherical Delaunay tessellations
{Tbk }N
k=1 .
For an illustration of a spherical Delaunay triangulation determined by this algorithm see Fig. 2b.
Because of the singularity in Eqs. (4) or (5) for p = f , the
serial version of this algorithm, i.e. if N = 1, can run into
trouble whenever the antipode f of the tangency point t is
in the spherical domain . Of course, this is always the case
if  is the whole sphere. In such cases, the number of subdomains used cannot be less than two, even if one is running
the above algorithm in serial mode.
In setting the extent of overlap, i.e. the radii of the spherical caps, as the maximum geodesic distance from the a cap
centre t j to neighbouring cap centres t i , the geodesic distance is given by cos−1 (t j · t i ).
www.geosci-model-dev.net/6/1353/2013/
1359
3.3
Generator Count
Approx. Resolution
40 962
163 842
2 621 442
120 km
60 km
15 km
Application to the construction of spherical
centroidal Voronoi tessellations (SCVTs)
The parallel CVT construction algorithm of Sect. 2.2 is easily adapted for the construction of centroidal Voronoi tessellations on the sphere (SCVTs). Obviously, one uses the spherical version of the Delaunay triangulation algorithm as given
in Sect. 3.2 instead of the version of Sect. 2.1. One also has to
deal with the fact that if one computes the centroid of a spherical Voronoi cell using Eq. (2), then that centroid does not
lie on the sphere; a different definition of a centroid has to be
used (Du et al., 2003a). Fortunately, it is shown in Du et al.
(2003a) that the correct centroid can be determined by using
Eq. (2), which yields a point inside the sphere, and then projecting that point onto the sphere (actually, this was shown
to be true for general surfaces) so that the correct spherical
centroid is simply the intersection of the radial line going
through the point determined by Eq. (2) and the sphere.
4
Results
All results are created using a high performance computing
(HPC) cluster with 24 AMD Opteron “model 6176” cores
per node and 64 GB of RAM per node.
Results are presented for three different generator counts,
and are presented with the labels “coarse”, “medium”, and
“fine”. Table 1 can be used to refer these labels to actual
counts as well as approximate resolution on an earth sized
sphere.
We use STRIPACK (Renka, 1997), a serial Fortran 77 Association of Computing Machinery (ACM) Transactions on
Mathematical Software (TOMS) algorithm that constructs
Delaunay triangulations on a sphere, as a baseline for comparison with our approach. It is currently one of the few wellknown spherical triangulation libraries available.
The algorithm described in this paper has been implemented as a software packaged referred to as MPI-SCVT,
and is freely available following Jacobsen and Womeldorff
(2013). It utilizes the Triangle software (Shewchuk, 1996) to
perform planar triangulations of specified point sets in the
tangent planes of each partition of the domain.
Geosci. Model Dev., 6, 1353–1365, 2013
4.1
Approach
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
x1639.3149 104.793
Uniform
14.9779
2556.971276.681
2611.35 1560.71
Base
Delaunay triangulations of the full sphere
200000
Uniform Decomposition
x16
104.793 Voronoi
276.681 98.5482
1560.71 249.771965.56 288.694
1.32
1965.56
Number of sorted points in region
1360
x16 uses a coarse
SCVT withCommunication
a 16 to 1 ratio in grid
sizes to e
Approach Triangulation
Integration
Iteration
Costs Of Different Algorithm Steps
Cost Per
same x16 coarse39.3149
SCVT grid to define regions centres
and sort
2611.35
TriangulationUniform
Integration 14.9779
Communication Iteration 2556.971
Speedup
Number of sorted points in region
Sorting
640.472
Table 2 compares MPI-SCVT with STRIPACK for the con150000
struction Voronoi
of spherical Delaunay
triangulations.
The
resultson sorting288.694
98.5482
249.77
640.472uses a coarse
4.07 quasi-uniform SC
Table
3. Timings
based
approach
used. Uniform
given compare the cost to compute a single triangulation of
a medium resolution global grid.
The results
show,
for the radii and sorts using maximum distance between centres and n
centres
and their
associated
Table
3. Timings
based
on sorting
approach
used. Uniform
uses100000
a coarse quasi-uniform SCVT to define region
computation
of a full
spherical
triangulation
MPI-SCVT
is
x16
uses a coarse
roughly a factor of 7 slower than
STRIPACK
in serialSCVT
mode, with a 16 to 1 ratio in grid sizes to effect the same type of sortin
centres and their associated radii and sorts using maximum distance between centres and neighbouring centres.
and a factor of 12 faster when using 96 processors.
50000
same x16 coarse SCVT grid to define regions centres and sorts using the neighbouring Vor
x16 uses a coarse SCVT with a 16 to 1 ratio in grid sizes to effect the same type of sorting. Voronoi uses the
4.2
Initial generator placement and sorting heuristics
0
200000
4.2.1
150000
Number of sorted points in region
Number of sorted points in region
Uniform Decomposition
We now examine the effects of initial generator placement
(a) Uniform
and sorting heuristics with regards to SCVT generation. The
200000
results200000
of this section are
intended
to guide
decisions about
150000
Uniform
Decomposition
which initial condition or sorting method we use later to generate SCVTs.
100000
Initial generator placement
100000
150000
100000
5
10
15 20 25 30 35
200000
Region number
Number
Number of sorted
pointsofinsorted
regionpoints in region
Number of sorted points in region
same x16
coarseconstruction
SCVT grid to define regions centres and sorts using 0the neighbouring Voronoi cell-based sort.
for SCVT
40
x16 Deco
150000
x16 Decomposition
200000
100000
Voronoi Deco
150000
Number of sorted points in region
Number of sorted points in region
50000
50000
Because most climate models are shifting
towards global
high-resolution simulations, we begin by exploring the cre100000
ation of50000
a fine resolution grid. We compare SCVT grids
50000
0
0
constructing starting with uniform Monte Carlo0 and5 bisec10 15 20 25 30 35 40
0
5 10 15 20
tion initial generator placement. The times for these grids to Region number
Region nu
50000
0
converge to 0a tolerance 10−6 in the `2 norm are presented.
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
The 10−6 threshold is the strictest convergence level that the
(a) Uniform
(b) x16
Region number
Region number
(b) x16
Monte Carlo grid can attain in a reasonable amount of time,
0
0
5 10 15 20
and is therefore chosen as the convergence threshold for this
200000
(a) Uniform
Voronoi Decomposition
Region nu
study. However, the bisection
grid can converge well beyond
this threshold in a similar amount of time. Table 3 shows
timing results for the parallel algorithm200000
for the two differ(c) Voronoi
150000
Voronoi Decomposition
ent options for initial generator placement. Although the time
spent converging a mesh with Monte Carlo
initial placements
150000
Fig. 5. 100000
Number of points each processor has to triangulate.
is highly dependent on the initial point set, it is clear from
this table that bisection initial conditions provide a significant
corresponding to the three rows of Table 3. All plots were cre
speedup in the overall cost to generate an
SCVT grid. Based
100000
50000
on the results presented in Table 3 and unless otherwise specified, only bisection initial generator placements are used in
370
Based on the results in Table 3, the new Voronoi-base
subsequent experiments. Bisection initial50000
conditions are not
0
0
5 10 15 20 25 30 35 40
specific to this algorithm, and can be used to accelerate any
Region number
SCVT grid generation method.
(c)
Voronoi
0
15
15 20 25 30 35 40
Fig. 3. Number of points each processor has to triangulate. (a), (b),
Region number
Fig. 5. Number of points each processor has
toandtriangulate.
5(a),
5(b), and
5(c) useto the
sorting
(c) use the sorting
approaches
corresponding
the three
rows approache
of Table 4. All plots were created with the same medium resolution
As described in Sect. 2.1.1, the MPI-SCVT code utilizes two
Voronoi
the three
of Table
3.
plots
grid. were created with the same set of 163842 generators.
methodscorresponding
for determining ato
region’s
pointrows
set. Whereas
the (c)
dotAll
0
4.2.2
370
5
10
Sorting of points for assignment to processors
product sort method is faster, it provides poor load balancboth utilize
distance
differ
the decomposition
ing 5.
in variable
resolution
meshes,
causing the
Fig.
Number
of points
eachthus
processor
hasiteration
to triangulate.
5(a),the
5(b),
andsort,
5(c)butuse
theinsorting
approaches
370
BasedTimings
on theusing
results
Table 3,arethe
new Voronoi-based
sorting
approach
is used
for all subsequen
used. Uniform uses
a quasi-uniform
42 region
decomposicost to be greater.
both in
approaches
given
corresponding
to the
threethe
rows
of Table
3. All
created
with
same
set of resolution
163842 generators.
whereas
×8the
uses
a variable
decomposition
3 shows
number
of points
thatplots
eachweretion,
in Table 4. Figure
with sixteen to one ration in maximum and minimum grid
processor has to triangulate on a per iteration basis. Timsizes. Timings
ings presented are for a medium resolution grid, 42 regions,
15 labeled Voronoi use the same decomposition
42
processors,
and
are
averages
over
3000
iterations.
Three
as
×8,
use theapproach
Voronoi sort
rather
the disBased on the results in Table 3, the new Voronoi-based but
sorting
ismethod
used for
allthan
subsequent
tance sort method.
sets of timings are given. Timings labeled Uniform and ×8
Geosci. Model Dev., 6, 1353–1365, 2013
15
www.geosci-model-dev.net/6/1353/2013/
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
1361
Table 2. Comparison of STRIPACK with serial and parallel versions of MPI-SCVT for computation of a full spherical triangulation. Computed using the same initial conditions for a medium resolution grid.
results. This should provide appropriate load balancing when generating SCVT
Algorithm
Processors
STRIPACK
MPI-SCVT
MPI-SCVT
MPI-SCVT
Regions
Time (ms)
Speedup (STRIPACK/MPI-SCVT)
this1 sorting 1method563.87
should result in quasi-uniform
and variable resolution SCVT g
Baseline
1
2
3 724.08
comparable
in terms
of performance.
2
2
2 206.06
96
4.3
96
0.15
0.26
12.33
45.75
SCVT generation
Table 3. Timing results for MPI-SCVT with bisection and Monte Carlo initial generator placements and the speedup of bisection relative to
Monte Carlo. Final mesh is fine resolution.
We now provide both quasi-uniform and variableMC
resolution SCVT generation resu
Timed Portion
Bisection (B)
Monte Carlo (MC)
Speedup B
contributor to the112differences
in computational
performance arises as a result of load
Total Time (ms)
890
358 003 000
3171.25
Triangulation
Time (ms)Results10in887
ferences.
this
Integration Time (ms)
30 893
Communication Time (ms)
70 878
48 034
600 use of the
4412.11
section
make
density function
66 374 400
2148.52
1565.47
110 958 000
1
β − |xc − xi |
ρ(xi ) =
tanh
+1 +γ
2(1SCVT
− γ)to define region centres
α and their associated
Table 4. Timings based on sorting approach used. Uniform uses a coarse quasi-uniform
radii and sorts using maximum distance between centres and neighbouring centres. ×8 uses a coarse SCVT with a 8 to 1 ratio in grid sizes
to effect the same type of sorting.
Voronoi
uses the
same ×8 coarse
to definex
regions
centres and sorts
the the
neighbouring
375
which
is visualized
inSCVT
Fig. grid
6, where
tousing
lie on
surface
i is constrained
Voronoi cell-based sort.
of the un
function results in relatively large value of ρ within a distance β of the point xc , wher
Distance
Distance
Voronoi
Decomposition
Costs Of Different Algorithm Steps
Cost Per
in radians
and xc isIntegration
also constrained
to lie on
the surface
Grid
Triangulation
Communication
Iteration
Speedupof
the sphere. The func
Uniform
14.53 values
37.62
1314.22
1396.53
Baseline
to relatively small
of ρ across
a radian
distance
of α. The
×8
35.84
92.91
865.32
995.04
1.40
−1
×8computed as |x
16.71
86.24
177.55
305.85
4.56
−
x
|
=
cos
(x
·
x
)
with
a
range
from
0 to π.
c
i
c
i
Timings presented in Table 4 are taken relative to processor number 0, and as can be seen in Fig. 3a, processor 0 has
a very small load which causes the majority of its iteration
time being spent waiting for the processors with large loads
to catch up; this idling time is included in the Communication column of the table.
Table 4 and Fig. 3 show there is a significant advantage to
the Voronoi based sort method in that it not only speeds up
the overall cost per iteration, but also provides more balanced
loads across the processors.
1
distance betwee
Density Function
0.8
Density
Sorting
Approach
0.6
0.4
0.2
0
4.3
SCVT generation
0
0.5
1
1.5
2
2.5
3
Distance from centre of density function (Radians)
We now provide both quasi-uniform and variable resolution
Fig. 4. Density function that creates a grid with resolutions that difSCVT generation results. The major contributor to the differfer by a factor of 8 between the coarse and the fine regions. The
6.asDensity
ences in computational performanceFig.
arises
a result function
of load that creates a grid with resolutions that differ by a factor of 8 betwe
maximum value of the density function is 1 whereas the minimum
balancing differences. Results in thisthe
section
make useThe
of the
4 . the density function is 1 whereas the minimum value
fine regions.
maximum
value
value is
(1/8)of
density function
β − |x c − x i |
1
tanh
+1 +γ
ρ(x i ) =
380α
Figure 7 shows(6)
an example
grid created using this density function, with xc set to
2 (1 − γ )
sitions to relatively small values of ρ across a radian diswhich is visualized in Fig. 4, whereλxc i=
is π/6,
constrained
tance
of α. The and
distance
between γx c=and
x i is4 ,computed
as and α = 0
wheretoφliedenotes
longitude
λ latitude,
(1/8)
β = π/6,
|x c − x i | = cos−1 (x c · x i ) with a range from 0 to π.
on the surface of the unit sphere. This function results in
set of parameters
used
(6) isgrid
referred
to asthis
x8.density
The quasi-uni
relatively large value of ρ within agenerators.
distance β of This
the point
Figure 5 shows
an in
example
created using
function, with x c set to be λc = 3π/2, φc = π/6, where λ
x c , where β is measured in radians and x c is also conreferred
as x1.transtrained to lie on the surface of the sphere.
Theto
function
denotes longitude and φ latitude, γ = (1/8)4 , β = π/6, and
www.geosci-model-dev.net/6/1353/2013/
Geosci. Model Dev., 6, 1353–1365, 2013
8
1362
D. W. Jacobsen et (a)
al.:Coarse
Parallel
algorithms for Delaunay
construction
region
(b) Transition region
(a) Coarse region
(c) Fine region
(b) Transition region
Fig. 7. using
Different
of thefunction
same variable
resolution
grid created
using the
densityoffunction
(6). Figure 7(a)
Fig. 5. Different views of the same variable resolution grid created
theviews
density
Eq. (6).
(a) shows
the coarse
region
the grid,
Fig.shows
5. Different
views ofregion,
the same
resolution
created using the density function Eq. (6). (a) shows the coarse region of the grid,
(b)
the transition
andvariable
(c) shows
the fine grid
region.
shows the coarse region of the grid, 7(b) shows the transition region, and 7(c) shows the fine region.
(b) shows the transition region, and (c) shows the fine region.
In Sect.
4.1, we showed
that MPI-SCVT
performs
to STRIPACK
when computing a
Table 5. Comparisons of SCVT generators using STRIPACK, serial
MPI-SCVT,
and parallel
MPI-SCVT.
Costcomparably
per triangulation
and iteration
Table
2. Comparison
ofisSTRIPACK
with serial
andper
parallel
versions
of MPI-SCVT
for
computation
of a full for
spherical
triangulation.
Comare
presented.
Speedup
compared using
the cost
iteration.
Computed
using
the
same
initial
conditions
a
medium
resolution
385 single full triangulation. However, computing a full triangulation is only part of the grid.
story. For SCVT
puted
using
the
same
initial
conditions
for
a
medium
resolution
grid.
Variable resolution results are presented for MPI-SCVT only.
generation, a triangulation needs to be computed at every iteration of Lloyd’s algorithm as described
Fig. 7.
Sect.
When using
STRIPACK,
the full triangulation needs to be computed at every iteration,
Algorithm
Processors
RegionsinTriangulation
Time2.3.
(ms)
Speedup
(STRIPACK/MPI-SCVT)
Algorithm
Procs Regions
Iteration
Speedup Per Iteration
butTime
with (ms)
MPI-SCVT
only(ms)
each regional
triangulation needs to be computed at each iteration. This
Time
(STRIPACK/MPI-SCVT)
STRIPACK
1
1
563.87
Baseline
means
the
merge
step
can
be
skipped
resulting
in significantly cheaper triangulations.
MPI-SCVT
3 724.08
0.15Baseline
STRIPACK
11
1 2
563.871
3 009.74
MPI-SCVT
2
2
2
206.06
0.26
390
Figure
8
shows
the
performance
of
a
STRIPACK-based
SCVT construction as the number of
MPI-SCVT ×1
1
2
3 480.43
9 485.78
0.32
MPI-SCVT
96
96
45.75
12.33
generators
is increased
through bisection as mentioned
in Sect. 2.3. Values are averages over 2000
MPI-SCVT ×1
2 (c) Fine region
2
2 114.38
5 390.82
0.56
MPI-SCVT ×1
96
96
14.23The green 207.26
14.52
iterations.
dashed line represents the
portion of the code that computes the centroids of
Different views
of the same
resolution grid
using16.71
the density function (6). Figure 7(a)
MPI-SCVT
×8variable96
96 createdthe
Voronoi regions 305.855
whereas the red solid line9.84
represent the portion of the code that computes the
shows the coarse region of the grid, 7(b) shows the transition region, and 7(c) shows the fine region.
Delaunay triangulation.
1.0e+05
390
Average
Time Time
(ms) (ms)
Average
Average
Time Time
(ms) (ms)
Average
385
1.0e+06
In1.0e+05
Sect. 4.1, weTriangulation
showed that MPI-SCVT performs comparably to STRIPACK when 1.0e+06
computing a Triangluation
1.0e+05
Integration
Integration
Triangulation
Triangluation
single
full
triangulation.
However,
computing
a
full
triangulation
is
only
part
of
the
story.
For SCVT
Communication
1.0e+04
1.0e+05
Integration
Integration
generation,
a triangulation needs to be computed at every iteration of Lloyd’s algorithm1.0e+04
as described
Communication
1.0e+04
17
1.0e+04
1.0e+03
in Sect.
1.0e+032.3. When using STRIPACK, the full triangulation needs to be computed at every iteration,
1.0e+03
but 1.0e+03
with MPI-SCVT only each regional triangulation needs to be computed at each iteration.
This
1.0e+02
1.0e+02
means
the merge step can be skipped resulting in significantly cheaper triangulations.
1.0e+02
1.0e+02
1.0e+01
Figure 8 shows the performance of a STRIPACK-based SCVT construction as the
number of
1.0e+01
1.0e+00
1.0e+01 is increased through bisection as mentioned in Sect. 2.3. Values are averages over 2000
generators
1.0e+00
1.0e+01
1.0e-01
1.0e+00
1.0e-01
1.0e-02
iterations. The green dashed line represents the portion of the code that computes the centroids of
the Voronoi regions whereas the red solid line represent the portion of the code that computes the
1.0e+02
1.0e+00
1.0e+03
Delaunay
triangulation.
1.0e+02
1.0e+03
1.0e+05
1.0e+06
1.0e+07
1.0e+01
1.0e-02
1.0e+02
1.0e+03
1.0e+05
1.0e+06
1.0e+07
1.0e+04
1.0e+05
Number of Generators
1.0e+04
1.0e+06
1.0e+07
1.0e+01
1.0e+02
1.0e+03
1.0e+05
Number1.0e+04
of Generators
1.0e+06
1.0e+07
Number of Generators
03
19 Aug 2013
by: Douglas.
1.0e+04
Number of Generators
Fig. 6. Timings for STRIPACK-based SCVT construction for various 6.
generator
Red solid lines SCVT
represent
the time spent
in
Fig.
Timingscounts.
for STRIPACK-based
construction
for var17
STRIPACK
computing
a triangulation
dashed
lines
ious
generator
counts. Red
solid lines whereas
representgreen
the time
spent
in
represent thecomputing
time spenta triangulation
integrating the
Voronoi
cellsdashed
outside
of
STRIPACK
whereas
green
lines
STRIPACK
one iteration
of Lloyd’sthe
algorithm.
represent
theintime
spent integrating
Voronoi cells outside of
STRIPACK in one iteration of Lloyd’s algorithm.
Fig. 7. Timings for various portions of MPI-SCVT using 2 processors
and 2 regions.
As theportions
problemofsize
increases using
the slope
for
Fig.
7. Timings
for various
MPI-SCVT
2 proboth triangulation
(red-solid)
integration
(green-dashed)
remain
cessors
and 2 regions.
As theand
problem
size increases
the slope
for
roughly
constant. (red-solid) and integration (green-dashed) remain
both
triangulation
roughly constant.
steps, where the italicized steps are the ones added to the algorithm of Sect. 2.1:
α = 0.20 with 10 242 generators. This set of parameters used
spherical
caps
Uk , k = 1, . .version
. , N , that
in –Eq.define
(6) isoverlapping
referred to as
×8. The
quasi-uniform
is
cover
the×1.
given region  on the sphere;
referred
to as
In Sect. 4.1, we showed that MPI-SCVT performs compa– sort
the given point
P into athesingle
subsets
k, k =
rably
to STRIPACK
when set
computing
full Ptriangu1,However,
. . . , N , each
containing
the points
in P that
in part
Uk ;
lation.
computing
a full
triangulation
is are
only
ek
– for k = 1, . . . , N , construct in parallel the point set P
of thebystory.
For
SCVT
generation,
a
triangulation
needs
to
stereographically projecting the points in Pk onto the
be computed
at
every
iteration
of
Lloyd’s
algorithm
as
deplane tangent to the sphere at the point t k ;
scribed in Sect. 2.2. When using STRIPACK, the full triangulation needs to be computed at every iteration, but with MPISCVT only each regional triangulation needs to be computed
foriteration.
k = 1, . . .This
, N , construct
parallel
thecan
planar
Delauat –each
means theinmerge
step
be skipped
ek ;
nay triangulation
Tekcheaper
of the points
set P
resulting
in significantly
triangulations.
Geosci. Model Dev., 6, 1353–1365,
1–13, 2013 2013
www.geosci-model-dev.net/6/1/2013/
www.geosci-model-dev.net/6/1353/2013/
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
10000
1.0e+04
1.0e+03
1363
Triangluation
Integration
Communication
1000
Average Time (ms)
Average Time (ms)
100
1.0e+02
1.0e+01
1.0e+00
10
1
0.1
1.0e-01
0.001
1.0e-02
1.0e+02
Triangulation
Integration
Communication
0.01
1.0e+03
1.0e+04
1.0e+05
1.0e+06
1
1.0e+07
10
100
Number of Processors
Number of Generators
10000
Fig. 8. Same information as in Fig. 7 but for 96 processors and
96 regions.
1000
4.4
General algorithm performance
This section is intended to showcase some general performance results of MPI-SCVT. Figure 9a–c show the timings
for a coarse resolution grid, a medium resolution grid, and a
fine resolution grid, respectively.
To assess the overall performance of the MPI-SCVT algorithm, scalability results are presented in Fig. 10. Figure 10a
shows that this algorithm can easily under-saturate processors; when this happens, communication ends up dominating the overall runtime for the algorithm which is seen in
Fig. 9a; as a result, scalability ends up being sub-linear. As
www.geosci-model-dev.net/6/1353/2013/
10
1
0.1
Triangulation
Integration
Communication
0.01
0.001
1
10
100
Number of Processors
1e+06
100000
10000
Average Time (ms)
Figure 6 shows the performance of a STRIPACK-based
SCVT construction as the number of generators is increased
through bisection as mentioned in Sect. 2.2. Values are averages over 2000 iterations. The green dashed line represents
the portion of the code that computes the centroids of the
Voronoi regions whereas the red solid line represent the portion of the code that computes the Delaunay triangulation.
Table 5 compares STRIPACK with the triangulation routine in MPI-SCVT that is called on every iteration. The results presented relative to MPI-SCVT are averages over 2000
iterations.
As a comparison with Fig. 6, in Figs. 7 and 8 we present
timings made for MPI-SCVT for two and 96 regions and processors, respectively, as the problem size, i.e. the number of
generators, increases. A minimum of two processors are used
because the stereographic projection used in MPI-SCVT has
a singularity at the focus point.
Whereas Fig. 7 shows performance similar to that of
STRIPACK (see Fig. 6), Fig. 8 shows roughly two orders
of magnitude faster performance relative to STRIPACK. As
mentioned previously, this is only the case when creating
SCVTs as the full triangulation is no longer required when
computing a SCVT in parallel.
Average Time (ms)
100
1000
100
10
1
0.1
Triangulation
Integration
Communication
0.01
0.001
1
10
100
Number of Processors
Fig. 9. Timing results for MPI-SCVT vs. number of processors for
three different problem sizes. Red solid-lines represent the cost of
computing a triangulation, whereas green-dashed lines represent the
cost of integrating all Voronoi cells, and blue-dotted lines represent
the cost of communicating each region’s updated point set to its
neighbours.
the number of generators increases (as seen in Fig. 10b, c),
the limit for being under-saturated is higher. Currently in the
algorithm, communications are done asynchronously using
non-blocking sends and receives. Also, overall communications are reduced by only communicating with a region’s
Geosci. Model Dev., 6, 1353–1365, 2013
1364
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
as the integration routines are embarrassingly parallel. Modification of the data structures used could also enable the generation of ultra-high resolution meshes (100M generators).
In principle, because all of the computation is local this algorithm should scale linearly very well up to hundreds if not
thousands of processors.
100
Iteration Speedup
Linear Reference
90
80
Speedup Factor
70
60
50
40
5
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
70
80
90
100
Number of Processors
100
Speedup
Linear Reference
90
80
Speedup Factor
70
60
50
40
30
Summary
A novel technique for the parallel construction of Delaunay triangulations is presented, utilizing unique domain decomposition techniques combined with stereographic projections. This parallel algorithm can be applied to the generation of planar and spherical centroidal Voronoi tessellations. Results were presented for the generation of spherical centroidal Voronoi tessellations, with comparisons to
STRIPACK, a well-known algorithm for the computation of
spherical Delaunay triangulations. The algorithm presented
in the paper (MPI-SCVT) shows slower performance than
STRIPACK when computing a single triangulation in serial
and faster performance when using roughly 100 processors.
When paired with a SCVT generator, the algorithm shows
additional speed up relative to a STRIPACK based SCVT
generator.
20
10
0
0
10
20
30
40
50
60
Number of Processors
100
Speedup
Linear Reference
90
Acknowledgements. We would like to thank Geoff Womeldorff
and Michael Duda for many useful discussions. The work of Doug
Jacobsen, Max Gunzburger, John Burkardt, and Janet Peterson was
supported by the US Department of Energy under grants number
DE-SC0002624 and DE-FG02-07ER64432.
Edited by: R. Redler
80
Speedup Factor
70
60
References
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Number of Processors
Fig. 10. Scalability results based on number of generators. Green is
a linear reference whereas red is the speedup computed using the
parallel version of MPI-SCVT compared to a serial version.
neighbours. This is possible because points can only move
within a region radius on any two subsequent iterations, and
because of this can only move into another region which is
overlapping the current region. More efficiency gains could
be realized by improving communication and integration algorithms. This addition of shared memory parallelization to
the integration routines might improve overall performance,
Geosci. Model Dev., 6, 1353–1365, 2013
Amato, N. and Preparata, F.: An NC parallel 3D convex hull algorithm, in: Proceedings of the ninth annual symposium on Computational geometry – SCG ’93, 289–297, May 1993.
Batista, V., Millman, D., Pion, S., and Singler, J.: Parallel geometric algorithms for multi-core computers, Comp. Geom.-Theor.
Appl., 43, 663–677, 2010.
Bowers, P., Diets, W., and Keeling, S.: Fast algorithms for generating Delaunay interpolation elements for domain decomposition, available at: http://www.math.fsu.edu/∼aluffi/archive/
paper77.ps.gz, (last access: May 2011), 1998.
Chernikov, A. and Chrisochoides, N.: Algorithm 872: parallel 2D
constrained Delaunay mesh generation, ACM T. Math. Software,
34, 6:1–6:20, 2008.
Cignoni, P., Montani, C., and Scopigno, R.: DeWall: a fast divide
and conquer Delaunay triangulation algorithm in E-d, Comput.
Aided Design, 30, 333–341, 1998.
Du, Q. and Gunzburger, M.: Grid generation and optimization based
on centroidal Voronoi tessellationss, Appl. Math. Comput., 133,
591–607, 2002.
Du, Q., Faber, V., and Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms, SIAM Rev., 41, 637–676,
1999.
www.geosci-model-dev.net/6/1353/2013/
D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction
Du, Q., Ju, L., and Gunzburger, M.: Constrained centroidal Voronoi
tessellations for surfaces, SIAM J. Sci. Comput., 24, 1488–1506,
2003a.
Du, Q., Ju, L., and Gunzburger, M.: Voronoi-based finite volume
methods, optimal Voronoi meshes, and PDEs on the sphere,
Comput. Method. Appl. M., 192, 3933–3957, 2003b.
Du, Q., Emelianenko, M., and Ju, L.: Convergence of the Lloyd
Algorithm for computing centroidal Voronoi tessellations, SIAM
J. Numer. Anal., 44, 102–119, 2006a.
Du, Q., Ju, L., and Gunzburger, M.: Adaptive finite element methods for elliptic PDE’s based on conforming centroidal Voronoi
Delaunay triangulations, SIAM J. Sci. Comput., 28, 2023–2053,
2006b.
Heikes, R. and Randall, D.: Numerical integration of the shallowwater equations on a twisted icosahedral grid. Part I: Basic design
and results of tests, Mon. Weather Rev., 123, 1862–1880, 1995.
Jacobsen, D. and Womeldorff, G.: MPI-SCVT Source Code: https:
//github.com/douglasjacobsen/MPI-SCVT 2013.
Ju, L., Ringler, T., and Gunzburger, M.: A multi-resolution method
for climate system modeling: application of spherical centroidal
Voronoi tessellations, Ocean Dynam., 58, 475–498, 2008.
Ju, L., Ringler, T., and Gunzburger, M.: Voronoi tessellations and
their application to climate and global modeling, Ocean Dynam.,
58, 313–342, 2011.
Lambrechts, J., Comblen, R., Legat, V., Geuzaine, C., and Remacle,
J.-F.: Multiscale mesh generation on the sphere. Ocean Dynam.,
58, 461–473. (2008). doi:10.1007/s10236-008-0148-3
Lloyd, S.: Least squares quantization in PCM, IEEE T. Inform. Theory, 28, 129–137, 1982.
Metropolis, N. and Ulam, S.: The Monte Carlo method, J. Am. Stat.
Assoc., 44, 335–341, 1949.
Nguyen, H., Burkardt, J., Gunzburger, M., Ju, L., and Saka, Y.: Constrained CVT meshes and a comparison of triangular mesh generators, Ocean Dynam., 42, 1–19, 2008.
www.geosci-model-dev.net/6/1353/2013/
1365
Okabe, A., Boots, B., Sugihara, K., and Chiu, S.: Spatial Tessellations: Concepts and Applications of Voronoi Diagrams, John
Wiley, Baffins Lane, Chichester, West Sussex, PO19 1UD, England, 2000.
Pain, C. C., Piggot, M. D., Goddard, A. J. H., Fang, F., Gorman, G.
J., Marshall, D. P., Eaton, M. D., Power, P. W., and de Oliveira,
C. R. E.: Three-dimensional unstructured mesh ocean modelling,
Ocean Model., 10, 5–33, 2005.
Renka, R.: Algorithm 772: STRIPACK: Delaunay triangulation and
Voronoi diagram on the surface of a sphere, ACM T. Math. Software, 23, 416–434, 1997.
Ringler, T., Petersen, M., Higdon, R. L., Jacobsen, D., Jones,
P. W., and Maltrud, M.: A Multi-Resolution Approach
to Global Ocean Modeling, Ocean Modelling, accepted,
doi:10.1016/j.ocemod.2013.04.010, 2013.
Ringler, T. D., Jacobsen, D., Gunzburger, M., Ju, L., Duda, M.,
and Skamarock, W.: Exploring a Multiresolution Modeling Approach within the Shallow-Water Equations, Mon. Weather Rev.,
139, 3348–3368, doi:10.1175/MWR-D-10-05049.1., 2011.
Saalfeld, A.: Delaunay triangulations and stereographic projections,
Cartogr. Geogr. Inform., 26, 289–296, 1999.
Shewchuk, J.: Triangle: engineering a 2D quality mesh generator
and Delaunay triangulator, Applied Computational Geometry:
Towards Geometric Engineering, 1148, 203–222, 1996.
Weller, H., Weller, H., and Fournier, A.: Voronoi, Delaunay, and
block-structured mesh tefinement for solution of the shallowwater equations on the sphere, Mon. Weather Rev., 137, 4208–
4224, 2009.
Zhou, J., Deng, X., and Dymond, P.: A 2-D parallel convex hull
algorithm with optimal communication phases, in: Proceedings
11th International Parallel Processing Symposium, April 1997,
University of Geneva, Geneva, Switzerland, 596–602, 2001.
Geosci. Model Dev., 6, 1353–1365, 2013