Geoscientific Model Development Open Access Hydrology and Earth System Sciences Open Access Ocean Science Open Access Geosci. Model Dev., 6, 1353–1365, 2013 www.geosci-model-dev.net/6/1353/2013/ doi:10.5194/gmd-6-1353-2013 © Author(s) 2013. CC Attribution 3.0 License. ess Data Systems Parallel algorithms for planar and spherical Delaunay construction with an application to centroidal Voronoi tessellations D. W. Jacobsen1,2 , M. Gunzburger1 , T. Ringler2 , J. Burkardt1 , and J. Peterson1 1 Department 2 Theoretical of Scientific Computing, Florida State University, Tallahassee, FL 32306, USA Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA Received: 5 February 2013 – Published in Geosci. Model Dev. Discuss.: 22 February 2013 Revised: 18 June 2013 – Accepted: 2 July 2013 – Published: 30 August 2013 1 Introduction Voronoi diagrams and their dual Delaunay tessellations have become, in many settings, natural choices for spatial griding due to their ability to handle arbitrary boundaries and refinement well. Such grids are used in a wide range of applications and can, in principle, be created for almost any geometry in two and higher dimensions. The recent trend towards the exascale in most aspects of high-performance Open Access Abstract. A new algorithm, featuring overlapping domain decompositions, for the parallel construction of Delaunay and Voronoi tessellations is developed. Overlapping allows for the seamless stitching of the partial pieces of the global Delaunay tessellations constructed by individual processors. The algorithm is then modified, by the addition of stereographic projections, to handle the parallel construction of spherical Delaunay and Voronoi tessellations. The algorithms are then embedded into algorithms for the parallel construction of planar and spherical centroidal Voronoi tessellations that require multiple constructions of Delaunay tessellations. This combination of overlapping domain decompositions with stereographic projections provides a unique algorithm for the construction of spherical meshes that can be used in climate simulations. Computational tests are used to demonstrate the efficiency and scalability of the algorithms for spherical Delaunay and centroidal Voronoi tessellations. Compared to serial versions of the algorithm and to STRIPACK-based approaches, the new parallel algorithm results in speedups for the construction of spherical centroidal Voronoi tessellations and spherical Delaunay triangulations. Solid Earth Open Access Correspondence to: D. W. Jacobsen ([email protected]) computing further demands fast algorithms for the generation of high-resolution spatial meshes that are also of high The Cryosphere quality, e.g. featuring variable resolution with smooth transition regions; otherwise, the meshing part can dominate the rest of the discretization and solution processes. However, creating such meshes can be time consuming, especially for high-quality, high-resolution meshing. Attempts to speed up the generation of Delaunay tessellations via parallel divideand-conquer algorithms were made in, e.g. Cignoni et al. (1998); Chernikov and Chrisochoides (2008), however these approaches are restricted to two-dimensional planar surfaces. With the current need for high, variable-resolution grid generators in mind, we first develop a new algorithm that makes use of a novel approach to domain decomposition for the fast, parallel generation of Delaunay and Voronoi grids in Euclidean domains. Centroidal Voronoi tessellations provide one approach for high-quality Delaunay and Voronoi grid generation (Du et al., 1999, 2003b, 2006b; Du and Gunzburger, 2002; Nguyen et al., 2008). The efficient construction of such grids involves an iterative process (Du et al., 1999) that calls for the determination of multiple Delaunay tessellations; we show how our new parallel Delaunay algorithm is especially useful in this context. Climate modelling is a specific field which has recently begun adopting Voronoi tessellations as well as triangular meshes for the spatial discretization of partial differential equations (Pain et al., 2005; Weller et al., 2009; Ju et al., 2008, 2011; Ringler et al., 2011). As this is a special interest of ours, we also develop a new parallel algorithm for the generation of Voronoi and Delaunay tessellations for the entire sphere or some subregion of interest. The algorithm uses stereographic projections, similar to Lambrechts et al. Published by Copernicus Publications on behalf of the European Geosciences Union. M 1354 D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction (2008), to transform these tasks into planar Delaunay constructions for which we apply the new parallel algorithm we have developed for that purpose. Spherical centroidal Voronoi tessellation (SCVT) based grids are especially desirable in climate modelling because they not only provide for precise grid refinement, but also feature smooth transition regions (Ringler et al., 2011, 2013). We show how such grids can be generated using the new parallel algorithm for spherical Delaunay tessellations. The paper is organized in the following fashion. In Sect. 2.1, we present our new parallel algorithm for the construction of such tessellations. In Sect. 2.2, we provide a brief discussion of centroidal Voronoi tessellations (CVTs) followed by showing how the new parallel Delaunay tessellation algorithm can be incorporated into an efficient method for the construction of CVTs. In Sect. 3.1, we provide a short review of stereographic projections followed, in Sect. 3.2, by a presentation of the new parallel algorithm for the construction of spherical Delaunay and Voronoi tessellations. In Sect. 3.3, we consider the parallel construction of spherical CVTs. In Sect. 4, the results of numerical experiments and demonstrations of the new algorithms are given; for the sake of brevity, we only consider the algorithms for grid generation on the sphere. Finally, in Sect. 5, concluding remarks are provided. 2 Parallel Delaunay and Voronoi tessellation construction in Rk In this section, we present a new method for the construction of Delaunay and Voronoi tessellations. The algorithms are first presented in R2 and later extended to R3 . 2.1 Parallel algorithm for Delaunay tessellation construction The construction of planar Delaunay triangulations in parallel has been of interest for several years; see, e.g. Amato and Preparata (1993); Cignoni et al. (1998); Zhou et al. (2001); Batista et al. (2010). Typically, such algorithms divide the point set up into several smaller subsets, each of which can then be triangulated independently from the others. The resulting triangulations need to be stitched together to form a global triangulation. This stitching, or merge step, is typically computed serially because one may need to modify significant portions of the individual triangulations. The merge step is the main difference between the different parallel algorithms. Here, we provide a new alternative merge step that, because no modifications of the individual triangulations are needed, can be performed in parallel. We are given a set of points P = {x j }nj=1 in a given domain ⊂ Rk . We begin by covering by a set S = {Sk (ck , rk )}N k=1 of N overlapping spheres Sk , each of which is defined by a centre point ck and radius rk . For each sphere Sk , Geosci. Model Dev., 6, 1353–1365, 2013 a connectivity or neighbour list is defined which consists of the indices of all the spheres Si ∈ S, i 6= k, that overlap with Sk . From the given point set P , we then create N smaller point sets Pk , k = 1, . . . , N , each of which consists of the points in P which are in the sphere Sk . Due to the overlap of the spheres Sk , a point in P may belong to multiple points sets Pk . This defines what we are referring to as our sorting method. The presented sort method is the simplest to understand, but provides issues with load balancing on variable resolution meshes. For that reason, we use an alternate sort method in practice which is described Sect. 2.1.1. The next step is to construct the N Delaunay tessellations Tk , k = 1, . . . , N, of the N point sets Pk , k = 1, . . . , N . For this purpose, one can use any Delaunay tessellation method at one’s disposal; for example, in the plane, one can use the Delaunay triangulator that is part of the Triangle software of Shewchuk (1996). A note should be made, that these triangulation software packages only aid in the computation of the triangulation. For example, they can not be used to determine which points should be triangulated. These Delaunay tessellations, of course, can be constructed completely in parallel after assigning one point set Pk to each of N processors. At this point, we are almost but not quite ready to merge the N local Delaunay tessellations into a single global one. Before doing so, we deal with the fact that although, by construction, the tessellation Tk of the point set Pk is a Delaunay tessellation of Pk , there are very likely to be simplices in the tessellation Tk that may not be globally Delaunay, i.e. that may not be Delaunay with respect to points in P that are not in Pk . This follows because the points in any Tk are unaware of the triangles and points outside of Sk so that there may be points in P that are not in Pk that lie within the circumsphere of a simplex in Tk . In fact, only simplices whose circumspheres are completely contained inside of the ball Sk are guaranteed to be globally Delaunay; those points satisfy the criteria ||ck −b ci || +b ri < r k , (1) whereb ci and b ri are the centre and radius of the circumsphere of the i-th triangle in the local Delaunay tessellation Tk . So, at this point, for each k = 1, . . . , N , we construct the triangulation Tbk by discarding all simplices in the local Delaunay tessellation Tk whose circumspheres are not completely contained in Sk , i.e. all simplices that do not satisfy Eq. (1). Figure 1 shows one of the local Delaunay triangulations Tk and the triangulation Tbk after the deletion of triangles that are not guaranteed to satisfy the Delaunay property globally. The final step is to merge the N modified tessellations Tbk , k = 1, . . . , N , that have been constructed completely in parallel into a single global tessellation. The key observation here is that each regional tessellation Tbk is now exactly a portion of the global Delaunay tessellation because if two local tessellations Tbi and Tbk overlap, they must coincide wherever they overlap. This follows from the uniqueness property www.geosci-model-dev.net/6/1353/2013/ 120 i.e., all simplices that do not satisfy (1). Figure 1 shows one of the local Delaunay triangulations Tk and the triangulation Tbk after the deletion of triangles that are not guaranteed to satisfy the Delaunay globally. algorithms for Delaunay construction D. W. Jacobsenproperty et al.: Parallel 1355 Fig. 1. A local Delaunay triangulation Tk (left) and the triangulation Tbk after the deletion of triangles that are not guaranteed to satisfy the Delaunay property globally. Fig. 1. A local Delaunay triangulation Tk (left) and the triangulation Tbk after the deletion of triangles that are not guaranteed to satisfy the Delaunay property globally. of the Delaunay tessellations. Thus, the union of the local points than are required. This is an even greater challenge Delaunay tessellations is the global Delaunay tessellation; when considering variable resolution meshes and regions that The final step is to merge the N modified tessellations Tbk , k = 1,...,N , that have been conby using an overlapping domain decomposition, the stitchstraddle both fine and coarse regions. structed completely in parallel a singleone global tessellation. key observation here isthe that ing of the regional Delaunay tessellations intointo a global In order toThe properly sort the points, overlap of the reis transparent. course, some bookkeeping chores havea to toDelaunay be large enough to satisfy 125 Of each regional tessellation Tbk is now exactly portiongions of theneed global tessellation becausethe if following three be done such astwo rewriting simplex information in terms of must requirements: local tessellations Tbi and Tbk overlap, they coincide wherever they overlap. This follows global point indices and counting overlapping simplices only – all Thus, pointsthe contained in alocal region’s Voronoi cell are not from the uniqueness property of the Delaunay tessellations. union of the Delaunay once. vertices of any non-Delaunay triangles; In summary, the algorithmisfor construction a Delau- by using an overlapping domain decomposition, tessellations thethe global Delaunayoftessellation; nay tessellation the in parallel consists of the following steps: the resulting triangulation includes all Delaunay trianstitching of the regional Delaunay tessellations into –a global one is transparent. Of course, some gles for which owned points are vertices of; 130 bookkeeping have rewriting simplex information in terms of global point – define overlapping ballschores Sk , k = 1, .to . . ,beNdone , thatsuch coverasthe given region ; indices and counting overlapping simplices only once. – the point set defines all triangles whose circumcircles fall in the region’s Voronoi cell. A final radii {rk }N – sort the given pointnote set isPthat intothethe subsets Pk should , k = be chosen large enough so that there are no gaps k=1 If all three of these are satisfied, the point set can be used in 1, . . . , N , each containing points in P that are in Sk ; any of the algorithms described in this paper. 5 Whereas we have yet to describe a method for the choice – for k = 1, . . . , N , construct in parallel the N Delaunay of region radius, let us first begin by describing the two extessellations Tk of the points sets Pk ; tremes. First, consider a radius that encompasses the entire – for k = 1, . . . , N , construct in parallel the tessellation Tbk domain. This obviously provides enough overlap to guaranby removing from Tk all simplices that are not guarantee the three conditions are satisfied. Second, consider a rateed to satisfy the circumsphere property; dius that is equivalent to the edges of the region’s Voronoi cell, and no larger. In this case, the overlap is obviously in– construct the Delaunay tessellation of P as the union of sufficient to satisfy the three conditions. This can be seen if N the N modified Delaunay tessellations {Tbk }k=1 . we assume at least one non-Delaunay triangle exists at the boundary of the Voronoi cell, in which case both criteria 1 Once a Delaunay tessellation is determined, it is an easy mat2 3 and 2 are not met. ter, at least for domains in R and R , to construct the dual In practice, choosing the radii {rk }N k=1 to be equal to the Voronoi tessellation; see, e.g. Okabe et al. (2000). maximum distance from the centre ck of its ball Sk to the centre of all its adjacent balls allows enough overlap in quasi2.1.1 Methods for decomposing global point set uniform cases. However, this method still fails to provide In Sect. 2.1, an initial method was presented for distributenough overlap in certain cases. All of the failures occur ing points between regions. We refer to this method as a diswhen the global point set does not include enough points tance sort. The distance between a point and a region centre for the decomposition of choice. Before discussing the target is computed, and if the distance is less than the region’s ranumber of points for a given decomposition, an alternative dius the point is considered for triangulation. The key diffisorting method is introduced. culty for this sort method is determining a radius which proIn an attempt to provide better load balancing in genvides enough overlap, without including significantly more eral for arbitrary density functions, we also developed an www.geosci-model-dev.net/6/1353/2013/ Geosci. Model Dev., 6, 1353–1365, 2013 1356 D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction alternative sorting method. We refer to this sorting method as the Voronoi sort. In this case, the method for sorting points makes use of the Voronoi property that all points contained within a Voronoi cell are closer to that cell’s generator than any other generator in the set. It is important to note that the coarse Voronoi cells used for domain decomposition here are not the same as Voronoi cells related to the Delaunay triangulation we are attempting to construct. To begin, the point set is divided into the Voronoi cells defined by the region’s centre, and its neighbour’s region centres. The union of the owned region’s point set and its immediately adjacent region’s point sets defines the final set of points for triangulation. As with the distance sort, the Voronoi sort meets all of the criteria for use in the described algorithms. However, it can also still fail in certain cases. Both sort methods can fail if the target point set is too small for the target decomposition. In this case, one or more regions do not include enough points to correctly compute a triangulation through any method. In practice, these algorithms fail if the decomposition includes more than two regions N, and the point set contains less than 16N points, or two bisections. For example, a 12 processor decomposition cannot be used with a target of 42 points, but a 12 processor decomposition can be used with a target of 162 points. In general, the decomposition should be created using the target density function to provide optimal load balancing, and to improve robustness of the algorithm. Results for these sort methods are presented in Sect. 4.2.2. 2.2 Application to the construction of centroidal Voronoi and Delaunay tessellations Given a Voronoi tessellation V = {Vj }nj=1 of a bounded region ⊂ Rk corresponding to the set of generators P = {x j }nj=1 and given a nonnegative function ρ(x) defined over , referred to as the point-density function, we can define the centre of mass or centroid x ∗j of each Voronoi region as R Vj xρ(x) dx ∗ xj = R , j = 1, . . . , n. (2) Vj ρ(x) dx In general, x j 6 = x ∗j , i.e. the generators of the Voronoi cells do not coincide with the centres of mass of those cells. The special case for which x j = x ∗j for all j = 1, . . . , n, i.e. for which all the generators coincide with the centres of mass of their Voronoi regions, is referred to as a centroidal Voronoi tessellation (CVT). Given , n, and ρ(x), a CVT of must be constructed. The simplest means for doing so is Lloyd’s method (Lloyd, 1982) in which, starting with an initial set of n distinct generators, one constructs the corresponding Voronoi tessellation, then determines the centroid of each of the Voronoi cells, and then moves each generator to the centroid of its Voronoi cell. These steps are repeated until satisfactory convergence Geosci. Model Dev., 6, 1353–1365, 2013 is achieved, e.g. until the movement of generators falls below a prescribed tolerance. The convergence properties of Lloyd’s method are rigorously studied in Du et al. (2006a). The point-density function ρ plays a crucial role in how the converged generators are distributed and the relative sizes of the corresponding Voronoi cells. If we arbitrarily select two Voronoi cells Vi and Vj from a CVT, their grid spacing and density are related as hi ≈ hj ρ(x j ) ρ(x i ) 1 d+2 , (3) where hi denotes a measure of the linear dimension, e.g. the diameter, of the cell Vi and x i denotes a point, e.g. the generator, in Vi . Thus, the point-density function can be used to produce nonuniform grids in either a passive or adaptive manner by prescribing a ρ or by connecting ρ to an error indicator, respectively. Although the relation Eq. (3) is at the present time a conjecture, its validity has been demonstrated through many numerical studies; see, e.g. Du et al. (1999); Ringler et al. (2011). CVTs and their dual Delaunay tessellations have been successfully used for the generation of high-quality nonuniform grids; see, e.g. Du and Gunzburger (2002); Du et al. (2003b, 2006b); Ju et al. (2011); Nguyen et al. (2008). As defined above, every iteration of Lloyd’s method requires the construction of the Voronoi tessellation of the current set of generators followed by the determination of the centroids of the Voronoi cells. However, the construction of the Voronoi tessellation can be avoided until after the iteration has converged and instead, the iterative process can be carried out with only Delaunay tessellation constructions. Thus, the first task within every iteration of Lloyd’s method is to construct the Delaunay tessellation of the current set of generators. The parallel algorithm for the generation of Delaunay tessellations given in Sect. 2.1 is thus especially useful in reducing the costs of the multiple tessellations needed in CVT construction. After the Delaunay tessellation of the current generators is computed, every Voronoi cell centre of mass must be computed by integration, so its generator can be replaced by the centre of mass. Superficially, it seems that we cannot avoid constructing the Voronoi tessellation to do this. However, it is easy to see that one does not actually need the Voronoi tessellation and instead one can determine the needed centroids from the Delaunay tessellation. For simplicity, we restrict this discussion to tessellations in R2 . Each triangle in a Delaunay triangulation contributes to the integration over three different Voronoi cells. The triangle is split into three kites, each made up of two edge midpoints, the triangle circumcentre, and a vertex of the triangle. Each kite is part of the Voronoi cell whose generator is located at the triangle vertex associated with the kite. Integrating over each kite and updating a portion of the integrals in the centroid formula Eq. (2) allows one to only use the Delaunay triangulation to www.geosci-model-dev.net/6/1353/2013/ D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction determine the centroids of the Voronoi cells. Thus, because both steps within each Lloyd iteration can be performed using only the Delaunay triangulation, determining the generators of a CVT does not require construction of any Voronoi tessellations nor does it require any mesh connectivity information. If one is interested in the CVT, one need only construct a single Voronoi tessellation after the Lloyd iteration has converged; if one is interested in the Delaunay tessellation corresponding to the CVT, then no Voronoi construction is needed. To make this algorithm parallel, one can compute the Voronoi cell centroids using the local Delaunay tessellations and not on the stitched-together global one. However, because the local Delaunay tessellations overlap, one has to ensure that each generator is only updated by one region, i.e. by only one processor. This can be done using one of a variety of domain decomposition methods. We use a coarse Voronoi diagram corresponding to the region centres ck . Each processor only updates the generators that are inside of its coarse Voronoi cell. Because Voronoi cells are non-overlapping, each generator will only get updated by one processor. After all the generators are updated, each coarse region needs to transfer its newly updated points only to its adjacent regions and not to all active processors. This limits each processor’s communications to roughly six sends and receives, regardless of the total number of processors used. We use two metrics to check for the convergence of the Lloyd iteration, namely the `2 and `∞ norms of generator movement given by `2 norm = n 1/2 1 X new 2 (x old − x ) j n j =1 j 1357 In summary, the algorithm for the construction of a CVT in parallel consists of the following steps; the steps in Roman font are the same as in the algorithm of Sect. 2.1 whereas the italicized steps, as well as the do loop, are additions for CVT construction: – define overlapping balls that cover the given region; – while not converged do – sort the current point set; – construct in parallel the local Delaunay tessellations; – remove from the Delaunay tessellations simplices which are not guaranteed to satisfy the circumsphere property; – in parallel, determine the centroids of the Voronoi cells by integrating over simplices; – move each generator to the corresponding cell centroid; – test the convergence criteria; – communicate new generator positions to neighbouring balls; – end – construct the Delaunay tessellation corresponding to the CVT as the union of the modified local Delaunay tessellations. and new `∞ norm = max (|x old j − x j |), j =1,...,n respectively, are compared with a given tolerance; here, old and new refer to the previous and current Lloyd iterates. If either norm falls below the tolerance, the iterative process is deemed to have converged. The `∞ norm is more strict, but both norms follow similar convergence paths when plotted against the iteration number. A variety of point sets can be used to initiate Lloyd’s method for CVT construction. The obvious one is Monte Carlo points (Metropolis and Ulam, 1949). These can either be sampled uniformly over or sampled according to the point-density function ρ(x). The latter approach usually reduces the number of iterations required for convergence. One can instead use a bisection method to build fine grids from a coarse grid (Heikes and Randall, 1995). To create a bisection grid, a coarse CVT is constructed using as few points as possible. After this coarse grid is converged, in R2 , one would add the midpoint of every Voronoi cell edge or Delaunay triangle edge to the set of points. This causes the overall grid spacing to be reduced by roughly a factor of two in every cell so that the refined point set is roughly four times larger. www.geosci-model-dev.net/6/1353/2013/ 3 Parallel Delaunay and Voronoi tessellation construction on the sphere Replacing the planar constructs of Delaunay triangulations and Voronoi tessellations with their analogous components on the sphere, i.e. on the surface of a ball in R3 , creates the spherical versions of Delaunay triangulations and Voronoi tessellations. The Euclidean distance metric is replaced by the geodesic distance, i.e. the shortest of the two lengths along the great circle joining two points on the sphere. Triangles and Voronoi cells are replaced with spherical triangles and spherical Voronoi cells whose boundaries are geodesic arcs. To develop a parallel Delaunay and Voronoi tessellation construction algorithm on the sphere, we could try to parallelize the STRIPACK algorithm of Renka (1997). Here, however, we take a new approach that is based on a combination of stereographic projections (which we briefly discuss in Sect. 3.1) and the algorithm of Sect. 2.1 applied to planar domains. The serial version of the new Delaunay tessellation construction algorithm is therefore novel as well. Geosci. Model Dev., 6, 1353–1365, 2013 265 P , we then create N smaller point sets Pk , k = 1,...,N , each of w which are in the spherical cap Uk . Due to the overlap of the spherica develop a parallel algorithm for the construction of SDTs. The most important adaptation is to use up belonging multiple points sets Pk . D. W. Jacobsen et al.: to Parallel algorithms for Delaunay construction stereographic projections so that the actual construction of Delaunay triangulations is done on planar 1358 3.1 domains. Stereographic projections 260 We are now given a set of points P = {xj }nj=1 on a subset Ω of the sphere; Ω may be the whole Stereographic projections are special mappings between the sphere. Wetobegin by covering surface of a sphere and a plane tangent the sphere. Not onlyΩ by a set U = {Uk (tk ,rk )}N k=1 of N overlapping “umbrellas” or are stereographic projections conformal mappings, meaning spherical caps Uk , each of which is defined by a centre point tk and geodesic radius rk . See the left that angles are preserved, but they also preserve circularity, in Fig. For each spherical meaning that circles on theplot sphere are 4. mapped to circles on cap Uk , a connectivity (or neighbours) list is defined that consists the plane. Stereographic projections also map the interior of caps U ∈ U , i 6= k, that overlap with U . From the given point set of the indices of all the spherical i k these circles to the interior of the mapped circles (Bowers 2651999). P , we N smaller point sets Pk , k = 1,...,N , each of which consists of the points in P et al., 1998; Saalfeld, Forthen our create purposes, the importance of circularity preservation is that it implies that stereowhich are in the spherical cap Uk . Due to the overlap of the spherical caps Uk , a point in P may end graphic projections preserve the Delaunay circumcircle propbelonging to multiple points erty. This follows because up triangle circumcircles (along withsets Pk . (a) their interiors) are preserved. Therefore, Delaunay triangulations on the sphere are mapped to Delaunay triangulations on the plane and conversely. As a result, stereographic pro- Fig. 4. Left: An overlapping subdivision of the sphere into N = 12 sphe jections can be used to construct a Delaunay triangulation of a portion of the sphere by first constructing a Delaunay tri- the projections onto the sphere of the centroids of the 12 pentagons of the i angulation in the plane, which is a simpler and well-studied coloured ring represents the edge of cap of radius rk . Right: A 10,242 gene task. Without loss of generality, we assume that we are given parallel algorithm the unit sphere in R3 centred at the origin. We are also given a plane tangent to the sphere at the point t. The focus point f At this point in Sect. 2.2, we assigned each of the point subsets Pk is defined as the reflection of t about the centre of the sphere, i.e. f is the antipode of t. Let p denote a point on the unit N Delaunay tessellations in parallel. Before doing so in the spheri sphere. The stereographic projection of p onto the plane tan270 point sets Pk (b) onto the plane tangent at the corresponding point tk . T gent at t is the point q on the plane defined by q = sp + (1 − s)f , Fig. 2. (a) An overlapping subdivision of the sphere into N = 12 processor to construct a planar Delaunay triangulation instead of 1 spherical caps. The t k are caps. the projections onto the Fig. 4. Left: An overlapping subdivision of the sphere into cap N =centres 12 spherical The cap centres tk are . (4) where s=2 ekthe sphere ofconstruct the centroids ofplanar the 12 point pentagons of inscribed regeach k, we the set P = S(P ;t ), f · (f − p) k k where S(P the projections onto the sphere of the centroids of the 12 pentagons of thering inscribed regular icosahedron. ular icosahedron. Each coloured represents the edge of cap ofEach For our purposes, it is more convenient to define the projecprojection points in Pkgenerator onto theDelaunay plane tangent to the at th rkof .r(b) A 10 242 generator Delaunay triangulation by the parcoloured ring represents the edge of cap ofradius radius Right: A 10,242 triangulation bysphere the k .the allel algorithm. e tion relative to t rather than f . The simple substitution of parallel algorithm of the point sets Pk to a different processor and have each processo f = −t into Eq. (4) results in 275 triangulation of its point set Pek . Because stereographic projections p 1 defined consists the indices all the spherical caps eachthat of the point of subsets Pk to of a processor and constructed q = sp + (s − 1)t, whereAt thiss point = 2 in Sect. .2.2, we (5)assigned t · (p + t) preserve circumcircle property, it point is then Ui ∈ Uthe , i 6=Delaunay k, that overlap with Uk . From the given set just a sim N Delaunay tessellations in parallel. Before doing so in the spherical case, we project each of the P , we then create N smaller point sets Pk , k = 1, . . . , N , each The definitions Eqs. (4) and (5) can also be used to define triangle vertex and edge indices to define the spherical Delaunay tria ofthe which consists of the points P which are in step the spherical 270 point sets P onto the plane tangent at corresponding point tk .inThis additional allows each k k stereographic projections in R for k > 3. capinUkSect. . Due to the we overlap ofproceed the spherical caps Uk ,spherical a point triangles As 2.2, now to remove processor to construct a planar Delaunay triangulation instead of spherical one. Specifically, for in P may end up belonging to multiple points sets Pk . 3.2 Parallel algorithm for spherical Delaunay Delaunay with respect to the global set the Pof. stereographic Instead each k, we construct the planar point setAt Pekthis = S(P where ;tk point ) denotes point 2.1,S(P wek assigned each the pointof (1), in th k ;tin k ),Sect. triangulation construction Pk to toa the processor constructed Delaunay projection of the points in Pk onto the subsets plane tangent sphere atand the point tk . WeNthen assign each A spherical Delaunay triangulation (SDT) is the Delaunay tessellations in parallel. Before doing so in the spherical case, e 11tangent point on setsthe Pksurface to a different and each have of each theplane planar Delaunay triangulation of a point setofPthe defined of the processor we project theprocessor point sets construct P onto the k 2.1Pefor sphere. We adapt the developed Sect.set at the corresponding point t k . This additional step allows 275algorithm triangulation of itsinpoint k . Because stereographic projections preserve circularity and therefore domains in Rk to develop a parallel algorithm for the coneach processor to construct a planar Delaunay triangulation preserve the Delaunay property, is then one. just aSpecifically, simple matter of keeping track of struction of SDTs. The most important adaptation circumcircle is to use instead of itspherical for each k, we conek = S(Pk ; t k ), where S(Pk ; t k ) stereographic projections so that the actual construction of struct the planar point set P triangle vertex and edge indices to define the spherical Delaunay triangulations of the point sets Pk . Delaunay triangulations is done on planar domains. denotes the stereographic projection of the points in Pk onto As in PSect. proceedthe to plane remove spherical nott .guaranteed to be We are now given a set of points = {x j2.2, }nj=1we on now a subset tangent to thetriangles sphere atthat the are point k We then asekthe of the sphere; may be the whole sphere. Wetobegin by sign set eachP of the point sets in P to spherical a differentcase, processor andmust Delaunay with respect the global point . Instead of (1), triangles covering by a set U = {Uk (t k , rk )}N of N overlapping have each processor construct the planar Delaunay trianguk=1 ek . Because stereographic projections “umbrellas” or spherical caps Uk , each of which is defined lation of its point set P by a centre point t k and geodesic radius rk . See Fig. 2a. For preserve circularity and therefore preserve the Delaunay cir11 each spherical cap Uk , a connectivity (or neighbours) list is cumcircle property, it is then just a simple matter of keeping Geosci. Model Dev., 6, 1353–1365, 2013 www.geosci-model-dev.net/6/1353/2013/ D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction track of triangle vertex and edge indices to define the spherical Delaunay triangulations of the point sets Pk . As in Sect. 2.1, we now proceed to remove spherical triangles that are not guaranteed to be Delaunay with respect to the global point set P . Instead of Eq. (1), in the spherical case, triangles must satisfy Table 1. Description of generator count labels, along with approximate grid resolution on an earth sized sphere. Generator Label coarse medium fine cos−1 ||t k −b ci || +b ri < rk , for that guarantee to be in effect, where b t i and b ri are the centre and radius, respectively, of the circumsphere of the i-th triangle in the local Delaunay tessellation of Pk . Once the possibly unwanted triangles are removed, we can, as in Sect. 2.1, transparently stitch together the modified Delaunay triangulations into a global one. In summary, the algorithm for the construction of a spherical Delaunay tessellation in parallel consists of the following steps, where the italicized steps are the ones added to the algorithm of Sect. 2.1: – define overlapping spherical caps Uk , k = 1, . . . , N , that cover the given region on the sphere; – sort the given point set P into the subsets Pk , k = 1, . . . , N , each containing the points in P that are in Uk ; ek – for k = 1, . . . , N , construct in parallel the point set P by stereographically projecting the points in Pk onto the plane tangent to the sphere at the point t k ; – for k = 1, . . . , N , construct in parallel the planar Delauek ; nay triangulation Tek of the points set P – for k = 1, . . . , N, construct in parallel the spherical Delaunay triangulation Tk by mapping the planar Delaunay triangulation Tek onto the sphere; – for k = 1, . . . , N , construct in parallel the spherical triangulation Tbk by removing from Tk all simplices that are not guaranteed to satisfy the circumsphere property; – construct the Delaunay tessellation of P as the union of the N modified spherical Delaunay tessellations {Tbk }N k=1 . For an illustration of a spherical Delaunay triangulation determined by this algorithm see Fig. 2b. Because of the singularity in Eqs. (4) or (5) for p = f , the serial version of this algorithm, i.e. if N = 1, can run into trouble whenever the antipode f of the tangency point t is in the spherical domain . Of course, this is always the case if is the whole sphere. In such cases, the number of subdomains used cannot be less than two, even if one is running the above algorithm in serial mode. In setting the extent of overlap, i.e. the radii of the spherical caps, as the maximum geodesic distance from the a cap centre t j to neighbouring cap centres t i , the geodesic distance is given by cos−1 (t j · t i ). www.geosci-model-dev.net/6/1353/2013/ 1359 3.3 Generator Count Approx. Resolution 40 962 163 842 2 621 442 120 km 60 km 15 km Application to the construction of spherical centroidal Voronoi tessellations (SCVTs) The parallel CVT construction algorithm of Sect. 2.2 is easily adapted for the construction of centroidal Voronoi tessellations on the sphere (SCVTs). Obviously, one uses the spherical version of the Delaunay triangulation algorithm as given in Sect. 3.2 instead of the version of Sect. 2.1. One also has to deal with the fact that if one computes the centroid of a spherical Voronoi cell using Eq. (2), then that centroid does not lie on the sphere; a different definition of a centroid has to be used (Du et al., 2003a). Fortunately, it is shown in Du et al. (2003a) that the correct centroid can be determined by using Eq. (2), which yields a point inside the sphere, and then projecting that point onto the sphere (actually, this was shown to be true for general surfaces) so that the correct spherical centroid is simply the intersection of the radial line going through the point determined by Eq. (2) and the sphere. 4 Results All results are created using a high performance computing (HPC) cluster with 24 AMD Opteron “model 6176” cores per node and 64 GB of RAM per node. Results are presented for three different generator counts, and are presented with the labels “coarse”, “medium”, and “fine”. Table 1 can be used to refer these labels to actual counts as well as approximate resolution on an earth sized sphere. We use STRIPACK (Renka, 1997), a serial Fortran 77 Association of Computing Machinery (ACM) Transactions on Mathematical Software (TOMS) algorithm that constructs Delaunay triangulations on a sphere, as a baseline for comparison with our approach. It is currently one of the few wellknown spherical triangulation libraries available. The algorithm described in this paper has been implemented as a software packaged referred to as MPI-SCVT, and is freely available following Jacobsen and Womeldorff (2013). It utilizes the Triangle software (Shewchuk, 1996) to perform planar triangulations of specified point sets in the tangent planes of each partition of the domain. Geosci. Model Dev., 6, 1353–1365, 2013 4.1 Approach D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction x1639.3149 104.793 Uniform 14.9779 2556.971276.681 2611.35 1560.71 Base Delaunay triangulations of the full sphere 200000 Uniform Decomposition x16 104.793 Voronoi 276.681 98.5482 1560.71 249.771965.56 288.694 1.32 1965.56 Number of sorted points in region 1360 x16 uses a coarse SCVT withCommunication a 16 to 1 ratio in grid sizes to e Approach Triangulation Integration Iteration Costs Of Different Algorithm Steps Cost Per same x16 coarse39.3149 SCVT grid to define regions centres and sort 2611.35 TriangulationUniform Integration 14.9779 Communication Iteration 2556.971 Speedup Number of sorted points in region Sorting 640.472 Table 2 compares MPI-SCVT with STRIPACK for the con150000 struction Voronoi of spherical Delaunay triangulations. The resultson sorting288.694 98.5482 249.77 640.472uses a coarse 4.07 quasi-uniform SC Table 3. Timings based approach used. Uniform given compare the cost to compute a single triangulation of a medium resolution global grid. The results show, for the radii and sorts using maximum distance between centres and n centres and their associated Table 3. Timings based on sorting approach used. Uniform uses100000 a coarse quasi-uniform SCVT to define region computation of a full spherical triangulation MPI-SCVT is x16 uses a coarse roughly a factor of 7 slower than STRIPACK in serialSCVT mode, with a 16 to 1 ratio in grid sizes to effect the same type of sortin centres and their associated radii and sorts using maximum distance between centres and neighbouring centres. and a factor of 12 faster when using 96 processors. 50000 same x16 coarse SCVT grid to define regions centres and sorts using the neighbouring Vor x16 uses a coarse SCVT with a 16 to 1 ratio in grid sizes to effect the same type of sorting. Voronoi uses the 4.2 Initial generator placement and sorting heuristics 0 200000 4.2.1 150000 Number of sorted points in region Number of sorted points in region Uniform Decomposition We now examine the effects of initial generator placement (a) Uniform and sorting heuristics with regards to SCVT generation. The 200000 results200000 of this section are intended to guide decisions about 150000 Uniform Decomposition which initial condition or sorting method we use later to generate SCVTs. 100000 Initial generator placement 100000 150000 100000 5 10 15 20 25 30 35 200000 Region number Number Number of sorted pointsofinsorted regionpoints in region Number of sorted points in region same x16 coarseconstruction SCVT grid to define regions centres and sorts using 0the neighbouring Voronoi cell-based sort. for SCVT 40 x16 Deco 150000 x16 Decomposition 200000 100000 Voronoi Deco 150000 Number of sorted points in region Number of sorted points in region 50000 50000 Because most climate models are shifting towards global high-resolution simulations, we begin by exploring the cre100000 ation of50000 a fine resolution grid. We compare SCVT grids 50000 0 0 constructing starting with uniform Monte Carlo0 and5 bisec10 15 20 25 30 35 40 0 5 10 15 20 tion initial generator placement. The times for these grids to Region number Region nu 50000 0 converge to 0a tolerance 10−6 in the `2 norm are presented. 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 The 10−6 threshold is the strictest convergence level that the (a) Uniform (b) x16 Region number Region number (b) x16 Monte Carlo grid can attain in a reasonable amount of time, 0 0 5 10 15 20 and is therefore chosen as the convergence threshold for this 200000 (a) Uniform Voronoi Decomposition Region nu study. However, the bisection grid can converge well beyond this threshold in a similar amount of time. Table 3 shows timing results for the parallel algorithm200000 for the two differ(c) Voronoi 150000 Voronoi Decomposition ent options for initial generator placement. Although the time spent converging a mesh with Monte Carlo initial placements 150000 Fig. 5. 100000 Number of points each processor has to triangulate. is highly dependent on the initial point set, it is clear from this table that bisection initial conditions provide a significant corresponding to the three rows of Table 3. All plots were cre speedup in the overall cost to generate an SCVT grid. Based 100000 50000 on the results presented in Table 3 and unless otherwise specified, only bisection initial generator placements are used in 370 Based on the results in Table 3, the new Voronoi-base subsequent experiments. Bisection initial50000 conditions are not 0 0 5 10 15 20 25 30 35 40 specific to this algorithm, and can be used to accelerate any Region number SCVT grid generation method. (c) Voronoi 0 15 15 20 25 30 35 40 Fig. 3. Number of points each processor has to triangulate. (a), (b), Region number Fig. 5. Number of points each processor has toandtriangulate. 5(a), 5(b), and 5(c) useto the sorting (c) use the sorting approaches corresponding the three rows approache of Table 4. All plots were created with the same medium resolution As described in Sect. 2.1.1, the MPI-SCVT code utilizes two Voronoi the three of Table 3. plots grid. were created with the same set of 163842 generators. methodscorresponding for determining ato region’s pointrows set. Whereas the (c) dotAll 0 4.2.2 370 5 10 Sorting of points for assignment to processors product sort method is faster, it provides poor load balancboth utilize distance differ the decomposition ing 5. in variable resolution meshes, causing the Fig. Number of points eachthus processor hasiteration to triangulate. 5(a),the 5(b), andsort, 5(c)butuse theinsorting approaches 370 BasedTimings on theusing results Table 3,arethe new Voronoi-based sorting approach is used for all subsequen used. Uniform uses a quasi-uniform 42 region decomposicost to be greater. both in approaches given corresponding to the threethe rows of Table 3. All created with same set of resolution 163842 generators. whereas ×8the uses a variable decomposition 3 shows number of points thatplots eachweretion, in Table 4. Figure with sixteen to one ration in maximum and minimum grid processor has to triangulate on a per iteration basis. Timsizes. Timings ings presented are for a medium resolution grid, 42 regions, 15 labeled Voronoi use the same decomposition 42 processors, and are averages over 3000 iterations. Three as ×8, use theapproach Voronoi sort rather the disBased on the results in Table 3, the new Voronoi-based but sorting ismethod used for allthan subsequent tance sort method. sets of timings are given. Timings labeled Uniform and ×8 Geosci. Model Dev., 6, 1353–1365, 2013 15 www.geosci-model-dev.net/6/1353/2013/ D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction 1361 Table 2. Comparison of STRIPACK with serial and parallel versions of MPI-SCVT for computation of a full spherical triangulation. Computed using the same initial conditions for a medium resolution grid. results. This should provide appropriate load balancing when generating SCVT Algorithm Processors STRIPACK MPI-SCVT MPI-SCVT MPI-SCVT Regions Time (ms) Speedup (STRIPACK/MPI-SCVT) this1 sorting 1method563.87 should result in quasi-uniform and variable resolution SCVT g Baseline 1 2 3 724.08 comparable in terms of performance. 2 2 2 206.06 96 4.3 96 0.15 0.26 12.33 45.75 SCVT generation Table 3. Timing results for MPI-SCVT with bisection and Monte Carlo initial generator placements and the speedup of bisection relative to Monte Carlo. Final mesh is fine resolution. We now provide both quasi-uniform and variableMC resolution SCVT generation resu Timed Portion Bisection (B) Monte Carlo (MC) Speedup B contributor to the112differences in computational performance arises as a result of load Total Time (ms) 890 358 003 000 3171.25 Triangulation Time (ms)Results10in887 ferences. this Integration Time (ms) 30 893 Communication Time (ms) 70 878 48 034 600 use of the 4412.11 section make density function 66 374 400 2148.52 1565.47 110 958 000 1 β − |xc − xi | ρ(xi ) = tanh +1 +γ 2(1SCVT − γ)to define region centres α and their associated Table 4. Timings based on sorting approach used. Uniform uses a coarse quasi-uniform radii and sorts using maximum distance between centres and neighbouring centres. ×8 uses a coarse SCVT with a 8 to 1 ratio in grid sizes to effect the same type of sorting. Voronoi uses the same ×8 coarse to definex regions centres and sorts the the neighbouring 375 which is visualized inSCVT Fig. grid 6, where tousing lie on surface i is constrained Voronoi cell-based sort. of the un function results in relatively large value of ρ within a distance β of the point xc , wher Distance Distance Voronoi Decomposition Costs Of Different Algorithm Steps Cost Per in radians and xc isIntegration also constrained to lie on the surface Grid Triangulation Communication Iteration Speedupof the sphere. The func Uniform 14.53 values 37.62 1314.22 1396.53 Baseline to relatively small of ρ across a radian distance of α. The ×8 35.84 92.91 865.32 995.04 1.40 −1 ×8computed as |x 16.71 86.24 177.55 305.85 4.56 − x | = cos (x · x ) with a range from 0 to π. c i c i Timings presented in Table 4 are taken relative to processor number 0, and as can be seen in Fig. 3a, processor 0 has a very small load which causes the majority of its iteration time being spent waiting for the processors with large loads to catch up; this idling time is included in the Communication column of the table. Table 4 and Fig. 3 show there is a significant advantage to the Voronoi based sort method in that it not only speeds up the overall cost per iteration, but also provides more balanced loads across the processors. 1 distance betwee Density Function 0.8 Density Sorting Approach 0.6 0.4 0.2 0 4.3 SCVT generation 0 0.5 1 1.5 2 2.5 3 Distance from centre of density function (Radians) We now provide both quasi-uniform and variable resolution Fig. 4. Density function that creates a grid with resolutions that difSCVT generation results. The major contributor to the differfer by a factor of 8 between the coarse and the fine regions. The 6.asDensity ences in computational performanceFig. arises a result function of load that creates a grid with resolutions that differ by a factor of 8 betwe maximum value of the density function is 1 whereas the minimum balancing differences. Results in thisthe section make useThe of the 4 . the density function is 1 whereas the minimum value fine regions. maximum value value is (1/8)of density function β − |x c − x i | 1 tanh +1 +γ ρ(x i ) = 380α Figure 7 shows(6) an example grid created using this density function, with xc set to 2 (1 − γ ) sitions to relatively small values of ρ across a radian diswhich is visualized in Fig. 4, whereλxc i= is π/6, constrained tance of α. The and distance between γx c=and x i is4 ,computed as and α = 0 wheretoφliedenotes longitude λ latitude, (1/8) β = π/6, |x c − x i | = cos−1 (x c · x i ) with a range from 0 to π. on the surface of the unit sphere. This function results in set of parameters used (6) isgrid referred to asthis x8.density The quasi-uni relatively large value of ρ within agenerators. distance β of This the point Figure 5 shows an in example created using function, with x c set to be λc = 3π/2, φc = π/6, where λ x c , where β is measured in radians and x c is also conreferred as x1.transtrained to lie on the surface of the sphere. Theto function denotes longitude and φ latitude, γ = (1/8)4 , β = π/6, and www.geosci-model-dev.net/6/1353/2013/ Geosci. Model Dev., 6, 1353–1365, 2013 8 1362 D. W. Jacobsen et (a) al.:Coarse Parallel algorithms for Delaunay construction region (b) Transition region (a) Coarse region (c) Fine region (b) Transition region Fig. 7. using Different of thefunction same variable resolution grid created using the densityoffunction (6). Figure 7(a) Fig. 5. Different views of the same variable resolution grid created theviews density Eq. (6). (a) shows the coarse region the grid, Fig.shows 5. Different views ofregion, the same resolution created using the density function Eq. (6). (a) shows the coarse region of the grid, (b) the transition andvariable (c) shows the fine grid region. shows the coarse region of the grid, 7(b) shows the transition region, and 7(c) shows the fine region. (b) shows the transition region, and (c) shows the fine region. In Sect. 4.1, we showed that MPI-SCVT performs to STRIPACK when computing a Table 5. Comparisons of SCVT generators using STRIPACK, serial MPI-SCVT, and parallel MPI-SCVT. Costcomparably per triangulation and iteration Table 2. Comparison ofisSTRIPACK with serial andper parallel versions of MPI-SCVT for computation of a full for spherical triangulation. Comare presented. Speedup compared using the cost iteration. Computed using the same initial conditions a medium resolution 385 single full triangulation. However, computing a full triangulation is only part of the grid. story. For SCVT puted using the same initial conditions for a medium resolution grid. Variable resolution results are presented for MPI-SCVT only. generation, a triangulation needs to be computed at every iteration of Lloyd’s algorithm as described Fig. 7. Sect. When using STRIPACK, the full triangulation needs to be computed at every iteration, Algorithm Processors RegionsinTriangulation Time2.3. (ms) Speedup (STRIPACK/MPI-SCVT) Algorithm Procs Regions Iteration Speedup Per Iteration butTime with (ms) MPI-SCVT only(ms) each regional triangulation needs to be computed at each iteration. This Time (STRIPACK/MPI-SCVT) STRIPACK 1 1 563.87 Baseline means the merge step can be skipped resulting in significantly cheaper triangulations. MPI-SCVT 3 724.08 0.15Baseline STRIPACK 11 1 2 563.871 3 009.74 MPI-SCVT 2 2 2 206.06 0.26 390 Figure 8 shows the performance of a STRIPACK-based SCVT construction as the number of MPI-SCVT ×1 1 2 3 480.43 9 485.78 0.32 MPI-SCVT 96 96 45.75 12.33 generators is increased through bisection as mentioned in Sect. 2.3. Values are averages over 2000 MPI-SCVT ×1 2 (c) Fine region 2 2 114.38 5 390.82 0.56 MPI-SCVT ×1 96 96 14.23The green 207.26 14.52 iterations. dashed line represents the portion of the code that computes the centroids of Different views of the same resolution grid using16.71 the density function (6). Figure 7(a) MPI-SCVT ×8variable96 96 createdthe Voronoi regions 305.855 whereas the red solid line9.84 represent the portion of the code that computes the shows the coarse region of the grid, 7(b) shows the transition region, and 7(c) shows the fine region. Delaunay triangulation. 1.0e+05 390 Average Time Time (ms) (ms) Average Average Time Time (ms) (ms) Average 385 1.0e+06 In1.0e+05 Sect. 4.1, weTriangulation showed that MPI-SCVT performs comparably to STRIPACK when 1.0e+06 computing a Triangluation 1.0e+05 Integration Integration Triangulation Triangluation single full triangulation. However, computing a full triangulation is only part of the story. For SCVT Communication 1.0e+04 1.0e+05 Integration Integration generation, a triangulation needs to be computed at every iteration of Lloyd’s algorithm1.0e+04 as described Communication 1.0e+04 17 1.0e+04 1.0e+03 in Sect. 1.0e+032.3. When using STRIPACK, the full triangulation needs to be computed at every iteration, 1.0e+03 but 1.0e+03 with MPI-SCVT only each regional triangulation needs to be computed at each iteration. This 1.0e+02 1.0e+02 means the merge step can be skipped resulting in significantly cheaper triangulations. 1.0e+02 1.0e+02 1.0e+01 Figure 8 shows the performance of a STRIPACK-based SCVT construction as the number of 1.0e+01 1.0e+00 1.0e+01 is increased through bisection as mentioned in Sect. 2.3. Values are averages over 2000 generators 1.0e+00 1.0e+01 1.0e-01 1.0e+00 1.0e-01 1.0e-02 iterations. The green dashed line represents the portion of the code that computes the centroids of the Voronoi regions whereas the red solid line represent the portion of the code that computes the 1.0e+02 1.0e+00 1.0e+03 Delaunay triangulation. 1.0e+02 1.0e+03 1.0e+05 1.0e+06 1.0e+07 1.0e+01 1.0e-02 1.0e+02 1.0e+03 1.0e+05 1.0e+06 1.0e+07 1.0e+04 1.0e+05 Number of Generators 1.0e+04 1.0e+06 1.0e+07 1.0e+01 1.0e+02 1.0e+03 1.0e+05 Number1.0e+04 of Generators 1.0e+06 1.0e+07 Number of Generators 03 19 Aug 2013 by: Douglas. 1.0e+04 Number of Generators Fig. 6. Timings for STRIPACK-based SCVT construction for various 6. generator Red solid lines SCVT represent the time spent in Fig. Timingscounts. for STRIPACK-based construction for var17 STRIPACK computing a triangulation dashed lines ious generator counts. Red solid lines whereas representgreen the time spent in represent thecomputing time spenta triangulation integrating the Voronoi cellsdashed outside of STRIPACK whereas green lines STRIPACK one iteration of Lloyd’sthe algorithm. represent theintime spent integrating Voronoi cells outside of STRIPACK in one iteration of Lloyd’s algorithm. Fig. 7. Timings for various portions of MPI-SCVT using 2 processors and 2 regions. As theportions problemofsize increases using the slope for Fig. 7. Timings for various MPI-SCVT 2 proboth triangulation (red-solid) integration (green-dashed) remain cessors and 2 regions. As theand problem size increases the slope for roughly constant. (red-solid) and integration (green-dashed) remain both triangulation roughly constant. steps, where the italicized steps are the ones added to the algorithm of Sect. 2.1: α = 0.20 with 10 242 generators. This set of parameters used spherical caps Uk , k = 1, . .version . , N , that in –Eq.define (6) isoverlapping referred to as ×8. The quasi-uniform is cover the×1. given region on the sphere; referred to as In Sect. 4.1, we showed that MPI-SCVT performs compa– sort the given point P into athesingle subsets k, k = rably to STRIPACK when set computing full Ptriangu1,However, . . . , N , each containing the points in P that in part Uk ; lation. computing a full triangulation is are only ek – for k = 1, . . . , N , construct in parallel the point set P of thebystory. For SCVT generation, a triangulation needs to stereographically projecting the points in Pk onto the be computed at every iteration of Lloyd’s algorithm as deplane tangent to the sphere at the point t k ; scribed in Sect. 2.2. When using STRIPACK, the full triangulation needs to be computed at every iteration, but with MPISCVT only each regional triangulation needs to be computed foriteration. k = 1, . . .This , N , construct parallel thecan planar Delauat –each means theinmerge step be skipped ek ; nay triangulation Tekcheaper of the points set P resulting in significantly triangulations. Geosci. Model Dev., 6, 1353–1365, 1–13, 2013 2013 www.geosci-model-dev.net/6/1/2013/ www.geosci-model-dev.net/6/1353/2013/ D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction 10000 1.0e+04 1.0e+03 1363 Triangluation Integration Communication 1000 Average Time (ms) Average Time (ms) 100 1.0e+02 1.0e+01 1.0e+00 10 1 0.1 1.0e-01 0.001 1.0e-02 1.0e+02 Triangulation Integration Communication 0.01 1.0e+03 1.0e+04 1.0e+05 1.0e+06 1 1.0e+07 10 100 Number of Processors Number of Generators 10000 Fig. 8. Same information as in Fig. 7 but for 96 processors and 96 regions. 1000 4.4 General algorithm performance This section is intended to showcase some general performance results of MPI-SCVT. Figure 9a–c show the timings for a coarse resolution grid, a medium resolution grid, and a fine resolution grid, respectively. To assess the overall performance of the MPI-SCVT algorithm, scalability results are presented in Fig. 10. Figure 10a shows that this algorithm can easily under-saturate processors; when this happens, communication ends up dominating the overall runtime for the algorithm which is seen in Fig. 9a; as a result, scalability ends up being sub-linear. As www.geosci-model-dev.net/6/1353/2013/ 10 1 0.1 Triangulation Integration Communication 0.01 0.001 1 10 100 Number of Processors 1e+06 100000 10000 Average Time (ms) Figure 6 shows the performance of a STRIPACK-based SCVT construction as the number of generators is increased through bisection as mentioned in Sect. 2.2. Values are averages over 2000 iterations. The green dashed line represents the portion of the code that computes the centroids of the Voronoi regions whereas the red solid line represent the portion of the code that computes the Delaunay triangulation. Table 5 compares STRIPACK with the triangulation routine in MPI-SCVT that is called on every iteration. The results presented relative to MPI-SCVT are averages over 2000 iterations. As a comparison with Fig. 6, in Figs. 7 and 8 we present timings made for MPI-SCVT for two and 96 regions and processors, respectively, as the problem size, i.e. the number of generators, increases. A minimum of two processors are used because the stereographic projection used in MPI-SCVT has a singularity at the focus point. Whereas Fig. 7 shows performance similar to that of STRIPACK (see Fig. 6), Fig. 8 shows roughly two orders of magnitude faster performance relative to STRIPACK. As mentioned previously, this is only the case when creating SCVTs as the full triangulation is no longer required when computing a SCVT in parallel. Average Time (ms) 100 1000 100 10 1 0.1 Triangulation Integration Communication 0.01 0.001 1 10 100 Number of Processors Fig. 9. Timing results for MPI-SCVT vs. number of processors for three different problem sizes. Red solid-lines represent the cost of computing a triangulation, whereas green-dashed lines represent the cost of integrating all Voronoi cells, and blue-dotted lines represent the cost of communicating each region’s updated point set to its neighbours. the number of generators increases (as seen in Fig. 10b, c), the limit for being under-saturated is higher. Currently in the algorithm, communications are done asynchronously using non-blocking sends and receives. Also, overall communications are reduced by only communicating with a region’s Geosci. Model Dev., 6, 1353–1365, 2013 1364 D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction as the integration routines are embarrassingly parallel. Modification of the data structures used could also enable the generation of ultra-high resolution meshes (100M generators). In principle, because all of the computation is local this algorithm should scale linearly very well up to hundreds if not thousands of processors. 100 Iteration Speedup Linear Reference 90 80 Speedup Factor 70 60 50 40 5 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 70 80 90 100 Number of Processors 100 Speedup Linear Reference 90 80 Speedup Factor 70 60 50 40 30 Summary A novel technique for the parallel construction of Delaunay triangulations is presented, utilizing unique domain decomposition techniques combined with stereographic projections. This parallel algorithm can be applied to the generation of planar and spherical centroidal Voronoi tessellations. Results were presented for the generation of spherical centroidal Voronoi tessellations, with comparisons to STRIPACK, a well-known algorithm for the computation of spherical Delaunay triangulations. The algorithm presented in the paper (MPI-SCVT) shows slower performance than STRIPACK when computing a single triangulation in serial and faster performance when using roughly 100 processors. When paired with a SCVT generator, the algorithm shows additional speed up relative to a STRIPACK based SCVT generator. 20 10 0 0 10 20 30 40 50 60 Number of Processors 100 Speedup Linear Reference 90 Acknowledgements. We would like to thank Geoff Womeldorff and Michael Duda for many useful discussions. The work of Doug Jacobsen, Max Gunzburger, John Burkardt, and Janet Peterson was supported by the US Department of Energy under grants number DE-SC0002624 and DE-FG02-07ER64432. Edited by: R. Redler 80 Speedup Factor 70 60 References 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Number of Processors Fig. 10. Scalability results based on number of generators. Green is a linear reference whereas red is the speedup computed using the parallel version of MPI-SCVT compared to a serial version. neighbours. This is possible because points can only move within a region radius on any two subsequent iterations, and because of this can only move into another region which is overlapping the current region. More efficiency gains could be realized by improving communication and integration algorithms. This addition of shared memory parallelization to the integration routines might improve overall performance, Geosci. Model Dev., 6, 1353–1365, 2013 Amato, N. and Preparata, F.: An NC parallel 3D convex hull algorithm, in: Proceedings of the ninth annual symposium on Computational geometry – SCG ’93, 289–297, May 1993. Batista, V., Millman, D., Pion, S., and Singler, J.: Parallel geometric algorithms for multi-core computers, Comp. Geom.-Theor. Appl., 43, 663–677, 2010. Bowers, P., Diets, W., and Keeling, S.: Fast algorithms for generating Delaunay interpolation elements for domain decomposition, available at: http://www.math.fsu.edu/∼aluffi/archive/ paper77.ps.gz, (last access: May 2011), 1998. Chernikov, A. and Chrisochoides, N.: Algorithm 872: parallel 2D constrained Delaunay mesh generation, ACM T. Math. Software, 34, 6:1–6:20, 2008. Cignoni, P., Montani, C., and Scopigno, R.: DeWall: a fast divide and conquer Delaunay triangulation algorithm in E-d, Comput. Aided Design, 30, 333–341, 1998. Du, Q. and Gunzburger, M.: Grid generation and optimization based on centroidal Voronoi tessellationss, Appl. Math. Comput., 133, 591–607, 2002. Du, Q., Faber, V., and Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms, SIAM Rev., 41, 637–676, 1999. www.geosci-model-dev.net/6/1353/2013/ D. W. Jacobsen et al.: Parallel algorithms for Delaunay construction Du, Q., Ju, L., and Gunzburger, M.: Constrained centroidal Voronoi tessellations for surfaces, SIAM J. Sci. Comput., 24, 1488–1506, 2003a. Du, Q., Ju, L., and Gunzburger, M.: Voronoi-based finite volume methods, optimal Voronoi meshes, and PDEs on the sphere, Comput. Method. Appl. M., 192, 3933–3957, 2003b. Du, Q., Emelianenko, M., and Ju, L.: Convergence of the Lloyd Algorithm for computing centroidal Voronoi tessellations, SIAM J. Numer. Anal., 44, 102–119, 2006a. Du, Q., Ju, L., and Gunzburger, M.: Adaptive finite element methods for elliptic PDE’s based on conforming centroidal Voronoi Delaunay triangulations, SIAM J. Sci. Comput., 28, 2023–2053, 2006b. Heikes, R. and Randall, D.: Numerical integration of the shallowwater equations on a twisted icosahedral grid. Part I: Basic design and results of tests, Mon. Weather Rev., 123, 1862–1880, 1995. Jacobsen, D. and Womeldorff, G.: MPI-SCVT Source Code: https: //github.com/douglasjacobsen/MPI-SCVT 2013. Ju, L., Ringler, T., and Gunzburger, M.: A multi-resolution method for climate system modeling: application of spherical centroidal Voronoi tessellations, Ocean Dynam., 58, 475–498, 2008. Ju, L., Ringler, T., and Gunzburger, M.: Voronoi tessellations and their application to climate and global modeling, Ocean Dynam., 58, 313–342, 2011. Lambrechts, J., Comblen, R., Legat, V., Geuzaine, C., and Remacle, J.-F.: Multiscale mesh generation on the sphere. Ocean Dynam., 58, 461–473. (2008). doi:10.1007/s10236-008-0148-3 Lloyd, S.: Least squares quantization in PCM, IEEE T. Inform. Theory, 28, 129–137, 1982. Metropolis, N. and Ulam, S.: The Monte Carlo method, J. Am. Stat. Assoc., 44, 335–341, 1949. Nguyen, H., Burkardt, J., Gunzburger, M., Ju, L., and Saka, Y.: Constrained CVT meshes and a comparison of triangular mesh generators, Ocean Dynam., 42, 1–19, 2008. www.geosci-model-dev.net/6/1353/2013/ 1365 Okabe, A., Boots, B., Sugihara, K., and Chiu, S.: Spatial Tessellations: Concepts and Applications of Voronoi Diagrams, John Wiley, Baffins Lane, Chichester, West Sussex, PO19 1UD, England, 2000. Pain, C. C., Piggot, M. D., Goddard, A. J. H., Fang, F., Gorman, G. J., Marshall, D. P., Eaton, M. D., Power, P. W., and de Oliveira, C. R. E.: Three-dimensional unstructured mesh ocean modelling, Ocean Model., 10, 5–33, 2005. Renka, R.: Algorithm 772: STRIPACK: Delaunay triangulation and Voronoi diagram on the surface of a sphere, ACM T. Math. Software, 23, 416–434, 1997. Ringler, T., Petersen, M., Higdon, R. L., Jacobsen, D., Jones, P. W., and Maltrud, M.: A Multi-Resolution Approach to Global Ocean Modeling, Ocean Modelling, accepted, doi:10.1016/j.ocemod.2013.04.010, 2013. Ringler, T. D., Jacobsen, D., Gunzburger, M., Ju, L., Duda, M., and Skamarock, W.: Exploring a Multiresolution Modeling Approach within the Shallow-Water Equations, Mon. Weather Rev., 139, 3348–3368, doi:10.1175/MWR-D-10-05049.1., 2011. Saalfeld, A.: Delaunay triangulations and stereographic projections, Cartogr. Geogr. Inform., 26, 289–296, 1999. Shewchuk, J.: Triangle: engineering a 2D quality mesh generator and Delaunay triangulator, Applied Computational Geometry: Towards Geometric Engineering, 1148, 203–222, 1996. Weller, H., Weller, H., and Fournier, A.: Voronoi, Delaunay, and block-structured mesh tefinement for solution of the shallowwater equations on the sphere, Mon. Weather Rev., 137, 4208– 4224, 2009. Zhou, J., Deng, X., and Dymond, P.: A 2-D parallel convex hull algorithm with optimal communication phases, in: Proceedings 11th International Parallel Processing Symposium, April 1997, University of Geneva, Geneva, Switzerland, 596–602, 2001. Geosci. Model Dev., 6, 1353–1365, 2013
© Copyright 2026 Paperzz