Force-Directed Graph Layout with Parallel Haskell

'
$
Force-Directed Graph Layout
with Parallel Haskell
Rob Marshall
MEng, BSc Computer Science
2010/2011
&
%
e candidate confirms that the work submied is their own and the appropriate credit has been
given where reference has been made to the work of others.
I understand that failure to aribute material whi is obtained from another source may be considered as plagiarism.
(Signature of student)
Summary
is project aims to evaluate teniques for parallel programming in the functional programming language Haskell, specifically their applicability to the problem of graph layout using forcedirected algorithms. An implementation of a simple force-based graph drawing algorithm was
created, parallelised in various ways, and evaluated to determine the performance gains aieved.
i
Anowledgements
I would like to thank my project supervisor, Dr David Duke, for essential advice and support
throughout the course of the project, and my assessor, Dr Haiko Müller, for his feedba in response
to my mid-project report and progress meeting. anks also to Peter Wortmann for tips on Haskell
performance analysis.
More generally, I'd like to thank my family and friends for their unending patience, particularly
during the writing stages of this report, where several late nights were had.
ii
Contents
1
2
Introduction
1
1.1
e Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Aims and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Sedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.5
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.6
Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Baground Resear
6
2.1
Graph Drawing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.1
Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.2
Force-Directed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.1
Aritectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.2
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Haskell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.3.1
Non-strictness and Lazy Evaluation . . . . . . . . . . . . . . . . . . . . .
13
2.3.2
Monads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2
2.3
iii
2.4
3
4
2.3.3
Parallel Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.3.4
Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Existing Work on Graphs in Haskell . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4.1
Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4.2
Graph Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4.2.1
graphviz . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.4.2.2
graph-rewriting-layout . . . . . . . . . . . . . . . . . .
18
2.4.2.3
Communicating Haskell Processes . . . . . . . . . . . . . . . .
18
Initial Implementation
19
3.1
Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Force Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
Main Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.4
Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.5
Introducing Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Analysis and Improvement
24
4.1
readScope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.2
Switing from par and pseq to Strategies . . . . . . . . . . . . . . . . . . . . .
25
4.3
Pre-generating the Force List . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.4
Unboxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.5
List Chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.6
Forcing the Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.7
Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.8
Iterating in the IO Monad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
iv
5
Alternative Implementation
32
5.1
Changes Applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.1.1
Type Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.1.2
Force List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.1.3
Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.1.4
Force Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Initial Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5.2
6
Evaluation
36
6.1
Performance of Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
6.1.1
Testing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
6.1.2
Impact of Different Graphs . . . . . . . . . . . . . . . . . . . . . . . . . .
37
6.1.3
Impact of Problem Size and Iteration Count . . . . . . . . . . . . . . . . .
38
6.1.4
Comparison Implementations . . . . . . . . . . . . . . . . . . . . . . . .
39
6.2
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
6.3
Potential Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Bibliography
43
A Personal Reflection
46
B Materials Used
49
C Supplemental Source Code
50
C.1 Iteration Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
C.2 GTK+/Cairo Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
C.3 Iteration with Accelerate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
D Supplemental Results
53
v
Chapter 1
Introduction
1.1 e Problem
A simple graph is an abstract representation of a set of objects, some pairs of whi are related.
e set of objects is referred to as the graph's vertices, and the relationships are known as edges.
A graph drawing is a visual representation of a graph's structure, most oen a two-dimensional
diagram made up of circles and lines representing vertices and edges respectively. Figure 1.1 shows
that a graph may be represented in different ways without anging its meaning.
As a data structure, the graph is fundamental to computer science. Many significant computational problems, su as seduling and optimisation, can be defined in terms of graphs[7].
Furthermore, graph drawings are widely used as a means to help design and implement linked
structures su as database diagrams and electronic circuits.
Methods for graph drawing, particularly force-directed algorithms, are reviewed in Section 2.1.
Several factors are discussed whi make a particular graph layout more aesthetically pleasing.
Force-directed algorithms are a considered a reasonably effective means of balancing some of
these. Su algorithms generally require a large amount of computation, and hence can take a
1
1
0
4
0
2
2
1
3
3
2
3
4
1
4
0
Figure 1.1: ree diagrams, ea representing the same underlying graph.
long time to run, but it appears that it may be possible to make performance gains by running
algorithms across more than one processor.
e popular functional programming language Haskell, discussed in Section 2.3, is the focus of
various groups working on open resear areas. As su, many experimental syntax extensions and
libraries exist. In recent years, multiple teniques have been developed to support the execution
of Haskell programs across multiple processor cores.
1.2 Aims and Requirements
e main aim of this project is to determine how applicable the parallel programming capabilities
provided by Haskell are to force-directed graph layout algorithms. Specifically, to determine what
level of performance increase can be aieved, and to reflect on the experience of applying these
teniques.
Accordingly, the minimum requirements for the project were to produce an appropriate sequential implementation, and adapt it to run in parallel on a multicore processor.
Considered an optional extension was the creation of an implementation using an alternative
parallel programming paradigm.
2
Week 1 2 3 4 5 6 7 8 9 10 Easter
11 12
Baground reading
Sequential implementation
Multicore implementation
Alternate parallel implementation
Evaluation
Report writing
Mid-project
report date
Final
report date
Figure 1.2: Initial project sedule
1.3 Sedule
Figure 1.2 shows the project sedule as initially planned, spliing work over 11.5 weeks of term
time and 4 weeks of the Easter period. Some baground reading had taken place before the formal
start of the project. Work on the project largely kept to this sedule. Some minor improvements to
the multicore implementation were made in week 8. e configuration required to start work on
the “alternate” implementation took longer than expected, causing the end of this implementation,
the final evaluation, and the start of the final report writing to be delayed by around a week.
1.4 Methodology
e project followed an iterative design process, wherein the first task was to produce an initial
implementation, using simple pure Haskell, effectively forming an executable specification of the
algorithm. ere then followed a cycle of analysis and refinement, aiming to identify potential
bolenes and improve performance, in particular to increase the utilisation of processor cores.
Casual measurements of running times and calculations of speedup, discussed below, were
made aer ea ange in order to decide whether to keep or revert the ange. is improvement
process was then repeated until all apparent bolenes had been eliminated.
3
1.5 Evaluation
In terms of the performance of a computer program, the simplest metric to compare is the overall
running time of the program. As the number of processors in use increases, it is intended that the
running time decreases. Speedup is one way to interpret the degree that applying parallelism improves performance. If Tp is the running time across p processors, the speedup across p processors
is defined by Equation 1.1. For a “perfect” parallel algorithm, S p = p, indicating linear speedup.
T1
Tp
(1.1)
Sp
T
= 1
p
pTp
(1.2)
Sp =
Ep =
is project will mostly refer to speedup, though efficiency is another commonly used metric,
found by dividing speedup by the number of processors as in Equation 1.2. Efficiency values for
different numbers of processors are easily compared, as the “perfect” value is always 1.
antitative evaluation of this project will compare speedup values for the final implementations, given different workloads. Various types of graph structure (with differing numbers of edges
per vertex) will be tested, and graph size and number of iterations will be varied to determine what
effect these factors have on running time and speedup.
e relative difficulty of development and maintenance of different approaes will also be
briefly discussed.
1.6 Structure of the Report
e remainder of this report is structured as follows:
ˆ Chapter 2 reviews literature relating to graph drawing and Haskell, whi forms the ba-
ground to the project.
ˆ Chapter 3 describes the initial implementation of a spring embedder in sequential Haskell,
and a first aempt at introduction of multicore parallelism.
4
ˆ Chapter 4 details the process of analysis and iterative improvement of the multicore solu-
tion.
ˆ Chapter 5 describes the production of an alternate implementation using the Accelerate
library.
ˆ Chapter 6 evaluates the produced solutions and concludes the report.
ˆ Appendix A provides personal reflection on the project experience.
ˆ Appendix B provides a list of third-party materials used in the project.
ˆ Appendices C and D provide supplemental code listings and results referenced in earlier
apters.
5
Chapter 2
Baground Resear
2.1 Graph Drawing
Graphs are most oen rendered by drawing a circle for ea vertex, and for ea edge, a straight
line, arc, or polyline connecting the neighbouring vertices. As the layout of the graph is semantically irrelevant, finding the best positions for the vertices is a question of aesthetics. e number
of possible layouts being effectively infinite, this is a continuous optimisation problem.
A graph drawing algorithm is defined in [8] as one whi takes a graph as input and outputs
primitive drawing instructions. However, for our purposes it is more useful to define the algorithms' output as a list of geometric positions corresponding to the vertices of the input graph, with
the understanding that this dataset is stored and/or displayed in some implementation-specific
manner. e following goals are suggested for a good algorithm in [8] and [4]:
ˆ Show symmetry where it is present in the graph.
ˆ Minimise edge crossings.
ˆ Minimise bends in edges (where polylines or arcs are used).
6
ˆ Make edges similar lengths.
ˆ Spread out vertices evenly.
ˆ Maximise angles between edges incident to a common vertex.
As is discussed in [8], these (oen competing) aesthetics form an NP-hard optimisation problem, so approaes to graph drawing are generally heuristic in nature. A good algorithm makes a
trade-off between quality of the result and computation time.
2.1.1 Trees
A tree is a connected graph that contains no cycles. As su, drawing a tree is a simpler task than
a general graph. Whilst there are still many sensible ways of drawing a given tree, and optimising
all of the factors above remains a complex problem, some concerns may be easily satisfied. For
example, with no cycles, it is trivial to make every edge the same length, if other concerns are
ignored.
X
X
Figure 2.1: A simple tree laid out with layer-based and radial algorithms.
Another reason trees are relatively simple to lay out is that, directly following from their la
of cycles, all trees are planar, and so can be laid out in two dimensions without edge crossings.
Partitioning a tree by cuing one edge always results in two trees, whi can be laid out separately
and put ba together: a divide-and-conquer approa.
Simple algorithms to produce layer-based and radial drawings from rooted trees, as in Figure 2.1, are described in [26] and [2] respectively.
7
2.1.2 Force-Directed Algorithms
Various overaring strategies have been used for the layout of graphs, for example the vertex
addition method described in [25]. Most popular for general graphs are algorithms based on forcedirected simulation[8]. ese algorithms use iterative methods in whi vertices are moved based
on real-world physical formulae.
Eades, in [9], introduced the spring embedder algorithm. is imitates a physical system,
whereby it is first imagined that ea vertex is a particle with the same nonzero electrostatic
arge, and thus all vertices repel ea other according to Coulomb's inverse square law:
F = ke
q1 q2
r2
(2.1)
Here ke q1 q2 is the product of a constant and the arges on the vertices in question, and can
be treated as a single constant.
Secondly, to cause connected vertices in the graph to be pulled together spatially, vertices are
imagined as metal rings, and edges as straight springs of common natural length linking them. In
physics, ideal springs behave according to Hooke's law:
F = −kx
(2.2)
Here k is the spring constant, and x is the extension of the spring compared to its natural length.
Applied to the graph, this mean that neighbouring vertices that are too far away will move towards
ea other, and those too close together will move further away. Eades elects to use logarithmic
springs instead, stating that “Experience shows that Hookes Law (linear) springs are too strong
when the vertices are far apart”[9].
Figure 2.2 shows some of the forces in the spring embedder model. Ea arrow in the diagram
is a force acting on the shaded vertex, positioned next to the vertex or edge causing that force. e
sizes and angles of the arrows indicate approximate magnitude and direction of the forces.
Vertices are initially placed in random positions. In one iteration of the algorithm, the force on
ea vertex is calculated, and then ea vertex is moved by a proportion of its force. It is notable
that here a “force” is used to calculate a velocity, whereas in physical reality application of forces
causes acceleration: this is to avoid undesirable dynamic equilibria, su as orbits[13].
8
Figure 2.2: A pictorial representation of the spring embedder.
A naïve implementation of a spring embedder executing k iterations has a running time on the
order of kn2 , where n is the number of vertices in the graph. Whilst Eades used a fixed number of
iterations, experience indicates the number of iterations required to produce a reasonable result
increases with n, therefore the running time is best thought of as O(n3 ).
Various other algorithms have been proposed, citing Eades as a basis. Fruterman and Reingold, in [13], proposed modifications to Eades's model to improve performance, taking their initial
inspiration from the strong nuclear force, whi can be aractive or repulsive depending on the
distance between particles. e model deviates from physics by deciding whether to apply the
aractive force based on whether or not the pair of vertices in question are neighbours. All vertex
pairs are affected by a repulsive force whi is very strong if the distance between them is smaller
than a specified “ideal distance”.
e concept of cooling is also introduced, whereby the amount a vertex moves in proportion to
the force acting on it is slightly decreased with ea iteration, as an aid to convergence. It should
also be noted that [13] rejects the logarithmic springs described above as “inefficient to compute”.
9
algorithm SPRING(G:graph);
place vertices of G in random locations;
repeat M times
calculate the force on each vertex;
move the vertex (arbitrary constant) Ö (force on vertex)
draw graph on CRT or plotter.
Figure 2.3: Pseudocode for the spring embedder, from [9].
Another alternative is to use only springs, as in [16]. Here, a spring is imagined between every
pair of vertices in the graph. e natural length of ea spring is set based on the “graph theoretic”
distance between the two vertices — in an unweighted graph, the shortest path. Hooke's law is
then applied, but instead of forces being translated directly to movements, a formula for total
energy in the graph is derived and then minimised. is is aieved by moving one vertex at
a time, simultaneously solving two partial differential equations (for the x and y co-ordinates).
is method is, understandably, computationally expensive, but the initial idea is intuitive and
the algorithm adapts well to weighted graphs.
e G algorithm, described in [12], is a mu more complex model than any of those described above. e forces on ea vertex are a combination of repulsive and aractive forces based
on an ideal edge length as already seen; a gravitational force whi slightly aracts all vertices
to the centre of the graph; and a small random disturbance. In addition to a global temperature
for the graph, ea vertex has a local temperature value. e local temperature of ea vertex is
adjusted to inhibit rotation, where vertices move around ea other without anging the quality
of the layout; and oscillation, where a vertex repeatedly moves in the opposite direction to its
previous movement.
Only one vertex is moved in ea iteration of the G algorithm. Iterations are groups into
rounds, and in ea round all vertices are updated in a random order. is algorithm was found
to perform beer than the algorithms above on large graphs.
Some of these algorithms involve many repeated force calculations that can be performed
in any order in the course of an iteration. is suggests that it may be productive to introduce
parallelism in order to decrease running time.
10
2.2 Parallelism
Parallelism is the execution of a program across multiple processors. As the physical limits of
single processor performance are reaed, parallel programming is increasingly being turned to
as a means of continuing performance improvements.
2.2.1 Aritectures
In [11], Flynn describes a taxonomy of computer processing aritectures. Two of the structures
described are especially relevant to this project:
ˆ MIMD (multiple-instruction stream–multiple-data stream) refers to a maine made up of
multiple independent processors that are capable of ea performing different tasks on different data at the same time. Standard multicore CPUs are an example of a shared memory
MIMD system.
ˆ SIMD (single-instruction stream–multiple-data stream) refers to a maine where proces-
sors all perform the same task simultaneously, on different data. Various CPU aritectures
support SIMD extensions, su as MMX and SSE for x86, but perhaps the best modern examples of SIMD aritectures are graphics processing units (GPUs).
2.2.2 Analysis
As discussed in Section 1.5, a simple means to assess parallel performance is speedup. Amdahl, in
[1], discusses potential limitations on parallel performance. Equation 2.3 is a paraphrasing of one
of his main points, whi has become popularly known as Amdahl's Law.
maximum speedup =
1
(1 − P ) +
P
N
(2.3)
Here, P is the fraction of the program that can be parallelised, and N is the number of processors in use. It follows that as the number of processors is increased, the maximum speedup tends
towards (1−1 P) . For example, if 80% of a program can be made parallel, Amdahl's Law implies that
the maximum speedup aievable is
1
0.2
= 5.
11
Amdahl's Law has been criticised for assuming a fixed problem size. It is argued in [14] that
the fraction of the program that can be parallelised is nearly always dependent upon the problem
size:
One does not take a fixed-size problem and run it on various numbers of processors
except when doing academic resear; in practice, the problem size scales with the
number of processors. When given a more powerful processor, the problem generally
expands to make use of the increased facilities.
If the part of a program that is inherently sequential (setup, synronisation, output, etc.) does
not grow with problem size, or grows at a slower rate to the rest of the program, then increasing
the problem size may allow greater speedup to be aieved than is predicted by Amdahl's Law.
2.3 Haskell
Haskell is a purely functional, non-strict programming language, described in [20]. e most popular compiler implementation, oen treated as the de facto standard for new language features
not yet formally specified, is the Glasgow Haskell Compiler (GHC).
Purely functional languages are aracterised by referential transparency, that is, the return
value of a function depends only on its arguments, whi are immutable. As result of this, given
an expression whi depends on multiple sub-expressions, it does not maer in principle when,
or in what order the runtime evaluates the sub-expressions, regardless of their complexity.
Strictness is a related concept. In a strict language, in order to call a function, its arguments
must first be fully evaluated. Conversely, in a non-strict language, arguments do not need to be
evaluated first, or even at all. Purely functional languages enable non-strictness, as there is no
difference between evaluating a parameter value when the function is called and evaluating it
when it is used within the function.
double x = x + x
answer = sq (3 * 7)
For example, to find the value of answer from the listing above, a strict implementation would
first evaluate 3 * 7, then pass the result to sq, and evaluate 21 + 21 to get the final result. A nonstrict implementation would differ in that it would compose the expression (3 * 7) + (3 * 7)
before performing any calculation, but would return the same result.
12
2.3.1 Non-strictness and Lazy Evaluation
Haskell's non-strictness means that, by default, expressions are not evaluated until absolutely necessary: for example, when a value needs to be printed to the screen.
x = take 3 ls
where
ls = [1..]
y = 9ˆ9ˆ9
Given the code above, the Haskell run-time system will not evaluate anything until the value
of x is required, for example, if a user at the console has requested it be shown on the screen. It
will then perform the evaluation necessary to determine the first three elements of ls, and display
them. Note that ls is an infinite list of integers starting from 1, and therefore aempting to fully
9
evaluate it would cause the program to hang. e expression assigned to y, 99 , is never evaluated.
When a complex expression is assigned to a variable, a thunk is created. Initially, this is simply
a record of the expression itself. When the expression's value is needed, it is evaluated, and the
thunk stores the result, whi may be a simple value (e.g., an integer), or an expression (e.g. part
of a list, or a tuple containing thunks).
In the code example above, as the runtime evaluates take 3 ls, ls is evaluated as follows:
<thunk for [1..]>
1 : <thunk for [2..]>
1 : 2 : <thunk for [3..]>
1 : 2 : 3 : <thunk for [4..]>
It should be noted that this implies even expressions whi will cause errors if evaluated can
be passed around without issue. For example, the tuple a = (head [], 2 + 2) would cause
an error if fully evaluated, as head [] is undefined (it represents the first item of an empty list).
However, if the only second value in the tuple (snd a) is forced, the following happens, and 4 is
returned:
<thunk for (head [], 2 + 2)>
(<thunk for head []>, <thunk for 2 + 2>)
(<thunk for head []>, 4)
13
Lazy evaluation entails non-strictness as described above, and sharing of thunks to avoid duplicated work. In the evaluation of answer on page 12, the two occurrences of (3 * 7) would
point at the same thunk, so that this sub-expression is only calculated once.
2.3.2 Monads
Whilst the definition of functional purity depends on la of side effects, implementations need a
way of explicitly working with computations that may have side effects. ese include:
ˆ input and output
ˆ in-place state updates
ˆ threading and inter-process communication
Haskell provides monads as a means of controlling side effects, including the ST monad (for
mutable state) and the IO monad. In Haskell programs, the IO monad is used to declare a ain
of operations whi make up the main routine of the program.
2.3.3 Parallel Evaluation
Lazy evaluation is oen convenient, as it avoids performing unnecessary computation, and puts
off work until values are required. is is a trade-off against the extra work and memory space
required to manage thunks whi allows programs to be more easily modularised[15]. However,
in an environment where multiple processing cores are available, it may be beer to speculatively
evaluate some sub-expressions before they are required by the main computation, as otherwise
the available power is being wasted.
Haskell provides two functions, par and pseq, whi can be used as annotations to enable
shared-memory parallelism on multicore processors. As described in [21], “[t]he function par
indicates to the Haskell run-time system that it may be beneficial to evaluate the first argument in parallel with the second argument”. e function pseq forces its first argument to be
evaluated before returning the second. e two functions are oen used together in the form
(a `par` (b `pseq` c)), where c is generally an expression that combines a and b. e
following example is provided in [21].
14
parSumFibEuler a b
= f `par` (e `pseq` (e + f))
where
f = fib a
e = sumEuler b
A consequence of the par function call is the creation of a spark by the run-time system,
whi in turn may be pied up by an idle thread and speculatively evaluated. In the example
above, fib and sumEuler are computationally intensive functions that return integers. e expression f, containing the call to fib is sparked, and will be evaluated on another thread (if one
is available). At the same time, e will be evaluated on the main thread. Note that values of a and
b are osen su that these two evaluations take around the same amount of time. In a perfect
scenario, the main thread then goes to sum e and f just as the worker thread is finished with
f, and the summation operation is very fast in comparison. As a result, the running time of the
program is roughly halved when a second thread is added.
e par and pseq functions provide a simple way to add parallelism to a program on an ad
hoc basis. However, there are two main problems with the approa. Firstly, it is easy to create too
many or too few sparks. When too few sparks are created, a processor may be le idle waiting for
other processors to complete large tasks. When too many are created, there is a risk sparks will
end up being evaluated by the main thread (hence the effort used in creating them was wasted), or
never needed at all. In both of these cases it would be best if the spark was discarded early, or not
created. Secondly, functions featuring parallelism become liered with calls to these functions,
making them harder to read and maintain.
Evaluation strategies were created in an aempt to overcome these limitations. eir current
design, as of version 3, is detailed in [17]. Strategies allow the method for evaluating an expression
to be defined in isolation, and then applied to the expression itself. ese strategies can be re-used
and combined with ea other. Furthermore, memory management is improved with the intention
of wasting fewer sparks. e previous example might be refactored as follows.
parSumFibEuler a b
= sum ([fib a, sumEuler b] `using ` parList rseq)
Note that the expressions to be evaluated in parallel are only mentioned once, so e and f
are no longer needed. In a similar manner to par and pseq, the using function is semantically
equivalent to its le-hand argument, but additionally causes expressions to be sparked or evaluated
15
according to the strategy in its right-hand argument. In this case, parList causes sparks to be
created to apply the specified strategy to ea element of a list. e rseq strategy evaluates values
to weak head normal form, whi applied to an integer is a full evaluation.
In the above example, the use of a list container appears to be excessive for a calculation involving two numbers. In real-world parallel programs, large lists of actions to perform are commonplace, and evaluation strategies allow these and other structures to be defined clearly through plain
Haskell code before any parallel annotations are added. e paage contains various strategies
applicable to lists and other container types, as well as “building blo” functions for composing
new custom strategies.
2.3.4 Data Parallelism
e section above has discussed Haskell approaes to parallelism based on ad hoc allocation of
instructions and data to processors, sometimes referred to as task parallelism. A separate paradigm
for parallel programming is data parallelism, where a fixed operation is performed on ea element
of an array. is model can have many advantages over ad hoc approaes, su as allowing the
runtime to more easily split work over processors to maximise processor utilisation and sequential
memory access. Depending on the processor aritecture, the program code may only need to be
sent to the processor once for many elements of the input array.
In [3], a functional programming language, N, is introduced whi supports nested data
parallelism. e operation being applied to ea element of an array may be a simple scalar operation, or may itself be a data parallel operation, applying another operation over the elements
of an array. e Data Parallel Haskell (DPH) project aims in part to bring the features of N
to Haskell. Its syntax is described in [21], however, a stable version is not currently available for
the latest major version of GHC, and internally, the data parallelism is implemented using the
multicore parallel functionality described above.
Modern graphics processing units (GPUs) contain many more cores than CPUs, and are heavily specialised for array operations. An approa for exploiting the ability to run general-purpose
code on GPUs using Haskell using the Accelerate paage is discussed in [6]. Accelerate defines
a domain-specific language whi can be used to describe data parallel programs, largely implemented using Haskell type classes. ese programs are then executed with one of the two currently
available baends: a reference implementation in pure Haskell, and a GPU implementation whi
transforms programs into the CUDA language supported by recent NVIDIA GPUs. Accelerate is
16
more limited than DPH in terms of the flexibility afforded by its syntax, as it only supports flat
data parallelism. is means that only scalar operations may be applied to the array elements.
2.4 Existing Work on Graphs in Haskell
2.4.1 Graph Representation
It is argued in [10] that clear and efficient programs are made possible by inductive types for
data structures. ese are types where complex structures may be defined in nested constructor
expressions. For example, basic lists in Haskell are defined in terms of the nil ([]) and cons (:)
constructors — for example, [1, 2, 3] = (1 : (2 : (3 : []))).
Rooted trees may be easily represented as inductive types, because the types themselves are
effectively trees. A vertex in a tree may only be reaed in one way from a given root, so vertices
in a tree need only store a list of their ildren in order to encode the entirety of the structure.
Were a cycle to be introduced, the structure would contain multiple copies of the same vertices.
is can be avoided by using pointers, but this approa can quily become hard to maintain and
at risk of memory errors.
An inductively defined type for general graphs is introduced in [10]. Ea step in the construction of a graph value adds a vertex, along with lists of incoming and outgoing edges. It was found
that this structure facilitated the efficient implementation of various graph-based algorithms.
2.4.2 Graph Layout
Whilst many paages operating on graph-like structures in Haskell exist, few examples of graph
layout algorithms are publicly available. HaageDB, the central Haskell soware repository consisting of around three thousand paages, features only two paages related to graph layout:
graphviz and graph-rewriting-layout.
17
2.4.2.1 graphviz
e graphviz paage, documented in [23], supports graph layout using various algorithms, but
it does this by calling out to the third-party Graphviz toolkit, whi is wrien in C. e Haskell
paage contains functions to convert from a Haskell graph structure to the DOT format used by
Graphviz, and functions to call Graphviz utilities, returning various image formats or drawing to
a GTK canvas.
As all of the actual graph layout work occurs in external processes, this paage is not relevant
to this project.
2.4.2.2 graph-rewriting-layout
e graph-rewriting-layout paage, described in [22], provides a set of “basic methods
for force-directed node displacement that can be combined into an incremental graph-drawing
procedure”. Together with the parent graph-rewriting paage, these methods can be used to
define graph rewriting systems using the WithGraph monad.
No main method is provided as part of the paage, and as the methods use a monad to store
and retrieve state, it was assumed it would be difficult to make use of these methods in a simple
parallel implementation.
2.4.2.3 Communicating Haskell Processes
An implementation of force-directed layout using the Communicating Haskell Processes (CHP)
library is detailed in [5]. It uses one process per vertex, with ea process communicating with
the others to find their current locations, and calculating its own resultant force. is works, but
involves a large amount of communication overhead and as su is more interesting as a demonstration of the CHP library than an implementation suitable for use on large datasets.
18
Chapter 3
Initial Implementation
An initial Haskell spring embedder implementation was created. e intention was to create a
simple proof-of-concept, using pure functional syntax as mu as possible, and mostly relying on
functionality built-in to the language and the Prelude (Haskell's “standard library”). is apter
details the main features of the implementation, and some initial aempts at parallelisation.
3.1 Graph Representation
import qualified Data.IntMap as IM
import qualified Data.Set as S
type Graph = IM.IntMap (S.Set Int)
type Positions = IM.IntMap Vector
type Vector = (Float , Float)
e structure of the graph is represented in memory as an adjacency list using the IntMap
type, an efficient mapping type from integer key to values of any type — in this case, a Set of
integers. Ea vertex is assigned an integer ID, and this ID is mapped to the set of IDs representing
the neighbours of the vertex. A range of graph generation functions were produced for testing.
19
e positions of the vertices in the current layout are represented as another mapping from
vertex IDs to Vector values. Here, Vector is simply a tuple of two floating-point numbers representing X and Y coordinates. Various basic linear algebra functions were implemented for this
type, for example, scalar multiplication:
times :: Float -> Vector -> Vector
times i (x, y) = (i * x, i * y)
3.2 Force Calculation
Functions were created to calculate the appropriate araction and repulsion force vectors based
on two input vectors. Hooke's law and Coulomb's law (Equations 2.2 and 2.1) were implemented
without modification.
spring_k = 0.25 :: Float
spring_l = 15 :: Float
charge_k = 2000 :: Float
edge_force , repulse_force :: Vector -> Vector -> Vector
edge_force a b =
(- spring_k * (spring_l - dist a b)) `times ` normal (diff a b)
repulse_force u v =
(charge_k / ((dist a b) ˆ 2)) `times ` normal (diff b a)
In ea case, the le-hand argument of times calculates the magnitude of the force being
applied, whi is used in scalar multiplication of the normalised difference between the two positions. at the two forces act in opposing directions is indicated by the difference in the order of
the arguments of diff.
Values for the constants spring k, spring l, and charge k were osen arbitrarily and
adjusted with the help of the visualisation shown in Section 3.4 to ensure the algorithm converged
towards a reasonable layout. ese values are important with respect to the quality of the resulting
drawing, however, this project is only concerned with the relative performance of the implementation, and this is unaffected by the values osen.
To calculate the total aractive and repulsive forces acting on ea vertex, two functions
(shown in Section C.1) were implemented whi fold over the neighbours of the vertex and all
20
other vertices respectively, starting with (0, 0) and adding ea force. e total force on ea vertex is limited to 20 units, a value osen experimentally to avoid divergence caused by vertices
whi are too nearby or distant.
3.3 Main Iteration
steps :: Graph -> Positions -> [Positions]
steps g p = p : (steps g $ step g p)
e step function, shown in Section C.1, returns the successor state (new vertex positions)
given the adjacency list and the current positions. Recursion was used to create a list of iterations, in a similar manner to the Haskell Prelude's iterate function. e infinite list of iterations
produced by steps could then be manipulated using Haskell's standard list functions to retrieve
the vertex position for the step (or range of steps) required. Due to lazy evaluation, no actual
calculation is performed until required.
main :: IO()
main = do
let g = trimesh 10
let p = initPos g
let ps = steps g p
putStrLn $ show $ centre $ head $ drop 10 ps
e main method is short, assigning pseudorandom positions to ea vertex, obtaining a list
of steps, and printing the centre of the graph aer the tenth iteration to the screen. e centre of
the graph is the average of the position of every vertex, so requesting this be printed to the screen
forces all positions to be evaluated.
3.4 Visualisation
A simple visualisation was created using Haskell's support for the GTK+ windowing toolkit and
the Cairo graphics library, to verify that the code was producing suitable layouts and to enable
animated demonstrations.
e visualisation code, supplied in Section C.2, was disabled when taking running time measurements.
21
Figure 3.1: A 10-row triangular mesh aer 50, 100, and 150 iterations.
3.5 Introducing Parallelism
f = f1 `par` (f2 `pseq` (f1 `plus` f2))
An initial aempt to parallelise the implementation was made using the par and pseq functions. As the aractive and repulsive forces were being calculated separately, a simple ange was
to spark the repulsive force calculation (f1) whilst calculating the aractive forces (f2) on the
main thread.
Number of cores
1
Running time (s) 40.3
2
3
4
40.8 40.5 40.5
Table 3.1: Results of first parallelisation, tested on 10 iterations of K1000
e program was briefly tested on 1 to 4 cores, using a complete graph of 1000 vertices. is
was not successful. As the results in Table 3.1 show, the program took slightly longer to run on
multiple cores than on a single core.
Separately, a ange was tested wherein the function used to add together the individual repulsive forces was modified in a similar manner to the above line of code. A corresponding ange
was made for the aractive forces.
Number of cores
1
Running time (s) 49.5
Speedup 1.00
2
3
4
29.6 24.1 22.2
1.67 2.05 2.23
Table 3.2: Results of second parallelisation, tested on 10 iterations of K1000
22
is ange was more successful, insofar as the figures in Table 3.2 show that the program was
faster on two cores than on one. However, some of this gain is due to the performance on a single
core being significantly worse than before. e speedup with two cores is 1.67, whi is poor in
comparison to the ideal of 2. ere are diminishing returns to adding further cores, with speedup
on four cores only being just over 2.
e poor results here may suggest that evaluation of sparks is not being performed fully before their results are needed for another calculation, and/or that the overhead of parallelism is
outweighing its benefits. It may also be the case that a significant portion of the algorithm is inherently sequential, and therefore significant further gains in performance are not possible. e
next apter will explore these hypotheses and aempt to eliminate these and other possible causes
of poor performance.
23
Chapter 4
Analysis and Improvement
is apter details the various aempts made to improve the sequential and parallel performance
of the initial implementation from Chapter 3. Unless otherwise stated, all running times are for 10
iterations of the algorithm on a K1000 graph, on the GHC 7 installation discussed in Section 6.1.1.
4.1 readScope
readScope is a profiling tool for parallel Haskell programs. Together with the event logging
features in the Haskell run-time system, readScope can be used to visualise CPU utilisation and
thread-related events over time.
Figure 4.1 shows the visualisation produced by readScope for the program discussed at the
end of Chapter 3, running on four cores. e lower four timelines show that all the cores are being
used, however, the upper graph shows that in any one slice of time the average utilisation is not
significantly greater than 1 (the vertical ranges bordered by dashed lines ea represent potential
utilisation of one core).
24
Figure 4.1: readScope initial view
readScope allows the view to be zoomed to show more detail, as seen in Figure 4.2. e
section on the le-hand side shows that the timeline is made up of small blos of full CPU utilisation surrounded by a large amount of garbage collection (the thinner, contrasted bars seen on the
lower part of the diagram, orange in the original view). is implies a large amount of short-lived
memory allocation is occurring.
Vertical bars in the readScope view signify thread-related events, and are colour-coded. On
the right-hand side of Figure 4.2, the bars visible on the lower three cores are pink. ese represent
sparks being stolen. is means the value of an expression was partially evaluated on one thread,
when it was needed on another, and so evaluation was moved to the laer thread. is implies
iterations are not being calculated fully before calculation of the next iteration begins.
4.2 Switing from par and pseq to Strategies
As discussed in Section 2.3.3, par and pseq are useful for simple experiment with parallelism, but
evaluation strategies provide opportunities for cleaner code and beer management of memory.
For these reasons, a decision was made to swit to strategies in aempting to address the problems
above.
In Haskell, many data structures that can be traversed, su as lists and maps, implement the
Traversable class, and the parallel strategies library provides an appropriate strategy for these
25
Figure 4.2: Two enlarged parts of Figure 4.1
types. When parTraversable is applied to a structure, a spark is created for ea element in
the structure.
e annotations added in Section 3.5 were removed, and replaced with the parTraversable
strategy, applied to the map of positions at ea iteration. e argument of parTraversable
is the strategy to be applied to ea element, in this case rdeepseq, whi fully evaluates the
position vector.
step g p = IM.mapWithKey step1 p `using ` (parTraversable rdeepseq)
In addition to being simpler to read, this annotation produces fewer sparks — one for ea vertex
(n), as opposed to one for ea force (2n2 for the complete graph Kn ). It was thought that this
reduction would lessen the memory and processor overheads associated with managing a large
spark pool.
Number of cores
1
Running time (s) 45.3
Speedup 1.00
2
3
4
30.4 34.3 29.8
1.49 1.32 1.52
Table 4.1: Results of initial application of strategies
Against the results in Section 3.5, this ange resulted in worse speedup, particularly on three
and four cores. However, the ange was kept as a basis for further improvement.
26
4.3 Pre-generating the Force List
Upon review of the relevant functions, it was noted that ea new position calculation required
a traversal of the list of all vertices (to calculate the repulsive forces) and a lookup in the graph
structure followed by another traversal of the neighbours (to calculate the aractive forces). It was
suggested that a performance gain could be made by doing some of this work could be done once,
at the start of the program's execution, instead of on every iteration.
1
2
[
(Spring, 0, 1),
(Spring, 0, 3),
(Repulse, 0, 1),
(Repulse, 0, 2),
(Repulse, 0, 3)
0
]
3
Figure 4.3: A simple graph, and the corresponding force list for vertex 0.
type Forces = IM.IntMap [Force]
type Force = (ForceType , Int, Int)
data ForceType = Spring | Repulse
A function of type Graph -> Forces was implemented to create a map of lists of Force
tuples where, for example, (Spring, 0, 1) represents an aractive force from vertex 1 acting on vertex 0. An example force list is shown in Figure 4.3. e step function was rewrien
to map Force tuples to their resultant force vectors. this ange also raised the possibility of
anging the strategy to be more like the initial parallelisation in Section 3.5, by applying the
parList rdeepseq strategy to the mapped force list.
Number of cores
Running time (s)
Aer force list ange
Speedup
Running time (s)
With strategy ange
Speedup
1
59.1
1.00
43.0
1.00
2
35.3
1.67
26.0
1.65
3
30.0
1.97
18.7
2.30
Table 4.2: Results of pre-generating the force list
27
4
26.1
2.26
15.7
2.74
Comparing Tables 3.2 and 4.2, it can be seen that the force list ange made the program slower
overall, but returned the speedup values to around the same as the earlier code. In addition, making
the strategy ange resulted in the best single-core performance yet, and improved speedup over
three and four cores, so both anges were kept.
4.4 Unboxing
GHC supports a number of optional language extensions whi can be enabled through special
comments, known as pragmas. One su extension is BangPatterns, whi allows fields of a
data type to be marked as strict. As a result, the values of these fields are evaluated whenever
an instance of the type is created, without the lazy thunk behaviour described in Section 2.3.1. In
addition, the UNPACK pragma can be used to selectively unbox strict fields, that is, internally store
the field's value directly rather than indirectly via a pointer.
data Vector = Vec {-# UNPACK #-} !Float {-# UNPACK #-} !Float deriving Show
e Vector type was anged from a tuple to a data type with two strict, unboxed Floats.
Number of cores
1
Running time (s) 28.4
Speedup 1.00
2
3
4
17.7 11.9 10.7
1.60 2.38 2.67
Table 4.3: Results of vector component strictness/unboxing
Table 4.3 shows that this ange leads to a significant speed improvement overall, but does not
significantly affect the parallel speedup factors. e ange was accepted.
4.5 List Chunks
In addition to parList, the parallel strategies paage supplies the parListChunk strategy,
whi splits a list into unks and evaluates ea unk in parallel. at is, parListChunk 1 is
equivalent to parList.
Various unk sizes were tested, and none resulted in beer running times than the previous
code. e fact that some came very close (compare Tables 4.3 and 4.4) despite taking longer on one
core would seem to indicate that there is a speed benefit to using larger sparks, but it is cancelled
out by the overhead of spliing the list in the first place. is ange was not accepted.
28
Number of cores
parListChunk 25
parListChunk 50
Running time (s)
parListChunk 200
parListChunk 500
1
33.3
33.0
32.2
33.4
2
18.6
18.5
19.1
18.8
3
13.6
13.5
14.0
14.8
4
11.0
11.0
11.2
11.4
Table 4.4: Testing unk sizes
4.6 Forcing the Initial Conditions
Figure 4.4: CPU utilisation aer the unboxing ange
readScope was used again, to visualise an event log generated by the current code (aer the
unboxing anges). e result is shown in Figure 4.4. It was noted that the first 1.4 seconds show
lower CPU utilisation, and hypothesised that this was due to the initial conditions (the graph,
initial positions, and force list) not being evaluated fully at the start of execution.
let g = kgraph 1000 `using ` rdeepseq
let p = initPos g `using ` evalTraversable rseq
let fl = forces g `using ` evalTraversable (evalList
(evalTuple3 rseq rseq rseq))
g `seq` p `seq` f `seq` return ()
e start of the main method was anged to specify sequential evaluation strategies for the
relevant variables, and seq was used to force the evaluation to start immediately.
Figure 4.5: CPU utilisation aer initial evaluation ange
29
Figure 4.5 shows that this succeeded in causing the initial conditions to be evaluated on a single
core. However, time measurements showed the new version of the program to be slightly slower
overall. As the ange was in an area of code whi runs only once instead of every iteration, it
was thought increasing the number of iterations might cancel out this effect.
Number of cores
1
2
Running time (s) 147.2 80.2
Before
Speedup 1.00 1.83
Running time (s) 153.3 88.4
Aer
Speedup 1.00 1.73
3
64.8
2.27
69.5
2.21
4
51.8
2.84
54.9
2.79
Table 4.5: Results from 50 iterations, before and aer forcing the initial conditions
e running times and speedup values were worse with this ange, so it was reverted. However, it is worth noting that the speedup values in Table 4.5 are the best so far, due to the increase in
the amount of parallel work whilst the initial “setup cost” has not anged. At this stage, speedup
values were only being used as general indicators to guide development, and a more important concern was minimising the amount of time spent waiting for results while testing different anges.
4.7 Arrays
Up to this point, the positions of vertices and the adjacency/force lists have been stored in IntMap
structures. e Haskell IntMap is a generally efficient implementation — the cost of iteration is
O(n), the amortised costs of insertion and indexing are O(1)[19] — but another structure may be
faster in practice.
e graph generation algorithms implemented in Chapter 3 all produce graphs with consecutive integer IDs starting at 0, so switing to arrays was considered to be a sensible option to
explore. e nature of the force calculations is su that they require random access to position
vectors . is is a use case whi may benefit from contiguous storage of vectors in well-defined
memory locations. Contiguous storage is important because it means there is ance for more
values to end up in cae memory, and cause later lookups to be faster than normal RAM access.
e Haskell libraries included with GHC support several different types of array. Some are
immutable and usable from pure functional code, and some are mutable and only accessible within
the IO or ST monads. Some are non-strict, like most basic Haskell structures, and some are strict
and store values unboxed. For this project, the simple immutable, boxed Array type was osen
30
because the relevant code is pure, and unboxed UArray implementations are only provided for
basic types.
e Positions and Forces types were anged from integer maps to arrays, and the functions accessing them were adjusted accordingly.
Number of cores
1
Running time (s) 27.1
Speedup 1.00
2
3
4
16.2 11.2 9.5
1.67 2.42 2.84
Table 4.6: Results of using Array instead of IntMap
Comparing the results in Tables 4.3 and 4.6 reveals a slight improvement in all time measurements and speedup values. is ange was retained.
4.8
Iterating in the IO Monad
p <- foldM (\cp _ -> do
let np = step fl cp
return np
) ip [1..10]
It was speculated that some of the performance problems with the program were due to the recursive list definition of steps (first seen in Section 3.3). is function was removed and replaced
with the above snippet of “impure” Haskell, whi produces the same result. Note that the foldM
operation is similar to Haskell's list folding functions, whi apply a combining function over the
list to return a single value. e foldM operation differs in that it applies a monadic operation
over the list, in this case, an iteration of the algorithm.
Number of cores
1
Running time (s) 26.5
Speedup 1.00
2
3
4
16.0 11.2 8.5
1.65 2.37 3.10
Table 4.7: Results of using monadic iteration
is resulted in mostly minimal anges to running times and speedup values — small improvements in some values, and a small regression (three-core speedup). is ange was accepted due
to the significant improvement in the four-core figures.
31
Chapter 5
Alternative Implementation
e planned project aims and sedule included the intent to produce an alternate parallel implementation, in order to compare the performance and coding experience of the multicore implementation described in the previous apters with a different model. It was decided to use the
Accelerate library (discussed in Section 2.3.4) for this purpose, as it represents a significantly different approa to parallelism — performing data parallel computation on a GPU. e paage was
also easily installable from the HaageDB paage database, and access to hardware supporting
execution of CUDA programs was available.
Whilst code from the multicore implementation was used as a basis, more significant anges
were made than those in Chapter 4, and the resultant model of execution is very different. is
apter details the main anges made, and presents the results of brief testing. Further evaluation
is detailed in Chapter 6.
32
5.1 Changes Applied
5.1.1 Type Definitions
e Vector type was reverted ba to its original (Float, Float) tuple, as Accelerate only
supports specific primitive types for arrays, to mat the types supported natively by CUDA. Arrays of tuples are supported, and are represented in the baend as tuples of arrays[6].
For the same reason, ForceType needed to be anged from a custom enumeration to a
primitive type. e obvious oice would be Bool, as there are only two possibilities. However,
the CUDA baend of Accelerate does not currently support Char and Bool arrays, so ForceType
was anged to an Int.
5.1.2 Force List
As Accelerate does not support nested data parallelism, it was necessary for the force listing structure to be a single array, rather than a mapping from vertices to individual lists.
type SparseForces = (A.Array Int Int, A.Array Int Force)
A new type SparseForces was defined, modelled aer the SparseMatrix type in [6]. e
second element of the tuple is the concatenation of all the individual lists of forces, and the first is
the number of items in ea of the original lists, i.e., the result of applying map length to them.
is segment information will be used later to undo the flaening aer values are calculated for
the forces.
5.1.3 Iteration
e main loop of the program was le unanged at the state introduced in Section 4.8. e step
function, where the position update is applied, was subject to extensive anges.
First, the input arrays (the current positions, and the two arrays in the SparseForces tuple)
are passed through Accelerate's use function, whi “embeds a Haskell array into an embedded
array computation, implying a host-to-device memory transfer”[6]. ese arrays can then be used
in expressions with other Accelerate functions to build up an embedded program for later execution.
33
en, the force list is mapped using the calc force function to produce an array of force
vectors, whi are then collected together with foldSeg, resulting in an array consisting of the
total force acting on ea vertex. Finally, the forces are zipped with the current positions using
addition to form the new positions array.
ACC.Vector Int
Initial array in system memory
ACC.use
ACC.Acc (ACC.Vector Int)
Embedded program constructed
ACC.map (*2)
Program executed on GPU
Result array in system memory
ACC.Vector Int
ACC.run
ACC.Acc (ACC.Vector Int)
Figure 5.1: Construction and execution of a simple Accelerate program.
e program formed from these expressions is then passed to a run method, whi causes
actual computation to be performed and effectively performs the reverse operation to use - returning arrays to the host memory and making them available to normal Haskell code. Figure 5.1
illustrates the process of using Accelerate. e code of the new step is available in Section C.3.
5.1.4 Force Calculation
import qualified Data.Array.Accelerate as ACC
norm :: Vector -> Float
norm (x, y) = sqrt (x ˆ 2 + y ˆ 2)
normA :: ACC.Exp Vector -> ACC.Exp Float
normA a = sqrt (x ˆ 2 + y ˆ 2) where (x, y) = ACC.untuple a
New versions of the basic linear algebra functions were implemented for use in the array
programs. As seen in the example above, the types are wrapped in Exp, whi marks an embedded
34
scalar computation. Simple arithmetic operations are transformed automatically, but the tuple
needs to be explicitly unpaed.
5.2 Initial Testing
Some brief testing was performed using the GHC 6 environment described in Section 6.1.1. e
new code returned similar output to the multicore implementation, with the minor discrepancies
assumed to be due to differences in rounding behaviour (neither Haskell's nor CUDA's floatingpoint types implement IEEE 754 precisely[20, 18]). Correct behaviour was also observed using the
GTK/Cairo visualisation code.
e reference interpreter included with Accelerate is explicitly not designed with performance
in mind, with the main aim being to form a clear specification for the semantics of the Accelerate
language. is was shown clearly in testing, as running a single iteration of the algorithm on a
K1000 graph took 166 seconds.
Switing to the CUDA baend cut this to 5.5 seconds. Increasing to 10 iterations, as used
in many of the tests in the previous apter, resulted in a running time of 9.5 seconds, roughly
comparable to the previous code running across four CPU cores. Running 100 iterations took 50.1
seconds, less than ten times as long as for one iteration. Whilst real calculations of speedup are
not possible due to the la of ability to run code on a limited number of GPU cores, these timings
would seem to indicate that speedup is increasing as the problem size is increased, and that the
sequential portion of the first 5.5 second measurement is relatively large.
Iterations 1
Running time (s) 5.5
10 100 1000
9.5 50.1 452.4
Table 5.1: Running times from initial testing on the GPU.
35
Chapter 6
Evaluation
6.1 Performance of Implementations
6.1.1 Testing Environments
Most testing of implementations was performed on a maine running the generic binary distribution of GHC 7.0.1 on Fedora 12 with a 2.67GHz Intel Xeon W3520 CPU. is CPU has four physical
cores, whi ea appear to the operating system as two logical processors due to their simultaneous multithreading support, known as “Hyper-reading”. ese extra cores are not useful for
processing where full processor utilisation is expected.
Due to the current version of the Accelerate library's la of support for the latest release of
GHC, and the above maine's la of a CUDA-supporting GPU driver, tests of the Acceleratebased implementation were performed on a different maine and compiler. is used the generic
binary distribution of GHC 6.12.3 on Fedora 9 with a 2GHz Xeon E5405 CPU. Note that outdated
system libraries precluded the installation of GHC 7. e GPU in this maine was an NVIDIA
GeForce GTX 280, with 240 CUDA cores[18] running at 1296MHz and 1GB of memory.
36
As a result of this difference, care must be taken when comparing results from the two environments. However, it is hoped that in the case of the laer environment, most repeated computation
will occur on the GPU, so that as the problem size is increased, the differences in processor and
Haskell compiler/runtime become less relevant.
When compiling with the intention of taking measurements of running time, the GTK/Cairo
visualisation code described in Section 3.4 was disabled, and the -O2 option was used, whi
instructs GHC to “[a]pply every non-dangerous optimisation, even if it means significantly longer
compile times”[24]. Event logging and profiling were also disabled.
6.1.2 Impact of Different Graphs
Tests were carried out to investigate the impact of graph type upon the efficiency of the parallel
implementations. ree different graphs, listed below, were used. Problem sizes were osen su
that the running times on one CPU core were roughly equal, so that similar running times on
more cores would represent similar efficiency levels.
Figure 6.1: Diagrams of the graphs tested.
ˆ A complete graph, that is, a graph of n vertices where every vertex has degree n − 1.
ˆ An iterated complete graph, one of the examples from [12]. is consists of n + 1 complete
subgraphs of n vertices, where ea vertex in the first subgraph is merged with a vertex in
one of the others.
ˆ A random connected graph with n vertices and 6n edges. e graph was constructed by
first creating a random tree, then adding extra edges between random vertices.
ˆ A triangle mesh, as seen in Figure 3.1.
ˆ A cycle, that is, a connected graph where every vertex has degree 2.
37
Graph
Complete graph
Iterated complete graph
Random connected graph
Triangular mesh
Cycle
Number of CPU cores
GPU
1
2
3
4
59.9 37.4 25.8 23.6 20.4
58.7 37.0 25.2 23.3 15.7
59.8 37.4 27.9 24.0 14.9
59.6 37.5 25.5 23.9 15.6
59.8 37.1 25.5 23.8 18.1
Table 6.1: Running times for 10 iterations on various graphs
Table 6.1 clearly shows that for the multicore implementation, the type of graph did not make
a significant difference to speedup. If there was an effect, the expected result would be a consistent
increase or decrease as the table is descended, going from the most to the least complete graph.
Some small differences were observed with the GPU numbers, but given that there is no consistent trend, and that these measurements are being taken on a different maine with a slower CPU
and older compiler, they were not considered significant. It was therefore concluded that anging
the input graph alone does not impact on the benefits of running the algorithm in parallel.
6.1.3 Impact of Problem Size and Iteration Count
e implementations were tested with complete graphs of varying sizes and over a varying number
of iterations. Full results can be found in table D.1. Note that speedup was calculated relative to
the multicore implementation running on a single core, even for the GPU implementation.
As the graphs in Figure 6.2 show, speedup for two, three, and four cores was relatively stable
over the range of values tested (approximately 1.8, 2.4, and 3.1). ere were some slight drops,
most noticeably at n = 600 on the right-hand graph. e overall speedup trend as the number of
CPU cores is varied is shown in Figure 6.3 and Table 6.2, whi uses averages of all the speedup
values calculated for this section. Whilst the these show far from linear speedup, no efficiency
is lost between three and four cores. It would be useful to see how the speedup/efficiency trend
continues if more than four cores were available.
Number of cores
1
2
3
4
Average speedup 1.00 1.78 2.31 3.13
Efficiency 1.00 0.89 0.77 0.78
Table 6.2: Speedup
38
5.5
5.5
5
5
4.5
4.5
4
1
2
3
4
GPU
3.5
3
2.5
2
Speedup
Speedup
4
3
2.5
2
1.5
1
200
1
2
3
4
GPU
3.5
1.5
400
600
1
200
800
Iterations
400
600
800
Graph size
Figure 6.2: e effects on speedup of varying the number of iterations applied to a K600 graph
(le), and of applying 600 iterations to a K graph of varying size (right).
Unlike the CPU core results, the right-hand graph in Figure 6.2 shows an ever-increasing
speedup value for the GPU implementation. Further measurements, found in Table D.2, were taken
to see if this trend continued.
Results including these measurements are shown in Figure 6.4. e increase in speedup is
slowing down as the graph size is increased. e largest increases in speedup are between 200 and
600 vertices, aer whi the speedup value seems to be converging to around 6.
6.1.4 Comparison Implementations
For comparison, two sequential spring embedder implementations were developed:
ˆ A Haskell implementation using mutable arrays in the IO monad (IOArray) for vertex po-
sitions. is was based on the multicore version with immutable arrays, but with extensive
modifications to the step function to bring it into the IO monad.
ˆ A C implementation. is also used mutable arrays, whi are not unusual in a procedural
language. e aim of this project was not to compare Haskell to other languages, but it was
thought running time figures from a C implementation might help put the Haskell results
in perspective.
39
3.5
3
Speedup
2.5
200
400
600
800
2
1.5
1
1
2
3
4
Cores
Speedup
Figure 6.3: Speedup of 800 iterations on K graphs of varying size, ploed against number of cores.
7
6
5
4
3
2
1
0
200
400
600
800
Graph size
1000
1200
1400
Figure 6.4: Speedup of the GPU implementation over 600 iterations.
In brief testing, the mutable array Haskell implementation completed 10 iterations of the algorithm on a 1000-vertex complete graph in 24.2 seconds. Compared to the 26.5 seconds in Table 4.7,
this is an improvement, however, the immutable array version is faster when it is run across two
or more processor cores. With most modern maines having multiple cores available, this means
it is possible for the original version to be faster in most cases. It may be possible to create an implementation whi uses mutable arrays as well as running in parallel, but this is a more complex
task and would likely require explicit thread management, discussed below as potential future
work.
It was expected that the C implementation would be mu faster than all the Haskell implementations, and indeed, it took just 1.2 seconds to complete the above task. is was 7 times faster
40
than even the Haskell implementation running across four cores. Increasing the number of iterations to 1000 narrowed this gap (3 minutes versus 13.5 minutes on four cores, or 7.5 minutes on
the GPU), but the difference was still large.
Whilst some Haskell programs have been known to be competitive performance-wise against
C, there seems to be a large gap to bridge here. However, it is worth considering that the C implementation was more allenging to create from scrat, involving explicit memory management
and less concise syntax than in Haskell. Lower performance may be the price to pay for these
benefits.
6.2 Conclusion
is project aimed to review the applicability of Haskell's parallel programming provisions to the
problem of force-directed graph layout. An implementation was created and modified to run in
parallel on both multicore CPUs and GPUs, meeting the minimum requirements and extension
described in Section 1.2.
e implementation was adapted, following the planned methodology. e spring embedder
was found to be somewhat amenable to parallelism using features available in Haskell, aieving
75-90% efficiency. Haskell's pure-functional credentials allowed an initial implementation to be
prototyped and completed quily. Parallel evaluation strategies were found to be easy to apply
to existing code without a large amount of refactoring.
Whilst the code anges required to move to data parallelism with the Accelerate library were
more disruptive, the difficulty of some anges was minimised by the library's use of Haskell
language features. For example, the ability to define higher-order functions meant that the actions
occurring in parallel could be defined in separate, scalar functions. Also, Haskell's complex type
system enabled Accelerate code objects to implement the same interfaces as normal numeric types,
su that simple operators like + could be applied to delayed expressions to build up the data
parallel program instead of performing actual addition.
Overall, the Haskell language and libraries available with GHC were found to provide a suitable environment for the implementation and parallelisation of force-based graph drawing, and
results were obtained that showed a reasonable level of success and were promising in relation to
potential future improvements.
41
6.3 Potential Further Work
is project has not explored all of the teniques available for the creation of parallel programs
in Haskell. Two alternatives are readily apparent as potential further work:
ˆ Haskell's support for explicit concurrency, where the programmer manages the creation of
threads and allocation of work, may be able to provide further performance gains. However,
this implementation would require the introduction of more monadic Haskell and likely a
loss of clarity and maintainability of the code.
ˆ Data Parallel Haskell, whi provides data parallelism in a similar manner to Accelerate,
but with a cleaner syntax and support for nested data parallelism. At the time of writing,
a new stable release of Data Parallel Haskell is planned for the first half of 2011 with GHC
7.2. e DPH developers intend to implement a GPU baend using OpenCL or CUDA in
the future.
42
Bibliography
[1] A, G. Validity of the single processor approa to aieving large scale computing
capabilities. In AFIPS Conference Proceedings (1967), vol. 30, ACM, p. 483–485.
[2] B, M. On the Automated Drawing of Graphs. In Proceedings of the 3rd Caribbean
Conference on Combinatorics and Computing (1994), p. 43–55.
[3] B, G. N: A Nested Data-Parallel Language. Sool of Computer Science, Carnegie
Mellon University, 1992.
[4] B, U.,  S, B. Angle and distance constraints on tree drawings. In Graph
Drawing (2007), Springer, p. 54–65.
[5] B, N. Force-Directed Graph Layout with Barriers and Shared Channels. <http:
//chplib.wordpress.com/?p=633>, 2009. [Online; accessed 5 Mar 2011].
[6] C, M., K, G., L, S., MD, T.,  G, V. Accelerating Haskell
array codes with multicore GPUs. In Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming (2011), ACM, p. 3–14.
[7] C, T., L, C., R, R.,  S, C. Introduction to Algorithms, 2nd ed. e
MIT press, 2001, p. 525–526.
[8] D B, G., E, P., T, R.,  T, I. Algorithms for Drawing Graphs:
an Annotated Bibliography. Computational Geometry-eory and Application 4, 5 (1994),
235–282.
[9] E, P. A Heuristic for Graph Drawing. Congressus Numerantium 42 (1984), 149–160.
[10] E, M. Inductive graphs and functional graph algorithms. Journal of Functional Programming 11, 05 (2001), 467–492.
43
[11] F, M. Some computer organizations and their effectiveness. IEEE Transactions on Computers C-21, 9 (1972), 948–960.
[12] F, A., L, A.,  M, H. A Fast Adaptive Layout Algorithm for Undirected
Graphs (Extended Abstract and System Demonstration). In Graph Drawing (1995), Springer,
p. 388–403.
[13] F, T.,  R, E. Graph Drawing by Force-directed Placement. Soware:
Practice and Experience 21, 11 (1991), 1129–1164.
[14] G, J. Reevaluating Amdahl's law. Communications of the ACM 31, 5 (1988), 532–533.
[15] H, J. Why Functional Programming Maers. Computer Journal 32, 2 (1989), 98–107.
[16] K, T.,  K, S. An algorithm for drawing general undirected graphs. Information
Processing Leers 31, 12 (1989), 7–15.
[17] M, S., M, P., L, H., A, M.,  T, P. Seq no more: Beer Strategies
for Parallel Haskell. In Proceedings of the third ACM Haskell symposium on Haskell (2010),
ACM, p. 91–102.
[18] NVIDIA. CUDA Programming Guide. <http://developer.download.nvidia.com/
compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide.pdf>,
2010. [Online; accessed 8 May 2011].
[19] O, C.,  G, A. Fast mergeable integer maps. In Workshop on ML (1998), p. 77–86.
[20] P J, S. Haskell 98 language and libraries: the Revised Report. 2003.
[21] P J, S.,  S, S. A Tutorial on Parallel and Concurrent Programming in
Haskell. Advanced Functional Programming (2009), 267–305.
[22] R, J. HaageDB: graph-rewriting-layout-0.4.4. <http://hackage.haskell.
org/package/graph-rewriting-layout-0.4.4>, 2011. [Online; accessed 6 Mar
2011].
[23] S, M.,  M, I. L. HaageDB: graphviz-2999.11.0.0. <http://
hackage.haskell.org/package/graphviz-2999.11.0.0>, 2010.
[Online; accessed 6 Mar 2011].
44
[24] T GHC T. e Glorious Glasgow Haskell Compilation System User's Guide, Version
7.0.1: 4.10. Optimisation (code improvement). <http://www.haskell.org/ghc/docs/
7.0.1/html/users_guide/options-optimise.html>, 2010. [Online; accessed 24
April 2011].
[25] W, H. Heuristic graph displayer for G-BASE. International Journal of Man-Maine
Studies 30 (1989), 287–302.
[26] W, C.,  S, A. Tidy drawings of trees. IEEE Transactions on Soware
Engineering, 5 (1979), 514–520.
45
Appendix A
Personal Reflection
At the beginning of my degree, I found the idea of a 60-credit project quite daunting, struggling
particularly to imagine myself writing any more than a few paragraphs about any one topic.
However, as I write this it's the end of the project and I think I've aieved a reasonable level of
success, so it's worth reflecting on what worked and what didn't.
From the beginning, and perhaps far from uniquely, I knew that time management was a
potential problem. By discussing an initial vague sedule with my supervisor at an early stage
and keeping it visible, I was able to tell approximately how I was doing in the scale of things at
any point. I found that as the project went on, I was slightly behind sedule, and whilst this did
make things a lile more heated as the deadline approaed, this was wholly expected.
In terms of the time of day, I found that the points at whi I was productive varied widely,
and so worked on this project at various times. In this respect, it was convenient that I didn't have
to keep sociable hours for most of the time. Working late also allowed me to execute long-running
tests on sool maines when they were not otherwise in use.
During the course of baground reading for the project, I assembled a BibTeX bibliography
file with corresponding references, and where possible saved a digital copy of papers. is saved
a lot of time in the production of this and the mid-project report.
46
e methodology of iterative improvement used in the development of the solution was well
osen. It helped me to structure this report in an organised manner, as the exposition essentially
followed the same sequence as my thoughts at development time. It also helped me keep my
aention on the project, as at no point was I working on massive sets of anges without knowing
if the result would work — my experiences were more varied, and the time from starting coding to
being able to see results was short.
My initial sedule called for the main stage of report writing to start aer the completion
of development work. In reality this meant I didn't do any writing work outside of the times
indicated on the sedule, and failed to keep records during development beyond a few rough
notes. As well as (obviously) increasing the workload during the final weeks, this meant that I had
some trouble remembering details of my thought process when writing. Aer breaking the ba of
the writing work (in the third week of the Easter period), I realised mu of the report could have
been draed during earlier, albeit with some anges being required based on later developments.
I would therefore recommend that future students keep a diary throughout the course of their
project, and consider draing text for the formal report at a mu earlier stage.
I did, however, keep many old copies of program source code for reference, whi proved to be
worthwhile. I had considered seing up a revision control system for my project, as I was familiar
with their application in large projects and from personal use, but concluded this was too mu
effort for the project's scale. In the end, I simply copied the code and named it aer the date or a
key feature, whi was enough to jog my memory when referring ba during the writing stage.
I already had knowledge of makefiles, so I spent a small amount of time automating the build
processes for the code, my presentation files, and this report using GNU Make. I would strongly
recommend this to future students, particularly those undertaking projects with more complex
build requirements than mine, as I believe it saved me time, and made re-draing this report
mu less painful, particularly given latex bibtex latex latex.
Before this project, I was familiar with alpha- and beta-quality soware, but was not fully
prepared for resear-quality. I underestimated the time that would be required to set up an environment with the compiler and libraries I needed, particularly with respect to the Accelerate implementation. I had to try several combinations of GHC/Accelerate/CUDA versions before finding
one that worked. Aer geing the environment set up, I had to deal with error messages su as
undefined, in this case meaning that arrays of type Char were not supported by the CUDA
baend; and unspecified launch failure, seeming to mean that the video card was out
of memory. I should have anticipated issues su as these, as this is very mu an open resear
47
area ([6] was published around the start of this project), and it's unreasonable to expect works in
progress to be as polished as stable releases.
I would certainly encourage all future students to be sure to oose a project in an area that
they can work on for a third of a year without geing bored. Overall, I feel the project has been
an interesting and personally enjoyable task to complete, and that as well as gaining experience of
working in this structured manner, I've gained a deeper understanding of the topics involved and
become beer at questioning my own arguments. Whilst I can't say every stage of the process has
been stress-free, I have stayed interested in the project, and who knows, I might even understand
monads now.
48
Appendix B
Materials Used
ˆ e project made use of functionality from the standard libraries that are bundled with
GHC and the Haskell Platform, whi includes parallel. e accelerate, gtk2hs,
and cairo paages were installed from HaageDB along with their dependencies.
ˆ As required by Accelerate, the CUDA Toolkit 3.1 was installed from the NVIDIA website.
ˆ e GTK+/Cairo visualisation used in the project was loosely based on code from a radial
tree drawing implementation by Dr David Duke.
49
Appendix C
Supplemental Source Code
C.1
Iteration Method
e following listing shows the functions that apply one step of the iteration to the graph, whi
should be read together with Sections 3.2 and 3.3.
edge_forces :: Positions -> Vector -> S.Set Int -> Vector
edge_forces p up vs = S.fold op2 (0, 0) vs
where
op2 v tf = ef `plus` tf
where
ef = edge_force up (p IM.! v)
repulsions :: Positions -> Vector -> Vector
repulsions p up = IM.fold op2 (0, 0) p
where
op2 vp tf = rf `plus` tf
where
rf = repulse_force up vp
step :: Graph -> Positions -> Positions
step g p = IM.mapWithKey step1 p
where
step1 u up = (up `plus` (f u up))
f u up = (maxN 20 $ f1 `plus` f2)
where
f1 = (repulsions p up)
f2 = (edge_forces p up (g IM.! u))
50
C.2
GTK+/Cairo Visualisation
e following listing shows the drawing code for the visualisation shown in Figure 3.1, and the
main method with added code to set up the drawing canvas.
import Control.Concurrent
import qualified Graphics.UI.Gtk as G
import qualified Graphics.Rendering.Cairo as C
edges g p = [ (p IM.! u, p IM.! v) | (u, vs) <- IM.assocs g, v <- S.toList vs ]
uncurry3 :: (a -> b -> c -> d) -> ((a, b, c) -> d)
uncurry3 f (a, b, c) = f a b c
renderGraph :: G.DrawingArea -> Graph -> [Positions] -> IO Bool
renderGraph canvas g p = do
threadDelay 1000000
win <- G.widgetGetDrawWindow canvas
(width , height) <- G.widgetGetSize canvas
mapM (uncurry3 $ renderStep win) (zip3 [0..] (repeat g) p)
return True
where
renderStep win i g p = do
threadDelay 5000
G.renderWithDrawable win $ do
C.setSourceRGBA 1 1 1 1
C.rectangle 0 0 400 400
C.fill
C.setSourceRGBA 0 0 0 1
mapM renderEdge (edges g p)
mapM renderNode (IM.elems p)
renderEdge ((x1, y1), (x2, y2)) = do
C.moveTo (realToFrac x1) (realToFrac y1)
C.lineTo (realToFrac x2) (realToFrac y2)
C.stroke
renderNode (x, y) = do
C.arc (realToFrac x) (realToFrac y) 3 0 (2 * pi)
C.fill
main :: IO()
main = do
let g = trimesh 35
let p = initPos g
let ps = steps g p
G.initGUI
window <- G.windowNew
canvas <- G.drawingAreaNew
G.widgetSetSizeRequest window 400 400
51
G.onKeyPress window $ const (do G.widgetDestroy window; return True)
G.onDestroy window G.mainQuit
G.onExpose canvas $ const (renderGraph canvas g (take 1 $ drop 1000 ps))
G.set window [G.containerChild G.:= canvas]
G.widgetShowAll window
renderGraph canvas g (take 1000 ps)
G.mainGUI
C.3
Iteration with Accelerate
e following listing shows the new step function, rewrien to use parallel array computations
with Accelerate. e process is described in Section 5.1.3.
step :: SparseForces -> Positions -> Positions
step (segs, fl) p = ACC.toIArray $ ACC.run $ np
where
fl' = ACC.use $ ACC.fromIArray fl
segs' = ACC.use $ ACC.fromIArray segs
p' = ACC.use $ ACC.fromIArray p
np :: ACC.Acc (ACC.Vector Vector)
np = ACC.zipWith op p' fsg
where op tp tf = tp `plusA ` (maxNA 20 tf)
fsg :: ACC.Acc (ACC.Vector Vector)
fsg = ACC.foldSeg plusA (ACC.constant (0, 0)) fs segs'
fs :: ACC.Acc (ACC.Vector Vector)
fs = ACC.map calc_force fl'
calc_force :: ACC.Exp Force -> ACC.Exp Vector
calc_force t = calc_force ' $ ACC.untuple t
calc_force ' :: (ACC.Exp ForceType , ACC.Exp Int, ACC.Exp Int)
-> ACC.Exp Vector
calc_force ' (ft, u, v) = (ft ACC.==* 1) ACC.?
(edge_force up vp, repulse_force up vp)
where
up = p' ACC.! u
vp = p' ACC.! v
52
Appendix D
Supplemental Results
Graph size Iterations
200
200
200
200
400
400
400
400
600
600
600
600
800
800
800
800
200
400
600
800
200
400
600
800
200
400
600
800
200
400
600
800
Number of CPU cores
1
2
3
4
15.8
9.4
7.4
5.6
31.9 18.9 13.7
9.6
32.8 18.8 16.1 10.9
32.6 19.1 14.4 11.0
70.2 38.4 28.1 22.9
142.2 77.2 59.3 45.6
213.2 117.3 88.9 64.6
286.1 157.7 123.3 86.7
165.0 91.2 72.8 50.5
337.6 183.0 140.0 103.4
499.6 280.1 214.0 156.4
692.3 375.3 278.3 218.1
296.6 171.7 144.5 96.5
599.0 344.9 257.3 192.1
937.3 521.3 377.4 303.2
1230.8 688.5 563.2 407.8
GPU
14.4
27.4
40.4
54.2
24.3
47.0
69.3
91.7
40.9
79.0
117.1
155.6
63.7
122.8
183.2
243.2
Table D.1: Full table of running times from Section 6.1.3
Graph size 200 400
600
800
1000
1200
1400
CPU 32.8 213.2 499.6 937.3 1458.8 2119.4 2929.8
GPU 40.4 69.3 117.1 183.2 272.2 377.8 504.8
Table D.2: Table with extra GPU measurements from Section 6.1.3.
53