An Implementation and Investigation of Depth-First Work

Otto-von-Guericke-Universität Magdeburg
Fakultät für Informatik
Bachelor Thesis
An Implementation and Investigation of
Depth-First Work-Stealing
Sebastian Dörner
Weitlingstraße 9, 39104 Magdeburg, Germany
[email protected]
1st March 2011
R
Examiner:
Supervisor:
Prof. Dr. Stefan Schirra
Dr. Burkhard D. Steinmacher-Burow
Otto-von-Guericke-Universität Magdeburg
IBM Deutschland Research & Development
GmbH
Universitätsplatz 2
39106 Magdeburg
Tel: +49-391-67-18557
Email: [email protected]
Schönaicher Straße 220
71032 Böblingen
Tel: +49-7031-16-2863
Email: [email protected]
Abstract
Multi-threaded programming in conventional programming languages requires the developer to distribute work to and manage threads manually. With an increasing number of processor
cores in mainstream hardware, taking advantage of theses cores
demands more and more management and thus diminishes programmer productivity, which is known as the multi-core software
crisis.
To address this problem, new runtime libraries and programming
languages have been developed. Many of the latter – among them
MIT Cilk and IBM CES – employ the breadth-first work-stealing
approach, where a processor executes work from its own data
structure, but steals work from other processors once its own
structure is empty. The unit of work that is stolen, is called a
task. In breadth-first work-stealing, usually large tasks from a
high level in the call hierarchy are stolen, which leads to different
cores working on widely separated parts of the code. In depth-first
work-stealing, smaller tasks from lower levels of the call hierarchy
are stolen and different cores tend to work on nearby parts of
the code. When multiple cores have a shared cache, this might
improve the cache utilization and thus speed up the execution.
This thesis extends the IBM CES compiler and runtime to also
support depth-first work-stealing. For this, we implemented a
system for analyzing arbitrary dependencies between tasks at run
time and scheduling them to run in parallel. As far as we know,
CES with this extension is the first parallel language to support
automatic dependency analysis of tasks with nested parallelism.
Furthermore, we discovered a possibility to implement additional
array support and thus enabled some important applications.
Contents
1 Introduction
1.1 Motivation . . . . . . . . . .
1.2 Existing CES Implementation
1.3 An Improved Approach . . .
1.4 Thesis Objectives . . . . . . .
1.5 Related Work . . . . . . . . .
1.6 Outline . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
1
1
2
3
3
4
6
2 Previous Work on the CES Programming Language
2.1 Language Concepts . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Architecture Overview . . . . . . . . . . . . . . . . . .
2.1.2 A New Model for Function Calls . . . . . . . . . . . .
2.1.3 The Original CES Syntax . . . . . . . . . . . . . . . .
2.2 Previous Execution Systems . . . . . . . . . . . . . . . . . . .
2.2.1 Tasks in the Stack Execution System . . . . . . . . . .
2.2.2 Data Structures and Their Implications . . . . . . . .
2.2.3 Relationship to Cilk and the Deque Execution System
2.3 Deque ES Concept . . . . . . . . . . . . . . . . . . . . . . . .
7
. . . . . . 7
. . . . . . 7
. . . . .
8
. . . . .
9
. . . . .
12
. . . . .
13
. . . . .
15
. . . . . . 17
. . . . .
18
3 Design of the Deque Execution System
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Dependency Analysis . . . . . . . . . . . . . . . . . .
3.2.1 Types of Data Dependencies . . . . . . . . .
3.2.2 The Dependency Analysis Table . . . . . . .
3.2.3 The Dependency Analysis Algorithm . . . . .
3.3 Notification of Dependent Tasks . . . . . . . . . . .
3.4 Scheduling and Work-Stealing . . . . . . . . . . . . .
3.5 Synchronization . . . . . . . . . . . . . . . . . . . . .
3.6 Memory Management for Data Items . . . . . . . . .
3.7 Manual Encoding of Task Dependencies . . . . . . .
3.8 Additional Array Support by the Execution System .
3.8.1 Overview . . . . . . . . . . . . . . . . . . . .
3.8.2 Syntax . . . . . . . . . . . . . . . . . . . . . .
3.8.3 Use Case: Algorithms on Blocked Data . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
.
19
.
20
.
20
. . 21
.
22
.
23
.
25
.
26
. . 27
.
28
.
29
.
29
.
30
. . 31
4 Implementation of the Deque Execution System
4.1 Data Structures for Dependency Analysis and Task Notification . . . .
33
33
v
Contents
4.2
4.3
4.4
4.5
4.6
4.7
Notification of Dependent Tasks . . . . . . . . . . . . . . . . . .
Scheduling and Work Stealing . . . . . . . . . . . . . . . . . . . .
Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Memory Management for Data Items . . . . . . . . . . . . . . . .
Speed Improvements . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Using Single Variables for the Dependency Analysis Table
4.6.2 Avoiding O(n) Operations on Callback Lists . . . . . . . .
4.6.3 Using Free Pools for Task and Data Frames . . . . . . . .
4.6.4 Scheduling According to Hardware Threads . . . . . . . .
Array Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Performance Comparisons
5.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Test Configuration . . . . . . . . . . . . . . . . . .
5.3 CES Applications Used . . . . . . . . . . . . . . .
5.3.1 Recursive CES Applications . . . . . . . . .
5.3.2 Cholesky Decomposition . . . . . . . . . . .
5.3.3 Sweep 2D . . . . . . . . . . . . . . . . . . .
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Scaling of Work-Stealing Modes and Shared
5.4.2 Overhead of the Execution System . . . . .
5.4.3 Sweep 2D Results . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Deques
. . . . .
. . . . .
6 Conclusions
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Further Research Possibilities . . . . . . . . . . . . . . . .
6.2.1 Advancing the CES Language and Implementation
6.2.2 Evaluation of the Current CES State . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
.
36
. . 37
.
38
.
39
.
40
. . 41
.
42
.
43
.
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
.
46
.
46
. . 47
. . 47
.
48
.
48
.
49
.
49
.
53
.
56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
.
60
. . 61
. . 61
.
63
Bibliography
64
Selbstständigkeitserklärung
69
vi
Contents
Acknowledgments
I would like to thank Dr. Burkhard Steinmacher-Burow for offering a very interesting
subject and for supporting me throughout my internship at IBM and the writing of this
thesis. He was a tremendous source of good advice and never got tired of me seeking
it. I would also like to thank Prof. Dr. Stefan Schirra for supervising my bachelor
thesis at university, for giving me some valuable insights into academic writing and for
supporting me throughout my studies.
Furthermore, I would like to thank Uwe Fischer and all the people in his department
for a warm welcome at IBM. Thanks to Benjamin Ebrahimi, Benjamin Krill, Heiko
Schick, Peter Morjan, Bryan Rosenburg and Tom Musta for helping me with technical
issues.
Thanks to Prof. Dr. Dietmar Rösner for establishing contact with IBM.
I would like to thank Anett Hoppe, Anja Bachmann and Benjamin Espe for proofreading drafts of this thesis and for giving me some useful suggestions.
Finally, I would like to express my deep gratitude to my parents Heike Dörner and
Torsten Mehlhase, who enabled my studies and always encouraged and supported me.
vii
1 Introduction
1.1 Motivation
Until about 2005, gains in processor performance were mainly achieved by advancing the
processor core architecture and increasing the clock frequency. Improved manufacturing
techniques enabled smaller transistors with higher switching speeds. As Robert Shiveley
from Intel explained, this development has a drawback, which is increasingly difficult
to handle: “Power consumption and heat generation rise exponentially with clock
frequency. Until recently, this was not a problem, since neither had risen to significant
levels. Now both have become limiting factors in processor and system designs.” [Shi10]
Not only is high power consumption a technically limiting factor, but also unwanted
considering one of the latest political and industrial trends, “Green IT”. Governments
recognize e. g. data centers as major contributors to the global greenhouse gas emissions
and spend money on research for improving energy efficiency [U.S10]. IT companies
label their products as “green” promising lower operating costs [IBM10, VMw10]. For
these products to achieve their relatively low power consumption, it is essential to limit
the clock rate and instead look for other possibilities to achieve high performance.
Since transistors continue to shrink, the most common way to achieve high performance with lower clock rates is to use processors with multiple cores. On the hardware
side, this provides more and more raw computational power. But it also has serious
implications for the software side: Conventional computer programs were not designed
to exploit the capabilities of multiple CPUs or CPU cores, which means that they do
not run effortlessly faster on next-generation processors as in the past. With multi-core
processors already reaching the cell phone class of devices [Sav10], these problems will
affect large parts of the software industry.
To make use of additional cores, the program code must be distributed to multiple threads of execution. Most traditional programming languages offer libraries to
start threads and collect their results. However, the application programmer herself
must manually identify which parts can run in parallel, distribute the work loads
and add code to manage threads instead of concentrating on the application logic.
Therefore, university and corporate research seeks ways to automate as far as possible the distribution of applications to several threads, thereby retaining programmer
productivity.
For that purpose a multitude of systems have been developed, for instance Intel
Threading Building Blocks (TBB) [TBB10], the Berkeley Open Infrastructure for
Network Computing (BOINC) [BOI10], MIT Cilk [Cil10] and SMP Superscalar [SMP10].
Another of these systems and developed at IBM is C with Execution System (CES),
an extension of the C programming language.
1
1 Introduction
1.2 Existing CES Implementation
CES is based on an alternative implementation of the subroutine or function call,
developed by Dr. Burkhard D. Steinmacher-Burow [SB00b]. Such a subroutine, also
called task in the following, is characterized by its input, inout (input-output) and
output parameters. Unlike the conventional implementation of function calls, a call
to a CES subroutine is only executed after the parent task has completed. Therefore,
each thread executes only one task at a time, in contrast to the conventional call stack
with multiple subroutines in execution, each suspended at a call to a child. In CES,
once the parent task has completed, control is returned to an execution system, hence
the name CES (C with Execution System). This execution system (ES) is responsible
for dispatching new tasks ready to be executed.
Whether a task is ready to be executed mainly depends on the availability of its
input and inout parameters. Thus, the access to parameters determines the runtime
dependencies between tasks. Unless a program is very sequential in its structure, there
are usually multiple tasks ready to be executed. It is up to the execution system
to schedule these tasks. More importantly, they can be distributed among multiple
threads and run on different processors of a shared memory machine.
A well-known way to distribute the tasks without too much interference between
different threads is the work stealing approach [BL94], in which processors execute
work from their own data structures but steal tasks from other processors once they
run out of work. To exploit temporal locality, children of the current task are usually
executed directly after the parent finishes and on the same thread, which leads to a
depth-first execution order. For example, a thread operates on a stack of tasks, only
accessing the top for put and take operations.
Other threads may steal tasks from the top or bottom of the stack, a choice that is
expressed by the terms depth-first and breadth-first work-stealing respectively. Breadthfirst work-stealing has the advantage that processors working on widely separated
tasks typically do not invalidate each other’s cache. Furthermore, because tasks at
the bottom usually represent larger tasks, typically fewer steals are necessary. For
multiple processor cores with a shared cache however, depth-first work-stealing may
reduce cache misses through eviction since the cores would work on nearby parts of
the code, which would largely operate on equal parts of the memory.
For CES, there are multiple execution systems to choose from: The Sequential
Execution System is single-threaded and uses no work-stealing at all. The Round-robin
Execution System uses breadth-first work-stealing but is single-threaded; one thread of
execution simulates multiple virtual processors, thus avoiding the need for concurrency
synchronization. The Stack Execution System works with multiple threads and uses
breadth-first work-stealing as well.
The current implementation of CES also includes the CES Compiler (CESC) developed by Sven Wagner as part of his Diploma Thesis in 2007 [Wag07]. While Wagner
only used a hard-coded version of the Sequential Execution System, Jens Remus designed a macro interface to allow multiple execution systems and then developed the
previously mentioned Round-robin and Stack Execution Systems, also as part of his
Diploma Thesis [Rem08].
2
1.3 An Improved Approach
However, the current implementation has several shortcomings (cf. [Rem08, pages
147 – 149]). First, the application programmer identifies tasks that can run in parallel
by marking calls to subroutines with the parallel keyword. This identification is
a manual process and the simple binary marking cannot exploit some opportunities
for parallelism. Similar to Cilk, this approach is mostly useful for divide-and-conquer
algorithms but not for arbitrary dependency structures between tasks. Second, the
runtime data structures and code are quite complicated: All tasks are managed in a
data structure that is called a stack, but is in fact also randomly accessed. Since the
structure includes both tasks that are ready to be executed and tasks with outstanding
dependencies, other threads stealing from the stack have to search through it in order
to find a task that may be stolen. To enable this behavior, the old system also needs
sophisticated synchronization mechanisms.
1.3 An Improved Approach
In an improved approach, a new execution system (ES) would addresses the above
shortcomings of the Stack ES. The main idea is to use a double-ended queue (deque)
instead of the current pseudo-stack. In contrast to the current solution, it would only
hold tasks that are ready to be executed and therefore make it possible to access it
with correct deque semantics (see Section 2.3).
The current pseudo-stack implementation has the advantage that the execution order
as given by the programmer is internally represented by the order of tasks on the
stack. Together with the programmer’s indication of which tasks can run in parallel,
this information is enough to implicitly enforce the data dependencies between tasks.
With only ready-to-be-executed tasks on the deque of the new implementation, this
information is lost. To guarantee a valid execution order nonetheless, we must analyze
the actual dependencies between tasks and schedule them accordingly. Furthermore,
this automatic dependency resolution will probably lead to a better exploitation of
parallelism compared to the previous solution, which conservatively approximated the
true dependencies.
1.4 Thesis Objectives
The main objective of this thesis is to implement the described improved approach in a
new and efficient execution system, which will use a deque as the main data structure.
With a deque holding the tasks, stealing from the top and bottom yields depth-first
and breadth-first work-stealing respectively. The implementation should enable us to
easily switch between both alternatives and to compare them.
We must design and implement a system to analyze the dependencies between tasks
and properly integrate it with the current interface for execution systems. If need
be, we will extend the existing interface but also keep compatibility with the original
execution systems.
3
1 Introduction
R hardware in
Furthermore, the new execution system should run on Blue Gene
addition to standard x86 machines. This enables us to use a very efficient existing
implementation of a concurrent deque for x86 and Blue Gene/Q by Manuel Metzmann [Met09]. Since the next Blue Gene generation also provides multiple hardware
threads, we can easily verify the scaling capabilities of the new execution system.
In a nutshell, on the old CES system, the concept for the new Deque Execution
System and the concurrent deque implementation, we build a new execution system
with automated dependency analysis, easy switching of work-stealing modes and a
better exploitation of parallelism. Figure 1.1 illustrates this objective, also indicating
the sections in which each of the parts will be explained in detail.
Old CES system
– Sequential ES
– Round-robin ES
– Stack ES
(Sections 2.1 and 2.2)
Existing Concept
– Deque ES
(Section 2.3)
Concurrent deque
implementation
(Section 2.3)
New implementation
– Deque ES
(Chapters 3 and 4)
Figure 1.1: Foundations and objective of this thesis
1.5 Related Work
The main ideas of the work-stealing principle have already been mentioned in the early
1980s [BS81]. Since then, it has been used to develop both parallel runtime libraries like
Intel Threading Building Blocks [TBB10] or Java’s Fork/Join Framework [Lea00] and
programming languages like Cilk [FLR98] or Cilk++ [Lei09]. These systems and also
the old CES execution systems use breadth-first work-stealing. As different threads in
breadth-first work-stealing tend to have disjoint working sets, this is a very good choice
for multiple processors with distinct caches. In recent years however, processors with
multiple cores have emerged as the dominant architecture, not only for mainstream
computers. In a typical multi-core chip, all of the chip’s hardware threads share a
cache. The different working sets of these threads in breadth-first work-stealing may
disturb each other’s cache usage.
This development directed research towards schedulers with an emphasis on constructive cache sharing. Concurrently scheduled tasks on hardware threads of one processor
should operate on similar data so that all threads can make use of the same data in
the cache. Parallel Depth First (PDF) [BGM99] is a scheduler performing constructive
cache sharing. Scheduled to run next is the task that would be executed next in the
4
1.5 Related Work
serial execution. While depth-first work-stealing does not strictly provide this property,
it still prefers the execution of recently-spawned tasks and thus shows a similar behavior.
The performance benefits of PDF compared to breadth-first work-stealing as reported
in [CGK+ 07] might appear in depth-first work-stealing as well. Furthermore, the only
difference between depth-first and breadth-first work-stealing is the pop operation
used for the deque. Hence, in CES we can easily combine them and e. g. try stealing
depth-first on the same processor core and breadth-first across processor cores.
The idea of employing a deque with strict semantics to store tasks is not new to the
field. Blumofe and Leiserson [BL93] introduced a model to represent a ready-to-execute
task and its successors in a fixed linear group, called thread1 . They later [BL94] present
a scheduler that stores these threads in double-ended queues, each of which is assigned
to a fixed processor. In contrast to our solution, these threads and hence the deques
can contain tasks, which are not yet ready-to-execute. When the execution hits such a
task, the thread blocks and the execution continues with a new thread popped from
the processor’s deque. Based on this work, several deque implementations have been
developed and used in task schedulers [ABP98, CL05].
In [ALS10], Agrawal et al. present the NABBIT library that enables the execution
of both static and dynamic task graphs (see Subsection 2.1.2) in the work-stealing
environment of Cilk++. Instead of relying on the sequential program code like the
Deque ES, they require the programmer to specify the nodes of the graphs and their
dependencies explicitly. NABBIT performs the search for ready-to-execute tasks
backwards, starting from the final task, which is the sink of the graph. Therefore, it
keeps references to predecessors and successors of a task, whereas the Deque ES only
keeps forwards links. However, parts of their main execution driver code for dynamic
task graphs, in particular DecComputeNotify and ComputeAndNotify in [ALS10, Fig.
8]), are similar to the Deque ES.
Perez, Badia and Labarta devised the SMP Superscalar (SMPSs) programming
model [PBL07, PBL08], which analyzes the dependencies of tasks at run time, just
like the Deque ES. They use conventional C functions as the unit of parallelism and
annotate them with C pragmas to distinguish input, inout and output parameters.
The array support of the Deque ES (see Section 3.8) was inspired by SMPSs [PBL08,
Section IV]. A notable advantage of the Deque ES over SMPSs concerns nested tasks.
In the Deque ES, tasks executed in parallel can spawn child tasks, which are also
executed in parallel. In SMPSs the children of a task spawned in parallel are executed
serially like any conventional C function, i. e. “SMPSs does not currently support
nesting” [PBL10, Section 6].
Song et al. developed the library Task-based Basic Linear Algebra Subroutines
(TBLAS) [SYD09]. It implements a widely-used interface [LHKK79, DDCHH88,
DDCHD90], which is the foundation of many linear algebra algorithms. The new
implementation of each subroutine generates a set of tasks, which are executed dynamically after having their dependencies analyzed. As the dependency analysis algorithm
and scheduling scheme are totally distributed and “the runtime system has no globally
1
We use their definition of the word thread only in this paragraph.
5
1 Introduction
shared data structures” [SYD09], TBLAS runs on both shared-memory and distributedmemory systems. In contrast to CES, it is specialized on the linear algebra domain
and hence not suitable for executing arbitrary programs in parallel.
1.6 Outline
The rest of this thesis is organized as follows: Chapter 2 presents the original state of
the CES language and runtime environment, which is the starting point for this work.
It also includes the existing rough concept for the new execution system, the so-called
Deque ES. The detailed design and implementation of this concept constitute the main
contribution of this work and are explained in Chapters 3 and 4. Chapter 5 presents
some brief performance comparisons between different scheduling strategies including
depth-first and breadth-first work-stealing. Finally, Chapter 6 concludes this thesis by
summarizing our results and giving an outlook to further research possibilities.
6
2 Previous Work on the CES
Programming Language
2.1 Language Concepts
2.1.1 Architecture Overview
First, we explore how to create an executable application from a CES source file. The
CES language is an extension of the C language. For that reason and to simplify the
implementation, the CES compiler only translates CES programs to C code. The
intermediate C code includes a lot of macro calls to the execution system, a separate C
module that implements the parallel execution of the application. The C code for the
application and the execution system are compiled and linked as any C program to get
the executable application. The whole compilation process is illustrated in Figure 2.1.
CES Compiler
CES source
Intermediate
C source
C Compiler
Executable
application
Execution
system
Figure 2.1: Compilation process for CES programs (based on [Rem08, p. 23])
The execution system takes care of the parallel execution of the application. For that
purpose, the application code is split up using a new implementation of subroutines,
so-called tasks (see Subsection 2.1.2), which are scheduled by the execution system.
That is, the application code and the execution system communicate over a task-based
interface. The execution system assigns these tasks to a number of threads, which are
scheduled by its environment. So the execution system interacts with its environment,
usually the operating system and POSIX threads, using a thread-based interface. These
interfaces and the three main components of the application architecture are presented
in Figure 2.2.
7
2 Previous Work on the CES Programming Language
Application code
Application logic
Task interface
Execution system
Task scheduling & parallel execution
Thread interface
Hardware & software environment
Thread scheduling, resource allocation, etc.
Figure 2.2: Architecture of CES applications (based on [SBWR08, Fig. 1])
2.1.2 A New Model for Function Calls
As stated earlier, CES is built upon a new model for function calls, developed by
Burkhard D. Steinmacher-Burow [SB00a, SB00b]. In this model, child tasks spawned
by a subroutine are only executed after the parent routine finishes. Therefore, there is
no such thing as a return value which is delivered to the parent. Instead, the parent
hands over input, inout and output parameters as references. Results of a routine
are written into inout or output parameters and may be consumed by other tasks.
Suppose for instance, we want to calculate k = a · b + c · d. Listing 2.1 uses the new
kind of function call to perform this calculation. The three types of parameters (input,
inout, output) are separated by semicolons. In this case, we do not have any inout
parameters.
1
2
3
mult(a,b;;m);
mult(c,d;;n);
add(m,n;;k);
Listing 2.1: Calculating k = a · b + c · d using the new function call implementation
Lines 1 and 2 perform the multiplications and store the intermediate results in variables
m and n respectively. Afterwards, the add task calculates their sum to get the final
result k.
Recall that the parent routine (not included in the listing) runs prior to those three
children and never gets to see the intermediate or even final results. It only passes the
same references to multiple tasks and thus provides for their exchange of information
through variables m and n. It is the add task’s responsibility to process the intermediate
results. This is not by accident, but a general property of the new subroutine concept.
Since a calling routine never gets any results of child tasks, it cannot process them
on its own. Instead, it always needs to spawn other children to process intermediate
results.
Notice that the two mult tasks do not depend on each other. They may be executed
in any order or even in parallel on multiple processors, a choice which is to be made
8
2.1 Language Concepts
by the execution system. The execution system can schedule tasks arbitrarily as long
as the results are equal to the sequential execution order as given by the application
source code. To this end, it must know the dependencies between subroutines, which it
might either determine on its own or use hints given by the programmer.
Tasks and their dependencies can be represented by a directed acyclic graph (DAG).
The nodes of the graph correspond to the task instances and an arc from node A to
node B means that the task B cannot start executing until task A has completed. Such
a task graph is always acyclic because among multiple tasks with circular dependencies
no task could ever run. Since a task graph can represent arbitrary acyclic dependencies,
it is the most general representation, and one goal for execution systems to strive for is
to execute arbitrary task DAGs efficiently.
Now suppose we want to define a discrete subroutine to calculate the value of k from
above. The definition of Listing 2.1 is encapsulated into a task as in Listing 2.2. The task
calculateK takes the four components as input and k as an output parameter. Listing
2.3 shows a call to calculateK succeeded by a call to processK, a subroutine that
uses the result k to do something useful. Obviously processK depends on calculateK.
However, calculateK does not deliver the desired item (k) itself, but merely delegates
this duty to add. Therefore, after calculateK has run, the task processK depends on
add; the dependency is handed over to another task. We call this principle delegation.
If we look at the DAG representation of our example, initially there are only two nodes
for calculateK and processK as illustrated in Figure 2.3 (a). When calculateK runs,
its node is replaced by a sub-DAG consisting of the three nodes for the child tasks.
The resulting graph is shown in Figure 2.3 (b).
calculateK(a,b,c,d;;k) {
mult(a,b;;m);
mult(c,d;;n);
add(m,n;;k);
}
Listing 2.2: A discrete task to calculate k
calculateK(a,b,c,d;;k);
processK(k;;);
Listing 2.3: Using the previously defined task
2.1.3 The Original CES Syntax
Up to now, all code examples used a pseudocode closely related to the actual CES
syntax. We will look at the latter in detail now. The CES Compiler translates CES
source code to C code and since CES is an extension of C, all normal C code in CES
files is, roughly speaking, copied to corresponding C files. In order to easily find those
parts that the CES compiler must really care about, the CES syntax elements are
separated by dollar signs. To illustrate this matter of fact, we translate our previous
example to valid CES syntax as in Listing 2.4.
9
2 Previous Work on the CES Programming Language
mult
mult
calculateK
add
processK
processK
(a)
(b)
Figure 2.3: The node of calculateK is unfolded into a sub-DAG.
1
2
3
4
5
$calculateK(int a,int b,int c,int d;;int k){
$mult(a,b;;int m);$
$mult(c,d;;int n);$
$add(m,n;;k);$
}$
Listing 2.4: The task calculateK in CES syntax
The whole definition of the task is enclosed in dollar signs, likewise the individual
calls to other CES tasks. As with normal C functions, the body of a subroutine is
enclosed in curly braces. It may contain normal C code and special CES features, like
the task calls in Listing 2.4.
The parameter definition list (Listing 2.4, line 1) is separated by semicolons into
three parts for input, inout and output parameters. In this example, the second part is
empty. Input parameters may only be read, input-output parameters may be read and
written, whereas output parameters may only be written. Multiple parameters of the
same kind are separated by commas. Since CES is a C-derivative, all CES variables
have got a normal C type, which is specified in front of the parameter name. For ease
of implementation, the type name must consist of a single identifier. Hence, to use
compound types like unsigned char, declare an alias with typedef as illustrated in
Listing 2.5. The general syntax for CES definitions is given in Listing 2.6. Brackets
indicate optional parts.
typedef unsigned char uchar;
typedef int * int_ptr;
Listing 2.5: Using compound types with typedef
10
2.1 Language Concepts
$<task name>([<definition of input parameters>];
[<definition of inout parameters>];
[<definition of output parameters>]){
...
}$
Listing 2.6: Definition of a CES task
$[parallel] <task name>([<list of input parameters>]; [<list of inout parameters>];
[<list of output parameters>]);$
Listing 2.7: Call to a CES task
Next to the definition of calculateK, Listing 2.4 also includes multiple CES subroutine calls, the abstract syntax for which is given in Listing 2.7. The optional preceding
keyword parallel is a hint for the execution system that this method can run in
parallel, i. e. it does not depend on other tasks that have been called earlier in this
task.
The parameters in the call are in the same order as in the definition. Whether their
type must be given, depends: There are two types of variables, normal C variables
and CES variables. CES variables are passed by reference and their lifetime usually
extends the lifetime of a task. Their storage space is managed by the execution system.
The input of CES tasks, no matter if input, inout or output variable, consists of CES
variables. When we call a CES subroutine with a CES variable as a parameter, we
do not need a type specifier, since the compiler already knows about that variable. If
we pass a C variable as a parameter, a CES variable is created and the C variable
is used to initialize it. For that purpose, the type of the variable must be specified.
Output parameters need no initialization, because they are never read in the called
task. Therefore we can create a new CES variable during the call without explicit
initialization. This happens, when we specify as an output parameter the type and
name of a variable that does not exist yet. This newly created CES variable can be
used as an input or inout parameter to succeeding tasks, just like any other CES
variable. For ease of implementation, a CES variable that has been created as an
output parameter can currently not be used as an output parameter of a subsequent
task in the same parent. Instead, one can pass it as an inout parameter, no matter if it
is ever read or not.
When a task passes on its arguments to a child task, it must respect their parameter
types. Pure input parameters may not be passed on as inout or output parameters,
whereas the opposite direction is possible. In fact, inout and output parameters can be
passed on as any parameter type. The advantage of using output parameters is that
they do not need any initialization as mentioned above.
If we want to access a CES variable within a task, we must simply enclose the
variable name in dollar signs to differentiate it from normal C variables. Remember
that these variables are passed by reference.
11
2 Previous Work on the CES Programming Language
Finally, one note about the interaction of C and CES subroutines: You can call C
functions from within CES tasks. These C functions are executed synchronously as in
ordinary C, i. e. you get a return value, which can be processed immediately. What is
not possible however, is to call CES tasks from within C functions. This implies, that
the initial function of a CES program must be a CES task. This special task is called
program, its signature is given in Listing 2.8.
typedef char** argv_t;
$program(int argc, argv_t argv;;);$
Listing 2.8: Signature of the program task
To illustrate the features explained, we present the definition of a CES task that
calculates the Fibonacci numbers in Listing 2.9. For a given number n, the input
parameter, we calculate the nth Fibonacci number as a result. The recursive algorithm
is well-known, line 9 represents the base case, lines 11 to 16 the general case. Now
we want to put emphasis on the CES syntax: Line 7 starts the task definition, with n
as an input and result as an output parameter. In lines 8 and 9, the CES variables
n and result are locally accessed. Therefore, they are encapsulated in dollar signs.
Lines 11 and 12 declare local C variables and calculate their values in terms of n, which
is again accessed as a CES variable. Lines 14 and 15 spawn the two child tasks to
calculate the previous Fibonacci values. Both calls are enclosed in dollar signs and
marked by the parallel keyword. The latter is possible, because the calls do not
depend on each other. In contrast, add_uint64 in line 16 processes the output of the
calls to fibonacci and is therefore not spawned in parallel. Now look at the use of
type specifiers in task calls. Since n1 and n2 are C variables, their types are given to
create and access corresponding CES variables. The calls to fibonacci each declare a
new CES variable fib* of type uint64_t. When these variables are accessed again
in line 16, their type is not needed as with any ordinary CES variable. The output
parameter of the parent fibonacci task, result, is a CES variable and passed by
reference. Thus, add_uint64 can directly write into it without specifying a type.
The Syntax described in this subsection originates in the first implementation of
CES, the CES Compiler (CESC) and a hard-coded Sequential Execution System by
Sven Wagner [Wag07]. When Jens Remus generalized the ES interface and added the
Round-robin and Stack Execution Systems, he kept this syntax [Rem08]. The new
Deque ES presented in this thesis keeps all syntax elements shown here and adds some
more, which will be explained later. Before we introduce the new Deque Execution
System in chapters 3 and 4, we give an overview of the previous execution systems in
the next section.
2.2 Previous Execution Systems
As shown in Figure 1.1, before this work started there were three different Execution
Systems (ES), developed by Jens Remus [Rem08]: Sequential ES, Round-robin ES and
12
2.2 Previous Execution Systems
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/**
* Recursive computation of the nth Fibonacci number.
*
* @param[in] n the Fibonacci number to compute.
* @param[out] result the nth Fibonacci number.
*/
$fibonacci(uint32_t n;;uint64_t result){
if ($n$ <= 1)
$result$ = $n$;
else {
uint32_t n1 = $n$ - 1;
uint32_t n2 = $n$ - 2;
$parallel fibonacci(uint32_t n1;;uint64_t fib1);$
$parallel fibonacci(uint32_t n2;;uint64_t fib2);$
$add_uint64(fib1, fib2;;result);$
}
}$
Listing 2.9: Task to calculate the Fibonacci numbers recursively
Stack ES. The Sequential Execution System is based on a hard-coded version by Sven
Wagner [Wag07] and can execute CES code sequentially on a single processor. The
Round-robin ES executes the work of multiple threads in a round-robin fashion, thereby
avoiding synchronization problems but already introducing the program structure for
multiple threads. Finally, the Stack ES executes tasks using several processors and
constitutes the basis for the Deque Execution System. Since the Sequential ES is not
interesting for parallel computations and the Round-robin Execution System served
merely as an intermediate step toward parallelization, we will only describe the Stack
ES here. Furthermore, we concentrate on the main ideas, those that are relevant for
the new Deque ES, which constitutes the main effort of this thesis.
2.2.1 Tasks in the Stack Execution System
The CES Stack ES is responsible for scheduling and dispatching tasks. For that reason,
there is the major data structure TASK_FRAME, which contains all the information
concerning a task instance, most importantly pointers to the calling parameters and a
function pointer to the C-function implementing the task execution. This TASK_FRAME
is a general interface for various kinds of tasks and therefore the parameters are untyped
(void *) and the function pointer takes a generic argument. The execution system will
cast this generic task frame into a version with typed parameters (e. g. int *) used
during the task’s execution. From a CES source file, the CES compiler (CESC) creates
C header files containing the typed task frame definitions. Additionally, it generates a
C code file with a C function for each CES task.
For each CES construct in the CES task, the equivalent C function contains macro
calls to the execution system. For example, there are macros to initialize and finalize
the task, to call a new task, to access CES variables and to create storage for new
ones. A complete overview of the macro interface is given in [Rem08, Appendix A].
13
2 Previous Work on the CES Programming Language
Listing 2.10 shows the usage of these macros as part of an example, the core of the
Fibonacci C routine output by the CES compiler for the CES task in Listing 2.9.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/**
* Recursive computation of the nth Fibonacci number.
*
* @param[in] n the Fibonacci number to compute.
* @param[out] result the nth Fibonacci number.
*/
void fibonacci(RUNTIME_TASK_FUNCTION_ARGUMENTS) {
RUNTIME_TASK_INITIALIZE(fibonacci);
if (RUNTIME_TASK_PARAMIN(1) <= 1)
RUNTIME_TASK_PARAMOUT(1) = RUNTIME_TASK_PARAMIN(1);
else {
uint32_t n1 = RUNTIME_TASK_PARAMIN(1) - 1;
uint32_t n2 = RUNTIME_TASK_PARAMIN(1) - 2;
/* PUSH STORAGE FOR C VARIABLE ’n1’ TO FRAME STACK */
RUNTIME_CREATE_STORAGE_CVAR(n1, uint32_t, n1);
/* PUSH STORAGE FOR OUTPUT ARGUMENT ’fib1’ TO FRAME STACK */
RUNTIME_CREATE_STORAGE_OUTPUT(fib1, uint64_t);
/* PUSH TASK ’fibonacci’ TO CURRENT STACK */
RUNTIME_CREATE_TASK(fibonacci, 0, 1, 1, 0, 1);
RUNTIME_NEWTASK_PARAMIN_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(n1);
RUNTIME_NEWTASK_PARAMOUT_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(fib1);
/* PUSH STORAGE FOR C VARIABLE ’n2’ TO FRAME STACK */
RUNTIME_CREATE_STORAGE_CVAR(n2, uint32_t, n2);
/* PUSH STORAGE FOR OUTPUT ARGUMENT ’fib2’ TO FRAME STACK */
RUNTIME_CREATE_STORAGE_OUTPUT(fib2, uint64_t);
/* PUSH TASK ’fibonacci’ TO CURRENT STACK */
RUNTIME_CREATE_TASK(fibonacci, 0, 1, 1, 0, 1);
RUNTIME_NEWTASK_PARAMIN_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(n2);
RUNTIME_NEWTASK_PARAMOUT_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(fib2);
/* PUSH TASK ’add_uint64’ TO CURRENT STACK */
RUNTIME_CREATE_TASK(add_uint64, 0, 0, 2, 0, 1);
RUNTIME_NEWTASK_PARAMIN_REFERENCE(add_uint64, 1) = RUNTIME_STORAGE_REFERENCE(fib1);
RUNTIME_NEWTASK_PARAMIN_REFERENCE(add_uint64, 2) = RUNTIME_STORAGE_REFERENCE(fib2);
RUNTIME_NEWTASK_PARAMOUT_REFERENCE(add_uint64, 1) = RUNTIME_TASK_PARAMOUT_REFERENCE(1);
}
/* COPY CURRENT STACK TO FRAME STACK */
RUNTIME_TASK_FINALIZE(fibonacci);
}
Listing 2.10: CES compiler generated C code for the fibonacci task of Listing 2.9
Line 7 starts the C function fibonacci with RUNTIME_TASK_FUNCTION_ARGUMENTS
as its macro parameter. This macro usually expands to the major data structures of
the execution system, data structures that will be accessed by other macros throughout the function. We will explain shortly, how they appear in the Stack ES. The
first macro within the function is RUNTIME_TASK_INITIALIZE, which will initialize the
14
2.2 Previous Execution Systems
task. It is obvious that this macro can serve very different purposes as well, depending on the execution system. Access to CES variables is translated to the macros
RUNTIME_TASK_PARAMIN and RUNTIME_STORAGE_REFERENCE, depending on where exactly the variable was defined. These macros return the correct variables or their
pointers respectively. Notice that the names of CES parameters do not occur in the
C code, only the CES compiler knows about them. In C, these parameters are just
identified by their type (in/inout/out) and offset. In line 17 the new CES variable
n1 is created and initialized using its C equivalent, line 20 generates a new output
variable with no initialization. Both of them serve as input to the fibonacci task that
is created next. The seemingly magic numbers in line 23 represent, among other things,
the parallel flag and the number of input, inout and output parameters. Afterwards,
the parameters for the new task are initialized by storing references available in the
current task. The comments in the CESC output refer to the Current Stack and Frame
Stack, two major data structures of the Stack Execution System, which we will explain
now.
2.2.2 Data Structures and Their Implications
The Stack Execution System manages all tasks which have been spawned but not yet
executed on the Frame Stack. This data structure is accessed by its owner-thread like
a stack, but other threads may freely search through it, so it is actually a pseudostack. Initially, the Frame Stack of the first thread is seeded with the program task.
Afterwards a thread continuously takes the top item from its stack and executes the
task as visible in Figure 2.4. Part (a) shows how spawned child tasks are collected
through repeated push operations to the Current Stack, which is empty at the start of
a task. As depicted in (b), before a task finishes, the Current Stack is moved to the
top of the Frame Stack. Since the memory layout of the Current and Frame Stack are
opposite, the first-spawned child task is then on top of the Frame Stack and thus the
next task to be executed (Figure 2.4 (c)). This way the correct execution order for one
thread is guaranteed.
But what about the other threads and parallelization? Once a thread’s Frame Stack
is empty, it has no more work to execute from its own data structures. This is also the
case at the beginning of the execution, for all threads except the first one. In this state,
the thread steals a task from another thread’s stack. The principle of distributed data
structures for holding tasks and stealing from other threads once the own structure is
empty is known as work-stealing [BL94, ABB00]. Of course, a thread can not steal an
arbitrary task since the task might have outstanding dependencies. Here, the parallel
keyword comes into play. Tasks marked parallel get a flag in the Frame Stack. These
tasks have by definition all of their dependencies satisfied: They do not depend on any
other child spawned before them within the same parent, and their parent had all of
its dependencies fulfilled since it has already executed. So a task marked parallel
can be stolen by another thread and execute immediately. The foreign thread starts
its search for parallel tasks at the bottom of the Frame Stack, a concept called
breadth-first work-stealing. Since the stack’s owning thread executes from the top,
15
2 Previous Work on the CES Programming Language
(a)
(b)
(c)
Current Stack
Current Stack
Current Stack
execution
push
execution
move
next task
next task
Frame Stack
Frame Stack
Frame Stack
time
Figure 2.4: Interplay of Frame Stack and Current Stack
the threads tend to work on distant parts of the code. Furthermore, tasks on the
bottom tend to be higher in the call hierarchy and thus contain child tasks themselves.
Since child tasks of stolen tasks are pushed to the Frame Stack of the stealing thread,
breadth-first stealing leads to fewer steals and thus less overhead. Of course, stealing
from other threads, preventing them from executing a stolen task and notifying them of
the finished execution needs considerable synchronization efforts. The search through
foreign stacks needs some time and disrupts usual stack semantics.
On the other hand, using the stack to keep all tasks, whether they are ready to
execute or have outstanding dependencies, has some advantages in terms of simplicity.
First, the stack preserves the originally specified order of succeeding tasks. Combined
with the parallel keyword to explicitly mark tasks that can run in parallel, this
provides for the correct execution order of tasks and ensures that all dependencies
are satisfied without any additional effort. The second huge advantage is memory
management for CES variables, which is performed by the execution system. In the
CES model for function calls, data is only passed down to child tasks, not returned to
parents. When the stack level falls below that of a certain task, all its children and
grandchildren have completed. Therefore, all CES variables created by a task can be
freed once the stack goes below that task’s level. For that reason, it is convenient to
also put CES variables on the Frame Stack. This is done through a so-called storage
frame, which has the same structure as a task frame but contains a data item (a CES
variable). Figure 2.5 provides an example of the stack development. The next task
16
2.2 Previous Execution Systems
Task 1.1
Task 1.2
Task 1.2
Data Item 1
Data Item 1
Data Item 1
Task 1
Data item 2
Data Item 2
Data Item 2
Task 2
Task 2
Task 2
Task 2
Task 3
...
Task 3
...
Task 3
...
Task 3
...
Task 3
...
(a)
(b)
(c)
(d)
(e)
Figure 2.5: Development of the Frame Stack and automatic removal of data items
to be executed is printed in bold. Initially, Task 1 is scheduled to execute. A task
pushes newly created data items onto the Frame Stack before any child tasks, as in
Figure 2.5 (b). The two child tasks, Task 1.1 and Task 1.2, and their potential children
will probably use the data items to perform their work. Once all children have finished
(Figure 2.5 (d)), the execution system looks for more tasks on the Frame Stack below
their stack level. It will reach the data items and just pop them off the stack until it
finds a task to execute next, in this case Task 2.
2.2.3 Relationship to Cilk and the Deque Execution System
From a user’s point of view, the Stack Execution System is conceptually similar to
MIT Cilk. “Both systems serve the same divide-and-conquer-style applications on
shared-memory multiprocessor computers.” [SBWR08, p. 3] The parallel keyword is
comparable to Cilk’s spawn [FLR98], where the parent routine continues to execute
while the child may be scheduled to other processors. However, there are also significant
differences between the Stack ES and MIT Cilk. For instance, due to the new function
call, CES processes the results of child tasks through other child tasks, whereas Cilk
has an explicit sync statement.
Some of the concepts used in the Stack Execution System will be part of the Deque
ES as well. The CES implementation of a function call remains the same. The Deque
Execution System defines all macros presented here, although they partly serve a
different purpose and some additional macros will be needed. The task frame concept
is kept as the main identifier for a work packet. Multiple data structures to keep the
tasks, usually one per thread, are present in the Deque ES as well, but the nature
of the structure changes quite fundamentally. Finally, the memory management is
completely different.
17
2 Previous Work on the CES Programming Language
2.3 Deque ES Concept
The main idea for this thesis is an existing concept for a new CES execution system, the
Deque ES. Its name descends from the major data structure, a double-ended queue. In
contrast to the pseudo-stack of the Stack Execution System, the deque is only accessed
with correct semantics; that is, the deque only permits put and take operations to its
top and bottom. For that reason, foreign threads stealing tasks cannot search through
the data structure anymore, they must get a valid task with a normal deque operation.
Hence, the deque only holds tasks, which are ready to be executed. Now all take
operations yield a task that can be stolen. Since the owning thread keeps pushing new
tasks to the top, stealing from the bottom or top results in breadth-first or depth-first
work-stealing, respectively. In order to avoid copying task frames onto the deque once
the task has all of its dependencies fulfilled, the deque just stores pointers to task
frames.
The implications of only allowing ready-to-execute tasks on the deque are quite
extensive. Of course, not all spawned tasks are immediately ready. Still, they must be
kept in memory. Once their dependencies are fulfilled, they must be pushed onto the
deque. But how do we know, when those dependencies are fulfilled? In the Stack ES,
the stack implicitly satisfied the dependencies, but this was only possible by holding
all tasks within the stack. In the Deque Execution System, we must analyze these
dependencies and track their fulfillment explicitly. This could lead to some serious
overhead, where there is almost none in the Stack Execution System.
However, the Deque Execution System with dependency analysis can potentially
exploit more parallelism than the Stack ES. The latter relied on the coarse specification
through the parallel keyword. Similarly to Cilk, the parallel keyword only allowed a
binary decision: Either the child task can run at once or it must wait for all the previously
spawned child tasks. This model is appropriate for divide-and-conquer algorithms,
but not very good at providing parallel execution of multiple direct child tasks with
complex interdependencies. In contrast, a full dependency analysis enables the parallel
execution of arbitrary task DAGs. As an example, we present the complex task graph
and corresponding CES program for a Cholesky decomposition in Subsection 3.8.3.
The deque data structure is at the heart of the new Deque ES and is accessed by
multiple threads. Accordingly, the implementation should be thread-safe, but still as
fast as possible. Manuel Metzmann implemented several data structures that can be
accessed concurrently, among them a stack, queue and deque, the latter of which we
will use for the execution system [Met09]. These data structures are optimized for Blue
Gene/Q and therefore concurrent access to them is enormously fast on this platform.
Conveniently, there is also a (slower) x86 implementation, which can be used to easily
test the new execution system.
18
3 Design of the Deque Execution System
Based on the work presented in Chapter 2, we designed and implemented the Deque
Execution System for CES, which we will explain in detail in this and the following
chapter. Section 3.1 is a broad overview of the new ES design, whereas the rest of this
chapter provides a more in-depth description. Details on the implementation of this
design will be given in Chapter 4.
3.1 Overview
The major responsibility of the execution system is to keep track of the tasks in the
system. As indicated in Section 2.3, the deque as the main management structure only
holds pointers to tasks which are ready to be executed. Therefore, major design issues
include where in memory the tasks are located and how to keep track of those tasks
that have not all of their dependencies fulfilled yet.
Since there are mostly multiple tasks ready to be executed and the execution schedule
among those tasks depends on non-deterministic factors like execution speed and workstealing, the execution system cannot predict a fixed order in which the tasks will
run. And as a task frame’s storage space can be released as soon as the task has
completed, the order for freeing task frames also depends on these run-time factors.
For that reason, managing the actual task frames in a fixed structure like a stack is not
advisable. The alternative we chose, was to put each task frame in a separate location
on the heap and to allocate and free its space explicitly at the appropriate times.
Once the tasks are ready to be executed, their pointers are on the deque of a certain
thread. All other tasks have unfulfilled dependencies. A so-called condition task fulfills
a dependency of a so-called dependent task. Once its condition tasks have finished,
the dependent task’s pointer must be pushed onto a deque. A straightforward way
to realize that, is to keep the pointer to the dependent task in the task frame of the
condition task. Once the condition task has finished, the execution system will check
the readiness of all depending tasks and put them on the deque if necessary. That is, a
task which is not currently executed can be in one of two states: Either it is a “ready
task” on a deque or it has outstanding condition tasks holding its pointer. In the
dependency graph, the ready tasks are the sources1 , and any other task is reachable
through a path starting at one of them and therefore accessible although no central
data structure knows about it. An example graph is given in Figure 3.1.
When a condition task informs its depending tasks of the delivered data item, it must
determine whether those tasks are ready to be executed. They are, when there are no
1
nodes with in-degree zero
19
3 Design of the Deque Execution System
A source node, task pointer is
on the deque
A dependent node, task pointer
is known to condition tasks
Figure 3.1: A dependency graph with multiple tasks
other outstanding dependencies. We track the number of unsatisfied dependencies in
the task frame using a counter variable. When the task is spawned, this number is
initialized to the number of parameters that must be accessed by other tasks before
the task can run. Once a condition task delivers a needed parameter, the counter is
decreased. Should the number of unsatisfied dependencies fall to zero in doing so, the
dependent task’s pointer is put onto the deque of the current thread.
3.2 Dependency Analysis
In its description of identifying ready-to-be-executed tasks, the overview of Section 3.1
relied on a dependency graph of tasks, which consists of pointers between task frames.
However, the application code is a sequential stream of text and does not provide this
graph in itself. Therefore, the graph must be determined using the source code, a
process we call dependency analysis and describe in this section.
3.2.1 Types of Data Dependencies
In order to analyze the data dependencies between tasks, we should know where
dependencies may occur. There are three types of them to be observed.
Read After Write (RAW) dependencies occur in the usual case of one task producing
a data item and another task consuming it. The producing task must obviously run
before the consuming task, that is, the data item must be read only after it has been
written. Using the terminology of Section 3.1, the producing task is the condition task
and the consuming task is the dependent task.
When a developer intends to save memory, he may reuse variables. For instance,
task A writes a value, which is in turn read by task B. Afterwards task C writes data
to be consumed by task D. Task B must read the value A has written before task C
overwrites it, a dependency called Write After Read (WAR).
Finally, Write After Write (WAW) dependencies occur, when a variable is written
twice without any intermediate read operation. It is essential for subsequent reads that
the last write operation is the one specified last in the program code. Therefore, the
order of two succeeding write tasks must be preserved.
20
3.2 Dependency Analysis
The only type of true dependencies though is RAW, because WAR and WAW
dependencies can be eliminated through register renaming [SS95]. This technique stores
copies of the specified variables to allow for their original values being overwritten
immediately. Succeeding reads access the copy instead of the original, a “renamed
register” is looked up. An example for a system that uses register renaming is SMP
Superscalar, another programming model performing dependency analysis [PBL07].
As the current CES implementation does not use register renaming, we must take care
of all mentioned types of dependencies.
3.2.2 The Dependency Analysis Table
The dependency graph we want to build consists of nodes representing tasks and arcs for
the dependencies between them. Each arc is associated to a data item that consitutes
the data dependency between the two involved tasks. The graph is dynamically created
at run time. In the following, we use the terms input task, inout task and output task
for tasks which have the considered variable as an input, inout or output parameter,
respectively. Input and inout tasks are also referred to as readers, inout and output
tasks as writers. When a new task is called by the program, we must insert its node
into the dependency graph. We will look at the information required in order to do
that.
When a task reads a parameter, it depends on the last subroutine that wrote the
parameter, because that subroutine delivers the desired value (RAW dependency).
When a task writes a parameter, it must wait for all tasks that are interested in the
old value (WAR dependency). If no other task reads the old value, it must wait for the
previous writer to enforce the WAW dependency. All in all, as already stated by the
authors of SMP Superscalar, “only the last writer and the list of readers of the last
definition are required.” [PBL10]
The Dependency Analysis Table (DAT) is a data structure holding exactly those
pieces of information. Importantly, it is only used locally within a task and helps
analyze the dependencies of child tasks. For each CES variable in the current task, the
DAT provides access to the writing child task that was called last and all subsequent
readers. Indeed, the whole dependency analysis regards the calling of tasks, not their
actual execution. Hence, the “last writer” is the last called, not the last executed task
accessing the variable.
On a lower level, the DAT is built up as follows. It implements a map or dictionary
interface, delivering for each local CES variable a pointer to the frame of the task that
last wrote it. The task frame of the last writer constitutes the head of a linked list, all
remaining nodes of which are subsequent input tasks. If there is no last writer, e. g.
because the current task directly passes on one of its parameters to an input task, this
input task is the head of the list. We will now explain how the DAT is used to perform
the dependency analysis.
21
3 Design of the Deque Execution System
3.2.3 The Dependency Analysis Algorithm
We already outlined that a task delivering a data item holds pointers to dependent
tasks. When it finishes, the dependent tasks are notified about the delivered data item
and possibly put on a deque. As these pointers are part of the dependency graph, they
must be installed during the dependency analysis. Since such a pointer serves to find
the dependent task later on to possibly put it on the deque and enable its execution,
we call the process of installing the pointer callback registration.
The execution system performs the dependency analysis at the end of a task. In the
task execution before, all calls to new tasks have saved pointers to the new task frames
on the so-called Current Child List. This is similar to the old Stack Execution System’s
Current Stack, but the old version saved the actual data, where we just store pointers
as our task frames are on the heap. At the beginning of the analysis, the parent task’s
parameters are marked as available in the Dependency Analysis Table, they may be
read or written immediately depending on their type. After all, the current task is
executing at the moment and can therefore access its parameters as specified. The
execution system now loops over the Current Child List in the original calling order
and analyzes the dependencies.
When we reach a new child task, all its parameters are looked up in the Dependency
Analysis Table. The DAT delivers the linked list with the last writer and subsequent
readers. If the new task is an input task, it registers a callback with the last writer. If
however it is an inout or output task, it depends on all last readers (WAR dependency).
Conceptually, the new task registers callbacks with all of them. In fact, the process is
slightly more complicated for implementation reasons, which will be explained later.
Provided that there are no last readers, the new inout or output task depends on the
last writer as well (RAW/WAW) and registers a callback. In any case, each registered
callback increases the new task’s counter for its unsatisfied dependencies by one (see
Section 3.1). Parameters marked available in the DAT do not contribute a dependency,
no callback is registered and the number of unsatisfied dependencies stays the same.
Finally, the new task itself is recorded in the Dependency Analysis Table. Input
tasks are appended to the list, writers supersede the record of the previous writer and
subsequent input tasks.
Figure 3.2 visualizes a common case. Writer 1 delivers a data item, which is consumed
by Input Tasks 1 through 3. Afterwards, Writer 2 has the same data item either as an
inout or as an output parameter. In any case, it may only run after all input tasks
have finished. Solid arrows represent the direct task dependencies for this data item,
whereas dashed arrows show the DAT pointer structure before Writer 2 is added to
the dependency graph.
In order to differentiate complete DAGs showing all dependencies between multiple
tasks as in Figure 3.1 from pictures illustrating the registered callbacks and dependencies
for just one parameter as in Figure 3.2, we represent tasks in the former context as
circles and in the latter context as rectangles.
22
3.3 Notification of Dependent Tasks
DAT
Data dependency
Writer 1
DAT linked list before
Writer 2 is added
Input Task 1
Input Task 2
Input Task 3
Writer 2
Figure 3.2: Data dependencies and the DAT pointer structure of several writing and
reading tasks for a single data item
3.3 Notification of Dependent Tasks
When a task finishes, it must inform all dependent tasks about the fulfillment of their
dependencies, a process we call notification. These dependencies are either data items
it produces or a completed read operation on a data item which will be written by
the subsequent task. However, the condition task itself is not necessarily the one
that actually uses or produces the data item, this might be done by a child task. In
this case, the dependent task must also wait for the child task to finish. We already
illustrated this process in Figure 2.3. Here, we will describe how it is performed on a
simplified, conceptual level, neglecting implementation details until Section 4.2. After
the user-defined code of a task has executed and the dependencies have been analyzed,
the execution system goes through all parameters of the task again. It thereby notifies
dependent tasks and passes on dependencies from the parent to the child tasks that
actually access the respective parameter. In that process, we must distinguish input
parameters from inout and output parameters.
Figure 3.3 shows what happens for a pure input parameter and only considers the
dependencies of this specific data item, which is read by the input tasks 1 through
3. The previous corresponding writer task has already executed and they form the
subsequent input tasks that are now ready to run, unless they depend on another
Input
Task 1
Input
Task 2
$
Input
Task 3
Child
Task 1
Child
Task 2
Child
Task 3
Writer
Figure 3.3: Input Task 3 notifying its dependent task Writer and integrating its children
into the task graph
23
3 Design of the Deque Execution System
parameter. The user defined code of Input Task 3 and the following dependency
analysis have just finished and created three child tasks accessing the data item at hand.
As the original parameter was an input parameter, these tasks must be input tasks.
They already form a linked list in the Dependency Analysis Table (see Subsection 3.2.2),
but are not yet connected to any other tasks outside Input Task 3. Now the execution
system traverses this list and registers a callback from each of them to the following
Writer task. The Writer’s unsatisfied dependency counter is increased for each of the
input tasks, since they are now additional dependencies of Writer. Afterwards, this
number is decremented by one because the condition task Input Task 3 itself has
finished. This order is important as the task is put on the deque when the number of
unsatisfied dependencies drops to zero. If there are no child tasks, no new callback is
registered and the number of unsatisfied dependencies effectively decreases by one.
For inout and output parameters, the behavior is different. Both types are treated
equally here, because it is only important that they write to the parameter. The most
general case, with both input tasks and writers as children, is illustrated in Figure 3.4.
Writer 1 has just finished its execution and dependency analysis, and the execution
system resolves one of its inout or output parameters. The DAT of Writer 1 provides
access to the relevant children, the last writer and all subsequent input tasks, through
a linked list (dashed arrows). Since the input children depend on the value written
by Last Writer Child, they must run before Writer 2, so appropriate callbacks are
registered. Moreover, the last writing action of Writer 1 is actually performed by
Last Writer Child, so all input tasks following Writer 1 depend on Last Writer Child.
Conceptually, according callbacks are registered; we will explain what actually happens
in Section 4.2. Again, for each new callback, the number of unsatisfied dependencies is
increased, and afterwards the counter for all tasks depending on Writer 1 is decreased
by one. There are some special cases: When there are only input children for an inout
Last
Writer
Child
Writer 1
$
Input
Task 1
$
Input
Task 2
$
Input
Child 1
Input
Child 2
Input
Child 3
Input
Task 3
Writer 2
Figure 3.4: Writer 1 notifying dependent tasks and integrating its children into the
task graph
24
3.4 Scheduling and Work-Stealing
parameter, the outer input tasks 1 through 3 do not get a new dependency. Instead,
their number of unsatisfied dependencies will effectively be decremented and they will
be put on the deque if it reaches zero. These actions are even the only ones taken, if
there are no child tasks at all, because Writer 2 does not get new dependencies in this
case.
After handling all the parameters of a completed task, the execution system checks
the number of unsatisfied dependencies for all child tasks in the Current Child List.
If this counter is zero for a task, it gets pushed onto the deque. This procedure can
not be performed until all parameters have been handled, because tasks on the deque
might be scheduled to run, complete their work and try to notify their dependent tasks.
These tasks however, are only registered in the parameter handling phase we described
in this section.
3.4 Scheduling and Work-Stealing
Work-Stealing is the basic technique used to distribute tasks to multiple threads of
execution in CES. Normally, we have one software thread per hardware thread. Each
thread has its own deque to store tasks on, so that different threads only rarely interfere
with each other. A thread performs a depth-first execution of its deque, i. e. it puts
newly created tasks on top and fetches tasks to execute from the top as well. Therefore,
when executing tasks from the local deque, successive tasks tend to operate on similar
data and benefit from caching mechanisms. Furthermore, depth-first execution prefers
the completion of one top-level task over already starting the execution of other top-level
tasks and thus reduces the number of tasks in the system.
When the local deque is empty, a thread steals work from a foreign deque, either
from the top (depth-first work-stealing) or bottom (breadth-first work-stealing). We
already explained the theoretical advantages of the two alternatives in Section 1.2. To
simplify comparisons of both types, we made switching between them very easy: The
default is breadth-first work-stealing, but when the compiler flag -DDF_WS is given, the
Deque ES performs depth-first work-stealing.
Moreover, if multiple threads share a cache, they might benefit from sharing a deque
as well. For example, in a multi-threaded processor core, the threads share the L1
cache. Therefore, in CES the number of threads per deque is adjustable through the
compiler option -DTHREADS_PER_DEQUE. Continuing the example, the hardware threads
of a multithreaded processor core may use a single deque.
Since there is mostly a hierarchy of caches with different access times, it might be
beneficial to steal in a hierarchical way as well. In our example, we would first try
to steal from the deque of other hardware threads in the same core. If we find a
task there, we might find some data of the task in the shared L1 cache. Only if we
do not find a task on any deque used within the same core, we look at the deques
of threads outside our core. This behavior can be enabled with the compiler option
-DHIERARCHICAL_STEALING.
25
3 Design of the Deque Execution System
3.5 Synchronization
In a CES execution, each hardware thread is running a POSIX thread. The POSIX
threads are lightweight threads sharing a common address space. For shared data,
we must prevent conflicts where multiple threads access the same storage locations.
Otherwise, the machine instructions of multiple C statements from different threads
might be intermingled and thus fail to execute as expected by the programmer.
The instrument we use to prevent conflicts are architecture-dependent atomic operations. These cannot be interrupted and thus may be used to access shared data, even if
other threads might do just that at the same time. A huge advantage of using atomic
operations instead of higher-level concepts like mutual exclusion through semaphores
or monitors is that they do not block and therefore do not slow down the program.
Still, so as to find the points where atomic instead of conventional operations are
needed, we must identify access to shared data. In CES, the most obvious section of
multiple interfering threads is work-stealing, i. e. taking tasks from a foreign deque.
This part is handled by the concurrent deque implementation. The dependency graph
of tasks is also modified by the execution system in multiple threads. We already
described, how this graph is modified: Each task creates a sub-graph of its children
during the dependency analysis. At that point in time, just the parent task knows
about these children and no other task can interfere. Only at the very end of the
analysis is the sub-graph connected to the global graph, a process we described in
Section 3.3. As visible in Figures 3.3 and 3.4, new arcs always originate in child tasks
and thus the corresponding pointers are located in the sub-graph, which cannot be
accessed by other threads yet. The removed arcs are part of the currently executed
task. As all of its dependencies are already fulfilled and it has been taken from the
deque, no other thread can access it either. All in all, the dependency graph is not
vulnerable to concurrent access.
However, the dependent task of a new arc has its counter for unsatisfied dependencies
increased. As this counter accumulates the dependencies for all parameters and multiple
tasks might fulfill different dependencies at the same time, access to this variable must
be atomic. Furthermore, we often decrease the counter and put the task on a deque
if it reaches zero. When, for instance, the counter is initially two, and two tasks
simultaneously execute that part of the code, the decrease operations may overlap and
both succeeding read operations would yield zero. Hence, the task would be put on a
deque twice. Therefore, decreasing the counter and reading its value is performed by
an atomic FetchAndDecrement operation.
Another point we should be aware of is that concurrent memory allocations from
the heap might conflict. The operating system would need to coordinate them, which
could affect the speed of memory allocations by multiple threads.
When the user defined code in a task uses shared data, exclusive access is guaranteed
through the execution system. After all, enforcing these data access dependencies is
the major concern for the Deque ES.
A final issue concerning synchronization is how we detect that all tasks are finished.
As there are multiple distributed deques rather than a single data structure holding all
26
3.6 Memory Management for Data Items
tasks, we cannot easily determine if there are tasks in the system. Also, we would not
gain much if we could query all deques simultaneously. There might be no task on any
deque but other threads currently executing some tasks that will spawn children. To
solve this problem, we use a global counter variable to track the number of tasks in
the system. Details of how that might affect the performance and how to avoid race
conditions will follow in Section 4.4.
3.6 Memory Management for Data Items
In the Stack Execution System, CES variables are kept in data frames, which are
located on the stack just like task frames. We illustrated that process and also the
resulting benefits for deallocating the data frames again in Section 2.2. Since the Deque
ES replaced the stack with a deque holding only ready-to-be-executed tasks, we must
devise a new way to store the data frames.
As described in Section 3.1, the execution system cannot predict when a task is
ready to be executed. Similarly, the execution system cannot predict when a data item
will not be needed anymore. This is due to the fact that the necessary life time of data
items is directly coupled to the tasks using them. Therefore, a data structure with
fixed access patterns is not reasonable and we store the data items on the heap. Unlike
in the Stack ES, in the Deque ES data items do not share the outer structure and size
of a task frame (see Figure 2.2.2) but just consist of their plain data type.
CES parameter variables are allocated in the parent subroutine, before their reference
is passed to child tasks. The execution system performs this allocation synchronously,
when the execution reaches the task call (in CES code) or variable declaration (in
intermediate C code). However, since the variable will be used by subsequent tasks,
we cannot deallocate it in the parent routine but must wait until all tasks accessing it
have finished.
For that purpose, we would need to analyze the access patterns to CES variables,
similar to the dependency analysis described earlier for parameters passed between
tasks. Thus, we can use the dependency analysis to also schedule the release of data
items at the appropriate times by encapsulating the release procedure into a special
task we refer to as a Free Task. It only takes one inout parameter and releases its
storage location. These Free Tasks are appended to the Current Child List at the end
of a task but before the dependency analysis, for all variables which have been allocated
in that task. Since all child tasks using these variables have already been called then,
the Free Tasks are scheduled to run as the last tasks accessing their variables. Beyond
the calling order, Free Tasks are handled like any other task by the ES. In particular,
the Free Task’s special actions are totally transparent to the scheduler. Tasks not
descending from the current subroutine are unaware of variables allocated therein.
Hence, they do not care whether a Free Task runs before or after them.
27
3 Design of the Deque Execution System
3.7 Manual Encoding of Task Dependencies
In many divide-and-conquer-style applications using arrays, the dependencies for
recursive calls are often quite simple and entirely clear to the programmer. The Deque
ES can correctly handle array dependencies if they are properly encoded by hand, a
technique similar to the Stack ES dependency handling. Admittedly, this partly undoes
the benefits of the Deque ES, that the programmer precisely does not need to think
about what can run in parallel. On the other hand, it permits certain uses of arrays
and thus enables some applications. In other cases, there are no data dependencies
between tasks, but the programmer still wants them to run in a fixed order, a situation
which can also be solved by manually encoding dependencies.
As an example for arrays, we will examine a core part of a merge sort implementation
in CES shown in Listing 3.1. A specialty of the algorithm is that the temporary
copy merge sort needs is just created once and on each level the copy and original are
swapped. The implementation uses arrays and pointers through the parameters src and
dst. In order to enable both mergesort child tasks to run in parallel, these parameters
cannot be directly passed to both of them as this would cause the Deque ES to put
them in serial order. Therefore, we create new pointers right_src and right_dst in
lines 15 and 16 for the second task, which also serves to directly include the correct
offset from the start of the array. As the inout parameters of both mergesort tasks are
now distinct, the tasks can run in parallel. The subsequent merge subroutine takes both
values written by the first (src, dst) and by the second mergesort task (right_src).
Hence, it runs after both mergesort children. Thus, the actual dependencies of the
tasks must be expressed as “superficial” dependencies of the parameters that are passed
to the subroutines. Recall that the parallel key word is not used in the Deque ES
but is kept for compatibility with the other execution systems.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/**
* Recursive mergesort task
* @param[in] n the number of elements.
* @param[in] size the size of an element (result of sizeof()).
* @param[in] compare the pointer to the element comparison function of type compare_t.
* @param[in,out] src the copy of the array to sort.
* @param[in,out] dst the array to sort. The sorted result will be stored here.
*/
$mergesort(size_t n, size_t size, compare_t compare; ptr_t src, ptr_t dst;){
if ($n$ <= 1) {
/* array has zero or one element(s) and is sorted by default */
} else {
size_t nleft = $n$ / 2;
size_t nright = $n$ - nleft;
ptr_t right_src = $src$ + nleft * $size$;
ptr_t right_dst = $dst$ + nleft * $size$;
$parallel mergesort(size_t nleft, size, compare; dst, src;);$
$parallel mergesort(size_t nright, size, compare; ptr_t right_dst, ptr_t right_src;);$
$merge(size_t nleft, ptr_t right_src, size_t nright, size, compare; src, dst;);$
}
}$
Listing 3.1: Recursive part of a merge sort algorithm, based on [Rem08]
28
3.8 Additional Array Support by the Execution System
Beyond arrays, the manual encoding of dependencies can also be used to enforce
execution order for tasks that do not depend on each other. A very common example is
measuring the running time of an algorithm. The usual method is getting a time value
at the beginning and subtracting it from the time value at the end, so as to get the
time span in between. In CES, these timing procedures might be implemented in the
tasks start_timing and stop_timing. Listing 3.2 tries to use these tasks to measure
the running time of the subroutine algorithm. However, the data dependencies of the
timing tasks are completely distinct from those of the worker tasks. Therefore, the
timing tasks could run successively at the beginning or end of the program and thus
print an absurdly short running time.
$read(;;<type1> input_value);$
$start_timing(;;clock_t time);$
$algorithm(input_value;;<type2> result);$
$stop_timing(time;;);$
$print(result;;);$
Listing 3.2: Worker and timing tasks with independent data flows
In the revised program shown in Listing 3.3, the timing routines additionally take
data items of the worker tasks as dummies. In this case, the programmer wants all
five tasks to run sequentially. The fulfillment of this requirement is easily verified
by discovering that for each two successive tasks there is a data item written by the
first and read by the second task. Needless to say that in general one can use all
dependencies mentioned in Subsection 3.2.1 to manually enforce running order.
$read(;;<type1> input_value);$
$start_timing(;input_value;clock_t time);$
$algorithm(input_value;;<type2> result);$
$stop_timing(time;result;);$
$print(result;;);$
Listing 3.3: Worker and timing tasks with encoded artificial dependencies
3.8 Additional Array Support by the Execution System
3.8.1 Overview
While the approach for arrays presented in the previous section is very flexible, it also
needs the programmer to think about dependencies, a process that will be hard for
complex task graphs. Furthermore, it demands more parameter passing than necessary
for the plain algorithm. Therefore, we would like to offer a more natural way to use
arrays in CES.
The arguably biggest problem with arrays is to identify in the dependency analysis
which parts of the array will be read or written. Algorithms operating on arrays
often pass around just one pointer, regardless of which parts of the array are actually
accessed. Furthermore, arrays are often accessed through pointer arithmetics and
participating “iterator pointers” would need to be mapped to the original structure.
It is even worse with pointer usage in general, as they are a very versatile tool and
29
3 Design of the Deque Execution System
guessing their intended use is seemingly impossible. A pointer might be the root of
a tree and the whole tree needs to be locked, or it is just part of a structure with
references to other structures and not used at all in the current context. Hence, devising
a solution covering all use cases of arrays or even pointers is very tough.
We therefore offer a way to use arrays in CES, which is reasonable for some use
cases. One can declare a potentially multidimensional array, the elements of which are
treated like individual CES variables during the dependency analysis. Their storage
locations are individually allocated and individually freed when they are not needed
anymore. One can naturally access the single elements in the task declaring the array
and also pass them to child tasks individually. However, passing the whole array as a
parameter is not possible as the memory for the elements is not contiguous, the price
we pay for individual tracking of dependencies.
The array elements are not necessarily primitives types, they might be pointers to
manually allocated arrays. This way we can track the dependencies for larger blocks of
data. We will show a use case for this feature in Subsection 3.8.3. But prior to that we
introduce the actual syntax used for arrays in CES.
3.8.2 Syntax
In the CES syntax for the Stack ES, the only way to declare a CES variable is to call a
subroutine and use a C variable to initialize one of its parameters. As this involves
copying and doing so for arrays is expensive, the Deque ES introduces a new way to
directly declare a CES array without initializing it. The syntax is, apart from enclosing
dollar signs, equal to a normal C array declaration and is given in line 1 of Listing 3.4.
Brackets are to be taken literally in this context; the number of dimensions is not
limited to two, it is just an example. Since it was easy to implement, the same direct
declaration is possible for a single variable as shown in line 2. Hence, we do not need
an extra local C variable for initialization purposes anymore. However, for consistency
with the old declaration method, the type of the variable still has to be given when
passing it on to a child task, the same is true for arrays as will be visible shortly.
1
2
$<type> <variable name>[<size of dimension 1>][<size of dimension 2>];$
$<type> <variable name>;$
Listing 3.4: Directly declaring CES arrays and single CES variables
Local access to CES arrays is also very elegant, you just need to enclose a normal
array access in dollar signs as in Listing 3.5. The indices can be specified with constants,
variables or expressions with parentheses and the four basic arithmetic operations.
What follows in the next line is the handing over of an array element to a child task.
As already mentioned, the base type of the array must be given; furthermore we need
the index of the element to be handed over.
Since only individual elements are passed to subroutines, there is no change at all
for the called task. It is not even possible to detect from within the task, whether it
was called with an array element or single variable as a parameter.
30
3.8 Additional Array Support by the Execution System
$array[i][j+k]$ = 42;
$print(int array[i][j+k];;);$
Listing 3.5: Accessing CES arrays and passing on elements to a child task
3.8.3 Use Case: Algorithms on Blocked Data
When operating on large input data, this data can often be partitioned into multiple
blocks. If this blocking happens at the root level and does not need to be recursively
repeated as in divide-and-conquer algorithms, we can easily employ the new array
support features of CES to track the dependencies. For recursive blocking this is still
possible, but each internal node of the task tree would need to split the block further.
This is because the incoming block is a single CES variable passed to a task as a
parameter and must be split up into multiple variables to allow for individual tracking
of dependencies.
Linear algebra is a major application field for blocked algorithms [DK99, JK02,
GJ07, BLKD07], some of which are not recursive. For example, Kurzak et al. present
non-recursive implementations of Cholesky factorization, QR factorization and LU
factorization in [KLDB09]. As a proof of concept, we adapted to CES the implementation of Cholesky decomposition that comes with the distribution of SMP Superscalar
2.3 [SMP10]. The program concentrates on building the block structure and spawning
worker tasks and uses a C implementation [CBL10] of the Basic Linear Algebra Subprograms [LHKK79, DDCHH88, DDCHD90] to perform the actual decomposition of the
blocks. However, a valid scheduling order is important so as to obtain correct results.
Therefore, the example serves well to test the proper tracking of array dependencies.
Listing 3.6 shows the core part of the algorithm. The original implementation for SMP
Superscalar can be found in [PBL08]. This paper is also the origin of the corresponding
task graph in Figure 3.5, which illustrates the complex dependencies even for a small
input size. Without the new CES array support one would have to encoding these
dependencies by hand, which is hardly possible. In the graph, numbers show the
sequential execution order, whereas colors indicate the different task types.
for (long j = 0; j < DIM; j++) {
for (long k= 0; k< j; k++)
for (long i = j+1; i < DIM; i++) {
$ces_sgemm_tile(long BS, float_ptr A[i][k], float_ptr A[j][k]; float_ptr A[i][j];);$
}
for (long i = 0; i < j; i++) {
$ces_ssyrk_tile(long BS, float_ptr A[j][i]; float_ptr A[j][j];);$
}
$ces_spotrf_tile(long BS; float_ptr A[j][j];);$
for (long i = j+1; i < DIM; i++) {
$ces_strsm_tile(long BS, float_ptr A[j][j]; float_ptr A[i][j];);$
}
}
Listing 3.6: CES implementation of Cholesky decomposition (cf. [PBL08, Fig. 4])
31
3 Design of the Deque Execution System
Figure 3.5: Task graph for 6 by 6 block Cholesky decomposition, figure from [PBL08]
32
4 Implementation of the Deque Execution
System
This chapter explains some implementation details of the Deque Execution System and
refers to the actual code where appropriate. In Sections 4.1 to 4.5 we present various
aspects of the basic implementation, whereas Section 4.6 details on some individual
improvements to increase the speed of the execution. The final Section 4.7 describes
how we implemented the additional array support.
4.1 Data Structures for Dependency Analysis and Task
Notification
As already mentioned in Subsection 3.2.2, the dependency analysis table (DAT) provides
a map interface. For simplicity, we initially implemented it using a linked list of keyvalue pairs, a structure that would be replaced soon (see Subsection 4.6.1). The key to
lookup is the address of a data item and the corresponding value is a pair containing
the task frame of the subroutine that last wrote to that data item and the index or
offset of the parameter within the task frame. As we will see shortly, this offset is
necessary to find the correct entry for the linked list that contains all subsequent input
tasks.
The dependency analysis builds up a graph of the current task’s children. The
nodes are task frames and the arcs are pointers between them. Listing 4.1 shows
the TASK_FRAME structure. It contains a pointer fnptr_task to the corresponding C
function (line 2), possibly the function name (line 12) and the number of input, inout
and output parameters (lines 3 to 5). These numbers are necessary as all parameters
are held in one fixed-sized array of pointers (parameter). Therefore, the maximum
number of parameters is still 25, as in the Stack Execution System.
Of special interest for the dependency analysis are the remaining fields of the
structure. The integer unsatisfiedDependencies determines whether a task is ready
to run (line 6). Its type is either uint32_t or uint64_t because it is accessed through
architecture-dependent atomic operations that operate on the native word length.
To represent the linked list for the dependency analysis and the arcs of the resulting
dependency graph, there are two arrays of pointers to other task frames and one array
of offsets. All of them have the same length as the parameter list because each slot
of the arrays directly corresponds to the parameter at the same offset, the parameter
representing the data item associated with the dependency. A major problem is that a
data item can be consumed by arbitrarily many input tasks and as the writer delivering
33
4 Implementation of the Deque Execution System
1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef struct TASK_FRAME {
void (*fnptr_task)(struct TASK_FRAME * /* my task frame */, struct TASK_FRAME ** /* current '
child list */, DEQUE * /* my deque */);
unsigned char in;
/**< Number of input parameters */
unsigned char inout; /**< Number of inout parameters */
unsigned char out;
/**< Number of output parameters */
NATIVE_UINT unsatisfiedDependencies;
void * parameter[ARG_SIZE];
struct TASK_FRAME * toBeNotified[ARG_SIZE];
unsigned char notificationListOffset[ARG_SIZE];
struct TASK_FRAME * nextWriteNotification[ARG_SIZE];
#ifdef CES_DEBUG
char * name; /**< The name of the task (function name) */
#endif
} TASK_FRAME;
Listing 4.1: The TASK_FRAME structure
that item would need to notify them all, it would also need to hold arbitrarily many
pointers. Since we do not have enough space for that in a task frame, the writer’s
task frame only holds the pointer to the first input task to be notified, the remaining
input tasks are part of a linked list just as during the dependency analysis. In fact,
the linked list built during the dependency analysis is never destroyed, but directly
becomes part of the task graph. The array toBeNotified contains, for the writer,
the first input task of the linked list and, for the input tasks, the next node in the
linked list. Moreover, the same data item might occur at different parameter offsets
for different input tasks, for instance, it is input parameter 2 for task 1, but input
parameter 3 for task 2. Since the linked list is tied to the data item, the offset of the
parameter in the next task frame of the linked list is saved in the current task frame’s
integer array notificationListOffset. When the input tasks have consumed their
data item, they in turn must notify the next writer. Naturally, this pointer would
be held in toBeNotified, but this slot is already used by the linked list. Therefore,
while that linked list is needed, the pointer to the next writer is saved in the previous
writer’s nextWriteNotification array. When the writer notifies the input tasks, it
sets their toBeNotified pointer to its nextWriteNotification. As the linked list is
thereby destroyed, we call this process unwinding of the linked list. Figure 4.1 is a
Data dependency
Writer 1
Original pointer structure
Input Task 1
next
Writ
eN
otific
ation
Input Task 2
Input Task 3
New pointers after
Writer 1 has finished
Writer 2
Figure 4.1: Data dependencies and the DAG pointer structure of several writing and
reading tasks for a single data item
34
4.2 Notification of Dependent Tasks
modified version of Figure 3.2, now not as part of the local DAT but as part of the
global dependency graph. Compared to Figure 3.2, the pointer structure of the DAT’s
linked list is kept, but we actually have an additional nextWriteNotification pointer
to complement the original pointer structure (dashed arrows). When Writer 1 has run,
the linked list is unwound and all dashed arrows disappear. Instead, there are new
pointers installed, directly from the input tasks to the next writer (dotted arrows).
This happens when Writer 1 notifies its dependent tasks and traverses their list anyway.
The notification process is detailed in the following section.
4.2 Notification of Dependent Tasks
The conceptual notification model we described in Section 3.3 was directed towards
the actual dependencies and which tasks would need to be notified, when a certain
task finishes. However, it could not take into account the actual pointer structure we
explained in Section 4.1. We will connect the conceptual model to the pointer structure
here, thereby detailing on how the notification mechanism is actually implemented.
For input parameters, Figure 3.3 (p. 23) is quite accurate. Since Input Task 3
cannot have any writing children, all nodes of the DAT linked list are input tasks. This
linked list of child tasks is unwound before Input Task 3 notifies Writer. Notice, that
unwinding in the previous section referred to the global task graph and a writer task
notifying its dependent input tasks, whereas here, we unwind the local list of child
tasks. Still, both processes are almost equal, apart from the local child tasks not having
their dependency counter decreased, and therefore the processes are implemented in
the same function unwindLinkedList.
For inout and output parameters, Figure 4.2 shows how the handing over of dependencies to child tasks is implemented. As this deviates from the conceptual model, it
might be interesting to compare it to Figure 3.4 (p. 24). The original input tasks 1
through 3 depended on Writer 1 producing a data item that is actually delivered by
Last Writer Child. In order to make Last Writer Child inform them after it finishes,
we append the original list of input tasks to the list of child input tasks. Technically,
we only append Input Task 1 with the rest of the list following automatically. Hence,
adding one child to the input task list and appending two lists is the same process,
and performed by a function whose name is inspired by the conceptual procedure,
registerCallback. Furthermore, both the original and new input tasks must run
before Writer 2 (see Figure 3.4). Since they still have their pointer slots filled with the
linked list’s “next” pointers, the dependent Writer 2 must be kept by Last Writer Child
as explained in the previous section. Therefore, the parent’s nextWriteNotification
pointer is handed over to Last Writer Child, the task that will adapt the toBeNotified
array of the input children once it has run.
The mere special cases in the conceptual model now result in a fundamentally
different behavior. If there is no writer among the child tasks, the parent itself delivers
the data item and must inform all dependent tasks about it. Dependent subroutines
are not only located in the list of original subsequent input tasks, but also in the
35
4 Implementation of the Deque Execution System
Last
Writer
Child
Writer 1
Input
Child 1
Input
Task 1
Input
Task 2
nex
tWr
iteN
otifi
cati
on
Writer 2
Input
Child 2
Input
Child 3
Input
Task 3
nextWriteNotification
Figure 4.2: Writer 1 integrating its children into the task graph
(possibly empty) list of newly spawned input children. Therefore, the execution system
unwinds both lists, partly decreasing the tasks’ unsatisfiedDependencies (for tasks
in the global graph), directing their toBeNotified pointers toward the next writer and
putting them on a deque unless they still depend on other parameters.
4.3 Scheduling and Work Stealing
Listing 4.2 shows the main loop of our basic scheduling algorithm. As long as there
are tasks on our own deque, we keep executing them (lines 5 and 6). When we run out
of tasks, work stealing from other deques begins. We start our circular search on the
next deque (line 9) and keep searching until we have found a task or reached our own
deque again (line 10). Depending on the compiler option -DDF_WS, we either take a
task from the top or bottom of the foreign deque (lines 11 to 15). Line 18 advances
the circular search. When we find a task, we stop searching and execute it (line 21).
Afterwards we try to find tasks on our local deque again, since the executed task has
hopefully spawned children, which are pushed to the local deque. The main execution
loop ends, when there are no more tasks in the system, we detail on the global variable
readyTasks in the next section.
As we explained, in the work-stealing phase the Deque ES performs a circular search
through all deques starting with the deque next to the local one. That is, although
shown to efficient [BL94], we do not steal from a random deque. We tried that with
the rand() function from the C library. The implementation uses mutual exclusion to
ensure thread-safety and hence slowed down the execution tremendously. It might be
worth to try different libraries for random number generation, but as our focus is on
dependency-analysis and depth-first work-stealing, we stuck with the scheme explained
above.
36
4.4 Synchronization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
int myDequeId=...;
int workPacketFound, response, stealFromId;
while (readyTasks > 0) {
/* fetch work from our own deque */
while (SUCCESS == takeTopWD(&readyTaskDeques[myDequeId], &(currentTask.int64)))
runTask(currentTask.fnptr.taskFrame, myTd, &readyTaskDeques[myDequeId]);
/* no more work on our deque, steal from other deques */
workPacketFound = 0;
stealFromId = (myDequeId + 1) % CES_DEQUES;
while (!workPacketFound && stealFromId != myDequeId) {
#ifdef DF_WS
response = takeTopWD(&readyTaskDeques[stealFromId], &(currentTask.int64));
#else
response = takeBottomWD(&readyTaskDeques[stealFromId], &(currentTask.int64));
#endif
if (response == SUCCESS)
workPacketFound = 1;
stealFromId = (stealFromId + 1) % CES_DEQUES;
}
if (workPacketFound)
runTask(currentTask.fnptr.taskFrame, myTd, &readyTaskDeques[myDequeId]);
}
Listing 4.2: Basic scheduling algorithm
The hierarchical work-stealing option we described in 3.4 adds another search loop
between the local (lines 5 and 6) and global (lines 10 to 19) search. In the new loop,
the Deque ES tries to find a task on a deque assigned to the same core but a different
hardware thread. If successful, it executes a task from there, benefiting from a shared
L1 cache. Otherwise, we continue with the global search.
4.4 Synchronization
We explored which data structures are accessed by multiple threads in the design
chapter. The deque is no concern for us, since it handles concurrent access itself. The
task graph’s edges are, even when considering the real pointer structure presented in
Section 4.2, not vulnerable to concurrent access for the reasons explained in Section 3.5.
Memory (de)allocation is handled through thread-safe implementations of malloc
and free, a fact that has performance implications discussed in Subsection 4.6.3 but
prevents the need for explicit synchronization. What we must take care of is concurrent
access to each task’s counter for the number of unsatisfied dependencies and to the
global counter for the number of tasks in the system.
For both issues, we use atomic operations. Depending on the use case, we employ
either AtomicIncrement and AtomicDecrement where we just need to change but not
read the value, or otherwise FetchAndDecrement. The latter atomically reads the
old value of a variable and afterwards decrements it. These operations are provided
through libraries for both x86 and Blue Gene.
In order to know when the worker threads executing tasks on the deque can exit
and allow the program to end, we must check if all tasks have been executed. When
37
4 Implementation of the Deque Execution System
there are some which have not been executed yet, at least one of them is ready to be
executed. Therefore, we use the global variable readyTasks to represent the number
of tasks that are either currently in execution or located on a deque, i. e. ready to be
executed. The counter is incremented in the function putOnDeque, where we put a task
on a deque. It is decremented again, when a task finishes. Naturally, it is important
to increase the counter for new and ready child tasks or tasks whose dependencies
have been fulfilled before the task performing that action finishes and decreases the
counter again. Otherwise, the counter could fall to zero early and some of our worker
threads would stop executing. Hence, AtomicDecrement(&readyTasks) is the very
last statement of any task, implemented as the last command in the expansion of the
RUNTIME_TASK_FINALIZE macro. With both processes, increasing and decreasing, in
one task which is executed by one thread, there is no risk of race conditions. As the
atomic operations are very fast on Blue Gene, concurrent access to the single global
variable does not constitute a speed bottleneck. Furthermore, considering the whole
dependency analysis and the execution of user defined code, changing the counter is
only a small part of the execution procedure. In contrast to the modification statements,
reading the value of readyTasks is performed non-atomically. This is possible because
we only want to know if it is larger than zero and once it reaches zero, it never rises
up again. This access method might give us some stale values when a cache has not
been invalidated yet, so the worker threads would run a little longer at the end. On
the other hand, the non-atomicity yields performance benefits for the huge number of
reads during the execution.
The second use case for atomic operations is a task’s number of unsatisfied dependencies. Non-atomic operations are sufficient for newly created child tasks that are not
part of the global dependency graph yet. But when we unwind a list of child tasks and
the next writer gets multiple new dependencies, we must use AtomicIncrement since
other tasks might do the same for another parameter simultaneously. The variable
is decremented when a dependency gets fulfilled. As any dependency might be the
last one preventing a move to the deque, we always use FetchAndDecrement and check
whether the old value was one, i. e. the task is now ready and must be pushed onto
a deque. Again, the order is important. Unwinding of child tasks happens before
we decrement the value for the parent task, which has now finished. Otherwise the
dependent task might be pushed onto a deque, although it depends on a newly created
child task.
4.5 Memory Management for Data Items
Recall the memory management design from Section 3.6. Storage space for data items
is allocated by the parent routine, but special Free Tasks are responsible for releasing
it again. Those Free Tasks are virtually inserted into the code by the execution system
as the last inout subroutines of a task, and afterwards their dependencies are analyzed
as with any other task. The scheduler is unaware of the special nature of the Free
Tasks and dispatches them as usual.
38
4.6 Speed Improvements
1
2
3
4
void freeTask(RUNTIME_TASK_FUNCTION_ARGUMENTS) {
free(myTaskFrame->parameter[0]);
AtomicDecrement(&readyTasks);
}
Listing 4.3: Implementation of the Free Task
In order to know for which data items we must insert a Free Task, all allocations of
new CES variables in the current task are recorded in an array called storageTracker.
This array is only needed locally. The macros RUNTIME_CREATE_STORAGE_CVAR and
RUNTIME_CREATE_STORAGE_OUTPUT create new entries. Just before the dependency
analysis, the execution system traverses the array and creates a Free Task for each of
the data items.
The Free Task has some properties that allow us to drastically cut down the overheads
of the Deque Execution System. The plain C implementation, which is manually written
instead of generated by the compiler, is shown in Listing 4.3. It simply releases the
location of the first parameter. Notice the direct pointer access instead of using a
macro as in compiler-generated code, which enables us to get the reference we need for
free instead of the dereferenced pointer value. The only other activity performed is to
decrease the number of tasks in the system before the freeTask finishes.
If you compare that bare implementation with normal compiler-generated C files
like the one in Listing 2.10, you will find the following calls to the ES removed. The
task is not initialized, i. e. no dependency analysis table is created. There is no task
call or assignment of parameter references, simply because the task has no children.
Finally, and that’s the biggest overhead, there is no finalization besides decrementing
the readyTasks counter. The normal procedure would include the parts
• calling of a Free Task for each data item in the storageTracker
• dependency analysis for all child tasks (here, we do not have any child tasks),
• notification of dependent tasks (here, no task depends on us, since freeTask is
the last to access the data item) and
• handing over of dependencies to child tasks (here, we have neither dependencies
nor child tasks),
none of which are needed within a Free Task. With each data item corresponding
to one Free Task, we might easily have half the subroutines of this type. For that
reason, the lightweight Free Task we described here significantly contributes to the
performance of the Deque ES.
4.6 Speed Improvements
Until now, we described a basic version of the Deque ES as originally implemented.
This section highlights some important changes in an optimized implementation to
increase the execution speed.
39
4 Implementation of the Deque Execution System
4.6.1 Using Single Variables for the Dependency Analysis Table
The original implementation of the Dependency Analysis Table used a linked list for
simplicity. Searching for entries and inserting new ones thus needed time in O(n),
where n is the length of the list. As the DAT is very frequently used, we wanted to
speed up the access. Our first approach was using the hash table implementation of
GLib [GTK10], however, this slowed down the execution of common cases even further.
Instead of investigating the library or implementing our own hash map, we chose a
rather radical way that promised even better performance (not only big-oh-wise, but
also with minimal constant factors).
The optimized implementation stores the pointer to the head of the DAT list not in
a central DAT structure for the whole task, but in dedicated single variables. Each
such variable is connected to the CES variable it tracks through a name convention.
Specifically, the CES variable foo is tracked by the local C variable cesLastAccess_foo.
When this name is known, reading the value is directly possible, in O(1) and with
minimal overheads and no lookup at all. In the original implementation, the dependency
analysis was completely performed by the execution system, which used the memory
addresses of data items as an entry point to the DAT. When we rely on the variables
names however, the compiler must take responsibility for parts of the dependency
analysis as the ES does not know anything about variable names. We will detail on
this point shortly.
In the original implementation, the Deque Execution System performed the following
actions at the end of a task: A storageTracker array, which had recorded all newly
allocated CES variables, was traversed and a Free Task scheduled on the Current Child
List for each of them. Then, the ES looped over the Current Child List, performing the
dependency analysis and building the dependency graph by inserting according pointers.
Afterwards, the parent task’s parameters were traversed in order to connect the child
graph to the global graph by handing over dependencies and to notify dependent tasks
about the fulfillment of their dependency. Finally, ready child tasks were put on the
deque of the current thread.
In the optimized implementation, a result of using variable name identifiers is that a
dependency analysis based on the Current Child List, which only holds task frames
through pointer addresses, is not possible anymore. We could either record even more
information or perform the dependency analysis directly within the code execution.
The latter solution required substantial changes to the compiler to insert additional
macros in the intermediate C code. On the other hand, we would need to store less
information and save a few loops in the task finalization, which promised performance
benefits. Hence, we chose the solution involving the compiler changes.
In the original implementation, we could access the DAT in the whole function.
The single entry point cesLastAccess_* however, is declared at the same time as the
corresponding CES variable. Since it is located on the stack for simplicity and speed,
we can only access it in the scope in which it was declared. Therefore, appending all
Free Tasks at the end of the subroutine is not always possible, as the DAT variable
could have been popped off the stack already. Moreover, the CES variable cannot be
40
4.6 Speed Improvements
accessed under its name after leaving the declaration scope, i. e. we can actually insert
the Free Task earlier. Hence, the compiler was extended to recognize and track C
scopes. It records those scopes and declarations of new CES variables on a stack. Just
before the scope is left, Free Tasks for all declarations within the scope are inserted
into the intermediate C code with the new macro RUNTIME_CREATE_FREETASK. When
the execution system hits these Free Tasks, the corresponding DAT entry point is still
accessible, so they can be dependency-analyzed.
The dependency analysis itself happens after the task creation through the new
macros RUNTIME_PARAMACCESS_<type>, where <type> is one of IN, INOUT and OUT. In
the original implementation, the ES looped over all parameters at the end of the parent
task and acted according to the type. In the optimized implementation, the compiler
inserts these macros for each parameter and the macro expansion performs the analysis
immediately (registering callbacks, recording the new DAT entry, . . . ). The differences
for the parameter types have been explained in Subsection 3.2.3.
Dependencies for parameters of the parent task are also tracked in cesLastAccess*
variables, which are allocated at the beginning of a task. For that purpose, the compiler
inserts RUNTIME_PARAM_INITIALIZE macros. As the root scope of a task stays open
until its end, we can incorporate the last writers and readers of those parameters in
the notification process of subsequent tasks. Since we need the name of the variables
to read the distinct DAT entry points and the execution system knows nothing about
them, the compiler inserts RUNTIME_HANDLE_<type>_CALLBACK macros at the end of
the task.
Next to decrementing the readyTasks counter, the only remaining responsibility
of the RUNTIME_TASK_FINALIZE macro is to put ready child tasks onto the deque, as
this process must wait until after the notification. The storageTracker and task-wide
Dependency Analysis Table (just the map) are now obsolete and have been removed
from the source code. All in all, those changes yielded a speedup of over 30 percent for
an application like Fibonacci (see Subsection 5.3.1) which performs little computation
within each task.
4.6.2 Avoiding O(n) Operations on Callback Lists
In the Dependency Analysis Table, all input tasks called after the last writer of a
variable are kept in a linked list. Inserting new items at the back of this linked list needs
a number of steps linear in the list length. We could insert single items at the front, but
when we connect two lists as in Figure 4.2, we still need to access one of their end nodes.
In order to avoid traversing the whole list before we can insert, we save a pointer to the
end of the list. We use the current parameter’s slot in the nextWriteNotification
array for that purpose, as shown in Figure 4.3(a). For the very last writer of a CES
variable, this pointer indicates the end of the list of subsequent input tasks. When
the next writer is added to the dependency graph, nextWriteNotification fulfills its
original purpose, to keep a reference to the next writer of the parameter (Figure 4.3(b)).
As new input tasks are now appended to Writer 2 rather than Writer 1, we do not
need the shortcut to the end of the list anymore.
41
4 Implementation of the Deque Execution System
DAT
Input T. 1
Writer 1
Input T. 2
Writer 2
nextWriteNotification
Input T. 3
Input T. 1
DAT
(a)
Writer 1
Input T. 2
next
Wri
teN
Input T. 3
Writer 2
(b)
Figure 4.3: Temporary usage of nextWriteNotification as a pointer to the end of
the reader list
However, when we want to append a node at the end of the list, we must know
the parameter offset for the last existing node, as this is where the list will continue.
These offsets are saved in the notificationListOffset array, but only for the next
node, not for the newly introduced shortcut to the list end. Therefore, we introduce
nextWriteOffset, a new field in the TASK_FRAME structure, as a place to store the
parameter offset at the last node of the list. Indeed it has nothing to do with the “next
writer”, but the name reflects the association to nextWriteNotification, even if only
in its temporary usage.
We noticed the problem addressed in this subsection in a test program that calls
within one parent task very many successive input child tasks. The execution of this
program was unexpectedly slow in the original implementation of the Deque ES. The
described changes in the optimized implementation solved the issue.
4.6.3 Using Free Pools for Task and Data Frames
The Deque Execution System puts both tasks and CES variables on the heap. For
each called task and each declared variable, we allocate and release memory at the
appropriate times as explained in sections 3.1 and 3.6. Naturally, this leads to many calls
to malloc and free from different threads. We discovered, that memory allocations
considerably slowed down the application execution with an increasing number of
threads, with serious impact on the scaling performance of the Deque ES. We suspect
that this is the result of the operating system coordinating concurrent allocations.
As user programs might request space of arbitrary size, finding free blocks is not easy
for the operating system. However, the memory needs of CES are quite uniform. We
either allocate a task frame of fixed size or a variable. For variables, CES always has
pass-by-reference semantics. Therefore, we can restrict to variables with a maximum
size of 64 bits. When larger structures are needed, they can be manually allocated.
42
otifi
cati
o
n
4.7 Array Support
Since our memory allocations have only two distinct sizes (sizeof(TASK_FRAME)
and 64 bits for data items), we can use two free pools to speed them up. A free pool
is a data structures holding pointers of allocated memory blocks, which are currently
not in use. When the execution system requests a new block, we take one from the
free pool and therefore save a memory allocation. Only in case of an empty free pool,
the operating system is asked for a new block. To fill up the free pool again, unused
memory blocks are not immediately released but stored in the free pool.
Still, multiple threads might allocate memory simultaneously. Either each thread
has its own free pools, or we use a concurrent implementation. The former solution
obviously needs more memory, since fluctuations cannot be balanced between threads.
For that reason, and with the availability of our fast, concurrent deque implementation
in mind, we chose the latter solution. As a result, the Deque ES is much more scalable
than before. While the original implementation scaled to only about four threads, the
improved implementation scales to about 32 threads. Depending on the application,
the SMT capabilities limit a further increase beyond 16 or 32 threads (see Section 5.2).
4.6.4 Scheduling According to Hardware Threads
When multiple threads share a deque or when we use hierarchical work-stealing (see
Section 3.4), it is important to know the hardware thread a POSIX thread runs on.
Otherwise, hardware threads from different cores might share the deque or might be
in the preferred group for hierarchical stealing. These could not take advantage of a
shared L1 cache.
Our first approach was to use the POSIX setaffinity functions, which bind a
POSIX thread to a CPU ID. Unfortunately the IDs used by POSIX did not reflect
the actual hardware architecture of Blue Gene/Q. As we therefore could not restrict a
POSIX thread to a certain hardware thread, we implemented a hand-crafted solution.
The Blue Gene/Q environment provides functions to get the physical ID of the
current core and hardware thread. In order to run only on threads with a certain
combination of these two numbers, we spawn as many POSIX threads as there are
hardware threads and ensure each hardware thread is actually running. Then we quit
the threads we don’t want to run and start the actual work on the remaining ones.
Which threads will run, can be configured through a mix of compiler (-DCES_THREADS,
-DTHREADS_PER_DEQUE) and run time options (number of cores to run on). From that
we can infer the number of threads per core and control whether a thread terminates
or starts to work, as shown in Listing 4.4. The code is executed by each POSIX thread
before the main execution loop (see Section 4.3).
4.7 Array Support
The array support of the Deque Execution System introduces new language features, so
the CES compiler had to be adapted. We do not describe the straightforward changes
to the compiler front-end. Instead, we focus on the interesting aspects of the code
generation phase.
43
4 Implementation of the Deque Execution System
const int kernelCoreId = Kernel_ProcessorCoreID();
const int kernelThreadId = Kernel_ProcessorThreadID();
L2_Barrier(&enoughPThreadsRunning, 64); // ensure all hw threads are running
if (kernelCoreId >= numCores || kernelThreadId >=numThreadsPerCore)
return EXIT_SUCCESS;
printf("(%2d, %d) running\n", kernelCoreId, kernelThreadId);
Listing 4.4: Code to quit POSIX threads we don’t want to run
int n,m;
...
$int array[n][m];$
Listing 4.5: An example for a CES array declaration
Part of the C code output are three new macro calls to the Deque ES. The macro
RUNTIME_CREATE_CES_ARRAY initializes a newly declared array by creating a C array
to store the pointers to the individual CES array elements. As all array elements
are dependency-tracked individually, we must allocate a separate block of memory
and initialize a DAT entry for each of them. Therefore, the second new macro
RUNTIME_CREATE_CES_ARRAY_PART is called for each individual array element. Declarations of single variables are translated to the RUNTIME_CREATE_CES_VARIABLE macro,
which allocates their storage space and initializes their DAT entry.
Initializing all array elements and creating a Free Task for each of them requires
multiple similar macro calls. As it is only known at run time how many elements must
be initialized, the CES compiler creates nested for loops, one loop for each dimension
of the array. As an example, CESC transforms the CES code in Listing 4.5 into the
intermediate C code in Listing 4.6. The additional braces are an easy way to prevent
clashes of variable names for multiple array declarations. As usual, the Free Tasks are
called directly before the array runs out of scope. The names of the loop variables
currently set a limit of 18-dimensional arrays, which should be enough for most use
cases. We could easily increase the limit.
With arithmetic expressions as indices or size specifiers of arrays, the CES compiler
and the runtime macros just copy the expression to other places, e. g. the upper
boundary of the for loop. For simplicity, the evaluation is only performed by the
C compiler. Given the advanced optimization capabilities of modern compilers, this
decision should not significantly affect the performance of the resulting program.
44
4.7 Array Support
int n,m;
...
RUNTIME_CREATE_CES_ARRAY(array, [n][m], int)
{
int ces_i;
for (ces_i = 0; ces_i < n; ++ces_i) {
int ces_j;
for (ces_j = 0; ces_j < m; ++ces_j) {
RUNTIME_CREATE_CES_ARRAY_PART(array[ces_i][ces_j],int)
}
}
}
...
{
int ces_i;
for (ces_i = 0; ces_i < n; ++ces_i) {
int ces_j;
for (ces_j = 0; ces_j < m; ++ces_j) {
RUNTIME_CREATE_FREETASK(array[ces_i][ces_j])
}
}
}
Listing 4.6: CESC output for the declaration of Listing 4.5
45
5 Performance Comparisons
5.1 Goals
The Deque Execution System was designed to be very flexible regarding the various
work-stealing modes. We can easily switch between depth-first, breadth-first and
hierarchical work-stealing, and also vary the number of threads sharing a deque (see
Section 3.4). In this chapter, we demonstrate differences between these modes using
several CES applications.
In-depth comparisons to the previous execution systems or other implementations
like SMPSs would also be interesting. However, the Deque ES makes heavy use of
the deque library by Manuel Metzmann [Met09], which is specifically optimized for
Blue Gene. For this and other reasons, comparisons on x86 machines would require
additional effort. The previous execution systems make use of x86 atomic primitives,
so running them on Blue Gene is not possible either without porting them to the
new platform. The same is true for other parallel programming environments with
dependency analysis like SMP Superscalar. These porting efforts are out of scope for
this thesis and hence in-depth comparisons are left open for the future.
In order to roughly examine the performance of our full dependency-analysis with
nested parallelism, we provide a brief comparison on x86 hardware. Therein, we
contrast the Deque ES with the Stack ES and SMPSs.
5.2 Test Configuration
Most measurements of this chapter were taken on one compute node of a Blue Gene/Q
System [Fel11]. It is driven by a single BG/Q processor chip with 16 A2 cores, each
at 1.6 GHz and using four-way simultaneous multithreading (SMT). This provides 64
hardware threads. Each core has access to 32 KB of L1 cache, 16 KB for data and
16 KB for instructions. The shared L2 cache has a size of 32 MB. The node has access
to 4 GB of DDR3 RAM. The programming environment on Blue Gene/Q for CES
is the Compute Node Kernel (CNK) Environment, the compiler is the GCC-based
cross-compiler for Blue Gene/Q.
Notably, we should not expect to see a four-fold increase in performance when relying
on the SMT capabilities of a single core. Therefore, when increasing the number of
software threads, we first distribute them to multiple cores. That is, we only increase
the number of threads per core, when all cores are busy. Therefore, a decrease in the
performance scaling is to be expected when increasing the number of threads from 16
to 32 or from 32 to 64.
46
5.3 CES Applications Used
For our brief comparison to other parallel environments, we used a Lenovo ThinkPad
T61p. It is equipped with an Intel Core 2 Duo T7700 processor with 2.4 GHz and
4 MB L2 Cache. In terms of memory, the system provided 4 GB. We used GCC 4.4.3
to compile for x86. All tests on this system run on two threads.
For all measurements, we started each configuration 25 times and show the median
results here.
5.3 CES Applications Used
In this section, we will briefly describe the origin and intention of different CES
programs we used to test the Deque ES. These programs will reappear when we focus
on different aspects of the execution system in Section 5.4.
5.3.1 Recursive CES Applications
Jens Remus provided an in-depth performance analysis of the Stack Execution System
in [Rem08, Chapter 4]. Since the Cilk-like execution style of the Stack ES is mainly
intended for divide-and-conquer algorithms, he used several classic recursive computational problems for his tests. As these small applications already existed as CES
programs, we reused them in the Deque ES.
The CES programs had to be slightly adapted to work with the Deque Execution
System. We encoded dependencies by hand as described in Section 3.7 for recursive
algorithms working on arrays. Furthermore, we had to enforce the constraint of only
passing values of at maximum 64 bits (see Subsection 4.6.3). In the only case where
changes were necessary, we switched to allocating the necessary structures on the heap.
We used the following recursive programs to test the Deque ES. The original code is
available in [Rem08, Chapter 4]. Our minor changes are of a technical nature, so we
refrain from printing the code again.
• The recursive calculation of the Fibonacci numbers. If the input value is at most
1, we return 1 for the base case. Otherwise, we spawn two tasks in parallel to
calculate the previous two Fibonacci numbers. Afterwards, a third task adds
up their results and delivers the desired value. In this application, the tasks are
very fine-grained and even for small input values we get a lot of tasks. [Rem08,
Section 4.3.1]
• A simple merge sort implementation. The recursive mergesort task is shown
in Section 3.7. The algorithm uses only one temporary array as also explained
in Section 3.7. The recursive calls to mergesort are spawned in parallel. The
merging step however is serial and thus limits the available parallelism. [Rem08,
Section 4.5.1].
• A modified version of the above merge sort implementation with adjustable task
granularity. When the length of the input array of a certain call level in the
execution drops below the input parameter MIN_TASK_SIZE, we switch to serial
47
5 Performance Comparisons
execution. The sorting algorithms stays the same, but all recursive calls are then
normal, synchronous function calls instead of task calls. Therefore, the parameter
determines the minimal task size.
• The calculation of the Mandelbrot set. The divide-and-conquer-based implementation provides excellent parallelization opportunities since all pixels of the resulting
bitmap can be computed independently. However, as multiple threads write to the
same memory area, they might disturb each others caching capabilities. [Rem08,
Section 4.9]
5.3.2 Cholesky Decomposition
The Cholesky decomposition is an important method in linear algebra. It is e. g.
used for the numerical solution of linear systems of equations or in the Monte Carlo
simulation.
The CES implementation is an adaption of the Cholesky program that comes with
SMPSs 2.3 [SMP10]. We did not change anything about the algorithm and only
translated it to CES. The program splits up the matrix into tiles and then uses the
Basic Linear Algebra Subprograms [LHKK79, DDCHH88, DDCHD90] to perform the
actual decomposition.
As the Cholesky decomposition uses the array support of the Deque ES, we already
presented the code and an example task graph in Subsection 3.8.3. In the performance
evaluation, it is an example for a non-recursive algorithm operating on a large data set.
The performance results for Cholesky decomposition are given in MFlops based on a
calculation in the original SMPSs program.
5.3.3 Sweep 2D
Sweep 2D was inspired by its 3D counterpart, which “solves a three-dimensional neutron
transport problem from a scattering source.” [PFF+ 07] Our adaption of the model uses
a 2D grid of cells as shown in Figure 5.1. Each cell represents a task. Incoming edges
represent the data dependencies of a task, outgoing edges connect it to dependent
tasks. As a result of these dependencies, the first task to run is in the upper-left corner,
the final task is in the lower right corner. In between, the execution schedule depends
on the timing of individual tasks, since mostly multiple tasks are ready to execute.
Assuming equal execution time for each task and an infinite number of processors, the
execution would spread like a diagonal wave front.
We use Sweep 2D to visualize which tasks run on which processor core in different
work-stealing modes and with our new scheduling according to hardware threads (see
Subsection 4.6.4). For each task, we save the ID of the core it has run on. We then
visualize the grid and color each cell according to the saved ID. To achieve good caching
performance, one would strive for getting larger areas of the same color rather than a
wild mix.
Our implementation first passes one parameter down and then one parameter to the
right. Hence, it is more likely to find columns than rows of the same color.
48
5.4 Results
wavei
wavei+1 wavei+2
Figure 5.1: Dependencies in Sweep 2D (inspired by [PFF+ 07, Fig.1])
5.4 Results
5.4.1 Scaling of Work-Stealing Modes and Shared Deques
This section compares five different combinations of work-stealing modes and shared or
non-shared deques with different numbers of threads on BG/Q. Recall, that each thread
operates on the top of its own deque. When multiple threads share a deque, they all
treat it as their own deque and thus all of them push to and pop from its top. In our
shared-deque configuration, all threads on a single core share one deque. That is, until
16 threads are running, there is no difference to the non-shared configuration, as each
core runs only one thread. With 32 threads running, each deque is used by two threads;
with 64 threads running, each deque is used by four threads. Work-stealing relates to
popping a deque distinct from the own one. In breadth-first work-stealing, the bottom
of the foreign deque is popped, whereas in depth-first work-stealing the top of the
foreign deque is popped. There are four combinations of shared and non-shared deques
with these two work-stealing modes. The last configuration is hierarchical work-stealing
and always uses non-shared deques. In hierarchical work-stealing, a thread with an
empty deque first tries to steal from other threads on the same core, i. e. it tries to pop
the remaining three deques assigned to its core. Only if this fails, the thread looks for
work on the deques of threads outside its own core. Shared deques do not make sense
for hierarchical work-stealing since this would eliminate the first stealing hierarchy.
49
5 Performance Comparisons
When several of these five configurations show no visible difference in the graph, we
only show one graph and indicate the configurations in the key accordingly. Where we
show the speed increase on the y-axis, the number indicates the relation to running
the program on the Deque ES with a single thread and breadth-first work-stealing.
However, since there are no other threads to steal from, it does not make much of a
difference which work-stealing mode is the base line.
The first two experiments we present run the Fibonacci program with n = 31 as
input and the merge sort program sorting four million random numbers. The results
are shown in Figures 5.2 and 5.3 respectively.
25
Speed increase
20
15
10
5
breadth-first or depth-first WS, non-shared deques
hierarchical WS, non-shared deques
breadth-first or depth-first WS, shared deques
0
0
10
20
30
40
Number of hardware threads
50
60
Figure 5.2: Performance of the Fibonacci numbers calculation
The Fibonacci program scales quite well until running on eight threads, where we
get a 6.4-fold speed increase. The increase declines at 16 threads, perhaps due to the
very small task granularity and memory allocation overheads. The increase declines
again when the number of threads reaches 64, presumably due to the exhaustion of the
cores’ SMT capabilities.
The merge sort program shows a similar development, but generally scales worse
than the Fibonacci program. This is probably due to the serial part in the merge step.
Furthermore, all cores operate on the same data set, as they sort the same array. With
work-stealing, multiple cores will sometimes sort nearby parts of the array and thus
presumably disturb each other’s L1 caching. When the number threads increases to 32
and 64, again the SMT capabilities limit further scaling.
50
5.4 Results
14
12
Speed increase
10
8
6
4
2
breadth-first or depth-first WS, non-shared deques
hierarchical WS, non-shared deques
breadth-first or depth-first WS, shared deques
0
0
10
20
30
40
Number of hardware threads
50
60
Figure 5.3: Performance of the merge sort algorithm
In both programs, all work-stealing modes show almost identical results. This
suggests, that there is not much work-stealing. As both recursive applications start
with rather large, high-level tasks, which are then distributed to different threads,
the threads might be well load-balanced and hence might only need few more steals.
We also suspect that the dependency analysis, which contributes a good part of the
executed code, washes out some differences between different work-stealing modes.
To our knowledge, all previous analyses of different work-stealing schedulers were
conducted on Cilk-like systems which run almost no additional code besides the actual
application.
With shared deques, the performance of 64 running threads is slightly worse than
with non-shared deques. This is probably a result of very small task granularities: In
the Fibonacci program, all tasks are small, in merge sort there are way more small
tasks than large ones. With such small tasks, the deques are frequently accessed and
as shared deques imply fewer deques, we might run into contention earlier. Moreover,
shared deques should lead to L1 caching benefits for the data, but atomic access to
the deque itself always needs to access L2 cache as a result of BQ/Q architecture. As
the data is mostly small in these applications, data caching benefits are seemingly
outweighed by other factors.
51
5 Performance Comparisons
The results of running the Mandelbrot program (input parameters: -2.25 0.75 -1.25
1.25 800 2000, cf. [Rem08, Section 4.9]) are shown in Figure 5.4. The program scales
excellent until all 16 cores are in use (15.3-fold speed increase with 16 threads). It also
benefits quite much from multiple SMP threads, the curve flattens only slightly at 32
and 64 threads.
As each pixel of the Mandelbrot set can be calculated independently, no large
amounts of data are shared. Therefore, and probably for the reasons explained above,
we see almost no difference in the performance for different work-stealing modes and
with shared or non-shared deques.
45
40
35
Speed increase
30
25
20
15
10
breadth-first or depth-first WS, non-shared deques
hierarchical WS, non-shared deques
breadth-first WS, shared deques
depth-first WS, shared deques
5
0
0
10
20
30
40
Number of hardware threads
50
60
Figure 5.4: Performance of the Mandelbrot set calculation
The last program we use to compare the different work-stealing modes is the Cholesky
decomposition. It is non-recursive and in fact does not use nested parallelism at all.
The results are shown in Figure 5.5. The Cholesky decomposition scales fairly well and
sees an about 13-fold performance increase at 16 threads. Further increases beyond
16 threads might again be limited by the SMT capabilities. For non-shared deque
configurations, 64 threads is even worse than 32 threads.
Notably, Cholesky decomposition shows some differences between the work-stealing
and deque sharing modes. Hierarchical work-stealing performs slightly better than
breadth-first work-stealing, which in turn runs slightly faster than depth-first workstealing. Shared deques perform even up to 20 percent better than non-shared deques.
52
5.4 Results
18000
16000
14000
MFlops
12000
10000
8000
6000
4000
breadth-first WS, non-shared deques
depth-first WS, non-shared deques
hierarchical WS, non-shared deques
breadth-first WS, shared deques
depth-first WS, shared deques
2000
0
0
10
20
30
40
Number of hardware threads
50
60
Figure 5.5: Performance of the Cholesky decomposition
Cholesky reveals way more differences than the previously tested applications. There
are several explanations for this behavior. Firstly, it might result from not having
nested parallelism. After the initial scheduling of tasks, a large part of the execution is
concerned with application code instead of dependency analysis. Hence, application
data structures dominate the cache usage and different modes show different behavior.
Presumably more important is the spawning of tasks. As visible in Figure 3.5, the
graph contains tasks delivering data to multiple successors. This is where we can benefit
from shared deques. All of the successors are pushed to the deque and multiple threads
from the same core run those tasks and hence access the L1-cached data. The behavior
of the Fibonacci and merge sort programs is different. They initially distribute large
tasks to multiple threads. Stealing probably mostly occurs at the end, when some of
the large tasks have finished. In this phase, data is passed up the tree, i. e. multiple
tasks deliver data to the same dependent tasks. But no task spawns multiple others
anymore, which could run on different threads of the same core.
5.4.2 Overhead of the Execution System
The Stack ES and similar programming environments like Cilk have the runtime
dependencies of tasks implicitly enforced. Therefore, their execution has almost no
53
5 Performance Comparisons
additional overhead compared to the sequential execution of plain C code. In contrast,
the CES Deque Execution System analyzes the dependencies of tasks explicitly. The
relative overhead depends on the granularity of the tasks.
The following test on the x86 Lenovo ThinkPad laptop compares the Deque ES and
Stack ES using the merge sort implementation with configurable task sizes. When the
size of the array is below a threshold, the remaining sorting happens in the current
task, with no further task spawns. This threshold is shown on the x-axis of Figure 5.6
as the Minimal Task Size, beware of the logarithmic scale. The y-axis shows the sorting
performance in numbers per second. We always sorted an array of five million random
numbers.
2500000
Sorted Numbers per Second
2000000
1500000
1000000
500000
Deque ES, breadth-first WS
Stack ES, breadth-first WS
0
1
10
100
1000
Minimal Task Size
10000
100000
1e+06
Figure 5.6: Comparison of the Stack ES and the Deque ES performance
For very small task sizes, the Deque Execution System performs poorly as a major
part of the execution time is spent analyzing the dependencies of the multitude of tasks.
With an increasing Minimal Task size, the results improve quickly. Above a Minimal
Task Size of 256 numbers (about 350,000 clock cycles per task), the Deque ES comes
very close to the Stack ES and later even outperforms it slightly. The suspected reason
for better results than the Stack ES at coarse granularities is that the Stack ES still
needs to search through the Frame Stack, while the Deque ES can steal the first item
it finds on a foreign deque.
54
5.4 Results
The test shows that the performance of the Deque ES depends heavily on the task
granularity. However, we do not need too coarse tasks to achieve a good performance
compared to the Stack ES (note that with a Minimal Task Size of 1000, there are still
at least 5000 tasks). Nevertheless, more detailed comparisons with multiple threads
would be needed to fully compare the performance of the two execution systems.
In a second experiment, we want to ensure that our dependency analysis algorithm
is not unnecessarily slow. Therefore, we briefly compare the Deque ES to SMPSs 2.3,
again on the x86 system. As SMP Superscalar does not support nested parallelism yet,
we chose a program without nested parallelism for the comparison.
Since the Cholesky decomposition (see Subsection 3.8.3) is such a program and we
have both a CES and an SMPSs version available, we reused this application for our
test. The results are depicted in Figure 5.7. The input parameter shown on the x-axis
is the side length of the block matrix, a number that determines the number of tiles we
operate on and thus the number of tasks to execute. The performance measurement
was part of the SMPSs program and gives the number of floating point operations per
second on the y-axis.
4000
3500
3000
MFlops
2500
2000
1500
1000
500
SMPSs
Deque ES, breadth-first WS
0
0
5
10
15
20
25
30
Side length of the block matrix
35
40
45
Figure 5.7: Comparison of SMPSs 2.3 and the Deque ES performance
The results for both systems are very similar. The curves partly even overlap,
although both implementations do not share any code. The performance increases
with an increasing block side length. While the gains are huge at very small side
55
5 Performance Comparisons
lengths, they diminish later. This development might result from caching effects of
the decomposition code and from the decreasing influence of constant factors with an
increasing computational effort.
The main result from this experiment is that the implementation of the Deque ES is
competitive to another programming environment analyzing the dependencies of tasks.
5.4.3 Sweep 2D Results
Initially, we used Sweep 2D to test the scheduling according to hardware threads (see
Subsection 4.6.4). Larger areas of the same color indicate a core working on multiple
tasks interchanging data and are thus desirable for effective cache usage. In all results
we have run Sweep 2D with 300 × 300 tasks on 64 threads with all 16 A2 cores. Each
core has its own color and multiple hardware threads on the same core have the same
color.
56
5.4 Results
Figure 5.8 shows the result with breadth-first work-stealing and non-shared deques
before we introduced the scheduling according to hardware threads. Columns of the
same color are clearly visible and indicate the depth-first execution order of each single
thread. Beyond that, the colors are quite mixed, i. e. stealing occurs randomly across
processor cores.
Figure 5.8: Processor assignment to Sweep 2D grid cells, without scheduling according
to hardware threads
57
5 Performance Comparisons
Figure 5.9 shows the result with breadth-first work-stealing and non-shared deques.
As in the next sweep, the tasks are scheduled according to hardware threads. Clearly,
we have much larger areas of the same color and thus can make better use of the cache.
In the upper left and lower right corner, the mixed colors remain. This is a result of
only few tasks being available at the beginning and end of the execution.
Figure 5.9: Processor assignment to Sweep 2D grid cells, with breadth-first workstealing and non-shared deques
58
5.4 Results
The best results are achieved when the threads of a core use a shared deque as
depicted in Figure 5.10. In the middle of the execution, there are mainly large blocks
of the same color. This reflects the improved performance we get from using shared
deques in the Cholesky decomposition (Figure 5.5).
Figure 5.10: Processor assignment to Sweep 2D grid cells, with breadth-first workstealing and shared deques
59
6 Conclusions
6.1 Results
This Bachelor thesis explained the design and implementation of the new Deque
Execution System for the CES programming language. This involved modifying the
CES compiler, creating the Deque ES and extending the CES syntax to enable the newly
introduced array support. Furthermore, we adapted and extended the macro interface
connecting the compiler and execution system. Apart from the new language features,
we kept the previous execution systems compatible by extending their implementation
of the macro interface accordingly.
The new Deque ES supports the classical breadth-first work-stealing present in the
Stack ES, but it also enables depth-first work-stealing and a hybrid approach. We
provide compiler flags to easily switch between these modes of operation. To further
increase the flexibility of the scheduling algorithm, we added the option for multiple
threads to share the deque holding their tasks. This enables e. g. multiple hardware
threads in a single processor core to operate on the same double-ended queue, while
other cores have their own deques to avoid resource contention.
In contrast to previous execution systems, the Deque ES’ major data structure only
holds tasks which are ready to be executed. This reduces the necessary effort for
work-stealing and provides cleaner semantics. The Deque Execution System determines
dependencies between tasks and thus exposes parallelism. The previous Stack ES
required the programmer to explicitly indicate parallelism in the application.
The new Deque ES analyzes the dependencies of spawned tasks at run time, during
the execution. It schedules each task dynamically when all of its dependencies are
fulfilled. Arbitrary non-circular dependencies can be handled, so any directed acyclic
graph of tasks can be run. As the data dependencies of the tasks are exactly analyzed,
the Deque ES can exploit more available parallelism than the earlier Stack ES, where
the programmer could only coarsely expose the parallelism. However, due to the
dependency analysis, the Deque ES has a higher overhead than the Stack ES.
The Deque Execution System also takes care of memory management for all data
items shared between multiple tasks. When the last task accessing a data item has
finished, a so-called Free Task is scheduled to release the allocated memory. By using
free pools for storing data items and tasks, bottlenecks of standard allocation libraries
are circumvented.
In addition to handling the dependencies of scalar variables, we provide support for
declaring CES arrays. The elements of these arrays can be accessed with familiar array
syntax and can be passed to child tasks individually. Their dependencies and storage
space are also tracked individually facilitating fine-grained scheduling and memory
60
6.2 Further Research Possibilities
deallocation. In conjunction with the possibility to manually encode task dependencies,
this enables some important applications. For example, there are many linear algebra
algorithms operating on blocked data, one of which we presented in this thesis.
We briefly compared various work-stealing and deque sharing modes within the
Deque ES. The results for Cholesky decomposition suggest that sharing a deque among
multiple hardware threads on the same core can help the performance of certain
applications. Further investigations of this subject are desirable. We also outlined that
the performance of the Deque ES depends on task granularity and is comparable to
other parallel programming environments performing dependency analysis. Finally, we
illustrated the stealing behavior with shared and non-shared deques using the Sweep
2D application.
6.2 Further Research Possibilities
In this final section, we highlight some possible research directions for the future. We
start with promising modifications or extensions of the CES language and implementation. Afterwards we present some ideas for further evaluations of CES.
6.2.1 Advancing the CES Language and Implementation
Firstly, there are some direct improvements to the current implementation. In Subsection 3.2.1 we mentioned that the Deque ES adheres not only to RAW, but also
to WAR and WAW dependencies. The latter are not true dependencies as they can
be eliminated through register renaming [SS95]. The Deque ES could benefit from
implementing this technique as it allows more parallelism than the current solution.
The concurrent deque library we use has a fixed deque size, which can only be
changed at compile time. The deque could possibly be adapted to grow and shrink
according to its capacity utilization [CL05]. If this is not possible while keeping the
high concurrent performance, another idea is to implement a multi-deque wrapper
structure. When one deque is full, we switch to a new deque, while the old one is stored
for later access. If a (possibly different) thread empties its deque, it could replace the
deque with the stored one, which provides plenty of tasks to execute.
While the CES execution system uses free pools to deal with concurrent memory
allocations, user programs allocating heap memory regularly might still run into
scalability problems. “When ordinary, nonthreaded allocators are used, memory
allocation becomes a serious bottleneck in a multithreaded program because each
thread competes for a global lock for each allocation and deallocation of memory from
a single global heap.” [Rei07, p. 101] Therefore, we should offer the user a scalable
allocator that deals with this problem. Intel Threading Building Blocks (TBB) provides
two such allocator classes [Rei07, Chapter 6] one could possibly offer through or adapt
for CES. One obstacle is that Intel TBB is a C++ library whereas CES builds on C.
As C++ is mostly an extension of C, this obstacle might be easy to overcome.
At the moment, the Deque ES keeps tasks and data items separate and establishes the
links through pointers. We could cut down on storage locations and memory allocations,
61
6 Conclusions
if we put small data items directly into the consuming task frames. Those variables
would then be passed by value. This change would disrupt the task communication
model, which relies on different tasks accessing the same variables by reference. Recall, a
parent passes the same reference to multiple tasks, whose communication only happens
through the shared variable. Therefore, when notifying subsequent tasks, the condition
task would need to copy the newly delivered variable values into the task frame of
the dependent task. As this task frame is accessed anyway to decrease the number
of unsatisfied dependencies, the additional overhead for the copy operation might be
quite small.
An alternative keeping the current task communication model is to put the data
item directly into the corresponding Free Task. As there is exactly one Free Task per
data item, the mapping would be well-defined. All other tasks would still access the
shared variable within the Free Task through references, but the data item would not
need a distinct storage location anymore.
Scheduling tasks and particularly analyzing their dependencies imposes considerable
overhead to the execution system. This is especially obvious when we have very finegrained tasks as shown in Figure 5.6. When all threads are busy, there is no need to
spawn additional tasks, we could just execute the code sequentially as in conventional
C. The execution speed of applications with fine-grained tasks could be increased
considerably, if we switch between the sequential and the task-based execution mode
depending on the number of ready tasks in the system [TCBV10].
The following ideas for improvements are more involved and we are not sure of their
feasibility. In Subsection 3.8.1, we explained some of the difficulties with arrays and
pointers concerning dependency handling. As a bottom line, it is hard to determine
how pointers are used across tasks and what the intention of the programmer is. Since
pointers are a very central concept in C, it is desirable to extend their support in CES
dependency tracking. If there is no silver bullet to the problem, one could at least offer
different mechanisms for the most common use cases.
The execution of the Deque ES is strict, i. e. a task may only be executed when
all of its dependencies are fulfilled. In non-strict evaluation, this is not true for all
parameters, providing more opportunities for parallelization. For example, we consider
an input parameter of a parent task. When the parameter – or its reference to be
accurate – is only passed down to a child task, the parent task could run before the
value of the parameter is available. Only the child task would need to wait for actual
data delivery. In [SB99], Burkhard Steinmacher-Burow proposed the del (delegate)
keyword to mark parameters that are only passed to child tasks and therefore do not
need to be available for the parent to run. Supporting this keyword in CES would
yield more opportunities for a parallel execution.
The definition of the CES language is focused rather on quickly achieving results
than on comfort of usage. By extending the compiler, one could increase the usability.
For example, the type of C parameters passed to CES tasks must be given as detailed
in Subsection 2.1.3. Instead, the CESC could determine this information on its own.
In general the syntax of the language should be evaluated from a user’s point of view.
62
6.2 Further Research Possibilities
6.2.2 Evaluation of the Current CES State
While the above mentioned improvements and extensions were at the core of the Deque
ES and CES language development, we now present some ideas for investigating the
usefulness of CES.
To test the Deque ES and the CES language extensions, we used rather small
applications. These were mostly classical computer science problems, only Cholesky
decomposition explored the linear algebra domain. To thoroughly study the practicability of the language, the development of larger CES applications from different
domains is necessary.
The final research prospect we want to give here is an extended evaluation of the
performance of the Deque ES. Mainly due to differences in the targeted platforms,
this thesis could only briefly outline that the Deque ES is competitive. An extended
comparison to e. g. Cilk, SMP Superscalar, the Stack ES and particularly to the bestpossible sequential execution would be interesting to properly evaluate the performance
of the latest developments in CES.
63
Bibliography
[ABB00]
Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality
of work stealing. In Proceedings of the twelfth annual ACM symposium on
Parallel algorithms and architectures, SPAA ’00, pages 1–12, New York,
NY, USA, 2000. ACM.
[ABP98]
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread
scheduling for multiprogrammed multiprocessors. In Proceedings of the
tenth annual ACM symposium on Parallel algorithms and architectures,
SPAA ’98, pages 119–129, New York, NY, USA, 1998. ACM.
[ALS10]
Kunal Agrawal, Charles E. Leiserson, and Jim Sukha. Executing task
graphs using work-stealing. In Parallel Distributed Processing (IPDPS),
2010 IEEE International Symposium on, pages 1 –12, 2010.
[BGM99]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient
scheduling for languages with fine-grained parallelism. J. ACM, 46:281–
321, March 1999.
[BL93]
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling
of multithreaded computations. In Proceedings of the twenty-fifth annual
ACM symposium on Theory of computing, STOC ’93, pages 362–371, New
York, NY, USA, 1993. ACM.
[BL94]
Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded
computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356–368,
1994.
[BLKD07]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. Technical report,
University of Tennessee, Oak Ridge National Laboratory, 2007.
[BOI10]
Boinc – open-source software for volunteer computing and grid computing.
http://boinc.berkeley.edu/, Retrieved December 15th, 2010.
[BS81]
F. Warren Burton and M. Ronan Sleep. Executing functional programs
on a virtual tree of processors. In Proceedings of the 1981 conference on
Functional programming languages and computer architecture, FPCA ’81,
pages 187–194, New York, NY, USA, 1981. ACM.
64
Bibliography
[CBL10]
Netlib repository at UTK and ORNL. http://www.netlib.org/
clapack/cblas/, Retrieved January 17th, 2010.
[CGK+ 07]
Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis,
Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos
Hardavellas, Todd C. Mowry, and Chris Wilkerson. Scheduling threads
for constructive cache sharing on CMPs. In Proceedings of the nineteenth
annual ACM symposium on Parallel algorithms and architectures, SPAA
’07, pages 105–115, New York, NY, USA, 2007. ACM.
[Cil10]
The Cilk project. http://supertech.csail.mit.edu/cilk/, Retrieved
December 15th, 2010.
[CL05]
David Chase and Yossi Lev. Dynamic circular work-stealing deque. In
Proceedings of the seventeenth annual ACM symposium on Parallelism
in algorithms and architectures, SPAA ’05, pages 21–28, New York, NY,
USA, 2005. ACM.
[DDCHD90] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff.
A set of level 3 basic linear algebra subprograms. ACM Trans. Math.
Softw., 16:1–17, March 1990.
[DDCHH88] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J.
Hanson. An extended set of fortran basic linear algebra subprograms.
ACM Trans. Math. Softw., 14:1–17, March 1988.
[DK99]
Krister Dackland and Bo Kågström. Blocked algorithms and software for
reduction of a regular matrix pair to generalized schur form. ACM Trans.
Math. Softw., 25:425–454, December 1999.
[Fel11]
Michael Feldman. Argonne orders 10 petaflop Blue Gene/Q super. HPC
wire, February 8th 2011. Retrieved February 9th, 2011.
[FLR98]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the
ACM SIGPLAN 1998 conference on Programming language design and
implementation, PLDI ’98, pages 212–223, New York, NY, USA, 1998.
ACM.
[GJ07]
Robert Granat and Isak Jonsson. Recursive blocked algorithms for solving
periodic triangular sylvester-type matrix equations. In In PARA’06 State of the Art in Scientific and Parallel Computing, 2006. Lecture Notes
in Computer Science. Springer, 2007.
[GTK10]
The GTK+ project. http://www.gtk.org/, Retrieved January 24th,
2010.
65
Bibliography
[IBM10]
IBM Corporation. ROI: Extending the benefits of energy efficiency.
http://www-304.ibm.com/tools/cpeportal/fileserve/download0/
164224/FV_Energy_Efficiency.pdf?contentid=164224,
2009. Re-
trieved December 14th, 2010.
[JK02]
Isak Jonsson and Bo Kågström. Recursive blocked algorithms for solving
triangular systemspart i: one-sided and coupled sylvester-type matrix
equations. ACM Trans. Math. Softw., 28:392–415, December 2002.
[KLDB09]
Jakub Kurzak, Hatem Ltaief, Jack Dongarra, and Rosa M. Badia. Scheduling linear algebra operations on multicore processors – LAPACK working
note 213, February 2009.
[Lea00]
Doug Lea. A java fork/join framework. In Proceedings of the ACM 2000
conference on Java Grande, JAVA ’00, pages 36–43, New York, NY, USA,
2000. ACM.
[Lei09]
Charles E. Leiserson. The Cilk++ concurrency platform. In Proceedings of
the 46th Annual Design Automation Conference, DAC ’09, pages 522–527,
New York, NY, USA, 2009. ACM.
[LHKK79]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic
linear algebra subprograms for fortran usage. ACM Trans. Math. Softw.,
5:308–323, September 1979.
[Met09]
Manuel Metzmann. Implementation, verification and performance measurement of concurrent data structures using new synchronization primitives. Diploma thesis, Technische Universität Kaiserslautern, March
2009.
[PBL07]
Josep M. Perez, Rosa M. Badia, and Jesus Labarta. A flexible and
portable programming model for smp and multi-cores. Technical report,
Barcelona Supercomputing Center, March 2007.
[PBL08]
Josep M. Perez, Rosa M. Badia, and Jesus Labarta. A dependencyaware task-based programming environment for multi-core architectures.
In Proceedings of the 2008 IEEE International Conference on Cluster
Computing, pages 142 –151, September 2008.
[PBL10]
Josep M. Perez, Rosa M. Badia, and Jesus Labarta. Handling task
dependencies under strided and aliased references. In Proceedings of the
24th ACM International Conference on Supercomputing, ICS ’10, pages
263–274, New York, NY, USA, 2010. ACM.
[PFF+ 07]
F. Petrini, G. Fossum, J. Fernandez, A.L. Varbanescu, N. Kistler, and
M. Perrone. Multicore surprises: Lessons learned from optimizing sweep3d
on the cell broadband engine. In Parallel and Distributed Processing
Symposium, 2007. IPDPS 2007. IEEE International, pages 1 –10, 2007.
66
Bibliography
[Rei07]
James Reinders. Intel Threading Building Blocks. O’Reilly & Associates,
Inc., Sebastopol, CA, USA, first edition, 2007.
[Rem08]
Jens Remus. Konzeption und Entwicklung einer Cop/Thief Work-Stealing
Laufzeitumgebung zur parallelen Ausführung von Unterprogrammen.
Diploma thesis, Fachhochschule Wedel, February 2008.
[Sav10]
Vlad Savov. Exclusive: LG’s 4-inch Android phone with dual-core Tegra 2
and 1080p video coming in early 2011. Engadget, November 18th 2010.
Retrieved December 14th, 2010.
[SB99]
Burkhard D. Steinmacher-Burow. An alternative implementation of routines. http://www-zeus.desy.de/~funnel/TSIA/talks/ifl.pdf.gz,
October 5th 1999.
[SB00a]
Burkhard D. Steinmacher-Burow. Task frames. http://arxiv.org/abs/
cs.PL/0004011, 2000.
[SB00b]
Burkhard D. Steinmacher-Burow. TSIA: A dataflow model. http://
arxiv.org/abs/cs.PL/0003010, 2000.
[SBWR08]
Burkhard D. Steinmacher-Burow, Sven Wagner, and Jens Remus. A
modular approach to parallel applications. October 2008.
[Shi10]
Robert Shiveley. Performance scaling in the multi-core era. Intel Software Network, http://software.intel.com/en-us/articles/
performance-scaling-in-the-multi-core-era/, 2008. Retrieved December 14th, 2010.
[SMP10]
SMP Superscalar. http://www.bsc.es/plantillaG.php?cat_id=385,
Retrieved December 15th, 2010.
[SS95]
James E. Smith and Gurindar S. Sohi. The microarchitecture of superscalar processors. Proceedings of the IEEE, 83(12):1609 –1624, December
1995.
[SYD09]
Fengguang Song, Asim YarKhan, and Jack Dongarra. Dynamic task
scheduling for linear algebra algorithms on distributed-memory multicore
systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pages 19:1–19:11, New
York, NY, USA, 2009. ACM.
[TBB10]
Intel Threading Building Blocks 3.0 for open source. http://www.
threadingbuildingblocks.org/, Retrieved December 15th, 2010.
[TCBV10]
Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin.
Lazy binary-splitting: a run-time adaptive work-stealing scheduler. In
Proceedings of the 15th ACM SIGPLAN symposium on Principles and
67
Bibliography
practice of parallel programming, PPoPP ’10, pages 179–190, New York,
NY, USA, 2010. ACM.
[U.S10]
U.S. Department of Energy. Secretary Chu announces $47 million to
improve efficiency in information technology and communications sectors. http://www1.eere.energy.gov/recovery/news_detail.html?
news_id=15705, January 6th 2010. Retrieved December 14th, 2010.
[VMw10]
VMware Inc. How VMware virtualization right-sizes IT infrastructure
to reduce power consumption. http://www.vmware.com/files/pdf/
WhitePaper_ReducePowerConsumption.pdf, 2010. Retrieved December
14th, 2010.
[Wag07]
Sven Wagner. Konzeption und Entwicklung eines neuen Compiler “CESC”
zur Implementierung von Prozeduren als atomare Tasks. Diploma thesis,
Fachhochschule Gießen-Friedberg, August 2007.
68
SELBSTSTÄNDIGKEITSERKLÄRUNG
Selbstständigkeitserklärung
Ich erkläre hiermit, dass ich die vorliegende Arbeit selbstständig verfasst und keine
anderen als die angegebenen Quellen und Hilfsmittel benutzt habe.
Böblingen, den 28. Februar 2011
Sebastian Dörner
69