Otto-von-Guericke-Universität Magdeburg Fakultät für Informatik Bachelor Thesis An Implementation and Investigation of Depth-First Work-Stealing Sebastian Dörner Weitlingstraße 9, 39104 Magdeburg, Germany [email protected] 1st March 2011 R Examiner: Supervisor: Prof. Dr. Stefan Schirra Dr. Burkhard D. Steinmacher-Burow Otto-von-Guericke-Universität Magdeburg IBM Deutschland Research & Development GmbH Universitätsplatz 2 39106 Magdeburg Tel: +49-391-67-18557 Email: [email protected] Schönaicher Straße 220 71032 Böblingen Tel: +49-7031-16-2863 Email: [email protected] Abstract Multi-threaded programming in conventional programming languages requires the developer to distribute work to and manage threads manually. With an increasing number of processor cores in mainstream hardware, taking advantage of theses cores demands more and more management and thus diminishes programmer productivity, which is known as the multi-core software crisis. To address this problem, new runtime libraries and programming languages have been developed. Many of the latter – among them MIT Cilk and IBM CES – employ the breadth-first work-stealing approach, where a processor executes work from its own data structure, but steals work from other processors once its own structure is empty. The unit of work that is stolen, is called a task. In breadth-first work-stealing, usually large tasks from a high level in the call hierarchy are stolen, which leads to different cores working on widely separated parts of the code. In depth-first work-stealing, smaller tasks from lower levels of the call hierarchy are stolen and different cores tend to work on nearby parts of the code. When multiple cores have a shared cache, this might improve the cache utilization and thus speed up the execution. This thesis extends the IBM CES compiler and runtime to also support depth-first work-stealing. For this, we implemented a system for analyzing arbitrary dependencies between tasks at run time and scheduling them to run in parallel. As far as we know, CES with this extension is the first parallel language to support automatic dependency analysis of tasks with nested parallelism. Furthermore, we discovered a possibility to implement additional array support and thus enabled some important applications. Contents 1 Introduction 1.1 Motivation . . . . . . . . . . 1.2 Existing CES Implementation 1.3 An Improved Approach . . . 1.4 Thesis Objectives . . . . . . . 1.5 Related Work . . . . . . . . . 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 3 4 6 2 Previous Work on the CES Programming Language 2.1 Language Concepts . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Architecture Overview . . . . . . . . . . . . . . . . . . 2.1.2 A New Model for Function Calls . . . . . . . . . . . . 2.1.3 The Original CES Syntax . . . . . . . . . . . . . . . . 2.2 Previous Execution Systems . . . . . . . . . . . . . . . . . . . 2.2.1 Tasks in the Stack Execution System . . . . . . . . . . 2.2.2 Data Structures and Their Implications . . . . . . . . 2.2.3 Relationship to Cilk and the Deque Execution System 2.3 Deque ES Concept . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . 7 . . . . . . 7 . . . . . 8 . . . . . 9 . . . . . 12 . . . . . 13 . . . . . 15 . . . . . . 17 . . . . . 18 3 Design of the Deque Execution System 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Dependency Analysis . . . . . . . . . . . . . . . . . . 3.2.1 Types of Data Dependencies . . . . . . . . . 3.2.2 The Dependency Analysis Table . . . . . . . 3.2.3 The Dependency Analysis Algorithm . . . . . 3.3 Notification of Dependent Tasks . . . . . . . . . . . 3.4 Scheduling and Work-Stealing . . . . . . . . . . . . . 3.5 Synchronization . . . . . . . . . . . . . . . . . . . . . 3.6 Memory Management for Data Items . . . . . . . . . 3.7 Manual Encoding of Task Dependencies . . . . . . . 3.8 Additional Array Support by the Execution System . 3.8.1 Overview . . . . . . . . . . . . . . . . . . . . 3.8.2 Syntax . . . . . . . . . . . . . . . . . . . . . . 3.8.3 Use Case: Algorithms on Blocked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . 19 . 20 . 20 . . 21 . 22 . 23 . 25 . 26 . . 27 . 28 . 29 . 29 . 30 . . 31 4 Implementation of the Deque Execution System 4.1 Data Structures for Dependency Analysis and Task Notification . . . . 33 33 v Contents 4.2 4.3 4.4 4.5 4.6 4.7 Notification of Dependent Tasks . . . . . . . . . . . . . . . . . . Scheduling and Work Stealing . . . . . . . . . . . . . . . . . . . . Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Management for Data Items . . . . . . . . . . . . . . . . Speed Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Using Single Variables for the Dependency Analysis Table 4.6.2 Avoiding O(n) Operations on Callback Lists . . . . . . . . 4.6.3 Using Free Pools for Task and Data Frames . . . . . . . . 4.6.4 Scheduling According to Hardware Threads . . . . . . . . Array Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Performance Comparisons 5.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Test Configuration . . . . . . . . . . . . . . . . . . 5.3 CES Applications Used . . . . . . . . . . . . . . . 5.3.1 Recursive CES Applications . . . . . . . . . 5.3.2 Cholesky Decomposition . . . . . . . . . . . 5.3.3 Sweep 2D . . . . . . . . . . . . . . . . . . . 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Scaling of Work-Stealing Modes and Shared 5.4.2 Overhead of the Execution System . . . . . 5.4.3 Sweep 2D Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deques . . . . . . . . . . 6 Conclusions 6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Further Research Possibilities . . . . . . . . . . . . . . . . 6.2.1 Advancing the CES Language and Implementation 6.2.2 Evaluation of the Current CES State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 . 36 . . 37 . 38 . 39 . 40 . . 41 . 42 . 43 . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . 46 . 46 . . 47 . . 47 . 48 . 48 . 49 . 49 . 53 . 56 . . . . . . . . . . . . . . . . . . . . 60 . 60 . . 61 . . 61 . 63 Bibliography 64 Selbstständigkeitserklärung 69 vi Contents Acknowledgments I would like to thank Dr. Burkhard Steinmacher-Burow for offering a very interesting subject and for supporting me throughout my internship at IBM and the writing of this thesis. He was a tremendous source of good advice and never got tired of me seeking it. I would also like to thank Prof. Dr. Stefan Schirra for supervising my bachelor thesis at university, for giving me some valuable insights into academic writing and for supporting me throughout my studies. Furthermore, I would like to thank Uwe Fischer and all the people in his department for a warm welcome at IBM. Thanks to Benjamin Ebrahimi, Benjamin Krill, Heiko Schick, Peter Morjan, Bryan Rosenburg and Tom Musta for helping me with technical issues. Thanks to Prof. Dr. Dietmar Rösner for establishing contact with IBM. I would like to thank Anett Hoppe, Anja Bachmann and Benjamin Espe for proofreading drafts of this thesis and for giving me some useful suggestions. Finally, I would like to express my deep gratitude to my parents Heike Dörner and Torsten Mehlhase, who enabled my studies and always encouraged and supported me. vii 1 Introduction 1.1 Motivation Until about 2005, gains in processor performance were mainly achieved by advancing the processor core architecture and increasing the clock frequency. Improved manufacturing techniques enabled smaller transistors with higher switching speeds. As Robert Shiveley from Intel explained, this development has a drawback, which is increasingly difficult to handle: “Power consumption and heat generation rise exponentially with clock frequency. Until recently, this was not a problem, since neither had risen to significant levels. Now both have become limiting factors in processor and system designs.” [Shi10] Not only is high power consumption a technically limiting factor, but also unwanted considering one of the latest political and industrial trends, “Green IT”. Governments recognize e. g. data centers as major contributors to the global greenhouse gas emissions and spend money on research for improving energy efficiency [U.S10]. IT companies label their products as “green” promising lower operating costs [IBM10, VMw10]. For these products to achieve their relatively low power consumption, it is essential to limit the clock rate and instead look for other possibilities to achieve high performance. Since transistors continue to shrink, the most common way to achieve high performance with lower clock rates is to use processors with multiple cores. On the hardware side, this provides more and more raw computational power. But it also has serious implications for the software side: Conventional computer programs were not designed to exploit the capabilities of multiple CPUs or CPU cores, which means that they do not run effortlessly faster on next-generation processors as in the past. With multi-core processors already reaching the cell phone class of devices [Sav10], these problems will affect large parts of the software industry. To make use of additional cores, the program code must be distributed to multiple threads of execution. Most traditional programming languages offer libraries to start threads and collect their results. However, the application programmer herself must manually identify which parts can run in parallel, distribute the work loads and add code to manage threads instead of concentrating on the application logic. Therefore, university and corporate research seeks ways to automate as far as possible the distribution of applications to several threads, thereby retaining programmer productivity. For that purpose a multitude of systems have been developed, for instance Intel Threading Building Blocks (TBB) [TBB10], the Berkeley Open Infrastructure for Network Computing (BOINC) [BOI10], MIT Cilk [Cil10] and SMP Superscalar [SMP10]. Another of these systems and developed at IBM is C with Execution System (CES), an extension of the C programming language. 1 1 Introduction 1.2 Existing CES Implementation CES is based on an alternative implementation of the subroutine or function call, developed by Dr. Burkhard D. Steinmacher-Burow [SB00b]. Such a subroutine, also called task in the following, is characterized by its input, inout (input-output) and output parameters. Unlike the conventional implementation of function calls, a call to a CES subroutine is only executed after the parent task has completed. Therefore, each thread executes only one task at a time, in contrast to the conventional call stack with multiple subroutines in execution, each suspended at a call to a child. In CES, once the parent task has completed, control is returned to an execution system, hence the name CES (C with Execution System). This execution system (ES) is responsible for dispatching new tasks ready to be executed. Whether a task is ready to be executed mainly depends on the availability of its input and inout parameters. Thus, the access to parameters determines the runtime dependencies between tasks. Unless a program is very sequential in its structure, there are usually multiple tasks ready to be executed. It is up to the execution system to schedule these tasks. More importantly, they can be distributed among multiple threads and run on different processors of a shared memory machine. A well-known way to distribute the tasks without too much interference between different threads is the work stealing approach [BL94], in which processors execute work from their own data structures but steal tasks from other processors once they run out of work. To exploit temporal locality, children of the current task are usually executed directly after the parent finishes and on the same thread, which leads to a depth-first execution order. For example, a thread operates on a stack of tasks, only accessing the top for put and take operations. Other threads may steal tasks from the top or bottom of the stack, a choice that is expressed by the terms depth-first and breadth-first work-stealing respectively. Breadthfirst work-stealing has the advantage that processors working on widely separated tasks typically do not invalidate each other’s cache. Furthermore, because tasks at the bottom usually represent larger tasks, typically fewer steals are necessary. For multiple processor cores with a shared cache however, depth-first work-stealing may reduce cache misses through eviction since the cores would work on nearby parts of the code, which would largely operate on equal parts of the memory. For CES, there are multiple execution systems to choose from: The Sequential Execution System is single-threaded and uses no work-stealing at all. The Round-robin Execution System uses breadth-first work-stealing but is single-threaded; one thread of execution simulates multiple virtual processors, thus avoiding the need for concurrency synchronization. The Stack Execution System works with multiple threads and uses breadth-first work-stealing as well. The current implementation of CES also includes the CES Compiler (CESC) developed by Sven Wagner as part of his Diploma Thesis in 2007 [Wag07]. While Wagner only used a hard-coded version of the Sequential Execution System, Jens Remus designed a macro interface to allow multiple execution systems and then developed the previously mentioned Round-robin and Stack Execution Systems, also as part of his Diploma Thesis [Rem08]. 2 1.3 An Improved Approach However, the current implementation has several shortcomings (cf. [Rem08, pages 147 – 149]). First, the application programmer identifies tasks that can run in parallel by marking calls to subroutines with the parallel keyword. This identification is a manual process and the simple binary marking cannot exploit some opportunities for parallelism. Similar to Cilk, this approach is mostly useful for divide-and-conquer algorithms but not for arbitrary dependency structures between tasks. Second, the runtime data structures and code are quite complicated: All tasks are managed in a data structure that is called a stack, but is in fact also randomly accessed. Since the structure includes both tasks that are ready to be executed and tasks with outstanding dependencies, other threads stealing from the stack have to search through it in order to find a task that may be stolen. To enable this behavior, the old system also needs sophisticated synchronization mechanisms. 1.3 An Improved Approach In an improved approach, a new execution system (ES) would addresses the above shortcomings of the Stack ES. The main idea is to use a double-ended queue (deque) instead of the current pseudo-stack. In contrast to the current solution, it would only hold tasks that are ready to be executed and therefore make it possible to access it with correct deque semantics (see Section 2.3). The current pseudo-stack implementation has the advantage that the execution order as given by the programmer is internally represented by the order of tasks on the stack. Together with the programmer’s indication of which tasks can run in parallel, this information is enough to implicitly enforce the data dependencies between tasks. With only ready-to-be-executed tasks on the deque of the new implementation, this information is lost. To guarantee a valid execution order nonetheless, we must analyze the actual dependencies between tasks and schedule them accordingly. Furthermore, this automatic dependency resolution will probably lead to a better exploitation of parallelism compared to the previous solution, which conservatively approximated the true dependencies. 1.4 Thesis Objectives The main objective of this thesis is to implement the described improved approach in a new and efficient execution system, which will use a deque as the main data structure. With a deque holding the tasks, stealing from the top and bottom yields depth-first and breadth-first work-stealing respectively. The implementation should enable us to easily switch between both alternatives and to compare them. We must design and implement a system to analyze the dependencies between tasks and properly integrate it with the current interface for execution systems. If need be, we will extend the existing interface but also keep compatibility with the original execution systems. 3 1 Introduction R hardware in Furthermore, the new execution system should run on Blue Gene addition to standard x86 machines. This enables us to use a very efficient existing implementation of a concurrent deque for x86 and Blue Gene/Q by Manuel Metzmann [Met09]. Since the next Blue Gene generation also provides multiple hardware threads, we can easily verify the scaling capabilities of the new execution system. In a nutshell, on the old CES system, the concept for the new Deque Execution System and the concurrent deque implementation, we build a new execution system with automated dependency analysis, easy switching of work-stealing modes and a better exploitation of parallelism. Figure 1.1 illustrates this objective, also indicating the sections in which each of the parts will be explained in detail. Old CES system – Sequential ES – Round-robin ES – Stack ES (Sections 2.1 and 2.2) Existing Concept – Deque ES (Section 2.3) Concurrent deque implementation (Section 2.3) New implementation – Deque ES (Chapters 3 and 4) Figure 1.1: Foundations and objective of this thesis 1.5 Related Work The main ideas of the work-stealing principle have already been mentioned in the early 1980s [BS81]. Since then, it has been used to develop both parallel runtime libraries like Intel Threading Building Blocks [TBB10] or Java’s Fork/Join Framework [Lea00] and programming languages like Cilk [FLR98] or Cilk++ [Lei09]. These systems and also the old CES execution systems use breadth-first work-stealing. As different threads in breadth-first work-stealing tend to have disjoint working sets, this is a very good choice for multiple processors with distinct caches. In recent years however, processors with multiple cores have emerged as the dominant architecture, not only for mainstream computers. In a typical multi-core chip, all of the chip’s hardware threads share a cache. The different working sets of these threads in breadth-first work-stealing may disturb each other’s cache usage. This development directed research towards schedulers with an emphasis on constructive cache sharing. Concurrently scheduled tasks on hardware threads of one processor should operate on similar data so that all threads can make use of the same data in the cache. Parallel Depth First (PDF) [BGM99] is a scheduler performing constructive cache sharing. Scheduled to run next is the task that would be executed next in the 4 1.5 Related Work serial execution. While depth-first work-stealing does not strictly provide this property, it still prefers the execution of recently-spawned tasks and thus shows a similar behavior. The performance benefits of PDF compared to breadth-first work-stealing as reported in [CGK+ 07] might appear in depth-first work-stealing as well. Furthermore, the only difference between depth-first and breadth-first work-stealing is the pop operation used for the deque. Hence, in CES we can easily combine them and e. g. try stealing depth-first on the same processor core and breadth-first across processor cores. The idea of employing a deque with strict semantics to store tasks is not new to the field. Blumofe and Leiserson [BL93] introduced a model to represent a ready-to-execute task and its successors in a fixed linear group, called thread1 . They later [BL94] present a scheduler that stores these threads in double-ended queues, each of which is assigned to a fixed processor. In contrast to our solution, these threads and hence the deques can contain tasks, which are not yet ready-to-execute. When the execution hits such a task, the thread blocks and the execution continues with a new thread popped from the processor’s deque. Based on this work, several deque implementations have been developed and used in task schedulers [ABP98, CL05]. In [ALS10], Agrawal et al. present the NABBIT library that enables the execution of both static and dynamic task graphs (see Subsection 2.1.2) in the work-stealing environment of Cilk++. Instead of relying on the sequential program code like the Deque ES, they require the programmer to specify the nodes of the graphs and their dependencies explicitly. NABBIT performs the search for ready-to-execute tasks backwards, starting from the final task, which is the sink of the graph. Therefore, it keeps references to predecessors and successors of a task, whereas the Deque ES only keeps forwards links. However, parts of their main execution driver code for dynamic task graphs, in particular DecComputeNotify and ComputeAndNotify in [ALS10, Fig. 8]), are similar to the Deque ES. Perez, Badia and Labarta devised the SMP Superscalar (SMPSs) programming model [PBL07, PBL08], which analyzes the dependencies of tasks at run time, just like the Deque ES. They use conventional C functions as the unit of parallelism and annotate them with C pragmas to distinguish input, inout and output parameters. The array support of the Deque ES (see Section 3.8) was inspired by SMPSs [PBL08, Section IV]. A notable advantage of the Deque ES over SMPSs concerns nested tasks. In the Deque ES, tasks executed in parallel can spawn child tasks, which are also executed in parallel. In SMPSs the children of a task spawned in parallel are executed serially like any conventional C function, i. e. “SMPSs does not currently support nesting” [PBL10, Section 6]. Song et al. developed the library Task-based Basic Linear Algebra Subroutines (TBLAS) [SYD09]. It implements a widely-used interface [LHKK79, DDCHH88, DDCHD90], which is the foundation of many linear algebra algorithms. The new implementation of each subroutine generates a set of tasks, which are executed dynamically after having their dependencies analyzed. As the dependency analysis algorithm and scheduling scheme are totally distributed and “the runtime system has no globally 1 We use their definition of the word thread only in this paragraph. 5 1 Introduction shared data structures” [SYD09], TBLAS runs on both shared-memory and distributedmemory systems. In contrast to CES, it is specialized on the linear algebra domain and hence not suitable for executing arbitrary programs in parallel. 1.6 Outline The rest of this thesis is organized as follows: Chapter 2 presents the original state of the CES language and runtime environment, which is the starting point for this work. It also includes the existing rough concept for the new execution system, the so-called Deque ES. The detailed design and implementation of this concept constitute the main contribution of this work and are explained in Chapters 3 and 4. Chapter 5 presents some brief performance comparisons between different scheduling strategies including depth-first and breadth-first work-stealing. Finally, Chapter 6 concludes this thesis by summarizing our results and giving an outlook to further research possibilities. 6 2 Previous Work on the CES Programming Language 2.1 Language Concepts 2.1.1 Architecture Overview First, we explore how to create an executable application from a CES source file. The CES language is an extension of the C language. For that reason and to simplify the implementation, the CES compiler only translates CES programs to C code. The intermediate C code includes a lot of macro calls to the execution system, a separate C module that implements the parallel execution of the application. The C code for the application and the execution system are compiled and linked as any C program to get the executable application. The whole compilation process is illustrated in Figure 2.1. CES Compiler CES source Intermediate C source C Compiler Executable application Execution system Figure 2.1: Compilation process for CES programs (based on [Rem08, p. 23]) The execution system takes care of the parallel execution of the application. For that purpose, the application code is split up using a new implementation of subroutines, so-called tasks (see Subsection 2.1.2), which are scheduled by the execution system. That is, the application code and the execution system communicate over a task-based interface. The execution system assigns these tasks to a number of threads, which are scheduled by its environment. So the execution system interacts with its environment, usually the operating system and POSIX threads, using a thread-based interface. These interfaces and the three main components of the application architecture are presented in Figure 2.2. 7 2 Previous Work on the CES Programming Language Application code Application logic Task interface Execution system Task scheduling & parallel execution Thread interface Hardware & software environment Thread scheduling, resource allocation, etc. Figure 2.2: Architecture of CES applications (based on [SBWR08, Fig. 1]) 2.1.2 A New Model for Function Calls As stated earlier, CES is built upon a new model for function calls, developed by Burkhard D. Steinmacher-Burow [SB00a, SB00b]. In this model, child tasks spawned by a subroutine are only executed after the parent routine finishes. Therefore, there is no such thing as a return value which is delivered to the parent. Instead, the parent hands over input, inout and output parameters as references. Results of a routine are written into inout or output parameters and may be consumed by other tasks. Suppose for instance, we want to calculate k = a · b + c · d. Listing 2.1 uses the new kind of function call to perform this calculation. The three types of parameters (input, inout, output) are separated by semicolons. In this case, we do not have any inout parameters. 1 2 3 mult(a,b;;m); mult(c,d;;n); add(m,n;;k); Listing 2.1: Calculating k = a · b + c · d using the new function call implementation Lines 1 and 2 perform the multiplications and store the intermediate results in variables m and n respectively. Afterwards, the add task calculates their sum to get the final result k. Recall that the parent routine (not included in the listing) runs prior to those three children and never gets to see the intermediate or even final results. It only passes the same references to multiple tasks and thus provides for their exchange of information through variables m and n. It is the add task’s responsibility to process the intermediate results. This is not by accident, but a general property of the new subroutine concept. Since a calling routine never gets any results of child tasks, it cannot process them on its own. Instead, it always needs to spawn other children to process intermediate results. Notice that the two mult tasks do not depend on each other. They may be executed in any order or even in parallel on multiple processors, a choice which is to be made 8 2.1 Language Concepts by the execution system. The execution system can schedule tasks arbitrarily as long as the results are equal to the sequential execution order as given by the application source code. To this end, it must know the dependencies between subroutines, which it might either determine on its own or use hints given by the programmer. Tasks and their dependencies can be represented by a directed acyclic graph (DAG). The nodes of the graph correspond to the task instances and an arc from node A to node B means that the task B cannot start executing until task A has completed. Such a task graph is always acyclic because among multiple tasks with circular dependencies no task could ever run. Since a task graph can represent arbitrary acyclic dependencies, it is the most general representation, and one goal for execution systems to strive for is to execute arbitrary task DAGs efficiently. Now suppose we want to define a discrete subroutine to calculate the value of k from above. The definition of Listing 2.1 is encapsulated into a task as in Listing 2.2. The task calculateK takes the four components as input and k as an output parameter. Listing 2.3 shows a call to calculateK succeeded by a call to processK, a subroutine that uses the result k to do something useful. Obviously processK depends on calculateK. However, calculateK does not deliver the desired item (k) itself, but merely delegates this duty to add. Therefore, after calculateK has run, the task processK depends on add; the dependency is handed over to another task. We call this principle delegation. If we look at the DAG representation of our example, initially there are only two nodes for calculateK and processK as illustrated in Figure 2.3 (a). When calculateK runs, its node is replaced by a sub-DAG consisting of the three nodes for the child tasks. The resulting graph is shown in Figure 2.3 (b). calculateK(a,b,c,d;;k) { mult(a,b;;m); mult(c,d;;n); add(m,n;;k); } Listing 2.2: A discrete task to calculate k calculateK(a,b,c,d;;k); processK(k;;); Listing 2.3: Using the previously defined task 2.1.3 The Original CES Syntax Up to now, all code examples used a pseudocode closely related to the actual CES syntax. We will look at the latter in detail now. The CES Compiler translates CES source code to C code and since CES is an extension of C, all normal C code in CES files is, roughly speaking, copied to corresponding C files. In order to easily find those parts that the CES compiler must really care about, the CES syntax elements are separated by dollar signs. To illustrate this matter of fact, we translate our previous example to valid CES syntax as in Listing 2.4. 9 2 Previous Work on the CES Programming Language mult mult calculateK add processK processK (a) (b) Figure 2.3: The node of calculateK is unfolded into a sub-DAG. 1 2 3 4 5 $calculateK(int a,int b,int c,int d;;int k){ $mult(a,b;;int m);$ $mult(c,d;;int n);$ $add(m,n;;k);$ }$ Listing 2.4: The task calculateK in CES syntax The whole definition of the task is enclosed in dollar signs, likewise the individual calls to other CES tasks. As with normal C functions, the body of a subroutine is enclosed in curly braces. It may contain normal C code and special CES features, like the task calls in Listing 2.4. The parameter definition list (Listing 2.4, line 1) is separated by semicolons into three parts for input, inout and output parameters. In this example, the second part is empty. Input parameters may only be read, input-output parameters may be read and written, whereas output parameters may only be written. Multiple parameters of the same kind are separated by commas. Since CES is a C-derivative, all CES variables have got a normal C type, which is specified in front of the parameter name. For ease of implementation, the type name must consist of a single identifier. Hence, to use compound types like unsigned char, declare an alias with typedef as illustrated in Listing 2.5. The general syntax for CES definitions is given in Listing 2.6. Brackets indicate optional parts. typedef unsigned char uchar; typedef int * int_ptr; Listing 2.5: Using compound types with typedef 10 2.1 Language Concepts $<task name>([<definition of input parameters>]; [<definition of inout parameters>]; [<definition of output parameters>]){ ... }$ Listing 2.6: Definition of a CES task $[parallel] <task name>([<list of input parameters>]; [<list of inout parameters>]; [<list of output parameters>]);$ Listing 2.7: Call to a CES task Next to the definition of calculateK, Listing 2.4 also includes multiple CES subroutine calls, the abstract syntax for which is given in Listing 2.7. The optional preceding keyword parallel is a hint for the execution system that this method can run in parallel, i. e. it does not depend on other tasks that have been called earlier in this task. The parameters in the call are in the same order as in the definition. Whether their type must be given, depends: There are two types of variables, normal C variables and CES variables. CES variables are passed by reference and their lifetime usually extends the lifetime of a task. Their storage space is managed by the execution system. The input of CES tasks, no matter if input, inout or output variable, consists of CES variables. When we call a CES subroutine with a CES variable as a parameter, we do not need a type specifier, since the compiler already knows about that variable. If we pass a C variable as a parameter, a CES variable is created and the C variable is used to initialize it. For that purpose, the type of the variable must be specified. Output parameters need no initialization, because they are never read in the called task. Therefore we can create a new CES variable during the call without explicit initialization. This happens, when we specify as an output parameter the type and name of a variable that does not exist yet. This newly created CES variable can be used as an input or inout parameter to succeeding tasks, just like any other CES variable. For ease of implementation, a CES variable that has been created as an output parameter can currently not be used as an output parameter of a subsequent task in the same parent. Instead, one can pass it as an inout parameter, no matter if it is ever read or not. When a task passes on its arguments to a child task, it must respect their parameter types. Pure input parameters may not be passed on as inout or output parameters, whereas the opposite direction is possible. In fact, inout and output parameters can be passed on as any parameter type. The advantage of using output parameters is that they do not need any initialization as mentioned above. If we want to access a CES variable within a task, we must simply enclose the variable name in dollar signs to differentiate it from normal C variables. Remember that these variables are passed by reference. 11 2 Previous Work on the CES Programming Language Finally, one note about the interaction of C and CES subroutines: You can call C functions from within CES tasks. These C functions are executed synchronously as in ordinary C, i. e. you get a return value, which can be processed immediately. What is not possible however, is to call CES tasks from within C functions. This implies, that the initial function of a CES program must be a CES task. This special task is called program, its signature is given in Listing 2.8. typedef char** argv_t; $program(int argc, argv_t argv;;);$ Listing 2.8: Signature of the program task To illustrate the features explained, we present the definition of a CES task that calculates the Fibonacci numbers in Listing 2.9. For a given number n, the input parameter, we calculate the nth Fibonacci number as a result. The recursive algorithm is well-known, line 9 represents the base case, lines 11 to 16 the general case. Now we want to put emphasis on the CES syntax: Line 7 starts the task definition, with n as an input and result as an output parameter. In lines 8 and 9, the CES variables n and result are locally accessed. Therefore, they are encapsulated in dollar signs. Lines 11 and 12 declare local C variables and calculate their values in terms of n, which is again accessed as a CES variable. Lines 14 and 15 spawn the two child tasks to calculate the previous Fibonacci values. Both calls are enclosed in dollar signs and marked by the parallel keyword. The latter is possible, because the calls do not depend on each other. In contrast, add_uint64 in line 16 processes the output of the calls to fibonacci and is therefore not spawned in parallel. Now look at the use of type specifiers in task calls. Since n1 and n2 are C variables, their types are given to create and access corresponding CES variables. The calls to fibonacci each declare a new CES variable fib* of type uint64_t. When these variables are accessed again in line 16, their type is not needed as with any ordinary CES variable. The output parameter of the parent fibonacci task, result, is a CES variable and passed by reference. Thus, add_uint64 can directly write into it without specifying a type. The Syntax described in this subsection originates in the first implementation of CES, the CES Compiler (CESC) and a hard-coded Sequential Execution System by Sven Wagner [Wag07]. When Jens Remus generalized the ES interface and added the Round-robin and Stack Execution Systems, he kept this syntax [Rem08]. The new Deque ES presented in this thesis keeps all syntax elements shown here and adds some more, which will be explained later. Before we introduce the new Deque Execution System in chapters 3 and 4, we give an overview of the previous execution systems in the next section. 2.2 Previous Execution Systems As shown in Figure 1.1, before this work started there were three different Execution Systems (ES), developed by Jens Remus [Rem08]: Sequential ES, Round-robin ES and 12 2.2 Previous Execution Systems 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 /** * Recursive computation of the nth Fibonacci number. * * @param[in] n the Fibonacci number to compute. * @param[out] result the nth Fibonacci number. */ $fibonacci(uint32_t n;;uint64_t result){ if ($n$ <= 1) $result$ = $n$; else { uint32_t n1 = $n$ - 1; uint32_t n2 = $n$ - 2; $parallel fibonacci(uint32_t n1;;uint64_t fib1);$ $parallel fibonacci(uint32_t n2;;uint64_t fib2);$ $add_uint64(fib1, fib2;;result);$ } }$ Listing 2.9: Task to calculate the Fibonacci numbers recursively Stack ES. The Sequential Execution System is based on a hard-coded version by Sven Wagner [Wag07] and can execute CES code sequentially on a single processor. The Round-robin ES executes the work of multiple threads in a round-robin fashion, thereby avoiding synchronization problems but already introducing the program structure for multiple threads. Finally, the Stack ES executes tasks using several processors and constitutes the basis for the Deque Execution System. Since the Sequential ES is not interesting for parallel computations and the Round-robin Execution System served merely as an intermediate step toward parallelization, we will only describe the Stack ES here. Furthermore, we concentrate on the main ideas, those that are relevant for the new Deque ES, which constitutes the main effort of this thesis. 2.2.1 Tasks in the Stack Execution System The CES Stack ES is responsible for scheduling and dispatching tasks. For that reason, there is the major data structure TASK_FRAME, which contains all the information concerning a task instance, most importantly pointers to the calling parameters and a function pointer to the C-function implementing the task execution. This TASK_FRAME is a general interface for various kinds of tasks and therefore the parameters are untyped (void *) and the function pointer takes a generic argument. The execution system will cast this generic task frame into a version with typed parameters (e. g. int *) used during the task’s execution. From a CES source file, the CES compiler (CESC) creates C header files containing the typed task frame definitions. Additionally, it generates a C code file with a C function for each CES task. For each CES construct in the CES task, the equivalent C function contains macro calls to the execution system. For example, there are macros to initialize and finalize the task, to call a new task, to access CES variables and to create storage for new ones. A complete overview of the macro interface is given in [Rem08, Appendix A]. 13 2 Previous Work on the CES Programming Language Listing 2.10 shows the usage of these macros as part of an example, the core of the Fibonacci C routine output by the CES compiler for the CES task in Listing 2.9. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 /** * Recursive computation of the nth Fibonacci number. * * @param[in] n the Fibonacci number to compute. * @param[out] result the nth Fibonacci number. */ void fibonacci(RUNTIME_TASK_FUNCTION_ARGUMENTS) { RUNTIME_TASK_INITIALIZE(fibonacci); if (RUNTIME_TASK_PARAMIN(1) <= 1) RUNTIME_TASK_PARAMOUT(1) = RUNTIME_TASK_PARAMIN(1); else { uint32_t n1 = RUNTIME_TASK_PARAMIN(1) - 1; uint32_t n2 = RUNTIME_TASK_PARAMIN(1) - 2; /* PUSH STORAGE FOR C VARIABLE ’n1’ TO FRAME STACK */ RUNTIME_CREATE_STORAGE_CVAR(n1, uint32_t, n1); /* PUSH STORAGE FOR OUTPUT ARGUMENT ’fib1’ TO FRAME STACK */ RUNTIME_CREATE_STORAGE_OUTPUT(fib1, uint64_t); /* PUSH TASK ’fibonacci’ TO CURRENT STACK */ RUNTIME_CREATE_TASK(fibonacci, 0, 1, 1, 0, 1); RUNTIME_NEWTASK_PARAMIN_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(n1); RUNTIME_NEWTASK_PARAMOUT_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(fib1); /* PUSH STORAGE FOR C VARIABLE ’n2’ TO FRAME STACK */ RUNTIME_CREATE_STORAGE_CVAR(n2, uint32_t, n2); /* PUSH STORAGE FOR OUTPUT ARGUMENT ’fib2’ TO FRAME STACK */ RUNTIME_CREATE_STORAGE_OUTPUT(fib2, uint64_t); /* PUSH TASK ’fibonacci’ TO CURRENT STACK */ RUNTIME_CREATE_TASK(fibonacci, 0, 1, 1, 0, 1); RUNTIME_NEWTASK_PARAMIN_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(n2); RUNTIME_NEWTASK_PARAMOUT_REFERENCE(fibonacci, 1) = RUNTIME_STORAGE_REFERENCE(fib2); /* PUSH TASK ’add_uint64’ TO CURRENT STACK */ RUNTIME_CREATE_TASK(add_uint64, 0, 0, 2, 0, 1); RUNTIME_NEWTASK_PARAMIN_REFERENCE(add_uint64, 1) = RUNTIME_STORAGE_REFERENCE(fib1); RUNTIME_NEWTASK_PARAMIN_REFERENCE(add_uint64, 2) = RUNTIME_STORAGE_REFERENCE(fib2); RUNTIME_NEWTASK_PARAMOUT_REFERENCE(add_uint64, 1) = RUNTIME_TASK_PARAMOUT_REFERENCE(1); } /* COPY CURRENT STACK TO FRAME STACK */ RUNTIME_TASK_FINALIZE(fibonacci); } Listing 2.10: CES compiler generated C code for the fibonacci task of Listing 2.9 Line 7 starts the C function fibonacci with RUNTIME_TASK_FUNCTION_ARGUMENTS as its macro parameter. This macro usually expands to the major data structures of the execution system, data structures that will be accessed by other macros throughout the function. We will explain shortly, how they appear in the Stack ES. The first macro within the function is RUNTIME_TASK_INITIALIZE, which will initialize the 14 2.2 Previous Execution Systems task. It is obvious that this macro can serve very different purposes as well, depending on the execution system. Access to CES variables is translated to the macros RUNTIME_TASK_PARAMIN and RUNTIME_STORAGE_REFERENCE, depending on where exactly the variable was defined. These macros return the correct variables or their pointers respectively. Notice that the names of CES parameters do not occur in the C code, only the CES compiler knows about them. In C, these parameters are just identified by their type (in/inout/out) and offset. In line 17 the new CES variable n1 is created and initialized using its C equivalent, line 20 generates a new output variable with no initialization. Both of them serve as input to the fibonacci task that is created next. The seemingly magic numbers in line 23 represent, among other things, the parallel flag and the number of input, inout and output parameters. Afterwards, the parameters for the new task are initialized by storing references available in the current task. The comments in the CESC output refer to the Current Stack and Frame Stack, two major data structures of the Stack Execution System, which we will explain now. 2.2.2 Data Structures and Their Implications The Stack Execution System manages all tasks which have been spawned but not yet executed on the Frame Stack. This data structure is accessed by its owner-thread like a stack, but other threads may freely search through it, so it is actually a pseudostack. Initially, the Frame Stack of the first thread is seeded with the program task. Afterwards a thread continuously takes the top item from its stack and executes the task as visible in Figure 2.4. Part (a) shows how spawned child tasks are collected through repeated push operations to the Current Stack, which is empty at the start of a task. As depicted in (b), before a task finishes, the Current Stack is moved to the top of the Frame Stack. Since the memory layout of the Current and Frame Stack are opposite, the first-spawned child task is then on top of the Frame Stack and thus the next task to be executed (Figure 2.4 (c)). This way the correct execution order for one thread is guaranteed. But what about the other threads and parallelization? Once a thread’s Frame Stack is empty, it has no more work to execute from its own data structures. This is also the case at the beginning of the execution, for all threads except the first one. In this state, the thread steals a task from another thread’s stack. The principle of distributed data structures for holding tasks and stealing from other threads once the own structure is empty is known as work-stealing [BL94, ABB00]. Of course, a thread can not steal an arbitrary task since the task might have outstanding dependencies. Here, the parallel keyword comes into play. Tasks marked parallel get a flag in the Frame Stack. These tasks have by definition all of their dependencies satisfied: They do not depend on any other child spawned before them within the same parent, and their parent had all of its dependencies fulfilled since it has already executed. So a task marked parallel can be stolen by another thread and execute immediately. The foreign thread starts its search for parallel tasks at the bottom of the Frame Stack, a concept called breadth-first work-stealing. Since the stack’s owning thread executes from the top, 15 2 Previous Work on the CES Programming Language (a) (b) (c) Current Stack Current Stack Current Stack execution push execution move next task next task Frame Stack Frame Stack Frame Stack time Figure 2.4: Interplay of Frame Stack and Current Stack the threads tend to work on distant parts of the code. Furthermore, tasks on the bottom tend to be higher in the call hierarchy and thus contain child tasks themselves. Since child tasks of stolen tasks are pushed to the Frame Stack of the stealing thread, breadth-first stealing leads to fewer steals and thus less overhead. Of course, stealing from other threads, preventing them from executing a stolen task and notifying them of the finished execution needs considerable synchronization efforts. The search through foreign stacks needs some time and disrupts usual stack semantics. On the other hand, using the stack to keep all tasks, whether they are ready to execute or have outstanding dependencies, has some advantages in terms of simplicity. First, the stack preserves the originally specified order of succeeding tasks. Combined with the parallel keyword to explicitly mark tasks that can run in parallel, this provides for the correct execution order of tasks and ensures that all dependencies are satisfied without any additional effort. The second huge advantage is memory management for CES variables, which is performed by the execution system. In the CES model for function calls, data is only passed down to child tasks, not returned to parents. When the stack level falls below that of a certain task, all its children and grandchildren have completed. Therefore, all CES variables created by a task can be freed once the stack goes below that task’s level. For that reason, it is convenient to also put CES variables on the Frame Stack. This is done through a so-called storage frame, which has the same structure as a task frame but contains a data item (a CES variable). Figure 2.5 provides an example of the stack development. The next task 16 2.2 Previous Execution Systems Task 1.1 Task 1.2 Task 1.2 Data Item 1 Data Item 1 Data Item 1 Task 1 Data item 2 Data Item 2 Data Item 2 Task 2 Task 2 Task 2 Task 2 Task 3 ... Task 3 ... Task 3 ... Task 3 ... Task 3 ... (a) (b) (c) (d) (e) Figure 2.5: Development of the Frame Stack and automatic removal of data items to be executed is printed in bold. Initially, Task 1 is scheduled to execute. A task pushes newly created data items onto the Frame Stack before any child tasks, as in Figure 2.5 (b). The two child tasks, Task 1.1 and Task 1.2, and their potential children will probably use the data items to perform their work. Once all children have finished (Figure 2.5 (d)), the execution system looks for more tasks on the Frame Stack below their stack level. It will reach the data items and just pop them off the stack until it finds a task to execute next, in this case Task 2. 2.2.3 Relationship to Cilk and the Deque Execution System From a user’s point of view, the Stack Execution System is conceptually similar to MIT Cilk. “Both systems serve the same divide-and-conquer-style applications on shared-memory multiprocessor computers.” [SBWR08, p. 3] The parallel keyword is comparable to Cilk’s spawn [FLR98], where the parent routine continues to execute while the child may be scheduled to other processors. However, there are also significant differences between the Stack ES and MIT Cilk. For instance, due to the new function call, CES processes the results of child tasks through other child tasks, whereas Cilk has an explicit sync statement. Some of the concepts used in the Stack Execution System will be part of the Deque ES as well. The CES implementation of a function call remains the same. The Deque Execution System defines all macros presented here, although they partly serve a different purpose and some additional macros will be needed. The task frame concept is kept as the main identifier for a work packet. Multiple data structures to keep the tasks, usually one per thread, are present in the Deque ES as well, but the nature of the structure changes quite fundamentally. Finally, the memory management is completely different. 17 2 Previous Work on the CES Programming Language 2.3 Deque ES Concept The main idea for this thesis is an existing concept for a new CES execution system, the Deque ES. Its name descends from the major data structure, a double-ended queue. In contrast to the pseudo-stack of the Stack Execution System, the deque is only accessed with correct semantics; that is, the deque only permits put and take operations to its top and bottom. For that reason, foreign threads stealing tasks cannot search through the data structure anymore, they must get a valid task with a normal deque operation. Hence, the deque only holds tasks, which are ready to be executed. Now all take operations yield a task that can be stolen. Since the owning thread keeps pushing new tasks to the top, stealing from the bottom or top results in breadth-first or depth-first work-stealing, respectively. In order to avoid copying task frames onto the deque once the task has all of its dependencies fulfilled, the deque just stores pointers to task frames. The implications of only allowing ready-to-execute tasks on the deque are quite extensive. Of course, not all spawned tasks are immediately ready. Still, they must be kept in memory. Once their dependencies are fulfilled, they must be pushed onto the deque. But how do we know, when those dependencies are fulfilled? In the Stack ES, the stack implicitly satisfied the dependencies, but this was only possible by holding all tasks within the stack. In the Deque Execution System, we must analyze these dependencies and track their fulfillment explicitly. This could lead to some serious overhead, where there is almost none in the Stack Execution System. However, the Deque Execution System with dependency analysis can potentially exploit more parallelism than the Stack ES. The latter relied on the coarse specification through the parallel keyword. Similarly to Cilk, the parallel keyword only allowed a binary decision: Either the child task can run at once or it must wait for all the previously spawned child tasks. This model is appropriate for divide-and-conquer algorithms, but not very good at providing parallel execution of multiple direct child tasks with complex interdependencies. In contrast, a full dependency analysis enables the parallel execution of arbitrary task DAGs. As an example, we present the complex task graph and corresponding CES program for a Cholesky decomposition in Subsection 3.8.3. The deque data structure is at the heart of the new Deque ES and is accessed by multiple threads. Accordingly, the implementation should be thread-safe, but still as fast as possible. Manuel Metzmann implemented several data structures that can be accessed concurrently, among them a stack, queue and deque, the latter of which we will use for the execution system [Met09]. These data structures are optimized for Blue Gene/Q and therefore concurrent access to them is enormously fast on this platform. Conveniently, there is also a (slower) x86 implementation, which can be used to easily test the new execution system. 18 3 Design of the Deque Execution System Based on the work presented in Chapter 2, we designed and implemented the Deque Execution System for CES, which we will explain in detail in this and the following chapter. Section 3.1 is a broad overview of the new ES design, whereas the rest of this chapter provides a more in-depth description. Details on the implementation of this design will be given in Chapter 4. 3.1 Overview The major responsibility of the execution system is to keep track of the tasks in the system. As indicated in Section 2.3, the deque as the main management structure only holds pointers to tasks which are ready to be executed. Therefore, major design issues include where in memory the tasks are located and how to keep track of those tasks that have not all of their dependencies fulfilled yet. Since there are mostly multiple tasks ready to be executed and the execution schedule among those tasks depends on non-deterministic factors like execution speed and workstealing, the execution system cannot predict a fixed order in which the tasks will run. And as a task frame’s storage space can be released as soon as the task has completed, the order for freeing task frames also depends on these run-time factors. For that reason, managing the actual task frames in a fixed structure like a stack is not advisable. The alternative we chose, was to put each task frame in a separate location on the heap and to allocate and free its space explicitly at the appropriate times. Once the tasks are ready to be executed, their pointers are on the deque of a certain thread. All other tasks have unfulfilled dependencies. A so-called condition task fulfills a dependency of a so-called dependent task. Once its condition tasks have finished, the dependent task’s pointer must be pushed onto a deque. A straightforward way to realize that, is to keep the pointer to the dependent task in the task frame of the condition task. Once the condition task has finished, the execution system will check the readiness of all depending tasks and put them on the deque if necessary. That is, a task which is not currently executed can be in one of two states: Either it is a “ready task” on a deque or it has outstanding condition tasks holding its pointer. In the dependency graph, the ready tasks are the sources1 , and any other task is reachable through a path starting at one of them and therefore accessible although no central data structure knows about it. An example graph is given in Figure 3.1. When a condition task informs its depending tasks of the delivered data item, it must determine whether those tasks are ready to be executed. They are, when there are no 1 nodes with in-degree zero 19 3 Design of the Deque Execution System A source node, task pointer is on the deque A dependent node, task pointer is known to condition tasks Figure 3.1: A dependency graph with multiple tasks other outstanding dependencies. We track the number of unsatisfied dependencies in the task frame using a counter variable. When the task is spawned, this number is initialized to the number of parameters that must be accessed by other tasks before the task can run. Once a condition task delivers a needed parameter, the counter is decreased. Should the number of unsatisfied dependencies fall to zero in doing so, the dependent task’s pointer is put onto the deque of the current thread. 3.2 Dependency Analysis In its description of identifying ready-to-be-executed tasks, the overview of Section 3.1 relied on a dependency graph of tasks, which consists of pointers between task frames. However, the application code is a sequential stream of text and does not provide this graph in itself. Therefore, the graph must be determined using the source code, a process we call dependency analysis and describe in this section. 3.2.1 Types of Data Dependencies In order to analyze the data dependencies between tasks, we should know where dependencies may occur. There are three types of them to be observed. Read After Write (RAW) dependencies occur in the usual case of one task producing a data item and another task consuming it. The producing task must obviously run before the consuming task, that is, the data item must be read only after it has been written. Using the terminology of Section 3.1, the producing task is the condition task and the consuming task is the dependent task. When a developer intends to save memory, he may reuse variables. For instance, task A writes a value, which is in turn read by task B. Afterwards task C writes data to be consumed by task D. Task B must read the value A has written before task C overwrites it, a dependency called Write After Read (WAR). Finally, Write After Write (WAW) dependencies occur, when a variable is written twice without any intermediate read operation. It is essential for subsequent reads that the last write operation is the one specified last in the program code. Therefore, the order of two succeeding write tasks must be preserved. 20 3.2 Dependency Analysis The only type of true dependencies though is RAW, because WAR and WAW dependencies can be eliminated through register renaming [SS95]. This technique stores copies of the specified variables to allow for their original values being overwritten immediately. Succeeding reads access the copy instead of the original, a “renamed register” is looked up. An example for a system that uses register renaming is SMP Superscalar, another programming model performing dependency analysis [PBL07]. As the current CES implementation does not use register renaming, we must take care of all mentioned types of dependencies. 3.2.2 The Dependency Analysis Table The dependency graph we want to build consists of nodes representing tasks and arcs for the dependencies between them. Each arc is associated to a data item that consitutes the data dependency between the two involved tasks. The graph is dynamically created at run time. In the following, we use the terms input task, inout task and output task for tasks which have the considered variable as an input, inout or output parameter, respectively. Input and inout tasks are also referred to as readers, inout and output tasks as writers. When a new task is called by the program, we must insert its node into the dependency graph. We will look at the information required in order to do that. When a task reads a parameter, it depends on the last subroutine that wrote the parameter, because that subroutine delivers the desired value (RAW dependency). When a task writes a parameter, it must wait for all tasks that are interested in the old value (WAR dependency). If no other task reads the old value, it must wait for the previous writer to enforce the WAW dependency. All in all, as already stated by the authors of SMP Superscalar, “only the last writer and the list of readers of the last definition are required.” [PBL10] The Dependency Analysis Table (DAT) is a data structure holding exactly those pieces of information. Importantly, it is only used locally within a task and helps analyze the dependencies of child tasks. For each CES variable in the current task, the DAT provides access to the writing child task that was called last and all subsequent readers. Indeed, the whole dependency analysis regards the calling of tasks, not their actual execution. Hence, the “last writer” is the last called, not the last executed task accessing the variable. On a lower level, the DAT is built up as follows. It implements a map or dictionary interface, delivering for each local CES variable a pointer to the frame of the task that last wrote it. The task frame of the last writer constitutes the head of a linked list, all remaining nodes of which are subsequent input tasks. If there is no last writer, e. g. because the current task directly passes on one of its parameters to an input task, this input task is the head of the list. We will now explain how the DAT is used to perform the dependency analysis. 21 3 Design of the Deque Execution System 3.2.3 The Dependency Analysis Algorithm We already outlined that a task delivering a data item holds pointers to dependent tasks. When it finishes, the dependent tasks are notified about the delivered data item and possibly put on a deque. As these pointers are part of the dependency graph, they must be installed during the dependency analysis. Since such a pointer serves to find the dependent task later on to possibly put it on the deque and enable its execution, we call the process of installing the pointer callback registration. The execution system performs the dependency analysis at the end of a task. In the task execution before, all calls to new tasks have saved pointers to the new task frames on the so-called Current Child List. This is similar to the old Stack Execution System’s Current Stack, but the old version saved the actual data, where we just store pointers as our task frames are on the heap. At the beginning of the analysis, the parent task’s parameters are marked as available in the Dependency Analysis Table, they may be read or written immediately depending on their type. After all, the current task is executing at the moment and can therefore access its parameters as specified. The execution system now loops over the Current Child List in the original calling order and analyzes the dependencies. When we reach a new child task, all its parameters are looked up in the Dependency Analysis Table. The DAT delivers the linked list with the last writer and subsequent readers. If the new task is an input task, it registers a callback with the last writer. If however it is an inout or output task, it depends on all last readers (WAR dependency). Conceptually, the new task registers callbacks with all of them. In fact, the process is slightly more complicated for implementation reasons, which will be explained later. Provided that there are no last readers, the new inout or output task depends on the last writer as well (RAW/WAW) and registers a callback. In any case, each registered callback increases the new task’s counter for its unsatisfied dependencies by one (see Section 3.1). Parameters marked available in the DAT do not contribute a dependency, no callback is registered and the number of unsatisfied dependencies stays the same. Finally, the new task itself is recorded in the Dependency Analysis Table. Input tasks are appended to the list, writers supersede the record of the previous writer and subsequent input tasks. Figure 3.2 visualizes a common case. Writer 1 delivers a data item, which is consumed by Input Tasks 1 through 3. Afterwards, Writer 2 has the same data item either as an inout or as an output parameter. In any case, it may only run after all input tasks have finished. Solid arrows represent the direct task dependencies for this data item, whereas dashed arrows show the DAT pointer structure before Writer 2 is added to the dependency graph. In order to differentiate complete DAGs showing all dependencies between multiple tasks as in Figure 3.1 from pictures illustrating the registered callbacks and dependencies for just one parameter as in Figure 3.2, we represent tasks in the former context as circles and in the latter context as rectangles. 22 3.3 Notification of Dependent Tasks DAT Data dependency Writer 1 DAT linked list before Writer 2 is added Input Task 1 Input Task 2 Input Task 3 Writer 2 Figure 3.2: Data dependencies and the DAT pointer structure of several writing and reading tasks for a single data item 3.3 Notification of Dependent Tasks When a task finishes, it must inform all dependent tasks about the fulfillment of their dependencies, a process we call notification. These dependencies are either data items it produces or a completed read operation on a data item which will be written by the subsequent task. However, the condition task itself is not necessarily the one that actually uses or produces the data item, this might be done by a child task. In this case, the dependent task must also wait for the child task to finish. We already illustrated this process in Figure 2.3. Here, we will describe how it is performed on a simplified, conceptual level, neglecting implementation details until Section 4.2. After the user-defined code of a task has executed and the dependencies have been analyzed, the execution system goes through all parameters of the task again. It thereby notifies dependent tasks and passes on dependencies from the parent to the child tasks that actually access the respective parameter. In that process, we must distinguish input parameters from inout and output parameters. Figure 3.3 shows what happens for a pure input parameter and only considers the dependencies of this specific data item, which is read by the input tasks 1 through 3. The previous corresponding writer task has already executed and they form the subsequent input tasks that are now ready to run, unless they depend on another Input Task 1 Input Task 2 $ Input Task 3 Child Task 1 Child Task 2 Child Task 3 Writer Figure 3.3: Input Task 3 notifying its dependent task Writer and integrating its children into the task graph 23 3 Design of the Deque Execution System parameter. The user defined code of Input Task 3 and the following dependency analysis have just finished and created three child tasks accessing the data item at hand. As the original parameter was an input parameter, these tasks must be input tasks. They already form a linked list in the Dependency Analysis Table (see Subsection 3.2.2), but are not yet connected to any other tasks outside Input Task 3. Now the execution system traverses this list and registers a callback from each of them to the following Writer task. The Writer’s unsatisfied dependency counter is increased for each of the input tasks, since they are now additional dependencies of Writer. Afterwards, this number is decremented by one because the condition task Input Task 3 itself has finished. This order is important as the task is put on the deque when the number of unsatisfied dependencies drops to zero. If there are no child tasks, no new callback is registered and the number of unsatisfied dependencies effectively decreases by one. For inout and output parameters, the behavior is different. Both types are treated equally here, because it is only important that they write to the parameter. The most general case, with both input tasks and writers as children, is illustrated in Figure 3.4. Writer 1 has just finished its execution and dependency analysis, and the execution system resolves one of its inout or output parameters. The DAT of Writer 1 provides access to the relevant children, the last writer and all subsequent input tasks, through a linked list (dashed arrows). Since the input children depend on the value written by Last Writer Child, they must run before Writer 2, so appropriate callbacks are registered. Moreover, the last writing action of Writer 1 is actually performed by Last Writer Child, so all input tasks following Writer 1 depend on Last Writer Child. Conceptually, according callbacks are registered; we will explain what actually happens in Section 4.2. Again, for each new callback, the number of unsatisfied dependencies is increased, and afterwards the counter for all tasks depending on Writer 1 is decreased by one. There are some special cases: When there are only input children for an inout Last Writer Child Writer 1 $ Input Task 1 $ Input Task 2 $ Input Child 1 Input Child 2 Input Child 3 Input Task 3 Writer 2 Figure 3.4: Writer 1 notifying dependent tasks and integrating its children into the task graph 24 3.4 Scheduling and Work-Stealing parameter, the outer input tasks 1 through 3 do not get a new dependency. Instead, their number of unsatisfied dependencies will effectively be decremented and they will be put on the deque if it reaches zero. These actions are even the only ones taken, if there are no child tasks at all, because Writer 2 does not get new dependencies in this case. After handling all the parameters of a completed task, the execution system checks the number of unsatisfied dependencies for all child tasks in the Current Child List. If this counter is zero for a task, it gets pushed onto the deque. This procedure can not be performed until all parameters have been handled, because tasks on the deque might be scheduled to run, complete their work and try to notify their dependent tasks. These tasks however, are only registered in the parameter handling phase we described in this section. 3.4 Scheduling and Work-Stealing Work-Stealing is the basic technique used to distribute tasks to multiple threads of execution in CES. Normally, we have one software thread per hardware thread. Each thread has its own deque to store tasks on, so that different threads only rarely interfere with each other. A thread performs a depth-first execution of its deque, i. e. it puts newly created tasks on top and fetches tasks to execute from the top as well. Therefore, when executing tasks from the local deque, successive tasks tend to operate on similar data and benefit from caching mechanisms. Furthermore, depth-first execution prefers the completion of one top-level task over already starting the execution of other top-level tasks and thus reduces the number of tasks in the system. When the local deque is empty, a thread steals work from a foreign deque, either from the top (depth-first work-stealing) or bottom (breadth-first work-stealing). We already explained the theoretical advantages of the two alternatives in Section 1.2. To simplify comparisons of both types, we made switching between them very easy: The default is breadth-first work-stealing, but when the compiler flag -DDF_WS is given, the Deque ES performs depth-first work-stealing. Moreover, if multiple threads share a cache, they might benefit from sharing a deque as well. For example, in a multi-threaded processor core, the threads share the L1 cache. Therefore, in CES the number of threads per deque is adjustable through the compiler option -DTHREADS_PER_DEQUE. Continuing the example, the hardware threads of a multithreaded processor core may use a single deque. Since there is mostly a hierarchy of caches with different access times, it might be beneficial to steal in a hierarchical way as well. In our example, we would first try to steal from the deque of other hardware threads in the same core. If we find a task there, we might find some data of the task in the shared L1 cache. Only if we do not find a task on any deque used within the same core, we look at the deques of threads outside our core. This behavior can be enabled with the compiler option -DHIERARCHICAL_STEALING. 25 3 Design of the Deque Execution System 3.5 Synchronization In a CES execution, each hardware thread is running a POSIX thread. The POSIX threads are lightweight threads sharing a common address space. For shared data, we must prevent conflicts where multiple threads access the same storage locations. Otherwise, the machine instructions of multiple C statements from different threads might be intermingled and thus fail to execute as expected by the programmer. The instrument we use to prevent conflicts are architecture-dependent atomic operations. These cannot be interrupted and thus may be used to access shared data, even if other threads might do just that at the same time. A huge advantage of using atomic operations instead of higher-level concepts like mutual exclusion through semaphores or monitors is that they do not block and therefore do not slow down the program. Still, so as to find the points where atomic instead of conventional operations are needed, we must identify access to shared data. In CES, the most obvious section of multiple interfering threads is work-stealing, i. e. taking tasks from a foreign deque. This part is handled by the concurrent deque implementation. The dependency graph of tasks is also modified by the execution system in multiple threads. We already described, how this graph is modified: Each task creates a sub-graph of its children during the dependency analysis. At that point in time, just the parent task knows about these children and no other task can interfere. Only at the very end of the analysis is the sub-graph connected to the global graph, a process we described in Section 3.3. As visible in Figures 3.3 and 3.4, new arcs always originate in child tasks and thus the corresponding pointers are located in the sub-graph, which cannot be accessed by other threads yet. The removed arcs are part of the currently executed task. As all of its dependencies are already fulfilled and it has been taken from the deque, no other thread can access it either. All in all, the dependency graph is not vulnerable to concurrent access. However, the dependent task of a new arc has its counter for unsatisfied dependencies increased. As this counter accumulates the dependencies for all parameters and multiple tasks might fulfill different dependencies at the same time, access to this variable must be atomic. Furthermore, we often decrease the counter and put the task on a deque if it reaches zero. When, for instance, the counter is initially two, and two tasks simultaneously execute that part of the code, the decrease operations may overlap and both succeeding read operations would yield zero. Hence, the task would be put on a deque twice. Therefore, decreasing the counter and reading its value is performed by an atomic FetchAndDecrement operation. Another point we should be aware of is that concurrent memory allocations from the heap might conflict. The operating system would need to coordinate them, which could affect the speed of memory allocations by multiple threads. When the user defined code in a task uses shared data, exclusive access is guaranteed through the execution system. After all, enforcing these data access dependencies is the major concern for the Deque ES. A final issue concerning synchronization is how we detect that all tasks are finished. As there are multiple distributed deques rather than a single data structure holding all 26 3.6 Memory Management for Data Items tasks, we cannot easily determine if there are tasks in the system. Also, we would not gain much if we could query all deques simultaneously. There might be no task on any deque but other threads currently executing some tasks that will spawn children. To solve this problem, we use a global counter variable to track the number of tasks in the system. Details of how that might affect the performance and how to avoid race conditions will follow in Section 4.4. 3.6 Memory Management for Data Items In the Stack Execution System, CES variables are kept in data frames, which are located on the stack just like task frames. We illustrated that process and also the resulting benefits for deallocating the data frames again in Section 2.2. Since the Deque ES replaced the stack with a deque holding only ready-to-be-executed tasks, we must devise a new way to store the data frames. As described in Section 3.1, the execution system cannot predict when a task is ready to be executed. Similarly, the execution system cannot predict when a data item will not be needed anymore. This is due to the fact that the necessary life time of data items is directly coupled to the tasks using them. Therefore, a data structure with fixed access patterns is not reasonable and we store the data items on the heap. Unlike in the Stack ES, in the Deque ES data items do not share the outer structure and size of a task frame (see Figure 2.2.2) but just consist of their plain data type. CES parameter variables are allocated in the parent subroutine, before their reference is passed to child tasks. The execution system performs this allocation synchronously, when the execution reaches the task call (in CES code) or variable declaration (in intermediate C code). However, since the variable will be used by subsequent tasks, we cannot deallocate it in the parent routine but must wait until all tasks accessing it have finished. For that purpose, we would need to analyze the access patterns to CES variables, similar to the dependency analysis described earlier for parameters passed between tasks. Thus, we can use the dependency analysis to also schedule the release of data items at the appropriate times by encapsulating the release procedure into a special task we refer to as a Free Task. It only takes one inout parameter and releases its storage location. These Free Tasks are appended to the Current Child List at the end of a task but before the dependency analysis, for all variables which have been allocated in that task. Since all child tasks using these variables have already been called then, the Free Tasks are scheduled to run as the last tasks accessing their variables. Beyond the calling order, Free Tasks are handled like any other task by the ES. In particular, the Free Task’s special actions are totally transparent to the scheduler. Tasks not descending from the current subroutine are unaware of variables allocated therein. Hence, they do not care whether a Free Task runs before or after them. 27 3 Design of the Deque Execution System 3.7 Manual Encoding of Task Dependencies In many divide-and-conquer-style applications using arrays, the dependencies for recursive calls are often quite simple and entirely clear to the programmer. The Deque ES can correctly handle array dependencies if they are properly encoded by hand, a technique similar to the Stack ES dependency handling. Admittedly, this partly undoes the benefits of the Deque ES, that the programmer precisely does not need to think about what can run in parallel. On the other hand, it permits certain uses of arrays and thus enables some applications. In other cases, there are no data dependencies between tasks, but the programmer still wants them to run in a fixed order, a situation which can also be solved by manually encoding dependencies. As an example for arrays, we will examine a core part of a merge sort implementation in CES shown in Listing 3.1. A specialty of the algorithm is that the temporary copy merge sort needs is just created once and on each level the copy and original are swapped. The implementation uses arrays and pointers through the parameters src and dst. In order to enable both mergesort child tasks to run in parallel, these parameters cannot be directly passed to both of them as this would cause the Deque ES to put them in serial order. Therefore, we create new pointers right_src and right_dst in lines 15 and 16 for the second task, which also serves to directly include the correct offset from the start of the array. As the inout parameters of both mergesort tasks are now distinct, the tasks can run in parallel. The subsequent merge subroutine takes both values written by the first (src, dst) and by the second mergesort task (right_src). Hence, it runs after both mergesort children. Thus, the actual dependencies of the tasks must be expressed as “superficial” dependencies of the parameters that are passed to the subroutines. Recall that the parallel key word is not used in the Deque ES but is kept for compatibility with the other execution systems. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 /** * Recursive mergesort task * @param[in] n the number of elements. * @param[in] size the size of an element (result of sizeof()). * @param[in] compare the pointer to the element comparison function of type compare_t. * @param[in,out] src the copy of the array to sort. * @param[in,out] dst the array to sort. The sorted result will be stored here. */ $mergesort(size_t n, size_t size, compare_t compare; ptr_t src, ptr_t dst;){ if ($n$ <= 1) { /* array has zero or one element(s) and is sorted by default */ } else { size_t nleft = $n$ / 2; size_t nright = $n$ - nleft; ptr_t right_src = $src$ + nleft * $size$; ptr_t right_dst = $dst$ + nleft * $size$; $parallel mergesort(size_t nleft, size, compare; dst, src;);$ $parallel mergesort(size_t nright, size, compare; ptr_t right_dst, ptr_t right_src;);$ $merge(size_t nleft, ptr_t right_src, size_t nright, size, compare; src, dst;);$ } }$ Listing 3.1: Recursive part of a merge sort algorithm, based on [Rem08] 28 3.8 Additional Array Support by the Execution System Beyond arrays, the manual encoding of dependencies can also be used to enforce execution order for tasks that do not depend on each other. A very common example is measuring the running time of an algorithm. The usual method is getting a time value at the beginning and subtracting it from the time value at the end, so as to get the time span in between. In CES, these timing procedures might be implemented in the tasks start_timing and stop_timing. Listing 3.2 tries to use these tasks to measure the running time of the subroutine algorithm. However, the data dependencies of the timing tasks are completely distinct from those of the worker tasks. Therefore, the timing tasks could run successively at the beginning or end of the program and thus print an absurdly short running time. $read(;;<type1> input_value);$ $start_timing(;;clock_t time);$ $algorithm(input_value;;<type2> result);$ $stop_timing(time;;);$ $print(result;;);$ Listing 3.2: Worker and timing tasks with independent data flows In the revised program shown in Listing 3.3, the timing routines additionally take data items of the worker tasks as dummies. In this case, the programmer wants all five tasks to run sequentially. The fulfillment of this requirement is easily verified by discovering that for each two successive tasks there is a data item written by the first and read by the second task. Needless to say that in general one can use all dependencies mentioned in Subsection 3.2.1 to manually enforce running order. $read(;;<type1> input_value);$ $start_timing(;input_value;clock_t time);$ $algorithm(input_value;;<type2> result);$ $stop_timing(time;result;);$ $print(result;;);$ Listing 3.3: Worker and timing tasks with encoded artificial dependencies 3.8 Additional Array Support by the Execution System 3.8.1 Overview While the approach for arrays presented in the previous section is very flexible, it also needs the programmer to think about dependencies, a process that will be hard for complex task graphs. Furthermore, it demands more parameter passing than necessary for the plain algorithm. Therefore, we would like to offer a more natural way to use arrays in CES. The arguably biggest problem with arrays is to identify in the dependency analysis which parts of the array will be read or written. Algorithms operating on arrays often pass around just one pointer, regardless of which parts of the array are actually accessed. Furthermore, arrays are often accessed through pointer arithmetics and participating “iterator pointers” would need to be mapped to the original structure. It is even worse with pointer usage in general, as they are a very versatile tool and 29 3 Design of the Deque Execution System guessing their intended use is seemingly impossible. A pointer might be the root of a tree and the whole tree needs to be locked, or it is just part of a structure with references to other structures and not used at all in the current context. Hence, devising a solution covering all use cases of arrays or even pointers is very tough. We therefore offer a way to use arrays in CES, which is reasonable for some use cases. One can declare a potentially multidimensional array, the elements of which are treated like individual CES variables during the dependency analysis. Their storage locations are individually allocated and individually freed when they are not needed anymore. One can naturally access the single elements in the task declaring the array and also pass them to child tasks individually. However, passing the whole array as a parameter is not possible as the memory for the elements is not contiguous, the price we pay for individual tracking of dependencies. The array elements are not necessarily primitives types, they might be pointers to manually allocated arrays. This way we can track the dependencies for larger blocks of data. We will show a use case for this feature in Subsection 3.8.3. But prior to that we introduce the actual syntax used for arrays in CES. 3.8.2 Syntax In the CES syntax for the Stack ES, the only way to declare a CES variable is to call a subroutine and use a C variable to initialize one of its parameters. As this involves copying and doing so for arrays is expensive, the Deque ES introduces a new way to directly declare a CES array without initializing it. The syntax is, apart from enclosing dollar signs, equal to a normal C array declaration and is given in line 1 of Listing 3.4. Brackets are to be taken literally in this context; the number of dimensions is not limited to two, it is just an example. Since it was easy to implement, the same direct declaration is possible for a single variable as shown in line 2. Hence, we do not need an extra local C variable for initialization purposes anymore. However, for consistency with the old declaration method, the type of the variable still has to be given when passing it on to a child task, the same is true for arrays as will be visible shortly. 1 2 $<type> <variable name>[<size of dimension 1>][<size of dimension 2>];$ $<type> <variable name>;$ Listing 3.4: Directly declaring CES arrays and single CES variables Local access to CES arrays is also very elegant, you just need to enclose a normal array access in dollar signs as in Listing 3.5. The indices can be specified with constants, variables or expressions with parentheses and the four basic arithmetic operations. What follows in the next line is the handing over of an array element to a child task. As already mentioned, the base type of the array must be given; furthermore we need the index of the element to be handed over. Since only individual elements are passed to subroutines, there is no change at all for the called task. It is not even possible to detect from within the task, whether it was called with an array element or single variable as a parameter. 30 3.8 Additional Array Support by the Execution System $array[i][j+k]$ = 42; $print(int array[i][j+k];;);$ Listing 3.5: Accessing CES arrays and passing on elements to a child task 3.8.3 Use Case: Algorithms on Blocked Data When operating on large input data, this data can often be partitioned into multiple blocks. If this blocking happens at the root level and does not need to be recursively repeated as in divide-and-conquer algorithms, we can easily employ the new array support features of CES to track the dependencies. For recursive blocking this is still possible, but each internal node of the task tree would need to split the block further. This is because the incoming block is a single CES variable passed to a task as a parameter and must be split up into multiple variables to allow for individual tracking of dependencies. Linear algebra is a major application field for blocked algorithms [DK99, JK02, GJ07, BLKD07], some of which are not recursive. For example, Kurzak et al. present non-recursive implementations of Cholesky factorization, QR factorization and LU factorization in [KLDB09]. As a proof of concept, we adapted to CES the implementation of Cholesky decomposition that comes with the distribution of SMP Superscalar 2.3 [SMP10]. The program concentrates on building the block structure and spawning worker tasks and uses a C implementation [CBL10] of the Basic Linear Algebra Subprograms [LHKK79, DDCHH88, DDCHD90] to perform the actual decomposition of the blocks. However, a valid scheduling order is important so as to obtain correct results. Therefore, the example serves well to test the proper tracking of array dependencies. Listing 3.6 shows the core part of the algorithm. The original implementation for SMP Superscalar can be found in [PBL08]. This paper is also the origin of the corresponding task graph in Figure 3.5, which illustrates the complex dependencies even for a small input size. Without the new CES array support one would have to encoding these dependencies by hand, which is hardly possible. In the graph, numbers show the sequential execution order, whereas colors indicate the different task types. for (long j = 0; j < DIM; j++) { for (long k= 0; k< j; k++) for (long i = j+1; i < DIM; i++) { $ces_sgemm_tile(long BS, float_ptr A[i][k], float_ptr A[j][k]; float_ptr A[i][j];);$ } for (long i = 0; i < j; i++) { $ces_ssyrk_tile(long BS, float_ptr A[j][i]; float_ptr A[j][j];);$ } $ces_spotrf_tile(long BS; float_ptr A[j][j];);$ for (long i = j+1; i < DIM; i++) { $ces_strsm_tile(long BS, float_ptr A[j][j]; float_ptr A[i][j];);$ } } Listing 3.6: CES implementation of Cholesky decomposition (cf. [PBL08, Fig. 4]) 31 3 Design of the Deque Execution System Figure 3.5: Task graph for 6 by 6 block Cholesky decomposition, figure from [PBL08] 32 4 Implementation of the Deque Execution System This chapter explains some implementation details of the Deque Execution System and refers to the actual code where appropriate. In Sections 4.1 to 4.5 we present various aspects of the basic implementation, whereas Section 4.6 details on some individual improvements to increase the speed of the execution. The final Section 4.7 describes how we implemented the additional array support. 4.1 Data Structures for Dependency Analysis and Task Notification As already mentioned in Subsection 3.2.2, the dependency analysis table (DAT) provides a map interface. For simplicity, we initially implemented it using a linked list of keyvalue pairs, a structure that would be replaced soon (see Subsection 4.6.1). The key to lookup is the address of a data item and the corresponding value is a pair containing the task frame of the subroutine that last wrote to that data item and the index or offset of the parameter within the task frame. As we will see shortly, this offset is necessary to find the correct entry for the linked list that contains all subsequent input tasks. The dependency analysis builds up a graph of the current task’s children. The nodes are task frames and the arcs are pointers between them. Listing 4.1 shows the TASK_FRAME structure. It contains a pointer fnptr_task to the corresponding C function (line 2), possibly the function name (line 12) and the number of input, inout and output parameters (lines 3 to 5). These numbers are necessary as all parameters are held in one fixed-sized array of pointers (parameter). Therefore, the maximum number of parameters is still 25, as in the Stack Execution System. Of special interest for the dependency analysis are the remaining fields of the structure. The integer unsatisfiedDependencies determines whether a task is ready to run (line 6). Its type is either uint32_t or uint64_t because it is accessed through architecture-dependent atomic operations that operate on the native word length. To represent the linked list for the dependency analysis and the arcs of the resulting dependency graph, there are two arrays of pointers to other task frames and one array of offsets. All of them have the same length as the parameter list because each slot of the arrays directly corresponds to the parameter at the same offset, the parameter representing the data item associated with the dependency. A major problem is that a data item can be consumed by arbitrarily many input tasks and as the writer delivering 33 4 Implementation of the Deque Execution System 1 2 3 4 5 6 7 8 9 10 11 12 13 14 typedef struct TASK_FRAME { void (*fnptr_task)(struct TASK_FRAME * /* my task frame */, struct TASK_FRAME ** /* current ' child list */, DEQUE * /* my deque */); unsigned char in; /**< Number of input parameters */ unsigned char inout; /**< Number of inout parameters */ unsigned char out; /**< Number of output parameters */ NATIVE_UINT unsatisfiedDependencies; void * parameter[ARG_SIZE]; struct TASK_FRAME * toBeNotified[ARG_SIZE]; unsigned char notificationListOffset[ARG_SIZE]; struct TASK_FRAME * nextWriteNotification[ARG_SIZE]; #ifdef CES_DEBUG char * name; /**< The name of the task (function name) */ #endif } TASK_FRAME; Listing 4.1: The TASK_FRAME structure that item would need to notify them all, it would also need to hold arbitrarily many pointers. Since we do not have enough space for that in a task frame, the writer’s task frame only holds the pointer to the first input task to be notified, the remaining input tasks are part of a linked list just as during the dependency analysis. In fact, the linked list built during the dependency analysis is never destroyed, but directly becomes part of the task graph. The array toBeNotified contains, for the writer, the first input task of the linked list and, for the input tasks, the next node in the linked list. Moreover, the same data item might occur at different parameter offsets for different input tasks, for instance, it is input parameter 2 for task 1, but input parameter 3 for task 2. Since the linked list is tied to the data item, the offset of the parameter in the next task frame of the linked list is saved in the current task frame’s integer array notificationListOffset. When the input tasks have consumed their data item, they in turn must notify the next writer. Naturally, this pointer would be held in toBeNotified, but this slot is already used by the linked list. Therefore, while that linked list is needed, the pointer to the next writer is saved in the previous writer’s nextWriteNotification array. When the writer notifies the input tasks, it sets their toBeNotified pointer to its nextWriteNotification. As the linked list is thereby destroyed, we call this process unwinding of the linked list. Figure 4.1 is a Data dependency Writer 1 Original pointer structure Input Task 1 next Writ eN otific ation Input Task 2 Input Task 3 New pointers after Writer 1 has finished Writer 2 Figure 4.1: Data dependencies and the DAG pointer structure of several writing and reading tasks for a single data item 34 4.2 Notification of Dependent Tasks modified version of Figure 3.2, now not as part of the local DAT but as part of the global dependency graph. Compared to Figure 3.2, the pointer structure of the DAT’s linked list is kept, but we actually have an additional nextWriteNotification pointer to complement the original pointer structure (dashed arrows). When Writer 1 has run, the linked list is unwound and all dashed arrows disappear. Instead, there are new pointers installed, directly from the input tasks to the next writer (dotted arrows). This happens when Writer 1 notifies its dependent tasks and traverses their list anyway. The notification process is detailed in the following section. 4.2 Notification of Dependent Tasks The conceptual notification model we described in Section 3.3 was directed towards the actual dependencies and which tasks would need to be notified, when a certain task finishes. However, it could not take into account the actual pointer structure we explained in Section 4.1. We will connect the conceptual model to the pointer structure here, thereby detailing on how the notification mechanism is actually implemented. For input parameters, Figure 3.3 (p. 23) is quite accurate. Since Input Task 3 cannot have any writing children, all nodes of the DAT linked list are input tasks. This linked list of child tasks is unwound before Input Task 3 notifies Writer. Notice, that unwinding in the previous section referred to the global task graph and a writer task notifying its dependent input tasks, whereas here, we unwind the local list of child tasks. Still, both processes are almost equal, apart from the local child tasks not having their dependency counter decreased, and therefore the processes are implemented in the same function unwindLinkedList. For inout and output parameters, Figure 4.2 shows how the handing over of dependencies to child tasks is implemented. As this deviates from the conceptual model, it might be interesting to compare it to Figure 3.4 (p. 24). The original input tasks 1 through 3 depended on Writer 1 producing a data item that is actually delivered by Last Writer Child. In order to make Last Writer Child inform them after it finishes, we append the original list of input tasks to the list of child input tasks. Technically, we only append Input Task 1 with the rest of the list following automatically. Hence, adding one child to the input task list and appending two lists is the same process, and performed by a function whose name is inspired by the conceptual procedure, registerCallback. Furthermore, both the original and new input tasks must run before Writer 2 (see Figure 3.4). Since they still have their pointer slots filled with the linked list’s “next” pointers, the dependent Writer 2 must be kept by Last Writer Child as explained in the previous section. Therefore, the parent’s nextWriteNotification pointer is handed over to Last Writer Child, the task that will adapt the toBeNotified array of the input children once it has run. The mere special cases in the conceptual model now result in a fundamentally different behavior. If there is no writer among the child tasks, the parent itself delivers the data item and must inform all dependent tasks about it. Dependent subroutines are not only located in the list of original subsequent input tasks, but also in the 35 4 Implementation of the Deque Execution System Last Writer Child Writer 1 Input Child 1 Input Task 1 Input Task 2 nex tWr iteN otifi cati on Writer 2 Input Child 2 Input Child 3 Input Task 3 nextWriteNotification Figure 4.2: Writer 1 integrating its children into the task graph (possibly empty) list of newly spawned input children. Therefore, the execution system unwinds both lists, partly decreasing the tasks’ unsatisfiedDependencies (for tasks in the global graph), directing their toBeNotified pointers toward the next writer and putting them on a deque unless they still depend on other parameters. 4.3 Scheduling and Work Stealing Listing 4.2 shows the main loop of our basic scheduling algorithm. As long as there are tasks on our own deque, we keep executing them (lines 5 and 6). When we run out of tasks, work stealing from other deques begins. We start our circular search on the next deque (line 9) and keep searching until we have found a task or reached our own deque again (line 10). Depending on the compiler option -DDF_WS, we either take a task from the top or bottom of the foreign deque (lines 11 to 15). Line 18 advances the circular search. When we find a task, we stop searching and execute it (line 21). Afterwards we try to find tasks on our local deque again, since the executed task has hopefully spawned children, which are pushed to the local deque. The main execution loop ends, when there are no more tasks in the system, we detail on the global variable readyTasks in the next section. As we explained, in the work-stealing phase the Deque ES performs a circular search through all deques starting with the deque next to the local one. That is, although shown to efficient [BL94], we do not steal from a random deque. We tried that with the rand() function from the C library. The implementation uses mutual exclusion to ensure thread-safety and hence slowed down the execution tremendously. It might be worth to try different libraries for random number generation, but as our focus is on dependency-analysis and depth-first work-stealing, we stuck with the scheme explained above. 36 4.4 Synchronization 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 int myDequeId=...; int workPacketFound, response, stealFromId; while (readyTasks > 0) { /* fetch work from our own deque */ while (SUCCESS == takeTopWD(&readyTaskDeques[myDequeId], &(currentTask.int64))) runTask(currentTask.fnptr.taskFrame, myTd, &readyTaskDeques[myDequeId]); /* no more work on our deque, steal from other deques */ workPacketFound = 0; stealFromId = (myDequeId + 1) % CES_DEQUES; while (!workPacketFound && stealFromId != myDequeId) { #ifdef DF_WS response = takeTopWD(&readyTaskDeques[stealFromId], &(currentTask.int64)); #else response = takeBottomWD(&readyTaskDeques[stealFromId], &(currentTask.int64)); #endif if (response == SUCCESS) workPacketFound = 1; stealFromId = (stealFromId + 1) % CES_DEQUES; } if (workPacketFound) runTask(currentTask.fnptr.taskFrame, myTd, &readyTaskDeques[myDequeId]); } Listing 4.2: Basic scheduling algorithm The hierarchical work-stealing option we described in 3.4 adds another search loop between the local (lines 5 and 6) and global (lines 10 to 19) search. In the new loop, the Deque ES tries to find a task on a deque assigned to the same core but a different hardware thread. If successful, it executes a task from there, benefiting from a shared L1 cache. Otherwise, we continue with the global search. 4.4 Synchronization We explored which data structures are accessed by multiple threads in the design chapter. The deque is no concern for us, since it handles concurrent access itself. The task graph’s edges are, even when considering the real pointer structure presented in Section 4.2, not vulnerable to concurrent access for the reasons explained in Section 3.5. Memory (de)allocation is handled through thread-safe implementations of malloc and free, a fact that has performance implications discussed in Subsection 4.6.3 but prevents the need for explicit synchronization. What we must take care of is concurrent access to each task’s counter for the number of unsatisfied dependencies and to the global counter for the number of tasks in the system. For both issues, we use atomic operations. Depending on the use case, we employ either AtomicIncrement and AtomicDecrement where we just need to change but not read the value, or otherwise FetchAndDecrement. The latter atomically reads the old value of a variable and afterwards decrements it. These operations are provided through libraries for both x86 and Blue Gene. In order to know when the worker threads executing tasks on the deque can exit and allow the program to end, we must check if all tasks have been executed. When 37 4 Implementation of the Deque Execution System there are some which have not been executed yet, at least one of them is ready to be executed. Therefore, we use the global variable readyTasks to represent the number of tasks that are either currently in execution or located on a deque, i. e. ready to be executed. The counter is incremented in the function putOnDeque, where we put a task on a deque. It is decremented again, when a task finishes. Naturally, it is important to increase the counter for new and ready child tasks or tasks whose dependencies have been fulfilled before the task performing that action finishes and decreases the counter again. Otherwise, the counter could fall to zero early and some of our worker threads would stop executing. Hence, AtomicDecrement(&readyTasks) is the very last statement of any task, implemented as the last command in the expansion of the RUNTIME_TASK_FINALIZE macro. With both processes, increasing and decreasing, in one task which is executed by one thread, there is no risk of race conditions. As the atomic operations are very fast on Blue Gene, concurrent access to the single global variable does not constitute a speed bottleneck. Furthermore, considering the whole dependency analysis and the execution of user defined code, changing the counter is only a small part of the execution procedure. In contrast to the modification statements, reading the value of readyTasks is performed non-atomically. This is possible because we only want to know if it is larger than zero and once it reaches zero, it never rises up again. This access method might give us some stale values when a cache has not been invalidated yet, so the worker threads would run a little longer at the end. On the other hand, the non-atomicity yields performance benefits for the huge number of reads during the execution. The second use case for atomic operations is a task’s number of unsatisfied dependencies. Non-atomic operations are sufficient for newly created child tasks that are not part of the global dependency graph yet. But when we unwind a list of child tasks and the next writer gets multiple new dependencies, we must use AtomicIncrement since other tasks might do the same for another parameter simultaneously. The variable is decremented when a dependency gets fulfilled. As any dependency might be the last one preventing a move to the deque, we always use FetchAndDecrement and check whether the old value was one, i. e. the task is now ready and must be pushed onto a deque. Again, the order is important. Unwinding of child tasks happens before we decrement the value for the parent task, which has now finished. Otherwise the dependent task might be pushed onto a deque, although it depends on a newly created child task. 4.5 Memory Management for Data Items Recall the memory management design from Section 3.6. Storage space for data items is allocated by the parent routine, but special Free Tasks are responsible for releasing it again. Those Free Tasks are virtually inserted into the code by the execution system as the last inout subroutines of a task, and afterwards their dependencies are analyzed as with any other task. The scheduler is unaware of the special nature of the Free Tasks and dispatches them as usual. 38 4.6 Speed Improvements 1 2 3 4 void freeTask(RUNTIME_TASK_FUNCTION_ARGUMENTS) { free(myTaskFrame->parameter[0]); AtomicDecrement(&readyTasks); } Listing 4.3: Implementation of the Free Task In order to know for which data items we must insert a Free Task, all allocations of new CES variables in the current task are recorded in an array called storageTracker. This array is only needed locally. The macros RUNTIME_CREATE_STORAGE_CVAR and RUNTIME_CREATE_STORAGE_OUTPUT create new entries. Just before the dependency analysis, the execution system traverses the array and creates a Free Task for each of the data items. The Free Task has some properties that allow us to drastically cut down the overheads of the Deque Execution System. The plain C implementation, which is manually written instead of generated by the compiler, is shown in Listing 4.3. It simply releases the location of the first parameter. Notice the direct pointer access instead of using a macro as in compiler-generated code, which enables us to get the reference we need for free instead of the dereferenced pointer value. The only other activity performed is to decrease the number of tasks in the system before the freeTask finishes. If you compare that bare implementation with normal compiler-generated C files like the one in Listing 2.10, you will find the following calls to the ES removed. The task is not initialized, i. e. no dependency analysis table is created. There is no task call or assignment of parameter references, simply because the task has no children. Finally, and that’s the biggest overhead, there is no finalization besides decrementing the readyTasks counter. The normal procedure would include the parts • calling of a Free Task for each data item in the storageTracker • dependency analysis for all child tasks (here, we do not have any child tasks), • notification of dependent tasks (here, no task depends on us, since freeTask is the last to access the data item) and • handing over of dependencies to child tasks (here, we have neither dependencies nor child tasks), none of which are needed within a Free Task. With each data item corresponding to one Free Task, we might easily have half the subroutines of this type. For that reason, the lightweight Free Task we described here significantly contributes to the performance of the Deque ES. 4.6 Speed Improvements Until now, we described a basic version of the Deque ES as originally implemented. This section highlights some important changes in an optimized implementation to increase the execution speed. 39 4 Implementation of the Deque Execution System 4.6.1 Using Single Variables for the Dependency Analysis Table The original implementation of the Dependency Analysis Table used a linked list for simplicity. Searching for entries and inserting new ones thus needed time in O(n), where n is the length of the list. As the DAT is very frequently used, we wanted to speed up the access. Our first approach was using the hash table implementation of GLib [GTK10], however, this slowed down the execution of common cases even further. Instead of investigating the library or implementing our own hash map, we chose a rather radical way that promised even better performance (not only big-oh-wise, but also with minimal constant factors). The optimized implementation stores the pointer to the head of the DAT list not in a central DAT structure for the whole task, but in dedicated single variables. Each such variable is connected to the CES variable it tracks through a name convention. Specifically, the CES variable foo is tracked by the local C variable cesLastAccess_foo. When this name is known, reading the value is directly possible, in O(1) and with minimal overheads and no lookup at all. In the original implementation, the dependency analysis was completely performed by the execution system, which used the memory addresses of data items as an entry point to the DAT. When we rely on the variables names however, the compiler must take responsibility for parts of the dependency analysis as the ES does not know anything about variable names. We will detail on this point shortly. In the original implementation, the Deque Execution System performed the following actions at the end of a task: A storageTracker array, which had recorded all newly allocated CES variables, was traversed and a Free Task scheduled on the Current Child List for each of them. Then, the ES looped over the Current Child List, performing the dependency analysis and building the dependency graph by inserting according pointers. Afterwards, the parent task’s parameters were traversed in order to connect the child graph to the global graph by handing over dependencies and to notify dependent tasks about the fulfillment of their dependency. Finally, ready child tasks were put on the deque of the current thread. In the optimized implementation, a result of using variable name identifiers is that a dependency analysis based on the Current Child List, which only holds task frames through pointer addresses, is not possible anymore. We could either record even more information or perform the dependency analysis directly within the code execution. The latter solution required substantial changes to the compiler to insert additional macros in the intermediate C code. On the other hand, we would need to store less information and save a few loops in the task finalization, which promised performance benefits. Hence, we chose the solution involving the compiler changes. In the original implementation, we could access the DAT in the whole function. The single entry point cesLastAccess_* however, is declared at the same time as the corresponding CES variable. Since it is located on the stack for simplicity and speed, we can only access it in the scope in which it was declared. Therefore, appending all Free Tasks at the end of the subroutine is not always possible, as the DAT variable could have been popped off the stack already. Moreover, the CES variable cannot be 40 4.6 Speed Improvements accessed under its name after leaving the declaration scope, i. e. we can actually insert the Free Task earlier. Hence, the compiler was extended to recognize and track C scopes. It records those scopes and declarations of new CES variables on a stack. Just before the scope is left, Free Tasks for all declarations within the scope are inserted into the intermediate C code with the new macro RUNTIME_CREATE_FREETASK. When the execution system hits these Free Tasks, the corresponding DAT entry point is still accessible, so they can be dependency-analyzed. The dependency analysis itself happens after the task creation through the new macros RUNTIME_PARAMACCESS_<type>, where <type> is one of IN, INOUT and OUT. In the original implementation, the ES looped over all parameters at the end of the parent task and acted according to the type. In the optimized implementation, the compiler inserts these macros for each parameter and the macro expansion performs the analysis immediately (registering callbacks, recording the new DAT entry, . . . ). The differences for the parameter types have been explained in Subsection 3.2.3. Dependencies for parameters of the parent task are also tracked in cesLastAccess* variables, which are allocated at the beginning of a task. For that purpose, the compiler inserts RUNTIME_PARAM_INITIALIZE macros. As the root scope of a task stays open until its end, we can incorporate the last writers and readers of those parameters in the notification process of subsequent tasks. Since we need the name of the variables to read the distinct DAT entry points and the execution system knows nothing about them, the compiler inserts RUNTIME_HANDLE_<type>_CALLBACK macros at the end of the task. Next to decrementing the readyTasks counter, the only remaining responsibility of the RUNTIME_TASK_FINALIZE macro is to put ready child tasks onto the deque, as this process must wait until after the notification. The storageTracker and task-wide Dependency Analysis Table (just the map) are now obsolete and have been removed from the source code. All in all, those changes yielded a speedup of over 30 percent for an application like Fibonacci (see Subsection 5.3.1) which performs little computation within each task. 4.6.2 Avoiding O(n) Operations on Callback Lists In the Dependency Analysis Table, all input tasks called after the last writer of a variable are kept in a linked list. Inserting new items at the back of this linked list needs a number of steps linear in the list length. We could insert single items at the front, but when we connect two lists as in Figure 4.2, we still need to access one of their end nodes. In order to avoid traversing the whole list before we can insert, we save a pointer to the end of the list. We use the current parameter’s slot in the nextWriteNotification array for that purpose, as shown in Figure 4.3(a). For the very last writer of a CES variable, this pointer indicates the end of the list of subsequent input tasks. When the next writer is added to the dependency graph, nextWriteNotification fulfills its original purpose, to keep a reference to the next writer of the parameter (Figure 4.3(b)). As new input tasks are now appended to Writer 2 rather than Writer 1, we do not need the shortcut to the end of the list anymore. 41 4 Implementation of the Deque Execution System DAT Input T. 1 Writer 1 Input T. 2 Writer 2 nextWriteNotification Input T. 3 Input T. 1 DAT (a) Writer 1 Input T. 2 next Wri teN Input T. 3 Writer 2 (b) Figure 4.3: Temporary usage of nextWriteNotification as a pointer to the end of the reader list However, when we want to append a node at the end of the list, we must know the parameter offset for the last existing node, as this is where the list will continue. These offsets are saved in the notificationListOffset array, but only for the next node, not for the newly introduced shortcut to the list end. Therefore, we introduce nextWriteOffset, a new field in the TASK_FRAME structure, as a place to store the parameter offset at the last node of the list. Indeed it has nothing to do with the “next writer”, but the name reflects the association to nextWriteNotification, even if only in its temporary usage. We noticed the problem addressed in this subsection in a test program that calls within one parent task very many successive input child tasks. The execution of this program was unexpectedly slow in the original implementation of the Deque ES. The described changes in the optimized implementation solved the issue. 4.6.3 Using Free Pools for Task and Data Frames The Deque Execution System puts both tasks and CES variables on the heap. For each called task and each declared variable, we allocate and release memory at the appropriate times as explained in sections 3.1 and 3.6. Naturally, this leads to many calls to malloc and free from different threads. We discovered, that memory allocations considerably slowed down the application execution with an increasing number of threads, with serious impact on the scaling performance of the Deque ES. We suspect that this is the result of the operating system coordinating concurrent allocations. As user programs might request space of arbitrary size, finding free blocks is not easy for the operating system. However, the memory needs of CES are quite uniform. We either allocate a task frame of fixed size or a variable. For variables, CES always has pass-by-reference semantics. Therefore, we can restrict to variables with a maximum size of 64 bits. When larger structures are needed, they can be manually allocated. 42 otifi cati o n 4.7 Array Support Since our memory allocations have only two distinct sizes (sizeof(TASK_FRAME) and 64 bits for data items), we can use two free pools to speed them up. A free pool is a data structures holding pointers of allocated memory blocks, which are currently not in use. When the execution system requests a new block, we take one from the free pool and therefore save a memory allocation. Only in case of an empty free pool, the operating system is asked for a new block. To fill up the free pool again, unused memory blocks are not immediately released but stored in the free pool. Still, multiple threads might allocate memory simultaneously. Either each thread has its own free pools, or we use a concurrent implementation. The former solution obviously needs more memory, since fluctuations cannot be balanced between threads. For that reason, and with the availability of our fast, concurrent deque implementation in mind, we chose the latter solution. As a result, the Deque ES is much more scalable than before. While the original implementation scaled to only about four threads, the improved implementation scales to about 32 threads. Depending on the application, the SMT capabilities limit a further increase beyond 16 or 32 threads (see Section 5.2). 4.6.4 Scheduling According to Hardware Threads When multiple threads share a deque or when we use hierarchical work-stealing (see Section 3.4), it is important to know the hardware thread a POSIX thread runs on. Otherwise, hardware threads from different cores might share the deque or might be in the preferred group for hierarchical stealing. These could not take advantage of a shared L1 cache. Our first approach was to use the POSIX setaffinity functions, which bind a POSIX thread to a CPU ID. Unfortunately the IDs used by POSIX did not reflect the actual hardware architecture of Blue Gene/Q. As we therefore could not restrict a POSIX thread to a certain hardware thread, we implemented a hand-crafted solution. The Blue Gene/Q environment provides functions to get the physical ID of the current core and hardware thread. In order to run only on threads with a certain combination of these two numbers, we spawn as many POSIX threads as there are hardware threads and ensure each hardware thread is actually running. Then we quit the threads we don’t want to run and start the actual work on the remaining ones. Which threads will run, can be configured through a mix of compiler (-DCES_THREADS, -DTHREADS_PER_DEQUE) and run time options (number of cores to run on). From that we can infer the number of threads per core and control whether a thread terminates or starts to work, as shown in Listing 4.4. The code is executed by each POSIX thread before the main execution loop (see Section 4.3). 4.7 Array Support The array support of the Deque Execution System introduces new language features, so the CES compiler had to be adapted. We do not describe the straightforward changes to the compiler front-end. Instead, we focus on the interesting aspects of the code generation phase. 43 4 Implementation of the Deque Execution System const int kernelCoreId = Kernel_ProcessorCoreID(); const int kernelThreadId = Kernel_ProcessorThreadID(); L2_Barrier(&enoughPThreadsRunning, 64); // ensure all hw threads are running if (kernelCoreId >= numCores || kernelThreadId >=numThreadsPerCore) return EXIT_SUCCESS; printf("(%2d, %d) running\n", kernelCoreId, kernelThreadId); Listing 4.4: Code to quit POSIX threads we don’t want to run int n,m; ... $int array[n][m];$ Listing 4.5: An example for a CES array declaration Part of the C code output are three new macro calls to the Deque ES. The macro RUNTIME_CREATE_CES_ARRAY initializes a newly declared array by creating a C array to store the pointers to the individual CES array elements. As all array elements are dependency-tracked individually, we must allocate a separate block of memory and initialize a DAT entry for each of them. Therefore, the second new macro RUNTIME_CREATE_CES_ARRAY_PART is called for each individual array element. Declarations of single variables are translated to the RUNTIME_CREATE_CES_VARIABLE macro, which allocates their storage space and initializes their DAT entry. Initializing all array elements and creating a Free Task for each of them requires multiple similar macro calls. As it is only known at run time how many elements must be initialized, the CES compiler creates nested for loops, one loop for each dimension of the array. As an example, CESC transforms the CES code in Listing 4.5 into the intermediate C code in Listing 4.6. The additional braces are an easy way to prevent clashes of variable names for multiple array declarations. As usual, the Free Tasks are called directly before the array runs out of scope. The names of the loop variables currently set a limit of 18-dimensional arrays, which should be enough for most use cases. We could easily increase the limit. With arithmetic expressions as indices or size specifiers of arrays, the CES compiler and the runtime macros just copy the expression to other places, e. g. the upper boundary of the for loop. For simplicity, the evaluation is only performed by the C compiler. Given the advanced optimization capabilities of modern compilers, this decision should not significantly affect the performance of the resulting program. 44 4.7 Array Support int n,m; ... RUNTIME_CREATE_CES_ARRAY(array, [n][m], int) { int ces_i; for (ces_i = 0; ces_i < n; ++ces_i) { int ces_j; for (ces_j = 0; ces_j < m; ++ces_j) { RUNTIME_CREATE_CES_ARRAY_PART(array[ces_i][ces_j],int) } } } ... { int ces_i; for (ces_i = 0; ces_i < n; ++ces_i) { int ces_j; for (ces_j = 0; ces_j < m; ++ces_j) { RUNTIME_CREATE_FREETASK(array[ces_i][ces_j]) } } } Listing 4.6: CESC output for the declaration of Listing 4.5 45 5 Performance Comparisons 5.1 Goals The Deque Execution System was designed to be very flexible regarding the various work-stealing modes. We can easily switch between depth-first, breadth-first and hierarchical work-stealing, and also vary the number of threads sharing a deque (see Section 3.4). In this chapter, we demonstrate differences between these modes using several CES applications. In-depth comparisons to the previous execution systems or other implementations like SMPSs would also be interesting. However, the Deque ES makes heavy use of the deque library by Manuel Metzmann [Met09], which is specifically optimized for Blue Gene. For this and other reasons, comparisons on x86 machines would require additional effort. The previous execution systems make use of x86 atomic primitives, so running them on Blue Gene is not possible either without porting them to the new platform. The same is true for other parallel programming environments with dependency analysis like SMP Superscalar. These porting efforts are out of scope for this thesis and hence in-depth comparisons are left open for the future. In order to roughly examine the performance of our full dependency-analysis with nested parallelism, we provide a brief comparison on x86 hardware. Therein, we contrast the Deque ES with the Stack ES and SMPSs. 5.2 Test Configuration Most measurements of this chapter were taken on one compute node of a Blue Gene/Q System [Fel11]. It is driven by a single BG/Q processor chip with 16 A2 cores, each at 1.6 GHz and using four-way simultaneous multithreading (SMT). This provides 64 hardware threads. Each core has access to 32 KB of L1 cache, 16 KB for data and 16 KB for instructions. The shared L2 cache has a size of 32 MB. The node has access to 4 GB of DDR3 RAM. The programming environment on Blue Gene/Q for CES is the Compute Node Kernel (CNK) Environment, the compiler is the GCC-based cross-compiler for Blue Gene/Q. Notably, we should not expect to see a four-fold increase in performance when relying on the SMT capabilities of a single core. Therefore, when increasing the number of software threads, we first distribute them to multiple cores. That is, we only increase the number of threads per core, when all cores are busy. Therefore, a decrease in the performance scaling is to be expected when increasing the number of threads from 16 to 32 or from 32 to 64. 46 5.3 CES Applications Used For our brief comparison to other parallel environments, we used a Lenovo ThinkPad T61p. It is equipped with an Intel Core 2 Duo T7700 processor with 2.4 GHz and 4 MB L2 Cache. In terms of memory, the system provided 4 GB. We used GCC 4.4.3 to compile for x86. All tests on this system run on two threads. For all measurements, we started each configuration 25 times and show the median results here. 5.3 CES Applications Used In this section, we will briefly describe the origin and intention of different CES programs we used to test the Deque ES. These programs will reappear when we focus on different aspects of the execution system in Section 5.4. 5.3.1 Recursive CES Applications Jens Remus provided an in-depth performance analysis of the Stack Execution System in [Rem08, Chapter 4]. Since the Cilk-like execution style of the Stack ES is mainly intended for divide-and-conquer algorithms, he used several classic recursive computational problems for his tests. As these small applications already existed as CES programs, we reused them in the Deque ES. The CES programs had to be slightly adapted to work with the Deque Execution System. We encoded dependencies by hand as described in Section 3.7 for recursive algorithms working on arrays. Furthermore, we had to enforce the constraint of only passing values of at maximum 64 bits (see Subsection 4.6.3). In the only case where changes were necessary, we switched to allocating the necessary structures on the heap. We used the following recursive programs to test the Deque ES. The original code is available in [Rem08, Chapter 4]. Our minor changes are of a technical nature, so we refrain from printing the code again. • The recursive calculation of the Fibonacci numbers. If the input value is at most 1, we return 1 for the base case. Otherwise, we spawn two tasks in parallel to calculate the previous two Fibonacci numbers. Afterwards, a third task adds up their results and delivers the desired value. In this application, the tasks are very fine-grained and even for small input values we get a lot of tasks. [Rem08, Section 4.3.1] • A simple merge sort implementation. The recursive mergesort task is shown in Section 3.7. The algorithm uses only one temporary array as also explained in Section 3.7. The recursive calls to mergesort are spawned in parallel. The merging step however is serial and thus limits the available parallelism. [Rem08, Section 4.5.1]. • A modified version of the above merge sort implementation with adjustable task granularity. When the length of the input array of a certain call level in the execution drops below the input parameter MIN_TASK_SIZE, we switch to serial 47 5 Performance Comparisons execution. The sorting algorithms stays the same, but all recursive calls are then normal, synchronous function calls instead of task calls. Therefore, the parameter determines the minimal task size. • The calculation of the Mandelbrot set. The divide-and-conquer-based implementation provides excellent parallelization opportunities since all pixels of the resulting bitmap can be computed independently. However, as multiple threads write to the same memory area, they might disturb each others caching capabilities. [Rem08, Section 4.9] 5.3.2 Cholesky Decomposition The Cholesky decomposition is an important method in linear algebra. It is e. g. used for the numerical solution of linear systems of equations or in the Monte Carlo simulation. The CES implementation is an adaption of the Cholesky program that comes with SMPSs 2.3 [SMP10]. We did not change anything about the algorithm and only translated it to CES. The program splits up the matrix into tiles and then uses the Basic Linear Algebra Subprograms [LHKK79, DDCHH88, DDCHD90] to perform the actual decomposition. As the Cholesky decomposition uses the array support of the Deque ES, we already presented the code and an example task graph in Subsection 3.8.3. In the performance evaluation, it is an example for a non-recursive algorithm operating on a large data set. The performance results for Cholesky decomposition are given in MFlops based on a calculation in the original SMPSs program. 5.3.3 Sweep 2D Sweep 2D was inspired by its 3D counterpart, which “solves a three-dimensional neutron transport problem from a scattering source.” [PFF+ 07] Our adaption of the model uses a 2D grid of cells as shown in Figure 5.1. Each cell represents a task. Incoming edges represent the data dependencies of a task, outgoing edges connect it to dependent tasks. As a result of these dependencies, the first task to run is in the upper-left corner, the final task is in the lower right corner. In between, the execution schedule depends on the timing of individual tasks, since mostly multiple tasks are ready to execute. Assuming equal execution time for each task and an infinite number of processors, the execution would spread like a diagonal wave front. We use Sweep 2D to visualize which tasks run on which processor core in different work-stealing modes and with our new scheduling according to hardware threads (see Subsection 4.6.4). For each task, we save the ID of the core it has run on. We then visualize the grid and color each cell according to the saved ID. To achieve good caching performance, one would strive for getting larger areas of the same color rather than a wild mix. Our implementation first passes one parameter down and then one parameter to the right. Hence, it is more likely to find columns than rows of the same color. 48 5.4 Results wavei wavei+1 wavei+2 Figure 5.1: Dependencies in Sweep 2D (inspired by [PFF+ 07, Fig.1]) 5.4 Results 5.4.1 Scaling of Work-Stealing Modes and Shared Deques This section compares five different combinations of work-stealing modes and shared or non-shared deques with different numbers of threads on BG/Q. Recall, that each thread operates on the top of its own deque. When multiple threads share a deque, they all treat it as their own deque and thus all of them push to and pop from its top. In our shared-deque configuration, all threads on a single core share one deque. That is, until 16 threads are running, there is no difference to the non-shared configuration, as each core runs only one thread. With 32 threads running, each deque is used by two threads; with 64 threads running, each deque is used by four threads. Work-stealing relates to popping a deque distinct from the own one. In breadth-first work-stealing, the bottom of the foreign deque is popped, whereas in depth-first work-stealing the top of the foreign deque is popped. There are four combinations of shared and non-shared deques with these two work-stealing modes. The last configuration is hierarchical work-stealing and always uses non-shared deques. In hierarchical work-stealing, a thread with an empty deque first tries to steal from other threads on the same core, i. e. it tries to pop the remaining three deques assigned to its core. Only if this fails, the thread looks for work on the deques of threads outside its own core. Shared deques do not make sense for hierarchical work-stealing since this would eliminate the first stealing hierarchy. 49 5 Performance Comparisons When several of these five configurations show no visible difference in the graph, we only show one graph and indicate the configurations in the key accordingly. Where we show the speed increase on the y-axis, the number indicates the relation to running the program on the Deque ES with a single thread and breadth-first work-stealing. However, since there are no other threads to steal from, it does not make much of a difference which work-stealing mode is the base line. The first two experiments we present run the Fibonacci program with n = 31 as input and the merge sort program sorting four million random numbers. The results are shown in Figures 5.2 and 5.3 respectively. 25 Speed increase 20 15 10 5 breadth-first or depth-first WS, non-shared deques hierarchical WS, non-shared deques breadth-first or depth-first WS, shared deques 0 0 10 20 30 40 Number of hardware threads 50 60 Figure 5.2: Performance of the Fibonacci numbers calculation The Fibonacci program scales quite well until running on eight threads, where we get a 6.4-fold speed increase. The increase declines at 16 threads, perhaps due to the very small task granularity and memory allocation overheads. The increase declines again when the number of threads reaches 64, presumably due to the exhaustion of the cores’ SMT capabilities. The merge sort program shows a similar development, but generally scales worse than the Fibonacci program. This is probably due to the serial part in the merge step. Furthermore, all cores operate on the same data set, as they sort the same array. With work-stealing, multiple cores will sometimes sort nearby parts of the array and thus presumably disturb each other’s L1 caching. When the number threads increases to 32 and 64, again the SMT capabilities limit further scaling. 50 5.4 Results 14 12 Speed increase 10 8 6 4 2 breadth-first or depth-first WS, non-shared deques hierarchical WS, non-shared deques breadth-first or depth-first WS, shared deques 0 0 10 20 30 40 Number of hardware threads 50 60 Figure 5.3: Performance of the merge sort algorithm In both programs, all work-stealing modes show almost identical results. This suggests, that there is not much work-stealing. As both recursive applications start with rather large, high-level tasks, which are then distributed to different threads, the threads might be well load-balanced and hence might only need few more steals. We also suspect that the dependency analysis, which contributes a good part of the executed code, washes out some differences between different work-stealing modes. To our knowledge, all previous analyses of different work-stealing schedulers were conducted on Cilk-like systems which run almost no additional code besides the actual application. With shared deques, the performance of 64 running threads is slightly worse than with non-shared deques. This is probably a result of very small task granularities: In the Fibonacci program, all tasks are small, in merge sort there are way more small tasks than large ones. With such small tasks, the deques are frequently accessed and as shared deques imply fewer deques, we might run into contention earlier. Moreover, shared deques should lead to L1 caching benefits for the data, but atomic access to the deque itself always needs to access L2 cache as a result of BQ/Q architecture. As the data is mostly small in these applications, data caching benefits are seemingly outweighed by other factors. 51 5 Performance Comparisons The results of running the Mandelbrot program (input parameters: -2.25 0.75 -1.25 1.25 800 2000, cf. [Rem08, Section 4.9]) are shown in Figure 5.4. The program scales excellent until all 16 cores are in use (15.3-fold speed increase with 16 threads). It also benefits quite much from multiple SMP threads, the curve flattens only slightly at 32 and 64 threads. As each pixel of the Mandelbrot set can be calculated independently, no large amounts of data are shared. Therefore, and probably for the reasons explained above, we see almost no difference in the performance for different work-stealing modes and with shared or non-shared deques. 45 40 35 Speed increase 30 25 20 15 10 breadth-first or depth-first WS, non-shared deques hierarchical WS, non-shared deques breadth-first WS, shared deques depth-first WS, shared deques 5 0 0 10 20 30 40 Number of hardware threads 50 60 Figure 5.4: Performance of the Mandelbrot set calculation The last program we use to compare the different work-stealing modes is the Cholesky decomposition. It is non-recursive and in fact does not use nested parallelism at all. The results are shown in Figure 5.5. The Cholesky decomposition scales fairly well and sees an about 13-fold performance increase at 16 threads. Further increases beyond 16 threads might again be limited by the SMT capabilities. For non-shared deque configurations, 64 threads is even worse than 32 threads. Notably, Cholesky decomposition shows some differences between the work-stealing and deque sharing modes. Hierarchical work-stealing performs slightly better than breadth-first work-stealing, which in turn runs slightly faster than depth-first workstealing. Shared deques perform even up to 20 percent better than non-shared deques. 52 5.4 Results 18000 16000 14000 MFlops 12000 10000 8000 6000 4000 breadth-first WS, non-shared deques depth-first WS, non-shared deques hierarchical WS, non-shared deques breadth-first WS, shared deques depth-first WS, shared deques 2000 0 0 10 20 30 40 Number of hardware threads 50 60 Figure 5.5: Performance of the Cholesky decomposition Cholesky reveals way more differences than the previously tested applications. There are several explanations for this behavior. Firstly, it might result from not having nested parallelism. After the initial scheduling of tasks, a large part of the execution is concerned with application code instead of dependency analysis. Hence, application data structures dominate the cache usage and different modes show different behavior. Presumably more important is the spawning of tasks. As visible in Figure 3.5, the graph contains tasks delivering data to multiple successors. This is where we can benefit from shared deques. All of the successors are pushed to the deque and multiple threads from the same core run those tasks and hence access the L1-cached data. The behavior of the Fibonacci and merge sort programs is different. They initially distribute large tasks to multiple threads. Stealing probably mostly occurs at the end, when some of the large tasks have finished. In this phase, data is passed up the tree, i. e. multiple tasks deliver data to the same dependent tasks. But no task spawns multiple others anymore, which could run on different threads of the same core. 5.4.2 Overhead of the Execution System The Stack ES and similar programming environments like Cilk have the runtime dependencies of tasks implicitly enforced. Therefore, their execution has almost no 53 5 Performance Comparisons additional overhead compared to the sequential execution of plain C code. In contrast, the CES Deque Execution System analyzes the dependencies of tasks explicitly. The relative overhead depends on the granularity of the tasks. The following test on the x86 Lenovo ThinkPad laptop compares the Deque ES and Stack ES using the merge sort implementation with configurable task sizes. When the size of the array is below a threshold, the remaining sorting happens in the current task, with no further task spawns. This threshold is shown on the x-axis of Figure 5.6 as the Minimal Task Size, beware of the logarithmic scale. The y-axis shows the sorting performance in numbers per second. We always sorted an array of five million random numbers. 2500000 Sorted Numbers per Second 2000000 1500000 1000000 500000 Deque ES, breadth-first WS Stack ES, breadth-first WS 0 1 10 100 1000 Minimal Task Size 10000 100000 1e+06 Figure 5.6: Comparison of the Stack ES and the Deque ES performance For very small task sizes, the Deque Execution System performs poorly as a major part of the execution time is spent analyzing the dependencies of the multitude of tasks. With an increasing Minimal Task size, the results improve quickly. Above a Minimal Task Size of 256 numbers (about 350,000 clock cycles per task), the Deque ES comes very close to the Stack ES and later even outperforms it slightly. The suspected reason for better results than the Stack ES at coarse granularities is that the Stack ES still needs to search through the Frame Stack, while the Deque ES can steal the first item it finds on a foreign deque. 54 5.4 Results The test shows that the performance of the Deque ES depends heavily on the task granularity. However, we do not need too coarse tasks to achieve a good performance compared to the Stack ES (note that with a Minimal Task Size of 1000, there are still at least 5000 tasks). Nevertheless, more detailed comparisons with multiple threads would be needed to fully compare the performance of the two execution systems. In a second experiment, we want to ensure that our dependency analysis algorithm is not unnecessarily slow. Therefore, we briefly compare the Deque ES to SMPSs 2.3, again on the x86 system. As SMP Superscalar does not support nested parallelism yet, we chose a program without nested parallelism for the comparison. Since the Cholesky decomposition (see Subsection 3.8.3) is such a program and we have both a CES and an SMPSs version available, we reused this application for our test. The results are depicted in Figure 5.7. The input parameter shown on the x-axis is the side length of the block matrix, a number that determines the number of tiles we operate on and thus the number of tasks to execute. The performance measurement was part of the SMPSs program and gives the number of floating point operations per second on the y-axis. 4000 3500 3000 MFlops 2500 2000 1500 1000 500 SMPSs Deque ES, breadth-first WS 0 0 5 10 15 20 25 30 Side length of the block matrix 35 40 45 Figure 5.7: Comparison of SMPSs 2.3 and the Deque ES performance The results for both systems are very similar. The curves partly even overlap, although both implementations do not share any code. The performance increases with an increasing block side length. While the gains are huge at very small side 55 5 Performance Comparisons lengths, they diminish later. This development might result from caching effects of the decomposition code and from the decreasing influence of constant factors with an increasing computational effort. The main result from this experiment is that the implementation of the Deque ES is competitive to another programming environment analyzing the dependencies of tasks. 5.4.3 Sweep 2D Results Initially, we used Sweep 2D to test the scheduling according to hardware threads (see Subsection 4.6.4). Larger areas of the same color indicate a core working on multiple tasks interchanging data and are thus desirable for effective cache usage. In all results we have run Sweep 2D with 300 × 300 tasks on 64 threads with all 16 A2 cores. Each core has its own color and multiple hardware threads on the same core have the same color. 56 5.4 Results Figure 5.8 shows the result with breadth-first work-stealing and non-shared deques before we introduced the scheduling according to hardware threads. Columns of the same color are clearly visible and indicate the depth-first execution order of each single thread. Beyond that, the colors are quite mixed, i. e. stealing occurs randomly across processor cores. Figure 5.8: Processor assignment to Sweep 2D grid cells, without scheduling according to hardware threads 57 5 Performance Comparisons Figure 5.9 shows the result with breadth-first work-stealing and non-shared deques. As in the next sweep, the tasks are scheduled according to hardware threads. Clearly, we have much larger areas of the same color and thus can make better use of the cache. In the upper left and lower right corner, the mixed colors remain. This is a result of only few tasks being available at the beginning and end of the execution. Figure 5.9: Processor assignment to Sweep 2D grid cells, with breadth-first workstealing and non-shared deques 58 5.4 Results The best results are achieved when the threads of a core use a shared deque as depicted in Figure 5.10. In the middle of the execution, there are mainly large blocks of the same color. This reflects the improved performance we get from using shared deques in the Cholesky decomposition (Figure 5.5). Figure 5.10: Processor assignment to Sweep 2D grid cells, with breadth-first workstealing and shared deques 59 6 Conclusions 6.1 Results This Bachelor thesis explained the design and implementation of the new Deque Execution System for the CES programming language. This involved modifying the CES compiler, creating the Deque ES and extending the CES syntax to enable the newly introduced array support. Furthermore, we adapted and extended the macro interface connecting the compiler and execution system. Apart from the new language features, we kept the previous execution systems compatible by extending their implementation of the macro interface accordingly. The new Deque ES supports the classical breadth-first work-stealing present in the Stack ES, but it also enables depth-first work-stealing and a hybrid approach. We provide compiler flags to easily switch between these modes of operation. To further increase the flexibility of the scheduling algorithm, we added the option for multiple threads to share the deque holding their tasks. This enables e. g. multiple hardware threads in a single processor core to operate on the same double-ended queue, while other cores have their own deques to avoid resource contention. In contrast to previous execution systems, the Deque ES’ major data structure only holds tasks which are ready to be executed. This reduces the necessary effort for work-stealing and provides cleaner semantics. The Deque Execution System determines dependencies between tasks and thus exposes parallelism. The previous Stack ES required the programmer to explicitly indicate parallelism in the application. The new Deque ES analyzes the dependencies of spawned tasks at run time, during the execution. It schedules each task dynamically when all of its dependencies are fulfilled. Arbitrary non-circular dependencies can be handled, so any directed acyclic graph of tasks can be run. As the data dependencies of the tasks are exactly analyzed, the Deque ES can exploit more available parallelism than the earlier Stack ES, where the programmer could only coarsely expose the parallelism. However, due to the dependency analysis, the Deque ES has a higher overhead than the Stack ES. The Deque Execution System also takes care of memory management for all data items shared between multiple tasks. When the last task accessing a data item has finished, a so-called Free Task is scheduled to release the allocated memory. By using free pools for storing data items and tasks, bottlenecks of standard allocation libraries are circumvented. In addition to handling the dependencies of scalar variables, we provide support for declaring CES arrays. The elements of these arrays can be accessed with familiar array syntax and can be passed to child tasks individually. Their dependencies and storage space are also tracked individually facilitating fine-grained scheduling and memory 60 6.2 Further Research Possibilities deallocation. In conjunction with the possibility to manually encode task dependencies, this enables some important applications. For example, there are many linear algebra algorithms operating on blocked data, one of which we presented in this thesis. We briefly compared various work-stealing and deque sharing modes within the Deque ES. The results for Cholesky decomposition suggest that sharing a deque among multiple hardware threads on the same core can help the performance of certain applications. Further investigations of this subject are desirable. We also outlined that the performance of the Deque ES depends on task granularity and is comparable to other parallel programming environments performing dependency analysis. Finally, we illustrated the stealing behavior with shared and non-shared deques using the Sweep 2D application. 6.2 Further Research Possibilities In this final section, we highlight some possible research directions for the future. We start with promising modifications or extensions of the CES language and implementation. Afterwards we present some ideas for further evaluations of CES. 6.2.1 Advancing the CES Language and Implementation Firstly, there are some direct improvements to the current implementation. In Subsection 3.2.1 we mentioned that the Deque ES adheres not only to RAW, but also to WAR and WAW dependencies. The latter are not true dependencies as they can be eliminated through register renaming [SS95]. The Deque ES could benefit from implementing this technique as it allows more parallelism than the current solution. The concurrent deque library we use has a fixed deque size, which can only be changed at compile time. The deque could possibly be adapted to grow and shrink according to its capacity utilization [CL05]. If this is not possible while keeping the high concurrent performance, another idea is to implement a multi-deque wrapper structure. When one deque is full, we switch to a new deque, while the old one is stored for later access. If a (possibly different) thread empties its deque, it could replace the deque with the stored one, which provides plenty of tasks to execute. While the CES execution system uses free pools to deal with concurrent memory allocations, user programs allocating heap memory regularly might still run into scalability problems. “When ordinary, nonthreaded allocators are used, memory allocation becomes a serious bottleneck in a multithreaded program because each thread competes for a global lock for each allocation and deallocation of memory from a single global heap.” [Rei07, p. 101] Therefore, we should offer the user a scalable allocator that deals with this problem. Intel Threading Building Blocks (TBB) provides two such allocator classes [Rei07, Chapter 6] one could possibly offer through or adapt for CES. One obstacle is that Intel TBB is a C++ library whereas CES builds on C. As C++ is mostly an extension of C, this obstacle might be easy to overcome. At the moment, the Deque ES keeps tasks and data items separate and establishes the links through pointers. We could cut down on storage locations and memory allocations, 61 6 Conclusions if we put small data items directly into the consuming task frames. Those variables would then be passed by value. This change would disrupt the task communication model, which relies on different tasks accessing the same variables by reference. Recall, a parent passes the same reference to multiple tasks, whose communication only happens through the shared variable. Therefore, when notifying subsequent tasks, the condition task would need to copy the newly delivered variable values into the task frame of the dependent task. As this task frame is accessed anyway to decrease the number of unsatisfied dependencies, the additional overhead for the copy operation might be quite small. An alternative keeping the current task communication model is to put the data item directly into the corresponding Free Task. As there is exactly one Free Task per data item, the mapping would be well-defined. All other tasks would still access the shared variable within the Free Task through references, but the data item would not need a distinct storage location anymore. Scheduling tasks and particularly analyzing their dependencies imposes considerable overhead to the execution system. This is especially obvious when we have very finegrained tasks as shown in Figure 5.6. When all threads are busy, there is no need to spawn additional tasks, we could just execute the code sequentially as in conventional C. The execution speed of applications with fine-grained tasks could be increased considerably, if we switch between the sequential and the task-based execution mode depending on the number of ready tasks in the system [TCBV10]. The following ideas for improvements are more involved and we are not sure of their feasibility. In Subsection 3.8.1, we explained some of the difficulties with arrays and pointers concerning dependency handling. As a bottom line, it is hard to determine how pointers are used across tasks and what the intention of the programmer is. Since pointers are a very central concept in C, it is desirable to extend their support in CES dependency tracking. If there is no silver bullet to the problem, one could at least offer different mechanisms for the most common use cases. The execution of the Deque ES is strict, i. e. a task may only be executed when all of its dependencies are fulfilled. In non-strict evaluation, this is not true for all parameters, providing more opportunities for parallelization. For example, we consider an input parameter of a parent task. When the parameter – or its reference to be accurate – is only passed down to a child task, the parent task could run before the value of the parameter is available. Only the child task would need to wait for actual data delivery. In [SB99], Burkhard Steinmacher-Burow proposed the del (delegate) keyword to mark parameters that are only passed to child tasks and therefore do not need to be available for the parent to run. Supporting this keyword in CES would yield more opportunities for a parallel execution. The definition of the CES language is focused rather on quickly achieving results than on comfort of usage. By extending the compiler, one could increase the usability. For example, the type of C parameters passed to CES tasks must be given as detailed in Subsection 2.1.3. Instead, the CESC could determine this information on its own. In general the syntax of the language should be evaluated from a user’s point of view. 62 6.2 Further Research Possibilities 6.2.2 Evaluation of the Current CES State While the above mentioned improvements and extensions were at the core of the Deque ES and CES language development, we now present some ideas for investigating the usefulness of CES. To test the Deque ES and the CES language extensions, we used rather small applications. These were mostly classical computer science problems, only Cholesky decomposition explored the linear algebra domain. To thoroughly study the practicability of the language, the development of larger CES applications from different domains is necessary. The final research prospect we want to give here is an extended evaluation of the performance of the Deque ES. Mainly due to differences in the targeted platforms, this thesis could only briefly outline that the Deque ES is competitive. An extended comparison to e. g. Cilk, SMP Superscalar, the Stack ES and particularly to the bestpossible sequential execution would be interesting to properly evaluate the performance of the latest developments in CES. 63 Bibliography [ABB00] Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. In Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures, SPAA ’00, pages 1–12, New York, NY, USA, 2000. ACM. [ABP98] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, SPAA ’98, pages 119–129, New York, NY, USA, 1998. ACM. [ALS10] Kunal Agrawal, Charles E. Leiserson, and Jim Sukha. Executing task graphs using work-stealing. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1 –12, 2010. [BGM99] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM, 46:281– 321, March 1999. [BL93] Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, STOC ’93, pages 362–371, New York, NY, USA, 1993. ACM. [BL94] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356–368, 1994. [BLKD07] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. Technical report, University of Tennessee, Oak Ridge National Laboratory, 2007. [BOI10] Boinc – open-source software for volunteer computing and grid computing. http://boinc.berkeley.edu/, Retrieved December 15th, 2010. [BS81] F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 conference on Functional programming languages and computer architecture, FPCA ’81, pages 187–194, New York, NY, USA, 1981. ACM. 64 Bibliography [CBL10] Netlib repository at UTK and ORNL. http://www.netlib.org/ clapack/cblas/, Retrieved January 17th, 2010. [CGK+ 07] Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, SPAA ’07, pages 105–115, New York, NY, USA, 2007. ACM. [Cil10] The Cilk project. http://supertech.csail.mit.edu/cilk/, Retrieved December 15th, 2010. [CL05] David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA ’05, pages 21–28, New York, NY, USA, 2005. ACM. [DDCHD90] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16:1–17, March 1990. [DDCHH88] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of fortran basic linear algebra subprograms. ACM Trans. Math. Softw., 14:1–17, March 1988. [DK99] Krister Dackland and Bo Kågström. Blocked algorithms and software for reduction of a regular matrix pair to generalized schur form. ACM Trans. Math. Softw., 25:425–454, December 1999. [Fel11] Michael Feldman. Argonne orders 10 petaflop Blue Gene/Q super. HPC wire, February 8th 2011. Retrieved February 9th, 2011. [FLR98] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI ’98, pages 212–223, New York, NY, USA, 1998. ACM. [GJ07] Robert Granat and Isak Jonsson. Recursive blocked algorithms for solving periodic triangular sylvester-type matrix equations. In In PARA’06 State of the Art in Scientific and Parallel Computing, 2006. Lecture Notes in Computer Science. Springer, 2007. [GTK10] The GTK+ project. http://www.gtk.org/, Retrieved January 24th, 2010. 65 Bibliography [IBM10] IBM Corporation. ROI: Extending the benefits of energy efficiency. http://www-304.ibm.com/tools/cpeportal/fileserve/download0/ 164224/FV_Energy_Efficiency.pdf?contentid=164224, 2009. Re- trieved December 14th, 2010. [JK02] Isak Jonsson and Bo Kågström. Recursive blocked algorithms for solving triangular systemspart i: one-sided and coupled sylvester-type matrix equations. ACM Trans. Math. Softw., 28:392–415, December 2002. [KLDB09] Jakub Kurzak, Hatem Ltaief, Jack Dongarra, and Rosa M. Badia. Scheduling linear algebra operations on multicore processors – LAPACK working note 213, February 2009. [Lea00] Doug Lea. A java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA ’00, pages 36–43, New York, NY, USA, 2000. ACM. [Lei09] Charles E. Leiserson. The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC ’09, pages 522–527, New York, NY, USA, 2009. ACM. [LHKK79] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 5:308–323, September 1979. [Met09] Manuel Metzmann. Implementation, verification and performance measurement of concurrent data structures using new synchronization primitives. Diploma thesis, Technische Universität Kaiserslautern, March 2009. [PBL07] Josep M. Perez, Rosa M. Badia, and Jesus Labarta. A flexible and portable programming model for smp and multi-cores. Technical report, Barcelona Supercomputing Center, March 2007. [PBL08] Josep M. Perez, Rosa M. Badia, and Jesus Labarta. A dependencyaware task-based programming environment for multi-core architectures. In Proceedings of the 2008 IEEE International Conference on Cluster Computing, pages 142 –151, September 2008. [PBL10] Josep M. Perez, Rosa M. Badia, and Jesus Labarta. Handling task dependencies under strided and aliased references. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pages 263–274, New York, NY, USA, 2010. ACM. [PFF+ 07] F. Petrini, G. Fossum, J. Fernandez, A.L. Varbanescu, N. Kistler, and M. Perrone. Multicore surprises: Lessons learned from optimizing sweep3d on the cell broadband engine. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1 –10, 2007. 66 Bibliography [Rei07] James Reinders. Intel Threading Building Blocks. O’Reilly & Associates, Inc., Sebastopol, CA, USA, first edition, 2007. [Rem08] Jens Remus. Konzeption und Entwicklung einer Cop/Thief Work-Stealing Laufzeitumgebung zur parallelen Ausführung von Unterprogrammen. Diploma thesis, Fachhochschule Wedel, February 2008. [Sav10] Vlad Savov. Exclusive: LG’s 4-inch Android phone with dual-core Tegra 2 and 1080p video coming in early 2011. Engadget, November 18th 2010. Retrieved December 14th, 2010. [SB99] Burkhard D. Steinmacher-Burow. An alternative implementation of routines. http://www-zeus.desy.de/~funnel/TSIA/talks/ifl.pdf.gz, October 5th 1999. [SB00a] Burkhard D. Steinmacher-Burow. Task frames. http://arxiv.org/abs/ cs.PL/0004011, 2000. [SB00b] Burkhard D. Steinmacher-Burow. TSIA: A dataflow model. http:// arxiv.org/abs/cs.PL/0003010, 2000. [SBWR08] Burkhard D. Steinmacher-Burow, Sven Wagner, and Jens Remus. A modular approach to parallel applications. October 2008. [Shi10] Robert Shiveley. Performance scaling in the multi-core era. Intel Software Network, http://software.intel.com/en-us/articles/ performance-scaling-in-the-multi-core-era/, 2008. Retrieved December 14th, 2010. [SMP10] SMP Superscalar. http://www.bsc.es/plantillaG.php?cat_id=385, Retrieved December 15th, 2010. [SS95] James E. Smith and Gurindar S. Sohi. The microarchitecture of superscalar processors. Proceedings of the IEEE, 83(12):1609 –1624, December 1995. [SYD09] Fengguang Song, Asim YarKhan, and Jack Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pages 19:1–19:11, New York, NY, USA, 2009. ACM. [TBB10] Intel Threading Building Blocks 3.0 for open source. http://www. threadingbuildingblocks.org/, Retrieved December 15th, 2010. [TCBV10] Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. Lazy binary-splitting: a run-time adaptive work-stealing scheduler. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and 67 Bibliography practice of parallel programming, PPoPP ’10, pages 179–190, New York, NY, USA, 2010. ACM. [U.S10] U.S. Department of Energy. Secretary Chu announces $47 million to improve efficiency in information technology and communications sectors. http://www1.eere.energy.gov/recovery/news_detail.html? news_id=15705, January 6th 2010. Retrieved December 14th, 2010. [VMw10] VMware Inc. How VMware virtualization right-sizes IT infrastructure to reduce power consumption. http://www.vmware.com/files/pdf/ WhitePaper_ReducePowerConsumption.pdf, 2010. Retrieved December 14th, 2010. [Wag07] Sven Wagner. Konzeption und Entwicklung eines neuen Compiler “CESC” zur Implementierung von Prozeduren als atomare Tasks. Diploma thesis, Fachhochschule Gießen-Friedberg, August 2007. 68 SELBSTSTÄNDIGKEITSERKLÄRUNG Selbstständigkeitserklärung Ich erkläre hiermit, dass ich die vorliegende Arbeit selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe. Böblingen, den 28. Februar 2011 Sebastian Dörner 69
© Copyright 2026 Paperzz