Programming with Relaxed Synchronization

Programming with Relaxed Synchronization
Lakshminarayanan Renganarayana, Vijayalakshmi Srinivasan,
Ravi Nair, Daniel Prener
IBM T. J. Watson Research Center
Yorktown Heights, New York, USA
(lrengan, viji, nair, prener)@us.ibm.com
ABSTRACT
We propose a methodology for programming with relaxed
synchronization. This methodology can help to reduce, and
in some cases completely eliminate, synchronization overhead in parallel programs without sacrificing the quality of
results.
1.
INTRODUCTION
Parallel computer systems are gaining in importance largely
because of the need to process the vast amount of data that
is being produced from all types of computing devices and
sensors. These systems are also being called upon to perform complex analyses of data, and to answer queries from
millions of users in real time. As systems get larger however, they get more complex and hence more expensive and
more power-hungry. Both the acquisition cost and the running cost of computers can be contained by recognizing that
there is an accuracy implied by traditional computing that
is not needed in the processing of most new types of data.
This relaxation of accuracy can help in the wider exploitation of relatively more energy-efficient modes of computing
like cluster computing. More importantly, this relaxation of
the requirement of deterministic execution is in tune with
the trend in parallel systems designed for computing results
of approximate nature.
One source of both hardware complexity and software
overhead in most modern parallel systems is the act of synchronization between parallel threads. Synchronization overhead is a major bottleneck in scaling parallel applications
to a large number of cores. This continues to be true in
spite of various synchronization-reduction techniques that
have been proposed. Previously studied synchronizationreduction techniques tacitly assume that all synchronizations specified in a source program are essential to guarantee quality of the results produced by the program. In
this work we examine the validity of this assumption, and
propose an intellectual framework, Relax and Race (RaR),
which enables systematically relaxing the synchronizations
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
RACES 2012
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
and demonstrates significant performance improvement on
a variety of benchmarks.
Such synchronization may occur either in order to ensure
that fundamental data structures, e.g. linked lists, do not
break due to simultaneous manipulation of their structure
by different threads, or in order to ensure that threads reach
various points in their execution in a predictable manner, or
in order to ensure that all threads see consistent values when
shared variables are updated. While the first of these reasons
for synchronization cannot typically be relaxed because it
could lead to fatal behavior of the program, the second, and
even more often, the third reason can be relaxed with no
visible effect in many cases and acceptable effect in other
cases.
Traditional synchronization techniques like data privatization [2][11], lock-free synchronization [10][7][1] and transactional systems [7][14] have helped a lot in reducing synchronization overheads. These techniques primarily use careful
analysis and identification of situations where synchronization is not needed, or use hardware or software structures
that are able to detect and correct incorrect behavior due to
relaxed synchronization. However this has added complexity along other dimensions, while not fundamentally alleviating the barrier that synchronization poses to the scaling
of parallel applications. We are therefore led to ask what
the effect would be if synchronization were omitted in situations where such an omission would not lead to catastrophic
behavior of the system. Would the program produce results
that are acceptable? Furthermore, if the results are acceptable, would they be produced in considerably less time? Or
with considerably lower expenditure of energy?
This is what we set out to explore in this work. We first
consider the class of computations for which the quality of
results can be quantified. Several computations from important benchmark suites such as PARSEC, STAMP, and
NU-MineBench, belong to this class. In many of these computations the result produced by the original program with
programmer-specified synchronization is not necessarily unique;
it is often just one in a domain of results characterized
through some quality metric as acceptable. The convergence
criterion in iterative convergence algorithms, for example,
provides such a quality metric. We show several examples
where, by relaxing synchronization, we can satisfy the same
quality metric while executing in less time, often considerably lower (15x for Kmeans) compared to the original. We
also show through the Graph500 benchmark [6] an example of how a quality criterion can be developed for an algorithm based on knowledge of the application that uses the
Normalized Cluster Time
Domain / Applications
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Synchronization
Scientific
Compute
2
3
4
5
# of threads
6
7
8
Figure 1: Synchronization overhead in OpenMP based
Kmeans benchmark from NU-MineBench
algorithm. Based on our experience with relaxing synchronization on several real benchmarks that use different parallelism models, we propose a methodology for programming
with relaxed synchronization.
2.
Data mining, Text
mining, Risk Analysis,
and Machine learning
Numerical
optimization,
Forecasting, Financial
Modeling, and
Mathematical
programming
EDA
Graph processing and
Social Network
Analysis
Program analyses in
compilers and software
engineering
Examples
Finite difference methods,
Successive over relaxation,
N-body methods
Kmeans, Hierarchical
clustering, Multinomial
regression, Feature reduction
non-convex optimization
solvers, constrained
optimization,
multi-dimensional cubing
VLSI place and route
Mesh refinement, Metis style
graph partitioning, Graph
traversals
Iterative data flow analysis
RELAXED SYNCHRONIZATION
Synchronization overhead is often a major performance
limiting factor in parallel applications. Figure 1 shows the
parallel execution time on an IBM POWER P7 machine of
the Kmeans [12] benchmark with the goal of finding 8 clusters using the supplied large input set. It is clear from the
figure that synchronization time dominates the total execution time and is as high as 90% for 8 threads. We now
propose a methodology that can be used to reduce (and in
some cases completely eliminate) synchronization overhead.
The goal of our methodology is to identify key steps necessary to reap the performance and scalability potential of
relaxed synchronization. The first step is to choose parallel
code regions in the program that use synchronizations. For
each parallel code region, the following four key steps are
needed to succeed in exploiting relaxed synchronization:
Table 1: Applications that are suitable for relaxing synchronization. These applications choose a good solution from
many correct solutions, and use quality criteria such as convergence tests, fixpoint tests, or boolean acceptability tests.
All of them are suited for parallelization too.
1. Identify criteria that quantify the quality or acceptability of the solution
this metric be smaller than a given value in order for the solution to be considered acceptable. Thus multiple solutions
could satisfy this criterion. When an acceptability criterion
is identified, it is important to ensure that the user of the
results of the corresponding computation accepts any solution that meets the acceptability criterion. In our experience
we find that several application domains have computations
that naturally have such quality/acceptability criteria. Table 1 lists several application domains and sample applications that have the property mentioned above.
2. Choose relaxation points
2.2
3. Restructure code to enable switching execution between the the original and/or relaxed version based
on the usage context
In general, the use of synchronization in parallel programs
can be classified into three broad categories: (i) to preserve the integrity of data structures (e.g., malloc(), and
insert/delete of nodes in a tree), (ii) for barrier-style collective synchronization, and (iii) to ensure that threads use
the most recent value of a variable. The structural-integritypreserving synchronizations, the first category, are typically
essential and do not lend themselves to relaxation. We have
indeed observed the abnormal termination of applications
when synchronization was relaxed indiscriminately. The second category of collective synchronizations, such as barriers,
are good candidates for relaxation, and examples of such relaxation are shown in [8] for the NAS parallel benchmarks
using OpenMP directives. The third category of synchronizations provides excellent candidates for relaxation, and
this is the class we explore in this work.
4. Select profitable degree of relaxation via experimentation
The subsequent sections discuss each of these steps.
2.1
Identify quality or acceptability condition
Many applications compute and produce one solution out
of many correct solutions of specified quality. The exact
value of the solution is less important for these applications
than the quality of the result. The acceptability of any solution meeting the specified quality provides programmers,
compilers and runtime systems the freedom to make choices
in order to produce solutions that are often significantly
faster to compute than others.
The first step is to identify or make explicit a condition
that quantifies the quality or acceptability of a solution. For
example in the Kmeans computation, one such metric is the
change in the membership of the clusters between two successive iterations. The convergence criterion would be that
2.3
Choose relaxation points
Relax and Race: A scheme for systematic
use of relaxed synchronization
Once we have identified acceptability criteria and relaxation points, we are ready to evaluate the profitability of
relaxing the synchronization with respect to performance or
scalability. An important goal during this task is that the
relaxed version should always produce a result that satisfies the acceptability criterion. In this section we outline a
simple but widely applicable scheme, called Relax and Race
(RaR), that enables developers to restructure their code so
that an acceptable solution is guaranteed to be produced.
The expectation is that on most occasions the solution is
obtained faster than that produced by the original, unrelaxed program.
Consider a function f(input) that is suited for relaxing
synchronization. We assume that f() has a parallel computation with synchronization and f() has its quality criterion
defined by a function, say, qualityNotAcceptable(). Once
a set of profitable synchronization points are determined, we
create two versions, f_original(input) and f_relaxed(input),
which respectively correspond to the original and relaxed
versions of f().
gree of relaxation of a synchronization. The most profitable
degree of relaxation is dependent on many factors, including the platform of execution and the input data size. Hence
the choice of degree of relaxation is empirical; we have found
that it is useful to restructure the code so that this degree of
relaxation can be controlled via a parameter. A realization
of this for OpenMP style parallel loops and critical sections
is shown in Figure 3.
#pragma omp parallel for private(i) \
shared(N,numbers,sum) schedule(static)
for ( i=0; i < N; i++) {
#pragma omp critical
{
sum += numbers[i];
}
}
⇓
result = f_original(input);
#pragma omp parallel for private(i) \
shared(N,numbers,sum) schedule(static)
for ( i=0; i < N; i++) {
if ( i % relaxFactor == 0 ) { // sync free
sum += numbers[i];
} else {
#pragma omp critical
{
sum += numbers[i];
}
}
}
⇓
result = f_relaxed(input);
if (qualityNotAcceptable(result)) {
//compute result using the original version
result = f_original(input);
}
Figure 2: The RaR code restructuring scheme. It uses the
original and relaxed versions, together with the quality criteria, in a structured way that guarantees the same acceptable
quality of results for all executions of the relaxed version.
The RaR scheme is shown in Figure 2. At every program point where f() is called, the call can be replaced
by the code structure in Figure 2. The RaR scheme first
computes the result with the relaxed version. This simple scheme provides the relaxed version an opportunity to
produce an acceptable result with significant gain in performance. If the result does not meet the quality criterion it is
recomputed with the original version. In such cases, we pay
a performance penalty because the original version has to
be executed after finding out that the results of the relaxed
version did not meet the acceptability condition. For some
situations, it may be possible to use a slightly more complex
code restructuring scheme which detects unacceptable behavior before completing the relaxed computation, thereby
mitigating the performance penalty.
A key property of the RaR scheme is that the result computed is always guaranteed to be as acceptable as the original. This result could be computed by using only the relaxed version or by using both the relaxed and original versions, but since the same acceptability criterion is used by
both versions, we are guaranteed an equally acceptable result. This ensures that, with the RaR scheme, there is no
acceptability impact on a program from using relaxed synchronization.
2.4
Select profitable degree of relaxation
For any synchronization point selected for relaxation, one
can relax either all the (dynamic) instances of that synchronization, some of the instances, or none of the instances.
We refer to this spectrum of relaxation choices as the de-
Figure 3: Parameterized control of the degree of relaxation.
The variable relaxFactor controls the degree of relaxation.
For example a value of 1 for relaxFactor would result in relaxing all the synchronizations and a value of 2 would result
in relaxing every other synchronizations.
In a typical usage scenario, the application developer would
relax the synchronizations in the input program and measure the performance and quality of results for representative input data sets during the program development phase.
Profitable synchronization points and relaxation factors will
be chosen based on the observed performance. The selection
of profitable synchronization points, of degree of relaxation
and the generation of code using the RaR scheme can be
easily automated using standard iterative auto-tuner techniques, such as the one used by the popular ATLAS BLAS
library generator [15, 5], or by iterative compilers [9]. While
we are currently doing this manually we plan to use the
aforementioned techniques to automate this process.
3.
EXPERIMENTAL SET-UP
We evaluate the performance of RaR for a set of benchmarks from NUMineBench [12], STAMP [3] and Graph500 [6]
using two different parallelism models on two major multicore systems. The benchmarks cover a broad set of computation classes including data mining, graph analysis, image
processing, scientific, non-numeric, and graph traversal algorithms. The shared memory architectures/platforms include an Intel x86 32-way machine and an IBM POWER
P7 32-way machine. The two different classes of parallelism/synchronization include explicit (OpenMP or pthreads)
and implicit (transactional memory (STM)) models.
14
1
Speedup
10
2
3
4
5
6
7
Speedup
20
12
8
8
15
10
5
6
0
4
1
2
0
4
5
6
7
8
9
10
Number of clusters
11
12
13
4.1
Speedup
RESULTS
For each of the benchmarks studied, we use the large input
data set from the sample data sets supplied with the benchmark. For evaluation of acceptability, we use the quality
criterion specified in the benchmark code.
We relax the transactional memory (TM) benchmarks
from STAMP by replacing one or more of the
TM SHARED READ and TM SHARED WRITE with normal unsynchronized reads and writes. We relax the OpenMP
benchmarks from NU-MineBench and the pthreads benchmark from Graph500 by removing one or more of the atomic
or critical sections and allowing unsynchronized reads and
writes to shared variables.
All the synchronization primitives which were relaxed in
these benchmarks use full relaxation, i.e., all dynamic instances of synchronized updates for that primitive are now
performed without synchronization. The full relaxation is
realized by setting the relaxFactor to 1 in the implementations.
Overall, for each benchmark and for each parallel region
where the synchronization was relaxed, we constructed two
versions (original and relaxed) and applied the RaR scheme
as outlined in Section 2. The benchmarks (except Graph500)
already had quality criteria for the results they computed,
and we used the same criteria to evaluate the quality of the
solution produced by the relaxed version. For further details
on the benchmarks we refer the readers to the technical report [13].
Kernels in the Graph500 benchmark suite capture the
essence of graph of computations that are used in a wide variety of contexts. We have chosen a particular usage context,
viz., advertisement mailer list selection, to demonstrate the
performance gains achievable by relaxing synchronization.
For this context, we chose an appropriate quality criterion
as described in Section 4.7. An outline of the algorithm used
for the Graph500 problem is presented in [4].
We use speedup computed as the ratio of the original to relaxed execution times as the metric to show the performance
benefits of relaxing synchronization.
Kmeans (NUMineBench)
Kmeans uses thread-based synchronization and finds the
best of 4 through 13 clusters. Figure 4 shows speedup up to
13x with no impact on the acceptability of the results.
16
Figure 5: Speedup for Kmeans from STAMP benchmark
suite.
Figure 4: Speedup of relaxed versus original versions of
Kmeans. Each group of bars in the X axis shows the number of clusters considered in each run of the Kmeans computation; the bars within a group represent the number of
OpenMP threads used in the parallel execution.
4.
2
4
8
Number of threads
3
2.5
2
1.5
1
0.5
0
1
2
4
8
Number of threads
16
Figure 6: Speedup for SSCA2 from STAMP benchmark
suite.
4.2
Kmeans (STAMP)
Having seen excellent performance speedup and quality of
results for the threads-based Kmeans computation we wondered whether the achieved speedup stems from the choice
of parallelization/synchronization, or whether the Kmeans
computation itself is resilient to relaxing synchronization.
The STAMP Kmeans code is based on the Kmeans from
NU-Minebench, and uses TM-based synchronization.
Figure 5 shows speedup up to 15x for the large input data
set without any impact on the acceptability of the results.
4.3
SSCA2
Figure 6 shows the speedup using the large input data set
and used 1 through 16 threads. For all the runs, the relaxed
version produced results as acceptable as the original.
Infrequent concurrent updates to the adjacency list makes
SSCA2 an ideal candidate for synchronization relaxation.
However, the concurrent update operation is relatively small,
so very little time is spent in the transactions. Thus, the
benefits of relaxing synchronization are significant only with
large thread counts (see Figure 6).
4.4
Labyrinth
Figure 7 shows the speedup for the Labyrinth benchmark
from the STAMP suite. We observe modest speedups increasing with the number of threads. The benchmark has a
verification phase to validate the paths, to ensure that the
acceptability criterion, namely no overlapping paths, is met.
For the given input data set, there are 512 paths to be
routed. The original version routes 512 or fewer paths on
each run. The relaxed version and the original version route
all 512 paths for 1, 2, and 4 threads. For 8, 16 and 32 threads,
the original version routes 511 paths and the relaxed version
routes 509, 505, and 491 paths, respectively. These results
illustrate that both the original and the relaxed versions
4.7
2
Graph500
3.5
Speedup
1.5
3
1
2.5
0
1
2
4
8
16
Number of threads
Speedup
0.5
32
2
1.5
1
0.5
Execution time
Figure 7: Speedup for Labyrinth from STAMP benchmark
suite.
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
relaxed
8 threads
original
16 threads
Figure 8: Fuzzy Kmeans from NU-Mine benchmark suite.
The relaxed version does not produce an acceptable result.
The result is re-computed with the original version. Hence,
a slowdown is experienced by the application. This graph
shows the total execution time of the original and relaxed
versions for 8 and 16 threads.
meet the acceptability criterion stipulated in the benchmark,
but by allowing unsynchronized writes, it is possible that
fewer paths will be routed in the relaxed version compared to
the original. The benchmark does not specifically stipulate
the number of paths routed as a quality measure, and hence
these results show the benefit of relaxing synchronization
without impacting the acceptability of the results, at least
for this phase of the problem.
4.5
YADA
The opportunities for relaxation in this benchmark were
few because most of the synchronizations protected memory
allocation and deallocation. Consequently, RaR yields very
little performance improvement, although it achieved the
same quality as the original.
4.6
Fuzzy Kmeans
Our analysis of this algorithm shows that there was a high
degree of contention in updating clusters, and that this contention increases with increase in the number of threads.
For 8 and 16 threads, as shown in Figure 8 the relaxed version does not converge to an acceptable solution. The RaR
scheme detects this and recomputes the result using the original version. Overall, we observe a slowdown for the RaR
scheme relative to the original version.
0
1
2
4
8
Number of threads
16
Figure 9: Speedup for Breadth-First Search from the
Graph500 benchmark.
We consider the Breadth-First Search (BFS) problem which
forms the core of several graph applications and involves the
determination of reachability of nodes in a graph to select a
group of related nodes. One such application is mailer list
selection, where BFS from a given node is used to select a
list of potential buyers to mail product advertisements. This
application seeks to select a predefined number of potential
buyers (or nodes) that are reachable from a given node. One
can imagine several quality criteria for different BFS usage
contexts. For this problem we could use the number of reachable nodes to be produced (size of desired mailing list), as
the quality criterion. This highlights the importance of considering relaxed synchronization in the context of use of the
program.
We use an award-winning parallel implementation [4] of
the BFS kernel used in the Graph500 benchmark, and show
that by using our methodology we can achieve up to 3x performance improvement. For our experiments, we use random scale-free graphs with different numbers of vertices.
For each run the application started with a random root
and set out to list all the vertices reachable from the root
using a breadth-first search. Each iteration advances the
wavefront of visited nodes by an additional unit distance
from the root. The parallel implementation has an atomic
update of the list of nodes reached by a wavefront. This
atomic update was replaced by a non-atomic update in the
relaxed version. This potentially caused the eventual list to
have fewer visited vertices compared to the original.
Our results are shown in Figure 9. The experiments were
done with a range of CPUs from 1 to 16, one thread per
CPU, for a scale factor of 24 (16 million vertices), and with
a quality criterion of reaching at least 8 million vertices.
In all cases our relaxed version reached 8 million or more
vertices with a speedup compared to the original of between
2.3x and 3.0x.
5.
CONCLUSION
Synchronization overhead is a major performance-limiting
factor in parallel applications. In this work, we have shown
that significant performance speedups can be obtained by
relaxing the widespread assumption that all programmer-
specified synchronizations are essential to produce good quality results. We proposed the RaR scheme, which allows
programmers to systematically relax synchronizations in the
program to achieve performance improvement and still produce results that are of the same acceptable quality as the
original. Our experiments with a variety of benchmarks
show that relaxing synchronization with the RaR scheme
can achieve significant speedups; for example, up to 15x for
the Kmeans benchmark, with no degradation in the quality
of the results.
The results and insights from this work suggest that it
will be fruitful to investigate automatic methods to select
the relaxation points and degree of relaxation, as well as
methods to generate the relaxed version based on the RaR
scheme. We plan to pursue this in the future. The positive results presented here on relaxed synchronization are
a strong indication of the potential of the general area of
approximate computing, where we sacrifice the determinism and preciseness of the general computing paradigm as
practiced today for improved latency, reduced energy consumption, and lower system cost, while providing results
acceptably close to what would otherwise be possible.
Acknowledgements
We would like thank Fabrizio Petrini and Fabio Checconi
for sharing with us their award-winning Graph500 code. We
would also like to thank Colin Blundell for his contributions
at the inception of this work.
6.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
REFERENCES
[1] J. Alemany and E. W. Felten. Performance issues in
non-blocking synchronization on shared-memory
multiprocessors. In Proceedings of the eleventh annual
ACM symposium on Principles of distributed
computing, PODC ’92, pages 125–134, New York, NY,
USA, 1992.
[2] F. Allen, M. Burke, R. Cytron, J. Ferrante, and
W. Hsieh. A framework for determining useful
parallelism. In Proceedings of the 2nd international
conference on Supercomputing, ICS ’88, pages
207–215, New York, NY, USA, 1988.
[3] C. Cao Minh, J. Chung, C. Kozyrakis, and
K. Olukotun. STAMP: Stanford transactional
applications for multi-processing. In IISWC ’08:
Proceedings of The IEEE International Symposium on
Workload Characterization, September 2008.
[4] F. Checconi, F. Petrini, J. Willcock, A. Lumsdaine,
Y. Sabharwal, and A. Choudhury. Breaking the Speed
and Scalability Barriers for Graph Exploration on
Distributed-Memory Machines. In Proceedings of The
2012 International Conference for High-Performance
Computing, Networking, Storage, and Analysis, SC12,
2012.
[5] J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes,
A. Petitet, R. Vuduc, R. Whaley, and K. Yelick.
Self-Adapting Linear Algebra Algorithms and
Software. Proceedings of the IEEE, 93(2):293, 2005.
[6] The Graph 500 benchmark.
http://www.graph500.org/.
[7] M. Herlihy and J. E. B. Moss. Transactional memory:
architectural support for lock-free data structures. In
Proceedings of the 20th annual international
[15]
symposium on computer architecture, ISCA ’93, pages
289–300, New York, NY, USA, 1993.
H. Jin, M. Frumkin, and J. Yan. The OpenMP
implementation of NAS parallel benchmarks and its
performance. Technology, (October):6, 1999.
P. M. W. Knijnenburg, T. Kisuki, and M. F. P.
O’Boyle. Iterative compilation. In Embedded processor
design challenges: systems, architectures, modeling,
and simulation-SAMOS, pages 171–187.
Springer-Verlag New York, Inc., New York, NY, USA,
2002.
L. Lamport. Concurrent reading and writing.
Commun. ACM, 20:806–811, November 1977.
Z. Li. Array privatization for parallel execution of
loops. In Proceedings of the 6th international
conference on Supercomputing, ICS ’92, pages
313–322, New York, NY, USA, 1992.
R. Narayanan, B. Ozisikyilmaz, J. Zambreno,
G. Memik, and A. Choudhary. Minebench: A
benchmark suite for data mining workloads. In
Workload Characterization, 2006 IEEE International
Symposium on, pages 182 –188, oct. 2006.
L. Renganarayana, V. Srinivasan, R. Nair, D. Prener,
and C. Blundell. Relaxing synchronization for
performance and insight. Technical Report RC25256,
IBM T. J. Watson Research Center, Oct 2011.
N. Shavit and D. Touitou. Software transactional
memory. In Proceedings of the fourteenth annual ACM
symposium on Principles of distributed computing,
PODC ’95, pages 204–213, New York, NY, USA, 1995.
R. C. Whaley and J. J. Dongarra. Automatically
tuned linear algebra software. In Proceedings of the
1998 ACM/IEEE conference on Supercomputing
(CDROM), pages 1–27. IEEE Computer Society, 1998.

Download Report

Programming with Relaxed Synchronization

Paperzz.com

Your Paperzz