A Scalable, Sound, Eventually-Complete Algorithm for

A Scalable, Sound, Eventually-Complete Algorithm
for Deadlock Immunity
Horatiu Jula and George Candea
http://dimmunix.epfl.ch/
EPFL, Switzerland
Abstract. We introduce the concept of deadlock immunity—a program’s ability to
avoid all deadlocks that match patterns of deadlocks experienced in the past. We
present here an algorithm for enabling large software systems to automatically acquire
such immunity without any programmer assistance. We prove that the algorithm is
sound and complete under reasonable assumptions. We implemented the described
technique in Java and present here measurements that show it introduces only modest
performance overhead in real applications (e.g., < 6% in JBoss). Since deadlock immunity is as powerful as complete freedom from deadlocks in most practical cases, the
proposed technique takes a new step toward resolving the long-standing challenge of
ridding complex concurrent programs of their deadlocks.
1
Introduction
Writing concurrent software is a challenging task, because it requires careful reasoning about
complex interactions between concurrently-running threads. Programmers consider concurrency bugs to be some of the most insidious. An important category of such bugs result in
deadlocks—situations in which a set of threads cannot make forward progress because each
thread is waiting to acquire a lock held by another thread in that set. Avoiding the introduction of deadlock bugs during development is challenging, because large software systems
are developed by multiple teams totaling hundreds to thousands of programmers. Testing
is not a panacea, because exercising all possible execution paths and thread interleavings is
still infeasible for large programs. Debugging deadlocks is tedious, because they are hard to
reproduce and diagnose.
We expect deadlocks to become more frequent, as multi-core CPUs lead to higher degrees of concurrency and encourage new software systems to be increasingly more parallel.
There have been proposals for making concurrent programming easier, such as transactional
memory [10], but issues concerning I/O and long-running operations still make it difficult to
provide atomicity transparently. Ironically, some of the most efficient transactional memory
implementations resort to locking and can run into deadlocks. We therefore believe that locks
will continue being a primary vehicle for synchronization in multi-threaded applications.
There are numerous approaches to eliminating deadlocks, both static [12, 7, 22] and dynamic [4, 23]. The static approaches typically aim to find deadlock bugs in the source code
and either let the progammer fix them, or automatically instrument the application with new
locks that introduce serialzation in the deadlock-prone code. The challenge faced by static
approaches is that they either generate many false positives (i.e., wrongly identify deadlock
bugs) and burden programmers with sifting through the reports to pick out the true bugs, or
they do not scale to large applications due to resource consumption that is exponential in the
size of the analyzed program. In fact, false positives vs. scalability appears to be an essential
tradeoff in static techniques for finding deadlocks.
EPFL Technical Report DSLAB-2007-002, December 2007
Dynamic approaches often face a different challenge: false negatives. Since they rely exclusively on runtime information from the present execution (e.g., a lock trace), deadlocks
may still occur, because they cannot be predicted. In fact, the pure version of the deadlock
avoidance problem is generally undecidable, because it can be reduced to the halting problem
(see Appendix A). One way to simplify the problem and circumvent undecidability is to save
deadlock information that persists across executions, and leverage this knowledge to avoid
solely the already-encountered deadlocks.
Our proposed approach detects deadlocks at runtime and saves the contexts in which
they occurred, in order to avoid these contexts in future runs, thus achieving “immunity”
against them. In this report, we refer to such contexts as deadlock patterns or templates. To
avoid previously-seen deadlocks we employ program steering [14] and automatically change
the scheduling of threads. A program with deadlock immunity will progressively eliminate
the manifestations of its deadlocks bugs, by automatically changing the thread scheduling so
as to avoid at runtime a monotonically increasing set of deadlock templates—the contexts of
all deadlocks that occurred in past runs of the program. We expect this approach to result in
fewer false positives, because it relies on deadlock patterns that actually manifested, rather
than trying to infer potential deadlocks that may occur in the future. If a deadlock does not
have a pattern similar to one that was already encountered, our approach will not avoid that
deadlock, resulting in a false negative. In fact, the false negative rate of our aproach is exactly
one per deadlock template, because all runs after the first occurrence will be free of that
deadlock pattern (see §3).
Fortunately, in practice, deadlock immunity is as useful as complete deadlock avoidance,
since the only difference is the one occurrence per deadlock pattern. Software users now have
the additional option of employing a tool based on our approach, instead of waiting for the
deadlock bugs to be fixed by software producers. Deadlock immunity can either provide an
interim solution, until the deadlocks are fixed, or even a permanent one. Programmers may
find it to expensive or risky to fix certain bugs, and deadlock immunity can help.
This technical report make three main contributions:
1. An algorithm for developing deadlock immunity with no assistance from programmers or
users.
2. A proof that this algorithm is sound (i.e., avoids deadlocks while preserving liveness) and
eventually-complete (i.e., all deadlocks are avoided after a finite number of steps).
3. Experimental evidence that our algorithm can scale to large programs (e.g., < 6% overhead
in JBoss, a system with > 350, 000 lines of Java code) and to large numbers of threads
(< 60% overhead for 256 threads in a synchronization-intensive microbenchmark).
The rest of this report is structured as follows: we first describe the deadlock immunity
algorithm (§2), then prove its soundness and eventual completeness (§3) and analyze its
complexity (§4); we describe an implementation of the algorithm for Java programs (§5) and
evaluate the effectiveness and performance of this implementation (§6). We survey related
work (§7), discuss future work (§8), and conclude (§9).
2
Algorithm for Deadlock Immunity
After an overview of the proposed algorithm together with the necessary definitions (§2.1),
we describe in detail the instrumentation needed to intercept lock/unlock requests (§2.2) and
the two main parts of our approach: the avoidance algorithm (§2.3) and detection algorithm
(§2.4). We have devised two versions of the immunity algorithm, a synchronized one, that
uses a global lock for internal shared data structures, and a lock-free version, that eliminates
2
this central point of synchronization. The synchronized version is simpler, and we focus on it
in this section; the lock-free version appears in Appendix B.
2.1
Overview and Definitions
The deadlock immunity algorithm applies to the following abstract model of a multi-threaded
program: there is a finite number of threads performing synchronization operations (i.e., lock
and unlock) on a finite number of shared mutex locks. When a thread t performs a lock(l) on
mutex l, it follows 3 steps: (1) t requests lock l; then (2) t waits until l becomes free, i.e., not
held by any other thread; and finally (3) t acquires l. A thread can request only one lock at
time, and a lock can be held by only one thread at a time. When t performs an unlock(l) on l,
it releases l, which becomes available to be acquired by other threads. A code region protected
by a lock l (i.e., situated between a lock(l) and unlock(l)) is called a critical section. When
a thread performs lock(l’) within the critical section of a lock l, we say the thread is doing
a nested lock if l′ 6= l and, when l = l′ , we call it a reentrant lock. The immunity algorithm
supports both reentrant and nested locking. We identify the program position p at which a
thread requests or acquires a lock as the offset of the corresponding instruction within the
program source code or binary.
A set of threads is deadlocked iff every thread from that set is waiting (step 2 above) for
a lock held by another thread in that set. A thread can request only one lock at time, and
a lock can be held by only one thread at a time. The immediate cause of a deadlock is a
given sequence of lock acquisitions that have led to the situation described above. A deadlock
avoidance mechanism generally tries to predict impending deadlocks and dynamically reorder
the lock acquisitions in order to avoid the predicted deadlocks.
The deadlock immunity algorithm consists of two parts, one that detects deadlocks and
another that avoids them by forcing threads to yield when they are approaching a previouslyseen deadlock. In order to detect deadlocks, we maintain a standard resource allocation graph
RAG=[V, E]. The vertices v ∈ V are threads and locks, and the edges e ∈ E are request,
hold, grant, or yield edges. When the value of an edge label or an endpoint is irrelevant, we
∗
∗
mark it with ∗ (as in v1→yv2 , or v1→y∗).
A request edge t→r l represents thread t requesting permission to wait for lock l (step 1
p
above). A grant edge t→g l represents the fact that thread t was granted permission by the
algorithm to wait for lock l at position p in the program (i.e., to enter step 2). A hold edge
p
p1
t←hl indicates that t has acquired lock l at position p (step 3). A group of yield edges t→yt1 ,
pn
..., t→ytn indicates that thread t was forced to yield because threads t1 , ..., tn had acquired (or
were granted) locks at positions p1 , ..., pn ; the use of yield edges will become more apparent
later on. In summary, a request edge is the manifestation in the RAG of a lock request and
a hold edge the manifestation of a lock acquisition. A grant edge reflects the algorithm’s
decision to allow a thread to do a blocking wait for a lock, while a yield edge captures the
immunity algorithm’s decision to pause a thread in order to avoid a potential deadlock.
A deadlock appears as a cycle in the RAG that involves exclusively request, grant, and
hold edges (i.e., no yield edges). When avoiding deadlocks using thread yields, the potential
for livelock exists, in which a set of threads makes no forward progress as they are either
blocked or they repeatedly yield, waiting for some other thread in the set to release a lock.
We call such livelocks avoidance-induced livelocks. We say that a cycle in the RAG is a yield
cycle iff all yield edges emerging from its nodes belong to yield cycles; a yield cycle is the
representation of avoidance-induced livelocks in the RAG. Another way to think of avoidanceinduced livelocks is as a group (conjunction) of yield cycles that intersect in a vertex v of the
RAG as well as in all yield edges that emerge from v.
3
qu eve
eu nt
e
The yield cycle construct enables the imgrant edge
ownership edge
munity algorithm to detect and avoid all
yield edge
avoidance-induced livelocks in the exact same
way it detects and avoids deadlocks. To illust2
trate the concept in an intuitive manner, cont1
t5
sider Figure 1: for thread t1 to be livelocked,
all of its yield edges must be part of cycles,
t3
t4
l
as well as all of t4 ’s yield edges, since t4 is in
t6
one of t1 ’s yield cycles. If the RAG had solely
the (t1 , t2 , . . . , t1 ) and (t1 , t3 , l, t4 , t6 , . . . , t1 )
cycles, then there would be no livelock, because t4 could “evade” livelock through t5 , Fig. 1. Livelocked threads and yield cycles.
allowing t1 to “evade” through t3 . If, as in
Figure 1, cycle (t1 , t3 , l, t4 , t5 , . . . , t1 ) is also
present, then the threads have no way to evade an “eternal wait” and they are thus livelocked, as neither can make forward progress.
We use instruction location information to capture and save templates of deadlocks and
induced livelocks. Remember, the program position p is an abstraction that denotes the
location of an instruction in the source code or binary. A template is the set of program
positions (edge labels) corresponding to the edges of a cycle in the RAG; remember that only
p1
grant, hold, and yield edges carry labels. For example, the template of the cycle t1 →r l1 →h
pn
t2 →r ...ln →h t1 is {p1 , ..., pn }. A template instance is an instantiation of a template in a
program execution, i.e., a set of (thread, position) tuples, representing distinct threads that
are currently holding or have been granted locks at positions corresponding to the template,
that cover all positions in the template; e.g., {(t1 , p1 ), ..., (tn , pn )} could be an instantiation
of {p1 , ..., pn } given the current state of the program.
Templates are analogous to “antibodies”—the algorithm saves them to persistent storage
and avoids their re-instantion in all future executions. Since both deadlocks and avoidanceinduced livelocks have their templates saved to the same template history, we will refer to all
of them as simply templates, the unified deadlock and induced livelock history simply history,
and the deadlocks and avoidance-induced livelocks simply cycles when no distinction needs
to be made.
An immunizing tool based on the proposed algorithm instruments programs so
Avoidance
that all lock and unlock operations are inThread
tercepted and relayed to the immunity almake
st
ue
gorithm’s avoidance module. This module
q
decision
re
shares
an event queue and the history with
grant / yield
lock(l)
the detection module, as illustrated in in Figacquired
ure 2.
se
ea
l
The avoidance module runs synchronously
e
r
periodically do :
process notifications
unlock(l)
with
the application, in that it is invoked
search for cycles
if cycles found
on
every
lock request, acquisition or release;
save to history
restart program
History
avoidance decisions are made only for lock reDetection
quest. This module is responsible for avoiding cycles, based on the templates stored in
Fig. 2. Architecture of an immunizing tool. history. The avoidance module notifies asynchronously the detection component about
events (lock request/acquisition/release) and
decisions (grant/yield) using an event queue. On every lock request, this module checks
4
whether any templates in the history would be instantiated by granting the requested lock.
If not, the thread is allowed to proceed with locking; otherwise, the thread is forced to yield.
The detection module runs asynchronously, in parallel with the program’s threads. It
periodically updates the RAG based on notifications received from the avoidance module,
detects cycles in the RAG, and saves these cycles’ templates to the history. It then restarts
(all or a subset of) the deadlocked application’s threads.
In summary, the proposed approach has three key features: (1) it captures executionindependent templates of previously-encountered deadlocks and avoids future instantiations
of these templates; (2) livelocks induced by avoidance are detected and avoided in exactly the
same way as deadlocks; (3) cycle detection is asynchronous and periodic, done in a separate
thread, so as to remove this expensive computation from the critical path; given that a
deadlocked application is not making any progress anyway, this asynchrony does not hurt
anything except recovery time (which can be tuned by selecting a suitable period).
2.2
Instrumentation
The code (or binary) of the original program needs to be instrumented such that all lock and
unlock operations are intercepted; we refer to these as the “native” lock and unlock operations.
The instrumentation replaces each call to a native lock/unlock with corresponding code that
relays events to the avoidance module and exercise control over the scheduling of the calling
thread, as shown in Figure 3.
The data structures
used by the instrumentation are explained in lock wrapper( l )
the next section; we 1 t := current thread ID
present the instrumen- 2 p := current program position
tation here to help 3 if owner[l] 6= t then // t does not hold l
events := events + [ request(t, l, p) ] // notify cycle detector
the reader understand 4
5
while
Request(t,l,p) = YIELD do // while unsafe to proceed
the mechanics of the
6
native
lock( yieldLock[t] ) // t’s condition variable
runtime system. When
7
if
yieldCause[t]
6= null then // if t not woken up already
an immunized thread t
yieldLock[t].wait() // wait until woken up, then retry
wants a lock l at posi- 8
native unlock( yieldLock[t] )
tion p, it asks for per- 9
mission from the avoid- 10 native lock( l )
ance module (line 5); 11 Acquired( t, l, p )
if this module asks the
thread to yield, then unlock wrapper( l )
the thread will sleep 1 t := current thread ID
until permitted to pro- 2 Release( t,l )
ceed (lines 6-9); if the 3 native unlock( l )
avoidance module alFig. 3. Instrumentation to replace native lock/unlock operations.
lows the thread to proceed, t uses the native
locking mechanism to
acquire l and notifies the avoidance module of this event (lines 10-11). An unlock operation is analogous, but simpler. Note that this form of instrumentation is just one design
choice; several others exist, that can be implemented inside the runtime, the OS kernel, etc.
5
2.3
The Avoidance Module
The avoidance module is responsible for controlling the schedule of threads such as to avoid
previously-encountered deadlocks and avoidance-induced livelocks. The interface offered by
this module to the instrumented threads consists of three operations: Request, Acquired and
Release, that process lock requests, acquisitions and releases, respectively. As mentioned earlier, we present here the synchronized version of the algorithm, which uses a global lock to
ensure atomicity of Request and Release. The lock-free version (see Appendix B) eliminates
this global lock, but requires changes to Request and Release relative to what is shown here, as
well as changes to the instrumentation of lock operations; these changes obscure the essence
of the algorithm.
The Request, Acquired and Release operations are described in Figure ??. The core avoidance occurs in Request, whose return value determines whether a thread is paused or allowed
to proceed. But, before discussing the algorithms in any more detail, we describe the data
structures they employ.
Since avoidance is performed synchronously with calling threads, we aim to minimize the
amount of work performed in the critical path; for this we choose fine-grain, efficient data
structures in the avoidance module. The avoidance module shares only two data structures
with the cycle detector: an event queue (updated by the avoidance module and read by
the detector) and the template history (updated by the detector and read by the avoidance
module). Cycle detection requires a consistent view of the RAG, thus requiring exclusive access
to it, but is also a complex operation, thus holding the RAG locked for extended periods of
time. We therefore opted to have all RAG updates be performed in the detection module,
based on the events received through the queue. We have thus achieved strong isolation
between the two modules.
The avoidance module makes use of the following data structures:
– lockGrantees[p] : a multiset1 containing the threads that hold (or are granted) locks at
position p; it is initially empty. This is the data structure used in searching for template
instantiations.
– history : the set of templates of previously-encountered cycles. It is persistent, in that all
updates are saved on disk, to be available in subsequent executions. At program startup,
history is loaded in memory.
– yieldCause[t] : the cause of thread t’s yield; yieldCause[t] has the same structure as a
template instance, because it is in effect a subset of a real template instance. All elements
in the array are initially null.
– yielders[t, p] : the set of threads that are currently paused and have (t, p) in their yieldCause;
the yielders map is initially empty.
– owner[l] : the thread currently holding lock l; it is initially null.
– acqP os[l] : the program position where lock l was acquired by its current owner; it is
initially null.
– nLockings[l] : the number of times lock l was reentrantly locked; it is initially 0.
– yieldLock[t] : a lock (condition variable) used for pausing or waking up thread t.
– avoidanceLock : the global lock used to ensure atomicity of avoidance operations.
When a thread requests a lock, the avoidance module checks whether granting that lock
would instantiate any of the templates currently in history. An instantiation of template
1
A multiset is a set whose elements can be present more than once, i.e., a vector where the order of
elements doesn’t matter. An element x is added to a multiset M using the ⊎ operator and removed
from M using the \ operator. For a multiset M and an element x, 1M (x) represents the number of
times x was added to M (M := M ⊎ {x}) less the number of times it was removed (M := M \ {x}).
x is deleted from M only when 1M (x) reaches zero.
6
T = {p1 , ..., pn } is a set of (thread, position) tuples representing distinct threads t that hold
(or are granted) locks at positions p from T (i.e., t ∈ lockGrantees[p]), with all positions
being covered (i.e., ∀p ∈ T : lockGrantees[p] 6= ∅). Thus, a template T would be instantiated
by thread t being granted a lock at position p iff p ∈ T and T − {p} is already covered by
threads different from t (i.e., instance(T − {p}, {t}) 6= null).
The templateInstance(t, p)
helper, shown in Figure 4, returns a template instantiation templateInstance( t,p )
that would occur if thread t 1 foreach T ∈ history where p ∈ T do
templInstance := instance( T − {p}, {t} )
were granted a lock at position 2
if templInstance 6= 0 then
p (line 4). If no potential tem- 3
return templInstance
plate instantiations are found, 4
templateInstance(t, p) returns 5 return null
null (line 5). The instance(T,
exclT hreads) helper (Figure 4) instance( T, exclThreads )
returns an instantiation of tem- 1 if T = ∅ then
return ∅
plate T that does not involve 2
3
else
any thread from exclT hreads
pos := choose p ∈ T
(line 8), or null if such an in- 4
5
foreach t ∈ lockGrantees[pos]
stantiation does not exist (line
where t ∈
/ exclT hreads do
9).
6
match := instance( T \ {pos}, exclT hreads ∪ {t} )
If templateInstance(t, p) re7
if match 6= null then
turns an instantiation {(t1 , p1 ), ...
8
return {(t, pos)} ∪ match
...(tn , pn )}, then it means that
9
return null
yieldCause[t] = {(t1 , p1 ), ...
...(tn , pn )}, i.e., thread t has to Fig. 4. Two helpers to match templates during avoidance.
wait until t1 releases all the
locks it acquired at p1 , or . . . ,
or tn releases all the locks it acquired at pn . Whenever a thread t releases all locks acquired at position p, it wakes up all
yielding threads ti for which (t, p) ∈ yieldCause[ti ] by performing a notify on the corresponding condition variable.
As can be seen in the instrumentation (Figure 3), we influence the thread schedule via a
simple wait mechanism that relies on the condition variable yieldLock[t] to pause thread
t. An alternative choice would be to use plain yield-based busy waiting, but be more CPU
intensive in practice.
When Request is invoked, the avoidance module checks whether the calling thread t
doesn’t already hold l, notify the detection module about the lock request, check if it is safe
for thread t to proceed (i.e., no template can be instantiated); if it is unsafe, make thread t
wait till woken up by one of the threads because of which a template would be instantiated
if t would be allowed to proceed, then check again if it is safe for t to proceed; if it is safe for
thread t to proceed, grant lock l to thread t, and let t execute the lock statement.
Figure 5 presents the algorithms corresponding to Request, Acquired , and Release—the 3
core operations performed by the avoidance algorithm. The commented pseudocode is selfexplanatory.
2.4
The Detection Module
The detection module is responsible for detecting cycles (deadlocks or avoidance-induced livelocks). Periodically, with a period τ (e.g., 1 second), it fetches and processes the notifications
7
Request( t,l,p )
1 native lock( avoidanceLock )
2 yieldCause[t] := templateInstance( t,p ) // is it safe to proceed ?
3 if yieldCause[t] = null then // if safe to proceed, grant the lock
4
lockGrantees[p] := lockGrantees[p] ⊎ {t} // register thread t to position p
5
native unlock( avoidanceLock )
6
events := events + [ grant(t, l, p) ] // notify the cycle detector
7
return OK // safe to proceed
8 else // wait until at least a pair (t’,p’) from yieldCause[t] is “deactivated” (t’ releases
all locks from p’)
9
foreach (t′ , p′ ) ∈ yieldCause[t] do
10
yielders[t’,p’] := yielders[t’,p’] ∪ {t} // t’ will have to wake up t when it releases
all locks from p’
11
native unlock( avoidanceLock )
12
events := events + [ yield(t, yieldCause[t]) ] // notify the cycle detector
13
return Y IELD // wait, because it is not safe to proceed
Acquired( t,l,p )
1 if owner[l] 6= t // if t doesn’t hold l
2
owner[l] := t // acquire l
3
acqPos[l] := p // save acquisition position
4
events := events + [ acquired(t, l, p) ] // notify the cycle detector
5 nLockings[l] := nLockings[l] + 1 // update, for reentrant locking
Release( t,l )
1 nLockings[l] := nLockings[l] - 1 // update, for reentrant locking
2 if nLockings[l] = 0 then // if l is now released
3
p := acqPos[l] // t must be deregistered from acqPos[l]
4
owner[l] := null // release l
5
acqPos[l] := null // reset acquisition position
6
events := events + [ release(t, l) ] // notify the cycle detector
7
native lock( avoidanceLock )
8
lockGrantees[p] := lockGrantees[p] \ {t} // deregister t from p
9
if 1lockGrantees[p] (t) = 0 then // if t released all locks from p (i.e., pair (t,p) is
“deactivated”
10
foreach t′ ∈ yielders[t, p] do // wake up all threads waiting for t to release all
locks acquired at p
11
if yieldCause[t′] 6= null then // if t’ is not woken up already
12
native lock( yieldLock[t’] )
13
yieldCause[t’] := null // mark t’ as woken up
14
yieldLock[t’].notify() // wake up t’
15
native unlock( yieldLock[t’] )
16
yielders[t,p] := ∅ // reset yielders
17
native unlock( avoidanceLock )
Fig. 5. The core algorithms for the avoidance module.
(RAG events and decisions) sent by the avoidance module, then it looks for cycles in the
RAG. If cycles are found, it adds their templates to the history and optionally restarts the
application. The detection module looks only for RAG cycles containing threads having pend8
ing lock requests, because as we prove in §3, only the request events can introduce new cycles
in the RAG. Note that theoretically the period τ can be arbitrarily large, because it doesn’t
really matter when the cycle is detected if the application is already deadlocked/livelocked.
But in practice, τ cannot be too big, because it may cause significant hiccups, if many events
are accumulated between two iterations. Moreover, a smaller τ means faster recovery from
deadlocks/livelocks, and thus higher availability of the system.
The data structures used in the detection module are the following ones:
– rag : the resource allocation graph
– events : the shared queue used to send RAG events and avoidance decisions, from the
avoidance module to the detection module; it is initially empty. A RAG event can be
request(thread, lock, position), yield(thread, yieldCause), grant(thread, lock, position),
acquired(thread, lock, position) or
release(thread, lock), corresponding to the types of operations that can be performed on
a RAG, namely add request/yield edge, convert request edge into a grant edge, convert
grant edge into a hold edge, and respectively remove a hold edge.
– requestingT hreads : the set of threads having pending (not yet satisfied) lock requests.
We give a high-level description of the helper routines waitCycles, waitChains, and waitingFor, used for cycle detection, and template(C), used to extract the template of a cycle C, in Algorithms 1,2,3,4. For the implementation of waitingF or(v1 , v2 ), we used the colored DFS algorithm [15], in which we marked all explored nodes with REACHABLE/U N REACHABLE,
depending on whether v2 is reachable or not from v1 . For retrieving cycles traversing v
(waitCycles(v)), one just explores the nodes marked as REACHABLE in waitingF or(v, v).
Having yield edges in the RAG, a thread node can have multiple edges emerging from it,
namely one request edge and zero, one or more yield edges, therefore a node can be involved in
more than one cycle. Hence, waitCycles(v) can return more than one cycle. However, this is
at most an issue concerning the performance of our cycle detection. We don’t need to retrieve,
save and avoid the templates of all the cycles containing a particular node, because to avoid
an avoidance-induced livelock, it is enough to avoid a single yield cycle from it, because an
avoidance-induced livelock is a conjunction of yield cycles. Therefore, if a thread t is involved
in an avoidance-induced livelock, it is enough to retrieve, in waitCycles(t), just one yield
cycle from the livelock.
waitCycles(v)
return waitChains(v, v)
Algorithm 1: waitCycles()
The cycle detection is illustrated in Algorithm 5. The processEvents() routine, responsible
for fetching and processing the notifications coming from the avoidance module, is shown in
Algorithm 6.
9
waitChains(v1 , v2 )
if !waitingF or(v1 , v2 ) then
return null
∗
choose e = v1 →r/g/h v ∈ RAG where waitingF or(v, v2 )
∗
choose ey = v1→yvy ∈ RAG where
∗
waitingF or(vy , v2 ) ∧ ∀v1→yvi ∈ RAG : waitingF or(vi , v2 )
cycles := ∅
if e 6= null ∧ e.dst = v2 then
cycles := cycles ∪ {{e}}
if e 6= null ∧ e.dst 6= v2 then S
cycles := cycles ∪ (
{e} ∪ c)
c∈waitChains(e.dst,v2 )
if ey 6= null ∧ ey .dst = v2 then
cycles := cycles ∪ {{ey }}
if ey 6= null ∧ ey .dst 6= v2 then S
cycles := cycles ∪ (
{ey } ∪ c)
c∈waitChains(ey .dst,v2 )
return cycles
Algorithm 2: waitChains()
waitingFor(v1 , v2 )
return v1 = v2 ∨
∃ v1→rv ∈ RAG
∗
∃ v1→gv ∈ RAG
∗
∃ v←hv1 ∈ RAG
∗
∀ v1→yvi ∈ RAG
s.t. waitingFor (v, v2 ) ∨
s.t. waitingFor (v, v2 ) ∨
s.t. waitingFor (v, v2 ) ∨
: waitingFor (vi , v2 )
Algorithm 3: waitingFor()
template(C)
∗
∗
return {e.pos | e ∈ C ∧ (e = ∗←h∗ ∨ e = ∗→y∗)}
Algorithm 4: template()
main loop
while running do // run while running = true
sleep τ milliseconds
processEvents()
cyclesFound := ∅
foreach t ∈ requestingThreads do // find all cycles containing threads with pending
requests
cyclesFound : = cyclesFound ∪ waitCycles(t)
if cyclesFound 6= ∅ then // if cycles are found
foreach c ∈ cyclesFound do // add the templates of the detected cycles to the history
history := history ∪ template(c)
restart application
Algorithm 5: Cycle detection main loop
10
processEvents()
while events 6= ∅ do
evt := events.head // return first event
events := events.tail // remove first event
switch evt do
case request(t,l,p)
rag := rag ∪{t→rl} // add request edge
requestingThreads := requestingThreads ∪ {t} // pending request
case yield(t,yCause) // replace existing yield edges with the new ones
p
∗
∗
rag := rag \{t→yt′ |t→yt′ ∈ rag} ∪ {t→yt′ | (t′ , p) ∈ yCause}
case grant(t,l,p)
p
rag := rag \{t→rl} ∪ {t→gl} // convert request edge into a grant edge
case acquired(t,l,p)
p
p
rag := rag \{t→gl} ∪ {t←hl} // convert grant edge into a hold edge
requestingThreads := requestingThreads \ {t} // request satisfied
case release(t,l,p)
∗
rag := rag \{t←hl} // remove hold edge
Algorithm 6: processEvents()
3
Completeness and Soundness
We prove the soundness of our deadlock immunity algorithm, namely safety, i.e., that it
indeed avoids deadlocks, and liveness, i.e., that threads will eventually be able to proceed
with locking every lock they request. We also prove algorithm’s completeness, i.e., that it
eventually detects and avoids all cycles (deadlocks and avoidance-induced livelocks).
We make the following assumptions, in order to prove the soundness and completeness of
our approach:
– All avoidance routines (Request, Acquired, Release) run safely w.r.t. the shared automaton
state, i.e., avoidance routines are thread-safe. In the synchronized version of the algorithm
explained in the report, the atomicity, and hence thread-safety, is ensured by the global
lock avoidanceLock. Atomicity is not ensured in the lock-free version of our algorithm,
but nevertheless, thread-safety (consistency) is ensured by additional checks in Request
routine.
– The number of threads in an application and the application size are both finite.
– templateInstance and waitCycles have correct implementations, i.e., that comply with
their high-level definitions presented in the report.
– All possible deadlocks and avoidance-induced livelocks will eventually happen.
– All critical sections eventually terminate, in normal conditions, i.e., when not deadlocked/livelocked.
– Thread scheduler is fair.
– All lock/unlock statements performed within the application are captured.
– The lock positions that constitute the templates are execution-independent — the position of a lock statement in present execution, will denote the position of the same lock
statement in subsequent executions.
We prove first the completeness of cycle detection. The detection module looks only for
cycles containing threads with pending requests. In consequence, we prove first that indeed
only the request events can introduce new cycles in the RAG, by proving that the remaining
RAG events — acquired and release — cannot introduce new cycles in the RAG. Then, we
11
prove the actual completeness of the cycle detection, i.e., that the detection module detects
all cycles necessary for avoidance.
To prove safety (deadlock immunity), we prove first that the following invariant is maintained: ∀T ∈ historyold : instance(T, ∅) = null, i.e., no template from historyold can be instantiated. Then, we prove that the invariant ∀C ∈ waitCycles(t) : template(C) ∈
/ historyold
is maintained, i.e., no newly detected cycle has its template in historyold . Finally, we prove
safety — invariant historynew ∩ historyold = ∅ is maintained by Request, i.e., templates do
not repeat in different runs.
To prove completeness (immunity against all possible deadlocks and avoidance-induced
livelocks), we prove first that the number of possible templates is finite. Then, we prove the
actual completeness of our approach — every application instrumented with our deadlock
avoidance will be eventually free of deadlocks and avoidance-induced livelocks, after a finite
number of reboots. Note that the assumption that all possible deadlocks and avoidance-induced
livelocks will eventually happen holds only in theory. In practice, one cannot assume that an
application will experience all its potential (possible) deadlocks. Therefore, the completeness
of our approach holds only in theory. However, one important and unarguable contribution of
our approach remains: liveness-preserving immunity against all deadlocks similar to deadlocks
experienced in previous executions.
Finally, we prove the liveness of our approach — eventually, every thread will be granted
every lock it requests.
The complete proofs can be found in Appendix C.
4
Complexity Analysis
In this section, we present the complexity of the avoidance and detection modules.
Consider history has N templates, of average size NP , and for each position p from a
template, |lockGrantees[p]| = NG , on average. Let NW be the average number of yields
(waits) performed by a thread before being granted a lock, and Cwait the cost of a wait()
system call. Let NY be the average size of yielders[t, p], and Cnotif y the cost of a notif y() call.
The complexity of the avoidance module is linear in NW · Cwait and NY · Cnotif y ), polynomial
in NG , and exponential in NP .
However, we expect NP to be very small. Therefore, the fact that the complexity is
exponential in the size of the templates (NP ) shouldn’t be a problem. In practice, we expect
to have deadlocks involving up to three/four threads. A deadlock is structurally quite complex.
For a group of threads to deadlock, first they have to simultaneously perform nested locks,
and then to perform the inner locks in such a way that a circular wait for locks happens.
Since the size of a template (NP ) is equal to the number of threads involved in the cycle, NP
will be therefore very small, too. Moreover, it is even less likely to have large livelocks caused
by our algorithm, because an avoidance-induced livelock is a conjunction of yield cycles (see
§2.4), and a yield cycle is similar to a deadlock.
Consider RAG = [V, E], |requestingT hreads| = NR on average, and that we have on
average NE events in the event queue. The complexity of the detection module is linear in
NE , because every event is processed in constant-time, and linear in NR · (|V | + |E|), because
we use the optimal colored DFS algorithm for detecting RAG cycles, starting from each thread
from requestingT hreads.
The complete derivations of the two modules’ complexities can be found in Appendix D.
12
5
Java Implementation
We implemented the deadlock immunity algorithm for Java. For instrumenting Java applications with our algorithm, we used AspectJ [1]. We instrumented the calls to monitorenter(x)
(lock(x) statement in Java) and monitorexit(x) (unlock(x) statement in Java) with advices
that capture lock requests (before lock(x) advice), lock acquisitions (after lock(x) advice),
and lock releases (before unlock(x) advice). Deadlock avoidance is performed in before lock(x)
advice.
Lock positions are represented in terms of file:line strings.
We chose for the period of the cycle detection τ = 1000 msec. To locate in the RAG the
lock node corresponding to a lock (monitor) object, we used a hash map with initial capacity
of 220 , therefore with a hash key of 20 bits precision, which assures a high dispersion of the
keys, and thus an average constant time look-up. For computing the hash code of a lock
object, we used System.identityHashCode(Object) method, in order to have an uniform hash
code computation for all the lock objects. The event queue and the history are lock-free —
unprotected by any lock, but still thread-safe — to have a faster communication between
avoidance and detection modules. For their implementation, we use ConcurrentLinkedQueue
from JDK.
We also implemented the lock-free version of the algorithm, which is minimally synchronized, just the wait-notify mechanism being synchronized. A full description of it can be
found in Appendix B. The tradeoff we had to do in the lock-free implementation consists
of additional checks that need to be performed in Request routine, in order to preserve the
soundness of deadlock avoidance, in the lack of atomicity. A performance comparison of the
two versions can be found in §6.
6
Evaluation
In this section, we present the evaluation we performed for our deadlock immunity algorithm.
First, we evaluated the effectiveness and efficiency of our approach on real applications like
MySQL JDBC driver and JBoss. Then, we evaluated its efficiency on a microbenchmark that
we developped.
We tested the effectiveness of the Java implementation on MySQL JDBC driver. We
successfully reproduced and avoided four different deadlocks from version 5.0.0-beta of the
MySQL JDBC driver [18].
In the reminder of the section, we will illustrate the performance of the implementation
on real systems, then we will show its performance in extreme conditions, on a lock-intensive
microbenchmark with high contention.
All experiments concerning performance were run on computers with 2 x 4-core Intel Xeon
E5310 1.6GHz CPUs, 4GB RAM, WD-1500 hard disk, 2 NetXtreme II GbE interfaces, interconnected by a dedicated GbE switch, running FC7 Linuxwith kernel 2.6.20, Java HotSpot
Server VM 1.6.0, and Java SE 1.6.0.
6.1
Real Applications
We measured the performance impact that our algorithm has on a real system. We chose
a J2EE application server, which is a piece of middleware that allows enterprise and Web
applications to be written in Java, with all complexities of transactions, persistence, group
communication, replication, etc. being handled transparently on the applications’ behalf.
Behind virtually every e-commerce Web site today lies an application server. We chose JBoss
[13], one of the most widely used J2EE servers; at over 350KLOC excluding comments,
13
it is likely one of the largest systems written in Java. To benchmark performance on the
deadlock-immunized JBoss, we deployed on top of it RUBiS [20], an online auction application
modeled after eBay.com. In our measurements, we used the servlet-based version RUBiS.
RUBiS is frequently used to evaluate performance scalability of J2EE application servers.
We ran JBoss+RUBiS, the corresponding MySQL database tier, and the RUBiS clients on
separate nodes. We show throughput results in Figure 6, measured for various history sizes
(0 . . . 32) and templates of length 2; the templates were generated randomly as combinations
of program locations at which JBoss performs synchronization. We operated the auction site
just below its saturation point (i.e., 900 RUBiS client threads), in order to obtain an upper
bound of the overhead; below and above this level we found the impact of our algorithm to
be virtually unmeasurable. Jboss’s web console reported that 280 threads were spawned be
JBoss, when serving requests coming from RUBiS’s client threads.
The cost of immunity against up to 32 templates of virtual deadlocks, of length 2 each, in
one of the largest Java systems, is a penalty of < 6% in request throughput on an e-commerce
workload, running on a 8-core machine. This suggests that our approach offers an efficient
deadlock avoidance solution even for the largest production systems.
Throughput reduction
RUBiS threads=900, JBoss threads=280, Template len=2
Baseline=515.5 req/sec
6%
5%
4%
3%
2%
1%
0
0
2
4
8
16
Number of templates
32
Fig. 6. JBoss throughput drop for browse-only DB transactions.
6.2
Microbenchmark
The microbenchmark experiments consist of up to 256 threads that execute synchronized
blocks from 10 different locations, for 1 minute, with a 500 usec delay between successive
blocks, and a 500 usec delay inside the synchronized blocks. The blocks are synchronized on
lock objects chosen from a pool of 8 shared objects, in round-robin order. We used a pregenerated virtual history, consisting of 4 templates of size 2 each, with positions randomly
chosen from the same 10 lock positions. The delays are implemented as busy waits (busy
loops), for a better delay precision, and also to simulate computation between and inside the
synchronized blocks.
We tested on the above microbenchmark the performance, in terms of synchronizations
per second, corresponding to different design approaches that we could use in our algorithm.
We compare the performance of the synchronized version of our algorithm with the lock-free
version. We also compare the efficiency of the thread yielding mechanism based on wait-notify
with the efficiency of the yielding mechanism based on T hread.yield() calls. More precisely,
we compared the overheads introduced by our algorithm for the above design choices. The
results are shown in Figure 7. We noticed that the lock-free version using a wait-notify yielding
mechanism outperforms the other three design choices.
14
Overhead factor
Sh locks=8, Hist size=4, Templ len=2
Sync delay=500usec inside + 500usec outside sync blocks
Synchronized, wait-notify
Synchronized, Thread.yield()
Lock-free, wait-notify
Lock-free, Thread.yield()
8
4
2
1
2
4
8
16
32
Number of threads
64
128
256
Fig. 7. Overhead comparison for various design options
The conclusion we can draw from the microbenchmark experiments is that the overhead
introduced by our approach is acceptable, considering we are dealing with an extreme case,
a lock-intensive program with high contention, which forks many threads that all they do is
synchronizing on shared locks, with small delays between locks.
If we would make our approach faster, we would most likely sacrifice precision over performance. And when the precision of the avoidance is low, we would have many false positives
(false deadlocks), and therefore serialize the threads too aggressively. And if the threads hold
the locks they acquire for a non-negligible amount of time, the overhead induced by a more
aggressive serialization would be higher than the overhead induced by a more precise (and
therefore less efficient) avoidance, but with less aggressive serialization. We observe in Figure 8
that the overhead corresponding to the lock-free version with wait-notify yielding mechanism,
is much larger when the synchronization delay (1000 usec) is inside the synchronized blocks,
than for the case when the delay is outside the synchronized blocks. This proves that the
serialization-induced overhead is dominant, when the time spent inside the critical sections is
non-negligible.
However, for 256 threads, the overhead for synchronization delay outside the synchronized
block is bigger. We believe that is because the delays are implemented as busy waits, and for
large number of threads, they introduce high contention, because they are performed outside
the critical sections, in parallel from 256 threads, whereas in the case where the delays are
inside the synchronized blocks, there are at most 8 busy waits running in parallel (because
there are 8 shared locks on which the synchronized blocks are executed), and hence, there is
much less contention.
Reduction in throughput
Shared locks=8, Hist size=4, Templ len=2
100%
High contention (
Low contention (
= 1ms,
= 0)
= 0,
= 1 ms)
0%
2
4
8
16
32
Number of threads
64
128
Fig. 8. Lock-free with wait-notify: overhead for inside vs. outside sync delay
15
7
Related Work
Aside from better discipline, there are several approaches that help avoid deadlocks in programs: detect and prevent the introduction of deadlocks before a program runs (§7.1), detect
and avoid deadlocks while the program is running (§7.2), or recover from deadlocks once they
occur and the program is hung (§7.3). We share goals with all of these areas; our algorithm
straddles the area of deadlock avoidance and deadlock recovery: we allow deadlocks to occur
and recover by avoiding future occurrences.
7.1
Deadlock Prevention
Most deadlock prevention is done via static methods, that aim to either prevent the introduction of deadlock potential in code, or to inspect program code and help developers find
and fix deadlock bugs. In contrast, our algorithm aims to enable deadlock-prone code to run
successfully and avert deadlock as late as possible; in some sense, our technique conspires
with the scheduler to make the program succeed despite its shortcomings.
Language-level approaches aim to simplify the writing of lock-based concurrent programs
and thus avoid synchronization problems through powerful type systems. For example, the
monitor construct [11] was employed early on in systems programming [16] and is now the
default means of synchronization in Java; unfortunately, monitors still offer opportunities for
deadlocks, so they do not provide a complete solution. More recently, [5] introduced a way to
augment Java with ownership types to ensure deadlock freedom at compile time; compared
to our approach, [5] has the benefit that it incurs no performance overhead, but requires
translating existing programs to use the type annotations and programmers must partition
locks into equivalence classes and then specify a partial order.
ESC [8] introduced extended static checking, which uses a theorem prover to find deadlocks; the approach makes heavy use of annotations to provide knowledge to the analysis and
to reduce false positives. RacerX [7] uses flow-sensitive inter-procedural analysis to detect
race conditions and deadlocks in C systems code; these analyses run fast (∼10 minutes for 1.8
MLOC) and were successfully applied to Linux and FreeBSD; on the downside, the analyses
generate many false negatives and rely on heuristic techniques for ranking the results and
helping programmers sort through false positives. [22] uses flow-sensitive, context-sensitive
analysis together with a lock-order graph to find deadlocks. This analysis was applied e.g., to
Java JDK 1.4 and found 7 deadlocks; unfortunately the tool reported 70 potential deadlocks
and manual verification was needed to eliminate 90% of the reports as false positives; even so,
unsound filtering heuristics were needed to reduce the over 100,000 reported deadlocks down
to the 70 which were then manually sorted. In contrast, our algorithm never detects a false
deadlock, is completely automatic, but it incurs a performance overhead; all its conservativeness manifests as an acceptable performance overhead.
Another approach to finding deadlocks is to use model-checkers, which systematically explore all possible states of a program; in the case of concurrent programs, this includes all
possible thread interleavings. Model-checkers generally achieve high coverage and are sound;
unfortunately they suffer from poor scalability: the explored state space increases roughly
combinatorially with the number of threads and the size of the program (a.k.a. “state space
explosion”). Java PathFinder [12], one of the most successful modelcheckers, does not support native-code libraries for networking and I/O, and is restricted to applications up to 10
KLOC [12]; because of these library and size limitations, JPF has so far been mainly used
for models, not actual applications. An alternate approach is to transform a concurrent program P into a sequential one P ′ that simulates the execution of most behaviors of P . P ′
16
is then checked using a modelchecker that does not need to be aware of thread interleavings. While this technique is complete (i.e., no false positives), it is unsound (i.e., it has false
negatives) [19].
Real-world applications are large (e.g., JBoss has > 350 KLOC, Apache Tomcat > 86
KLOC), and our approach has the advantage of being independent of program size. Moreover,
our algorithm places no restrictions on I/O activity or calls to native libraries. Unlike model
checking, our algorithm explores only those states that are guaranteed to be relevant, i.e., the
ones encountered during execution. On the other hand, model checking is static, so it incurs
no runtime overhead. The tradeoff is simple: if model checking is feasible, do it, otherwise
approaches like ours are the only option.
7.2
Deadlock Avoidance
Early approaches to avoiding deadlocks relied on static pre-allocation of resources, including
classics such as [6]. The main drawback of these approaches is that they require a priori the
exact resource needs of applications; [9] relaxes this requirement by only needing the maximum
number of resources. More recently, [2] proposed a more efficient algorithm in this category,
but still requires a priori knowledge of resource needs. Without a priori information, the
resulting allocations can lead to significant underutilization of resources, hence these schemes
have been virtually abandoned in systems that do not have real-time requirements.
[4] describes an approach that statically detects groups of nested locks that might cause
deadlocks, called “sources of deadlocks” (SODs), and serializes accesses to SODs by introducing a new lock before them. Assuming all SODs are found, this approach guarantees deadlock
freedom; however, static analysis results in many false positives and the resulting serialization
is aggressive. The deadlock templates used by our approach bear some resemblance to SODs.
A similar idea is that of ghost locks [23]—the authors augmented the JVM with a mechanism that serializes threads’ access to lock sets that could induce deadlock. They compute
the strongly-connected components (SCC) in a lock-order graph and introduce a ghost lock
for each SCC; threads must acquire the ghost lock prior to acquiring a lock in the SCC.
Unfortunately, this approach cannot guarantee deadlock freedom, introduces new potential
deadlocks (via the ghost locks), and requires changing the Java VM.
7.3
Deadlock Recovery
The main idea in deadlock recovery is that the runtime system does not make an effort to avoid
deadlocks in applications, but detects deadlocks and exposes deadlock information to applications, which must then deal with the situation. Such detection can be done asynchronously
(like in our approach) or synchronously on every lock request.
The asynchronous approach is popular when the cost of online detection is high. In distributed systems, for example, checking for safe states is difficult due to the need to aggregate
local states into global state and due to the requisite network communication; as a result,
there is a bias toward asynchronous detection followed by some form of recovery [21]. Given a
deadlock detection mechanism, the simplest way to recover is to abort one or more involved
threads/processes. Due to the powerful semantics of transactions, the database community
has wholeheartedly embraced this approach to deadlock recovery [3]—today virtually every
commercial database system includes an automatic deadlock resolver. We hope that every
runtime environment will soon provide similar facilities; we see our work as a step in that
direction.
Pulse [17] takes an interesting approach in this same vein: by adding a kernel facility
and a special daemon, Pulse can detect processes that appear to not be making progress,
17
speculatively execute them forward, and decide whether they are stuck in a deadlock. The
main advantage of Pulse is that it can detect deadlocks that are caused by more than just
lock/unlock mechanisms; on the downside, it does encounter false positives (i.e., false deadlocks).
8
Future Work
We want to address in further research the conservativeness of our approach. An example
of where our technique is too conservative, is when one uses wrapper methods to perform
locking. All the lock positions will be the same, i.e., the location of the lock statement from
the wrapper. Therefore, the deadlock templates will have only copies of this position, and
thus all subsequent locks via the wrapper will be unfairly serialized. We can circumvent this
issue by storing more information in the template about the context in which the deadlock
happened. More precisely, we can store for each lock acquisition a suffix of the call path
that led to it, and save it in the template along with the lock position. The precision of the
templates will therefore increase. The higher the precision of the templates, the less often
avoidance is performed in vain, i.e., the fewer the false positives are.
On the other hand, if the templates are too precise, then the application may reboot too
often if many deadlocking paths have a common suffix. Using too large suffixes from the call
paths, results in hitting the same deadlock over and over again. We also need a method for
dynamically adjusting (learning) the necessary sizes of these suffixes. This dynamic adjustment must be performed separately for each position in a template, because the deadlocking
call paths are generally highly decoupled. To increase precision even more, one may consider
also the sequence of the lock positions traversed along the call paths.
We will also explore the possibility of using static analysis, as a complementary approach.
More precisely, we explore the idea of using look-ahead static analysis starting from the current
state, to augment our technique, in order to reduce the number of false negatives (unpredicted
deadlocks). If the avoidance module grants a lock, look-ahead module could additionally check
if it is safe to proceed, by symbolically executing the application up to a small bound (for
considerations of efficiency), starting from the current state of the application. If the avoidance
module says it is unsafe for a thread to proceed, we should rely on our avoidance module,
because, if the look-ahead module says it is safe to proceed till some bound, it doesn’t mean
it is safe to proceed beyond that bound. In other words, bounded static analysis is sound only
when finding safety violations — if a program is declared unsafe till a bound, then it is unsafe
indeed, but if it is declared safe up to the bound, it doesn’t mean it will be safe beyond the
bound.
9
Conclusion
In this technical report we described an approach of imbuing software systems with immunity
against deadlocks, requiring no assistance from programmers or users. We showed that, once
a program instrumented with our deadlock avoidance encounters a deadlock, it avoids in all
future executions all deadlocks having that same template. We demonstrated the effectiveness
of our approach against deadlocks in real software like MySQL JDBC. Our algorithm also
scales gracefully to large software systems: in JBoss, a > 350 KLOC application server, worstcase overhead was a drop of < 6% in request throughput, while avoiding up to 32 deadlock
templates.
Our technique complements prior work by addressing the challenges of production systems:
code size scalability, correctness in the face of program I/O, and performance overhead.
18
While pure deadlock avoidance is undecidable, deadlock immunity is decidable, to the
extent imposed by the definition of similarity between deadlocks. Given that our approach
considers two deadlocks similar iff they have the same template, and given the fact that
the deadlock templates are execution-independent, deadlock immunity is definitely decidable.
Our approach demonstrates that it is not only decidable, but also feasible, to avoid deadlocks
having the same templates as deadlocks encountered in previous executions.
References
1. Aspectj. http://www.eclipse.org/aspectj, 2007.
2. F. Belik. An efficient deadlock avoidance technique. IEEE Transactions on Computers, 39(7),
1990.
3. P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database
Systems. Addison Wesley, 1987.
4. P. Boronat and V. Cholvi. A transformation to provide deadlock-free programs. In Intl. Conf.
on Computational Science, 2003.
5. C. Boyapati, R. Lee, and M. Rinard. Ownership types for safe programming: Preventing data
races and deadlocks. In 17th Conf. on Object-Oriented Programming, Systems, Languages, and
Applications, 2002.
6. E. G. Coffman, M. Elphick, and A. Shoshani. System deadlocks. ACM Computing Surveys, 3(2),
1971.
7. D. Engler and K. Ashcraft. RacerX: Effective, static detection of race conditions and deadlocks.
In 19th ACM Symp. on Operating Systems Principles, 2003.
8. C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson, J. B. Saxe, and R. Stata. Extended
static checking for Java. In Conf. on Programming Language Design and Implementation, 2002.
9. A. N. Habermann. Prevention of system deadlocks. Communications of the ACM, 12(7), 1969.
10. M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data
structures. In 20th Intl. Symposium on Computer Architecture, 1993.
11. C. A. R. Hoare. Monitors: An operating system structuring concept. Communications of the
ACM, 17(10), 1974.
12. Java pathfinder. http://javapathfinder.sourceforge.net/doc/What can be checked with JPF.html, 2007.
13. JBoss. http://jboss.org.
14. M. Kim, M. Viswanathan, S. Kannan, I. Lee, and O. Sokolsky. Java-MaC: A run-time assurance
approach for java programs. In Formal Methods in System Design, 2004.
15. D. E. Knuth. The Art of Computer Programming, volume III: Sorting and Searching. AddisonWesley, 1998.
16. B. W. Lampson and D. D. Redell. Experience with processes and monitors in Mesa. Communications of the ACM, 23(2), 1980.
17. T. Li, C. S. Ellis, A. R. Lebeck, and D. J. Sorin. Pulse: A dynamic deadlock detection mechanism
using speculative execution. In USENIX Annual Technical Conference, 2005.
18. MySQL bug database. http://bugs.mysql.com/.
19. S. Qadeer and D. Wu. KISS: Keep it simple and sequential. In Conf. on Programming Language
Design and Implementation, 2004.
20. RUBiS. http://rubis.objectweb.org, 2007.
21. M. Singhal. Deadlock detection in distributed systems. IEEE Computer, 22(11), 1989.
22. A. Williams, W. Thies, and M. D. Ernst. Static deadlock detection for Java libraries. In 19th
European Conference on Object-Oriented Programming, 2005.
23. F. Zeng and R. P. Martin. Ghost locks: Deadlock prevention for Java. In Mid-Atlantic Student
Workshop on Programming Languages and Systems, 2004.
19
A
Undecidability of Deadlock Prediction
Consider two threads t1 , t2 that perform the following sequences of operations:
thread t1 :
lock(x)
P1 // random computation
end1 := true
sleep while !end2
lock(y)
thread t2 :
lock(y)
P2 // random computation
end2 := true
sleep while !end1
lock(x)
P1 , P2 are random computations, without any synchronization inside. We notice that
threads t1 , t2 will deadlock iff P1 , P2 both terminate. But that is undecidable, in the general case. Therefore, it is impossible to decide whether t1 , t2 will deadlock.
B
The Lock-Free Avoidance Algorithm
For the lock-free implementation, every lock(l) and unlock(l) are instrumented as shown in
Algorithms 7 and 8. Request(t,l,p),Release(t,p),Acquired (t,l,p) and Release(t,l) are illustrated
in Agorithms 9,10,11 and 12.
This space left intentionally blank, for alignment purposes.
20
lock wrapper lockfree(l)
t := current thread ID
p := current program position
if owner[l] 6= t then // if t doesn’t hold l
events := events + [request(t, l, p)] // notify the cycle detector
yCause := Request(t,l,p) // check if it is safe to proceed
while yCause 6= null do // if it is not safe
foreach (t′ , p′ ) ∈ yCause do // lock all (t’,p’) pairs from yCause for reading
lockRead((t’,p’))
if ∀(t′ , p′ ) ∈ yCause : 1lockGrantees[p′ ] (t′ ) > 0 then // if yCause is still a valid
template instance
lock(yieldLock[t])
foreach (t′ , p′ ) ∈ yCause do // register t to all (t’,p’) pairs from yCause, then
unlock the pairs
yielders[t’,p’] := yielders[t’,p’] ∪ {t} // register t to (t’,p’)
unlockRead((t’,p’)) // unlock (t’,p’)
yieldCause[t] := yCause // store yield cause
events := events + [yield(t, yieldCause[t])] // notify the cycle detector
yieldLock[t].wait() // wait till at least a pair (t’,p’) gets deactivated
unlock(yieldLock[t])
else
foreach (t′ , p′ ) ∈ yCause do // unlock all (t’,p’) pairs from yCause
unlockRead((t’,p’))
yCause := Request(t,l,p) // check again
native lock(l)
Acquired(t,l,p)
Algorithm 7: Instrumentation of lock(l)
unlock wrapper lockfree(l)
t := current thread ID
Release(t,l)
native unlock(l)
Algorithm 8: Instrumentation of unlock(l)
Request(t, l, p)
yieldCause := templateInstance(t,p) // check if it is safe to proceed
if yieldCause = null then // if safe
lockGrantees[p] := lockGrantees[p] ⊎ {t} // grant lock
yieldCause := templateInstance(t,p) // check again
if yieldCause = null then // if really safe
events := events + [grant(t, l, p)] // notify the cycle detector
return null // OK
else
Release(t,p) // cancel the grant
goto check // retry
else
return yieldCause // YIELD
Algorithm 9: Request(t, l, p)
21
Release(t, p)
if 1lockGrantees[p] (t) = 1 then // if t is about to release all locks from p
lockWrite((t,p)) // lock (t,p) in write mode
lockGrantees[p] := lockGrantees[p] \ {t} // remove the grant
if 1lockGrantees[p] (t) = 0 then // if t released all locks from p
// wake up all yielders foreach t′ ∈ yielders[t, p] do
lock(yieldLock[t’])
if yieldCause[t’] 6= null then // if t’ not woken up yet
yieldCause[t’] := null // mark t’ as woken up
yieldLock[t’].notify() // wake up t’
unlock(yieldLock[t’])
yielders[t,p] := ∅
unlockWrite((t,p))
Algorithm 10: Release(t, p)
Acquired(t, l, p)
if owner[l] 6= t then
owner[l] := t
acqPos[l] := p
events := events + [acquired(t, l, p)] // notify the cycle detector
nLockings[l] := nLockings[l] + 1 // for reentrant locking
Algorithm 11: Acquired(t, l, p)
Release(t, l)
nLockings[l] := nLockings[l] - 1 // for reentrant locking
if nLockings[l] = 0 then // if l is released
p := acqPos[l]
owner[l] := null
acqPos[l] := null
events := events + [release(t, l)] // notify the cycle detector
Release(t,p)
Algorithm 12: Release(t, l)
C
C.1
Completeness and Soundness
Completeness of Cycle Detection
We prove that our detection module is complete, i.e., that it finds all cycles (deadlocks and
avoidance-induced livelocks) necessary for avoidance.
Lemma 1. An ’acquired’ event cannot introduce new cycles in the RAG.
Proof (by contradiction). Suppose one introduces a new cycle ...l′ ← t ← l ← t′ ... by convertp
p
ing the edge t→g l into t←h l. That means we had ...l′ ← t → l ← t′ ... before, therefore two
request edges emerging from t, namely t→rl and t→rl′ , which is impossible, because a thread
can request only one lock at a time.
Lemma 2. A ’release’ event cannot introduce new cycles in the RAG.
∗
Proof. Release(t,l) removes a t ←h l edge from the RAG. The removal of an edge cannot
introduce new cycles in the RAG.
22
Theorem 1 (Cycle detection completeness). The detection module detects all cycles
necessary for avoidance.
Proof. As we saw in Lemmas 1 and 2, the acquired and release events cannot cause new
cycles, therefore request(t, l, p) remains the only event able to introduce new cycles in the
p
∗
RAG, by either adding request/grant edge t→rl/t→gl, or by adding yield edges t→yt′ . Hence,
it is sufficient to look for cycles just after request events. In other words, a cycle can be
triggered only by the addition of request, yield, or grant edges.
Note that an acquired(t,l,p) event cancels a request(t,l,p), therefore it is enough to look
for cycles containing threads with pending requests. Each τ msec, our cycle detector looks for
all cycles containing threads having pending requests (i.e., threads from requestingT hreads),
by calling waitCycles(t).
Therefore, assuming all lock/unlock statements are captured, the detection module detects
all cycles that may show up during the application execution. Since the period of the cycle
detection is τ msec, every cycle is detected in (approximately) at most τ msec from the
moment it happens.
We note that waitingCycles(v) actually doesn’t retrieve all cycles containing v, i.e., it
retrieves a single yield cycle containing v, instead of all yield cycles containing v. In order to
avoid an avoidance-induced livelock, it is sufficient to avoid only one of its constituting yield
cycles.
Finally, given the above observations, we can conclude that the detection module detects
all cycles necessary for avoidance.
C.2
Safety, Completeness and Liveness of Avoidance
Now, we prove that our algorithm eventually avoids all cycles (deadlocks and avoidanceinduced livelocks), in a liveness-preserving manner.
Let historynew denote the history accumulated in the current run of the application, and
historyold the history accumulated in all the previous runs of the application. Hence, the
entire history is history = historynew ∪ historyold .
Lemma 3. The following invariant is maintained: ∀T ∈ historyold : instance(T, ∅) = null,
i.e., no template from historyold can be instantiated.
Proof (by contradiction). From the definition of templateInstance(t, p) (Algorithm 4), Request
routine (Algorithm ??) and instrumentation of lock(l) (Algorithm 3), it follows that a thread
t is delayed when trying to acquire a lock at a position p iff ∃T ∈ history s.t. instance(T \
{p}, {t}) 6= null.
Suppose ∃T ∈ historyold s.t. instance(T, ∅) 6= null. According to the definition of instance(T,
exclT hreads), we have that ∀p ∈ T : ∃t ∈ lockGrantees[p] s.t. instance(T, ∅) = {(t, p)} ∪
instance(T \ {p}, {t}). Therefore, it must be the case that, at some time in the past, thread
t was granted a lock at a position p, although there was a template T ∈ history such that
T \ {p} was already instantiated by threads 6= t, i.e., instance(T \ {p}, {t}) was 6= null. But
that contradicts the behavior of Algorithm 3.
Note that this lemma doesn’t hold if we replace historyold with history, because all the
cycles encountered in the current execution of an application have their templates implicitely
added to historynew , therefore those templates are actually instantiated in the same run, and
hence the invariant doesn’t hold for historynew .
Lemma 4. The following invariant is maintained: ∀C ∈ waitCycles(t) : template(C) ∈
/
historyold , i.e., no newly detected cycle has its template in historyold .
23
Proof. Every newly detected cycle, i.e., every cycle from waitCycles(t), must have had its
template instantiated prior to its occurence. That is because our avoidance algorithm has
to grant each lock l to the thread t curently requesting it at a position p (by adding t to
lockGrantees[p]), before thread t proceeds with locking l (lock(l)), and removes the grant
(by removing t from lockGrantees[p]) only when t releases l. Therefore, when a cycle occurs,
its template T is actually still instantiated, no matter when the cycle is detected, because
the application is deadlocked/livelocked in that cycle, anyway. But, according to Lemma 3,
template T cannot belong to historyold .
Moreover, in our avoidance framework, the applications are instructed (instrumented) to
load historyold into the avoidance module at the beginning of their execution, so they become
aware of all the templates from historyold from the very start of their execution. Therefore,
since there is no code executed before this initialization, no lock/unlock statements could
have been performed in a way that would make a cycle happen exactly when historyold is
fetched.
Therefore, no newly detected cycle can have its template in historyold .
Theorem 2 (Safety — non-repeating templates). Invariant historynew ∩historyold = ∅
is maintained by Request, i.e., templates do not repeat in different runs.
Proof. The templates of the newly detected cycles are implicitely added to the runtime history
image, i.e., to historynew (historynew := historynew ∪ templates(waitCycles(t))). A template
from historynew is accounted for historyold only after the application stops/restarts.
From above observations and from Lemma 4, it follows that historynew and historyold are
always disjoint.
Because the templates are execution-independent, all templates from historyold are valid
templates, i.e., their constituting lock positions denote the same lock statements they denoted
when the templates were added to the history. This property is very important, in order to
avoid template mutations, i.e., situations when a deadlock having a particular template in
present execution, has a different template in next execution.
Even in the case of template mutation, the invariant would still hold. But the application
would experience again deadlocks that used to have templates from historyold in previous executions, but in the current execution have different templates, that don’t belong to historyold
anymore. Therefore, the deadlock immunity would be weakened, so we cannot afford to have
execution-dependent templates.
Lemma 5. The number of possible templates is finite.
Proof. As we know, a template is a multiset of program positions where lock acquisitions
were performed by a set of different threads, and since the number of threads and number
of program positions are assumed finite, it immediately follows that the number of possible
templates is finite.
Theorem 3 (Completeness). Every application instrumented with our deadlock avoidance
will be eventually free of deadlocks and avoidance-induced livelocks, after a finite number of
reboots.
Proof. Let templates be this set of possible templates — the templates of all cycles that may
occur in an arbitrary execution of the application. According to Lemma 4, no cycle with a
template situated in historyold could happen again, therefore the application would become
free of deadlocks and avoidance-induced livelocks when historyold = templates.
The templates from historynew are saved to the persitent history image, and in the next
run, they are accounted for historyold . According to Theorem 2, historynew ∩ historyold = ∅,
24
and therefore, in the next run, historyold grows with historynew . Since, according to Lemma 5,
|templates| is finite, and assuming all possible deadlocks and avoidance-induced livelocks will
eventually happen, historyold will be eventually (after a finite number of reboots) equal to
templates, and the application will be thus deadlock-free.
Note that the assumption that all possible deadlocks and avoidance-induced livelocks will
eventually happen holds only in theory. In practice, one cannot assume that an application
will experience all its potential deadlocks. Therefore, the completeness of our approach holds
only in theory. However, one important and unarguable contribution of our approach remains:
liveness-preserving immunity against all deadlocks similar to deadlocks experienced in previous
executions.
Theorem 4 (Liveness). Eventually, every thread will be granted every lock it requests.
Proof. Assuming every critical section eventually terminates (in a normal execution), every
thread t will eventually free all the locks it holds at a position p, and then wake up all
threads from yielders[t, p], i.e., all threads t′ that have (t, p) in their yield cause (i.e., in
yieldCause[t′]). Therefore, each yield imposed by our avoidance will eventually terminate.
Next, we have to prove that the number of yields is finite. If the thread scheduling is
fair, enough (thread, position) pairs will be eventually deactivated (i.e., thread releases all
locks from position), such that no template from history can be instantiated anymore, and
the thread requesting the lock will be eventually granted the lock. A problem still remains if
multiple threads at a time — say a group M of threads — are in the position to be granted
the locks they require, and the lock request of every thread t from M is in conflict with lock
requests of other threads t′ from M , i.e., a template would be instantiated if both t and t′
would be granted the locks they require. Even in this situation, if the scheduler is fair, every
thread t from M will eventually get to obtain its grant before the other threads from M , with
which t’s request is in conflict.
Another problem concerning liveness, would be the avoidance-induced livelocks. The avoidanceinduced livelocks are avoided exactly as the deadlocks. Although the avoidance of these livelocks can lead to other livelocks (see Appendix E), this propagation cannot last indefinitely,
for the simple fact that, according to Lemma 5, the number of possible templates is finite.
Again, we need the assumption that the application will eventually experience all possible
deadlocks/livelocks. As this assumption holds just in theory, liveness therefore holds just theoretically. However, liveness is progressively assured by our algorithm, just as the safety is —
the application can deadlock/livelock only once in a particular context (template).
Given the above observations, it follows that our algorithm preserves liveness.
D
Complexity Analysis
The complexity of the updates performed on the data structures is O(1) — constant time
— in almost all the cases. Only operations like removal of an element from a set (e.g.,
lockGrantees[p] := lockGrantees[p] \ {t}; complexity is O(|lockGrantees[p]|) in this case)
are non-constant — linear, in worst case. We don’t take into account the complexity of
constant-time elementary operations.
The most expensive operation from Request(t, l, p) is the call to templateInstance(t, p).
We compute now the complexity of templateInstance. As a reminder, lockGrantees[p] is
the set of threads that acquired or were granted locks at program position p. The comn
Q
plexity of instance(T, {t}), where T = {p1 , ..., pn }, is O( i · |lockGrantees[pi ]|) — one
i=1
has to scan the entire space of lockGrantees[p1 ] × ... × lockGrantees[pn ], plus that, at each
25
position pi ∈ T , one has to check if current thread ti ∈ lockGrantees[pi ] was already encountered. Consider history has N templates, of average size NP , and for each position
p from a template, |lockGrantees[p]| = NG , on average. The complexity of finding out
whether a position belongs to a template is O(NP ). Therefore, the complexity of Request, is
NP
Q−1
NP −1
O(Request) = O(N · (NP +
i · NG )) = O(N · (NP + (NP − 1)! · NG
)). We notice that
i=1
the complexity grows polynomially with the number of threads (NG ) and exponentially with
the size of the templates (NP ). As we stated in §4, we expect NP to be very small, therefore
it is not a problem that the complexity is exponential in NP .
If we chose to store for each thread t the list of positions where t acquired or was granted
locks, say lockP ositions[t], having average size NL , we would have to scan lockP ositions[t] for
every thread t, and to check if current position p belongs to the current template and whether
p was already met. Therefore the complexity of templateInstance would be O(N ·(NP +(NP −
1)! · (NL · NP )NT )), where NT is the number of threads t with |lockP ositions[t]| > 0. NT can
be very large, therefore we can consider that the complexity in this case would be exponential
N
P) T
. We also notice that
in NT . The ratio between the two complexities is K ≈ (NL ·N
NP −1
NG
NT ≫ NG . Therefore, complexity would be much bigger than in previous case. Knowing that
these two are basically the only memory effective ways of mapping threads to lock positions,
the approach we chose seems to be the best option.
The complexity our algorithm adds to a lock request is O(lockRequest) = O(NW · Cwait ·
O(Request)), where NW is the average number of yields (waits) performed by thread before
being granted a lock, and Cwait is the cost of a wait() system call. The Acquired routine
comprises only constant-time operations, therefore it will be discarded from the analysis.
Release(t, l) includes only two non-constant-time operations: lockGrantees[p] := lockGrantees[p]\
{t}, whose complexity is O(NG ), and the awakening of all threads from yielders[t, p], whose
complexity is O(NY · Cnotif y ), where NY is the average size of yielders[t, p], and Cnotif y is
the cost of a notif y() call. Therefore, the complexity of Release is O(Release) = O(NG + NY ·
Cnotif y ).
For cycle detection, we use the optimal offline algorithm, based on colored DFS [15].
Therefore, the complexity of waitCycles, for RAG = [V, E], is O(waitCycles) = O(|V | + |E|).
There are however online approaches for cycle detection, but we chose to use offline detection
because the cycle detection is done periodically, so that many RAG updates (events) may
accumulate between two consecutive detection rounds, and offline detection is faster for bulk
RAG updates. Moreover, the cycle detector looks only for cycles containing threads with
pending requests, and this selection can not be done in the case of online cycle detection,
because of the topological order of RAG nodes that must be maintained after each addition
of a request edge. Considering that |requestingT hreads| = NR on average, and given that
cycle detection is performed starting from every thread in requestingT hreads, the complexity
of cycle detection is O(cycleDetection) = O(NR · (|V | + |E|)).
Every event is processed in constant time. Considering we have on average NE events in
the event queue, between two consecutive cycle detection iterations, the complexity of event
processing is O(processEvents) = NE .
In conclusion the complexity of the avoidance module is O(avoidance) = O(lockRequest)+
NP −1
O(Release) = O(NW · Cwait · N · (NP + (NP − 1)! · NG
) + NG + NY · Cnotif y ), and the complexity of the detection module is O(detection) = O(processEvents) + O(cycleDetection) =
O(NE + NR · (|V | + |E|)).
26
E
Avoidance Propagation
We illustrate in Figure 9 the propagation of the avoidance-induced livelocks. First, the
p3
p6
deadlock t1 →r y3 →h t2 →r x3 →h t1 occurs, having the template {p3 , p6 }. Pausing t2
at position p6 , because t1 crossed p3 already (by locking x3 at p3 ), leads to the livelock
p5
p3
t2 →y t1 →r y2 →h t2 , having the template {p3 , p5 }. If t1 yields at position p3 , because t2
p2
p6
crossed p6 already (by locking y3 at p6 ), leads to the livelock t1 →y t2 →r x2 →h t1 , having
the template {p2 , p6 }. The same way, avoiding instantiating template {p3 , p5 } in t1 , causes
p2
p5
livelock t1 →y t2 →r x2 →h t1 , having template {p2 , p5 }, and avoiding instantiating template
p4
p3
{p3 , p5 } in t2 , causes livelock t2 →y t1 →r y1 →h t2 , having template {p3 , p4 }. In the same
fashion, avoiding template {p2 , p6 } in t1 leads to a livelock having the template {p1 , p6 }, and
avoiding it in t2 leads to a livelock having again the template {p2 , p5 }. Avoiding template
{p1 , p6 } in t2 leads to a livelock having the template {p1 , p5 }, and avoiding template {p3 , p4 }
in t1 leads to a livelock having the template {p2 , p4 }. Finally, avoiding template {p1 , p5 } in t2 ,
or avoiding template {p2 , p4 } in t1 , leads to a livelock having the template {p1 , p4 }. But, since
we are already at the root lock statements in both threads, the livelocks cannot propagate
any further. Therefore, the avoidance stabilizes at positions p1 , p4 .
Thread t1
Thread t2
lock(x1)
p1
p4
lock(y1)
lock(x2)
p2
p5
lock(y2)
lock(x3)
p3
p6
lock(y3)
lock(y3)
lock(x3)
lock(y2)
lock(x2)
lock(y1)
lock(x1)
Fig. 9. Avoidance propagation
27