Algebraic Topology and Distributed Computing

Transactional Memory
The Art of
Multiprocessor Programming
Multicore Architetures
• “Learn how the multi-core processor
architecture plays a central role in
Intel's platform approach. ….”
• “AMD is leading the industry to multicore technology for the x86 based
computing market …”
• “Sun's multicore strategy centers
around multi-threaded software. ... “
Art of Multiprocessor
Programming
2
Why do we care?
• Time no longer cures software bloat
• When you double your path length
– You can’t just wait 6 months
– Your software must somehow exploit
twice as much concurrency
© 2006 Herlihy & Shavit
3
The Problem
• Cannot exploit cheap threads
• Today’s Software
– Non-scalable methodologies
• Today’s Hardware
– Poor support for scalable
synchronization
© 2006 Herlihy & Shavit
4
Locking
© 2006 Herlihy & Shavit
5
Coarse-Grained Locking
Easily made correct …
But not scalable.
© 2006 Herlihy & Shavit
6
Fine-Grained Locking
Here comes trouble …
© 2006 Herlihy & Shavit
7
Why Locking Doesn’t Scale
• Not Robust
• Relies on conventions
• Hard to Use
– Conservative
– Deadlocks
– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit
8
Locks are not Robust
If a thread holding
a lock is delayed …
No one else can
make progress
© 2006 Herlihy & Shavit
9
Why Locking Doesn’t Scale
• Not Robust
• Relies on conventions
• Hard to Use
– Conservative
– Deadlocks
– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit
10
Locking Relies on Conventions
• Relation between
Actual comment
– Lock bit and object bits
from Linux Kernel
(hat tip: Bradley Kuszmaul)
– Exists only in programmer’s mind
/*
* When a locked buffer is visible to the I/O layer
* BH_Launder is set. This means before unlocking
* we must clear BH_Launder,mb() on alpha and then
* clear BH_Lock, so no reader can see BH_Launder set
* on an unlocked buffer and then risk to deadlock.
*/
© 2006 Herlihy & Shavit
11
Why Locking Doesn’t Scale
• Not Robust
• Relies on conventions
• Hard to Use
– Conservative
– Deadlocks
– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit
12
Sadistic Homework
enq(x)
Double-ended queue
enq(y)
No interference if
ends “far enough”
apart
© 2006 Herlihy & Shavit
13
Sadistic Homework
enq(x)
Double-ended queue
enq(y)
Interference OK if
ends “close enough”
together
© 2006 Herlihy & Shavit
14
Sadistic Homework
deq()
Double-ended queue
deq()

Make sure
suspended
dequeuers awake as
needed
© 2006 Herlihy & Shavit
15
You Try It …
• One lock?
– Too Conservative
• Locks at each end?
– Deadlock, too complicated, etc
• Waking blocked dequeuers?
– Harder than it looks
© 2006 Herlihy & Shavit
16
Actual Solution
• Clean solution would be a publishable
result
• [Michael & Scott, PODC 96]
• High performance fine-grained lockbased solutions are good for
libraries…
• not general consumption by
programmers
© 2006 Herlihy & Shavit
17
Why Locking Doesn’t Scale
• Not Robust
• Relies on conventions
• Hard to Use
– Conservative
– Deadlocks
– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit
18
Locks do not compose
Hash Table
lock T1
T1
add(T1, item)
item
Must lock T1
before adding
item
Move from T1 to T2
delete(T1, item)
add(T2, item)
lock T1
T1
lock
lock T2
T2
item
item
Must lock T2
before deleting
from T1
Exposing lock internals breaks abstraction
© 2006 Herlihy & Shavit
19
Monitor Wait and Signal
Empty
buffer
zzz
Yes!
If buffer is empty,
wait for item to show up
© 2006 Herlihy & Shavit
20
Wait and Signal do not Compose
empty
Wait for either?
empty
zzz…
© 2006 Herlihy & Shavit
21
The Transactional Manifesto
• What we do now is inadequate to
meet the multicore challenge
• Research Agenda
– Replace locking with a transactional API
– Design languages to support this model
– Implement the run-time to be fast
enough
© 2006 Herlihy & Shavit
22
Transactions
• Atomic
– Commit: takes effect
– Abort: effects rolled back
• Usually retried
• Linearizable
– Appear to happen in one-at-a-time order
© 2006 Herlihy & Shavit
23
Sadistic Homework Revisited
Public void LeftEnq(item x) {
Qnode q = new Qnode(x);
q.left = this.left;
this.left.right = q;
this.left = q;
}
Write sequential Code
(1)
© 2006 Herlihy & Shavit
24
Sadistic Homework Revisited
Public void LeftEnq(item x) {
atomic {
Qnode q = new Qnode(x);
q.left = this.left;
this.left.right = q;
this.left = q;
}
}
(1)
© 2006 Herlihy & Shavit
25
Sadistic Homework Revisited
Public void LeftEnq(item x) {
atomic {
Qnode q = new Qnode(x);
q.left = this.left;
this.left.right = q;
this.left = q;
}
}
Enclose in atomic block
(1)
© 2006 Herlihy & Shavit
26
Warning
• Not always this simple
– Conditional waits
– Enhanced concurrency
– Overlapping locks
• But often it is
– Works for sadistic homework
© 2006 Herlihy & Shavit
27
Composition
Public void Transfer(Queue q1, q2)
{
atomic {
Item x = q1.deq();
q2.enq(x);
}
}
Trivial or what?
(1)
© 2006 Herlihy & Shavit
28
Wake-ups: lost and found
Public item LeftDeq() {
atomic {
if (this.left == null)
retry;
…
}
}
Roll back transaction and restart
when something changes
(1)
© 2006 Herlihy & Shavit
29
OrElse Composition
atomic {
x = q1.deq();
} orElse {
x = q2.deq();
}
Run 1st method. If it retries …
Run 2nd method. If it retries …
Entire statement retries
© 2006 Herlihy & Shavit
30
Transactional Memory
[HerlihyMoss93]
© 2006 Herlihy & Shavit
31
Hardware Overview
• Exploit Cache coherence protocols
• Already do almost what we need
– Invalidation
– Consistency checking
• Exploit Speculative execution
– Branch prediction = optimistic synch!
© 2006 Herlihy & Shavit
32
HW Transactional Memory
read
active
T
caches
Interconnect
memory
© 2006 Herlihy & Shavit
33
Transactional Memory
active
T
active
read
T
caches
memory
© 2006 Herlihy & Shavit
34
Transactional Memory
active
committed
T
active
T
caches
memory
© 2006 Herlihy & Shavit
35
Transactional Memory
committed
active
write
T
D caches
memory
© 2006 Herlihy & Shavit
36
Rewind
abortedactive
T
active
write
T
D caches
memory
© 2006 Herlihy & Shavit
37
Transaction Commit
• At commit point
– If no cache conflicts, we win.
• Mark transactional entries
– Read-only: valid
– Modified: dirty (eventually written back)
• That’s all, folks!
– Except for a few details …
© 2006 Herlihy & Shavit
38
Not all Skittles and Beer
• Limits to
– Transactional cache size
– Scheduling quantum
• Transaction cannot commit if it is
– Too big
– Too slow
– Actual limits platform-dependent
© 2006 Herlihy & Shavit
39
Some Approaches
• Trap to software if hardware fails
– “Hybrid” approach
• Use locks if speculation fails
– Lock elision
• “Virtualize” transactional memory
– VTM, UTM, etc…
© 2006 Herlihy & Shavit
40
Hardware versus Software
• Do we need hardware at all?
– Like virtual memory, probably need HW
for performance
• Do we need software?
– Policy issues don’t make sense for
hardware
© 2006 Herlihy & Shavit
41
Transactional Locking
•
•
•
Build STM based on locks
Atomic{} simple as course grained
locking and it composes
Under the covers (transparent to
user) we deploy fine grained locks
© 2006 Herlihy & Shavit
42
Locking STM Design Choices
Map
Application
Memory
Array of
VersionedWrite-Locks
V#
PS = Lock per Stripe
PO = Lock per Object
(separate array of © 2006 Herlihy (embedded
in object)
& Shavit
locks)
43
Lock-based STM
Mem
Locks
X
X
V#
V#
V#
V#+1
V#
V#
V#+1
0
00
0
00
1
Y
Y
V#
V#
V#
V#+1
V#+1
V#+1
V#
V#
00
1
00
0
V#
V#
V#
V#
V#
V#
00
000
0
1.
2.
3.
4.
5.
6.
7.
To Read: load lock + location
Location in write-set? (Bloom Filter)
Check unlocked add to Read-Set
To Write: add value to write set
Acquire Locks
Validate read/write v#’s unchanged
Release each lock with v#+1
© 2006 Herlihy & Shavit
44
Transactional Consistency
Invariant x = 2y
4 X
8
2 Y
4
Transaction A:
Write x
Write y
Transaction B:
Read x
Read y
Compute z = 1/(x-y) = 1/2
Application
Memory
Transactional Inconsistency
Invariant x = 2y
48 X
Transaction B:
Read x = 4
24 Y
Transaction A:
Write x
Write y
Transaction B:
Read y = 4 {trans is zombie}
Compute z = 1/(x-y)
Application
Memory
DIV by 0 ERROR
Solution: The “Global Clock”
• Have one shared global clock
• Incremented by (small subset of)
writing transactions
• Read by all transactions
• Used to validate that state worked on
is always consistent
Solution: “Version Clock”
• Have one shared global version clock
• Incremented by (small subset of)
writing transactions
• Read by all transactions
• Used to validate that state worked on
is always consistent
Later: how we learned not to worry
about contention and love the clock
© 2006 Herlihy & Shavit
48
Version Clock: Read-OnlyTrans
Mem
Locks
100
87
87
0
34
34
34
00
88
88
0
V#
99
99
0
44
44
0
50
50
V#
0
100
VClock
1. RV  VClock
2. On Read: read lock, read
mem, read lock: check
unlocked, unchanged, and v#
<= RV
3. Commit.
Reads form a snapshot of memory.
No read set!
RV
© 2006 Herlihy & Shavit
49
Version Clock
Mem
X
X
Y
Y
121
120
100
Locks
87
87
87
121
34
34
121
88
88
00
0
0
01
0
0
0
V#
121
99
121
44
44
0
0
10
0
0
50
V#
50
V#
50
0
0
0
Commit
100
VClock
1. RV  VClock
2. On Read/Write: check
unlocked and v# <= RV then
add to Read/Write-Set
3. Acquire Locks
4. WV = F&I(VClock)
5. Validate each v# <= RV
6. Release locks with v#  WV
RV
Reads+Inc+Writes
=Linearizable
© 2006 Herlihy & Shavit
50
Performance Benchmarks
• Mechanically Transformed Sequential
Red-Black Tree using TL2
• Compare to STMs and hand-crafted
fine-grained Red-Black
implementation
• On a 16–way Sun Fire™ running
Solaris™ 10
© 2006 Herlihy & Shavit
51
Uncontended Large Red-Black
Tree
5% Delete 5% Update 90% Lookup
Handcrafted
TL/PO
TL2/P0
Ennals
TL/PS
TL2/PS
© 2006 Herlihy & Shavit
Farser
Harris
Lock52
free
Uncontended Small RB-Tree
5% Delete 5% Update 90% Lookup
TL/P0
TL2/P0
© 2006 Herlihy & Shavit
53
Contended Small RB-Tree
30% Delete 30% Update 40% Lookup
TL/P0
TL2/P0
Ennals
© 2006 Herlihy & Shavit
54
What Next?
• On multicores like Niagara Global
Clock not a big issue.
• But there is a lot of TM research
work ahead
• Do you want to help…?
© 2006 Herlihy & Shavit
55