The Structure of THE Multiprogramming System

ECE 259 / CPS 221 Advanced Computer Architecture II
Synchronization without Contention
John M. Mellor-Crummey and Michael L. Scott+
Presenter : Tae Jun Ham
2012. 2. 16
Problem

-

-

Busy-waiting synchronization incurs high memory/network
contention
Creation of hot spot = degradation of performance
Causes cache-line invalidation (for every write on lock)
Possible Approach : Add special-purpose hardware for
synchronization
Add synchronization variable to the switching nodes on
interconnection
Implement lock queuing mechanisms on cache controller
Suggestion in this paper : Use scalable synchronization
algorithm (MCS) instead of special-purpose hardware
Review of Synchronization Algorithms

-
Test and Set
Require : Test and Set (Atomic operation)
LOCK
while (test&set(x) == 1);
UNLOCK
x = 0;
1.
2.
Problem :
Large Contention – Cache / Memory
Lack of Fairness - Random Order
Review of Synchronization Algorithms

-
Test and Set with Backoff
Almost similar to Test and Set but has delay
LOCK
while (test&set(x) == 1) {
delay(time);
}
UNLOCK
x = 0;
1.
2.
-
Time :
Linear : Time = Time + Some Time
Exponential : Time = Time * Some constant
Performance : Reduced contention but still not fair
Review of Synchronization Algorithms

-
Ticket Lock
Requires : fetch and increment (Atomic Operation)
LOCK
myticket = fetch & increment (&(L->next_ticket));
while(myticket!=L->now_serving) {
delay(time * (myticket-L->now_serving));
}
UNLOCK
L->now_serving = L->now_serving+1;
-
Advantage : Fair (FIFO)
Disadvantage : Contention (Memory/Network)
Review of Synchronization Algorithms

-
Anderson Lock (Array based queue lock)
Requires : fetch and increment (Atomic Operation)
LOCK
myplace= fetch & increment (&(L->next_location));
while(L->location[myplace] == must_wait) ;
L->location[myplace]=must_wait;
}
UNLOCK
L->location[myplace+1]=has_lock;
-
Advantage : Fair (FIFO), No cache contention
Disadvantage : Requires coherent cache / Space
MCS Lock


1.
2.
3.
4.
MCS Lock – Based on Linked List
Acquire
Fetch & Store Last processor node (Get predecessor & set tail)
Set arriving processor node to locked
Set last processor node’s next node to arriving processor node
Spin till Locked=false
tail
1
2
3
Locked : False
(Run)
Locked :True
(Spin)
Locked :True
(Spin)
4
tail
1
2
3
4
Locked : False
(Run)
Locked :True
(Spin)
Locked :True
(Spin)
Locked :True
(Spin)
MCS Lock
MCS Lock – Based on Linked List
 Release
Check if next processor node is set (check if we completed acquisition)
- If set, make next processor node unlocked

tail
1
2
3
4
Locked : False
(Run)
Locked :True
(Spin)
Locked :True
(Spin)
Locked :True
(Spin)
tail
1
2
3
4
Locked : False
(Finished)
Locked : False
(Run)
Locked :True
(Spin)
Locked :True
(Spin)
MCS Lock
MCS Lock – Based on Linked List
 Release
Check if next processor node is set (check if we completed acquisition)
If not set, check if tail points itself (compare & swap with null)
If not, wait till next processor node is set

tail
1
Locked : False
(Run)
-
2
tail
tail
1
2
1
2
Locked : False
(Run)
Locked : True
(Run)
Locked : False
(Finished)
Locked : False
(Run)
Then, unlock next processor node
MCS Lock – Concurrent Read Version

MCS Lock – Based on Linked List

MCS Lock – Concurrent Read Version
MCS Lock – Concurrent Read Version

Start_Read :
- If predecessor is nill or active reader, reader_count++ (atomic) ; proceed;
- Else, spin till (another Start_Read or End_Write) unblocks this
=> Then, this unblocks its successor reader (if any)

End_Read :
- If successor is writer, set next_writer=successor
- reader_count-- (atomic)
- If last reader(reader_count==0), check next_writer and unblocks it

Start_Write :
- If predecessor is nill and there’s no active reader(reader_count=0), proceed
- Else, spin till (last End_Read ) unblocks this

End_Write :
- If successor is reader, reader_count++ (atomic) and unblocks it
Review of Barriers

Centralized counter barrier
Keeps checking(test & set) centralized counter
 Advantage : Simplicity
 Disadvantage : Hot spot, Contention
Review of Barriers

Combining Tree Barrier

Advantage : Simplicity, Less contention, Parallelized fetch&increment
Disadvantage : Still spins on non-local location

Review of Barriers

Bidirectional Tournament Barrier

Winner is statically determined
Advantage : No need for fetch and op / Local Spin

Review of Barriers

Dissemination Barrier

Can be understood as a variation of tournament (Statically determined)
Suitable for MPI system

MCS Barriers

MCS Barrier (Arrival)

Similar to Combined Tree Barrier
Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path

MCS Barriers

MCS Barrier (Wakeup)
0
2
1
3


4
5
Similar to Combined Tree Barrier
Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path
Spin Lock Evaluation

Butterfly Machine result

Three scaled badly; Four scaled
well. MCS was best

Backoff was effective
Spin Lock Evaluation

Butterfly Machine result

Measured consecutive lock
acquisitions on separate
processors instead of
acquire/release pair from start
to finish
Spin Lock Evaluation

Symmetry Machine Result

MCS and Anderson scales well

Ticket lock cannot be
implemented in Symmetry due
to lack of fetch and increment
operation

Symmetry Result seems to be
more reliable
Spin Lock Evaluation

Network Latency

MCS has greatly reduced
increases in network latency

Local Spin reduces contention
Barrier Evaluation




Butterfly Machine
Dissemination was best
Bidirectional and MCS Tree was
okay
Remote memory access
degrades performance a lot
Barrier Evaluation





Symmetry Machine
Counter method was best
Dissemination was worst
Bus-based architecture:
Cheap broadcast
MCS arrival tree outperforms
counter for more than 16
processors
Local Memory Evaluation
Local Memory Evaluation



Having a local memory is
extremely important
It both affects performance
and network contention
Dancehall system is not really
scalable
Summary

This paper proposed a scalable spin-lock synchronization algorithm
without network contention

This paper proposed a scalable barrier algorithm

This paper proved that network contention due to busy-wait
synchronization is not really a problem

This paper proved an idea that hardware for QOSB lock
would not be cost-effective when compared with MCS lock

This paper suggests the use of distributed memory or coherent
caches rather than dance-hall memory without coherent caches
Discussion

What would be the primary disadvantage of MCS lock?

In what case MCS lock would have worse performance than other
locks?

How do you think about special-purpose hardware based locks?

Is space usage of lock important?

Can we benefit from dancehall style memory architecture?
(disaggregated memory ?)

Is there a way to implement energy-efficient spin-lock?