S - NCSU COE People

Lecture 27
Multiprocessor Scheduling
• Last lecture: VMM
• Two old problems: CPU virtualization and memory
virtualization
• I/O virtualization
• Today
• Issues related to multi-core: scheduling and scalability
The cache coherence problem
• Since we have multiple private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores
The cache coherence problem
Core 1
One or more
levels of
cache
x=15213
Core 2
One or more
levels of
cache
x=15213
Main memory
x=15213
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
multi-core chip
The cache coherence problem
Core 1
One or more
levels of
cache
x=21660
Core 2
One or more
levels of
cache
x=15213
Main memory
x=15213
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
assuming
write-back
caches
multi-core chip
The cache coherence problem
Core 1
One or more
levels of
cache
x=15213
Core 2
One or more
levels of
cache
x=15213
Main memory
x=15213
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
multi-core chip
The cache coherence problem
Core 1
One or more
levels of
cache
x=21660
Core 2
One or more
levels of
cache
x=15213
Main memory
x=21660
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
assuming
write-through
caches
multi-core chip
Solutions for cache coherence
• There exist many solution algorithms, coherence
protocols, etc.
• A simple solution:
Invalidation protocol with bus snooping
Inter-core bus
Core 1
One or more
levels of
cache
Core 2
One or more
levels of
cache
Main memory
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
multi-core chip
inter-core bus
Invalidation protocol with
snooping
• Invalidation:
If a core writes to a data item, all other copies of
this data item in other caches are invalidated
• Snooping:
All cores continuously “snoop” (monitor) the bus
connecting the cores.
The cache coherence problem
Core 1
One or more
levels of
cache
x=15213
Core 2
One or more
levels of
cache
x=15213
Main memory
x=15213
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
multi-core chip
The cache coherence problem
Core 1
One or more
levels of
cache
x=21660
Core 2
One or more
levels of
cache
x=15213
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
sends
INVALIDATED
invalidation Main memory
assuming
request
x=21660
write-through
caches
multi-core chip
The cache coherence problem
Core 1
One or more
levels of
cache
x=21660
Core 2
One or more
levels of
cache
x=21660
Main memory
x=21660
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
assuming
write-through
caches
multi-core chip
Alternative to invalidate protocol:
update protocol
Core 1
One or more
levels of
cache
x=21660
Core 2
One or more
levels of
cache
x=15213
broadcasts
updated
Main memory
value
x=21660
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
assuming
write-through
caches
multi-core chip
Alternative to invalidate protocol:
update protocol
Core 1
One or more
levels of
cache
x=21660
Core 2
One or more
levels of
cache
x=21660
broadcasts
updated
Main memory
value
x=21660
Core 3
Core 4
One or more
levels of
cache
One or more
levels of
cache
assuming
write-through
caches
multi-core chip
Invalidation vs update
• Multiple writes to the same location
• invalidation: only the first time
• update: must broadcast each write
(which includes new variable value)
• Invalidation generally performs better:
it generates less bus traffic
Programmers still Need to Worry
about Concurrency
• Mutex
• Condition variables
• Lock-free data structures
Single-Queue
Multiprocessor Scheduling
• reuse the basic framework for single processor
scheduling
• put all jobs that need to be scheduled into a single
queue
• pick the best two jobs to run, if there are two CPUs
• Advantage: simple
• Disadvantage: does not scale
SQMS and Cache Affinity
Cache Affinity
• Thread migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as much as
possible: it tends to keeps a thread on the same core
SQMS and Cache Affinity.
Multi-Queue Multiprocessor
Scheduling
• Scalable
• Cache affinity
Load Imbalance
• Migration
Work Stealing
• A (source) queue that is low on jobs will
occasionally peek at another (target) queue
• If the target queue is (notably) more full than the
source queue, the source will “steal” one or more
jobs from the target to help balance load
• Cannot look around at other queues too often
Linux Multiprocessor Schedulers
• Both approaches can be successful
• O(1) scheduler
• Completely Fair Scheduler (CFS)
• BF Scheduler (BFS), uses a single queue
An Analysis of Linux Scalability to
Many Cores
• This paper asks whether traditional kernel designs
can be used and implemented in a way that allows
applications to scale
Amdahl's Law
• N: the number of threads of execution
• B: the fraction of the algorithm that is strictly serial
• the theoretical speedup:
Scalability Issues
• Global lock used for a shared data structure
• longer lock wait time
• Shared memory location
• overhead caused by the cache coherency algorithms
• Tasks compete for limited size-shared hardware cache
• increased cache miss rates
• Tasks compete for shared hardware resources
(interconnects, DRAMinterfaces)
• more time wasted waiting
• Too few available tasks:
• less efficiency
How to avoid/fix
• These issues can often be avoided (or limited) using
popular parallel programming techniques
•
•
•
•
Lock-free algorithms
Per-core data structures
Fine-grained locking
Cache-alignment
• Sloppy Counters
Current bottlenecks
• https://www.usenix.org/conference/osdi10/analysislinux-scalability-many-cores