Lecture 27 Multiprocessor Scheduling • Last lecture: VMM • Two old problems: CPU virtualization and memory virtualization • I/O virtualization • Today • Issues related to multi-core: scheduling and scalability The cache coherence problem • Since we have multiple private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores The cache coherence problem Core 1 One or more levels of cache x=15213 Core 2 One or more levels of cache x=15213 Main memory x=15213 Core 3 Core 4 One or more levels of cache One or more levels of cache multi-core chip The cache coherence problem Core 1 One or more levels of cache x=21660 Core 2 One or more levels of cache x=15213 Main memory x=15213 Core 3 Core 4 One or more levels of cache One or more levels of cache assuming write-back caches multi-core chip The cache coherence problem Core 1 One or more levels of cache x=15213 Core 2 One or more levels of cache x=15213 Main memory x=15213 Core 3 Core 4 One or more levels of cache One or more levels of cache multi-core chip The cache coherence problem Core 1 One or more levels of cache x=21660 Core 2 One or more levels of cache x=15213 Main memory x=21660 Core 3 Core 4 One or more levels of cache One or more levels of cache assuming write-through caches multi-core chip Solutions for cache coherence • There exist many solution algorithms, coherence protocols, etc. • A simple solution: Invalidation protocol with bus snooping Inter-core bus Core 1 One or more levels of cache Core 2 One or more levels of cache Main memory Core 3 Core 4 One or more levels of cache One or more levels of cache multi-core chip inter-core bus Invalidation protocol with snooping • Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated • Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores. The cache coherence problem Core 1 One or more levels of cache x=15213 Core 2 One or more levels of cache x=15213 Main memory x=15213 Core 3 Core 4 One or more levels of cache One or more levels of cache multi-core chip The cache coherence problem Core 1 One or more levels of cache x=21660 Core 2 One or more levels of cache x=15213 Core 3 Core 4 One or more levels of cache One or more levels of cache sends INVALIDATED invalidation Main memory assuming request x=21660 write-through caches multi-core chip The cache coherence problem Core 1 One or more levels of cache x=21660 Core 2 One or more levels of cache x=21660 Main memory x=21660 Core 3 Core 4 One or more levels of cache One or more levels of cache assuming write-through caches multi-core chip Alternative to invalidate protocol: update protocol Core 1 One or more levels of cache x=21660 Core 2 One or more levels of cache x=15213 broadcasts updated Main memory value x=21660 Core 3 Core 4 One or more levels of cache One or more levels of cache assuming write-through caches multi-core chip Alternative to invalidate protocol: update protocol Core 1 One or more levels of cache x=21660 Core 2 One or more levels of cache x=21660 broadcasts updated Main memory value x=21660 Core 3 Core 4 One or more levels of cache One or more levels of cache assuming write-through caches multi-core chip Invalidation vs update • Multiple writes to the same location • invalidation: only the first time • update: must broadcast each write (which includes new variable value) • Invalidation generally performs better: it generates less bus traffic Programmers still Need to Worry about Concurrency • Mutex • Condition variables • Lock-free data structures Single-Queue Multiprocessor Scheduling • reuse the basic framework for single processor scheduling • put all jobs that need to be scheduled into a single queue • pick the best two jobs to run, if there are two CPUs • Advantage: simple • Disadvantage: does not scale SQMS and Cache Affinity Cache Affinity • Thread migration is costly • Need to restart the execution pipeline • Cached data is invalidated • OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core SQMS and Cache Affinity. Multi-Queue Multiprocessor Scheduling • Scalable • Cache affinity Load Imbalance • Migration Work Stealing • A (source) queue that is low on jobs will occasionally peek at another (target) queue • If the target queue is (notably) more full than the source queue, the source will “steal” one or more jobs from the target to help balance load • Cannot look around at other queues too often Linux Multiprocessor Schedulers • Both approaches can be successful • O(1) scheduler • Completely Fair Scheduler (CFS) • BF Scheduler (BFS), uses a single queue An Analysis of Linux Scalability to Many Cores • This paper asks whether traditional kernel designs can be used and implemented in a way that allows applications to scale Amdahl's Law • N: the number of threads of execution • B: the fraction of the algorithm that is strictly serial • the theoretical speedup: Scalability Issues • Global lock used for a shared data structure • longer lock wait time • Shared memory location • overhead caused by the cache coherency algorithms • Tasks compete for limited size-shared hardware cache • increased cache miss rates • Tasks compete for shared hardware resources (interconnects, DRAMinterfaces) • more time wasted waiting • Too few available tasks: • less efficiency How to avoid/fix • These issues can often be avoided (or limited) using popular parallel programming techniques • • • • Lock-free algorithms Per-core data structures Fine-grained locking Cache-alignment • Sloppy Counters Current bottlenecks • https://www.usenix.org/conference/osdi10/analysislinux-scalability-many-cores
© Copyright 2024 Paperzz