Madarbux

Low Latency Scheduling Algorithm for Shared
Memory Communications over Optical Networks
Muhammad Ridwan Madarbux, Anouk Van Laer,
Philip M. Watts
Electronic and Electrical Engineering Department
University College London
MOTIVATION
•  Scaling chip multiprocessors (CMP) is
increasing thermal
issues
−  negative impact on
performance
•  Photonic NoCs have
been shown to have
lower power
consumption
Op0cal Plane O
O
O
X
O
O
O
O
O
O
To Op0cal Plane Shared Network L2 Interface L1 Core Chip Mul0-­‐
Processor MOTIVATION
request •  Latency in switched photonic
networks is dominated by
scheduling
–  Request, arbitration and grant
arbitra0on grant •  Scheduling generates a
significant overhead in shared
memory systems
–  8B control messages
–  16-256B data messages
data Source
Core
Arbiter
Destination
Core
CIRCUIT SWITCHING FOR SHARED MEMORY
• 
Reduce scheduling
latency by circuit switching
for large flows of data
between two cores
• 
Maximise proportion of
messages on circuits to
minimise contention and
improve latency
–  Reduces for circuit
periods >100 clock
cycles
• 
Backup network required
This work proposes a new scheduling algorithm which intelligently uses
information from the cache hierarchy to setup optical circuits
SIMULATION PARAMETERS
Benchmark
● 
● 
● 
● 
Operating system
● 
● 
● 
Kernel
● 
● 
● 
CPU
Caches
Memory
PARSEC benchmark suites
Blackscholes – Calculate price of options
Canneal – Optimization of routing in chip design
Dedup – Compression using data deduplication
Ferret – Content similarity search server
Fluidanimate – Simulates fluid for animations
Freqmine – Data mining
Streamcluster – Online clustering of input streams
Swaptions – Calculates price of swaptions (MC)
Vips – Image processing
X264 – Video encoder
32 cores @ 1.2 GHz
●  Private L1 = 16 kB
●  Shared L2 = 1 MB in total (distributed among the tiles)
●  Cacheline size = 64B
●  Cache coherence protocol = MESI
● 
Interconnection network
COMMON COMMUNICATION PATTERNS IN THE
MESI COHERENCE PROTOCOL
L1
requesting
address
L1 requesting
address
L1 requesting
address
Request
Response
L2 bank
L1
requesting
address
Request
L2 bank
Time
3 messages Store Request
Unblock
Response
Unblock
L2 bank
L1
requesting
address
Response
L2 bank
Request
L1*
caching
address
Time
5 messages Store Request
CACHE COHERENCE PROTOCOL BASED
COMMUNICATION PATTERNS
Communica0on PaEerns LATENCY IMPROVEMENTS
TIME request request 01
07
02
08
03
09
04
10
05
11
06
12
arbitra0on grant REQ data arbitra0on 14
15
grant 16
arbitra0on REQ data data request 13
grant RES RES Path Allocator
data UNB data 17
18
Optical Switch
19
20
request grant 21
22
23
24
25
arbitra0on UNB 26
67% Overhead Savings data 27
28
29
30
31
32
Source Arbiter Des/na/on Core Core Arbitra/on per message Source Arbiter Des/na/on Core Core Arbitra/on per pa;ern LATENCY IMPROVEMENTS
70.6%
Time taken per pattern
BL – Blackscholes
CA – Canneal
DE – Dedup
FE - Ferret
Head latency per message
FL – Fluidanimate
FR – Freqmine
ST – Streamcluster
SW – Swaptions
VI – Vips
X2 – X264
EFFECT OF CONTENTION
4x
< 2%
BL – Blackscholes
CA – Canneal
DE – Dedup
FE - Ferret
FL – Fluidanimate
FR – Freqmine
ST – Streamcluster
SW – Swaptions
VI – Vips
X2 – X264
CONCLUSION
•  We have proposed an algorithm that intelligently
uses information from the cache hierarchy to setup
optical paths
•  The algorithm provides significant latency
reductions of up to 70.6% for Swaptions
•  Results shown are for on-chip networks. Larger
networks with longer time of flight will benefit more
from this algorithm
–  Examples: Multiple socket servers or rack-scale networks
FUTURE WORK
•  Considering the implications of this algorithm on
the energy consumption of the allocator and
control circuits
•  Looking at the performance of the algorithm for
heavier traffic loads considering that PARSEC
benchmarks lightly load the network
•  Measuring the performance improvements in full
system gem5 simulation
THANK YOU FOR YOUR ATTENTION