COST_2012_Florea_Research_Presentation

Energy saving in multicore architectures
• Anticipatory Techniques in Advanced Processor
Architectures (superscalar, SMT)
• An Automatic Design Space Exploration Framework
for Multicore Architecture Optimizations
Assoc. Prof. Adrian FLOREA, PhD
http://webspace.ulbsibiu.ro/adrian.florea/html/
Prof. Lucian VINTAN, PhD – Research chair
Lecturer Arpad GELLERT, PhD
Horia CALBOREAN, PhD
Advanced Computer Architecture & Processing
Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/
Computing hardware
14 Intel Compute nodes (2 processor HS21 blades with quad-core Intel Xeon)
2 Cell Compute nodes (2 processor QS22 blades withIBM PowerXCell 8i Processor )
Advanced Computer Architecture & Processing Systems Research Lab
http://acaps.ulbsibiu.ro/index.php/en/
Anticipatory Techniques in Advanced Processor
Architectures (superscalar, SMT)
Issue Bottleneck (Data-flow)
Conventional processing models are limited in their processing speed
by the dynamic program’s critical path (Amdahl);
2 Solutions
 Dynamic Instruction Reuse (DIR) - a non-speculative technique.
 Value Prediction (VP) - a speculative technique.
Common issue
 Value locality
Challenges
 Selective Instruction Reuse (MUL & DIV)
 Selective Load Value Prediction (“Critical Loads”)
 Exploiting IR & VP in a Superscalar / Simultaneous Multithreaded
(SMT) Architecture to anticipate Long-Latency Instructions Results
Advanced Computer Architecture & Processing Systems Research Lab
http://acaps.ulbsibiu.ro/index.php/en/
Exploiting Selective Value Prediction in Superscalar
and SMT Architectures



Traditional value prediction techniques have been increasingly
challenged by the advent of mobile, battery-operated devices due to the
significant amount of energy consumption.
This is essentially due to the on-chip memory required for computing the
prediction and the overall number of accesses to the predictor itself.
We introduce and analyze a selective value predictor which is
triggered selectively only during specific cache miss events.
Advantages:
 Reduce
the overall number of accesses and the energy
consumption of the on-chip memory and logic reserved to the value
speculation.
 Improve over traditional value predictors in terms of performance and
energy consumption.
 Create room for a reduction of the data-cache size by preserving
performance, thus enabling a reduction of the system cost.
Advanced Computer Architecture & Processing Systems Research Lab
http://acaps.ulbsibiu.ro/index.php/en/
Tools, Metrics and some Results
The M-SIM Simulator
Power Models
Hardware
Configuration
Cycle-Level
Performance
Simulator
SPEC
Benchmark
Power
Estimation
Hardware Access Counts
Performance
Estimation
40%

CPI reduction 
CPI base  CPI improved
CPI base
100 [%]
35%
30%
INT - IPC
25%
INT - ED
20%

E  PMean  cycles
FP - IPC
15%
FP - EDP
10%
5%

Ereduction 
Ebase  Eimproved
Ebase
0%
100 [%]
16
32
64
128 256
LVPT entries
Advanced Computer Architecture & Processing Systems Research Lab
http://acaps.ulbsibiu.ro/index.php/en/
512 1024 2048

Design
space
exploration
(DSE)
of
a
Selective
Load
Value
Prediction
scheme
suitable
for
energyaware Simultaneous MultiThreaded (SMT) architectures
a) Superscalar
b) SMT
Advanced Computer Architecture & Processing Systems Research Lab
http://acaps.ulbsibiu.ro/index.php/en/
Automatic Design Space Exploration Framework
for Multicore Architecture Optimizations

Multiobjective
optimization
of
advanced
computer architectures using experts’ domainknowledge



HUGE design space (>19 parameters)
 M-SIM 2 – 2,5 millions of billions configurations (1015)
 Manual design space exploration  impossible
Multi-objective optimization (performance processing,
power consumption, integration area, thermal dissipation)
 problem becomes even harder
Solution

Heuristic algorithms (genetic algorithms, bio-inspired algorithms)
Advanced Computer Architecture & Processing Systems Research Lab
http://acaps.ulbsibiu.ro/index.php/en/
Framework for Automatic Design Space Exploration
(FADSE) - http://code.google.com/p/fadse/
It must:



Simulate many individuals (architectural configurations)  Slow!
(24 hours/generations on 96 cores, one generation = 100 individuals)
Implement reliability mechanisms (bounded wait for client, resending
individuals, checkpointing, etc)
Accelerating process:



Simulate less configurations (database integration (up to 67% reuse),
evaluate only 2500 configurations!!!)
After 30 generations
Parallelize (distributed evaluation)
0.5
Adding Computer Architecture
Domain-Knowledge
(Constraints, Hierarchical
parameters, Fuzzy Rules)
0.45
CPI

0.4
0.35
0.3
0.25
7.00E+09
Advanced Computer Architecture & Processing Systems Research Lab
http://acaps.ulbsibiu.ro/index.php/en/
1.20E+10
1.70E+10
2.20E+10
2.70E+10
3.20E+10
3.70E+10
4.20E+10
Energy
Run without fuzzy
Run with fuzzy
Manual
4.70E+10