Energy saving in multicore architectures • Anticipatory Techniques in Advanced Processor Architectures (superscalar, SMT) • An Automatic Design Space Exploration Framework for Multicore Architecture Optimizations Assoc. Prof. Adrian FLOREA, PhD http://webspace.ulbsibiu.ro/adrian.florea/html/ Prof. Lucian VINTAN, PhD – Research chair Lecturer Arpad GELLERT, PhD Horia CALBOREAN, PhD Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ Computing hardware 14 Intel Compute nodes (2 processor HS21 blades with quad-core Intel Xeon) 2 Cell Compute nodes (2 processor QS22 blades withIBM PowerXCell 8i Processor ) Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ Anticipatory Techniques in Advanced Processor Architectures (superscalar, SMT) Issue Bottleneck (Data-flow) Conventional processing models are limited in their processing speed by the dynamic program’s critical path (Amdahl); 2 Solutions Dynamic Instruction Reuse (DIR) - a non-speculative technique. Value Prediction (VP) - a speculative technique. Common issue Value locality Challenges Selective Instruction Reuse (MUL & DIV) Selective Load Value Prediction (“Critical Loads”) Exploiting IR & VP in a Superscalar / Simultaneous Multithreaded (SMT) Architecture to anticipate Long-Latency Instructions Results Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ Exploiting Selective Value Prediction in Superscalar and SMT Architectures Traditional value prediction techniques have been increasingly challenged by the advent of mobile, battery-operated devices due to the significant amount of energy consumption. This is essentially due to the on-chip memory required for computing the prediction and the overall number of accesses to the predictor itself. We introduce and analyze a selective value predictor which is triggered selectively only during specific cache miss events. Advantages: Reduce the overall number of accesses and the energy consumption of the on-chip memory and logic reserved to the value speculation. Improve over traditional value predictors in terms of performance and energy consumption. Create room for a reduction of the data-cache size by preserving performance, thus enabling a reduction of the system cost. Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ Tools, Metrics and some Results The M-SIM Simulator Power Models Hardware Configuration Cycle-Level Performance Simulator SPEC Benchmark Power Estimation Hardware Access Counts Performance Estimation 40% CPI reduction CPI base CPI improved CPI base 100 [%] 35% 30% INT - IPC 25% INT - ED 20% E PMean cycles FP - IPC 15% FP - EDP 10% 5% Ereduction Ebase Eimproved Ebase 0% 100 [%] 16 32 64 128 256 LVPT entries Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ 512 1024 2048 Design space exploration (DSE) of a Selective Load Value Prediction scheme suitable for energyaware Simultaneous MultiThreaded (SMT) architectures a) Superscalar b) SMT Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ Automatic Design Space Exploration Framework for Multicore Architecture Optimizations Multiobjective optimization of advanced computer architectures using experts’ domainknowledge HUGE design space (>19 parameters) M-SIM 2 – 2,5 millions of billions configurations (1015) Manual design space exploration impossible Multi-objective optimization (performance processing, power consumption, integration area, thermal dissipation) problem becomes even harder Solution Heuristic algorithms (genetic algorithms, bio-inspired algorithms) Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ Framework for Automatic Design Space Exploration (FADSE) - http://code.google.com/p/fadse/ It must: Simulate many individuals (architectural configurations) Slow! (24 hours/generations on 96 cores, one generation = 100 individuals) Implement reliability mechanisms (bounded wait for client, resending individuals, checkpointing, etc) Accelerating process: Simulate less configurations (database integration (up to 67% reuse), evaluate only 2500 configurations!!!) After 30 generations Parallelize (distributed evaluation) 0.5 Adding Computer Architecture Domain-Knowledge (Constraints, Hierarchical parameters, Fuzzy Rules) 0.45 CPI 0.4 0.35 0.3 0.25 7.00E+09 Advanced Computer Architecture & Processing Systems Research Lab http://acaps.ulbsibiu.ro/index.php/en/ 1.20E+10 1.70E+10 2.20E+10 2.70E+10 3.20E+10 3.70E+10 4.20E+10 Energy Run without fuzzy Run with fuzzy Manual 4.70E+10
© Copyright 2026 Paperzz