NePSim: A Network Processor Simulator with Power Evaluation Framework IEEE Micro, Sept/Oct 2004 – Source code at http://www.cs.ucr.edu/~yluo/nepsim Yan Luo, Jun Yang, Laxmi N. Bhuyan, Li Zhao Computer Science & Engineering University of California at Riverside NP Architecture Design - Research Goals NPs provide performance and programmability for packet processing without O/S overhead Future design of NPs (with hardware accelerators) should be based on accurate estimation of execution performance gain instead of intuition Power consumption of NPs is becoming a big concern. Techniques are needed to save power when traffic is low. Need for an open-source Execution-driven simulator that can be used to explore architectural modifications for improved performance and power consumption 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 2 NP Simulation Tools Intel IXA SDK + accuracy, visualization - closed-source, low speed, inflexibility, no power model – Can’t incorporate new hardware designs SimpleScalar + open-source, popular, power model (wattch) - Uniprocessor architecture - disparity with real NP NePSim + open-source, real NP, power model, accuracy - currently target IXP1200 – 2400 under development About 60 software downloads and 300 hits at http://www.cs.ucr.edu/~yluo/nepsim/ 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 3 Objectives of NePSim An open-source simulator for a real NP (Intel® IXP1200, later IXP2400/2800…) Cycle-level accuracy of performance simulation Flexibility for users to add new instructions and functional units Integrated power model to enable power dissipation simulation and optimization Extensibility for future NP architectures Faster simulation compared to SDK 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 4 NePSim Software Architecture 11/13/2015 ME core (ME ISA, 5-stage pipeline, GPR, xfer registers) SRAM and SDRAM (data memory and controller with command queues) FBI unit (ixbus and CSRs) Device (network interface with in/out buffer, packet streams) Dlite(a light-weighted debugger) Stats (collection of statistics data) Traffic generator, program parser and loader Laxmi N. Bhuyan University of California, Riverside 5 NePSim Overview ME C program Compilergenerated ME C compiler microcode SDK Microcode program Parser Internal format Microcode assembler NePSim Stats Results host C compiler NePSim source code 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 6 NePSim Internals (I) Instruction Command ( for memory, fbi accesses) Opcode: ALU, memory ref., CSR access etc. operands: GPR, XFER, immed, Shift: shift amount Optional token: ctx_swap, ind_ref, … Opcode: sram_read, sram_write, sdram_read, … Thread id: ME, thread Functional unit: sram, sdram, scratchpad, fbi Address: source or destination address Reg: source or destination XFER register Optional token: ctx_swap, ind_ref, … Event <cycle time, command> 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 7 NePSim Internals (II) SRAM controller I P0 Inst. lookup I P1 Inst. decode ME P2 Read operand I P3 ALU, gen mem addr I arbiter C P4 Retire, gen mem Command SDRAM controller C Wake up sleeping threads Event Queue I: instruction C: command E: event 11/13/2015 E Laxmi N. Bhuyan University of California, Riverside arbiter E E 8 NePSim parameters Some of the simulation parameters of NePSim -d: enable debug message -I: start simulation in Dlite debugger mode -proc:speed processor speed in MHz -vdd default chip power supply -me0 program for microengine 0 -script script file used for initialization -strmconf stream config file of packet traces to network devices -max:cycle maximum number of cycles to execute -indstrm mark packet stream file as indefintely repeated -power flag to enable power calculations 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 9 Dlite Debugger Similar with Dlite in SimpleScalar Run simulation in debug mode 11/13/2015 Set/delete breakpoints Step into pipeline execution, check ALU condition code Examine threads’ PC, status etc. Examine register contents Examine memory (sram, sdram) contents Laxmi N. Bhuyan University of California, Riverside 10 IVERI – a verification tool To verify NePSim against IXP1200 is not a trivial task Multiple MEs, threads, memory units Huge amount of events (of pipeline and memory) generated in simulation process Have to pin-point error by scanning huge log traces IVERI tool Assertion checking based on Linear Temporal Logic (LTL) and Logic of Constraint (LOC) Log architectural events in both NePSim and IXP1200(with SDK) <cycle, PC, alu_out, address, event_type> Event_type can be pipeline, sram_enq, sdram_deq etc. Use LOC Assertion to specify performance requirement E.g. “the execution time of an instruction in pipeline of NePSim is no more than D cycles away from the execution time of the same instruction in IXP1200” PC(pipline[I])== PC(pipeline_IXP(I)) ^ |cycle(pipeline(I)) - cycle(pipeline_IXP(I))| <= D IVERI generate verification code based on assertions Verification code scans log trace, reports number and location of constraint violations. 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 11 Performance Validation of NePSim Use IVERI to verify NePSim functionality against IXP1200 with Logic of Constraint (LOC) Language Throughput and average latency are within %1 and %6 error. Throughput 11/13/2015 Average processing time Laxmi N. Bhuyan University of California, Riverside 12 Power Model H/W component Model Type Tool Configurations GPR per ME Array XCacti 2 64-entry files, 1 read/write port per file XFER per ME Array XCacti 4 32-entry files, 1 read/write port per file Control register per ME Array XCacti 1 32-entry file, 1 read/write port Control store, scratchpad Cache w/o tag path XCacti 4KB, 4byte per block, direct mapped, 10-bit address ALU , shifter ALU and shifter Wattch 32bit Command FIFO, command queue in controller, etc Array Wattch See paper Command bus arbiter, context arbiter Matrix, rr arbiter Orion See paper 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 13 Benchmarks Ipfwdr IPv4 forwarding(header validation, trie-based lookup) Medium SRAM access url Examining payload for URL pattern, used in content-aware routing Heavy SDRAM access Nat Network address translation medium SRAM access Md4 Message digest (compute a 128-bit message “signature”) Heavy computation and SDRAM access 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 14 Performance implications More MEs do not necessarily bring performance gain More ME cause more memory contention ME idle time is abundant (up to 42%) Faster ME core results in more ME idle time with the same memory Non-optimal rcv/xmit configuration for NAT (transmitting ME is a bottleneck) Throughput vs number of MEs at 232MHz 11/13/2015 Throughput vs number of MEs at 464MHz Laxmi N. Bhuyan University of California, Riverside Idle time vs ME/memory speed ratio 15 Where does the Power Go? Power dissipation by rcv and xmit MEs is similar across benchmarks Transmitting MEs consume ~5% more than receiving ALU consumes significant power ~45% (wattch model) Control store uses ~28% (accessed almost every cycle) GPRs burn ~13% , shifter ~7%, static ~7% Across MEs 11/13/2015 Inside an ME Laxmi N. Bhuyan University of California, Riverside 16 Power efficiency observations Power Power Performance Performance url Power consumption increases faster than performance More MEs/threads bring more idle time due to memory contention Reduce power consumption of MEs while waiting for memory accesses ipfwdr Power Power Performance Performance md4 11/13/2015 nat Laxmi N. Bhuyan University of California, Riverside Idle time vs # of MEs 17 Dynamic Voltage Scaling in NPs During the ME idle time, all threads are in ``wait'' state and the pipeline has no activity. Applying DVS while MEs are not very active can reduce the total power consumption substantially. DVS control scheme 11/13/2015 Observes the ME idle time (%) periodically. When idle > threshold, scale down the voltage and frequency (VF in short) by one step unless the minimum allowable VF is hit. Idle < threshold, scale up the VF by one step unless they are at maximum allowable values. Laxmi N. Bhuyan University of California, Riverside 18 DVS Considerations transition step meaning whether or not to use continuous or discrete changes in VF Use discrete VF steps transition status indicating if we allow the ME to continue working during VF regulation Pause ME while regulating VF transition time between two different VF states 10 us, [Burd][Shang][Sidiropoulos] transition logic complexity, i.e. the overhead of the control circuit that monitors and determines a transition 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 19 DVS Power-performance Power and performance reduction by DVS 11/13/2015 Initial VF=1.3V, 600MHz DVS period: every 15K, 20K or 30K cycles make a DVS decision to reduce or increase FV. Up to 17% power savings with less than 6% performance loss On average 8% power saving with <1% performance degradation Laxmi N. Bhuyan University of California, Riverside 20 Ongoing and future work Extend NePSim to IXP2400/2800 Dynamically shutdown/activate MEs Dynamically allocate task on MEs Model SRAM and SDRAM module power Integrate StrongARM/Xscale simulation … 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 21 NP-Based Projects at UCR Intel IXA Program: NP Architecture Lab – Architecture research - CS 162 Assignments Based on IXP 2400 NSF: Design and Analysis of a Web Switch (Layer 5/7 switch) Using NP – TCP Splicing – Load Balancing etc. Intel and UC Micro: Architectures to Accelerate Data Center Servers – TCP and SSL Offload to Dedicated Servers and NPs, XML Servers, etc. Los Alamos National Lab: Intelligent NP-Based NIC Design for Clusters – O/S Bypass Protocols, User Level Communication, etc. 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 22 References [Burd] T. Burd and R. Brodersen, Design issues for dynamic voltage scaling,International Symposium on Low Power Electronics and Design, pp. 9--14, 2000. [Shang] L. Shang, L.-S. Peh, and N. K. Jha, Dynamic voltage scaling with links for power optimization of interconnection networks,The 9th International Symposium on HighPerformance Computer Architecture,pp. 91--102, 2003. [Sidiropoulos] S. Sidiropoulos, D. Liu, J. Kim, G. Wei, and M. Horwitz, Adaptive bandwidth DLLs and PLLs using regulated supply CMOS buffers,IEEE Symposium on VLSI Circuits, pp. 124-127, 2000. 11/13/2015 Laxmi N. Bhuyan University of California, Riverside 23
© Copyright 2024 Paperzz