NePSim: A Network Processor Simulator with Power

NePSim: A Network Processor Simulator
with Power Evaluation Framework
IEEE Micro, Sept/Oct 2004 – Source code at
http://www.cs.ucr.edu/~yluo/nepsim
Yan Luo, Jun Yang, Laxmi N. Bhuyan, Li Zhao
Computer Science & Engineering
University of California at Riverside
NP Architecture Design - Research Goals




NPs provide performance and programmability for packet
processing without O/S overhead
Future design of NPs (with hardware accelerators) should
be based on accurate estimation of execution performance
gain instead of intuition
Power consumption of NPs is becoming a big concern.
Techniques are needed to save power when traffic is low.
Need for an open-source Execution-driven simulator that
can be used to explore architectural modifications for
improved performance and power consumption
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
2
NP Simulation Tools

Intel IXA SDK



+ accuracy, visualization
- closed-source, low speed, inflexibility, no power model –
Can’t incorporate new hardware designs
SimpleScalar
+ open-source, popular, power model (wattch)
 - Uniprocessor architecture - disparity with real NP
NePSim
 + open-source, real NP, power model, accuracy
 - currently target IXP1200 – 2400 under development
 About 60 software downloads and 300 hits at
http://www.cs.ucr.edu/~yluo/nepsim/


11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
3
Objectives of NePSim






An open-source simulator for a real NP (Intel®
IXP1200, later IXP2400/2800…)
Cycle-level accuracy of performance simulation
Flexibility for users to add new instructions and
functional units
Integrated power model to enable power
dissipation simulation and optimization
Extensibility for future NP architectures
Faster simulation compared to SDK
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
4
NePSim Software Architecture







11/13/2015
ME core (ME ISA, 5-stage
pipeline, GPR, xfer registers)
SRAM and SDRAM (data
memory and controller with
command queues)
FBI unit (ixbus and CSRs)
Device (network interface with
in/out buffer, packet streams)
Dlite(a light-weighted
debugger)
Stats (collection of statistics
data)
Traffic generator, program
parser and loader
Laxmi N. Bhuyan
University of California, Riverside
5
NePSim Overview
ME C program
Compilergenerated
ME C compiler microcode
SDK
Microcode program
Parser
Internal format
Microcode
assembler
NePSim
Stats Results
host C compiler
NePSim source code
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
6
NePSim Internals (I)

Instruction





Command ( for memory, fbi accesses)







Opcode: ALU, memory ref., CSR access etc.
operands: GPR, XFER, immed,
Shift: shift amount
Optional token: ctx_swap, ind_ref, …
Opcode: sram_read, sram_write, sdram_read, …
Thread id: ME, thread
Functional unit: sram, sdram, scratchpad, fbi
Address: source or destination address
Reg: source or destination XFER register
Optional token: ctx_swap, ind_ref, …
Event

<cycle time, command>
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
7
NePSim Internals (II)
SRAM controller
I
P0
Inst.
lookup
I
P1
Inst.
decode
ME
P2
Read
operand
I
P3
ALU, gen
mem
addr
I
arbiter
C
P4
Retire, gen
mem
Command
SDRAM controller
C
Wake up sleeping threads
Event Queue
I: instruction
C: command
E: event
11/13/2015
E
Laxmi N. Bhuyan
University of California, Riverside
arbiter
E
E
8
NePSim parameters










Some of the simulation parameters of NePSim
-d:
enable debug message
-I:
start simulation in Dlite debugger mode
-proc:speed processor speed in MHz
-vdd
default chip power supply
-me0
program for microengine 0
-script
script file used for initialization
-strmconf
stream config file of packet traces to network devices
-max:cycle maximum number of cycles to execute
-indstrm
mark packet stream file as indefintely repeated
-power
flag to enable power calculations
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
9
Dlite Debugger


Similar with Dlite in SimpleScalar
Run simulation in debug mode





11/13/2015
Set/delete breakpoints
Step into pipeline execution, check ALU
condition code
Examine threads’ PC, status etc.
Examine register contents
Examine memory (sram, sdram) contents
Laxmi N. Bhuyan
University of California, Riverside
10
IVERI – a verification tool

To verify NePSim against IXP1200 is not a trivial task




Multiple MEs, threads, memory units
Huge amount of events (of pipeline and memory) generated in simulation process
Have to pin-point error by scanning huge log traces
IVERI tool



Assertion checking based on Linear Temporal Logic (LTL) and Logic of Constraint (LOC)
Log architectural events in both NePSim and IXP1200(with SDK)
 <cycle, PC, alu_out, address, event_type>
 Event_type can be pipeline, sram_enq, sdram_deq etc.
Use LOC Assertion to specify performance requirement

E.g. “the execution time of an instruction in pipeline of NePSim is no more than D cycles away
from the execution time of the same instruction in IXP1200”
PC(pipline[I])== PC(pipeline_IXP(I)) ^
|cycle(pipeline(I)) - cycle(pipeline_IXP(I))| <= D


IVERI generate verification code based on assertions
Verification code scans log trace, reports number and location of constraint violations.
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
11
Performance Validation of NePSim


Use IVERI to verify NePSim functionality against IXP1200 with Logic
of Constraint (LOC) Language
Throughput and average latency are within %1 and %6 error.
Throughput
11/13/2015
Average processing time
Laxmi N. Bhuyan
University of California, Riverside
12
Power Model
H/W component
Model Type
Tool
Configurations
GPR per ME
Array
XCacti
2 64-entry files, 1 read/write
port per file
XFER per ME
Array
XCacti
4 32-entry files, 1 read/write
port per file
Control register per ME
Array
XCacti
1 32-entry file, 1 read/write
port
Control store, scratchpad
Cache w/o tag path
XCacti
4KB, 4byte per block, direct
mapped, 10-bit address
ALU , shifter
ALU and shifter
Wattch
32bit
Command FIFO, command
queue in controller, etc
Array
Wattch
See paper
Command bus arbiter, context
arbiter
Matrix, rr arbiter
Orion
See paper
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
13
Benchmarks




Ipfwdr
 IPv4 forwarding(header validation, trie-based lookup)
 Medium SRAM access
url
 Examining payload for URL pattern, used in content-aware routing
 Heavy SDRAM access
Nat
 Network address translation
 medium SRAM access
Md4
 Message digest (compute a 128-bit message “signature”)
 Heavy computation and SDRAM access
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
14
Performance implications





More MEs do not necessarily bring performance gain
More ME cause more memory contention
ME idle time is abundant (up to 42%)
Faster ME core results in more ME idle time with the same memory
Non-optimal rcv/xmit configuration for NAT (transmitting ME is a bottleneck)
Throughput vs number
of MEs at 232MHz
11/13/2015
Throughput vs number
of MEs at 464MHz
Laxmi N. Bhuyan
University of California, Riverside
Idle time vs ME/memory
speed ratio
15
Where does the Power Go?





Power dissipation by rcv and xmit MEs is similar across benchmarks
Transmitting MEs consume ~5% more than receiving
ALU consumes significant power ~45% (wattch model)
Control store uses ~28% (accessed almost every cycle)
GPRs burn ~13% , shifter ~7%, static ~7%
Across MEs
11/13/2015
Inside an ME
Laxmi N. Bhuyan
University of California, Riverside
16
Power efficiency observations

Power
Power
Performance
Performance
url

Power consumption increases faster
than performance
More MEs/threads bring more idle time
due to memory contention
Reduce power consumption of MEs
while waiting for memory accesses
ipfwdr
Power
Power
Performance
Performance
md4
11/13/2015
nat
Laxmi N. Bhuyan
University of California, Riverside
Idle time vs # of MEs
17
Dynamic Voltage Scaling in NPs



During the ME idle time, all threads are in
``wait'' state and the pipeline has no
activity.
Applying DVS while MEs are not very
active can reduce the total power
consumption substantially.
DVS control scheme



11/13/2015
Observes the ME idle time (%)
periodically.
When idle > threshold, scale down the
voltage and frequency (VF in short) by one
step unless the minimum allowable VF is
hit.
Idle < threshold, scale up the VF by one
step unless they are at maximum allowable
values.
Laxmi N. Bhuyan
University of California, Riverside
18
DVS Considerations




transition step meaning whether or not to use continuous or
discrete changes in VF
 Use discrete VF steps
transition status indicating if we allow the ME to continue
working during VF regulation
 Pause ME while regulating VF
transition time between two different VF states
 10 us, [Burd][Shang][Sidiropoulos]
transition logic complexity, i.e. the overhead of the control
circuit that monitors and determines a transition
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
19
DVS Power-performance




Power and performance reduction by DVS
11/13/2015
Initial VF=1.3V, 600MHz
DVS period: every 15K,
20K or 30K cycles make a
DVS decision to reduce or
increase FV.
Up to 17% power savings
with less than 6%
performance loss
On average 8% power
saving with <1%
performance degradation
Laxmi N. Bhuyan
University of California, Riverside
20
Ongoing and future work






Extend NePSim to IXP2400/2800
Dynamically shutdown/activate MEs
Dynamically allocate task on MEs
Model SRAM and SDRAM module power
Integrate StrongARM/Xscale simulation
…
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
21
NP-Based Projects at UCR




Intel IXA Program: NP Architecture Lab – Architecture
research - CS 162 Assignments Based on IXP 2400
NSF: Design and Analysis of a Web Switch (Layer 5/7
switch) Using NP – TCP Splicing – Load Balancing etc.
Intel and UC Micro: Architectures to Accelerate Data
Center Servers – TCP and SSL Offload to Dedicated
Servers and NPs, XML Servers, etc.
Los Alamos National Lab: Intelligent NP-Based NIC
Design for Clusters – O/S Bypass Protocols, User
Level Communication, etc.
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
22
References



[Burd] T. Burd and R. Brodersen, Design issues for dynamic
voltage scaling,International Symposium on Low Power
Electronics and Design, pp. 9--14, 2000.
[Shang] L. Shang, L.-S. Peh, and N. K. Jha, Dynamic voltage
scaling with links for power optimization of interconnection
networks,The 9th International Symposium on HighPerformance Computer Architecture,pp. 91--102, 2003.
[Sidiropoulos] S. Sidiropoulos, D. Liu, J. Kim, G. Wei, and M.
Horwitz, Adaptive bandwidth DLLs and PLLs using regulated
supply CMOS buffers,IEEE Symposium on VLSI Circuits, pp. 124-127, 2000.
11/13/2015
Laxmi N. Bhuyan
University of California, Riverside
23