Low Power Architecture - CSIE -NCKU

Low Power Architecture for High Speed
Packet Classification
Author:
Alan Kennedy, Xiaojun Wang
Zhen Liu, Bin Liu
Publisher:
ANCS 2008
Presenter: Chun-Yi Li
Date: 2009/05/06
Outline
 Introduction
 Adaptive Clocking Architecture
 Hardware Accelerator
 Hierarchical Intelligent Cutting (HiCut)
 Multidimensional Cutting (HyperCuts)
 Algorithm Changes
 Low Power Architecture
 Performance
2
Introduction
 The number in the Optical Carrier
level is directly proportional to the
data rate of the bitstream carried by
the digital signal.
Syn ch ro n o u s Op t ica l Ne t wo rk in g (SONET)
50
Tra n s m is s io n Sp e e d (Gb it / s )
 Optical Carrier levels describe a
range of digital signals that can be
carried on SONET fiber optic
network.
39.81312
40
30
20
9.95328
10
2.48832
0
O C-4 8
O C-1 9 2
O C-7 6 8
Op t ica l Ca rrie r Le ve l
3
Introduction
 Implementing packet classification algorithms
in software is not feasible when trying to
achieve high speed packet classification.
 High throughput algorithms such as RFC are
unable to reach OC-768 or even OC-192 line
rates when run on devices such as general
purpose processors for even relatively small
sized rulesets
4
Introduction
 A large percentage of idle time means a large amount
of unnecessary dynamic power is being used due to the
unnecessary switching of logic and memory elements.
Percentage of time classifier spends idle when classifying packets from
the CENIC trace at different frequencies
5
Outline
 Introduction
 Adaptive Clocking Architecture
 Hardware Accelerator
 Hierarchical Intelligent Cutting (HiCut)
 Multidimensional Cutting (HyperCuts)
 Algorithm Changes
 Low Power Architecture
 Performance
6
Adaptive Clocking Architecture
 The adaptive clocking unit is designed to run a packet
classification hardware accelerator at up to N different
frequencies.
 For our packet classifier, it was found that 32MHz is fast
enough to deal with the worst case bursts of packets for OC768 line speeds. This means that Fmax=32MHz
fi = Fmax/2n-i-1,
State
S0
S1
S2
i=0,1,...,N-1
S3
S4
S5
S6
S7
S8
S9
Speed
f =0.0625 f1=0.125 f2=0.25 f3=0.5 f4=1 f5=2 f6=4 f7=8 f8=16 f9=32
(MHz) 0
7
Adaptive Clocking Architecture
 This threshold is variable with the number of packets stored in
the buffer distributed among the N states with each state
having a width Wi , 0≦ Wi ≦M
N-1
M=
ΣW
i=0
i
M
Buffer
‥‥
‥‥
W0
W1
‥‥‥
‥‥
WN-1
8
Adaptive Clocking Architecture
 The threshold for determining when a state is exited and the
next higher state entered is saved in a register in the adaptive
clocking unit and can be changed at any time.
i
Ti =
ΣW
j=0
j
, i=0,1,...,N-2
T1
T0
Buffer
‥‥
‥‥
W0
W1
‥‥‥
‥‥
WN-1
9
Adaptive Clocking Architecture
 The output clock frequency to the packet classification
hardware accelerator will start at the frequency of the
lowest-used state f0 .
 If the threshold for this state T0 is exceeded then the
next higher-used state S1 will be entered and the clock
frequency will change to f1
S0
S1
S2
S3
S4
S5
S6
S7
S8
S9
10
Adaptive Clocking Architecture
 Only states S4, S7, S8 and S9 are used.
 In this case the output clock frequency to the packet
classifier will start at f1 .
S0
S1
S2
S3
S4
S5
S6
S7
S8
S9
11
Outline
 Introduction
 Adaptive Clocking Architecture
 Hardware Accelerator
 Hierarchical Intelligent Cutting (HiCut)
 Multidimensional Cutting (HyperCuts)
 Algorithm Changes
 Low Power Architecture
 Performance
12
Hardware Accelerator
Hierarchical Intelligent Cuttings (HiCut)
 The algorithm constructs the decision tree by
recursively cutting the hyperspace one dimension at
a time into sub regions.
 The algorithm will keep cutting into the hyperspace
until none of the sub regions exceed a predetermined
number called binth.
13
Hardware Accelerator
Hierarchical Intelligent Cuttings (HiCut)
Field2 (4 cuts)
00
R9
R10
R11
01
R8
R9
R10
R11
11
10
R2
R3
R4
R7
R10
R11
R0
R1
R5
R6
R7
R10
R11
binth = 4
14
Hardware Accelerator
Hierarchical Intelligent Cuttings (HiCut)
Field2 (4 cuts)
R9 R8
R10 R9
R11 R10
R11
Field4 (4 cuts)
R7 R3
R10 R7
R11 R10
R11
R2
R7
R10
R11
R4
R7
R10
R11
Field3 (4 cuts)
R7 R1
R10 R7
R11 R10
R11
R0 R7
R5 R10
R6 R11
R7
R10
R11
binth = 4
15
Hardware Accelerator
Hierarchical Intelligent Cuttings (HiCut)
Field2 (4 cuts)
R9 R8
R10 R9
R11 R10
R11
binth = 4
Field4 (4 cuts)
R7 R3
R10 R7
R11 R10
R11
R2
R7
R10
R11
R4 R7
R7 R10
R10 R11
R11
Field3 (4 cuts)
R1
R7
R10
R11
R7
R10
R11
Field5
R7 R0
R11 R5
R6
R10
16
Hardware Accelerator
Hierarchical Intelligent Cuttings (HiCut)
17
Hardware Accelerator
Hierarchical Intelligent Cuttings (HiCut)
 binth: Limits the amount of linear searching at
leaves.
 np: Number of cuts.
 spfac: A multiplier which limits the amount of
storage increase caused by executing cuts at a
node.
 spfac*number of rules at i ≤∑rules at each child of i+np
18
Hardware Accelerator
Multidimensional Cutting (HyperCuts)
 The main difference from HiCuts is that HyperCuts
recursively cuts the hyperspace into sub regions by
performing cuts on multiple dimensions at a time.
19
Hardware Accelerator
Multidimensional Cutting (HyperCuts)
Rule
Field1
Field2
Field3
Field4
Field5
R0
128-240
15-15
40-40
180-180
120-140
R1
90-100
0-80
0-200
190-200
130-132
R2
130-255
60-140
0-60
180-180
133-135
R3
90-92
200-200
40-40
180-180
136-138
R4
130-255
60-140
40-40
190-200
60-63
R5
140-150
60-140
0-255
0-255
140-255
R6
160-165
80-80
0-255
0-255
0-80
R7
48-50
0-80
40-40
0-255
0-10
R8
26-36
50-50
40-40
180-180
30-40
R9
40-40
40-70
40-40
0-255
0-60
Field1 (2 cuts)
Field5 (2 cuts)
R7
R8
R9
R1
R3
R0
R4
R6
R0
R2
R5
binth = 4
20
Hardware Accelerator
Multidimensional Cutting (HyperCuts)
 spfac: A multiplier which limits the amount of
storage increase caused by executing cuts at a
node.
 max child nodes at i ≤ spfac*sqrt( number of rules at i)
21
Hardware Accelerator
Multidimensional Cutting (HyperCuts)
 Region Compaction
A node in the decision tree originally covers the
region {[Xmin, Xmax], [Ymin, Ymax]}.
However all the rules that are associated with the
node are only covered by the subregion {[X’min,
X’max], [Y’min, Y’max]}.
Using region reduction the area that is associated
with the node shrinks to the minimum space
which can cover all the rules associated with the
node.
22
Hardware Accelerator
Multidimensional Cutting (HyperCuts)
 Pushing Common Rule Subsets Upwards
An example in which all the child nodes of A share the same subset of rules {R0, R1}.
As a result only A will store the subset instead of being replicated in all the children.
A
A
R0 R1
R0
R1
R2
R0
R1
R3
R0
R1
R0
R1
R4
R2
R3
R4
23
Hardware Accelerator
Algorithm Changes
 Remove the region compaction and push
common rule subsets upwards heuristics
from the HyperCuts algorithm.
24
Hardware Accelerator
Algorithm Changes
 For HiCuts the number of cuts to an
internal node starts at 32 and doubles each
time the following condition is met
 (spfac*number of rules at i ≤∑rules at each child of
i+np) & (np<129)
25
Hardware Accelerator
Algorithm Changes
 All combination of cuts between the chosen
dimensions are considered if they obey the
following condition where spfac can be 1, 2,
3 or 4:
(4+spfac)
 (np ≦ 2
) & (np≧32)
26
Hardware Accelerator
Memory Structure
 The hardware accelerator uses 7704-bit wide memory words.
 In order to calculate which cut the packet should traverse to, the
internal node stores 8-bit mask and shift values for each
dimension.
 The masks indicate how many cuts are to be made to each
dimension while the shift values indicate each dimensions weight.
 The cut to be chosen is calculated by ANDing the mask values
with the corresponding 8 most significant bits from each of the
packets 5 dimensions. The resulting values for each dimension
are shifted by the shift values with the results added together
giving the cut to be selected.
27
Hardware Accelerator
Memory Structure
 Each saved rule uses 160 bits of memory.
 The Destination and Source Ports use 32 bits each with 16 bits used
for the min and max range values.
 The Source and Destination IP addresses use 35 bits each with 32 bits
used to store the address and 3 bits for the mask.
 The storage requirement for the mask has been reduced from 6 to 3 bits
by encoding the mask and storing 3 bits of the encoded mask value in
the 3 least significant bits of the IP address when the mask is 0-27.
 The protocol number uses 9 bits with 8 bits used to store the number
and 1 bit for the mask.
 Each 7704-bit memory word can hold up to 48 rules, and it is possible
to perform a parallel search of these rules in one clock cycle.
28
Outline
 Introduction
 Adaptive Clocking Architecture
 Hardware Accelerator
 Hierarchical Intelligent Cutting (HiCut)
 Multidimensional Cutting (HyperCuts)
 Algorithm Changes
 Low Power Architecture
 Performance
29
Low Power Architecture
31
Outline
 Introduction
 Adaptive Clocking Architecture
 Hardware Accelerator
 Hierarchical Intelligent Cutting (HiCut)
 Multidimensional Cutting (HyperCuts)
 Algorithm Changes
 Low Power Architecture
 Performance
32
Performance
33
Performance
Power figures for ASIC
implementation
Power figures for Cyclone 3
implementation
34
Performance
ASIC implementation
classifying network traces using
rulesets containing 20,000 rules.
Cyclone 3 implementation
classifying network traces using
rulesets containing 20,000 rules.
35
Conclusion
Simulation results show that ASIC and FPGA
implementations of our low power architecture can reduce
power consumption by between 17-88% by adjusting the
frequency.
36