Erdem

Clustered Linked List Forest For
IPv6 Lookup
Oğuzhan ERDEM1 and Aydın CARUS2
1Department
of Electrical and Electronics
Engineering
2Department of Computer Engineering
Trakya University, TURKEY
{ogerdem, aydinc}@trakya.edu.tr
HOTI'13- Oğuzhan Erdem
Agenda
Ø Background
Ø Prior Work
Ø Our Solutions
Ø Implementation Results
Ø Conclusions & Future Work
2
HOTI'13- Oğuzhan Erdem
Agenda
Ø Background
Ø Prior Work
Ø Our Solutions
Ø Implementation Results
Ø Conclusions & Future Work
3
4
HOTI'13- Oğuzhan Erdem
Background
Ø IP Lookup
Ø Core functions of Internet routers
Ø Match the destination IP address to the prefixes in the routing table
Ø Longest Prefix Matching (LPM)
Ø Multiple matched prefixes possible
Ø Only the longest matched prefix will be used
Ø Example
Ø IP address “114.110.88.212”
Ø Routing table:
Routing
Prefixes
Next-Hop
Forwarding
114.110.*.*.
Port 0
114.110.88.*.
Port 1
Ø Matches both, but the second match is used
5
HOTI'13- Oğuzhan Erdem
Tree based IP Lookup
Ø Binary Search Tree (BST)
Ø Each node stores prefix values
Ø Better memory performance
Ø Complex pre-processing
Ø Binary Trie (BT)
Ø Simple data structure
Ø Easy update process
Ø Large number of memory
accesses
Ø Prefix expansion
Ø Sorting requırement
Ø Hard incremental updates
Prefix Next
Hop
00*
01*
10*
101*
110*
0001*
0011*
0101*
1000*
1100*
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
0
0
1
P6
0
1
P1
1
0
1
P2
1
P7
0
0
1
P8
P9
Binary Trie
0
1
P3
1
0
P4
0
P5
P10
Binary Search Tree
HOTI'13- Oğuzhan Erdem
6
Performance Metrics
Ø Throughput
Ø Support for higher link data rates
Ø 100 Gbps network interfaces and links
Ø Memory usage
Ø Rapid growth of routing table size
Ø Estimated as ~500K entries in 2015
Ø Transition from IPv4 (32-bit) to IPv6 (128-bit)
Ø IPv4 address exhaustion
Ø State-of-the-art designs cannot support large routing tables due to
the available on-chip memory and the number of I/O pins of FPGAs
Ø Update Capability
Ø Types of updates – Insert, delete and modify
Ø How fast an update operation can be performed
HOTI'13- Oğuzhan Erdem
Agenda
Ø Background
Ø Prior Work
Ø Our Solutions
Ø Implementation Results
Ø Conclusions & Future Work
7
HOTI'13- Oğuzhan Erdem
8
Hardware based solutions
Ø Linear pattern search in TCAM
Ø One lookup per cycle
Ø Expensive, power-hungry, low density, large access time, need for a
priority encoder
Ø Hash based solutions
Ø Simple and fast
Ø Large number of different hash tables and functions for each prefix
lengths, hash collisions.
Ø Binary bit traversal in pipelined tries
Ø Good throughput performance and support quick prefix updates
Ø Inefficient memory usage
Ø Binary value search in pipelined trees
Ø Good memory performance
Ø Complex pre-processing and updates
9
HOTI'13- Oğuzhan Erdem
Hierarchical Search Structures
Ø Hierarchical trie for packet classification
Ø Hierarchically connected binary tries
Ø One big SA trie from SA prefixes
Ø Many small DA tries using DA prefixes
Ø Hardware implementation unfeasible
0
1
0
0
1
1
Ø  Backtracking problem
SA Trie
DA Tries
1
1
0
1
1
0
0
1
0
0
0
0
Ø Hierarchical structures for IP lookup
Ø Ony DA is used
Ø Split prefixes into two part
Ø Hierarchically connected data structures
Ø Substantial memory saving
Ø Bit strings shared by multiple prefixes are stored only once
Ø No backtracking : Fixed size strings naturally pairwise disjoint
HOTI'13- Oğuzhan Erdem
Agenda
Ø Background
Ø Prior Work
Ø Our Solutions
Ø Implementation Results
Ø Conclusions & Future Work
10
HOTI'13- Oğuzhan Erdem
11
Our Solutions
Ø Clustered Linked List Forest (CLLF) structure
Ø Hierarchical two stage data structure
Ø Reduces the memory requirement
Ø Simplifies the table updates
Ø Linear pipelined SRAM-based architecture on FPGAs
Ø Supports up to 712K IPv6 prefixes
Ø Sustained throughput of 432 Million lookups per second (138 Gbps
for the minimum packet size of 40 Bytes)
12
HOTI'13- Oğuzhan Erdem
Clustered Linked List Forest (CLLF)
Ø Hierarchical two stage data structure
Ø A prefix P can be expressed as the concatenation of two substrings
Px and Py such that P = Px Py .
Ø Stage 1 : Clustering stage (Px) Stage 2 : Linked List (LL) stage (Py)
C18
C214
C320
C426
9
C215
C321
C427
C110
C216
C322
C428
11
C217
C323
C429
C112
C218
C324
C430
13
C219
C325
C431
Cluster3
Cluster4
C1
C1
C1
Cluster1
Cluster2
Clustering Stage
N1,0,0
N1,1,0
N1,k,0
N2,0,0
N2,1,0
N2,t,0
N3,0,0
N3,1,0
N1,0,1
N1,1,1
N1,k,1
N2,0,1
N2,1,1
N2,t,1
N3,0,1
N3,1,1
N2,1,2
N2,t,2
N3,0,2
N3,r,0
N4,0,0
N4,1,0
N4,z,0
N4,1,1
N4,z,1
LL1
N1,0,2
N1,k,2
N4,1,2
LLk
N1,0,m
LL0
N2,t,m
Linked List (LL) Stage
N4,1,m
HOTI'13- Oğuzhan Erdem
13
Clustering
Ø Length-based
Ø 64 clusters in IPv6 the worst case
Ø Each cluster is indexed starting from 1 to n (Ci represents ith cluster)
Ø Reduce the number of clusters by merging
Ø  Parameters
Ø Block size (BS) : The number of distinct lengths involved in a cluster
Ø CiL : Each distinct length (L) block of prefixes in a Ci
Ø Ex: If a cluster Ci includes the prefixes of lengths from 8 to 11, then
BS(Ci)=4 and Ci9 represents a bunch of prefixes with length 9.
Ø Initial prefix stride (IPS) : The length of substring Px (|Px|) in P = Px Py
Ø Ex: If IPS (Ci)= 4, then a prefix P = 110010111* in Ci can be represented
using two substrings Px= 1100 and Py= 10111*
14
HOTI'13- Oğuzhan Erdem
Data Structure
Ø Stage 1: Clustering stage
Ø Each group has a Binary Search Tree (BST) using Px of prefixes.
Ø Stage 2: Linked List (LL) stage
Ø Each node of BST connects to a linked list data structure.
Ø Each node in a LL contains different Py substring with its length.
Ø BS(C1)=2, BS(C2)=2, BS(C3)=3, IPS(C1)=2, IPS(C2)=3, IPS(C3)=4.
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
P15
P16
P17
P18
P19
P20
Prefix
10*
00*
000*
011*
101*
111*
1000*
0010*
0111*
10010*
11000*
01101*
11110*
10110*
01111*
101101*
011011*
1011110*
0110011*
01100000*
01101011*
Next Hop
1
2
2
8
6
5
8
6
7
10
9
8
10
3
1
5
1
3
7
9
4
C1={C12,C13}
C2={C24,C25}
C3={C36,C37,C38}
100 (4)
10 (2)
0110 (6)
1011 (6)
01 (1)
110 (6)
011 (3)
11 (3)
101 (5)
001 (1)
00 (0)
111 (7)
Clustering stage
Linked List Stage
- 0 P1
0 1 P2
1 1 P3
- 0 P0
1 1 P4
1 1 P5
0 1 P7
1 1 P8
0 1 P6
01 2 P11
10 2 P9
11 2 P14
10 2 P13
00 2 P10
10 2 P12
11 2 P16
01 2 P15
011 3 P18
110 3 P17
0000 4 P19
1011 4 P20
HOTI'13- Oğuzhan Erdem
15
IP Lookup
Ø Split IP address into two: IPx and IPy
Ø Based on the IPS value of corresponding cluster
Ø Search IPx in Stage 1 (BST) and IPy in Stage 2 (Linear search).
Ø Parallel searches in clusters
Ø If a match is found in BST, proceed to the connected LL.
Ø At each LL node, an IPy sub-address is compared with Py sub-prefix.
Ø Update search result if longer match than the previous match.
Ø Search terminates when all the nodes in a LL is traversed
Ø Outputs from all clusters are given to the priority encoder that
chooses the longest prefix.
HOTI'13- Oğuzhan Erdem
16
Optimization
Ø Node size optimizations
Ø BST
Ø The nodes of BST can be enhanced to store Py substrings
Ø Ptree: A limit for the number of Py substrings per tree node
Ø LL
Ø LL nodes can be enhanced to store more than one Py substrings
Ø PLL: Upper bound for the number of Py prefixes per LL node
Ø Clustering optimizations
Ø BS
Ø Defines the number and size of clusters
Ø Larges BS results in less number of pipelines in hardware implementation
Ø IPS
Ø Determines the number of nodes in BST and LL.
Ø Smaller IPS reduces the number of BST nodes but increase the number
of LL nodes and implicitly the delay.
17
HOTI'13- Oğuzhan Erdem
Architecture
Ø 
Overall
1 stage
Ø 
Register
Incoming Packet
Header extracter
Address Data
IPx
BST
BST
IPx
NHIIP
Address
NHIIP
Address
Px
Next Stage
Address
BRAM
BST
NHIprefix
Match
Module
(BST)
Address
Next Stage
Address
Clustering stage
Register
LL
LL
LL
Address Data
IPy
Cluster2
Clustern
Address
BRAM
Address
NHIprefix
Priority Encoder
Result
IPy
NHIIP
L(Py)
NHIIP
Cluster1
Py
LL stage
Match
Module
(LL)
Address
HOTI'13- Oğuzhan Erdem
Agenda
Ø Background
Ø Prior Work
Ø Our Solutions
Ø Implementation Results
Ø Conclusions & Future Work
18
19
HOTI'13- Oğuzhan Erdem
Results
No. of prefixes
Ø Experimental setup
Ø Ten IPv4 core routing tables from Project - RIS on 06/03/2010.
Ø Ten synthetic IPv6 routing tables generated using IPv4 core RTs
IPv4
rrc00
rrc03
rrc05
rrc06
rrc07
rrc10
rrc11
rrc12
rrc15
rrc16
IPv6
rrc00_6
rrc03_6
rrc05_6
rrc06_6
rrc07_6
rrc10_6
rrc11_6
rrc12_6
rrc15_6
rrc16_6
# prefixes
332118
321617
322997
321577
322557
319952
323668
320016
323987
328296
140000
rrc00_6
120000
rrc03_6
100000
rrc06_6
80000
rrc07_6
rrc05_6
rrc10_6
60000
rrc11_6
rrc12_6
40000
rrc15_6
20000
rrc16_6
0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Prefix Length
20
HOTI'13- Oğuzhan Erdem
Memory Requirement
Ø  For various BS and IPS (PLL=Ptree=1)
Ø  For various Ptree and PLL (BS=4, IPS=13)
19
20
Memory (Mbits)
Memory (Mbits)
25
15
10
5
0
4 5
6
7
8
BS
rrc00_6
9
10
rrc03_6
11
12 12
rrc05_6
13
14
15
16
17
18
18
17
16
15
1
2
IPS
rrc06_6
3
4
PLL
rrc07_6
rrc10_6
5
rrc11_6
6
7
rrc12_6
8
1
2
rrc15_6
3
4
Ptree
rrc16_6
BT
341
330
332
331
332
329
347
329
334
337
BTPC
72
69
70
69
70
69
72
70
70
71
DBPC
45
45
45
45
46
45
48
45
45
46
BST
35.5
34.4
34.6
34.4
34.5
34.3
34.6
34.3
34.7
35.2
CTF
27.9
27.3
27.0
29.2
27.2
27.1
27.1
26.9
28.0
27.2
CLLF
16.4
15.8
15.9
15.9
16.0
15.7
16.0
15.7
15.9
16.1
21
HOTI'13- Oğuzhan Erdem
Delay
Ø  The delay for various Ptree and PLL
Ø  The % of memory by BST and LL for various IPS
(BS=4, PLL=5, Ptree=1)
(BS=4, IPS=13)
200
150
100
50
0
1
2
3
4
PLL
rrc00_6
CLLF
45
5
rrc03_6
44
6
7
8
rrc05_6
45
1
2
3
4
Ptree
rrc06_6
44
% of memory usage
Delay (Cycle)
250
rrc07_6
45
69
45
Delay
27
14
12
13
14
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
rrc10_6
45
rrc11_6
IPS
15
Ø  Depth (cycles) (BS=4, IPS=13, PLL=5, Ptree=1)
6
6
16
17
18
BST
rrc12_6
44
8
45
rrc15_6
46
Linked List
rrc16_6
45
22
HOTI'13- Oğuzhan Erdem
Performance Comparison
Ø FPGA implementation
Ø  Implemented in Verilog using Xilinx ISE 12.4 and target device Xilinx XC6VSX475T
Ø  216 MHz (432 million packets per second with dual-ported BRAMs)
Ø  Clock period is 4.63 ns, 138 Gbps for the minimum packet size of 40 bytes
Ø Memory efficiency (in Bits/prefix)
Ø Throughput efficiency (in Gbps/Bits)
Architecture
#
prefix
Mem. Eff.
Bits/prefix
Throughput
Gbps
Throughput eff.
Gbps/B
CLLF
324
51.8
138
2.66
CTF
324
88.1
135
1.53
BST
324
124
125
1.01
DBPC
324
146
120
0.82
2-3 tree
324
167
120
0.72
Flashtrie
310
171
64
0.37
TreeBitmap
310
405
6
0.02
HOTI'13- Oğuzhan Erdem
Agenda
•  Background
•  Prior Work
•  Our Solutions
•  Implementation Sesults
•  Conclusions & Future Work
23
HOTI'13- Oğuzhan Erdem
24
Conclusions and Future Work
Ø Clustered Linked List Forest
Ø Significant memory saving
Ø Performance improvements with optimizations
Ø Linear pipelined SRAM-based architecture on FPGAs
Ø Supports up to 712K IPv6 prefixes
Ø Sustained throughput of 138 Gbps
Ø Future work
Ø Extend the algorithm to virtual routers to improve their memory
efficiency
THANK YOU !