Flow Stats

Flow Stats Module
James Moscola
September 12, 2007
SPP V1 LC Egress with 1x10Gb/s Tx
XScale
M
S
F
R
B
U
F
Rx1
NN
Rx2
NN
Key
Extract
NN
Lookup
SCR
S
W
I
T
C
H
NAT Miss
Scratch Ring
Hdr
Format
NN
TCAM
NN
M
S
F
T
B
U
F
1x10G
Tx2
Stats
(1 ME)
1x10G
Tx1
NN
SRAM3
XScale
SRAM
Archive Records
‹#› - Flow Stats Module - James Moscola
Flow
Stats1
SCR
Flow
Stats2
SRAM
SCR
NN
SCR
R
T
M
Freelist
QM0
SCR
QM1
SCR
QM2
SCR
QM3
SCR
Port
Splitter
SRAM1
NAT Pkt
return
SRAM2
XScale
SPP V1 LC Egress with 10x1Gb/s Tx
XScale
M
S
F
R
B
U
F
Rx1
NN
Rx2
NN
Key
Extract
NN
Lookup
SCR
S
W
I
T
C
H
NAT Miss
Scratch Ring
Hdr
Format
NN
TCAM
NN
M
S
F
T
B
U
F
SCR
SRAM3
XScale
SRAM
Archive Records
‹#› - Flow Stats Module - James Moscola
Flow
Stats1
SCR
Flow
Stats2
SRAM
SCR
Stats
(1 ME)
SCR
SCR
R
T
M
5x1G
Tx1
(P0-P4)
5x1G
Tx2
(P5-P9)
Freelist
QM0
SCR
QM1
SCR
QM2
SCR
QM3
SCR
Port
Splitter
SRAM1
NAT Pkt
return
SRAM2
XScale
Overview of Flow Stats
 Main
functions
»Uniquely identify flows based on 6-tuple

Hash header values to get an index into a table of records
»Maintain packet and byte counts for each flow
Compare packet header with header values in record, and
increment if same
 Otherwise, follow hash chain until correct record is found

»Send flow information to XScale for archiving every five
minutes
 Secondary
functions
»Maintain hash table

Identify and remove flows that are no longer active

Invalid flows are removed so memory can be resused
‹#› - Flow Stats Module - James Moscola
Design Considerations
 Efficiently
collisions
maintaining a hash table with chained
»Efficiently inserting and deleting records
 Efficiently
reading hash table records
 Synchronization
issues
»Multiple threads modifying hash table and chains
‹#› - Flow Stats Module - James Moscola
Flow Record

Total Record Size = 8 32-bit words
» V is valid bit



Only needed at head of chain
‘1’ for valid record
‘0’ for invalid record
» Start timestamp (16-bits) is set when
record starts counting flow

Reset to zero when record is archived
» End timestamp (16-bits) is set each
time a packet is seen for the given
flow
» Packet and Byte counters are
incremented for each packet on the
given flow

Reset to zero when record is archived
» Next Record Number is next record in
hash chain


0x1FFFF if record is tail
Address of next record =
(next_record_num * record_size) +
collision_table_base_addr
LW0
Source Address (32b)
LW1
Destination Address (32b)
LW2
LW3
LW4
SrcPort (16b)
Reserved (12b)
V
(1b)
DestPort (16b)
Slice ID (12b)
Reserved (14b)
Next Record Number (17b)
LW5
Packet Counter (32b)
LW6
Byte Counter (32b)
LW7
Start Timestamp (16b)
End Timestamp (16b)
= Member of 6-tuple
‹#› - Flow Stats Module - James Moscola
Protocol (8b)
Timestamp Details

Timestamp on XScale is 64-bits

Storing 64-bit start and end timestamps would cause each
flow record to be too large for a single SRAM read

Instead, only store the 16-bits of each timestamp required
to represent a five minute time interval
» Clock frequency = 1.4 GHz
» Timestamp increments every 16 clock cycles
» Use bits 41:26 for 16 bit timestamps
(226 * 16 cycles)/1.4GHz = .767 seconds
41 * 16 cycles)/1.4GHz =25131.69 seconds (418 minutes)
 (2

» Time interval that can be represented using these bits

.767 seconds through 418 minutes
‹#› - Flow Stats Module - James Moscola
Hash Table Memory

Allocating 4 MBytes in SRAM Channel 3 for hash table
» Supports ~130K records
» Divided memory 75% for the main table and 25% for
the collision table
» Memory required =
Main_table_size + Collision_table_size
.75*(#records * #bytes/record) + .25*(#records * #bytes/record)
~98K records + ~32K records
~3Mbytes + ~1Mbytes

Space for main table and collision table can be adjusted
to tune performance
» Larger main table means fewer collisions, but still need
adequate space for collision table
‹#› - Flow Stats Module - James Moscola
Main
Table
Collision
Table
~75%
~25%
Inserting Into Hash Table

IXP has 3 different hash functions (48-bit, 64-bit, 128-bit)
» Using 64-bit hash function is sufficient and takes less time than
128-bit hash function



Not including Source Addr or Protocol into address
HASH(D.Addr, S.Port, D.Port);
Result of hash is used to address the main hash table
» Since we want ~100K records in main table, result of hash is used to get
as close to 100K entries as possible by adding a 16bit and 15bit chunk
from the hash result

hash_result(15:0) + hash_result(30:16) = record_number
» Records in the main table represent the head of a chain
» If slot at head of chain is empty (valid_bit=0), store record there
» If slot at head of chain is occupied, compare 6-tuple

If 6-tuple matches



Main
Table
If packet_count == 0 then (existing flows will have 0 packet_counts
when previous packets on flow have just been archived)
– Increment packet_counter for record
– Add size of current packet to byte_counter
– Set start and end time stamps
If packet_count > 0 then
– Increment packet_counter for record
– Add size of current packet to byte_counter
– Set end time stamp
If 6-tuples doesn’t match then a collision has occurred and the
record needs to be stored in collision table
‹#› - Flow Stats Module - James Moscola
Collision
Table
Hash Collisions

Hash collisions are chained in linked list
» Head of list is in the main table
» Remainder of list is in collision table

SRAM ring maintains list of free slots in collision table
» Slots are numbered from 0 to #_Collision_Table_Slots

Same as next_record_number
To convert to memory address
(slot_num * record_size) + collision_table_base_addr
» When a collision occurs, a pointer to an open slot in the
collision table can be retrieved from the SRAM ring
» When a record is removed from the collision table, a pointer
is returned to the SRAM ring for the invalidated slot

Main
Table
Collision
Table
SRAM
Ring
Free list
‹#› - Flow Stats Module - James Moscola
Archiving Hash Table Records


Send all valid records in hash table to
XScale for archiving every 5 minutes
For each record in the main table (i.e. start
of chain) ...
» For each record in hash chain ...
 If record is valid ...


If packet count > 0 then
– Send record to XScale via SRAM ring
– Set packet count to 0
– Set byte count to 0
– Leave record in table
If packet count == 0 then
– Flow has already been archived
– No packet has arrived on flow in 5 minutes
– Record is no longer valid
– Delete record from hash table to free
memory
‹#› - Flow Stats Module - James Moscola
Info Sent to XScale for each
flow every 5 minutes
LW0
Source Address (32b)
LW1
Destination Address (32b)
LW2
LW3
SrcPort (16b)
Reserved (12b)
DestPort (16b)
Slice ID (12b)
LW4
Packet Counter (32b)
LW5
Byte Counter (32b)
LW6
Start Timestamp_high (32b)
LW7
Start Timestamp_low (32b)
LW8
End Timestamp_high (32b)
LW9
End Timestamp_low (32b)
Protocol (8b)
Deleting Records from Hash Table

While archiving records
» If packet count is zero then remove record from
hash table


Record has already been archived, and no packets
have arrived in the last five minutes
To remove a record
» If ((record == head) && (record == tail))

Main
Table
Valid_bit = 0
» Else If ((record == head) && (record != tail))


Replace record with record.next
Free the slot for the moved record
Collision
Table
» Else if record != head


Set previous records next pointer to record.next
Free slot for the deleted record
SRAM
Ring
Free list
‹#› - Flow Stats Module - James Moscola
Memory Synchronization Issues

Multiple threads reading/writing same block of memory

Only allow 1 ME to modify structure of hash table
»Inserting and deleting nodes

Use global registers to indicate that the structure of the hash
table is being modified
»Eight global lock registers (1 per thread) to indicate what chain in
the hash table is being modified
»When a thread wants to insert/delete a record from hash table

Store pointer to the head of the hash chain in the threads
dedicated global lock register



If another thread is processing a packet that hashed to the same hash
chain, wait for lock register to clear and restart processing packet
Otherwise, continue processing the packet normally
Clear global lock register when done with insert/deletes

Value of 0xFFFFFFFF indicates that lock is clear
‹#› - Flow Stats Module - James Moscola
Flow Stats Execution

ME 1
» Init - Configure hash function
» 8 threads




ME 2
Read packet header
Hash packet header
Send header and hash result to ME2 for processing
(thread numbers may need adjusting)
» Init - Load SRAM ring with addresses for each slot in the collision table
Init - Set TIMESTAMP to 0
» 7 threads (ctx 1-7)


Insert records into hash table
Increment counter for records
» 1 thread (ctx 0)

Archive and delete hash table records
‹#› - Flow Stats Module - James Moscola
Diagram of Flow Stats Execution (ME1)
get buffer
handle from QM
60 cycles
read packet
header (DRAM)
300 cycles
300 cycles
read buffer
descriptor (SRAM) 150 cycles
send buffer
handle to TX
60 cycles
build hash key
~50 cycles
compute hash
100 cycles
send packet
info to ME2
60 cycles
300 cycles
~570 cycles
‹#› - Flow Stats Module - James Moscola
Diagram of Flow Stats Execution (ME2)

Incrementing Counters
Iterating through hash chain
Locking head of chain
» Adds records to hash chain, but doesn’t remove them
Best: ~360 cycles
Worst: ~520 +160x
60 cycles
get packet
info from ME1
150 cycles
read hash table
record (SRAM)
valid?
Yes
x
compare
record to header
match?
~10 cycles
Yes
150 cycles
insert
new record
clear lock
register
set register
to lock chain
150 cycles
Write START/END
time & new counts
clear lock
register
‹#› - Flow Stats Module - James Moscola
tail?
Yes
Yes
count==0?
set register
to lock chain
No
No
set register
to lock chain
No
set register
to lock chain
150 cycles
Write END
time & new counts
clear lock
register
No
150 cycles
get record slot
from freelist
150 cycles
insert
new record
clear lock
register
read next
record in chain
150 cycles
Diagram of Flow Stats Execution (ME2)

Archiving Records
Waiting to archive
Locking head of chain
» Removes records from hash chain, but doesn’t add them
» Processing of archiving records occurs every five minutes
read current
time
count == 0?
No
No
send record
to XScale
5 minutes?
Yes
set register
to lock chain
read next record
from main table
reset counters
and timestamps
Yes
valid?
clear lock
register
No
read next record
in chain
Yes
head of list?
No
Yes
tail of list?
more records
in chain?
done with
all records?
Yes
‹#› - Flow Stats Module - James Moscola
Yes
No
set register
to lock chain
set register
to lock chain
set register
to lock chain
write next_ptr to
previous list item
read
record.next
set valid bit
to zero
clear lock
register
replace record
with record.next
clear lock
register
return record
slot to freelist
clear lock
register
return record.next
slot to freelist
No
No
Yes
Return from Swap

When returning from each CTX switch, always check global lock
registers
» If any of the global locks contain the address of the hash chain that the
current thread is trying to modify, then the hash chain is locked and the
current thread must restart processing on the current packet
» If none of the global locks contain the address of the hash chain that the
current thread is trying to modify, then the current thread can just continue
processing that packet as usual
check global
lock values
match current
chain?
Yes
No
continue
processing packet
‹#› - Flow Stats Module - James Moscola
restart procssing
packet
SPP V1 LC Egress with 1x10Gb/s Tx
V: Valid Bit
V Rsv Port
1 (3b) (4b)
Buffer Handle(24b)
QM0
1x10G
Tx1
SCR
XScale
SRAM
Flow
Stats2
Archive Records
Rsvd
(3b)
SrcPort (16b)
Reserved (12b)
Freelist
DestPort (16b)
Slice ID (12b)
Packet Counter (32b)
Source Address (32b)
Byte Counter (32b)
Destination Address (32b)
Start Timestamp_high (32b)
SrcPort (16b)
Reserved
(8b)
Source Address (32b)
Destination Address (32b)
SRAM
SRAM3
QM2
QM3
SCR
NN
QM1
Flow
Stats1
DestPort (16b)
Packet Length (16b)
Hash Result (17b)
Protocol (8b)
Slice ID (12b)
‹#› - Flow Stats Module - James Moscola
Start Timestamp_low (32b)
End Timestamp_high (32b)
End Timestamp_low (32b)
Protocol (8b)
Flow Statistics Module

Scratch rings
»
»
»
»
»

QM_TO_FS_RING_1: 0x2400 – 0x27FF
QM_TO_FS_RING_2: 0x2800 – 0x2BFF
FS1_TO_FS2_RING: 0x2C00 - 0x2FFF
FS_TO_TX_RING_1: 0x3000 - 0x33FF
FS_TO_TX_RING_2: 0x3400 – 0x37FF
//
//
//
//
//
for
for
for
for
for
receiving from QM
receiving from QM
sending data from FS1 to FS2
sending data to TX1
sending data to TX2
SRAM rings
» FS2_FREELIST: 0x???? - 0x???? // stores list of open slots in collision table
» FS2_TO_XSCALE: 0x???? – 0x???? // for sending record information to the XScale for archiving

LC Egress SRAM Channel 3 info for Flow Stats
» HASH_CHAIN_TAIL
» ARCHIVE_DELAY
0x1FFFF
0x0188
»
»
»
»
8 * 4 = 32
// 8 32-bit words/record * 4 bytes/word
130688
// MAX with 4 MB table is ~130K records
98304
// NUM_HASH_TABLE_RECORDS<=TOTAL_NUM_RECORDS (mod 32 = 0)
TOTAL_NUM_RECORDS - NUM_HASH_TABLE_RECORDS = 32384
RECORD_SIZE
TOTAL_NUM_RECORDS
NUM_HASH_TABLE_RECORDS
NUM_COLLISION_TABLE_RECORDS
» LCE_FS_HASH_TABLE_BASE
» LCE_FS_HASH_TABLE_SIZE
» LCE_FS_COLLISION_TABLE_BASE
// indicates the end of a hash chain
// 5 minutes
SRAM_CHANNEL_3_BASE_ADDR + 0x200000 = 0xC0200000
0x400000
(HASH_TABLE_BASE + (RECORD_SIZE * NUM_HASH_TABLE_RECORDS)) = 0xC0500000
‹#› - Flow Stats Module - James Moscola
End
‹#› - Flow Stats Module - James Moscola