The HENP Internet2 WG

HENP Working Group
High TCP performance over wide
area networks
Arlington, VA
May 8, 2002
Sylvain Ravot <[email protected]>
CalTech
HENP WG Goal #3
Share information and provide advice on the
configuration of routers, switches, PCs and
network interfaces, and network testing and
problem resolution, to achieve high
performance over local and wide area
networks in production.
Slide 2
Overview
• TCP
• TCP congestion avoidance
algorithm
• TCP parameters tuning
• Gigabit Ethernet Adapter
performance
Slide 3
TCP Algorithms
Connection opening : cwnd = 1 segment
Congestion Avoidance
Slow Start
Exponential
increase for cwnd until
cwnd = SSTHRESH
cwnd = SSTHRESH
Additive
increase for cwnd
Retransmission timeout
SSTHRESH:=cwnd/2
cwnd:= 1 segment
Retransmission timeout
SSTHRESH:=cwnd/2
3 duplicate
ack received
3 duplicate
ack received
Fast Recovery
Retransmission timeout
SSTHRESH:=cwnd/2
Slide 4
Exponential
increase beyond cwnd
Expected ack received
cwnd:=cwnd/2
TCP Congestion Avoidance
behavior (I)
•
Assumption
• The time spent in slow start is neglected
• The time to recover a loss is neglected
• No buffering (Max. congestion window size = Bandwidth Delay Product)
• Constant RTT
W
W/2
W/2
W
(RTT)
•
The congestion window is opened at the constant rate of one segment per RTT, so each cycle is W/2.
•
The throughput is the area under the curve.
Slide 5
Example
•
Assumption
•
Bandwidth = 600 Mbps
•
RTT = 170 ms (CERN – CalTech)
•
BDB = 12.75 Mbytes
•
Cycle = 12.3 minutes
•
Time to transfer 10 Gbyte?
12.3 Min
W
Initial SSTRESH
Initial SSTRESH
W/2
W/2
Slide 6
3.8 minutes to transfer 10 GBytes if
cwnd = 6.45 Mbytes at the beginning
of the congestion avoidance state.
(Throughput = 350 Mbps)
W
(RTT)
2.4 minutes to transfer 10 Gbyte if
cwnd = 12.05 Mbyte at the beginning
of the congestion avoidance state
(Throughput = 550 Mbps)
TCP Congestion Avoidance
behavior (II)
•
We take into account the buffering space.
(cwnd)
Buffering
capacity
W
BDP
W/2
Area #1
Area #2
W/2
•
Area #1
•
Cwnd<BDP => Throughput < Bandwidth
•
RTT constant
•
Throughput = Cwnd / RTT
Slide 7
W
•
(RTT)
Area #2
•
Cwnd > BDP => Througput = Bandwith
•
RTT increase (proportional to cwnd)
Tuning
•
Keep the congestion window size in the yellow area :
•
Limit the maximum congestion widow size to avoid loss
•
Smaller backoff
(Cwnd)
(Cwnd)
W
W
BDP
BDP
(Time)
(Time)
•
Limit the maximum congestion avoidance window size
•
In the application
•
In the OS
•
Limiting the maximum congestion avoidance widow size and setting a large initial ssthresh, we reached 125 Mbps
throughput between CERN and Caltech and 143 Mbps throughput between CERN and Chicago through the 155 Mbps of
the transatlantic link.
Slide 8
•
Smaller backoff
•
TCP Multi-streams
•
After a loss : Cwnd := Cwnd × back_off
0.5 < Back_off < 1
Tuning TCP parameters
Buffer space that the kernel allocates for each socket
• Kernel 2.2
• echo 262144 > /proc/sys/net/core/rmem_max
echo 262144 > /proc/sys/net/core/wmem_max
• Kernel 2.4
• echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem
• The 3 values are respectively min, default, and max.
Socket buffer settings:
• Setsockopt() of SO_RCVBUF and SO_SNDBUF
• Has to be set after calling socket() but before bind()
• Kernel 2.2 : default value is 32KB
• Kernel 2.4 : default value can be set in /proc/sys/net/ipv4 (see above)
Initial SSTRHESH
Slow Start
Connection opening :
cwnd = 1 segment
•
•
Slide 9
Cwnd = SSTHRESH
Exponential increase
for cwnd until
cwnd = SSTHRESH
Congestion Avoidance
Additive increase
for cwnd
Set the initial ssthresh to a value larger than the bandwidth delay product
No parameter to set this value in Linux 2.2 and 2.4 => Modified linux kernel
Gigabit Ethernet NICs
performances
•
NIC tested
• 3com: 3C996-T
• Syskonnect: SK-9843 SK-NET GE SX
• Intel: PRO/1000 T and PRO/1000 XF
•
32 and 64 bit PCI Motherboards
•
Measurements
• Back to back linux PCs
• Latest drivers available
• TCP throughput
• Two different tests: Iperf and gensink. Gensink is a tool written at CERN for
benchmarking TCP network performance
• Performance measurement with Iperf:
• We ran 10 consecutive TCP transfers of 20 seconds each. Using the time command,
we measured the CPU utilization.
• [root@pcgiga-2]#time iperf -c pcgiga-gbe – t 20
• We report the throughput min/avg/max of the 10 transfers.
• Performance measurement with gensink:
• We ran transfers of 10 Gbyte. Gensink allow us to measure the throughput and the
CPU utilization over the last 10 Mbyte transmitted.
Slide 10
Syskonnect - SX, PCI 32 bit 33
MHZ
•
•
Setup:
• GbE adapter: SK-9843 SK-NET GE SX; Driver included in the kernel
• CPU: PIV (1500 Mhz) PCI:32 bit 33MHz
• Motherboard: Intel D850GB
• RedHat 7.2 Kernel 2.4.17
Iperf test:
Throughput (Mbps)
CPU utilization (%)
CPU utilization per Mbps (% / Mbps)
Min.
443
44.5
0.100
Max.
449
50
0.111
428.9
46.4
0.103
Average
•
Gensink test:
CPU utilization
TCP Throughput
1000
sec/Mbyte
Mbit/s
800
600
400
200
0
0
2000
4000
6000
8000
10000
0.12
0.1
0.08
0.06
0.04
0.02
0
0
2000
4000
6000
8000
MByte
Mbyte
Throughput min / avg / max = 256 / 448 / 451 Mbps
Slide 11
CPU utilization average= 0.097 sec/Mbyte
10000
Intel - SX , PCI 32 bit 33 MHZ
•
•
Setup:
• GbE adapter: Intel PRO/1000 XF; Driver e1000; Version 4.1.7
• CPU: PIV (1500 Mhz) PCI:32 bit 33MHz
• Motherboard: Intel D850GB
• RedHat 7.2 Kernel 2.4.17
Iperf test:
Throughput (Mbps)
CPU utilization (%)
CPU utilization per Mbps (% / Mbps)
Min.
601
48.5
0.081
Max.
607
53
0.087
605.5
52
0.086
Average
•
Gensink test:
CPU utilization
sec/Mbyte
0.1
0.08
0.06
0.04
0.02
0
0
2000
4000
6000
8000
Mbyte
Throughput min / avg / max = 380 / 609 / 631 Mbps
Slide 12
CPU utilization average= 0.040 sec/Mbyte
10000
3Com - Cu, PCI 64 bit 66 MHZ
•
•
•
Setup:
• GbE adapter: 3C996-T; Driver bcm5700; Version 2.0.18
• CPU: 2 x AMD Athlon MP PCI:64 bit 66MHz
• Motherboard: Dual AMD Athlon MP Motherboard
• RedHat 7.2 Kernel 2.4.7
Iperf test
Throughput (Mbps)
CPU utilization (%)
CPU utilisation per Mbit/s (% / Mbps)
Min.
835
43.8
0.052
Max.
843
51.5
0.061
Average
838
46.9
0.056
Gensink test:
TCP Throughput
1000
Mbit/s
800
600
400
200
0
0
2000
4000
6000
8000
10000
Mbyte
Throughput min / avg / max = 232 / 889 / 945 Mbps
Slide 13
CPU utilization average= 0.0066 sec/Mbyte
Intel - Cu, PCI 64 bit 66 MHZ
•
•
Setup
• GbE adapter: Intel PRO/1000 T; Driver e1000; Version 4.1.7
• CPU: 2 x AMD Athlon MP PCI:64 bit 66MHz
• Motherboard: Dual AMD Athlon MP Motherboard
• RedHat 7.2 Kernel 2.4.7
Iperf test :
Throughput (Mbps)
CPU utilization (%)
CPU utilization per Mbit/s (% / Mbps)
Min.
813
41
0.050
Max.
873
47.5
0.054
846.1
44.5
0.053
Average
•
Gensink test:
CPU utilization
1000
0.01
800
0.008
sec/MByte
Mbit/s
TCP Througput
600
400
200
0
0.004
0.002
0
0
2000
4000
6000
8000
10000
Mbyte
Throughput min / avg / max = 429 / 905 / 943 Mbps
Slide 14
0.006
0
2000
4000
6000
8000
Mbyte
CPU utilization average= 0.0065 sec/Mbyte
10000
Intel - SX, PCI 64 bit 66 MHZ
•
•
•
Setup
• GbE adapter: Intel PRO/1000 XF; Driver e1000; Version 4.1.7
• CPU: 2 x AMD Athlon MP PCI:64 bit 66MHz
• Motherboard: Dual AMD Athlon MP Motherboard
• RedHat 7.2 Kernel 2.4.7
Iperf test :
Throughput (Mbps)
CPU utilization (%)
CPU utilisation per Mbit/s (% / Mbps)
Min.
828
43.2
0.052
Max.
877
49.1
0.056
Average
854
45.8
0.054
Gensink test:
CPU utilization
1000
0.01
800
0.008
sec/MByte
Mbit/s
TCP Throughput
600
400
200
0
0.004
0.002
0
0
2000
4000
6000
8000
10000
MByte
Throughput min / avg / max = 222 / 799 / 940 Mbps
Slide 15
0.006
0
2000
4000
6000
8000
Mbyte
CPU utilization average= 0.0062 sec/Mbyte
10000
Syskonnect - SX, PCI 64 bit 66
MHZ
•
•
Setup
• GbE adapter: SK-9843 SK-NET GE SX; Driver included in the kernel
• CPU: 2 x AMD Athlon MP PCI:64 bit 66MHz
• Motherboard: Dual AMD Athlon MP Motherboard
• RedHat 7.2 Kernel 2.4.7
Iperf test
Throughput (Mbps)
CPU utilization (%)
CPU utilization per Mbps (% / Mbps)
Min.
874
67.5
0.077
Max.
909
69
0.076
894.9
67.9
0.076
Average
•
Gensink test:
CPU utilization
TCP Throughput
0.01
sec/Mbyte
1000
MBit/s
800
600
400
200
0.006
0.004
0.002
0
0
0
2,000
4,000
6,000
8,000
MByte
Throughput min / avg / max = 146 / 936 / 947 Mbps
Slide 16
0.008
10,000
0
2,000
4,000
6,000
8,000
MByte
CPU utilization average= 0.0083 sec/Mbyte
10,000
Summary
•
32 PCI bus
• Intel NICs achieved the highest throughput (600 Mbps) with the smallest CPU utilization.
Syskonnect NICs achieved only 450 Mbps with a higher CPU utilization.
•
32 Vs 64 PCI bus
• 64 PCI bus is needed to get high throughput:
• We multiplied by 2 the throughput by moving Syskonnect NICs from 32 to 64 PCI
buses.
• We increased the throughput by 300 Mbps by moving Intel NICs from 32 to 64 PCI
buses.
•
64 PCI bus
• Syskonnect NICs achieved the highest throughput (930 Mbps) with the highest CPU
utilization.
• Intel NICs performances are unstable.
• 3Com NICs are a good compromise between stability, performance, CPU utilization and
cost. Unfortunately, we couldn’t test the 3Com NIC with fiber connector.
•
Cu Vs Fiber connector
• We could not measure important differences.
•
Strange behavior of Intel NICs. The throughout achieve by Intel NICs is unstable.
Slide 17
Questions ?
Slide 18