Slides - TNC 2004

On the joint use of new TCP
proposals and IP-QOS on
High bandwidth-RTT product
paths
Andrea Di Donato
Centre of Excellence on Networked Systems
University College London, United Kingdom
TERENA 2004 - Rhodes
Pillars….
 Case study
 few users require a reliable bulk transcontinental
data transfer at multi-Gb/s speed
 Why TCP?
 Provides reliable data transfer of large data sets
 Allows the usage of existing transport programs, eg
FTP, GridFTP etc..
TCP’s issues for
Transcontinental Gbps transfers
 TCP’s congestion control algorithm additive increase multiplicative decrease (AIMD)
 performs poorly over paths of High bandwidth-Round_Trip_Time product (BW*RTT)
 the TCP aggregate’s stability
 is even more precarious if such aggregate consists of Only few flows
This is exactly the case for the scientific community where……..
–
–
–
Few users
Reliable bulk data transfer
Multi Gb/s speed.
= const.
ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(αcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2]
[Low]
linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao
Wang, John C. Doyle, May 6, 2003
So…..let’s try different TCPs
Standard, HS-TCP and Scalable
TCP’s AIMDs
 The Standard TCP AIMD algorithm is as follows:
– on the receipt of an acknowledgement,
• W’ = w + 1/w ; w -> congestion window
– CWND increases by one at the most every RTT
– and in response to a congestion event,
• W’ = 0.5 * w
 The modified HighSpeedTCP AIMD algorithm is as follows:
– on the receipt of an acknowledgement,
• W’ = w + a(w)/w; higher ‘w’ gives higher a(w)
– CWND increases more than what it does for Standard TCP
– and in response to a congestion event,
• W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w)
 The modified ScalableTCP AIMD algorithm is as follows
– For each acknowledgement received in a round trip time,
• Scalable TCP cwnd’ = cwnd + 0.01
– CWND increases by 1 every 100 Acks received
– and on the first detection of congestion in a given round trip time
• Scalable TCP cwnd’ = cwnd – 0.125 * cwnd
Are the new TCPs more adequate
for a trans-continental data
transfer
?
Test-bed layout
TCP applications
Juniper
M10
Standard TCP
HS-TCP
AF
Scalable TCP
BE
Chicago
Geneva
UDP CBR
Competing
background
traffic
Path
Bottleneck = 1Gbps
Min RTT = 118 ms
PCs:
Supermicro 6022P-6 Dual Intel® Xeon
Linux Kernel: 2.4.20
NIC: Sysconnect 98xx
NIC driver : sk98lin 6.19
Standard Vs New TCP Stacks: the
inadequacy of standard TCP in fat long
pipes
Throughput
Coefficient of Variance
= St.Dev.Th/MeanTh.
CBR Load
•
The purpose is to show that a small number of standard-TCP flows are not adequate to take
advantage of the very large spare capacity when operating in a high RTT environment
•
The figures shows how poor the Standard (Vanilla) TCP throughput is even when competing
with very small CBR background traffic.
•
For Standard (Vanilla) TCP
–
–
the average throughput shows little dependence on the available bandwidth unlike the other
stacks.
This rather passive behaviour reveals that Standard TCP is weak in a non-protected path of these
ranges of bandwidth-delay products
CBR Load
Are the new TCP stacks invasive
?
Can These new stacks impact
traditional traffic ?: User Impact
Factors investigation METRIC
N Standard
TCP flows
case1
case2
N flows:
N-1 Standard
TCP flows
+ 1 New TCP
stack flow (red)
Avg. Throughput of 1 Standard TCP in case 1 (green)
User Impact Factor = UIF= --------------------------------------------------------------------Avg. Throughput of 1 standard TCP in case 2 (green)
New stacks: User Impact Factors
investigation(1)
UIF =
Avg. Throughput of 1 Standard TCP when with N Standard TCP flows
---------------------------------------------------------------------------------------------------------Avg. Throughput of 1 standard TCP when with N-1 Standard TCP + 1 New TCP
UIF
Number of
background
Vanilla TCP
flow
1
UIF is greater than one for a certain range of values……
The need for QoS….
The results obtained so far suggest we need a
mechanism for segregating traffic sharing the
same packet-switched path. This is in order to
1. Protect traditional BE traffic from the
possible aggression of the new TCP stacks
2. Guarantee a level of end-to-end service
predictability for TCP flows which is sufficient
to enforce a network resources reservation
system through the use of some sort of
middleware architecture (GRID)
Test-bed layout
AF: class for TCP
applications
Juniper
M10
AF
Standard TCP
HS-TCP
AF
Scalable TCP
BE
Geneva
Chicago
BE
QoS
AF
Priority
And X
Mbps
1 Gb/s UDP CBR
BE
1000X Mbps
BE: class for the
legacy web traffic
Path
Bottleneck = 1Gbps
Min RTT = 118 ms
PCs:
Supermicro 6022P-6 Dual Intel® Xeon
Linux Kernel: 2.4.20
NIC: Sysconnect 98xx
NIC driver : sk98lin 6.19
QoS tests metric used
 QoS efficiency factor (QEF) measures
how efficiently traffic utilises a minimum
bandwidth guaranteed reserved.
– AF-QEF = AF_Throughput / BE_BW-reserved
– BE-QEF = BE_Throughput / BE_BW-reserved
 TCP Convergence speed estimation
– (AF-QEF)s – (AF-QEF)
• s = steady state phase of the test
BE/AF QoS Efficiency Factors (1)
•
•
• 100MB socket buffersize
• Congestion moderation on
AF-QEF
txqueuelen and backlog = 50000
TCP with an offset of 30 sec.
BE-QEF
AF-allocation(%)
AFs-QEF in steady state
AF-allocation(%)
BEs QEF in steady state
BE/AF QoS Efficiency Factors (2)
AFs-QEF – AFtsQEF
AF-allocation
This is where HS and Standard TCP differ
Let’s see some TIPICAL
dynamics in time when AF is
allocated 900Mbps and BE
100Mbps
Standard TCP with 900Mbps
guaranteed
CWND
cwnd
throu
ghput
time
Cwnd Vs Time
Time
AF (green) and BE (red) Rx. Throughput
Hs-TCP with 900Mbps guaranteed
throughput
cwnd
time
Cwnd Vs Time
time
AF (green) and BE (red) Rx. Throughput
Scalable TCP with 900Mbps
guaranteed
cwnd
Throughput
Time
Time
Cwnd Vs Time
AF (green) and BE (red) Rx. Throughput
Clamping (50MBytes Socket Buffer
Size)
Same performance regardless of the TCP stack used since it is “NOT” TCP.
cwnd
Throughput
time
Cwnd Vs Time
time
AF (green) and BE (red) Rx. Throughput
Results analysis..
HS-TCP and Scalable TCP experienced
(with different magnitudes) “unusual”
slow start periods.
“unusual” as there were not any timeouts
experienced.
Zoom on the “unusual ” slow start period
experienced from all the TCPs senders but with
different magnitude: we found silences before…
cwnd
cwnd
time
time
time
QoS Results Interpretation (1):
Background - what are SACKS ?
Selective Acknowledgements (SACKs) are control
packets……
 Sent from the TCP receiver to the sender
 Informing what data segments the receiver has
 Allow the sender to infer and re-transmit lost data
 Required to prevent TCP going into slow start after a
multiple drop – MANDATORY for high ranges of the
CWND !!
BUT…
We know the current SACK processing (*) is not optimal
yet.
(*) Although the Tom Kelly SACKs patch improvement is used.
QoS results interpretation (2):
Hypothesis
Here we formulate a segment-level perspective hypothesis :
The burst of SACKs coming back might cause the sender host to:
 Hang (periods of silence) when AF is allocated high fraction of the BW,
this causing the host to reset everything and, as a consequence, to
make TCP enter slow start after each period of silence.
 Vanilla wouldn’t be affected by this as the congestion avoidance
CWND gradient doesn’t increase with higher CWND values.
 HS-TCP and Scalable would be affected by this as their congestion
avoidance CWND gradient increases with higher CWND values.
 Scalable is the most affected since it has the steepest gradient for
the congestion avoidance phase and this provokes a longer SACK
Burst if compared to that HS-TCP presents
Let’s try and validate such a
segment-level perspective
hypothesis through the use of a
network solution: WRED
This is inspired by the fact that
WRED is already a solution for
TCP stability but from a flowlevel perspective…remember???
= const.
ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(αcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2]
What is WRED ?
WRED (Weighted
Random Early
Detection) is a router
AQM (Active Queue
Management)
solution where….
 A probability of
dropping per router
queue occupancy is
assigned to a class
queue
Prob. of dropping
Queue occupancy
WRED:
test rationale
• Dual approach rationale – the same tool (WRED) but from different
perspective:
This is what control theory says at
flow level
 Flow-level perspective:
The oscillations are the effect of protocol instability. This
being the case, a gentle (small ρ) Weighted Random Early
Detection (WRED) drop profile should be able to facilitate
the validation of the inequality expression governing the
stability condition for the TCP flow [low]
 Segment-level perspective: our rationale……
A smoother distribution of the loss pattern all along the AF
scheduler buffer would help in reducing the LENGTH of the
bursts of SACKS coming back and therefore would help in
not stalling the sender.
This, if successful, would validate all the Hypothesis made before!!!!
[low] linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao
Wang, John C. Doyle, May 6, 2003
WRED “test-bed”
AF: class for TCP
applications
AF
Standard TCP
Juniper
M10
TRANSATLANTIC
HS-TCP
AF
Scalable TCP
BE
Chicago
Geneva
BE
AF
WRED +
Priority
And 900
Mbps
BE
100Mbps
1Gb/s UDP CBR
1Gbps BE UDP CBR
Background traffic
Results with WRED…
cwnd
Throughput
time
time
As expected, only the “gentle” WRED works. Other profiles are found to be too
aggressive, this way inducing too much loss.
Results with WRED: summary
cwnd
throughput
cwnd
throughput
gentle WRED
Conclusions(1/5): Summary/Results
With pretty empty fat-long pipes, TCP as it is now would invariably fail
in terms of high performance throughput.
New TCP stacks are therefore felt as necessary (Scalable, HsTCP,….)
………BUT
New TCP stacks can affect the performance of traditional BE traffic:
IP-QOS is therefore tried out as a solution for
– Legacy (BE) traffic protection
– service predictability
With……
 1Gbps IP-QOS-enabled static fat-long path
 the current state of the TCP Linux kernel/CPU speed,
Standard TCP performs surprisingly better than HS-TCP and Scalable
with HS-TCP much closer to Standard rather than to Scalable TCP.
Conclusions(2/5): results-proven
theory
The SACK burst processing is not an issue for Vanilla TCP since its
congestion avoidance slope gradient
Doesn’t increase with increasing CWND
The SACK burst processing is an issue for HS-TCP and Scalable TCP
as their congestion avoidance slope gradient
increases with increasing CWND with Scalable steeper than the HS-TCP
gradient.
For high values of the BW allocated to HS and Scalable TCP, the
steeper the CWND congestion avoidance gradient is (Scalable),
the higher/longer is the SACK rate/burst, this making the sender
host to
1. Stall
2. Experience a period of silence that occurs every time CWND “hits”
the QOS pipe allocated
3. reset all its variables and thus also the TCP connection to slow start
Conclusions(3/5):
problem generalisation and
remedies
 For high BW allocations there is an inherent higher risk,
for very aggressive transport protocol like Scalable TCP of
stalling the sender with the effect of reducing drastically
the stability and thus the throughput performance.
 A proven remedy is that of switching off code whose bad
performance is worse than not having it in place at all as for with
Congestion Moderation.
 For even higher BW allocations, then, the network itself
needs to smooth out the loss pattern, this way reducing
the sacks rate/burst.
 the use of a gentle WRED has been proven of extreme efficacy.
Conclusions(4/5)
 In a QoS-enabled high bandwidth delay product path,
Standard TCP performs better than the other protocols
because
 it is protected (large buffers) – juniper…
 CWND grows so slowly (and UNDOs + CongestionModeration
also helps greatly) that it never hits the CWND limit; therefore real
drops are prevented
 HS-TCP under QOS shows more and less the same
performances Standard shows BUT, in case of random
drop (even if in a protected environment there is always a
BER>0…), which would be lethal to a high performance
bulk data transfer over Standard TCP…
 HS-TCP is preferred as it would react notably better by design.
Conclusions (5/5)
 All the possible Solutions to the SACK processing issue
 Software: Optimising TCP code
 Hardware: Need CPU speed
Network: Active Queue Management (WRED)
OUR “third way”
SOLUTION !
The network/AQM solution found can relax the requirement of any
further optimization of the SACK code.
However, we believe that the main limitations in using aggressive TCP
protocols lies in not having enough CPU speed.
 In an environment where the available bandwidth is greater than 1
Gbit/s the problems are definitely further compounded for
Scalable and, above a certain CWND value, also for HS-TCP. This
is due to
 the larger range of the CWND values needed in order to utilize
those higher available bandwidths.
 The bigger values for CWND mean longer bursts of packets dropped
which induces longer SACK bursts to be processed at the sender.
publications
 “Using QoS for High Throughput TCP Transport Over Fat
Long Pipes”
 Andrea Di Donato, Yee-Ting Li, Frank Saka and Peter Clarke In
proceedings of Second International Workshop on Protocols for
Fast Long-Distance Networks February 16-17, 2004. Argonne
National Laboratory, Argonne, Illinois USA
 “On the joint use of new TCP proposals and IP-QoS on high
Bandwidth-RTT product paths”
 Andrea Di Donato, Frank Saka, Javier Orellana and Peter Clarke In
proceedings of the Trans-European-Research and Education
Networking Association (TERENA) networking conference 2004. 710 June 2004. Rhodes, Greece.
THANK YOU……….
APPENDIX
CWND and UNDOs definitions
• Congestion moderation is:
– triggered by a dubious event which is typically the
acknowledge of more than 3 packets. The action
induced is that the CWND is set to the # of packets
that are thought to be in flight.
• UNDOs
– ss-thresh and cwnd reduce as loss is thought to have
happened but the sudden reception of THE ack
makes both cwnd and ss-tresh to reset to the
previous value
Four things to keep in mind when
looking at the AF-QEF results:
1.
Regardless of the TCP implementation, a large bandwidth allocation
means higher CWND values. This means a higher probability of
dropping more packets in the same CWND.
•
This, along with the smaller value of the AF BW delay products as well
as with the CWND threshold explains why all the protocols tested
perform the same under small BW allocations
2.
For the new TCP stacks tested, the bigger the allocated bandwidth,
the higher the loss rate induced by the scheduler during congestion
since the loss recovery (congestion avoidance gradient) for bigger
CWND is more aggressive.
3.
SelectiveACKnowledgements (SACKs) are feedback from the
receiver to tell the sender what packets the receiver has. This is to
allow the sender to deduce what packets to resend.
4.
Too many SACKs (as a result of a major multiple drops in a CWND)
to be processed at once (burst of sacks) is recognised to be an
issue for the current kernels and CPUs in use.
•
This, along with point 2 and 1, leads to say that for protocols with a steep
congestion avoidance gradient like Scalable and for high bandwidth
allocated, the SACK processing becomes an issue. This explains why
Scalable performance for higher bandwidth allocations was so poor in
our tests.
•
Thus, the difference in performance is due to an implementation issue whose magnitude
is, though, function of the dynamic of the TCP protocol in use and not due to the protocol
design per se..
Vanilla TCP (Again)
•
The standard AIMD algorithm is as follows:
– on the receipt of an acknowledgement,
• W’ = w + 1/w ; w -> congestion window
– and in response to a congestion event,
• W’ = 0.5 * w
•
The increase and decrease parameters of the AIMD
algorithm are fixed at 1 and 0.5
•
With a 1500-byte packets and a 100 ms RTT, achieving a
steady state throughput of 10 Gbps would require a packet
drop rate at most once every 1 2/3 hours
•
TCP’s congestion control algorithm additive increase
multiplicative decrease (AIMD) performs poorly over paths of
high bandwidth-Round-Trip-Time product and the aggregate
stability is even more precarious if such aggregate consists of
only few flows as it is the case for the scientific community
where few users require a reliable bulk transcontinental data
transfer at Gb/s speed
HS-TCP
• If congestion window <= threshold, use Standard
AIMD algorithm else use HighSpeed AIMD algorithm
• The modified HighSpeed AIMD algorithm is as
follows:
– on the receipt of an acknowledgement,
• W’ = w + a(w)/w; higher ‘w’ gives higher a(w)
– and in response to a congestion event,
• W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w)
• The increase and decrease parameters vary based
on the current value of the congestion window.
• For the same packet size and RTT, a steady state
throughput of 10 Gbps can be achieved with a packet
drop rate at most once every 12 seconds
Scalable
• Slow start phase of the original TCP algorithm is
unmodified. The congestion avoidance phase is modified
as follows:
– For each acknowledgement received in a round trip time,
• Scalable TCP cwnd’ = cwnd + 0.01
– and on the first detection of congestion in a given round trip
time
• Scalable TCP cwnd’ = cwnd – 0.125 * cwnd
• Like HighSpeed TCP this has a threshold window size and
the modified algorithm is used only when the size of the
congestion window is above the threshold window size.
Though values of 0.01 and 0.125 are suggested for the
increase and decrease parameters, they (as well as
threshold window size) can be configured using the proc
file system (by a superuser).
New stacks: User Impact Factors
investigation(2)
UIF (10 New TCP flows)
1
Number of
background
Vanilla TCP
flow
The same as for the previous plot but for 10 New TCP stacks:
The UIPs are even higher than 1 with respect to the 1 new TCP flow scenario and this
corroborates the results even more as this is a much more realistic scenario.
IP-QOS ( a la Diff. Serv.)
• Packets are dscp-marked at the sender host
• Two classes are defined:
– BE
– AF
• Packet are dscp-classified at the router input
ports.
• Each class is put in a physically different IP-level
router output queue. Each of them is assigned a
minimum BW guaranteed (WRR).
– Low priority for BE
– High priority for AF
Juniper M10 BW scheduler
“priority”
•
The queue weight ensures the queue is provided a given minimum
amount of bandwidth which is proportional to the weight. As long as
this minimum has not been served, the queue is said to have a
“positive credit”. Once this minimum amount is reached, the queue
has a “negative credit”.
•
The credits are decreased immediately when a packet is sent. They
are increased frequently (by the weight assigned).
•
For each packet, the WRR algorithm strictly follows this queue
service order:
1.
2.
3.
4.
High priority, positive credit queues
Low priority, positive credit queues
High priority, negative credit queues
Low priority, negative credit queues
Router BW scheduler
benchmarking
BE
AF
JUNIPER M10
 Cards used:
1x G/E, 1000 BASE-SX REV 01
• IOS Version used: “Junos 5.3R2.4”