On the joint use of new TCP proposals and IP-QOS on High bandwidth-RTT product paths Andrea Di Donato Centre of Excellence on Networked Systems University College London, United Kingdom TERENA 2004 - Rhodes Pillars…. Case study few users require a reliable bulk transcontinental data transfer at multi-Gb/s speed Why TCP? Provides reliable data transfer of large data sets Allows the usage of existing transport programs, eg FTP, GridFTP etc.. TCP’s issues for Transcontinental Gbps transfers TCP’s congestion control algorithm additive increase multiplicative decrease (AIMD) performs poorly over paths of High bandwidth-Round_Trip_Time product (BW*RTT) the TCP aggregate’s stability is even more precarious if such aggregate consists of Only few flows This is exactly the case for the scientific community where…….. – – – Few users Reliable bulk data transfer Multi Gb/s speed. = const. ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(αcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2] [Low] linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao Wang, John C. Doyle, May 6, 2003 So…..let’s try different TCPs Standard, HS-TCP and Scalable TCP’s AIMDs The Standard TCP AIMD algorithm is as follows: – on the receipt of an acknowledgement, • W’ = w + 1/w ; w -> congestion window – CWND increases by one at the most every RTT – and in response to a congestion event, • W’ = 0.5 * w The modified HighSpeedTCP AIMD algorithm is as follows: – on the receipt of an acknowledgement, • W’ = w + a(w)/w; higher ‘w’ gives higher a(w) – CWND increases more than what it does for Standard TCP – and in response to a congestion event, • W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w) The modified ScalableTCP AIMD algorithm is as follows – For each acknowledgement received in a round trip time, • Scalable TCP cwnd’ = cwnd + 0.01 – CWND increases by 1 every 100 Acks received – and on the first detection of congestion in a given round trip time • Scalable TCP cwnd’ = cwnd – 0.125 * cwnd Are the new TCPs more adequate for a trans-continental data transfer ? Test-bed layout TCP applications Juniper M10 Standard TCP HS-TCP AF Scalable TCP BE Chicago Geneva UDP CBR Competing background traffic Path Bottleneck = 1Gbps Min RTT = 118 ms PCs: Supermicro 6022P-6 Dual Intel® Xeon Linux Kernel: 2.4.20 NIC: Sysconnect 98xx NIC driver : sk98lin 6.19 Standard Vs New TCP Stacks: the inadequacy of standard TCP in fat long pipes Throughput Coefficient of Variance = St.Dev.Th/MeanTh. CBR Load • The purpose is to show that a small number of standard-TCP flows are not adequate to take advantage of the very large spare capacity when operating in a high RTT environment • The figures shows how poor the Standard (Vanilla) TCP throughput is even when competing with very small CBR background traffic. • For Standard (Vanilla) TCP – – the average throughput shows little dependence on the available bandwidth unlike the other stacks. This rather passive behaviour reveals that Standard TCP is weak in a non-protected path of these ranges of bandwidth-delay products CBR Load Are the new TCP stacks invasive ? Can These new stacks impact traditional traffic ?: User Impact Factors investigation METRIC N Standard TCP flows case1 case2 N flows: N-1 Standard TCP flows + 1 New TCP stack flow (red) Avg. Throughput of 1 Standard TCP in case 1 (green) User Impact Factor = UIF= --------------------------------------------------------------------Avg. Throughput of 1 standard TCP in case 2 (green) New stacks: User Impact Factors investigation(1) UIF = Avg. Throughput of 1 Standard TCP when with N Standard TCP flows ---------------------------------------------------------------------------------------------------------Avg. Throughput of 1 standard TCP when with N-1 Standard TCP + 1 New TCP UIF Number of background Vanilla TCP flow 1 UIF is greater than one for a certain range of values…… The need for QoS…. The results obtained so far suggest we need a mechanism for segregating traffic sharing the same packet-switched path. This is in order to 1. Protect traditional BE traffic from the possible aggression of the new TCP stacks 2. Guarantee a level of end-to-end service predictability for TCP flows which is sufficient to enforce a network resources reservation system through the use of some sort of middleware architecture (GRID) Test-bed layout AF: class for TCP applications Juniper M10 AF Standard TCP HS-TCP AF Scalable TCP BE Geneva Chicago BE QoS AF Priority And X Mbps 1 Gb/s UDP CBR BE 1000X Mbps BE: class for the legacy web traffic Path Bottleneck = 1Gbps Min RTT = 118 ms PCs: Supermicro 6022P-6 Dual Intel® Xeon Linux Kernel: 2.4.20 NIC: Sysconnect 98xx NIC driver : sk98lin 6.19 QoS tests metric used QoS efficiency factor (QEF) measures how efficiently traffic utilises a minimum bandwidth guaranteed reserved. – AF-QEF = AF_Throughput / BE_BW-reserved – BE-QEF = BE_Throughput / BE_BW-reserved TCP Convergence speed estimation – (AF-QEF)s – (AF-QEF) • s = steady state phase of the test BE/AF QoS Efficiency Factors (1) • • • 100MB socket buffersize • Congestion moderation on AF-QEF txqueuelen and backlog = 50000 TCP with an offset of 30 sec. BE-QEF AF-allocation(%) AFs-QEF in steady state AF-allocation(%) BEs QEF in steady state BE/AF QoS Efficiency Factors (2) AFs-QEF – AFtsQEF AF-allocation This is where HS and Standard TCP differ Let’s see some TIPICAL dynamics in time when AF is allocated 900Mbps and BE 100Mbps Standard TCP with 900Mbps guaranteed CWND cwnd throu ghput time Cwnd Vs Time Time AF (green) and BE (red) Rx. Throughput Hs-TCP with 900Mbps guaranteed throughput cwnd time Cwnd Vs Time time AF (green) and BE (red) Rx. Throughput Scalable TCP with 900Mbps guaranteed cwnd Throughput Time Time Cwnd Vs Time AF (green) and BE (red) Rx. Throughput Clamping (50MBytes Socket Buffer Size) Same performance regardless of the TCP stack used since it is “NOT” TCP. cwnd Throughput time Cwnd Vs Time time AF (green) and BE (red) Rx. Throughput Results analysis.. HS-TCP and Scalable TCP experienced (with different magnitudes) “unusual” slow start periods. “unusual” as there were not any timeouts experienced. Zoom on the “unusual ” slow start period experienced from all the TCPs senders but with different magnitude: we found silences before… cwnd cwnd time time time QoS Results Interpretation (1): Background - what are SACKS ? Selective Acknowledgements (SACKs) are control packets…… Sent from the TCP receiver to the sender Informing what data segments the receiver has Allow the sender to infer and re-transmit lost data Required to prevent TCP going into slow start after a multiple drop – MANDATORY for high ranges of the CWND !! BUT… We know the current SACK processing (*) is not optimal yet. (*) Although the Tom Kelly SACKs patch improvement is used. QoS results interpretation (2): Hypothesis Here we formulate a segment-level perspective hypothesis : The burst of SACKs coming back might cause the sender host to: Hang (periods of silence) when AF is allocated high fraction of the BW, this causing the host to reset everything and, as a consequence, to make TCP enter slow start after each period of silence. Vanilla wouldn’t be affected by this as the congestion avoidance CWND gradient doesn’t increase with higher CWND values. HS-TCP and Scalable would be affected by this as their congestion avoidance CWND gradient increases with higher CWND values. Scalable is the most affected since it has the steepest gradient for the congestion avoidance phase and this provokes a longer SACK Burst if compared to that HS-TCP presents Let’s try and validate such a segment-level perspective hypothesis through the use of a network solution: WRED This is inspired by the fact that WRED is already a solution for TCP stability but from a flowlevel perspective…remember??? = const. ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(αcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2] What is WRED ? WRED (Weighted Random Early Detection) is a router AQM (Active Queue Management) solution where…. A probability of dropping per router queue occupancy is assigned to a class queue Prob. of dropping Queue occupancy WRED: test rationale • Dual approach rationale – the same tool (WRED) but from different perspective: This is what control theory says at flow level Flow-level perspective: The oscillations are the effect of protocol instability. This being the case, a gentle (small ρ) Weighted Random Early Detection (WRED) drop profile should be able to facilitate the validation of the inequality expression governing the stability condition for the TCP flow [low] Segment-level perspective: our rationale…… A smoother distribution of the loss pattern all along the AF scheduler buffer would help in reducing the LENGTH of the bursts of SACKS coming back and therefore would help in not stalling the sender. This, if successful, would validate all the Hypothesis made before!!!! [low] linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao Wang, John C. Doyle, May 6, 2003 WRED “test-bed” AF: class for TCP applications AF Standard TCP Juniper M10 TRANSATLANTIC HS-TCP AF Scalable TCP BE Chicago Geneva BE AF WRED + Priority And 900 Mbps BE 100Mbps 1Gb/s UDP CBR 1Gbps BE UDP CBR Background traffic Results with WRED… cwnd Throughput time time As expected, only the “gentle” WRED works. Other profiles are found to be too aggressive, this way inducing too much loss. Results with WRED: summary cwnd throughput cwnd throughput gentle WRED Conclusions(1/5): Summary/Results With pretty empty fat-long pipes, TCP as it is now would invariably fail in terms of high performance throughput. New TCP stacks are therefore felt as necessary (Scalable, HsTCP,….) ………BUT New TCP stacks can affect the performance of traditional BE traffic: IP-QOS is therefore tried out as a solution for – Legacy (BE) traffic protection – service predictability With…… 1Gbps IP-QOS-enabled static fat-long path the current state of the TCP Linux kernel/CPU speed, Standard TCP performs surprisingly better than HS-TCP and Scalable with HS-TCP much closer to Standard rather than to Scalable TCP. Conclusions(2/5): results-proven theory The SACK burst processing is not an issue for Vanilla TCP since its congestion avoidance slope gradient Doesn’t increase with increasing CWND The SACK burst processing is an issue for HS-TCP and Scalable TCP as their congestion avoidance slope gradient increases with increasing CWND with Scalable steeper than the HS-TCP gradient. For high values of the BW allocated to HS and Scalable TCP, the steeper the CWND congestion avoidance gradient is (Scalable), the higher/longer is the SACK rate/burst, this making the sender host to 1. Stall 2. Experience a period of silence that occurs every time CWND “hits” the QOS pipe allocated 3. reset all its variables and thus also the TCP connection to slow start Conclusions(3/5): problem generalisation and remedies For high BW allocations there is an inherent higher risk, for very aggressive transport protocol like Scalable TCP of stalling the sender with the effect of reducing drastically the stability and thus the throughput performance. A proven remedy is that of switching off code whose bad performance is worse than not having it in place at all as for with Congestion Moderation. For even higher BW allocations, then, the network itself needs to smooth out the loss pattern, this way reducing the sacks rate/burst. the use of a gentle WRED has been proven of extreme efficacy. Conclusions(4/5) In a QoS-enabled high bandwidth delay product path, Standard TCP performs better than the other protocols because it is protected (large buffers) – juniper… CWND grows so slowly (and UNDOs + CongestionModeration also helps greatly) that it never hits the CWND limit; therefore real drops are prevented HS-TCP under QOS shows more and less the same performances Standard shows BUT, in case of random drop (even if in a protected environment there is always a BER>0…), which would be lethal to a high performance bulk data transfer over Standard TCP… HS-TCP is preferred as it would react notably better by design. Conclusions (5/5) All the possible Solutions to the SACK processing issue Software: Optimising TCP code Hardware: Need CPU speed Network: Active Queue Management (WRED) OUR “third way” SOLUTION ! The network/AQM solution found can relax the requirement of any further optimization of the SACK code. However, we believe that the main limitations in using aggressive TCP protocols lies in not having enough CPU speed. In an environment where the available bandwidth is greater than 1 Gbit/s the problems are definitely further compounded for Scalable and, above a certain CWND value, also for HS-TCP. This is due to the larger range of the CWND values needed in order to utilize those higher available bandwidths. The bigger values for CWND mean longer bursts of packets dropped which induces longer SACK bursts to be processed at the sender. publications “Using QoS for High Throughput TCP Transport Over Fat Long Pipes” Andrea Di Donato, Yee-Ting Li, Frank Saka and Peter Clarke In proceedings of Second International Workshop on Protocols for Fast Long-Distance Networks February 16-17, 2004. Argonne National Laboratory, Argonne, Illinois USA “On the joint use of new TCP proposals and IP-QoS on high Bandwidth-RTT product paths” Andrea Di Donato, Frank Saka, Javier Orellana and Peter Clarke In proceedings of the Trans-European-Research and Education Networking Association (TERENA) networking conference 2004. 710 June 2004. Rhodes, Greece. THANK YOU………. APPENDIX CWND and UNDOs definitions • Congestion moderation is: – triggered by a dubious event which is typically the acknowledge of more than 3 packets. The action induced is that the CWND is set to the # of packets that are thought to be in flight. • UNDOs – ss-thresh and cwnd reduce as loss is thought to have happened but the sudden reception of THE ack makes both cwnd and ss-tresh to reset to the previous value Four things to keep in mind when looking at the AF-QEF results: 1. Regardless of the TCP implementation, a large bandwidth allocation means higher CWND values. This means a higher probability of dropping more packets in the same CWND. • This, along with the smaller value of the AF BW delay products as well as with the CWND threshold explains why all the protocols tested perform the same under small BW allocations 2. For the new TCP stacks tested, the bigger the allocated bandwidth, the higher the loss rate induced by the scheduler during congestion since the loss recovery (congestion avoidance gradient) for bigger CWND is more aggressive. 3. SelectiveACKnowledgements (SACKs) are feedback from the receiver to tell the sender what packets the receiver has. This is to allow the sender to deduce what packets to resend. 4. Too many SACKs (as a result of a major multiple drops in a CWND) to be processed at once (burst of sacks) is recognised to be an issue for the current kernels and CPUs in use. • This, along with point 2 and 1, leads to say that for protocols with a steep congestion avoidance gradient like Scalable and for high bandwidth allocated, the SACK processing becomes an issue. This explains why Scalable performance for higher bandwidth allocations was so poor in our tests. • Thus, the difference in performance is due to an implementation issue whose magnitude is, though, function of the dynamic of the TCP protocol in use and not due to the protocol design per se.. Vanilla TCP (Again) • The standard AIMD algorithm is as follows: – on the receipt of an acknowledgement, • W’ = w + 1/w ; w -> congestion window – and in response to a congestion event, • W’ = 0.5 * w • The increase and decrease parameters of the AIMD algorithm are fixed at 1 and 0.5 • With a 1500-byte packets and a 100 ms RTT, achieving a steady state throughput of 10 Gbps would require a packet drop rate at most once every 1 2/3 hours • TCP’s congestion control algorithm additive increase multiplicative decrease (AIMD) performs poorly over paths of high bandwidth-Round-Trip-Time product and the aggregate stability is even more precarious if such aggregate consists of only few flows as it is the case for the scientific community where few users require a reliable bulk transcontinental data transfer at Gb/s speed HS-TCP • If congestion window <= threshold, use Standard AIMD algorithm else use HighSpeed AIMD algorithm • The modified HighSpeed AIMD algorithm is as follows: – on the receipt of an acknowledgement, • W’ = w + a(w)/w; higher ‘w’ gives higher a(w) – and in response to a congestion event, • W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w) • The increase and decrease parameters vary based on the current value of the congestion window. • For the same packet size and RTT, a steady state throughput of 10 Gbps can be achieved with a packet drop rate at most once every 12 seconds Scalable • Slow start phase of the original TCP algorithm is unmodified. The congestion avoidance phase is modified as follows: – For each acknowledgement received in a round trip time, • Scalable TCP cwnd’ = cwnd + 0.01 – and on the first detection of congestion in a given round trip time • Scalable TCP cwnd’ = cwnd – 0.125 * cwnd • Like HighSpeed TCP this has a threshold window size and the modified algorithm is used only when the size of the congestion window is above the threshold window size. Though values of 0.01 and 0.125 are suggested for the increase and decrease parameters, they (as well as threshold window size) can be configured using the proc file system (by a superuser). New stacks: User Impact Factors investigation(2) UIF (10 New TCP flows) 1 Number of background Vanilla TCP flow The same as for the previous plot but for 10 New TCP stacks: The UIPs are even higher than 1 with respect to the 1 new TCP flow scenario and this corroborates the results even more as this is a much more realistic scenario. IP-QOS ( a la Diff. Serv.) • Packets are dscp-marked at the sender host • Two classes are defined: – BE – AF • Packet are dscp-classified at the router input ports. • Each class is put in a physically different IP-level router output queue. Each of them is assigned a minimum BW guaranteed (WRR). – Low priority for BE – High priority for AF Juniper M10 BW scheduler “priority” • The queue weight ensures the queue is provided a given minimum amount of bandwidth which is proportional to the weight. As long as this minimum has not been served, the queue is said to have a “positive credit”. Once this minimum amount is reached, the queue has a “negative credit”. • The credits are decreased immediately when a packet is sent. They are increased frequently (by the weight assigned). • For each packet, the WRR algorithm strictly follows this queue service order: 1. 2. 3. 4. High priority, positive credit queues Low priority, positive credit queues High priority, negative credit queues Low priority, negative credit queues Router BW scheduler benchmarking BE AF JUNIPER M10 Cards used: 1x G/E, 1000 BASE-SX REV 01 • IOS Version used: “Junos 5.3R2.4”
© Copyright 2026 Paperzz