IETF 80 highlights: congestion control

There were several interesting talks on various aspects of congestion control at IETF 80, spread around various working groups and research groups; the majority of work that I would classify as actual research being done in the IETF and IRTF at the moment seems to concern congestion control in some way or other.  I've already written about Multipath TCP and Bufferbloat; here's a potpourri of other TCP problems and proposed solutions.  Most of these came out of the meeting of the Internet Congestion Control Research Group (ICCRG) - strictly part of the IRTF rather than the IETF - but the presentation on SPDY came from the IETF Transport Area open meeting.

Chirping for Congestion Control

This work being undertaken by Mirja Kühlewind and Bob Briscoe at BT is a really neat way for TCP to react to changes in available capacity quickly and accurately.  As things stand, if a significant amount of new capacity appears after a connection has left the slow-start (exponential) phase, classic TCP can take a long time to make use of this capacity via additive increase of the congestion window.  The state of the art prior to this work reacts more quickly but will overshoot and cause unnecessary congestion.

Their solution is to "chirp" groups of data packets.  This is an adaptation of an old radio concept; in the analogue domain, a chirp is a signal with increasing or decreasing frequency over time.  Here, chirping means to transmit a group of packets with steadily-decreasing inter-packet intervals, or in other words steadily-increasing data rates.  At some point, the sender will pass the capacity of the bottleneck; the part of the chirp sent too quickly for the network to cope with will have been spread out when it reaches the receiver, i.e. the inter-packet intervals will flatten out at a minimum value from which the available capacity can be easily computed.

This application of chirping to networking was originally devised by Ribeiro et al. - pathChirp [postscript] - as a means for testing a link; Kühlewind and Briscoe have implemented this as a congestion control mechanism for TCP, sending every user data packet as part of a chirp (on the order of 32 packets or half a second long) in order to continuously reevaluate the available bandwidth.  TCP no longer has to rely on loss to detect the optimal window, and nor does it have to fill the buffer next to the bottleneck - no more buffer-induced latency!  A member of the audience commented that this could work particularly well on wireless links.

There are a few open research questions, though: for instance it is not clear what will happen when everybody is chirping; it may be that chirps interact with each other at the bottleneck.

Updating TCP to support Variable-Rate Traffic

Gorry Fairhurst of the University of Aberdeen made the case for revisiting an assumption inherent in TCP: that flows have traditionally been considered to be either "bulk" (transmitting data as fast as possible) or "thin" (sitting idle for most of the time), not both.  More recently we have seen flows which are both or neither, such as audio/video streams where the transmission rate is governed by the application or persistent HTTP connections which sit idle but switch to bulk operation occasionally.

Standard TCP does very badly with such connections for two reasons:

  1. If the connection goes idle (e.g. in an interactive application), the congestion window drops to 1 packet - i.e. all information about the probed available capacity is discarded and the connection must go through slow start again when it has more data to transmit.  Performance after a connection has been idle is therefore very poor.
  2. When the transmission rate is application-governed, the link capacity is massively overestimated: the congestion window continues to increase linearly as ACKs come back without loss.  TCP-CWV (RFC 2861) proposed to solve this particular problem by exponentially decaying the congestion window whilst idle, but this can apparently make the interactive case even worse. (Linux turns on CWV by default, but many interactive applications turn it off; most other operating systems do not use CWV.)  CWV is also not entirely satisfactory in the case of a variable-rate stream, where the congestion window will lag behind the application's transmission rate by a significant amount.

The proposed fix is to preserve the existing congestion window unmodified (for up to six minutes) whenever the application transmits at less than two-thirds of this window - i.e. the congestion window found during bursts of transmission is tracked.  (The 6-minute timeout is an arbitrary compromise.)  A connection which is idle (or transmitting slowly) for less than six minutes can resume transmission immediately at any time using its previously-determined congestion window.

As noted by Mark Handley, however, this behaviour does carry a risk: it is quite possible for the network conditions - in particular, competing traffic - to change significantly during a sub-six-minute idle period, which could cause the connection to transmit at a wildly-inappropriate rate when it returns from idle leading to massive congestion.

Datacentre problems and DCTCP

Murari Sridharan of Microsoft is a vociferous proponent for getting the IETF interested in the datacentre, which is approaching and surpassing the limits of several protocols (for example ARP, which is why I was there, but that's another story).  From his perspective the prevailing opinion in the IETF is that the datacentre is "not the internet" and tends to be the domain of proprietary equipment running proprietary protocols.  However some of the problems experienced in the datacentre relate specifically to TCP/IP and may have to be solved by modifying the OS or hypervisor; if that happens, then these modifications will doubtless interact with the internet at some point, either deliberately or accidentally.  Furthermore, as echoed by BT and Google, the line between datacentre and access network is blurring as access technologies increase in bandwidth and datacentres become more distributed.

The specifics of the datacentre from Sridharan's perspective (which are of course biased towards the situation in Microsoft datacentres, but probably apply more generally) are:

  • High-speed access links - or in other words, the core isn't much faster than the edge
  • Multipath topologies to increase edge-to-edge capacity
  • Very low latency (hundreds of microseconds between racks), which means that delay-based algorithms don't work
  • Low statistical multiplexing
  • Commodity switches - probably 10,000 of them, so they choose cheaper models, which tend to have shallow buffers and hence poor burst tolerance, but since TCP is very bursty that is a problem straight away
  • Virtualisation: 8-32 VMs per server
  • Any service on any server, with live migration (without interrupting connectivity)
  • Multi-tenancy, so you can't trust or tune the VM
  • Significant occurrence of the incast problem where simultaneous flows collide at a bottleneck resulting in high loss and high latency

On top of this demanding scenario, datacentre operators want high burst tolerance and low latency and high throughput (it's hard to get all three) - and they also want performance isolation between flows.  TCP is the state of the art in performance isolation, but it works on the wrong granularity for (at least) Microsoft's datacentres.  Simply turning on ECN ubiquitously in the datacentre does improve the situation significantly, but they would like to go further, by specifically addressing the incast and performance isolation problems.

Microsoft has attempted to improve congestion control in two parallel ways: firstly, by using congestion-controlled tunnels between VMs (Seawall [pdf]) and secondly by tweaking TCP itself (DCTCP [pdf]).  Seawall's encapsulating behaviour is known to break hardware offload, amongst other things; Sridharan went into more detail on DCTCP.

DCTCP aims to make TCP less bursty, and can be thought of as a more-capable ECN: rather than just reacting to the presence of congestion, DCTCP reacts in proportion to its extent, with switches marking packets probabilistically based on instantaneous queue length (actually an estimate of the fraction of time the queue length exceeds a threshold, but this has been shown to be equivalent by Kelly et al).  This higher-resolution data is needed to deal with the incast problem where short-timescale feedback is important.  Like ECN, DCTCP uses a single marker bit as feedback, but spreads more-detailed feedback over multiple packets: if one packet in 20 is marked, that means the congestion window should be cut by 5%.

Interestingly, according to Sridharan, DCTCP can be implemented using existing silicon despite the new congestion marking requirements on switches.


SPDY is a replacement for HTTP initiated by Google (and presented by Mike Belshe); the motivation is that HTTP is somewhat network-unfriendly in the way it uses TCP, and that it would be better to fix the application layer's abuse of the transport layer before taking the plunge and deciding to adapt or replace TCP.

As background, Belshe described the "average webpage", which consists of:

  • 44 resources (HTML, images, stylesheets, scripts...)
  • ...spread across 7 hosts (sometimes for good reasons, e.g. CDN, but sometimes as a hack to work around browsers' limits on the number of connections per server - set at 6 in most modern browsers)
  • 320 kB of data
  • 66% chance of being compressed (or 90% for the top sites, but under 50% for HTTPS)
  • 29 HTTP connections per page

HTTP fetches these 44 resources in an inefficient manner.  SPDY aims to use fewer connections, fewer bytes and fewer packets to fetch the same resources:

  • It opens a single connection to each server, and does its own multiplexing within this connection using prioritised streams to send the resources which will be needed first immediately even if there is an ongoing bulk transfer in progress from that server.
  • It implements mandatory header compression - and this compression is stateful across multiple requests, so that for example sending the same large cookie repeatedly on successive requests is very cheap.
  • It also allows the server to push data to the client without needing to hold an unanswered request open.

SPDY has been implemented in Chrome and is enabled for SSL traffic where the server supports it; since Google's servers do, they have a fair amount of real data on the performance of this new protocol.  In summary (it is claimed) SPDY is a significant improvement over HTTP.

However, there is an open problem: SPDY's use of a single connection hurts it in a few ways, despite this being "more correct" behaviour than that of HTTP.  First and foremost, RFC-compliant hosts using an initial congestion window of 1 packet will take a while to reach an acceptable throughput; Google "solves" this by setting the initial congestion window on their servers to 10 packets in direct violation of the spec, for which they received a certain amount of flak in the IETF!  Furthermore, Belshe also pointed out that TCP unfairly penalises single-connection protocols: a single packet loss halves SPDY's throughput, whereas in the presence of multiple HTTP connections only one of those will be hurt and the throughput will be reduced by a lesser amount.

SPDY is very much a work in progress and the developers are looking into ways to avoid this "single-connection tax".  Their roadmap also includes a few further enhancements; most interestingly they are experimenting with including request data in the SYN packet to avoid a round trip.

There are (as was pointed out by various audience members) preexisting alternatives to SPDY.  However, SCTP is less deployable on today's internet; Google considers that this is more likely to work if deployed at the application layer on top of TCP.  HTTP pipelining should solve some of the problems, but it interacts badly with some proxies so is generally turned off - and it does not implement prioritisation.  It turns out many hacks are necessary to reliably use pipelining; Firefox is close to having this working by running tests in the background and switching pipelining on as and when they pass.

Finally, a memorable quotation courtesy of Belshe: "the gentlemen's agreement in congestion control is over".  This was meant to defend SPDY's and Google's flagrant violation of the TCP specs and replacement of HTTP with a nonstandard protocol, but nevertheless it is food for thought.

Tagged as: Leave a comment
Comments (0) Trackbacks (0)

No comments yet.

Leave a comment

No trackbacks yet.