December 07, 2014

FlowBender : revisiting Equal Cost Multipath in Datacenters

Equal Cost Multipath (ECMP) is a widely used technique that allows routers and switches to spread the packets over several paths having the same cost. When a router/switch has several paths having the same cost towards a given destination, it can send packets over any of these paths. To maximise load-balancing, routers install all the available paths in their forwarding tables and balance the arriving packets over all of them. To ensure that all the packets that correspond to the same layer-4 flow follow the same path and thus have roughly the same delay, routers usually select the outgoing equal cost path by computing : $H(IP_{src}||IP_{dst}||Port_{src}||Port_{dst})~mod~n$ when n is the number of equal cost paths towards the packet’s destination and H a hash function. This technique works well in practice and is used in both datacenters and ISP networks.

A consequence of this utilisation of ECMP is that TCP connections with different source ports between two hosts will sometimes follow different paths. In large ISP networks, this may lead to very different round-trip-times for different flows between a pair of hosts. In datacenters, is has been shown that Multipath TCP can better exploit the available network resources by load balancing TCP traffic over all equal cost paths. The ndiffports path manager was designed with this use case in mind.

In a recent paper presented at Conext 2014, researchers from Google, Purdue University and Fabien Duchene propose another approach to allow TCP to efficiently utilise all paths inside a datacenter. Instead of using Multipath TCP to spread the packets from each connection over several paths (and risk increased delays due to reordering at the destination), they change the hash function used by the routers/switches. For this, they build upon the Smart hashing algorithm found in some broadcom switches. In an homogeneous datacenter that uses a single type of switches, they select the outgoing path as $H(TTL||IP_{src}||IP_{dst}||Port_{src}||Port_{dst})~mod~n$ where TTL is the Time-to-Live extracted from the packet. This is not the first load flow-based balancing strategy that uses the TTL. Another example is CFLB that even allows to control the path followed by the packets. In addition to using the TTL for load-balancing, the datacenter switches that they use support Explicit Congestion Notification and set the CE bit when their buffer growths. They then modify the TCP sources to react to congestion events. When a source receives a TCP acknowledgement that indicates congestion, it simply reacts by changing the TTL of all the packets sent over this connection. As illustrated in the figure below, this improves the flow termination time under higher loads.

In homogeneous datacenters, the FlowBender approach is probably a viable solution. However, Multipath TCP continues to have benefits in public datacenters where the endhosts cannot influence the operation of the routers and switches.

Posted by Olivier Bonaventure

Filed under: datacenter

Tags: datacenter

MPTCP

Recent articles and software

FlowBender : revisiting Equal Cost Multipath in Datacenters