Why is the Multipath TCP scheduler so important ?
Multipath TCP can pool several links together. An important use case for Multipath TCP are the smartphones and tablets equipped with both 3G and WiFi interfaces. On such devices, Multipath TCP would establish two subflows, one over the WiFi interface and one over the 3G interface. Once the two subflows have been established, one the main decisions taken by Multipath TCP is the scheduling of the packets over the different subflows.
This scheduling decision is very important because it can impact performance and quality of experience. In the current implementation of Multipath TCP in the Linux kernel, the scheduler always prefers the subflow with the smallest round-trip-time to send data. A typical example of the operation of this scheduler is shown in the demo below from the http://www.multipath-tcp.org web site :
On this demo, the Multipath TCP client uses SSH over Multipath TCP to connect to a server that exports a screensaver over the SSH session. The client has three interfaces : WiFi, 3G and Ethernet. Multipath continuously measures the round-trip-time every time it sends data over any of these subflows. The Ethernet subflow has the lowest routing time. WiFi has a slightly higher round-trip-time and 3G has the worst round-trip-time. The SSH session is usually not limited by the network throughput and all subflows are available every time data needs to be transmitted. When Ethernet is available, it is preferred over the other interfaces. WiFi is preferred over 3G and 3G is only used when the two other interfaces are unavailable.
Sending data over the subflow with the smallest round-trip-time is not sufficient to achieve good performance on memory constrained devices that use a small receive window. This problem was first explored in [NSDI12] where reinjection and penalizations where proposed to mitigate the head-of-line blocking than can occur when the receiver advertises a limited receive window. The typical scenario is a smartphone using 3G and WiFi where 3G is slower than WiFi. If the receiver is window-limited, then it might happen that a packet is sent on the 3G subflow and then the WiFi subflow becomes blocked due to the limited receive window. In this case, the algorithm proposed in [NSDI12] will reinject the unacknowledged data from the 3G subflow on the WiFi subflow and reduce the congestion window on the 3G subflow. This problem has been analyzed in more details in [Conext13] by considering a large number of scenarios. This analysis has resulted in various improvements to the Linux Multipath TCP implementation.
During the last years, several researchers have proposed other types of schedulers for Multipath TCP or other transport protocols. In theory, if a scheduler has perfect knowledge of the network characteristics (bandwidth, delay), it could optimally schedule the packets that are transmitted to prevent head-of-line blocking problems and minimize the buffer occupancy. In practice, and in a real implementation, this is slightly more difficult because the delay varies and the bandwidth is unknown and varies in function of the other TCP connections.
A few articles have tried to solve the scheduling problem by using a different approach than the one currently implemented in the Linux kernel.
The Delay-Aware Packet Scheduling For Multipath Transport proposed in [DAPS] is a recent example of such schedulers. [DAPS] considers two paths with different delays and generates a schedule, i.e. a list of sequence numbers to be transmitted over the different paths. Some limitations of the proposed scheduler are listed in [DAPS], notably : the DAPS scheduler assumes that there is a lage difference in delays between the different paths and it assumes that the congestion windows are stable. In practice, these conditions are not always true and a scheduler should operate in all situations. [DAPS] implements the proposed scheduler in the ns-2 CMT simulator dans evaluates its performance in small networks.
Another scheduler is proposed in [YAE2013]. This scheduler tries to estimate the available capacity on each subflow and measures the number of bytes transmitted over each subflow. This enables the scheduler to detect when the subflow is sending too much data and select the other subflow at that time. The proposed scheduler is implemented in the Linux kernel, but unfortunately the source code does not seem to have been released by the authors of [YAE2013]. The performance of the scheduler is evaluated by considering a simulation scenario with very long file transfers in a network with a very small amount of buffering. It is unclear whether this represents a real use case for Multipath TCP.
It can be expected that other researchers will propose new Multipath TCP schedulers. This is room for improvement in the part of the Multipath TCP code. However, to be convincing, the evaluation of a new scheduler should not be limited to small scale simulations. It should consider a wide range of scenarios like [Conext13] and demonstrate that it can be efficiently implemented in the Linux kernel.
References
[NSDI12] | (1, 2) Costin Raiciu, C. Paasch, S. Barre, A. Ford, and M. Honda and O. Bonaventure and M. Handley, How hard can it be? designing and implementing a deployable Multipath TCP USENIX NSDI, 2012. |
[Conext13] | (1, 2) Christoph Paasch, R. Khalili, and O. Bonaventure, On the benefits of applying experimental design to improve Multipath TCP, presented at the CoNEXT ‘13: Proceedings of the ninth ACM conference on Emerging networking experiments and technologies, 2013. |
[DAPS] | (1, 2, 3, 4) Nicolas Kuhn, E. Lochin, A. Mifdaoui, G. Sarwar, O. Mehani, and R. Boreli, DAPS: Intelligent Delay-Aware Packet Scheduling For Multipath Transport presented at the ICCC, 2014 |
[YAE2013] | (1, 2) Fan Yang, P. Amer, and N. Ekiz, A Scheduler for Multipath TCP, presented at the Computer Communications and Networks (ICCCN), 2013 22nd International Conference on, 2013, pp. 1-7. |
Researchers contribute Multipath TCP code
Our Multipath TCP implementation in the Linux continues to attracts a lot of interest from both researchers and industry. Until now, most of the work on the implementation has been done by researchers at UCL are close collaborators who work with us in the framework of scientific projects with the few exceptions. During the last week, two research groups have contributed new patches to Multipath TCP.
The first patch, proposed last week by Enhuan Dong adds an implementation of a Multipath-aware Vegas congestion control scheme. Most TCP congestion control schemes rely on packet losses to detect congestion with one notable exception : TCP Vegas [1] . TCP Vegas measures the round-trip-time and uses increases in round-trip-times as an indication of congestion and adapts its congestion window accordingly. In 2012, several researchers proposed to adapt TCP Vegas for Multipath TCP [2] . This patch is a first step in implementing this extension of TCP Vegas in the Linux kernel. It has already generated some discussion on the mailing list.
The second patch is an extension to the Multipath TCP path manager. The path manager is a recent addition to the Linux Multipath TCP implementation. It acts as a control plane for Multipath TCP since it includes the logic that decides when and how subflows are created. The default path manager creates a full-mesh of subflows, but this is not always the best solution. The path manager was designed to be flexible and extensible. The patch sent by Duncan Eastoe and Luca Boccassi supports the Binder system described in [3] . It also includes some support for using the IPv6 Routing header with Multipath TCP. Given that this header has been deprecated, it is unlikely that this will end up in the standard Multipath TCP implementation, but it could be useful for research experiments.
[1] | Lawrence S. Brakmo, S. W. O’Malley, and L. L. Peterson, TCP Vegas: new techniques for congestion detection and avoidance presented at the SIGCOMM’94: Proceedings of the conference on Communications architectures, protocols and applications, New York, New York, USA, 1994, pp. 24-35. |
[2] | Yu Cao, M. Xu, and X. Fu, Delay-based congestion control for multipath TCP , presented at the Network Protocols (ICNP), 2012 20th IEEE International Conference on, 2012, pp. 1-10. |
[3] | Luca Boccassi, M. M. Fayed, and M. K. Marina, Binder: a system to aggregate multiple internet gateways in community networks presented at the LCDNet’13: Proceedings of the 2013 ACM MobiCom workshop on Lowest cost denominator networking for universal access, New York, New York, USA, 2013, p. 3. |
Observing Siri : the three-way handshake
Apple’s Siri is the largest use of Multipath TCP as of this writing. This post looks at one Multipath TCP connection established by a single-homed iPad running iOS7 over a single WiFi interface. The trace below shows a simple Multipath TCP session between this iPad and the standard Siri server. As all Multipath TCP connections, it starts with a three-way exchange :
12:43:31.311061 IP (tos 0x0, ttl 64, id 54778, offset 0, flags [DF], proto TCP (6), length 76)
192.168.2.2.62787 > siri.4.https: Flags [S], cksum 0x5e3a (correct), seq 2739181685, win 65535, options [mss 1460,nop,wscale 3,mp capable flags:H sndkey:96e576198c475350,nop,nop,TS val 1363555813 ecr 0,sackOK,eol], length 0
The first segment is a SYN segment. It contains several TCP options :
These TCP options are standard TCP options that are used on modern TCP stacks. It is a bit surprising to see a window scale option for an application like Siri where typically only a small amount of data will be exchanged.
The last option is the MP_CAPABLE option defined in RFC 6824 that proposes the utilisation of Multipath TCP. In the SYN segment, this option contains the random 64 bits key chosen by the sender.
12:43:31.342236 IP (tos 0x0, ttl 244, id 52382, offset 0, flags [DF], proto TCP (6), length 64)
siri.4.https > 192.168.2.2.62787: Flags [S.], cksum 0x034b (correct), seq 1880401460, ack 2739181686, win 8190, options [mss 1460,nop,wscale 4,nop,nop,sackOK,mp capable flags:H sndkey:d7b705e4d86c1a66], length 0
The second segment is the SYN+ACK segment returned by the server. It is interesting to note that this segment does not contain the RFC 1323 timestamp option and uses a different windows scale than the one proposed by the client in the SYN segment. The absence of the timestamp option is probably to avoid using too many option bytes in the data segments.
12:43:31.345448 IP (tos 0x0, ttl 64, id 47496, offset 0, flags [DF], proto TCP (6), length 60)
192.168.2.2.62787 > siri.4.https: Flags [.], cksum 0x3719 (correct), seq 1, ack 1, win 8280, options [mp capable flags:H sndkey:96e576198c475350 rcvkey:d7b705e4d86c1a66], length 0
The third segment, contains the MP_CAPABLE option that includes the keys chosen by the sender and the receiver. Since the client repeats the sender and receiver keys in the ACK segment, the server can remain stateless.
12:43:31.357386 IP (tos 0x0, ttl 64, id 53779, offset 0, flags [DF], proto TCP (6), length 204)
192.168.2.2.62787 > siri.4.https: Flags [P.], cksum 0xeb34 (correct), seq 1:145, ack 1, win 8280, options [mp dss flags:MA dack: 2248627404 dsn: 3845908739 sfsn: 1 dlen: 144,eol], length 144
12:43:31.357430 IP (tos 0x0, ttl 64, id 23748, offset 0, flags [DF], proto TCP (6), length 100)
192.168.2.2.62787 > siri.4.https: Flags [P.], cksum 0x0c31 (correct), seq 145:185, ack 1, win 8280, options [mp dss flags:MA dack: 2248627404 dsn: 3845908883 sfsn: 145 dlen: 40,eol], length 40
12:43:31.385032 IP (tos 0x0, ttl 244, id 30705, offset 0, flags [DF], proto TCP (6), length 48)
siri.4.https > 192.168.2.2.62787: Flags [.], cksum 0x4c82 (correct), seq 1, ack 145, win 2221, options [mp dss flags:A dack: 3845908883], length 0
12:43:31.389460 IP (tos 0x0, ttl 244, id 31058, offset 0, flags [DF], proto TCP (6), length 48)
siri.4.https > 192.168.2.2.62787: Flags [.], cksum 0x4c34 (correct), seq 1, ack 185, win 2219, options [mp dss flags:A dack: 3845908923], length 0
The data transfer can now start. Siri uses HTTPS and thus the Multipath TCP connection begins with a TLS handshake. The details of this handshake are not important for Multipath TCP. There are some interesting details to mention concerning this utilisation of Multipath TCP. First, iOS7 does not seem to use the DSS checksum. This checksum was designed to detect payload modifications by middleboxes. With TLS, it is unlikely that a middlebox will modify the contents of the segment. Second, the DSNs and Data acks are 32 bits wide while RFC 6824 defines both 32 bits and 64 bits Data sequence numbers. iOS7 seems to place one DSS option inside each segment.
When analyzing packet traces, it is often interesting to show graphically the evolution of the connection. For regular TCP, tcptrace provides various ways to visualise the evolution of a TCP connection. Benjamin Hesmans is developing a tool that will provide the same features but for Multipath TCP. This tool is still being developed, but it already provides some nice visualisations. Since Siri only sends a small amount of data, we can only plot the evolution of the Multipath TCP Data Sequence Number.
The figure below shows the flow of data from the client (i.e. the iPad) to the server. Each vertical bar corresponds to one or more segments and the red dots represent acknowledgments. The WiFi network used for the test worked well and there were no losses.
With Siri, most of the data is sent by the client as shown by the server sequence number trace below. After the TLS handshake, only very few data is sent by the server.
The Multipath TCP control stream
During IETF89 we will present a draft [CS] that proposes to define the semantics of a single bit inside the DSS option. This change might appear small at first glance, but it could have a huge impact on the evolution of Multipath TCP and its future.
The DSS option is defined in RFC 6824 to encode the mapping between the Data Sequence Number and the subflow sequence number. RFC 6824 supports one bytestream in each direction between the communicating hosts. This bytestream is use to carry the data supplied by the user applications.
The Control Stream draft [CS] proposes to support two bytestreams in each direction. The first is the regular bytestream that is used to transport regular data. The second is a bytestream that allows the communicating hosts to exchange control information that is relevant for the Multipath TCP connection. [CS] defines the S bit in the DSS option shown below to indicate whether the mapping corresponds to the regular bytestream or to the control stream.
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+----------------------+
| Kind | Length |Subtype|(reserved)|S|F|m|M|a|A|
+---------------+---------------+-------+----------------------+
| Control ACK (4 or 8 octets, depending on flags) |
+--------------------------------------------------------------+
|Control sequence number (4 or 8 octets, depending on flags) |
+--------------------------------------------------------------+
| Subflow Sequence Number (4 octets) |
+-------------------------------+------------------------------+
|Control-Level Length (2 octets)| Checksum (2 octets) |
+-------------------------------+------------------------------+
The S bit of the 'reserved' field is set to 1 when sending on the
control stream.
Why would someone want to support two bytestreams over a single Multipath TCP connection ?
The main motivation is that we would like to exchange control information between communicating Multipath TCP hosts without being limited by the existing TCP options :
- TCP options are sent unreliably. When a host sends a segment that contains an ADD_ADDR option inside an acknowledgement, it cannot be certain that this option will be delivered to the other hosts. Some techniques to improve the reliability of the delivery of this option are discussed in [CellNet12]
- TCP options have a limited size. In the Multipath TCP handshake, we use several tricks to extract the required tokens and ISDN from a hash computation to minimize the length of the MP_CAPABLE option, but this hack is far from perfect
With a bytestream that allows to send control information inside the payload of TCP segments, it is possible to define new techniques to synchronise the two communicating state machines. As a first example, it becomes possible to ensure a reliable delivery of the ADD_ADDR option. Consider a client having several IPv6 addresses. An RFC 6824 compliant implementation would probably send these addresses inside independent TCP segments as shown below :
The only way for the sender to recover from the loss of the segment advertising IP2 is to regularly send the list of addresses that it owns. This is inefficient.
With the control stream, advertising several addresses becomes much simpler.
The same applies to the RM_ADDR option. With the control stream, the list of the addresses owned by each host can be exchanged reliably.
This is not the only application of the proposed control stream. The control stream could prove to be very useful to enhance the security of Multipath TCP. RFC 6824 includes a basic method to “authenticate” the addition of subflows by exchanging 64 bits keys in clear during the initial three-way handshake. From a security viewpoint, exchanging 64 bits in clear is obviously not the best solution. A better solution would be to use longer keys and rely on a key exchange scheme that is secure even if a passive listener is able to capture the segments exchanged. By relying exclusively on TCP options, this is impossible. With the control stream, it becomes possible to use any secure key agreement mechanism such as Diffie Hellmann or any other scheme to agree on a shared secret. Once the shared secret has been negotiated, it can be used to authenticate the establishment of the additional subflows.
Transporting control information inside the payload of segments may sound familiar to those who have followed the discussions that lead to the design of Multipath TCP. During several months in 2010, the MPTCP working group discussed about two solutions to transport data over different paths.
The first approach, that became later RFC 6824 only uses TCP options to encode all the control information. This solution was considered to be optimal to pass through various types of middleboxes. Recent experience with Multipath TCP implementations shows that Multipath TCP can indeed pass through most types of middleboxes.
The second approach, proposed by Michael Scharf in [MCTCP] , was to encode all the control information inside the payload of the TCP segments. For this, MC-TCP relies on a TLV-format to exchange both control and user data. Compared to the first approach, the advantage of MC-TCP was that it was possible to implement it as a library in user-space, but the MPTCP working group felt that this solution was too risky given the prevalence of middleboxes.
The control stream makes a minimal use of the TLV format to encode some control information. It remains to be seen whether there are interactions with some types of middleboxes that could lead to problems. DPIs are a likely source of problem for the control stream, but they already hae a problem today with Multipath TCP if they do not process all the data for a given Multipath TCP connection. Adding the control stream does not create an additional problem and one can expect that with the deployment of Multipath TCP on all iOS7 devices, middlebox vendors will start to add support for Multipath TCP on the DPI boxes…
Bibliography
[CS] | (1, 2, 3) Christoph Paasch, O. Bonaventure, A generic control stream for Multipath TCP , February 2014, Internet draft, work in progress, https://datatracker.ietf.org/doc/draft-paasch-mptcp-control-stream/ |
[MCTCP] | Michael Scharf, Multi-Connection TCP (MCTCP) Transport , Internet draft, July 2010, work in progress |
[CellNet12] | Christoph Paasch, Gregory Detal, Fabien Duchene, Costin Raiciu, and Olivier Bonaventure. 2012. Exploring mobile/WiFi handover with multipath TCP . In Proceedings of the 2012 ACM SIGCOMM workshop on Cellular networks: operations, challenges, and future design (CellNet ‘12). ACM, New York, NY, USA, 31-36. http://doi.acm.org/10.1145/2342468.2342476 |