The Pros and Cons of Offloading in a Virtualized Environment

Introduction

How do things change over the years - 15 years ago it was a big deal to have your TCP checksums computed by the network card. Similarly, it was a big deal to accommodate this in the network stack on a generic OS and OSes competed for bragging rights on who has managed to do it better.Things have changed - you can no longer fit the itemized list of offloads on a high end network adapter on a page. OSes and in particular, Linux, have changed to match the evolution of the hardware. As with everything hardware related, using all of these features for a particular network card is quite often hit and miss. Nearly all of them, however, work on virtual adapters and in virtualized environments. Some of them are actual game changes resulting in 20+ times differences in throughput for a typical application scenario. Does this "game change", however still apply for routing, QoS, firewalls, etc - the stuff which I like to play with. Let's find out. First of all, let's see what offloads are there to play with. Out of the many supported by Linux, the ones which are of interest for a virtual router are:
  • Checksumming
  • TX/RX Segmentation
  • Hashing

Checksumming

TX and RX checksum offload of network traffic on Linux openly violates the TCP, UDP, etc specs by design. The TCPv4 specification for example, mandates that a value of zero is written to the header field where the checksum will be stored, then a checksum is computed over a "pseudoheader" consisting of IP addresses and other main header fields followed by a checksum over the payload. The computed checksum is then written to the header field. This makes the computation of the checksum fairly TCP, UDP, etc specific. In order to make checksumming generic, instead of following the spec, Linux computes the "pseudoheader checksum", transforms it if needed and writes it to the header field. Thus, instead of computing a TCP, UDP, GRE, etc checksum any of the drivers now needs to compute a checksum solely over the payload and then write it into the header field. The header field is non-zero - not what the spec says. It contains a magic number which when checksummed will adjust the final tally to the correct payload + header value.

Matching these semantics to hardware (especially older hardware which implements the TCP, UDP, etc spec literally) is an interesting exercise. It is, however a godsend for virtualization. Once the initial "pseudoheader" computation has been done, the computation of the overall checksum can be delayed within a virtual system. Each and every entity can now pass the parcel without computing the checksum. Within the kernel this is done by marking SKBs as CHECKSUM_PARTIAL and on virtual machine boundaries it is done by using vnet headers (wherever they are supported) as "checksum needed". In fact, if the checksum is considered trusted, it will not be computed at all. In 4.9 (at least) Linux has a very entertaining assumption of trust - it will trust anything which is marked as "I have not done the payload yet, but it is ready for a payload only computation". As a result even if the checksum is broken, the frame will be accepted.

So, what is the effect on virtulization and virtual routers? It is fairly simple - if you cannot play the game, you are second class citizen. First of all - any frame produced by a virtual router which does not support vnet headers (or any other similar mechanism to convey a partial checksum) must have its checksums fully computed. This, out of necessity is in software on the virtual router and at CPU cost. Similarly, if the virtual router does not support the semantics to carry "half-baked" checksums across itself, the checksum will have to be computed on ingress. That, once again, is in software and costs CPU cycles. It is like starting the game with a handicap. You are guaranteed to be behind the "native" implementation and playing catch-up.

TX/RX Segmentation

Transmit and receive segmentation on Linux is closely tied up to both checksumming and scatter gather. You have to have both of these working before you are admitted to the member of the exclusive TSO/GSO club. The reasons for this are quite simple - if you cannot offload (which in reality on Linux is "delay") checksum, the benefit of being able to feed the network layer a segmented huge frame is NIL. The network layer now has to immediately walk whatever you have provided and checksum it defeating the whole idea of carrying around a huge blob - be it a contiguous or chunky (Scatter Gather) one.

In addition to that, in order for TSO/GSO to be of benefit, it must be supported by the other side - it should be either an network driver which can perform segmentation or a virtual entity which can accept a non-segmented frame. The last bit is very important on a virtual system. The biggest cost is not moving bits around, the biggest cost is signaling that the bits have moved. Network throughput under virtualization is limited by the maximum event rates. As a result it easily possible to get into 10Gbit+ territory by passing non-segmented TCP frames around. Compared to that, it is difficult to achieve rates above 1.5Gbit when performing TCP segmentation and reassembly.

So, where does this put us as far as virtual routing is concerned?

It means that the "exclusive club" of "you will use my checksumming semantics" on Linux has just become even more exclusive. If a router, firewall or any other fully virtualized entity on Linux has the job of working with frames produced (and consumed) by two fully integrated entities (f.e. two QEMU/kvm VMs) the lack of TSO/GSO will limit it to a couple of Gigabits per core. Sure, one can plow additional resources into some funky memory-to-memory interfaces and other special means for the VMs to communicate. The benefit will be rather negligible compared to simply following the rules of the club: "Though shall CHECKSUM_PARTIAL and Though shall TSO/GSO/GRO". Following the club rules gives you forwarding rates in excess of 6Gbit per core to start off with. If you do not follow the rules, the fact that that you have created a let's say 40GBps IO pipe at the edge is irrelevant. Your application is now constrained to sub-2GBit per core - a drop of 3x.

Hashing

Most modern L2 and L3 routing revolves around flow hashing and/or looking up flows in a table by hash instead of performing a full lookup where to forward the packet. If the header fields participating in the computation are "hot" in the CPU cache, computing a flow hash is relatively cheap. If, however, the packet is not cache-hot, this is associated with relevant memory access and synchronization. In theory, it is tempting to avoid this cost on a forwarder and use a pre-computed hash. In practice, if the packet is passed between virtual entities there is little or no benefit. The header was just formed several cycles ago on one VM. It is cache hot. A few cycles more or less to compute the hash make little difference for the purposes of making a routing decision. There is probably one use case where the difference is significant - it is high performance routing where there are many Gbits of traffic across a virtual forwarder. Then, the flow hash can be quite handy, but not for routing. It can be used for load balancing flows across multiple queues on an interface in order to avoid serialization and/or events-per-second bottlenecks. This, in fact, is the same way one can use them on the OS itself.

Conclusions

First and foremost, as usually around NFV and related topics there is a very thick markitechtural fog being produced by everyone (even me sometimes). My NFV can do 100GBit per second. No it cannot - its actual performance between two VMs or two Linux containers is ONLY 800 Mbits. Yes, that is the EXACT difference for one of the loudest and most obnoxious players out there, I am not going to say which one to protect the guilty. So, when looking at the numbers several issues must be kept in mind:

  • Are the advertised numbers for hardware-to-VM or between VMs. It is possible to vandalize the hardware to VM path in a way which defeats most benefits of virtualization and achieve 10s of Gigabits performance. Some examples are Integrating DPDK, SR-IoV, etc. None of these vandalizms apply to a VM-to-VM path without either copying (incurring the relevant costs) and/or breaking the VM security model.
  • What is the cost of delivering the numbers. Allocating 2Gbytes of Buffers and 1.6 Gbytes for DPDK memory management (again, I am not going to point fingers to protect the guilty) to deliver a few 10s of Gigabits may cost in if you have only one entity like this per host. It is not possible to have multiple. That in turn means that the entity must be shared by everyone in the system so has to have the relevant virtual tables, interfaces, etc. It also must support multi-queue on its interfaces - otherwise it will choke on a IO serialization point somewhere. While this is the preferred design at present, replacing one such entity by let's say 20 microservices is definitely a very appealing proposition.
  • Is QoS in play? Now, that is a very interesting topic. Traditionally, TSO/GSO are considered QoS killers as they produce gigantic frames that result in significant jitter on interactive traffic.
    • The TSO is destroyer of QoS used to be correct at 10 Mbit or less bandwidths. It is distinctly flawed at > 100Mbit and definitely flawed in a virtualized environment. The transmit times for a 64K blob are now in microsecond territory (or less). Microsecond jitter on an IP link is pretty much irrelevant for most applications - you can get bigger jitter from task switching on a generic OS.
    • In a virtualized environment the limiting factor is the number of events, so forcing TCP segmentation increases it and messes up interactive traffic more than allowing TCP to travel as blobs.
    • In a virtualized environment, there is a significant likelihood that the originator has already shipped the frame out in TSO form. Trying to segment it on a router for QoS purposes is nearly always trying to lock the gate after the horse has bolted. The sole exception is transmitting onto a very slow interface (<10Mbit). In that case, it is necessary to segment the traffic to get QoS right.
All in all, the question To Offload or Not to Offload in a virtualized environment nowdays has just one answer - To Offload.

-- AntonIvanov - 08 Oct 2017
Topic revision: r2 - 14 Oct 2017, AntonIvanov

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback