To Virtualize or Not To Virtualize. Part 2 - Performance of Packet IO
In order to virtualize a network element, its virtual instances must be able to cope with the expected load. This is usually not an issue if we virtualize it by slicing it into per-tenant subdomains (vRFs, Containers, etc). The overall network element capacity is usually well known and slicing into per-tenants adds minimal additional overhead. The situation becomes more complex if we decide to run a network element in a VM instead. There, the off-the-shelf performance is significantly lower than what is considered acceptable by today's network element standards. The speeds when operating in per-packet mode (as needed by most network element software) while using legacy IO are at best around couple of Gigabits. Usually, they are significantly worse.
We can, of course, just write off the idea of running network element software in a VM using standard virtualization IO as unworkable. Quite a few people are doing exactly that and are building systems which are based on this assumption. I, however, have spent a large portion of my career proving successfully that things that are considered unworkable can be made to work and scale to fit the specification. So instead of giving up, I would actually like to give it a try and scope the exact limits of how far can we push normal virtual IO.
The Root Cause for Decrepit Packet IO Under Virtualization
The root cause is fairly trivial - all non-TCP traffic is processed in one-packet-at-a-time mode. This is equivalent to using a primitive PC network adapter from the 1990-es. The network cards from that bygone age always threw an interrupt for each packet. You also had to copy each packet to/from the adapter buffers. The only difference in virtualization environments is that all of this is now done in software and this happens at the hypervisor layer. This is the standard modus operandi of QEMU/kvm. Xen is not any different and as I am not that familiar with Vmware internals I will not express an opinion about it.
At the same time, the network driver emulation under virtualization always supports some form of multipacket IO in order to provide TCP segmentation offload. This is why TCP performance is significantly better compared to what would be expected from per-packet benchmarks. So the question is, how far and by what means can we push the packet IO to match and exceed TCP performance and what are the barriers for particular individual approaches.
The key to performance improvement is to reduce the number of syscalls per packet executed by the hypervisor. When working on packet forwarding or other typical network element tasks, off-the-shelf hypervisors will read or write a packet per syscall. As a result they quickly hit a hard packet per second limit. That limit is somewhere around 0.1-0.2MPs (1.2GBit with 1500Byte packets) per core used in IO. Usually it is just one core as well. That is clearly not an acceptable number for a virtualized network element and is the reason why people have looked at a variety of alternatives including DPDK, direct access to network adapters via PCI virtualization, etc.
Throwing hardware at this performance bottleneck and Moore's law improvements in CPU performance does not help. This barrier has stayed that way for nearly 10 years now and has not improved in the slightest. At the same time, for example, Linux is perfectly capable of moving packets (either forward or to/from applications using high performance network APIs) at 0.5-1MPs and up to 6GBit per core. Other OS-es are similarly capable of hitting >5GBs per core and multi-MPs throughput.
In order to make things a bit easier to explain and provide a reasonable set of examples, I will slightly narrow down the scope from here onwards. While the rest of the article is applicable to other OSes as well as bare metal hypervisors, trying to describe the approach in an OS/Hypervisor agnostic manner is fairly difficult. None of the APIs involved is a part of any known standard - they are OS specific APIs and extensions. As a result, naming, terminology and calling conventions differ wildly between the usual suspects. In order to make my job of bring this down to earth and expressing it in terms of actual calls involved, I am going to stick strictly to Linux Host and Guest based examples as well as hypervisors running on top of Linux as a Host OS.
So, how can Linux on bare metal achieve 5+ times more in terms of throughput? The answer is simple - by going multipacket and reducing the average number of syscalls per-packet-handled to < 1. It is relatively easy to go multi-packet in Linux Host when working with sockets. The multi-packet receive call recvmmsg was added in Linux version 2.6.33 - more than 10 years ago. Similarly, the multi-packet transmit call sendmmsg has been around since Linux version 3.0.0. It is ridiculous that virtualization stacks use the ancient and slow recv() and recvfrom() instead.
When Can Multi-packet Be of Help?
Multi-packet helps most when packets can
accumulate in the OS kernel socket queue on the Host side. This can happen in a variety of common situations:
- The guest being slow at processing the incoming packets
- Task switch - the hypervisor and/or the host OS being busy with something else
- Packets arriving at a rate exceeding the maximum rate of packet-at-a-time processing
- Multiple packets being produced at the same time as a result of segmenting or fragmenting a frame by the host OS
The following situations are significantly less common:
- The host being too slow at processing packets provided by the guest
It is also possible, albeit quite difficult to deliberately try to queue up some packets on both receive and transmit and release multiple of them at a time from Host to Guest and in the opposite direction. This behavior is fairly standard in Gigabit (and above) network adapters in hardware - they can be programmed to delay the triggering of a packet arrival interrupt in order to accumulate more than one packet to be transferred via DMA.
Emulating this under virtualization is not worth it. The delays required are comparable or smaller than the resolution of the high resolution timers in most systems. In addition to this, the cost of setting up a timer is roughly equivalent to the cost of processing several packets one-at-a-time. The reason for this is that setting the timer is a syscall in its own right and doing so requires one or more system time reads. These trigger cache synchronization events in the system which are quite costly. Similarly, the cost of processing when the timer is being triggered is also comparable to several packets at a time - for similar reasons.
Introducing multi-packet receive to virtualization environments is easy if the packets are received on a socket transport. If the transport is using a device file descriptor or pipe as in tun/tap the multi-packet calls will fail. This rules out tap, tun and any of the acceleration work such as vhost-net done for these transports. On the positive side, this allows to use other transports such as L2TPv3
and GRE which allow a VM to communicate directly with a network element or another VM without the need to use a virtual switch. It also allows to use raw sockets and bind to Ethernet and Ethernet-like interfaces directly.
Multi-packet Receive in a Virtualization Environment
The most common scenarios observed in multi-packet accumulation - slow guest, task switch and high rate are a natural fit for the receive side. The socket buffer in the Host kernel will accumulate packets while the Guest is not receiving them. So, in fact, in our first step we only need to replace the recv() or read() used by the hypervisor IO layer with recvmmsg.
An example of this "minimal intervention" is the current l2tpv3
driver in QEMU/kvm. This approach by itself improves IO performance by about 50%. It can provide significantly better improvement if it is integrated deeper into the virtualization environment as will be described in the QEMUNetIO
article. While QEMU can never be integrated as closely as the network drivers in my UML
work, it should still be able to reach comparable performance - in excess of 0.5MPs and line rates in excess of 3-4Gbit per core for suitable transports.
Multi-packet Transmit in a Virtualization Environment
Making packet transmit go multipacket is of an order of magnitude harder than going multipacket on receive. In Linux, reaching the maximum transmit speed in per packet mode and starting to queue packets in the socket buffer usually does not result in -EBUSY being returned by the relevant kernel calls. You get a reliable -EBUSY only when working with protocols which have a flow control such as TCP. UDP and raw packets may be dropped instead. As a result the normal queue empty/queue full semantics cannot be relayed to the vNIC transmit routine and it cannot efficiently accumulate packets into a multi-packet call.
This problem severely limits the use cases where multi-packet transmit is viable. In fact, under realistic conditions only tcp segmentation, fragmentation and some task-switching conditions in the hypervisor will result in the accumulation of a multi-packet transmit sequence. As a result, enhancing transmit always remains a second order optimization.
So, Where Is The Border?
First of all, if you are looking at using off the shelf virtualization stacks for anything that works in per-packet mode and is in excess of 1GBit today
you are out of luck. You need to either provide the guest with some level of physicality (dedicated resources, PCIoV
, special IO transport+drivers bypassing the hypervisor IO loop, etc) or use a "slice" out of a larger network element. Is that network element built using NPUs and ASICs or DPDK does not really matter, provided that it can be sliced and diced in per-tenant domains. The already laughable 1GBit/0.1MPs number becomes much worse if timers and QoS
are involved. It a worst case scenario it can drop by another order of magnitude.
Once all the work described in these articles is complete and the network drivers being submitted to QEMU and UML are finished, the situation should change slightly as we are able to move the goalpoasts:
- Network elements such as IDSes which predominantly consume packets can be scaled >1MPs and >5GBit when using off-the-shelf virtualization.
- Network elements which forward packets and/or have an equal share of transmit and receive can go to >0.5MPs and >3Gbit
- Network elements which just generate traffic - the improvements there are likely to be minimal.
To be clear - these numbers are based on actual Proof Of Concept code and Prototypes some of which I have developed all the way to product prototypes in my previous jobs. They are not taken out of thin air. In fact, most of the work required is to of update, bugfix and clean up my original work for Cisco from 2011-2014. Key parts of that work Cisco was very kind to open source and submit to the relevant projects at the time.
It is possible to go even beyond that, but that would require significant re-engineering of virtualization IO and the network IO to bypass some or all of the hypervisor IO loop. They are out of scope here and for a reason. If you bypass the IO loop for your primary IO function you can starve the hypervisor of events to work with. That may work extremely well for particular special tasks (there are various efforts by network vendors to build virtual appliances like that). It is however, limited to particular special applications integrated with the special IO layer. It is no longer general purpose virtualization and it is not something which any off-the-shelf guest OS load can benefit from.
So, what can we do with 3-6Gbit and 0.5-1Mps
This is a perfectly respectable SMB figure. You can implement a network element per-customer service and run it successfully and with profit. You also can implement some level of aggregated residential service. 3-6Gbit is an overkill for a single residential customer or SMB customer in most regions and is likely to remain an overkill for a considerable amount of time.
At the same time, 3-6Gbit is generally sub-par for what you would want to run in a public cloud environment, enterprise or mobile. There you need to look at elements which can be subdivided, regardless of are they built using DPDK (or similar frameworks) or out of ASICs and NPUs. That may still leave a role for a VM-per-network element deployment only if the VM has been limited to a subset of the traffic - the so called Feature Path
- 19 Jan 2017