Improving Network Performance in QEMU
QEMU is one of the projects with the strictest code guidelines I have come across. As a comparison, it takes significantly less effort to cobble together a Linux kernel patch and have it accepted than to get something into QEMU. Unfortunately, the strictness of the code requirement does not necessarily translate into good design - at least in the network subsystem.
There are multiple issues which can be summarized as follows:
- Processing one packet at a time
- Reusing old NetBSD queuing macros without fully understanding the original BSD designs from the period when these macros were crafted by the BSD developers. This leads to:
- Malloc-ing in the middle of the network processing queue
- moving packets around by memcpy()
- While network front-ends support APIs which may use multiple buffers, there is no way to hint to the front-end if it is supposed to prefer fragmented or contiguous buffers. This is an issue primarily on the receive side and specifically related to virtio. All other drivers use contiguous buffers when working in packet at a time mode.
Processing One Packet At A Time
I have addressed this partially when looking at why packet IO in a Virtualized Network Element is so decrepit in one of the other articles
. While this is a common problem in various virtualization stacks, QEMU is particularly bad. In QEMU, all packet APIs are one packet at a time. So even if we kill recv()/send() or read()/write() in favor of recvmmsg()/sendmmsg() our gain is significantly lower than we would expect (only around 30-40%), because the packet delivery into the vNIC across the VM boundary is still one packet at a time.
I was initially surprised by this when I first added multipacket receive via the l2tpv3.c driver
. So I dug a little bit deeper and found that the offending code is not in the network front-end, but in queue.c and is reused in all network drivers. It is pointless to try to address this without addressing the root cause, which is queue.c
Misuse of BSD Queuing Code
Old BSD queuing code from those days is an interesting beast. If memory serves me right, it was using "pass the parcel" buffer ownership (I really should dig around the NetBSD
tree - something I have not done since I used it to replace Ultrix on MIPS DecStations
Based on what I recall, working with that code at that time (this is how old are these macros), they were designed mostly for a model where the de-queuer takes ownership and en-queuer relinquishes ownership (aka "pass the parcel"). That fine design detail is actually a BIG
deal, because it allows for the pointer to be passed around without copying and the need for reference counting. That design is mostly abandoned nowdays - everyone counts references instead in order to allow buffer cloning. The issue with QEMU is that it neither relinguishes ownership on enqueue, nor uses reference counting. As a result of that, every time it needs to enqueue a packet it (waaaaait for it... drum roll...):
- mallocs while processing the packet. Gulp
- memcpy-s the packet into the newly malloced buffer. Double Gulp
- to add insult to injury, all processing happens packet at a time. Example - you got a stall, you should BLOCK alloc and *BLOCK*-process all remaining packets. Similarly, you should try to *BLOCK*-enqueue and act based on "number of enqueued" returned by the en-queuer.
- does not start off with a proper packet + metadata buffer in the IO frontends. This is added along the way. This design decision prohibits use of reference counting so it has to use "pass the parcel" - something it does not do either.
This desperately needs fixing and it is where there is a HUGE performance reserve (in the gigabit range). Some of it will have effect in all drivers, but ones that use multi-packet receive and transmit will benefit the most. It is not an easy fix by the way - you need to add the relevant additional APIs and not break any of the existing code.
Zero Copy in Virtio
What people keep missing is that virtio network drivers are not 100% zero copy. They try to copy a portion determined by the GOOD_COPY_LEN constant in the virtio_net.c code. At present, the front-end does not try to make any allowances for this when receiving. Drivers have to support it when sending, but not the other way around. There is some performance reserve here as well.
- 18 Jan 2017