Zero-Copying Whole Packets is Not Always the Answer.
Nearly all packet processing systems like DPDK, VPP, etc built by people with an NPU background have been designed with two principles in mind. Considering the religious fervor in their enforcement by packet processing people we may as well call them "commandments".
- Anyone who copies a packet shall be burned at the stake. This is a mortal sin which shall not be forgiven
- Packets are processed as a whole, buffers are allocated as needed and where needed to accommodate the packet and any warts on its front or back developed as a part of encapsulating it - vpns, Pseudowires, etc.
These two result in an interesting combination when meeting any high performance application which has been built for a general purpose OS. They also do not make sense for most high end network hardware which has shipped with generic compute systems during the last decade. Let's look into that.
To Copy Or Not To Copy
Mapping memory is not free of charge. The cost differs depending on hardware, OS and Virtualization stack in use. It varies between fairly high UML and moderate (Type 1 hypervisors). It is, however, never free of charge. That cost is in the hundreds of clock cycles. Even if the cost is shared across multiple packets it is still fairly substantial. Compared to that, copying a sub-128byte packet using SSE2 instructions is 10s of cycles. Even if the packet is more than 128bytes, it is still worth it to copy its header, because this makes it HOT. It is now in the cache and we have significantly lower cost of deciding what to do with in the OS network stack. So, rather unsurprisingly, a lot of network drivers in generic OS including drivers for vNICs copy small packets and/or copy packet headers. An excellent example of this is the virtio driver as well as any driver in the Linux kernel which still has some of the original copy small packets code by Donald Becker.
The first commandment of network packet system design has been brutally violated and for a good reason. In a generic OS it makes sense - the value is not in shuffling the packet from left pocket to right pocket, the value is in doing interesting things with it. That is likely to involve accessing significant portions of the packet anyway, so even if we copied the whole thing, the actual end-to-end cost including processing by the application will show only a marginally small increase.
In the process of copying parts of packet for a large packet we have violated the second commandment. The packet is no longer in one piece. That, actually has an even BIGGER effect on system design than the zero copy commandment.
Once upon a time, during the dawn of modern OS there used to 4 ways to talk to a file descriptor. Read and Write if it was a file or connection oriented socket and recvfrom and sendto it was a connectionless. I am deliberately ignoring recv and send as most OSes make them into a read/write or recvfrom/sendto internally. As a mentality, that nicely maps to packet processing concepts. You needed a fixed buffer to read into, a fixed buffer to write from. Nice, simple and extremely low performing
. It made complex application design difficult - you could not delegate building headers to one part of your application, building payload to another, etc. They all had to agree on working with the same buffer. Initially, this was worked around using extensive copying and locking in multi-threaded applications.
It was fixed
in generic OS more than 15 years ago by introducing vector IO and using IO vectors (iovs) to describe segmented buffers:
- OS side: readv/writev and their socket counterparts sendmsg/recvmsg
- Network adapter side: support in network hardware to work with segmented buffers - also known as scatter-gather IO.
- RX/TX checksum offload for Scatter Gather IO.
These abstractions match modern application design. You can "split the work" - give different parts of your application their own segments to work on. They vastly simplify buffer management - you no longer need to allocate and deallocate gigantic buffers to work on data. You allocate buffers as needed for segments and you combine them at the very end. They provide implicit gains in performance by saving implicit copies in the application. Nearly anything written with performance in mind in the last 10 years makes very heavy use of them. The applications have moved on and in a very good way with some OSes like Linux providing even further abstractions like sendmmsg/recvmmsg. OS internal buffer management, drivers, etc have similarly moved on to match them.
In addition to everything else, on Linux, if hardware allows that, Scatter-Gather IO also makes moving the data from userspace to network hardware itself a zero copy operation. The kernel forms a small skbuff with the header only and the rest is shipped directly out of userspace via DMA by the network adapter.
The semantics of Scatter-Gather as done by a modern OS does not match traditional network packet processing and its reincarnations on generic compute hardware like DPDK, etc. In fact, compared to a modern application and the support for it in the OS, the way nearly all packet processing frameworks handle data look distinctly stone age. That is not an issue when packet processing like DPDK is used just for that - packet processing. It becomes an issue when you try to hand over a packet to a high performance application written for a generic OS. A modern high performance network application, regardless of is it running on a VM or on bare metal will spit out a "sashimi" of buffer segments and will expect to receive a sashimi of buffer segments. That is "business as usual" for a generic OS - it is designed to take care of it and does take care of it in its stride.
So what does the interface to a packet processing framework do to match this behavior? It copies. It has no other choice - its primitive buffer management designed specifically to shovel packets from left pocket to right pocket cannot cope with the "sashimi" provided by the application. So in fact, integrating a high performance packet framework like DPDK has left us worse off
than the OS by itself which moved the data directly via DMA as zero-copy.
If done correctly, the end result is performance which is not much different from the performance of a generic OS, especially if network polling has been enabled in it. In fact, it is often lower and all of this at the cost of loss in flexibility associated with using antisocial software
in the system.
There is a key caveat - RX/TX checksum offload. Scatter/Gather IO without checksum offloading has exactly zero benefit as far as network performance is concerned. The reason for this is that the cost of going back to the Scatter Gather list and checksumming all packets using the CPU has more or less the same cost as copying them on most modern CPU architectures (it is definitely the case on Intel). If, however, SG IO correctly interacts with checksum offload, the CPU does not need to walk the entire buffer list to compute the checksum - it is left to the network card to do that.
Effect on Virtualization
The biggest difference of all is when this is applied to virtualization. If an application in a VM sends data via Virtual NIC which correctly handles Scatter Gather IO and correctly offloads RX/TX via a Hypervisor transport which in turn uses vector IO to the kernel and ultimately to a physical NIC which does Scatter Gather and RX/TX checksum you actually do zero copy end-to-end. There is no buffer copying anywhere, just page, offset and virtual mapping address computation. Now, try to match that with a framework which understands only the concept of packet and has to deal with anything and everything as a packet. Sorry, that will be a Check, Mate.
- 17 May 2017s