Why Not Use libpcap or Linux Packet Ring
Some of the packet drivers discussed here, like for example the UML PoC
posted on the mailing list and a WIP I have for QEMU/kvm use raw sockets. This allows them to bind to an Ethernet or Ethernet like interface directly and receive and transmit using high performance calls - recvmmsg() and sendmmsg(). I could have used libpcap instead of hacking something on my own. In fact, what is the advantage of a pure raw socket driver compared to the existing libpcap uml driver? Similarly, why evaluate the possible use of BPF on a raw socket instead of reusing the existing libpcap capabilities?
Libpcap allows an application to capture
packets from an interface as well as limit the scope of the capture
using a simple filtering language. The filter is usually compiled to BPF and bound to the socket in-kernel decreasing the cost of capture
by allowing the application to see only the packets it needs. By now you are probably wondering why am I putting emphasis on capture?. The reason is simple capture = packet IO for driver purposes. Capture is obtaining a packet in a form suitable for analysis, which means that the packet must
be timestamped as precisely as possible at the moment it is received. Timestamps are essential - you cannot analyze most protocols without them.
Obtaining precise time is an extremely expensive operation in a multi-processor system. It remains so despite a lot of effort going into developing and using special locking methodology which is highly optimised for use cases when a variable (the time) is predominantly read by multiple readers and seldom changed by a single one (timer interrupt/realtime clock handler).
In the original libpcap each packet read was accompanied by a a gettimeofday call to timestamp the packet. The end result is that libpcap, when using the read + timestamp semantics, was incapable of reaching high packet rates. It was rather slow. In fact, it could not capture at 1G rate.
When it became clear that the original design for libpcap is not going to break through the Gigabit barrier the Linux kernel developers devised an improved interface which allows:
- Libpcap to capture packets at much higher rate
- Drop packets if the rate is too high and inform the reader exactly how many have been dropped
- Use hardware offload for timestamping where available
This is the so called packet ring interface. It is a very original idea. A long standing feature of Unix systems is that you can use the mmap() call to memory map part or all of a file to a virtual memory space and access it directly as if it is in-memory. This works quite well for normal files. It becomes somewhat esoteric for devices where it is often unclear what does it really mean to map the device to memory. It is sounds as an oxymoron and hence is normally unused for sockets.
The correct tense for "normally unused" should be past - was. The solution to breaking the Gbit barrier in packet capture was to implement the mmap call for raw sockets and provide an interface where the user can request a memory-mapped buffer for packets.
The IO event rate when using the memory mapped buffer is reduced. The raw socket file descriptor no longer presents data - only signalling if it is available via poll()/epoll(). The packets are written and read out of a ring buffer which mmap()ed into application memory. This should be faster than recvmmsg() or sendmmsg(). Or should it?
- The kernel does not write packets (or for the send call) read them directly to the buffer. The buffer is specialized memory, there are no skbufs allocated out of it and the packet will be copied into it to be presented to the application or copied from it to be sent. So the cost for the kernel is the same as in copy_to_user/copy_from_user used for send(), sendmsg(), sendmmsg() and recv(), recvmsg(), recvmmsg() respectively.
- The kernel still timestamps the packet. While there is support for hardware offload of the timestamping there is no means to turn it off altogether.
- The ring buffer structure is not the right fit to be used in virtualization. This is a common issue with all harebrained shared memory schemes for transferring packets around. While they may work out for a specialized NOS using specialized packet access routines (I am doubtful - they all use some buffer structures), they are the wrong idea if you are moving packets to/from a VM running a generic OS. When an OS supplies the packet it is wrapped in some kernel structure (on Linux skbuf). It is usually in an arbitrary location in the VM memory space. If the VM is to use a shared memory mechanism it has to copy the packet out of its buffer structures to the packet ring first. This costs exactly the same as copy_to_user()/copy_from_user() as used in recvmmsg() and sendmmsg() and both can transfer an arbitrary amount of packets at a time. Similarly, in order to receive the packet it has to be copied out - copy_from_user()/copy_to_user().
Once the cost of the timestamping is taken into account libpcap and/or packet ring end up with a guaranteed penalty compared to high performance socket calls. There is no benefit in using them. Similarly, there is no benefit in trying to wire VMs running generic OSes using shared memory. It can provide performance advantage only if most of the safeties in the network buffer allocation, etc have been turned off. In addition to that, in some virtualization stacks the number of copies may end up bigger than when using sockets.
There may be other applications, where there may be benefits in devising a specialized mmap interfaces. That, however is a completely different ball game compared to a bog standard packet ring and/or shared memory between two generic VMs running general purpose OS.
- 07 Feb 2017