The Gnostic Snake Approach to Queue Management When Using MMSG Functions

First and foremost - there are wonderful queue macros from (choose your poison) Linux/NetBSD/FreeBSD/etc, why are we not using them. The short answer is because they suck for the application we are interested in - sendmmsg/recvmmsg.
  1. They are designed for multiple enqueue/multiple dequeue actors - an N:M scenario. In a recvmmsg/sendmmsg environment we usually have some variety of 1:N - single producer, multiple consumers or multiple producers, single consumer. The recvmmsg/sendmmsg is the "1", the processing pipeline is the "N". This is further simplified for Virtual NICs - it often becomes 1:1 - recvmmsg handling the receive and passing to hypervisor and vice versa for transmit. This makes a HELL OF A DIFFERENCE. A N:M unified queuing macro has no choice but to lock the entire queue metadata when performing an enqueue/dequeue. A queue optimized for 1:1 or low number of 1:N can limit the whole queue locking to the queue depth checks. The actual enqueue and dequeue can be governed by separate tail and head locks which have little or no contention.
  2. They are not designed for bulk enqueue, bulk dequeue. This is one of the biggest performance gremlins in Linux (and other open source OSes). You have to enqueue packets one at a time even if you already have a whole list of them received by hardware (ingress) or prepared by the upper layers. The offender is the actual definition of struct net_device_ops for transmit and the corresponding entry into the upper layers for receive. While there is a backdoor (you can set a skb->xmit_more to signify that the driver can try bulking), it is nowhere as effective as it should be. The reason for this is that the "reduce bufferbloat" logic in the linux kernel will try to zero out the bulking.
  3. The universal macros are usually not designed for direct transmit off the queue/direct receive into the queue as well a typical try/fail transmit sequence. You are expected to dequeue a packet and send it. This does not match the semantics of sendmmsg where you grab the N packets pending on the queue at this time, try to send all of them and leave M on the queue for the next round if the OS has failed to send them.
  4. They are designed for large queue sizes. Queues allowing depths of 1000+ packets are not uncommon. Managing such a queue via fixed size metadata is extremely wasteful. This is the reason why most generic queueing macros and libraries have gone for dynamic queue sizing and various forms of linked lists. Array Backed queues and other fixed size metadata implementations are generally frowned upon. This does not match recvmmsg/sendmmsg semantics - these functions use array based descriptors to pass buffer arguments. The sizing argument also does not apply - data transferred via recvmmsg/sendmmsg can never exceed the size of a socket buffer. If the relevant kernel variable has not been tuned the size is ~ 90K which suffices for ~ 64 packets at the standard Ethernet MTU.

This is exactly what the "Gnostic Snake" Array Backed queue tries to address.

Bulk enqueue optimized array backed queuing with separate head and tail locks..

There is an ancient symbol depicting a snake constantly eating its tail. It is also known as the Ouroboros - a symbol of eternal cyclicality and something in the process of constant recreating of itself. The behavior of array backed queue optimized for 1:N (low N) is very similar to this picture. We allow the tail to chase the head independently by separating the locks so that dequeue and enqueue can add/remove multiple packets without contending for the metadata lock on a per-packet basis.

The approach is not new - it is very similar to what NICs (and higher end network hardware) do internally.
  1. The queue is fixed depth and is formed from a sequence of buffers used cyclically - once the queue head or tail reaches the last buffer it cycles onto the first one.
  2. The queue has separate locking for its head and tail. The only operation which requires a full (tail + head) lock is reading the queue depth. Otherwise, the active dequeue holds a head lock, the active enqueue holds a tail lock.
  3. An enqueue grabs both locks, checks if the queue is full, releases the head lock (so dequeue can work), enqueues at the tail, locks the head, increments the depth and adjusts the tail pointer, then releases both locks. There is minimal lock contention because it touches the head lock only to check and adjust queue depth. It can also enqueue N packets at a time (up to the remaining usable queue depth). Even if we do not take into account the decrease in levels of lock contention, compared to a "off the shelf locking", the number of lock/unlock operations will break even when handling more than 3 packets at a time. Legacy locking resource use (cache thrashing, waits, etc) will continue growing with more packets at a time. This one will not - as a result it is a natural fit for sendmmsg/recvmmsg where one can easily get the full 60+ packets a time pending in the socket buffer.
  4. A dequeue grabs both locks, checks if the queue is empty, releases the tail lock (so enqueue can work), attempts to dequeue at the head, if successful decrements the queue depth and adjusts the head pointer, locks tail to decrement head. First of all, as designed this is a natural fit for try/fail. You can attempt dequeuing N packets. If you have successfully dequeued M, you can adjust the head pointer by M, and release. There is no need to re-enqueue back, etc. There is NO cost of failed or partial transmissions. Again - natural fit for sendmmsg which regularly sends less messages than it has been asked to.
  5. An additional optimization specific for the bulk enqueue/dequeue performed by send/recvmmsg is to reset the queue head/tail to zero if the queue is empty. This allows the next enqueue to use the full queue depth in a single operation and the corresponding dequeue to attempt to send all the packets at once without splitting the work into two requests to account for the wraparound. Continuing the analogy of the gnostic snake, this is the moment when the snake unrolls itself in order to strike and then rolls back into a circle to continue handling more "knowledge".
The separation of the locks allows to enqueue as much as we like between the tail and the queue depth. Until we have adjusted the queue depth + tail pointer and have made the enqueue official the dequeue at the head will chase its "last known position" and not race the enqueue. Similarly the dequeue can emit as much as it likes. As long as it has not adjusted the head and the depth the tail will chase its "last known position" and will not race it.

A simplified example (without full locking) of the technique is available in the Qemu driver for L2TPv3 which I contributed a few years ago.

Full implementations will follow for UML where they are really needed. Qemu is not really multithreaded within the userspace portion of a Virtual NIC so you can get away with a lot of things you will never get away with when writing an OS NIC driver.

Why not improve the OS

Easier said than done. What is a good idea for dedicated network hardware or systems emulating it like DPDK is not really an option for a generic OS. You cannot reserve most of your memory as a packet processing buffer so you can use static metadata structures. You have other, more interesting and useful things to do, than to move packets from your left pocket to your right pocket. You pay for the ability to do that via the cost associated with dynamic buffering designed for N:M actors. There is also the issue of size. Fixed size array backed metadata allocations are very difficult to afford in a generic OS environment.

What can be done and should be done, however, is to have a functional multiple packet enqueue/multiple packet dequeue API (presently it is not, courtesy of the bufferbloat crusaders). Not having it in this day and age allows various severely overinflated containers full of very hot air to continue floating around NFV (and other network related) conferences, while claiming that Linux (and other generic OS/Hypervisors) suck and you have to buy their special and very refined snake oil.

QEMU

Qemu deserves special mentioning here as it also to add insult to injury uses generic macros incorrectly. So that definitely needs fixing

-- AntonIvanov - 23 Apr 2017
Topic revision: r4 - 09 May 2017, AntonIvanov

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback