Why DPDK Is Not The Answer To All Of World's Ills
First and foremost, this article is purely on the subject of using DPDK in a new project. I am a developer and as such I can put together the pros and cons of using DPDK for the needs of a project. I do not try to make any conclusions whatsoever regarding product(s) built based on DPDK by someone else. They have made their choice and they have delivered something. That something needs to be judged on its own merits and its own benchmarks regardless of what it uses internally.
I love listening to Intel's marketing about DPDK. It reminds me about the days when they did NPUs (10+ years ago). We were in the process of designing a large scale networking system which needed to do fairly esoteric (by the standards of 2006) encryption - sRTP on a custom muxed payload. Whatever we asked them, the answer was "buy some of our NPUs". Signalling - you can do that on an NPU. Crypto? NPU. Cryptography? NPU. TCP? NPU? Steak with Chips? If we asked that one, we would have probably gotten the same answer - NPU.
The song remains the same. Only the lyrics have changed slightly. NPU now is not the answer. The answer is DPDK. It is however, the same answer regardless of what question do you ask. Same as in 2006 - the answer is quite often inappropriate as it does not fit the design brief. Sure, DPDK is a phenomenal engineering achievement - it demonstrates how you can do on a generic CPU things which have long been the domain of ASICs and NPUs. It is, however, limited by the interaction of the design with the platform it runs on and its limitations are well known.
DPDK design is not new. It is well known and well established. The high level design can be summarized as "poll, pipeline and run to completion". If you put that description in front of a network engineer 10 years ago his reaction would have been: "You are talking about Cisco IOS, right?". In fact, you could have asked that 20 years ago. Or even more - the design is as old as network processing. It is a very good design - it has carried us through the first two decades of the Internet. It does, however, have some well known limitations. These can be worked around and have been worked around by some of its adopters. The question is: "Is it worth it to work around for a particular use case?" Sometimes it is, sometimes it is not and it is definitely not "Answer to all world ills" as claimed by Intel marketing.
Run to Completion Specifics for General Purpose Compute
Run to completion out of a poll loop is a classic Real Time Embedded technique. It does a brilliant job to maximize the use of system resources. There exist variations on the classic theme - multiple input, multiple output workers, etc which are well suited to modern multi-core architectures. In fact, DPDK is one such variation. They all, however, have a major and insurmountable design limitation - they do not work well in an environment where the processing of a particular packet takes an undefined amount of time.The solution to this problem is to take such packets out of the main processing pipeline and subject them to a different processing path.
The act of taking a packet out of the main processing path is what we used to refer to once upon a time as punting. It was a standard design pattern to punt the packets to slow path or software. On a NPU or ASIC based network processing platform the intention of the punt is usually to trigger a reprogramming of the NPU to ensure that future packets continue to use the fast path. As long as the fast path itself was not slowed down and the slow path is not overwhelmed, the fact that packets continued to be punted down to slow path processing was no big deal. This was valid for hardware based platforms - NPUs and ASICs. It was also valid for the platforms where run-to-completion was used to process packets in software. Those were mostly based on simplistic RISC CPUs. Even the SMP ones (f.e. the Cavium multicore MIPS) were not NUMA and had a minimal amount of cache which was easily kept in sync.
That assumption no longer holds for a run-to-completion system designed on a modern generic CPU platform using x86 architecture. These have gigantic caches, the memory on the larger systems is not uniform and DPDK (and other similar systems) perform at their advertised rates only because they are carefully designed to avoid excessive cache synchronization and off-node memory access. They own their CPUs and memory banks. Punting a packet out of the carefully crafted pipeline to a general purpose compute architecture based software residing on the other CPUs in the system will trigger a cache coherency event. Each and every one of the punts slows down
the main processing loop by triggering unexpected cache coherency events.The cache coherency event is necessary as any non-DPDK software processing the slow path must run on a different core - one not allocated to DPDK needs. In order to hand-off the packet to that core, its view of the system memory containing the packet must be brought up-to-date with what is being seen by the core(s) running the DPDK based software stack.
That is a design problem not present in an ASIC or NPU based system. There, punting to "slow path" on the same platform does not result in collateral damage to the path itself (up to a point). This effect is specific to modern generic CPU implementations (especially on multi-socket systems). Similarly, you do not have this issue in a simple SMP environment running on simplistic RISC cores with uniform memory and little or no per-core cache such as used in older network hardware.
How Bad Can It Possibly Be.
Well, how bad can it possibly be?
. Cache coherency, especially on large NUMA systems like the ones typically used for virtualization can be very ugly. In a worst case scenario you can have cores going offline and even the whole machine going down on unhandled NMAs. The bigger the system, the easier it is to trigger that. It is not difficult to write a test case to trigger cascaded core death on a multi-socket Xeon in less then 30 seconds. All it takes is mmapping a file and reading/writing to it from multiple threads residing on cores belonging to different sockets.
While you are not likely to see that if you just punt packets out of a processing queue, you will see performance issues. Each punt will cost and it will decrease the performance of the main packet processing path. As a result, if you have a simple application with little or no punts and all processing in the run-to-completion comfort zone, it will rock with DPDK (or analogues). If you design a system using an old router OS blueprint (fast path + punt) and DPDK with all components residing on the same machine, the performance will tank. It is a intrinsic property of general purpose computing, especially on large systems with multiple sockets.
System Design Workarounds
It may seem counterintuitive, but punting packets to a different system over the network may actually be lower cost than trying to make DPDK and another piece of software coexist on the same hardware. You still handle everything in a tight loop, there is no cache coherence penalties, there are no interruptions. Similarly, DPDK (and other run-to-completion) systems make for extremely antisocial neighbours in virtualization environments. They do not like their precious cache coherency being disturbed and they do not like to share resources. One or more DPDK (or other run-to-completion) based systems assigned to different cores can happily share a platform. Sharing with general purpose computing such as a policy engine or a controller is definitely a bad idea.
Punting performance is not the only problem when building high performance network processing systems using general purpose components. There is an even bigger one - it is how to merry the zero copy obsession found in packet processing dedicated frameworks (like DPDK) with real system needs. In fact, it is big enough to warrant an article of its own
All in all, there is nothing virtual about any piece of software which involves DPDK or other frameworks which use run-to-completion coupled to a poll loop. In order to perform they needs very specific resource assignments which are different from the ones used in general purpose virtual computing. When used correctly, they can provide cost effective solutions for a wide variety of problems. They are definitely not the answer to all of the world's ills. They do not have anything like the flexibility of true general purpose computing. They require orders of magnitude more power and floor space than a true hardware solution. There is however a significant middle ground between even low cost ASIC and NPU hardware and general purpose software. They fit nicely there and should be used there. Trying to push them out of this comfort zone is frankly counterproductive.
Unfortunately, we see Intel marketing trying to do it all the time, it is what it is - marketing. Same as 10 years ago - you ask a question, you always get one answer. Then it was NPU. Today it is DPDK, regardless of is the answer right or wrong. When is it right and when is it wrong - I will cover it in a in a different article
- 16 Jan 2017