To Virtualize Or Not To Virtualize. Part 3. Flow Level Offloads or Can It Go Any Faster?
1Mps and 6GBPs are definitely not a number to shout about. If you show it to someone who has written packet processing for a living their first and immediate reaction would be a frown: "Are you kidding me?" Actually, I am not - the number is not so bad when taken in a perspective. It is achievable while coexisting with normal loads in a normal cloud (subject to the cloud providing the necessary vNIC transports). It is also more than sufficient to provide a per-tenant network element which the tenant can "bring into" his network and tear it down as needed. The extra cost of resource to run it is recouped by the simplicity of management - managing such a network element is no different from spinning up a VM, tearing down and wiring it into a VLAN. It is something any cloud platform can do with ease. So, actually, there are no jokes here, provided that the headline rate is satisfactory for the customer's use case. That is not always the case - some customers need more. So, can we squeeze some more out of it or we cannot?
So before we go into various esoteric options, I am going to limit the scope. There are plenty of ways to break through the MPs and GBs barrier in legacy virtualization which involve either going physical or breaking the underlying cloud or both. These will be off limits
. If the answer is to pin dedicated CPUs to a VM and allow it direct access to network hardware, in a real cloud installation it is usually the wrong answer. The right answer would be a multi-tenant network element and a control plane to match which makes it look like individual slices which can be joined or removed from customer's networks.
What Does It Take To Offload?
This is not one question. There are in fact three:
- What are the network design requirements.
- What can request an offload, what can provide an offload and what cannot
- What should be the offload APIs and how to retro-fit them into off-the-shelf virtualization.
Network Design Requirements
Offloading a network operation such as forwarding, NAT, switching, etc in the general case means that there are at least two paths for network traffic. One, the slow, traverses the virtualized network element. The other, the fast, traverses some entity which can perform some operations on a packet, but at a significantly lower cost. It is also controlled by the slow one which decides which traffic is to be handled by the fast path and which by the slow path.
Ooops... That actually means not two entities, but at least 4. Ingress traffic routing, slow, fast paths and traffic egress (combining the two paths). Every offload will require the network element running under virtualization in the slow path to re-program the other three. That is possible with the technology we have today. For example we can easily imagine that ingress is openflow, the network element under virtualization performs packet inspection on incoming flows and if they are accepted sets them up on the dumb, but fast path.
While theoretically feasible, this picture looks very wrong. In a real network, the ingress routing is shared. So is the egress. Similarly, in order for the dumb and fast path to cost in, it must be shared between tenant VMs. If we continue with the assumption that all traffic routing and fast path elements are dumb, we quickly run into the fact that offloading from one network element to another is a complete and utter oxymoron. There is simply no way to ensure that one VM does not affect others if they are allowed to talk directly to the network.
The answer here is that the slow path should be talking to a controller, not to a network element. If the network element it talks to is smart enough to participate in the conversation, you probably can get away with that network element on its own. No need to have a slow path and offload.
Talking to a controller significantly changes the rules of the game. There is very little point in trying to express rules in bytecode or at low level. If we are talking to a controller we might as well talk properly - using a high level abstraction and a proper interface modelled in netconf and yang.
What can talk and what cannot? What will hiccup and have language barriers?
Nearly anything based on the off the shelf network stack of a generic OS cannot do any offloads that are really interesting like forwarding, bridging, NAT, firewalling, etc. Even offloads that are local in their nature (f.e. checksumming, hashing, etc) have taken ages to implement and retrofit into generic networking stacks.
This, unfortunately comes with the territory. The design of a generic OS presumes generic hardware where the packet is always somewhere in the system and can be accessed by software at all stages in the pipeline. This does not match the offload paradigm. An offloaded flow and its contents disappear as far as the OS is concerned. Most generic OS kernels will strongly dislike this idea and need serious surgery to accommodate it at a "native level". The alternative is to use one of the OS options to "delegate this to userspace". This is present in most modern general purpose OSes and the userspace hooks have a considerable freedom in what they can do to a packet or flow. This is fairly easy and is actually the right approach when dealing with offload in a Network Element based on a generic OS.
All of this differs greatly from specialized NOS-es. While these started off as software, same as a general purpose OS, they have been re-engineered for offloads decades ago. There offloads come natural so it is simply a matter of providing the relevant abstractions. They, however suffer from another, much bigger problem.
Offload to a remote element via controller differs significantly from the immediate realtime offloads found in typical slow + fast path router. It may happen. At some point in the future. There is absolutely no guarantee that it will be immediate. In fact, in nearly all cases the VM running the network element under virtualization will continue to receive traffic it has asked to be offloaded for a period of time. That will break quite a few NOS designs in various subtle, unpleasant and unpredictable ways. They expect the offload to be near immediate and the slow path not to receive any more hits once it has taken a decision to request an offload. It is however, something which should be trivial to accommodate for a generic OS based system. Packet more, packet less. Who cares, receive, process it and offload will happen at some point later on.
The Offload APIs and Transports
We already came to a conclusion that there is very little to be gained from a virtualized network element talking to another one directly for offload purposes. Everyone needs to talk to a controller to ensure that the security domains are isolated and enforced correctly, so that one network element cannot use an offload request to modify a virtual network it does not belong to.
In terms of APIs, most modern controllers have a very long list of ones to chose from. Most can be used. In fact, there is little benefit to try to invent an API instead of reusing any from the existing off-the-shelf ones. The limitations are mostly just a matter of what can be fitted into the NOS to be run under virtualization and what libraries are available for it.
Under normal circumstances, a controller will accept events and API requests over a network interface. Accessing these interfaces directly is a very bad fit for an offload interface. It is a gigantic can of worms in terms of network addressing, security, isolation, etc. For example, in this model, the authentication and authorization sit with the client. That in itself is a violation of the security model in a virtualization environment. It is the hypervisor and its management's job to provide the credentials for these operations. They should not sit in the VM.
This is trivial to implement in most hypervisors. Nearly all have suitable high performance virtual serial port links which connect to a network transport (f.e. socket). It can be wired on the outside to an appropriate API transport. The API requests can be sent through the virtual serial port inside the VM. This results in a correct security model - the insides do not know anything about API credentials or addressing on the management network. The transport helper on the outside can be limited to adding authentication information and sending the request. It does not need to understand anything about offloads. The end result is a completely generic design which can be used for nearly anything - offloads, DOS mitigation, "raising the shields" in case of an attack, etc. There are no issues with addressing or routing either.
This, by the way is something a lot of people who are trying to build virtual elements for the cloud get wrong. They do not try to perform isolation at VM boundary. As a result, they violate the security domain boundaries when bringing in offload into the system.
So far we have covered why it is slow, how we can make it go faster and if it can go even faster. It is probably about time to get to the Conclusions
- 26 Jan 2017