To Virtualize Or Not To Virtualize - Virtualizing the Very Small
It is often necessary to virtualize network elements which are very small in terms of resource footprint. Classic use cases are residential vCPE, IoT
gateways, virtual network appliances for an isolated small security domain in a larger network, etc.
Most of these share a common characteristic - there is an existing implementation of the network element in software out there which runs on a general purpose OS, such as Linux or BSD. These OS are multi-tenanted by nature - you can isolate specific applications in their own name spaces (called containers) and provide them with renamed network devices which the software inside the container perceives as the real deal - a proper Ethernet. Even the software that does not run on such OSes often has some form of multi-tenancy as well. It is very tempting to leverage it because it is extremely cost effective. In fact, significantly more cost effective than standard off-the-shelf virtualization.
So why don't we run every tenant as a container? We can. But we should not - we lose two very important aspects of proper off-the-shelf virtualization - isolation and abstraction.
- Isolation: A container is not properly isolated from other instances - it shares the same kernel with them. This may be acceptable if all instances on the physical or virtual machine are just network elements. It is obviously unacceptable for most use cases where it is sharing the same physical or virtual machine with the control plane.
- Abstraction: It is a key requirement in virtual networking that the network element as seen by the customer should have information only on the overlaid virtual network. It should have no information whatsoever on the underlay as it is not in its security domain. The easiest way to achieve this is to abstract the overlay at the hypervisor vNIC layer. The configuration of the overlay remains outside the VM, while the inside of the VM sees an Ethernet. While it is possible to do this at container level there is a significant likelihood of undesirable information leaks.
There is also one more aspect which people do not like to mention, but it is essential in building a large system - damage limitation.
- Damage Limitation. A major fault in a tenant domain of a multi-tenanted system like containers, router OS VRFs, etc will bring down the whole OS. It is a reality of life that all software has bugs. Software (and hardware) fails. It is essential that when it does it brings down some "acceptable" small number of customers which can be migrated elsewhere instead of bringing the whole platform.
Two Layer Virtualization - Heavy-Weight Outer, Container Inner
The idea is trivial, old as virtualization itself and it has all three of the above mentioned properties. It provides Isolation, Abstraction and Damage Limitation.
So, how can we leverage having more than one layer of virtualization to address use cases which are clearly unsuitable for a VM per tenant deployment?
- We use a heavy-weight hypervisor as an outer layer to provide isolation, resource abstraction and damage limitation.
- We subdivide each of the heavy VMs into a reasonable number of containers. The actual N:M ratio is compromise - it is determined by our isolation and acceptable damage level targets.
- We map the virtual network onto network interfaces or subinterfaces at hypervisor layer. The result is that, for example, 3 l2tpv3 pseudowires will appear as Ethernets inside the VM
- We run minimal control plane inside the VM - just power up, power down and resource assignments based on a supplied config. Essentially a small subset of basic LXC functionality. I would have preferred to run none at all. While not impossible, this is quite difficult, so we will take the calculated risk of powering up and down our network element slices from inside the VM.
- We can use LXC facilities to rename the vNICs as needed inside the VM. For example - l2tpv3 outside VM becomes eth3 inside VM is renamed to eth0 (upstream network interface) inside the container.
The design is not limited to Linux - it will work with any multi-tenanted NOS inside the VM. It is a perfect fit for running micro-network elements and/or micro-services and attaching them to a network via a pseudowire.
It has the interesting property that it can be implemented without a switch. The relationship between the container running the NOS and the interface providing network traffic is established by renaming the interface to a different namespace. While the vNIC at the "fat" hypervisor level may use a switched transport, that is not essential. It can, for example, terminate a pseudowire instead making the entire system completely switchless. This has significant benefits both in terms of scalability (virtual switches are limited in their maximum interface sizing) and manageability - there is one less shared element to manage at each activation.
And Here Be Dragons
There is a number of implementation issues with this design brought by various performance bottlenecks and/or hard scalability limits in modern hypervisors.
First and foremost - everything is Ethernet - either real or virtual. There are no vNICs for point-to-point transports in any of the off-the-shelf hypervisors. This may not be an issue when doing a vCPE or a network appliance. It is definitely an issue when trying to virtualize IoT
, because the protocols are not necessarily Ethernet oriented.
There is a hard limit on the number of vNICs in KVM, Vmware and Xen. All of these emulate a rather primitive PCI bridge allowing for a small number of attached virtual peripherals. There is little or no support for bridge-behind-bridge and PCI routing which are used by multi-NIC cards in the physical world. Even if it was supported, there are no drivers for the outer (vNIC) and inner (NIC emulation) portion which provide the functionality. While UML does not have the vNIC limit, it uses a very primitive shared IRQ controller whose performance decreases in a nearly linear fashion with adding more vNICs. None of these are unsurmountable obstacles. They do, however, require a significant amount of reworking of existing hypervisors as described in the QEMU/kvm
articles on this site.
The Overall Result
It is possible to use virtualization via running network element software under a conventional hypervisor even for very small loads. While it does require quite a bit of both data and control plane work, the end result can be beneficial for a number of use cases including some which are of very significant interest like isolating, securing and servicing the needs of an IoT
deployment; microservices and even a fully featured residential vCPE.
- 24 Jan 2017