To Virtualize Or Not To Virtualize. This Is The Question (applied to Network Elements).

Part One: What Does It Mean To Virtualize A Network Element?

This is a very interesting question. It is also a very meaningless question, because there is no accepted definition of what does it mean "To Virtualize" in the context of network elements. Every person working in this area has a slightly different opinion on the actual meaning of network element virtualization (note - I deliberately do not use the word function here, we will see why in a minute).

What is Virtualization (in a network element context).

While there are thousands and thousands of shades to this, there are two technically literate uses of the word virtualization in a network element context.

Diagram1.png
  1. Being able to segregate a larger network element into multiple functional domains, each of which can be perceived as an individual network element from both dataplane and control plane perspective.
  2. Running a per-tenant instance of a network element in software under some form of virtualization technology and leveraging the virtual network abstractions.
Both serve the same use case, which is: "I have a network, I want to attach a network element to it". While the underlying technology may differ, the network technology may differ too, the use case and goal remains the same - "I need a router, firewall, VOIP gateway, etc, can I have an instance and have it attached to my network please". Most designs that do not serve this use case, but carry the virtual moniker in their names actually have little or nothing virtual about them.

What does this have to do with Network Function Virtualization

NFV as an idea describes functional decomposition of network elements. The word Virtual in the name is a marketing misnomer. It is by no means mandatory - you can functionally decompose network element(s) in a way which does not leverage any virtual network or virtualization technology, does not provide multitenancy of dataplane + control plane and that will still be NFV as recognized by the industry. It is an acronym which could have been created by the Ministry of Truth in 1984 and it is about as truthful to its meaning as you would expect from something coming out of that building.

So, any further discussion in the article will be limited strictly to something, which can be virtualized - presented as instances at both control and dataplane level and one or more instances attached or removed from a network as needed. We will further limit the scope to designs and use cases, where each of these instances looks like a standalone network element as far as its attached networks are concerned - e.g. it has been virtualized.

Running Network Element Software Under Off the Shelf Virtualization. The "Instance Per Tenant" Paradigm.

Lots of people have tried it and lots of people have balked in horror at the results. That is not surprising as they have run into one or more of the following issues:

Resource Sharing Issues

Virtualization stacks are designed to share resources with the guests migrating to use spare resources in the system as needed. Everything is event driven. While one instance is waiting for an event, another may share the resource. Even if you "pin" a particular VM to a set of cores, there will always be some residual CPU usage by the hypervisor and other parts of the system on that core in most virtualization stacks. This is especially valid for QEMU/kvm which runs on top of a generic OS such as Linux. This behavior does not play nicely with software which is not event driven. For example, most legacy network element software like router or firewall NOS use poll + run to completion. Similarly, most modern x86 oriented packet processing frameworks like DPDK use the same technique. They really dislike when the hypervisor takes a time slice from under their feet and uses it for something else. The tell tale sign of this occurring is the occasional packet loss and/or the software hard crashing due to a realtime watchdog activation.

Cost of timers

The cost of getting exact system-wide time on a SMP x86 system is astronomical. As this is a very common operation, the high cost has resulted in the development of a special locking paradigm which is used almost exclusively in timekeeping - the Sequence Lock. This cost can be reduced and is reduced by dedicated network processing systems which need time (for example for traffic shaping and policing) by using CPU performance timers present in most architectures.

So far, so good, if you are on bare metal. If you are running under virtualization, the relevant control registers on the CPU are set to values which prohibit the VM guest software from executing the performance timer instructions. This is done for a reason - they are not guaranteed to be synchronized between sockets in a larger system so a migration of a VM may result in the time looking as it has gone backwards. Trying to execute these instructions usually results in a fault. The fault is handled and an appropriate result is supplied by the hypervisor instead. As a result, the VM continuously gets a consistent high resolution time even it is being moved around the system. The side effect, unfortunately are astronomical costs of high resolution time via CPU registers. Handling a fault consumes several hundreds (if not thousands) of clock cycles. It also includes at least one task switch for every attempt to get the time via the high performance CPU registers. The topper is that the actual time read usually results in one or more cache synchronization events when the system tries to Seq Lock the time to read it.

This can be tuned out to some extent in some hypervisors - for example in Vmware. The "tuning" is in fact, an alternative setting for the registers which allows the guest to obtain the real value instead of triggering a fault. That is usually not an answer for a real production system as it effectively prohibits key day-to-day activities in the virtualization environment - f.e. migration and load adjustments. The guest is now bound to actual hardware CPUs and can no longer be migrated. While it may still be called virtual, most of the advantages of using off the shelf virtualization to virtualize a network element are no longer there. You might as well just run it on bare metal now - it is no different from an operational perspective.

The alternative is to supply highly performing, paravirtual timers which map internal guest software timers directly onto hypervisor/host OS timers while incurring the lowest possible cost. An example of such implementation is the UML Timer implementation by the author of this article and Thomas Meyer. Similar interfaces also exist in Vmware and are under development in QEMU. Their key characteristic, however is that you cannot have your cake and eat it at the same time. By asking the hypervisor "what is the time", you agree to the laws and rules of the virtual environment. The hypervisor is event driven and may take its time processing various other events before it answers you. If you have the intention to use realtime run-to-completion out off a poll loop you should probably look at alternative means to virtualize your network element - it will not get along well with paravirtual timers.

Cost and Performance of IO

Packet IO under virtualization can be best described as decrepit. The root cause is that the virtualization and OS kernel developers involved in virtualization tend to concentrate on performance of applications and/or network performance of TCP and protocols running on top of it. This has resulted in a plethora of optimizations in various hypervisors and their matching network drivers which provide very good (up to nearly bare metal level) TCP throughput and do nothing to improve per-packet IO. As a result per packet is really bad. This is not surprising - it is using the packet-per-event/packet-per syscall paradigm. A good analogy will be 1990-es network card which transfers a packet per request and throws an interrupt per request. The performance of this approach has a hard limit under 1GB/s and around 0.1MPs.

Software developers with packet processing background who have evaluated the resulting poor network IO performance have all decided to try to use hardware directly, instead of fixing the root causes. Hardware vendors have addressed their needs by designing SR-IOV and various network adapters with virtualization support like Cisco ENET. The end result is once again the same. The guest becomes physical. It is not truly virtual any more. It is hardware bound, cannot share resources and cannot be migrated.

This decision is not the only way. In fact, the development of network hardware and how it changed to achieve Gigabit and multi-Gigabit throughput holds the answer. As this is a fairly large topic in its own right, I will look at it in the second part of this article.

Virtualizing a Network Element Using Its Ability To Multitenant. The "Slice Per Tenant" Paradigm.

Nearly all modern routing OSes have some ability to multitenant at dataplane level. What is missing however is any form of multi-tenancy at control plane level. The control plane of an off the shelf routing platform in most vendors remains distinctly single tenant. While it is theoretically possible to develop multi-instance clis and control plane environments it is significantly easier to do this at a higher level in the OSS/BSS stack. It is much easier to multi-tenant and/or multi-domain the network API presentation instead of multi-tenanting the cli. Additionally, any investment into multi-tenanting controllers and exposing APIs to the customer is applicable to both "Instance per Tenant" and "Slice per Tenant".

Where Does DPDK Fit Into This?

DPDK itself - nowhere. It is a framework to build systems and by itself it does not do anything of interest towards their virtualization. In fact, just the opposite, as it is quite unfriendly in regards to off-the-shelf virtualization stacks.

A system built on top of DPDK which supports dataplane multi-tenanting and has been built with a multi-tenanted control plane or is controlled by a multi-tenanted OSS/BSS system via the "Slice per Tenant" paradigm is a perfectly good fit for some Network Element Virtualization tasks. Similarly, a system which is capable of instantiating per-tenant instances of network elements on underlying "flat" hardware to provide "router as a service" or "switch as a service" is similarly a very good fit for virtualization tasks. While it is possible to build such a system out of an off-the-shelf virtualization stack it is probably not worth it, because of DPDK low level hardware dependencies.

Drawing the Line

So, where to use "Instance Per Tenant" and where to use "Slice Per Tenant"? Today? If you are lucky to be working solely with TCP you can draw it at several Gigabits per core. If you are unlucky to be working with raw packets, Ethernet frame and/or UDP the answer is nowhere. General purpose virtualization IO simply does not cut it. You either need to roll your own (as some vendors have done) or look at building multi-tenanted network elements and virtualizing by using their multi-tenanting.

While this is the reality of today's virtualization landscape it does not need to be so in the future. There are significant reserves in the legacy virtualization IO and it can be improved to provide multi-Gigabit throughtput per core. I will cover the the why, what and how in Part Two of this article. It will also contain references to the relevant code and/or guides on how to use it.

-- AntonIvanov - 18 Jan 2017
Topic revision: r1 - 29 Jan 2017, UnknownUser

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback