Improving Performance in User Mode Linux

Oh boy... Where shall we start...

User Mode Linux has a lot of history. Probably too much history. It is the second PC based virtualization from the same days as early Vmware. It was written for the facilities and limitations of the Linux kernel and libraries of those early days. This has resulted in a long list of issues which include, but are not limited to:

  • Pre-NPTL threading - before proper posix threading was implemented in the linux libc
  • Use of pre-posix timer calls and virtual (CPU spent) time.
  • Pre-high performance io event framework - UML is pre-epoll
  • Pre-multi-message receive and transmit calls
  • Additional unnecessary interface queue stops/starts
  • Pre-fragmented buffers and their integration into offloads in the linux kernel
  • Pre-p-read/write family of calls for file IO
  • One-request-at-a-time IPC between the main kernel thread and the IO helper
  • One page at a time full-memory msync on exec
We do however have to clarify a few things before we start. A lot of people who work on virtualization (especially high performance) will chuckle at the mentioning of UML. While sometimes the chuckling is justified that is not always the case. The rumors that it is hideously slow are somewhat exaggerated and are not based on proper benchmarks, use cases and analysis of the results.

Is UML Slow?

The answer is yes. And No.

Let's boot a minimal Debian under QEMU/kvm on Linux on a decent (3.5GHz) AMD box. It boots practically instantaneously. Let's boot it under UML. You are looking at about 7 seconds - a more than an order of magnitude slower. So, what's going on? Let's run the following magic incantation (it is a good filesystem exercise) on both VMs:
time find /usr -type f -exec cat {} > /dev/null \;

The difference is again - orders of magnitude. About two in fact. Now, lets do the same one more time
time busybox find /usr -type f -exec cat {} > /dev/null \;

The results are... surprise... surprise... comparable. So, now, what the hell just happened here?

Congratulations, you ran into the one, only and biggest UML slowness compared to a single CPU VM of any shape or form - the exec call. Everything else is actually tolerable. It may be slow by my standards (namely - possible to make up to 5-10 times faster), but it is not that slow. The slow part is exec When you execute find directly, you execute a real /usr/bin/find. It will execute a real /bin/cat for every file. That takes forever on UML (it is described later in this article why). If you run it via busybox you run an applet which will notice that you are trying to execute something that is suppored by busybox internally as another applet and it will call it instead. There will be no exec per file. So in all previous benchmarks you were not benchmarking UML. You were benchmarking the rather atrocious UML exec call. If we benchmark UML for stuff that does just sits there and runs it is reasonable. Not great, but reasonable. If it is for stuff that is running in-kernel it can actually beat most other virtualizations hands down on one CPU. So, with this information in hand (and a set of use cases to match) we can narrow down a few tasks where it can be quite usable - small appliances, infra that is attached or removed from a small virtual network, etc. So let's go back to the list of issues and see what can we do about them.

Performance Issues

Pre-NPTL threading

And Here Be Dragons.

There are a lot of places where it gets in the way, but frankly, any attempt to deal with it means dealing with actual UML VM and memory management. This is not for the fainthearted as compared to it dealing with vmexit, etc in a normal hypervisor are a child's game. While sorting this out will make life easier for all other issues, I am leaving this one for last.

Use of pre-posix timer calls and virtual (CPU spent) time.

This one was a major obstacle to UML being used for any serious networking work. It has been fixed properly by me and Thomas Mayer in Linux-4.4

Pre-high performance io event framework

UML uses poll. That is not a sin in itself - there is plenty of code out there which uses poll and performs very reasonably. Unfortunately it uses poll in a manner which results in exponential performance decrease with the number of devices involved. The underlying reason is that it has to enumerate all devices and see which one triggered the event each time the io is triggered. Walking a O(n) list every time is, however, half of the problem. It after that modifies the list by splicing and memcping bits to for a list which does not have the the file descriptor which triggered the event. It then updates the poll fd list. This is a massive performance and scalability killer and one of the parts of UML which is screaming for a fix. Unfortunately, the fix is not easy. I wrote one several years ago and it is part of my Cisco contributions. It has the issue that it does not fully prevent IRQ handler reentrancy. In linux all IRQ handlers are by default presumed non-reentrant. If you reenter bad things happen with stack exhaustion being one of the least unpleasant things on the menu. I have looked at this a few times, but I have parked this one for now until I am clear how to get the performance related to epoll and not updating the fd list while at the same time having reentrancy-safe IRQ handlers.

Unnecessary Multiple Flow Control in the Virtual NIC subsystem

This is a perennial gem in a lot of old network drivers. The first thing they do is to turn off the interface processing queue so that they are not re-entered. That kind'a makes sense if their code was reentrant to start off with and it usually is not. It usually is not and is guarded by additional full-driver lock which prevents any concurrent processing.

That is not necessary and is easily fixed as a part of migrating to mmsg multi-packet tx/rx which needs internal driver queues anyway.

Pre-multi-message receive and transmit calls

UML is in the same boat as QEMU/kvm. It uses ancient posix recv/send and read/write calls. Replacing these by recvmmsg() and sendmmsg() is more complicated than in QEMU because one has to work directly with linux kernel skbufs on the inside of the guest while accessing them using these calls from outside. I posted a rather dirty patch for this as a part of my Cisco contributions. It needs revisiting and cleanup, otherwise it is viable. It is based on the same principles as described in my article on QEMU/kvm and its performance of packet IO.

I have tested UML to go beyond 3GBIt/s forwarding+NAT with these patches and I believe that there is further reserve, allowing to push it beyond 3GBit/s simultaneous in and out for specific applications.

Pre-fragmented buffers

This one is relatively low priority. There is little or no benefit to offloads as the network transports do not support them at present.

Pre-p-read/write family of calls for file IO

UML was using the worst possible pattern for file access in a multi-threaded system (especially on SMP platforms). It is unfortunately the standard pattern nearly all developers use.

seek() read() /* or write() */ </verbatim>

So, what is wrong with it? Well, the file pointer is moved to a new position visible by all thread - that is a cache synch and mutex lock. That costs. On top of that you have two syscalls instead of one. Using this pattern should be prohibited in anything that is multi-threaded. The correct call is pread() or pwrite() instead. These do not change the file index and do not result in different threads getting into each others' way. In addition to that, two syscalls are replaced by one.

This one is in the linux kernel tree and I do not remember when I fixed it. It was somewhere in-between all of the patches. What's important - it's in.

It is also possible to improve this further by switching to preadv()/pwritev() and reading multiple sectors at a time. Even with this improvement UML retains distinctly 1990-es style IO. It is like PIO vs DMA on a PC. The difference moving to the latter would be significant. It requires a significant rewrite of the read-write routines as well as fixing the IPC to supply multiple read/write requests at a time.

One-request-at-a-time IPC between the main kernel thread and the IO helper

UML used blocking IO in the disk read/write thread. It took one request at a time to the IO thread and returned one request at a time back. Due to the use of pre-NPTL threading directly at low level these threads communicate over an IO pipe. That is obvious inefficient and fairly easy to fix by blocking requests together and using a poll on/off pattern. This patch is now in the linux kernel

One page at a time full-memory msync on exec

This is the root cause of "UML is slow" - it will do a full memory msync. To add insult to injury, if you are using a physical device backing in mmap is actually a write-out to device too. That, rather unsurprisingly, is hideously slow.

This is the biggest bugbear in overall UML performance. It should be possible to improve it somewhat by bulking mmap requests - something which is supported in modern Linux kernels. I have a few patches playing with the concept, but so far I am not happy with what I see, this one needs to be looked at again and again.

What Is NOT An Issue

UML allocates memory out of mmaped area. This is often mentioned as an issue with its performance. It actually is not
  • Using a mmaped shared area is a common approach for sharing a large or very large memory allocation between threads in a complex system. For example - most database use it.
  • If the mmaped area is from a tmpfs there is no penalty compared to malloc(). In fact, mmapping something that is not a proper fs area as a memory allocation method is not new. If memory serves me right NetBSD used to use it and it worked quite well.

The only issue to consider with mmap is that while malloc is NUMA aware, mmap off a tmpfs or a device is not - you have to make it so by requesting NUMA affinity on the backed device. If you do not do so, you are looking at getting an unhandled NMI back from a large NUMA system and cores dying an ugly death (with a possible totaly system crash).

-- AntonIvanov - 18 Jan 2017
Topic revision: r2 - 23 Apr 2017, AntonIvanov

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback