Eating My Own (Socket) Dog Food - UML

Current version of the patchset is in kernels from 4.17 onwards

These can be backported with no issues tp 4.13 and should probably work without a lot of changes down to 4.4 or thereabouts.

RX performance as measured using iperf is indeed higher than QEMU/kvm using tap, virtio and vhost-net (all bells and whistles) on the same hardware. No, you are not having hallucinations. KVM, with all offloads, bells and whistles is beaten by UML in a network performance test. This means that the benchmark HUGE numbers are applicable to TCP only if GRO and TSO are enabled and functional (only raw and tap for now). The non-TCP numbers are ~ 25% of what iperf reports (100% for gre and l2tpv3).

Tap and Raw now (as of version 4 of the patch) can leverage TSO offloads and shines at >4.5Gbit.

The actual numbers on an AMD A8-6500 APU (4 core 3500MHz system) as of last patchsets (as of what is sitting in linux-next queue enqueued for ~4.16) are:
Transport TX - Gbits RX - Gbits Forwarding (Port to OpenWRT HEAD, kernel 4.4), NAT + IPTABLES, No QoS
UML legacy tap 0.450 0.450 < 0.300
UML gre 0.710 1.45 0.540
UML l2tpv3 0.710 1.45 0.540
UML raw, Intel 82546EB NIC 0.640 0.680 0.570 - limited by write poll on raw socket on 1G NIC, 10G NIC should do > 4G
UML raw, vEth interfaces 6.5 6.5 > 7.0 - with BPF jit enabled - net.core.bpf_jit_enable = 1
UML vector tap 8.6Gbit 6.5 > 6 - with BPF jit enabled - net.core.bpf_jit_enable = 1
For Comparison Purposes
QEMU/kvm tap Ubuntu LTS guest, Debian stretch Host 9.1Gbit 5.5Gbit <4Gbit
In all UML cases the vector depth is raised to 128 from the default 64. In fact, for speeds above 300Mbit in the absence of GRO it needs to be 128. There is little benefit in raising it above 192 unless you are using raw and a 10G NIC or vEth (especially with QoS). There you may need a queue depth of up to 300-400. Similarly, enabling GRO on rx drops the queue length required significantly - 32 is now more than sufficient for a lot of applications.

Raw using Intel NIC is to a different machine, 2 core A4-3400 2700 MHz. Iperf on bare metal measures 0.700 to it. Exactly the same as we get from the raw socket driver so there should be no issues with it hitting 1G speed and beyond provided that the network and the underlying NIC allows it.

Why ">7" - I do not presently have the right hardware to test forwarding speeds in that range properly. Running the forwarder, RX sink and TX generator on a 4 core system yields results which are less than the maximum possible due to task switching and other overheads.


Note that compared to the legacy UML transports the option syntax has now been made identical with QEMU

linux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=l2tpv3,udp=1,src=,dst=,srcport=1707,dstport=1706,depth=128

L2TPv3 in raw mode requires raw socket access - the linux executable needs to have CAP_RAW set or be run via sudo. It can be run as a normal user if using udp transport


Similarly for gre - we reverse the src, dst and the keys if present:

vmlinux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=gre,src=,dst=,depth=128

GRE requires raw socket access - the linux executable needs to have CAP_RAW set or be run via sudo.


The raw transport binds directly to an Ethernet or Ethernet-like interface using raw sockets. The only parameter to specify is the interface to bind to:

vmlinux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=raw,ifname=eth3,depth=128,gro=1

It requires the interface to be prepared - it must be in promisc mode. Offloads are now fully supported, so if they are correctly matched to the UML instance there should be no throughput issues.

Note: As there is no ethernet software loopback on Linux you cannot contact the host on which you are running if you are using a raw transport. It is presently intended for the use of one VM per interface. It is preferred (though not mandatory) that the interface is dedicated to the VM - an ethernet pseudowire, f.e. a vlan.

Caveat: It is extremely difficult, not to say impossible to eliminate some of the unwelcome and unwanted interactions between the host v6 stack and the VMs on raw interfaces. If you want to play with IPv6, raw sockets are the wrong environment.


The tap transport at present is intended as a demonstration on how tap sucks asteroids through a microbore side-wise. Natively it delivers ~ 500Mbit. This transport subjects it to what can be considered vile abuse - it writes via tap (as for some reason we cannot write via raw socket to a tap interface), but reads via raw socket. The end-result is RX rate of 3+ times the normal tap RX rate. It implements rudimentary BPF filtering in order "not to see its own packets", thus allowing it to get to just a couple of percent above the normal tap headline TX rate (because of the reduced RX cost).

linux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=tap,ifname=tap4,depth=128,gro=1

It requires the interface to be prepared - same as in KVM. The "magic works out of the box" functionality which tap has in legacy UML transports has not been integrated (the transport is intended mostly as a demo at this point so no point wasting effort until BPF has gone in). Same as RAW, GRE and L2TPv3 raw mode it requires CAP_RAW or sudo to be executed.

Caveat: This transport CANNOT be used in L2 scenarios because it is impossible to fully isolate the raw socket reader so it does not see the frames being sent on the tap interface. Nearly anything in L3 with the exception of some more obscure use cases which add multiple MAC addresses to the interface like vrrp should work fine.

Coming Soon - VXLAN, VXLAN GPE and Geneve

I am working on these as well as some rudimentary SFC support, patches will follow once the core patch-set is accepted upstream.


The new vector drivers for UML have full ethtool support allowing to query and set key parameters.
  1. Ring sizes can be queried, but not set ethtool -g vec0
  2. RX/TX queue efficiency can be observed via ethtool -S vec0
  3. Coalesce can be queried via ethtool -c vec0 and set via ethtool -C vec0 . The only coalesce parameter supported is TX coalesce (tx-usecs) which cannot be set to a value lower than one jiffie.
  4. ethtool -S provides checksum offload and scatter gather statistics.
  5. RX/TX control for offloads where applicable.
Topic attachments
I Attachment Action Size Date Who Comment
0001-Epoll-based-IRQ-controller.patchpatch 0001-Epoll-based-IRQ-controller.patch manage 26 K 04 Oct 2017 - 15:24 AntonIvanov Epoll Interrupt Controller - Prerequisite for high performance IO
0002-High-Performance-Vector-Network-Driver.patchpatch 0002-High-Performance-Vector-Network-Driver.patch manage 75 K 11 Oct 2017 - 13:28 AntonIvanov High Performance Vector IO patch
0003-TSO4-Workaround-support-for-UML-Vector-Drivers.patchpatch 0003-TSO4-Workaround-support-for-UML-Vector-Drivers.patch manage 4 K 12 Oct 2017 - 20:19 AntonIvanov TSO Workaround for RAW sockets
Topic revision: r1 - 24 Jun 2018, UnknownUser

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback