Eating My Own (Socket) Dog Food - UML
NOTE: ALL of this is work in progress, it has been submitted and is under review. I am working to get it included.
Current version of the patchset:
These apply cleanly over 4.13 and should probably work without a lot of changes down to 4.11 or thereabouts.
RX performance as measured using iperf is indeed higher than QEMU/kvm using tap, virtio and vhost-net (all bells and whistles) on the same hardware. No, you are not having hallucinations. KVM, with all offloads, bells and whistles is beaten by UML in a network performance test. This means that the benchmark HUGE numbers are applicable to TCP only if GRO and TSO are enabled and functional (only raw and tap for now). The non-TCP numbers are ~ 25% of what iperf reports (100% for gre and l2tpv3).
Tap and Raw now (as of version 4 of the patch) can leverage TSO offloads and shines at >4.5Gbit.
The actual numbers on an AMD A8-6500 APU (4 core 3500MHz system) as of last patchsets (as of what is sitting in linux-next queue enqueued for ~4.16) are:
|| TX - Gbits
|| RX - Gbits
|| Forwarding (Port to OpenWRT HEAD, kernel 4.4), NAT + IPTABLES, No QoS
| UML legacy tap
|| < 0.300
| UML gre
| UML l2tpv3
| UML raw, Intel 82546EB NIC
|| 0.570 - limited by write poll on raw socket on 1G NIC, 10G NIC should do > 4G
| UML raw, vEth interfaces
|| > 7.0 - with BPF jit enabled - net.core.bpf_jit_enable = 1
| UML vector tap
|| > 6 - with BPF jit enabled - net.core.bpf_jit_enable = 1
| For Comparison Purposes
| QEMU/kvm tap Ubuntu LTS guest, Debian stretch Host
In all UML cases the vector depth is raised to 128 from the default 64. In fact, for speeds above 300Mbit in the absence of GRO it needs to be 128. There is little benefit in raising it above 192 unless you are using raw and a 10G NIC or vEth (especially with QoS
). There you may need a queue depth of up to 300-400. Similarly, enabling GRO on rx drops the queue length required significantly - 32 is now more than sufficient for a lot of applications.
Raw using Intel NIC is to a different machine, 2 core A4-3400 2700 MHz. Iperf on bare metal measures 0.700 to it. Exactly the same as we get from the raw socket driver so there should be no issues with it hitting 1G speed and beyond provided that the network and the underlying NIC allows it.
Why ">7" - I do not presently have the right hardware to test forwarding speeds in that range properly. Running the forwarder, RX sink and TX generator on a 4 core system yields results which are less than the maximum possible due to task switching and other overheads.
Note that compared to the legacy UML transports the option syntax has now been made identical with QEMU
linux mem=512M umid=Debian \
in raw mode requires raw socket access - the linux executable needs to have CAP_RAW set or be run via sudo. It can be run as a normal user if using udp transport
Similarly for gre - we reverse the src, dst and the keys if present:
vmlinux mem=512M umid=Debian \
GRE requires raw socket access - the linux executable needs to have CAP_RAW set or be run via sudo.
The raw transport binds directly to an Ethernet or Ethernet-like interface using raw sockets. The only parameter to specify is the interface to bind to:
vmlinux mem=512M umid=Debian \
It requires the interface to be prepared - it must be in promisc mode. Offloads are now fully supported, so if they are correctly matched to the UML instance there should be no throughput issues.
Note: As there is no ethernet software loopback on Linux you cannot contact the host on which you are running if you are using a raw transport. It is presently intended for the use of one VM per interface. It is preferred (though not mandatory) that the interface is dedicated to the VM - an ethernet pseudowire, f.e. a vlan.
It is extremely difficult, not to say impossible to eliminate some of the unwelcome and unwanted interactions between the host v6 stack and the VMs on raw interfaces. If you want to play with IPv6, raw sockets are the wrong environment.
The tap transport at present is intended as a demonstration on how tap sucks asteroids through a microbore side-wise. Natively it delivers ~ 500Mbit. This transport subjects it to what can be considered vile abuse - it writes via tap (as for some reason we cannot write via raw socket to a tap interface), but reads via raw socket. The end-result is RX rate of 3+ times the normal tap RX rate. It implements rudimentary BPF filtering in order "not to see its own packets", thus allowing it to get to just a couple of percent above the normal tap headline TX rate (because of the reduced RX cost).
linux mem=512M umid=Debian \
It requires the interface to be prepared - same as in KVM. The "magic works out of the box" functionality which tap has in legacy UML transports has not been integrated (the transport is intended mostly as a demo at this point so no point wasting effort until BPF has gone in). Same as RAW, GRE and L2TPv3
raw mode it requires CAP_RAW or sudo to be executed.
Caveat: This transport CANNOT
be used in L2 scenarios because it is impossible to fully isolate the raw socket reader so it does not see the frames being sent on the tap interface. Nearly anything in L3 with the exception of some more obscure use cases which add multiple MAC addresses to the interface like vrrp should work fine.
Coming Soon - VXLAN, VXLAN GPE and Geneve
I am working on these as well as some rudimentary SFC support, patches will follow once the core patch-set is accepted upstream.
The new vector drivers for UML have full ethtool support allowing to query and set key parameters.
- Ring sizes can be queried, but not set
ethtool -g vec0
- RX/TX queue efficiency can be observed via
ethtool -S vec0
- Coalesce can be queried via
ethtool -c vec0 and set via
ethtool -C vec0 . The only coalesce parameter supported is TX coalesce (
tx-usecs) which cannot be set to a value lower than one jiffie.
- ethtool -S provides checksum offload and scatter gather statistics.
- RX/TX control for offloads where applicable.
- 05 May 2017