Eating My Own (Socket) Dog Food

Both QEMU and UML come with plenty of helper scripts and Web Lore to connect them to network via tap. There is precious little (if any) information on how to use alternative transports. As a result of this people continue to use tap despite the fact that it sucks asteroids sidewise through a microbore.

I am going to try to address it in this article. After all, what is the point in writing high performance socket drivers if nobody uses them.

NOTE: some of this is for pre-release drivers which were published mid-July 2017. I am trying to get them included, but I cannot guarantee anything as far as when they will end up upstream. You can get the most recent versions out of the qemu-devel and user-mode-linux-devel archives.

Host Side

The key hurdle on which most people fail and give up is configuring the host side of the tunnel. So we will start with the instructions for that first.

GRETAP

GRETAP is relatively simple. It is set up via the /sbin/ip utility on Linux and is fairly well supported.

Example: ip link add gt1 type gretap local 192.168.128.1 remote 192.168.129.1

This sets up a gretap tunnel from 192.168.128.1 to 192.168.129.1. The IP addresses must be different. If GRE is being used to connect a VM to the host it is running on the best approach is to set up a number of aliases (or additional ip addresses for the lo interface).

The tunnel above is the most basic one - it uses GRE in its pristine form without any extra bells and whistles like sequencing, checksums or keys.

In most cases that is more than sufficient when connecting a VM to its host. If, however, the VM is being connected to a remote system not under our control, we may need to add these.

Example: ip link add gt0 type gretap local 192.168.128.1 remote 192.168.129.1 okey 0xdeadbeef ikey 0xbeefdead sets up a link from 192.168.128.1 to 192.168.129.1 with the host using output key 0xdeadbeef and expecting an incoming key of 0xbeefdead.

Adding seq and csum to the options turns sequencing and checksumming respectively.

These magic incantations can be added to /etc/network/interfaces on Debian/Ubuntu or the corresponding RHEL/Centos/Fedora interface definitions as "pre-up" statements - they are needed to set up the tunnel.

NOTE: The tunnels so far are just link layers - they create Ethernet-like virtual devices. We still need to set the ip addresses on them, turn on DHCP, etc.

Here is an example of a complete GRE interface definition from my test rig:

auto gt0
iface gt0 inet static
        address 10.0.0.1
        netmask 255.255.255.0
        broadcast 10.0.0.255
        mtu 1500
        pre-up ip link add gt0 type gretap local 192.168.128.1 remote 192.168.129.1 || true
        down ip link del gt0 || true

I have used || true because this is a test rig (so that the commands always complete). Normally, this should not be necessary.

L2TPv3

Compared to GRE L2TPv3 looks distinctly arcane. Most implementations have my favourite "bug" - more options than GNU ls. It is also not set up in one go - there can be multiple sessions in a single tunnel and each of them forms a distinct interface on Linux. For our purposes we will use it in a very static fashion where we cannot use this feature and there will be only one session per tunnel.

First we have to set up the tunnel. The tunnel definition includes the source, the destination and the protocol. In the case of UDP the source and destination specifications also include source and destination ports.
ip l2tp add tunnel remote 127.0.0.1 local 127.0.0.1 encap udp tunnel_id 2 peer_tunnel_id 2 udp_sport 1706 udp_dport 1707

Once we have set up the tunnel we can set up one or more sessions. You can use only one session for a static L2TPv3 tunnel to QEMU/kvm (from version 2.1 onwards) or UML (the old Cisco patches or the new version which I will release in June),
ip l2tp add session name l2tp1 tunnel_id 2 session_id 0xffffffff peer_session_id 0xffffffff

This example sets up an interface called l2tp1 which we can now use on the Linux side. I have omitted the more complex options like cookies, etc for sake of simplicity. They are not really needed to connect a VM under our control to a host we control (localhost in this case). The equivalent /etc/network/interfaces stanza will be:
auto l2tp1
iface l2tp1 inet static
   address 192.168.126.1
   netmask 255.255.255.0
   broadcast 192.168.126.255
   mtu 1500
   pre-up ip l2tp add tunnel remote 127.0.0.1 local 127.0.0.1 encap udp tunnel_id 2 peer_tunnel_id 2 udp_sport 1706 udp_dport 1707 && ip l2tp add session name l2tp1 tunnel_id 2 session_id 0xffffffff peer_session_id 0xffffffff
   down ip l2tp del session tunnel_id 2 session_id 0xffffffff && ip l2tp del tunnel tunnel_id 2

Wiring a VM to the PWE

We can now wire a VM to the end of the pseudowire.

QEMU

L2TPv3

NOTE: only L2TPv3 is shipping in QEMU/kvm 2.1. The rest were published recently and I am working for their inclusion into QEMU/kvm.

If you have recent enough QEMU it will support the L2TPv3 transport. In order to connect to the PWE set up in the previous section, we will need to reverse the source and destination as well as source port and destination port if present.
qemu-system-x86_64 -hda /exports/kvm/kvm.img  -m 4096 -enable-kvm\
   -net nic,vlan=0,model=virtio,macaddr=0a:98:fc:96:83:01 \
   -net l2tpv3,vlan=0,src=127.0.0.1,dst=127.0.0.1,srcport=1707,dstport=1706,rxsession=0xffffffff,txsession=0xffffffff,udp,counter

GRE

Similarly for gre - we reverse the src, dst and the keys if present:
qemu-system-x86_64 -hda /exports/kvm/kvm.img  -m 4096 -enable-kvm\
   -net nic,vlan=0,model=virtio,macaddr=0a:98:fc:96:83:01 \
   -net gre,vlan=0,src=192.168.129.1,dst=192.168.128.1

RAW

The raw transport binds directly to an Ethernet or Ethernet-like interface using raw sockets. The only parameter to specify is the interface to bind to:
qemu-system-x86_64 -hda /exports/kvm/kvm.img  -m 4096 -enable-kvm\
   -net nic,vlan=0,model=virtio,macaddr=0a:98:fc:96:83:01 \
   -net raw,vlan=0,ifname=eth3

It requires the interface to be prepared - it must be in promisc mode and all offloads must be turned off. Especially tso and tx checksummig:

ethtool -K eth3 gso off 
ethtool -K eth3 tso off 
ethtool -K eth3 rx off 
ethtool -K eth3 tx off 
ethtool -K eth3 gro off

Note: As there is no ethernet software loopback on Linux you cannot contact the host on which you are running if you are using a raw transport. It is presently intended for the use of one VM per interface. It is preferred (though not mandatory) that the interface is dedicated to the VM - an ethernet pseudowire, f.e. a vlan.

UML

NOTE: ALL of this is work in progress, it has been submitted and is under review. I am working to get it included.

I am performing final testing on the UML patches on top of 4.11 and a backport to 4.4 so I can test them using OpenWRT. The results so far are:

RX performance as measured using iperf is indeed higher than QEMU/kvm using tap, virtio and vhost-net (all bells and whistles) on the same hardware. No, you are not having hallucinations. KVM, with all offloads, bells and whistles is beaten by UML in a network performance test.

TX unfortunately is lagging behind and is roughly comparable to UML TX. The actual numbers on an AMD A8-6500 APU (4 core 3500MHz system) are:
Transport TX - Gbits RX - Gbits Forwarding (Port to OpenWRT HEAD, kernel 4.4), NAT + IPTABLES, No QoS
QEMU/kvm tap 0.955 1.38 < 0.500
UML legacy tap 0.450 0.450 < 0.300
UML gre 0.710 1.45 0.540
UML l2tpv3 0.710 1.45 0.540
UML raw, Intel 82546EB NIC 0.570 0.642 0.570 - limited by write poll on raw socket on 1G NIC, 10G NIC should do > 2G
UML vector tap 0.458 1.45 0.470
In all UML cases the vector depth is raised to 128 from the default 64. In fact, for speeds above 300Mbit it needs to be 128. There is little benefit in raising it above 192 unless you are using raw and a 10G NIC.

Raw is to a different machine, 2 core A4-3400 2700 MHz. Iperf on bare metal measures 0.570 to it. Exactly the same as we get from the raw socket driver so there should be no issues with it hitting 1G speed and beyond provided that the network and the underlying NIC allows it.

L2TPv3

Note that compared to the legacy UML transports the option syntax has now been made identical with QEMU
linux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=l2tpv3,udp=1,src=127.0.0.1,dst=127.0.0.1,srcport=1707,dstport=1706,depth=128

L2TPv3 in raw mode requires raw socket access - the linux executable needs to have CAP_RAW set or be run via sudo. It can be run as a normal user if using udp transport

GRE

Similarly for gre - we reverse the src, dst and the keys if present:
vmlinux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=gre,src=192.168.129.1,dst=192.168.128.1,depth=128

GRE requires raw socket access - the linux executable needs to have CAP_RAW set or be run via sudo.

RAW

The raw transport binds directly to an Ethernet or Ethernet-like interface using raw sockets. The only parameter to specify is the interface to bind to:
vmlinux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=raw,ifname=eth3,depth=128

It requires the interface to be prepared - it must be in promisc mode and all offloads must be turned off. Especially tso and tx checksummig:

ethtool -K eth3 gso off 
ethtool -K eth3 tso off 
ethtool -K eth3 rx off 
ethtool -K eth3 tx off 
ethtool -K eth3 gro off

Note: As there is no ethernet software loopback on Linux you cannot contact the host on which you are running if you are using a raw transport. It is presently intended for the use of one VM per interface. It is preferred (though not mandatory) that the interface is dedicated to the VM - an ethernet pseudowire, f.e. a vlan.

TAP

The tap transport at present is intended as a demonstration on how tap sucks asteroids through a microbore side-wise. Natively it delivers ~ 500Mbit. This transport subjects it to what can be considered vile abuse - it writes via tap (as for some reason we cannot write via raw socket to a tap interface), but reads via raw socket. The end-result is RX rate of 3+ times the normal tap RX rate. It implements rudimentary BPF filtering in order "not to see its own packets", thus allowing it to get to just a couple of percent above the normal tap headline TX rate (because of the reduced RX cost).
linux     mem=512M  umid=Debian \
   ubd0=/exports/UML-debian/Debian.64 \
   root=/dev/ubda vec0:transport=tap,ifname=tap4,depth=128

It requires the interface to be prepared - same as in KVM. The "magic works out of the box" functionality which tap has in legacy UML transports has not been integrated (the transport is intended mostly as a demo at this point so no point wasting effort until BPF has gone in). Same as RAW, GRE and L2TPv3 raw mode it requires CAP_RAW or sudo to be executed.

Ethtool

The new vector drivers for UML have full ethtool support allowing to query and set key parameters.
  1. Ring sizes can be queried, but not set ethtool -g vec0
  2. RX/TX queue efficiency can be observed via ethtool -S vec0
  3. Coalesce can be queried via ethtool -c vec0 and set via ethtool -C vec0 . The only coalesce parameter supported is TX coalesce (tx-usecs) which cannot be set to a value lower than one jiffie.

-- AntonIvanov - 05 May 2017
Topic revision: r10 - 20 Jul 2017, AntonIvanov

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback