Single address spaces: design flaw or feature?

Posted on January 30, 2016

Unikernels operate in a single address space. Usually this is an address space provided by a hypervisor (or a microkernel) but there’s no reason you can’t run a single unikernel on a single CPU (“bare metal”) with no hypervisor involved. As unikernels become more well-known, I’ve seen people describe the single-address-space design choice in quite pejorative terms. Unfortunately, many explanations of unikernels do not adequately explain this decision to ditch the MMU. In the first part of this blog post, I will explain the performance advantages of living in a single address space and how even some high performance non-unikernel systems are designed to exploit those benefits. In the second, I will explore the security/correctness aspects of the unikernel address space model.

Most operating systems use MMU1-enforced isolation to both separate processes from each other and separate user code from kernel code. MMU-enforced isolation is a powerful tool – indeed, if you know how your MMU hardware will interpret its translation tables and know how your kernel writes to the translation tables, you can generate formal results about the isolation between processes or between kernel code and user processes. Outside of the kernel explicitly enabling processes to interact with each other (inter-process communication), the isolation properties that flow from a properly configured MMU are independent of the user-level code being run.

Properly set up MMU-enforced isolation provides isolation between malicious and unknown bits of code – but at a cost. Context switches take time:

The performance of context switches is crucial as every system call a userspace program does to interact with hardware or OS services causes at least two context switches – one userspace-to-kernel, one kernel-to-userspace.

Context switches hurt your caches…

While there are hardware mechanisms 2 to optimize context switches, they do not eliminate all the performance degradation. Context switches have indirect costs outside of the time used for the switch itself – all the previously-mentioned caches get utterly trashed. The code that is being switched to has to start off execution with a cold cache, deeply degrading performance. The degradation is mutual – as when control returns to the code that was originally running, it can take tens of thousands of cycles 3 for performance to return to baseline levels. Executing a system call like write() evicts 2/3rds of the L1 cache and TLB, deeply degrading user code performance depending on how many instructions happen between syscalls. There are hardware features that can reduce cache pollution. TLB lockdown mechanisms, for example, allow an OS to make a certain number of entries in the TLB permanent (so they aren’t overwritten by invalidations or replacement). For example, a kernel can lock down entries for its own address space to reduce translation table walks after a user-kernel switch – the TLB entries are already there! However, these architectural features cannot come close to making context switches painless in high-performance applications and can even have costs of their own. Locked down some TLB entries to make context switching hurt less? Unless your TLB has entries that can only be used for lockdown entries, the locked down entries will always take up space in the TLB that might be more usefully used by cached translation results needed by program workload.

Context switches are inherently4 cruel to caches, and CPUs need happy caches in order to get acceptable performance. To reduce the performance impact of syscalls without modifying application software, exceptionless/asynchronous syscalls have been demonstrated. With the regular syscall interface, the userspace process requests a syscall by executing a special software interrupt instruction to cause a context switch to the kernel. The arguments for the syscall are put in the general-purpose registers. The exceptionless syscall model requires small modifications to the libc and the kernel: when a syscall is requested from the application program, the libc places the syscall’s arguments in a special page of memory (“syscall page”) and switches to another user-level thread. The kernel, at its leisure, can look at each process’s syscall page, execute the syscalls, and return the results and set a completion flag in the syscall page. This leads to fewer context switches which amortizes the direct time cost of context switches. Both kernel and user code run longer uninterrupted, which means less time with the caches cold. This decoupling of control flow means that they can even assign one core to the application software and one to servicing its syscalls! The first core will stay in user mode and the second will stay in kernel mode; instead of user code and kernel code interrupting each other, they communicate through syscall pages5. That the cited paper shows doubled6 performance in common workloads such as serving web pages just by replacing the syscall mechanism with an asynchronous one is a damning indictment.

…and getting rid of them is serious business

Work done on software to reduce the performance degradation caused by syscalls and context switching is far from being a systems research curiosity. If you’re working with gigabits per second of small packets, you’re going to be dealing with lots of packets per second. If the regular kernel networking APIs require a syscall per packet, this forces a context switch and a copy of data from kernel space to user space (or from user space to kernel space) for every packet. At enough packets per second, literally all your cores are doing is switching context, copying data, and running all the time with polluted caches and TLBs, leading to severe performance degradation. You will not be able to receive 10Gb of small packets with the regular Linux APIs.

This microbenchmark of context switches gives latency numbers on the order of 1 microsecond per context switch (which does not account for the slowdown due to polluted caches). 1 microsecond per context switch means only one million context switches per second, which isn’t close to the 10 million packets-per-second needed to saturate a 10Gb NIC. Because of this, your latency numbers will be ten times higher7 than what the hardware itself is capable of. To avoid the overhead of kernel-user data copying and the ruinous effect of context switches there are a bunch of different kernel APIs that just expose the NIC’s control and ring buffers to userspace. It’s your program’s responsibility to implement a TCP/IP stack in userspace with the send/receive ring buffers that are exposed to it. If the kernel is involved at all, all it has to do is verify the validity of addresses that userspace gave it and poke the NIC (if that can’t be done from userspace with appropriate configuration of mappings). Cloudflare (which is in the business of dealing with lots of gigabits of possibly-unwanted small packets) depends on userspace / kernel-bypass networking. High-performance networking on general-purpose CPUs8 today, regardless of what’s being done, does not go anywhere without some flavor of kernel-bypass networking. To reduce latency even further, Intel has DDIO that lets a network card directly place incoming packets in the Last Level Cache of a CPU without any access to main memory (and do the opposite upon transmit); a feature that only works to its full extent when the cache isn’t constantly being trashed by user/kernel context switching.

Back to unikernels

What does this all have to do with unikernels? Well, an application that uses kernel-bypass networking (and that doesn’t depend on other kernel features) on Linux contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of a unikernel9 – and using the Linux kernel as a hypervisor. If you don’t need kernel services (or can use them with a library you link against) and just need isolation, an actual hypervisor might be a better tool. If you need kernel features that provide a coherent and unified view of some hardware (especially one that contains lots of state, like a GPU or a filesystem on a disk) to multiple agents at once then the abstractions that a traditional kernel10 provides look more appealing.

However, if all you need is some address space of your own, access to a network device that’s just for you (nobody can mess with your NIC ring buffers!), and range of disk blocks for yourself, a unikernel’s libraries and language runtime can provide abstractions like a filesystem or threading or a a TCP/IP stack over those. Rather than the kernel and the MMU providing isolation between your code and the kernel, it’s the language’s type system and memory safety that provides isolation between your code and the unikernel’s libraries that provide the services your code uses.

The guarantees that a properly-managed MMU provide are just isolation between your process’s address space and the kernel’s address space: if your kernel or application is written in C and has an exploitable memory safety issue, the MMU won’t magically prevent exploitation. The type safety and memory safety of whatever code you end up running is always an issue and the best the MMU can do is limit the extent of the compromise. Indeed, unless the kernel (or hypervisor) is being exploited, a process under a traditional kernel running attacker-controlled code is as limited as a compromised unikernel in a hypervisor – except that kernels like Linux have a lot worse security record than hypervisors like Xen. Furthermore, unikernels can be written in memory-safe and type-safe languages and it is easier to reason about a OCaml program that links to OCaml libraries rather than software that calls upon a huge mass of kernel C code that has absolutely no documentation in most cases (much less a formal description of what it does). Instead of a kernel that needs to provide isolation and services to whatever arbitrary machine code is thrown at it, the runtime of the unikernel’s language only has to run a single language’s code that has been typechecked – a much easier task.

Using a unikernels doesn’t just provide security advantages from using a single memory-safe and type-safe language – there are potential benefits in terms of debugging. Admittedly, production-ready introspection/instrumentation tooling isn’t currently available for unikernels, and this is something that likely needs to change before seriously considering widespread use of unikernels in production applications. However, it’s easier to create quality tooling for something written in a single language with a decent type system that lives in a single address space that having to create bespoke instrumentation code that lives in the kernel (and corresponding userspace tools) for each kernel component, each with its own data formats and structures.

Unikernels are appealing not just because they let us use a decent language with a good type system and memory safety instead of piles of C – they let us redraw isolation and abstraction boundaries with more appropriate and specific tools provided by our language’s type systems. When the intended application doesn’t involve the kernel running userspace code that is unknown or potentially hostile or not yours, a unikernel model where all the code lives in the same address space, is compiled and typechecked as a single compilation unit, and is managed by a single language runtime offers quite a few advantages.

  1. The MMU is a piece of hardware that can be configured by software running at an appropriate privilege level. Once the MMU is configured with a set of translation tables, all code running at that privilege level or lower will have its access to memory restricted by the parameters of the MMU. When your code generates an access to memory, the MMU takes in that memory address and looks at its translation tables and generates a result or an error. If the memory address is in a region that is not granted access by the translation tables, an CPU interrupt is generated. If the MMU searches its tables and finds an entry that includes the input memory address, it will use the information in that page table entry to generate a translated address. If there is no other MMU in the system, that address is used to directly access memory (and is termed a “physical address”). A CPU that supports virtualization will have two MMUs and thus two levels of translation. There is a Stage 1 MMU that is configured by the guest OS and converts addresses from userspace software (virtual addresses) to intermediate physical addresses (if there weren’t a hypervisor they’d be real physical addresses!). The Stage 2 MMU, configured by the hypervisor, takes in intermediate physical addresses and converts them into physical addresses, fully fit to be used to access memory.

  2. For example, the hardware can tag each TLB entry with an address space identifier number and have a register (that the OS changes upon context switch) for the current address space identifier and only use TLB entries that have a matching tag.

  3. “There is a significant drop in instructions per cycle (IPC) due to the system call, and it takes up to 14,000 cycles of execution before the IPC of this application returns to its previous level. As we will show, this performance degradation is mainly due to interference caused by the kernel on key processor structures”

  4. Caches like temporal and spatial locality. Synchronous system calls break those.

  5. The syscall pages tend to stay cached meaning that the cache coherency mechanism (and not slow main memory accesses) is used to transmit them from userspace cores to kernel cores

  6. “We show how FlexSC improves performance of Apache by up to 116%, MySQL by up to 40%, and BIND by up to 105% while requiring no modifications to the applications.”

  7. And right around what we’d expect to see, having seen the numbers for context switch timing.

  8. And not FPGAs or ASICs.

  9. Instead of having the kernel expose the NIC’s ring buffers to you, the hypervisor does it. Instead of a filesystem you do syscalls to fuss with, the hypervisor gives you a range of disk blocks and you can call functions in some unikernel library that implement a filesystem.

  10. It doesn’t have to be implemented at all like a traditional monolithic kernel – you could have a unikernel that provides a coherent/unified filesystem service to multiple other unikernels and use the hypervisor just like you’d use a microkernel to pass messages between VMs.