tl;dr gVisor is Google’s sandboxing technology for containers running less-than-fully-trusted code. It’s a Golang reimplementation of the Linux kernel that runs in userspace, intercepting container syscalls and limiting what touches the host kernel directly. I found an issue with gVisor’s shared memory implementation that allowed unprivileged processes in the sandbox to read and write the memory of other, more highly privileged processes in the sandbox, including those running as root.
[edit: Please note that this is not a sandbox escape; gaining root within gVisor does not give you access to the host machine. I encourage you to check out the gVisor README for more details!]
Here’s a video of me running a proof-of-concept that reads and overwrites some memory in an unrelated process (code at the end of the post):
gVisor uses reference counting for various parts of its internal memory management, and the bug allows us to lower the reference count of the “platform” memory backing a
shmget shared memory region, even when we still have a mapping to it in our process’s virtual address space. The backing memory is then reclaimed and handed to another (potentially more privileged) process, where it can be used for something else. The bug is basically a use-after-free.
The problem stems from this code in
sys_shm.go, which implements the
segment.MarkDestroyed() unconditionally decreases the reference count of the shared memory object, which eventually decreases the reference count of the platform memory range itself in
So if the platform memory region in the
s.fr range has a reference count of
1, and we call
shmctl(shmid, IPC_RMID, NULL) until we trigger the destruction of the
Shm object, that backing memory can be reclaimed by the kernel for use by other processes.
The platform memory’s reference count is initialized to
1 when we initially create our shared memory object with
shmget(), but it gets incremented to
2 when the memory is accessed and page-faulted in. So if you read or write to the mapped shared memory region before triggering the
DecRef() bug, things become harder to exploit.
Here’s the proof-of-concept I put together for the video above, with comments explaining how it works:
gVisor is a well-written but incredibly ambitious project; reimplementing the Linux kernel is hard. I’ve been thinking about some ways that one could detect and mitigate against this class of bug. Perhaps when a memory region is
0, the kernel could check that it doesn’t exist in any process’s virtual address space. I’m not sure if that would be too expensive.