Privilege Escalation in gVisor, Google's Container Sandbox
tl;dr gVisor is Google’s sandboxing technology for containers running less-than-fully-trusted code. It’s a Golang reimplementation of the Linux kernel that runs in userspace, intercepting container syscalls and limiting what touches the host kernel directly. I found an issue with gVisor’s shared memory implementation that allowed unprivileged processes in the sandbox to read and write the memory of other, more highly privileged processes in the sandbox, including those running as root.
[edit: Please note that this is not a sandbox escape; gaining root within gVisor does not give you access to the host machine. I encourage you to check out the gVisor README for more details!]
Here’s a video of me running a proof-of-concept that reads and overwrites some memory in an unrelated process (code at the end of the post):
Vulnerability
Mapped-after-free
gVisor uses reference counting for various parts of its internal memory management, and the bug allows us to lower the reference count of the “platform” memory backing a shmget
shared memory region, even when we still have a mapping to it in our process’s virtual address space. The backing memory is then reclaimed and handed to another (potentially more privileged) process, where it can be used for something else. The bug is basically a use-after-free.
The problem stems from this code in sys_shm.go
, which implements the shmctl
syscall:
segment.MarkDestroyed()
unconditionally decreases the reference count of the shared memory object, which eventually decreases the reference count of the platform memory range itself in shm.go
:
So if the platform memory region in the s.fr
range has a reference count of 1
, and we call shmctl(shmid, IPC_RMID, NULL)
until we trigger the destruction of the Shm
object, that backing memory can be reclaimed by the kernel for use by other processes.
The platform memory’s reference count is initialized to 1
when we initially create our shared memory object with shmget()
, but it gets incremented to 2
when the memory is accessed and page-faulted in. So if you read or write to the mapped shared memory region before triggering the DecRef()
bug, things become harder to exploit.
Here’s the proof-of-concept I put together for the video above, with comments explaining how it works:
Conclusion
gVisor is a well-written but incredibly ambitious project; reimplementing the Linux kernel is hard. I’ve been thinking about some ways that one could detect and mitigate against this class of bug. Perhaps when a memory region is DecRef()
ed to 0
, the kernel could check that it doesn’t exist in any process’s virtual address space. I’m not sure if that would be too expensive.