Implementing a Linux Container in Go and Escaping it!!
Being a security enthusiast, I always look for ways to tweak around or break the software. Recently I’ve been interested in the container escapes and I’ve been reading about them for a while now. To understand the concept of container escapes, I’ve been reading about the containers underlying technology and tried to implement them in Go for a better understanding. After implementing it, I also tried to escape from it to the host.
A container is a chroot on steroids. –someone on the hackernews
What is a container?
We all have heard about containers somewhere in our day to day job with different meanings like it is a kind of lightweight virtualisation or kind of like a chroot jail or just simply as Process Isolation. According to Docker, it is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
In the official Docker Documentation, they mentioned that containers use several features of the Linux kernel and combines them into a wrapper called a container format. Those features are:-
- Namespaces
- Control groups
- Union file systems
Namespaces
Linux namespaces are the underlying technology behind the most modern container implementations. Namespaces are processes’ awareness of what else is running around them. Namespaces allow for isolating global system resources within a group of processes. To list the namespaces in your linux system, you can use the lsns command which provides information about all the namespaces in the system by reading directly from the /proc directory.
dopamine@x:~$ lsns -l
NS TYPE NPROCS PID USER COMMAND
4026531834 time 34 442 dopamine /lib/systemd/systemd --user
4026531835 cgroup 34 442 dopamine /lib/systemd/systemd --user
4026531837 user 34 442 dopamine /lib/systemd/systemd --user
4026531840 net 34 442 dopamine /lib/systemd/systemd --user
4026532244 ipc 34 442 dopamine /lib/systemd/systemd --user
4026532255 mnt 34 442 dopamine /lib/systemd/systemd --user
4026532256 uts 34 442 dopamine /lib/systemd/systemd --user
4026532257 pid 34 442 dopamine /lib/systemd/systemd --user
Control groups
A control group (cgroup) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, and so on) of a collection of processes. Docker and other containerization tools use cgroups to control how much of a given key resource (CPU, memory, network, and disk I/O) can be accessed or used by a process or set of processes. Cgroups are a key component of containers because there are often multiple processes running in a container that you need to control together. Cgroups are exposed by the kernel as a special file system you can mount.
Union file systems
Different filesystems have different rules about file attributes, sizes, names, and characters. Union filesystems are in a position where they often need to translate between the rules of different filesystems. It allows files and directories of separate file systems, known as branches, to be transparently overlaid, forming a single coherent file system. Contents of directories which have the same path within the merged branches will be seen together in a single merged directory, within the new, virtual filesystem.
The backing filesystem is another pluggable feature of Docker.A union filesystem (UFS) mount provides a container’s filesystem. Any changes that you make to the filesystem inside a container will be written as new layers owned by the container that created them.
Building the Container
We will start by creating a child process in our Linux system which would be separated by the namespaces. We will create a child processes that will inherit the behavior of the current program. Using /proc/self/exe
, we can create a new process that executes the same binary as the running process.
It is a common technique for creating recursive or self-referential process invocations. It allows you to build complex logic where the same program can behave differently depending on the context in which it is invoked, opening up patterns like parent-child relationships or self-wrapping programs.
package main
import (
"fmt"
"os"
"os/exec"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "child":
child()
default:
panic("what??")
}
}
func run() {
fmt.Printf("Running %v as PID %d\n", os.Args[2:], os.Getpid())
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
if err := cmd.Run(); err != nil {
fmt.Println("Error running the /proc/self/exe command:", err)
os.Exit(1)
}
}
func child() {
fmt.Printf("Running %v as PID %d\n", os.Args[2:], os.Getpid())
if err := syscall.Sethostname([]byte("Container")); err != nil {
fmt.Println("Error setting hostname:", err)
os.Exit(1)
}
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
fmt.Println("Error running the child command:", err)
os.Exit(1)
}
}
The parent() function uses /proc/self/exe to execute the same binary with “child” as the first argument and any additional command-line arguments passed in. This sets up a clear parent-child relationship where the parent process controls what the child should do. By referencing /proc/self/exe, you don’t have to know the absolute path to the program. This is useful if the program’s location isn’t static or if it might be installed or moved to different directories.
You can try running the program with bash shell and examine the filesystem of container.
dopamine@xGh0st:/mnt/d/container$ go build -o container main.go
dopamine@xGh0st:/mnt/d/container$ sudo ./container run /bin/sh
[sudo] password for dopamine:
Running [/bin/sh] as PID 478670
Running [/bin/sh] as PID 1
# hostname
Container
# ls
alpine-minirootfs-3.18.0-x86_64.tar.gz container main.go rootfs
#
Currently, this container is running /bin/sh
process but it is still connected to the base project directory. We will be assigning a Linux file system for
our container. This filesystem will serve as its root, isolating it from the host filesystem.
We will use Alpine Linux filesystem, since it is the lightest Linux distribution for our container. We can download the filesystem using the below curl command -
wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.0-x86_64.tar.gz
mkdir rootfs
tar -xzf alpine-minirootfs-3.18.0-x86_64.tar.gz -C rootfs
Then we can make the below changes in the child() to mount the filesystem.
func child() {
// Rest of the code goes here
if err := syscall.Chroot("~/your-project-directory/rootfs"); err != nil {
fmt.Println("Error changing root:", err)
os.Exit(1)
}
// Change working directory after changing the root.
if err := os.Chdir("/"); err != nil {
fmt.Println("Error changing working directory:", err)
os.Exit(1)
}
// Rest of the code goes here
}
You can check the filesystem of the container by running the program and verify the newly mounted filesystem.
dopamine@xGh0st:/mnt/d/container$ go build -o container main.go
dopamine@xGh0st:/mnt/d/container$ sudo ./container run /bin/sh
[sudo] password for dopamine:
Running [/bin/sh] as PID 555354
Running [/bin/sh] as PID 1
/ # ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ # ps -a
PID USER TIME COMMAND
/ #
If we use ps -a
command, we are not able to use because ps
relies on a special ‘virtual’ filesystem called /proc to gather its information.
The /proc filesystem is a system-created space that stores and organizes information about the system’s state and the processes running on it. It’s different from a typical filesystem with regular files. When we use ps, it looks into /proc to fetch the data it needs to function.
However, when we created our new isolated container environment and set up a new root filesystem, we didn’t include a /proc filesystem. That’s why ps -a can’t find the necessary information and fails to work.
To make ps and other similar tools work correctly inside our container, we need to ‘mount’ or set up a /proc filesystem inside the new root filesystem of our container. Let’s dive into how we can do this in the next step.
We will be making the below changes in our code.
func run() {
// Rest of the code goes here
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_PTRACE ,
}
// Rest of the code goes here
}
func child() {
// Rest of the code goes here
if err := syscall.Mount("proc", "proc", "proc", 0, ""); err != nil {
fmt.Println("Error mounting proc:", err)
os.Exit(1)
}
// Rest of the code goes here
}
This new piece of code mounts the /proc filesystem in the new root file system. This system call tells the kernel to attach the filesystem found at source (which is “proc”) to the location (also “proc”). As the filesystem type is also “proc”, this tells the kernel to treat this as a special proc filesystem. If there’s any error during this process, we handle it by printing out an error message and then exiting the program.
dopamine@xGh0st:/mnt/d/container$ go build -o container main.go
dopamine@xGh0st:/mnt/d/container$ sudo ./container run /bin/sh
Running [/bin/sh] as PID 642819
Running [/bin/sh] as PID 1
/ # ps -a
PID USER TIME COMMAND
1 root 0:00 /proc/self/exe child /bin/sh
6 root 0:00 /bin/sh
7 root 0:00 ps -a
/ #
As of now, we have created a container having its independent filesystem and we can also check the process running inside it.
Once we are inside the running container, we can try escaping it.
Escaping the container
Container escape refers to a security vulnerability or incident in which a process running within a container breaks out of its isolated environment, gaining access to the underlying host system or other containers. Containers are designed to be isolated from each other and from the host through various mechanisms like namespaces, cgroups, and capabilities, but security flaws, misconfigurations, or other weaknesses can lead to escape.
When container escape occurs, it poses significant security risks, as an escaped process might:
- Gain Host-Level Access: The process could interact with or control system-level resources, potentially leading to unauthorized data access or system compromise.
- Escalate Privileges: An escaped process could acquire root or other elevated privileges on the host, enabling it to make critical system changes.
- Affect Other Containers: The process could access or manipulate resources belonging to other containers, potentially compromising them as well.
In our container, we have a decent level of namespace isolation.
Below you can see the case with the syscall.CLONE_NEWPID namespace isolation, where we are unable to see the processes running inside the host system.
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
}
Running [/bin/sh] as PID 96246
Running [/bin/sh] as PID 1
/ # ps -a
PID USER TIME COMMAND
1 root 0:00 /proc/self/exe child /bin/sh
6 root 0:00 /bin/sh
7 root 0:00 ps -a
/ #
If we remove the NEWPID namespace isolation, then we can easily observe all the processes running inside the host system.
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWNS,
}
dopamine@xGh0st:/mnt/d/container$ sudo ./container run /bin/sh
[sudo] password for dopamine:
Running [/bin/sh] as PID 96899
Running [/bin/sh] as PID 96909
/ # ps -a
PID USER TIME COMMAND
1 root 7:27 {systemd} /sbin/init
2 root 0:00 {init-systemd(Ub} /init
7 root 0:00 {init} plan9 --control-socket 6 --log-level 4 --server-fd 7 --pipe-fd 9 --log-truncate
39 root 0:34 /lib/systemd/systemd-journald
67 root 0:07 /lib/systemd/systemd-udevd
75 root 0:00 snapfuse /var/lib/snapd/snaps/bare_5.snap /snap/bare/5 -o ro,nodev,allow_other,suid
76 root 0:00 snapfuse /var/lib/snapd/snaps/core18_2812.snap /snap/core18/2812 -o ro,nodev,allow_other,suid
79 root 0:00 snapfuse /var/lib/snapd/snaps/core18_2823.snap /snap/core18/2823 -o ro,nodev,allow_other,suid
83 root 0:00 snapfuse /var/lib/snapd/snaps/core22_1122.snap /snap/core22/1122 -o ro,nodev,allow_other,suid
88 root 0:01 snapfuse /var/lib/snapd/snaps/core22_1380.snap /snap/core22/1380 -o ro,nodev,allow_other,suid