Allow/Disallow Syscalls via Seccomp

As I said in the previous post, there are a couple of different security modules in the Linux Kernel: SELinux, AppArmor, Seccomp, Tomoyo, Smack, Capabilities, etc.

I’d like to talk about the Seccomp module in this post.

Seccomp stands for secure computing mode.

Photo by Raimond Klavins on Unsplash

Hundreds of system call available in the Linux Kernel. You may want to explicitly disable some system calls for a binary execute the file.

Seccomp allows you to set fine-grained filtering of the syscalls. You can set which syscalls are allowed or disallowed for a binary executable file before running it.

Let’s assume you have an application like this:

You will get an output like this when you run it:

root@adil:~# gcc uname.c -l seccomp && ./a.out
What's up?

Let’s re-run the binary file with strace :

root@adil:~# strace -c ./a.out
What’s up?
% time seconds usecs/call calls errors syscall
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
0.00 0.000000 0 1 read
0.00 0.000000 0 2 write
0.00 0.000000 0 2 close
0.00 0.000000 0 3 fstat
0.00 0.000000 0 7 mmap
0.00 0.000000 0 4 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 6 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 uname
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 2 openat
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
100.00 0.000000 36 2 total

Even a simple application makes a lot of syscalls. Let’s assume you want to disable the uname syscall via seccomp (You should install libseccomp-dev):

All system calls allowed except uname:

root@adil:~# gcc uname.c -l seccomp && ./a.out
What’s up?
Bad system call (core dumped)

Let’s run the binary file with strace:

root@adil:~# strace ./a.out 2>&1 | tail -3
seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=8, filter=0x55bdf22dbf30}) = 0
uname( <unfinished …>) = ?
+++ killed by SIGSYS (core dumped) +++

The Linux Kernel has killed the process. The uname syscall is not allowed.

You can check the audit log (/var/log/audit/audit.log):

type=SECCOMP msg=audit(1613512469.711:351): auid=1000 uid=0 gid=0 ses=11 pid=12427 comm="a.out" exe="/root/a.out" sig=31 arch=c000003e syscall=63 compat=0 ip=0x7f048a06dccb code=0x0

It says: The 63rd syscall (uname) has killed with the 31st signal (SIGSYS).

We allowed all of the syscalls and denied one of them.

Seccomp has a mode. It is called Strict mode. In strict mode, only read, write, _exit, and sigreturn syscalls allowed.

Let’s write some data to the disk:

Let’s run it:

root@ip-172-31-43-168:~# gcc file.c -lseccomp && ./a.out
root@ip-172-31-43-168:~# cat /tmp/test.txt

It is killed. However, two different strings already have already been written to the file.

Why is it killed?

Let’s have a look at the audit:

type=SECCOMP msg=audit(1613509181.394:262): auid=1000 uid=0 gid=0 ses=11 pid=11857 comm=”a.out” exe=”/root/a.out” sig=9 arch=c000003e syscall=3 compat=0 ip=0x7fcbe5fd04ab code=0x0

It says: The 3rd syscall (close) has killed with the 9th signal (SIGKILL). The close syscall has killed because it is not whitelisted in the strict mode.

Let’s modify the code:

We closed the file pointer. Then, we opened it again.

root@adil:~# gcc file.c -lseccomp && ./a.out
root@adil:~# cat /tmp/test.txt

It is killed. The latter string can’t be written to the disk.

Why is it killed?

Let’s have a look at the audit:

type=SECCOMP msg=audit(1613509624.996:263): auid=1000 uid=0 gid=0 ses=11 pid=11880 comm=”a.out” exe=”/root/a.out” sig=9 arch=c000003e syscall=5 compat=0 ip=0x7fd386719689 code=0x0

It says: The 5th syscall (fstat) has killed with 9th signal (SIGKILL).

Let’s run the code with strace:

openat(AT_FDCWD, “/tmp/test.txt”, O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
lseek(3, 0, SEEK_END) = 8
fstat(3, {st_mode=S_IFREG|0644, st_size=8, …}) = 0
write(3, “qwe\n”, 4) = 4
close(3) = 0
openat(AT_FDCWD, “/tmp/test.txt”, O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
lseek(3, 0, SEEK_END) = 12
fstat(3, <unfinished …>) = ?

+++ killed by SIGKILL +++

The first write operation has completed successfully, then the file pointer has closed. The code has opened the file. The code executed the lseek syscall. After that, we enabled Seccomp with Strict mode. Let’s remember that only read, write, _exit and sigreturn syscalls are allowed in the strict mode. So, the fstat syscall can’t be executed.

It is confusing

The close syscall killed in the first version of file.c. The fstat syscall killed in the second version of file.c. However, we enabled the strict mode before the second fputs function in both of the two codes.

Actually, the C compiler calls the fstat before each the fputs function. In the first example, we didn’t close the file pointer. So, the C compiler optimized the code. The compiler merged two fputs context into the same write syscall.

I executed the first version of file.c with strace:

root@adil:~# strace ./a.out 2>&1 | grep write -B2
fstat(3, {st_mode=S_IFREG|0644, st_size=28, ...}) = 0
write(3, "qwe\nxyz", 7) = 7

That’s why, the fstat syscall didn’t get killed in the first version. The fstat syscall executed before the first fputs function.

Let’s modify the code:

We allowed 4 different syscalls: fstat, write, close and exit_group.

Those 4 syscalls must be allowed to run the code successfully.

Let’s run it:

root@adil:~# gcc file.c -lseccomp && ./a.out
root@adil:~# cat /tmp/test.txt

Some notes:

You can search in the audit log via ausearch.

E.g.: ausearch -sc 5

You can convert your strace output to a Docker profile via syscall2seccomp:

E.g.: strace -o file.strace ./a.out & python3 ./ file.strace > file.seccomp

You can run a binary executable file with only allowed system calls via firejail.

You can find the processes that are using Seccomp via this command:

grep ‘Seccomp\|^Name’ /proc/*/status -h