Allow/Disallow Syscalls via Seccomp

As I said in the previous post, there are a couple of different security modules in the Linux Kernel: SELinux, AppArmor, Seccomp, Tomoyo, Smack, Capabilities etc.

I’d like to talk about the Seccomp module in this post.

Seccomp stands for secure computing mode.

Photo by Raimond Klavins on Unsplash

Hundreds of system calls available in the Linux Kernel. You may want to explicitly disable some system calls for a binary execute file.

Seccomp allows you to set fine-grained filtering of the syscalls. You can set which syscalls are allowed or disallowed for a binary executable file before run it.

Let’s assume you have an application like this:

You will get an output like this when you run it:

Let’s re-run the binary file with strace :

Even a simple application makes a lot of syscalls. Let’s assume you want to disable the uname syscall via seccomp (You should install libseccomp-dev):

All system calls allowed except uname:

Let’s run the binary file with strace:

The Linux Kernel has killed the process. The uname syscall is not allowed.

You can check the audit log (/var/log/audit/audit.log):

It says: The 63rd syscall (uname) has killed with the 31st signal (SIGSYS).

We allowed all of the syscalls and denied one of them.

Deny everything, allow some of them

Seccomp has a mode. It is called Strict mode. In strict mode, only read, write, _exit and sigreturn syscalls allowed.

Let’s write some data to the disk:

Let’s run it:

It is killed. However, two different strings already have already been written to the file.

Why is it killed?

Let’s have a look at the audit:

It says: The 3rd syscall (close) has killed with the 9th signal (SIGKILL). The close syscall has killed because it is not whitelisted in the strict mode.

Let’s modify the code:

We closed the file pointer. Then, we opened it again.

It is killed. The latter string can’t be written to the disk.

Why is it killed?

Let’s have a look at the audit:

It says: The 5th syscall (fstat) has killed with 9th signal (SIGKILL).

Let’s run the code with strace:

The first write operation has completed successfully, then the file pointer has closed. The code has opened the file. The code executed the lseek syscall. After that, we enabled Seccomp with Strict mode. Let’s remember that only read, write, _exit and sigreturn syscalls are allowed in the strict mode. So, the fstat syscall can’t be executed.

It is confusing

The close syscall killed in the first version of file.c. The fstat syscall killed in the second version of file.c. However, we enabled the strict mode before the second fputs function in both of the two codes.

Actually, the C compiler calls the fstat before each the fputs function. In the first example, we didn’t close the file pointer. So, the C compiler optimized the code. The compiler merged two fputs context into the same write syscall.

I executed the first version of file.c with strace:

That’s why, the fstat syscall didn’t get killed in the first version. The fstat syscall executed before the first fputs function.

Let’s modify the code:

We allowed 4 different syscalls: fstat, write, close and exit_group.

Those 4 syscalls must be allowed to run the code successfully.

Let’s run it:

Some notes:

You can search in the audit log via ausearch.

E.g.: ausearch -sc 5

You can convert your strace output to a Docker profile via syscall2seccomp:

E.g.: strace -o file.strace ./a.out & python3 ./syscall2seccomp.py file.strace > file.seccomp

You can run a binary executable file with only allowed system calls via firejail.

You can find the processes that are using Seccomp via this command:

grep ‘Seccomp\|^Name’ /proc/*/status -h

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store