How to Monitor Server via PSI (Pressure Stall Information) and cgroupv2?

We’ve been using the load average to see the health of the servers.

There are several drawbacks associated with using the load average:

  • The load average shows the CPU load for the last 1 minute. We may need to see the load average in a much shorter window (how about 10 seconds?).
  • You can’t understand if your load average is high because of the I/O waits. You should see other statistics of your server (iotop, sysstat etc.).
  • You should do some calculation to interpret the load average:
    Load average / Number of enabled cpu cores. It can be confusing for new users of Linux.
    (You can enable/disable a cpu core through this path: /sys/devices/system/cpu/cpu_number/online)

So, you will need to have a look at CPU, memory, and I/O statistics when you want to deep-dive your system to understand the issue. The load average isn’t enough to distinguish fundamental parameters.

Facebook has decided to track CPU, memory, and I/O different perspectives on the Linux Kernel. They developed a utility: PSI (Pressure Stall Information)

Photo by Marcelo Leal on Unsplash

The PSI utility creates 3 different files: /proc/pressure/cpu , /proc/pressure/memory , /proc/pressure/io

I’d like to share sample outputs from those files:

root@adil:~# cat /proc/pressure/cpu
some avg10=0.03 avg60=0.07 avg300=0.06 total=5376072182
root@adil:~# cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=1249184
full avg10=0.00 avg60=0.00 avg300=0.00 total=317955
root@adil:~# cat /proc/pressure/io
some avg10=0.08 avg60=0.03 avg300=0.00 total=702350375
full avg10=0.00 avg60=0.00 avg300=0.00 total=539254260

Avg10: How long the processes stalled for the last 10 seconds
Avg60: How long the processes stalled for the last 60 seconds
Avg300: How long the processes stalled for the last 300 seconds
Total: How long the processes stalled since the server booted

Avg10, Avg60, and Avg300 are percent. According to the output, some processes have stalled 0.03% of the last 10 seconds.

The total stalled time value is in microseconds.

If a process was starved of the CPU for 5 seconds in the last 10 seconds, then the Avg10 column will be 50, which means 50% of the last 10 seconds.

What is the difference between some and full?

Image Source: https://facebookmicrosites.github.io/psi/docs/overview

If a process is using all the RAM, then other processes will be waiting for enough memory to be running.

Let’s assume that there are only 2 tasks in the server: Task A and Task B.
Task B is starved of memory for 30 seconds. In the meantime, Task A is also starved of memory for 9.996 seconds. It means the server was unresponsive for 9.996 seconds.

Task B is starved of memory for 30 seconds. Therefore, the Avg60’s some value would be 50%.
Task A has worked fine for 20.004 seconds. Suddenly, Task A and Task B were stalled for 9.996 seconds. Therefore, the Avg60’s full value would be 16.66%.

Briefly, if all the processes stalled then the total stalled time can be found in the full. If some of the processes stalled then the total stalled time can be found in the some.

There is no the full keyword in the /proc/pressure/cpu file 🤔

Because all of the processes can’t be starved of the CPU at the same time. The CPU is always executing a process.

As pointed out in Kernel.org, you can change the calculation of the time period of the pressure metrics.

echo 'some 100000 1000000' > /proc/pressure/cpu

The CPU pressure statistic will be calculated and written if there are 100ms of total stall time within 1 second.

PSI (Pressure Stall Information) and Control Group V2 (cgroupv2)

cgroupV2 has a lot of new features. One of those new features is the calculation of PSI metrics per control group.

Docker supports cgroupv2 as of Docker Engine 20.10.0. It is really nice. We will be able to analyze the total stalled time of CPU, Memory, and I/O for each container.

Let’s test it:

My Docker version:

root@adil:~# docker --version
Docker version 20.10.5, build 55c4c88

Replaced the cgroupv1 with cgroupv2:

root@adil:~# mount -t cgroup2 none /sys/fs/cgroup

A simple Nginx container started:

root@adil:~# docker run -dit --name=nginx nginx
9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983

Read PSI statistics of the Nginx container through its container ID:

root@adil:~# cat /sys/fs/cgroup/docker/9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983/cpu.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=13428
root@adil:~# cat /sys/fs/cgroup/docker/9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983/memory.pressure
some avg10=0.02 avg60=0.04 avg300=0.01 total=53078
full avg10=0.02 avg60=0.04 avg300=0.01 total=51294
root@adil:~# cat /sys/fs/cgroup/docker/9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983/io.pressure
some avg10=0.08 avg60=0.20 avg300=0.06 total=222159
full avg10=0.07 avg60=0.19 avg300=0.06 total=217995

Facebook has developed a tool: oomd. OOMD uses PSI and cgroupv2.

Systemd-Oomd uses PSI and cgroupv2.