How to Monitor Server via PSI (Pressure Stall Information) and cgroupv2?

We use the load average to see the health of the servers.

There are a few drawbacks to using the load average:

  • The load average shows the CPU load for the last 1 minute. We may need to see the load average over a much shorter period of time (how about 10 seconds?).
  • You can’t tell if your high load average problem is due to I/O waits. You should see the server’s other stats (iotop, sysstat, etc.).
  • To interpret the load average you have to do some calculations:
    Load average / Number of enabled CPU cores. It may not be clear to new Linux users.
    (You can enable/disable a CPU core by: /sys/devices/system/cpu/cpu_number/online)

Therefore, you will need to look at CPU, memory, and I/O stats whenever you want to delve deeper into your system to understand the problem. The load average isn’t sufficient to distinguish basic parameters.

Facebook decided to monitor the Linux Kernel with different perspectives of CPU, memory, and I/O. They developed a utility: PSI (Pressure Stall Information)

Photo by Marcelo Leal on Unsplash

The PSI utility creates 3 different files: /proc/pressure/cpu , /proc/pressure/memory , /proc/pressure/io

I want to share sample outputs from these files:

root@adil:~# cat /proc/pressure/cpu
some avg10=0.03 avg60=0.07 avg300=0.06 total=5376072182
root@adil:~# cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=1249184
full avg10=0.00 avg60=0.00 avg300=0.00 total=317955
root@adil:~# cat /proc/pressure/io
some avg10=0.08 avg60=0.03 avg300=0.00 total=702350375
full avg10=0.00 avg60=0.00 avg300=0.00 total=539254260

Avg10: How long have the processes stalled for the last 10 seconds
Avg60: How long have the processes stalled for the last 60 seconds
Avg300: How long have the processes stalled for the last 300 seconds
Total: How long have the processes stalled since the server booted

Avg10, Avg60, and Avg300 are percentages. According to the output, some processes stalled by 0.03% in the last 10 seconds.

The total stalled time value is in microseconds.

If a process was starved of the CPU for 5 seconds in the last 10 seconds, the Avg10 column will be 50, which means 50% of the last 10 seconds.

What is the difference between some and full?

Image Source: https://facebookmicrosites.github.io/psi/docs/overview

If a process is using all the RAM, then other processes will be waiting for memory.

Let’s say there are only 2 tasks on the server: Task A and Task B.
Task B is starved of memory for 30 seconds. Meanwhile, Task A is also starved of memory for 9.996 seconds. This means that the server was unresponsive for 9.996 seconds.

Only Task B is starved of memory for 30 seconds. Therefore, the Avg60’s some value would be 50%.
Task A has worked fine for 20.004 seconds. Suddenly, Task A and Task B were stalled for 9.996 seconds. Therefore, the Avg60’s full value would be 16.66%.

In short, if all the processes stalled then the total stalled time can be found in the full. If some of the processes stalled then the total stalled time can be found in the some.

There is no full keyword in the /proc/pressure/cpu file 🤔

Because all of the processes can’t be starved of the CPU at the same time. The CPU is always executing a process.

As noted on Kernel.org, you can change the time period calculation of the pressure metrics.

echo 'some 100000 1000000' > /proc/pressure/cpu

The CPU pressure statistic will be calculated if there is 100ms of total stall time in 1 second.

PSI (Pressure Stall Information) and Control Group V2 (cgroupv2)

cgroupV2 has many new features. One of these new features is the calculation of PSI metrics per control group.

Docker supports cgroupv2 as of Docker Engine 20.10.0. It is really nice. We will be able to analyze the total stalled time of CPU, Memory, and I/O for each container.

Let’s test it:

My Docker version:

root@adil:~# docker --version
Docker version 20.10.5, build 55c4c88

Replaced the cgroupv1 with cgroupv2:

root@adil:~# mount -t cgroup2 none /sys/fs/cgroup

A simple Nginx container started:

root@adil:~# docker run -dit --name=nginx nginx
9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983

Read PSI statistics of the Nginx container through its container ID:

root@adil:~# cat /sys/fs/cgroup/docker/9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983/cpu.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=13428
root@adil:~# cat /sys/fs/cgroup/docker/9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983/memory.pressure
some avg10=0.02 avg60=0.04 avg300=0.01 total=53078
full avg10=0.02 avg60=0.04 avg300=0.01 total=51294
root@adil:~# cat /sys/fs/cgroup/docker/9719184a04057d324408d990ce43e0b040c83f3b69593b36e49ff9d5455cf983/io.pressure
some avg10=0.08 avg60=0.20 avg300=0.06 total=222159
full avg10=0.07 avg60=0.19 avg300=0.06 total=217995

Facebook has developed a tool: oomd. OOMD uses PSI and cgroupv2.

Systemd-Oomd uses PSI and cgroupv2.