Control Groups (cgroups) are what make container resource isolation work. Docker, Kubernetes, and systemd all depend on them. If you maintain production infrastructure, you’ve hit these:
- A container gets OOM-killed while the node has free memory (page cache doesn’t count toward your container’s limit).
- You set
cpu: 1on a Pod, average CPU looks fine, but P99 latency keeps spiking (time-sliced throttling). - A host upgrade switches to cgroup v2 and your monitoring scripts fail because the files moved.
This article covers what actually matters when debugging cgroup issues — what V1 did wrong, what V2 fixed, and what still catches people.
First: which version are you on?
stat -fc %T /sys/fs/cgroup
# cgroup2fs = v2
# tmpfs = v1
If it returns tmpfs, check which controllers are mounted:
mount | grep cgroup
On V1 you’ll see multiple mount points — cgroup on /sys/fs/cgroup/memory, cgroup on /sys/fs/cgroup/cpu, etc. On V2 there’s a single cgroup2 mount at /sys/fs/cgroup.
This is the first thing to check when debugging. I’ve burned hours chasing a CPU limit issue only to realize I was reading V2 files on a V1 system, where cpu.max doesn’t exist.
Why cgroups exist
Before cgroups, Linux had nice and ulimit. nice is a hint, not a hard limit — a CPU-hungry process could still starve other processes. ulimit works per-process, not per-group. If Apache forks 100 children, there’s no way to say “all of you together can use at most 2 CPUs and 4GB of RAM.”
Cgroups let you define a group of processes, attach controllers (CPU, memory, IO, PID), and enforce hard limits on the group as a whole.
V1 architecture: multiple hierarchies
V1 allowed separate trees for each controller:
/sys/fs/cgroup/cpu/demo/ ← cpu limits
/sys/fs/cgroup/memory/demo/ ← memory limits
/sys/fs/cgroup/blkio/demo/ ← IO limits
This looked flexible but caused real problems:
- Coordination: a process could be in
cpu/groupAandmemory/groupB. If you wanted to limit both together, you had to manually keep them in sync. - Accounting mismatch: the kernel had to track the same process across multiple trees, adding overhead.
- Thread granularity: V1 let you move individual threads to different cgroups, which sounds useful but mostly created confusion (see below).
The thread vs process trap
Each V1 cgroup directory has two files:
tasks— move a specific thread (TID)cgroup.procs— move the entire process (PID)
If you write a thread ID to tasks, only that thread moves. The rest stay outside. I’ve seen people do this accidentally — write tasks thinking it’s the same as cgroup.procs — and end up with a process where one thread is throttled and the others aren’t.
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
static void *spin(void *arg) {
long tid = syscall(SYS_gettid);
printf("thread tid=%ld\n", tid);
while (1) { }
return NULL;
}
int main() {
printf("main pid=%d\n", getpid());
pthread_t t1, t2;
pthread_create(&t1, NULL, spin, NULL);
pthread_create(&t2, NULL, spin, NULL);
while (1) { }
return 0;
}
Compile with gcc -O2 -pthread t.c -o t && ./t, then check ps -T -p <pid> to see the threads. The difference happens fast — write one TID to tasks and watch only that thread’s CPU usage change in htop.
CPU throttling: why your P99 spikes
The most common cgroup surprise: “I set 1 CPU, average usage is 50%, why is my service slow?”
The answer is that CFS quota is a time window, not a rate limiter. If cpu.cfs_period_us is 100ms and cpu.cfs_quota_us is 50ms (0.5 CPU), your process gets 50ms of CPU time per 100ms window. If it burns through that 50ms in the first 20ms of the window, it’s throttled for the remaining 80ms — even though the “average” over several windows is only 50%.
echo 100000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_period_us
echo 50000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_quota_us
echo $$ > /sys/fs/cgroup/cpu/demo/cgroup.procs
What to check: cpu.stat has nr_throttled and throttled_usec. If nr_throttled is rising, your process is hitting the ceiling. On V2: cpu.stat contains nr_periods, nr_throttled, and throttled_usec.
A gotcha I hit with Kubernetes: the default CFS period is 100ms, but a cpu: 0.1 Pod gets 10ms per period. A single syscall-heavy operation that takes 15ms of wall time can consume the entire quota — even though the process looks mostly idle on average. If your service does periodic batch work (log rotation, GC, connection pooling refresh), expect throttling.
For latency-sensitive workloads, consider pinning with cpuset instead of quota:
echo 0 > /sys/fs/cgroup/cpuset/demo/cpuset.mems
echo 2-3 > /sys/fs/cgroup/cpuset/demo/cpuset.cpus
cpuset avoids throttling entirely by dedicating physical cores. The tradeoff: the cores can’t be used by other processes, so node utilization drops.
Memory limits and OOM
V1’s memory.limit_in_bytes is a hard wall. Exceed it and the kernel OOM-kills the process — no warning, no gradual slowdown. The kernel picks a victim from the cgroup, which isn’t always the process that caused the overflow (it picks the largest memory consumer in the group).
echo 200M > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/demo/cgroup.procs
./memhog 500
Check memory.failcnt to see how many times the limit was hit. Recent OOM events show up in dmesg.
V2 improvement — memory.high: adds a soft threshold. When a cgroup exceeds memory.high, the kernel applies reclaim pressure (throttling allocations, swapping). This gives you time to react before the hard memory.max kills the process. In production, set memory.high at 80% of memory.max and alert on reclaim activity.
Common gotcha: page cache counts toward the cgroup memory limit. Your application might be using 200MB RSS, but the page cache for files it read could push it over the limit. This is a frequent cause of “mystery OOM” in containerized workloads. If your service reads large files at startup, the cache pages stay until pressure evicts them — and they count against your limit.
IO and PID controllers
blkio limits disk IO by device major:minor:
echo "8:0 10485760" > /sys/fs/cgroup/blkio/demo/blkio.throttle.write_bps_device
This caps writes to /dev/sda at 10MB/s. The gotcha: it applies to direct writes and (depending on kernel version) can interact badly with the page cache. A process that dirties pages quickly might not hit the limit immediately — the writes go to cache first, then get throttled on flush.
pids limits fork count. Hit pids.max and fork() returns EAGAIN (“Resource temporarily unavailable”). This isn’t as useful as it sounds — most fork bombs are caught by memory limits first — but it can prevent PID exhaustion in container runtimes.
V2: unified hierarchy
V2 fixed V1’s design problems with three rules:
- One tree — all controllers in a single hierarchy.
- One placement — a process exists at exactly one point in the tree.
- Leaf-node rule — processes live only in leaf cgroups. Internal nodes configure controllers and delegate resources to children.
mount -t cgroup2 none /sys/fs/cgroup
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir /sys/fs/cgroup/demo
echo "100000 100000" > /sys/fs/cgroup/demo/cpu.max
echo "300M" > /sys/fs/cgroup/demo/memory.max
echo $$ > /sys/fs/cgroup/demo/cgroup.procs
Leaf-node gotcha: try to put a process in a non-leaf cgroup and the kernel rejects it. This catches everyone at least once. The error is something like “Device or resource busy” — not exactly obvious.
Key V2 file changes:
cpu.cfs_*→cpu.max(format:max quota period, e.g.,50000 100000)memory.limit_in_bytes→memory.maxmemory.soft_limit_in_bytes→memory.high(different semantics — V2’s reclaim is more aggressive)blkio.throttle.*→io.max- Controller enable/disable →
cgroup.subtree_control
PSI (Pressure Stall Information)
V2 added PSI files: memory.pressure, cpu.pressure, io.pressure. These show time lost to resource pressure as a percentage:
some avg10=2.34 avg60=1.87 avg300=0.98 total=492817349
full avg10=1.12 avg60=0.89 avg300=0.45 total=238475281
some means at least one task was stalled. full means all tasks were stalled. The avg10/avg60/avg300 fields are decaying averages over 10s, 60s, and 300s windows.
PSI is the first signal that catches problems before they become OOM or throttling events. If memory.pressure shows consistent some avg10 > 5, you’re reclaiming too much — raise your memory.high.
How the ecosystem uses cgroups
You rarely touch cgroup files directly in production. The interaction happens through:
-
systemd: manages cgroups via service units.
systemd-run --scope -p MemoryMax=200M -p CPUQuota=50% bashcreates a transient scope with limits. systemd’s cgroup management does not always play well with manual cgroup manipulation — I’ve seen systemd reset cgroup limits when a unit restarts, overriding changes made by a monitoring script. -
Docker / Kubernetes: Pod resource
limitsandrequeststranslate to cgroup settings on the kubelet node. If you SSH into a node and look at/sys/fs/cgroup/kubepods/, you can trace a Pod’s PID to its cgroup. This is useful when a Pod reports OOMKilled and you want to verify the actual cgroup state — sometimes the kubelet’s view is stale.
Practical debugging
-
Check
memory.eventsandcpu.statbefore anywhere else. These are per-cgroup counters that tell you what actually happened — not what you think should have happened. -
PSI files catch pressure before limits are hit. Poll
memory.pressurealongside your usual metrics. -
cgroupV1 gotcha: non-root cgroup namespaces can give confusing
/proc/pid/cgroupoutput. If you’re inside a container, the paths are relative to the container’s cgroup namespace, not the host’s. -
Limit leaks: setting a limit on the parent and a higher limit on the child doesn’t work the way you expect on V1 — V2’s hierarchy enforcement means the child is always constrained by the parent’s limits. On V1, if CPU controller only runs on the parent hierarchy, memory on another, the constraints don’t compose.
-
Don’t set limits so tight that normal bursts trigger throttling. A good starting point:
memory.highat 80% ofmemory.max, CPU quota 10-20% above expected peak. Monitornr_throttledandmemory.eventsand adjust from there. Cgroup limits contain damage but don’t fix application-level bugs.
(End of Document)