At work, we experienced an issue with Jenkins builds starting to randomly fail on the slaves with a pretty obscure error:
OCI runtime create failed: container_linux.go:345: starting container process caused “process_linux.go:424: container init caused "join session keyring: create session key: disk quota exceeded"”: unknown
We aggresively use docker images to build and deploy artifacts as well as run various scripts. Almost every shell script is run inside a docker container to make sure that the configuration of the “machines” are the same for all runs. This means we have hundreds of containers in the slaves, although most of them are not really running.
One morning, everything started to break, and our deploy jobs were affected, so I sprung into action and took a look.
Disk quota?
At first glance, the error looked like a disk issue, so by instinct, I first checked whether the disk in one of the slaves was full.
1 |
|
There was nothing unusual in particular here, as no disks were really full. Next, I checked the inode usage.
1 |
|
Hmm. Nothing unusual as well. If the disk wasn’t full, there must be a config somewhere that’s limiting it, so I took a second look at the error message for more clues.
“join session keyring: create session key:
Keyrings
It looked like some type of resource related to keyrings reached its limit, so I looked at the keyring limits and how many were running.
1 |
|
The limits are large enough, and the keys are also pretty small relative to it. At this point, I got pretty stumped, until I read the keyrings(7)
manual and found another interesting config:
/proc/sys/kernel/keys/maxbytes (since Linux 2.6.26) This is the maximum number of bytes of data that a nonroot user can hold in the payloads of the keys owned by the user.
The default value in this file is 20,000.
So i checked it.
1 |
|
Aha. That limit is pretty low, considering we have about 243 keyrings. So I bumped it up to about 50mb. Turns out it was actually the issue, and have found similar issues after some deep googling. Two things/bugs i learned:
- The default limit for number of keyrings is high, but the disk space they can use is extremely low
- A unique session key is created for every linux container which was why the issue only happened when there were too many containers in the slaves already.
TLDR;/ Solution
We mitigated the issue by pruning the containers first using docker prune -f a
, or docker container prune
, then we set the correct limit by editing the config.
1 |
|