The following example uses a Rudder group in cfengine policy to create different thresholds for a monitored components.
Once upon a time...
I got an LXC host.
On LXC, the host, even with CentOS7, is able to see processes running in containers.
CFEngine by itself tries to be a good citizen and kills everything called cf-execd or cf-agent if it sees there's runaway processes...
On my LXC host it'd see 5 Agent processes. It would kill them, obviously, something is not right.
You'd see things like this on a agent run:
Where are my container processes now?
They're in container heaven.
Hey, looksie, I killed those processes that were running in your containers!
What would your cat say?
I had to change something. But for like a year, I didn't know how to fix.
Changing it in the Rudder policy globally, would mean disabling this safety feature everywhere. That's not gonna be good, i.e. on overloaded, slow-as-a-snail "cloud hosts" you could have multiple agents running for real. So we need the autokilling.
Updating and running different kernels had no effect either. Rudder meanwhile even learned to detect this issue, but only with OpenVZ, not for generic LXC containers.
Corn casing a corner case
Well, today I finally figured I just need to make an exception for LXC hosts!
I have a group called "LXC-Hosts". This is a static group, where I put the hostnames of any LXC host I got.
This group is visible to CFEngine and as such, we can use it in policy code.
The following is a diff showing how I worked around it:
So for the lxc-host group there's now special rules, and I make sure the "NOT WINDOWS" case doesn't just match by mistake.
For the record, we could also detect lxc, do an lxc-ls and process the number of running containers. But I feel pretty confident this change is gonna do just the same thing. And it's darn minimalistic.
The other thing I had to adjust is /opt/rudder/bin/check-rudder-agent !!!