The following example uses a Rudder group in cfengine policy to create different thresholds for a monitored components.


Once upon a time...

I got an LXC host.

On LXC, the host, even with CentOS7, is able to see processes running in containers.

CFEngine by itself tries to be a good citizen and kills everything called cf-execd or cf-agent if it sees there's runaway processes...

On my LXC host it'd see 5 Agent processes. It would kill them, obviously, something is not right.

You'd see things like this on a agent run:

repaired Common                    Process checking                             
Warning, more than 2 cf-execd processes were detected. They have been sent a graceful termination signal.



Where are my container processes now?

They're in container heaven.

Hey, looksie, I killed those processes that were running in your containers!



What would your cat say?



I had to change something. But for like a year, I didn't know how to fix.

Changing it in the Rudder policy globally, would mean disabling this safety feature everywhere. That's not gonna be good, i.e. on overloaded, slow-as-a-snail "cloud hosts" you could have multiple agents running for real. So we need the autokilling.

Updating and running different kernels had no effect either. Rudder meanwhile even learned to detect this issue, but only with OpenVZ, not for generic LXC containers.


Corn casing a corner case

Well, today I finally figured I just need to make an exception for LXC hosts!

I have a group called "LXC-Hosts". This is a static group, where I put the hostnames of any LXC host I got.

This group is visible to CFEngine and as such, we can use it in policy code.


The following is a diff showing how I worked around it:

[root@rudder system]# git diff master~2:techniques/system/common/1.0/ common/1.0/ 
diff --git a/techniques/system/common/1.0/ b/techniques/system/common/1.0/
index 7308a9d..b09f3b1 100644
--- a/techniques/system/common/1.0/
+++ b/techniques/system/common/1.0/
@@ -340,11 +340,17 @@ bundle agent check_cf_processes_running
       # process_term defines how many maximum instances of this
       # binary should be running before attempting to SIGTERM them.
       # process_kill is the same for SIGKILL.
-    !windows::
+    !windows.!group_lxc_host::
       # On windows, cf-execd is a service, and there can be only one instance of it running (by design)
       "process_term[execd]" string => "2";
       "process_kill[execd]" string => "5";

+    group_lxc_host::
+      "process_term[execd]" string => "6";
+      "process_kill[execd]" string => "7";
+      "process_term[agent]" string => "8";
+      "process_kill[agent]" string => "10";
       "process_term[agent]" string => "5";
       "process_kill[agent]" string => "8";


So for the lxc-host group there's now special rules, and I make sure the "NOT WINDOWS" case doesn't just match by mistake.

For the record, we could also detect lxc, do an lxc-ls and process the number of running containers. But I feel pretty confident this change is gonna do just the same thing. And it's darn minimalistic.


The other thing I had to adjust is /opt/rudder/bin/check-rudder-agent !!!