Bug #4498
closedSeveral issues with process management on Proxmox host (and container)
Description
I had my concerns regarding openVZ process management in cfengine, but after establishing the fact that Rudder uses special add-on patch I decided to give it a try and deployed rudder on several OpenVZ/Proxmox hosts and containers.
The first sign of trouble were two emails I've received at 5:05 am today and yesterday from one of the containers:
WARNING: No disable file detected and no CFEngine process neither. Relaunching CFEngine processes... Done
I sorta ignored the first one, but having received the second at exactly same time I went to check on the matters. Initially I thought it is from cron, but it turned to be more complicated than that. The cron check is executed every 5 minutes, so why only 5:05? Also by looking at the outputs directory both on the container and the host I've noticed size differences (they are practically the same, so only one is presented here):
# ls -l * ... -rw------- 1 root root 3578 Feb 21 04:46 cf_hostname__1392986764_Fri_Feb_21_04_46_04_2014_0x7fd054c69700 -rw------- 1 root root 3578 Feb 21 04:50 cf_hostname__1392987029_Fri_Feb_21_04_50_29_2014_0x7fd054c69700 -rw------- 1 root root 3578 Feb 21 04:55 cf_hostname__1392987354_Fri_Feb_21_04_55_54_2014_0x7fd054c69700 -rw------- 1 root root 946 Feb 21 05:01 cf_hostname__1392987679_Fri_Feb_21_05_01_19_2014_0x7fd054c69700 -rw------- 1 root root 3578 Feb 21 05:05 cf_hostname__1392987947_Fri_Feb_21_05_05_47_2014_0x7f5e781c4700 -rw------- 1 root root 3578 Feb 21 05:11 cf_hostname__1392988272_Fri_Feb_21_05_11_12_2014_0x7f5e781c4700 ...
The file with a different size has a content:
2014-02-21T05:01:21-0800 notice: R: @@Common@@log_info@@hasPolicyServer-root@@common-root@@109@@common@@StartRun@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Start execution 2014-02-21T05:01:21-0800 notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Update@@None@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Rudder policy, tools and ncf instance are already up to date. No action required. 2014-02-21T05:01:21-0800 notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Security parameters@@None@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#The internal environment security is acceptable 2014-02-21T05:01:21-0800 notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Red Button@@None@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Red Button is not in effect, continuing as normal...
Normally this should be followed by a line
2014-02-21T05:05:49-0800 notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Process checking@@None@@2014-02-21 05:05:49-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#There is an acceptable number of CFEngine processes running on the machine
but it cuts right there as if the process was terminated. I suspected that there might be some problems with getting the process list, but as vzps should be supported - what's the problem? OK, here comes the first problem. The patch, as presented here: https://github.com/Normation/rudder-packages/blob/da3db60a6ffd01ce7c581c575321b866c3c1db61/rudder-agent/SOURCES/add-support-of-openVZ.patch lists the following parameters for vzps:
+ "-o user,pid,ppid,pgid,pcpu,pmem,vsz,pri,rss,nlwp,stime,time,args", /* vz with vzps */
Unfortunately, this does not work on Proxmox, as they for whatever reason ship quite an old version, based on procps v2, which does not support nlwp option. They support thcount option instead, and the main page says that
nlwp NLWP number of lwps (threads) in the process. (alias thcount).
So, basically, replacing nlwp with thcount would work for both the old and the new case. CFengine has a similar issue, see https://cfengine.com/dev/issues/3395.
I do not know what happens exactly, but either the vzps exit status is misinterpreted or it reverts to using a regular ps - but apparently the host starts shooting off container's processes (which are restarted later because of those 5-minute-apart cron checks - hence the email). Oh, and actually a magic 5:05 - apparently this is the time when cfengine processes are forcibly restarted by the policy (see common/1.0/process_matching.cf). Only at 5:05 it actually tries to stop the processes - and how the kill is implemented is not clear - but evidently something goes wrong.
To add an insult to an injury, I have found that on my host the cf-execd process is not running for several hours. It is 16:11 now, and the last non-empty file in the outputs directory timestamped with 11:15:
... 2014-02-21T11:15:41-0800 notice: R: @@Common@@result_repaired@@hasPolicyServer-root@@common-root@@109@@Process checking@@None@@2014-02-21 11:15:41-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Warning, more than 2 cf-execd processes were detected. They have been sent a graceful termination signal. ...
The next file is empty and timestamped by 11:16:
# ls -l *_Fri_Feb_21_11_16_06_2014_0x7f5e781c4700 -rw------- 1 root root 0 Feb 21 11:16 cf_hostname__1393010166_Fri_Feb_21_11_16_06_2014_0x7f5e781c4700
And nothing beyond that until now. Again, something went wrong and the process apparently killed itself.
So why it is not being restarted by cron? This is another issue. The cron file (generated by the policy) contains the line
0,5,10,15,20,25,30,35,40,45,50,55 * * * * root if [ -e /opt/rudder/bin/check-rudder-agent ]; then /opt/rudder/bin/check-rudder-agent; else if [ ! -e /opt/rudder/etc/disable-agent -a `ps -efww | grep -E "(cf-execd|cf-agent)" | grep -E "/var/rudder/cfengine-community/bin/(cf-execd|cf-agent)" | grep -v grep | wc -l` -eq 0 ]; then /var/rudder/cfengine-community/bin/cf-agent -f failsafe.cf >/dev/null 2>&1 && /var/rudder/cfengine-community/bin/cf-agent >/dev/null 2>&1; if [ $? != 0 ]; then if [ -f /opt/rudder/etc/rudder-restart-message.txt ]; then cat /opt/rudder/etc/rudder-restart-message.txt; else echo "Rudder agent was unable to restart on $(hostname)."; fi; fi; fi; fi
So it tries to use check-rudder-agent script, but if it is not available it uses the plain ps (which in this case would yield the incorrect result). But we do have the script - however, it turns out that it uses plain ps as well:
# If no disable file AND no process of CFEngine from Rudder, then relaunch cf-agent with a failsafe first # But this is applied only on servers or nodes already initialized (policy server set) if [ ! -e ${CFE_DISABLE_FILE} -a `ps -efww | grep -E "(cf-execd|cf-agent)" | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l` -eq 0 -a -f ${CFE_DIR}/policy_server.dat ]; then echo -n "WARNING: No disable file detected and no CFEngine process neither. Relaunching CFEngine processes..." ${CFE_BIN_DIR}/cf-agent -f failsafe.cf >/dev/null 2>&1 ${CFE_BIN_DIR}/cf-agent >/dev/null 2>&1 echo " Done" fi
As the containers still have the cf-execd running, the script never detects that it needs to relaunch the process on the mothership.
I am restarting the service manually for now (as it is the end of the working week :) - we shall see how it is working over the weekend and will continue on Monday.