Project

General

Profile

Actions

Bug #4498

closed

Several issues with process management on Proxmox host (and container)

Added by Alex Tkachenko about 10 years ago. Updated over 6 years ago.

Status:
Rejected
Priority:
N/A
Assignee:
-
Category:
System techniques
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
UX impact:
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
57
Name check:
Fix check:
Regression:

Description

I had my concerns regarding openVZ process management in cfengine, but after establishing the fact that Rudder uses special add-on patch I decided to give it a try and deployed rudder on several OpenVZ/Proxmox hosts and containers.

The first sign of trouble were two emails I've received at 5:05 am today and yesterday from one of the containers:

WARNING: No disable file detected and no CFEngine process neither. Relaunching CFEngine processes... Done

I sorta ignored the first one, but having received the second at exactly same time I went to check on the matters. Initially I thought it is from cron, but it turned to be more complicated than that. The cron check is executed every 5 minutes, so why only 5:05? Also by looking at the outputs directory both on the container and the host I've noticed size differences (they are practically the same, so only one is presented here):

# ls -l *
...
-rw------- 1 root root 3578 Feb 21 04:46 cf_hostname__1392986764_Fri_Feb_21_04_46_04_2014_0x7fd054c69700
-rw------- 1 root root 3578 Feb 21 04:50 cf_hostname__1392987029_Fri_Feb_21_04_50_29_2014_0x7fd054c69700
-rw------- 1 root root 3578 Feb 21 04:55 cf_hostname__1392987354_Fri_Feb_21_04_55_54_2014_0x7fd054c69700
-rw------- 1 root root  946 Feb 21 05:01 cf_hostname__1392987679_Fri_Feb_21_05_01_19_2014_0x7fd054c69700
-rw------- 1 root root 3578 Feb 21 05:05 cf_hostname__1392987947_Fri_Feb_21_05_05_47_2014_0x7f5e781c4700
-rw------- 1 root root 3578 Feb 21 05:11 cf_hostname__1392988272_Fri_Feb_21_05_11_12_2014_0x7f5e781c4700
...

The file with a different size has a content:

2014-02-21T05:01:21-0800   notice: R: @@Common@@log_info@@hasPolicyServer-root@@common-root@@109@@common@@StartRun@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Start execution
2014-02-21T05:01:21-0800   notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Update@@None@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Rudder policy, tools and ncf instance are already up to date. No action required.
2014-02-21T05:01:21-0800   notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Security parameters@@None@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#The internal environment security is acceptable
2014-02-21T05:01:21-0800   notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Red Button@@None@@2014-02-21 05:01:21-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Red Button is not in effect, continuing as normal...

Normally this should be followed by a line

2014-02-21T05:05:49-0800   notice: R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@109@@Process checking@@None@@2014-02-21 05:05:49-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#There is an acceptable number of CFEngine processes running on the machine

but it cuts right there as if the process was terminated. I suspected that there might be some problems with getting the process list, but as vzps should be supported - what's the problem? OK, here comes the first problem. The patch, as presented here: https://github.com/Normation/rudder-packages/blob/da3db60a6ffd01ce7c581c575321b866c3c1db61/rudder-agent/SOURCES/add-support-of-openVZ.patch lists the following parameters for vzps:

+    "-o user,pid,ppid,pgid,pcpu,pmem,vsz,pri,rss,nlwp,stime,time,args", /* vz with vzps */

Unfortunately, this does not work on Proxmox, as they for whatever reason ship quite an old version, based on procps v2, which does not support nlwp option. They support thcount option instead, and the main page says that

nlwp       NLWP    number of lwps (threads) in the process. (alias thcount).

So, basically, replacing nlwp with thcount would work for both the old and the new case. CFengine has a similar issue, see https://cfengine.com/dev/issues/3395.

I do not know what happens exactly, but either the vzps exit status is misinterpreted or it reverts to using a regular ps - but apparently the host starts shooting off container's processes (which are restarted later because of those 5-minute-apart cron checks - hence the email). Oh, and actually a magic 5:05 - apparently this is the time when cfengine processes are forcibly restarted by the policy (see common/1.0/process_matching.cf). Only at 5:05 it actually tries to stop the processes - and how the kill is implemented is not clear - but evidently something goes wrong.

To add an insult to an injury, I have found that on my host the cf-execd process is not running for several hours. It is 16:11 now, and the last non-empty file in the outputs directory timestamped with 11:15:

...
2014-02-21T11:15:41-0800   notice: R: @@Common@@result_repaired@@hasPolicyServer-root@@common-root@@109@@Process checking@@None@@2014-02-21 11:15:41-08:00##65618d90-0e20-47d2-bf1f-693305dd51bc@#Warning, more than 2 cf-execd processes were detected. They have been sent a graceful termination signal.
...

The next file is empty and timestamped by 11:16:
# ls -l *_Fri_Feb_21_11_16_06_2014_0x7f5e781c4700 
-rw------- 1 root root 0 Feb 21 11:16 cf_hostname__1393010166_Fri_Feb_21_11_16_06_2014_0x7f5e781c4700

And nothing beyond that until now. Again, something went wrong and the process apparently killed itself.

So why it is not being restarted by cron? This is another issue. The cron file (generated by the policy) contains the line

0,5,10,15,20,25,30,35,40,45,50,55 * * * * root if [ -e /opt/rudder/bin/check-rudder-agent ]; then /opt/rudder/bin/check-rudder-agent; else if [ ! -e /opt/rudder/etc/disable-agent -a `ps -efww | grep -E "(cf-execd|cf-agent)" | grep -E "/var/rudder/cfengine-community/bin/(cf-execd|cf-agent)" | grep -v grep | wc -l` -eq 0 ]; then /var/rudder/cfengine-community/bin/cf-agent -f failsafe.cf >/dev/null 2>&1 && /var/rudder/cfengine-community/bin/cf-agent >/dev/null 2>&1; if [ $? != 0 ]; then if [ -f /opt/rudder/etc/rudder-restart-message.txt ]; then cat /opt/rudder/etc/rudder-restart-message.txt; else echo "Rudder agent was unable to restart on $(hostname)."; fi; fi; fi; fi

So it tries to use check-rudder-agent script, but if it is not available it uses the plain ps (which in this case would yield the incorrect result). But we do have the script - however, it turns out that it uses plain ps as well:

# If no disable file AND no process of CFEngine from Rudder, then relaunch cf-agent with a failsafe first
# But this is applied only on servers or nodes already initialized (policy server set)
if [ ! -e ${CFE_DISABLE_FILE} -a `ps -efww | grep -E "(cf-execd|cf-agent)" | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l` -eq 0 -a -f ${CFE_DIR}/policy_server.dat ]; then
  echo -n "WARNING: No disable file detected and no CFEngine process neither. Relaunching CFEngine processes..." 
  ${CFE_BIN_DIR}/cf-agent -f failsafe.cf >/dev/null 2>&1
  ${CFE_BIN_DIR}/cf-agent >/dev/null 2>&1
  echo " Done" 
fi

As the containers still have the cf-execd running, the script never detects that it needs to relaunch the process on the mothership.

I am restarting the service manually for now (as it is the end of the working week :) - we shall see how it is working over the weekend and will continue on Monday.


Subtasks 3 (0 open3 closed)

Bug #4534: vzps implementation in cfengine doesn't work with Promox (branch 2.11)RejectedActions
Bug #4535: vzps implementation in cfengine doesn't work with Promox (branch 2.6)RejectedJonathan CLARKE2014-02-27Actions
Bug #5467: implement openvz support for rudder 2.11ReleasedJonathan CLARKE2014-09-02Actions

Related issues 6 (0 open6 closed)

Related to Rudder - Bug #4499: Rudder init script kill all agent on Open VZ (or similar system)ReleasedBenoît PECCATTE2014-02-23Actions
Related to Rudder - Bug #4509: vzps is never used on vzps systemRejectedJonathan CLARKE2014-02-25Actions
Related to Rudder - Bug #4517: vzps is never used on vzps system (branch 2.8)RejectedJonathan CLARKE2014-02-25Actions
Related to Rudder - Bug #7189: issues with process management on physical hosting LXC containersReleasedMatthieu CERDA2015-09-12Actions
Related to Rudder - Bug #7381: Process management issues on nodes hosting LXC containersReleasedAlexis MoussetActions
Related to Rudder - Bug #7423: If using proxmox, process management fails due to bad options used on vzpsReleasedBenoît PECCATTE2015-12-07Actions
Actions

Also available in: Atom PDF