Bug #7338
closedAll reports are missing (totally orange) for a node due to multiple cf-execd processes
Description
All reports are missing (totally orange) for a node due to multiple cf-execd processes. The logs are there and visible in the web UI.
Workaround: Login on the node. stop rudder-agent. kill -9 cf-execd process which is still running. Start rudder-agent.
Updated by Nicolas CHARLES almost 9 years ago
Dennis, what happen if you run bash -x /opt/rudder/bin/check-rudder-agent ?
What is the exit code ?
Updated by Dennis Cabooter almost 9 years ago
# ps wwwuax|grep cf-exec|grep -v grep root 1679 0.0 0.3 107816 3984 ? Ss 09:34 0:00 /var/rudder/cfengine-community/bin/cf-execd root 2046 0.0 0.3 107816 3984 ? Ss 09:34 0:00 /var/rudder/cfengine-community/bin/cf-execd
# bash -x /opt/rudder/bin/check-rudder-agent + . /etc/profile ++ '[' '' ']' ++ '[' -d /etc/profile.d ']' ++ for i in '/etc/profile.d/*.sh' ++ '[' -r /etc/profile.d/bash_completion.sh ']' ++ . /etc/profile.d/bash_completion.sh +++ '[' -n '4.3.11(1)-release' -a -n '' -a -z '' ']' ++ for i in '/etc/profile.d/*.sh' ++ '[' -r /etc/profile.d/rudder-agent.sh ']' ++ . /etc/profile.d/rudder-agent.sh +++ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/rudder/cfengine-community/bin:/var/rudder/cfengine-community/bin +++ export PATH +++ type manpath ++++ manpath +++ MANPATH=/usr/local/man:/usr/local/share/man:/usr/share/man:/opt/rudder/share/man:/opt/rudder/share/man +++ export MANPATH ++ unset i + set -e + export PATH=/opt/rudder/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/rudder/cfengine-community/bin:/var/rudder/cfengine-community/bin + PATH=/opt/rudder/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/rudder/cfengine-community/bin:/var/rudder/cfengine-community/bin + BACKUP_DIR=/var/backups/rudder/ ++ uname -s + OS_FAMILY=Linux + CFENGINE_DB_EXT=lmdb + '[' zLinux = zAIX ']' + CP_A='cp -a' + CFE_DIR=/var/rudder/cfengine-community + CFE_BIN_DIR=/var/rudder/cfengine-community/bin + CFE_DISABLE_FILE=/opt/rudder/etc/disable-agent + LAST_UPDATE_FILE=/var/rudder/cfengine-community/last_successful_inputs_update + UUID_FILE=/opt/rudder/etc/uuid.hive ++ whoami + '[' '!' root = root ']' + check_and_fix_rudder_uuid + LATEST_BACKUPED_UUID= + '[' '!' -e /opt/rudder/etc/uuid.hive ']' ++ wc -l ++ grep -E '^[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}|root' ++ cat /opt/rudder/etc/uuid.hive + CHECK_UUID=1 + '[' 1 -ne 1 ']' + check_and_fix_cfengine_processes ++ ps -h -o utsns --pid 5742 + ns=4026531838 + '[' -e /proc/bc/0 ']' + '[' -n 4026531838 ']' + PS_COMMAND='eval ps --no-header -e -O utsns | grep -E '\''^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'\''' ++ cat ++ grep -E cf-execd ++ grep -v grep ++ eval ps --no-header -e -O utsns '|' grep -E ''\''^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'\''' +++ grep -E '^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838' +++ ps --no-header -e -O utsns + CF_EXECD_RUNNING=' 1679 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd 2046 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd' ++ wc -l ++ grep -v '^$' ++ echo ' 1679 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd 2046 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd' + NB_CF_EXECD_RUNNING=2 + '[' 2 -gt 1 ']' + echo_n 'WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...' + '[' zLinux = zAIX ']' + echo -n WARNING: Too many instance of CFEngine cf-execd processes running. Killing them... WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...+ xargs kill -9 + awk 'BEGIN { OFS=" "} {print $2 }' + echo ' 1679 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd 2046 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd' + true + echo ' Done' Done ++ cat ++ grep -E '/var/rudder/cfengine-community/bin/(cf-execd|cf-agent)' ++ grep -v grep ++ eval ps --no-header -e -O utsns '|' grep -E ''\''^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'\''' +++ grep -E '^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838' +++ ps --no-header -e -O utsns + CF_PROCESS_RUNNING=' 1679 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd 2046 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd' ++ wc -l ++ grep -v '^$' ++ echo ' 1679 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd 2046 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd' + NB_CF_PROCESS_RUNNING=2 + '[' '!' -e /opt/rudder/etc/disable-agent -a 2 -eq 0 -a -f /var/rudder/cfengine-community/policy_server.dat ']' + '[' -f /var/rudder/cfengine-community/inputs/run_interval ']' ++ cat /var/rudder/cfengine-community/inputs/run_interval + RUN_INTERVAL=15 ++ expr 15 '*' 2 + CHECK_INTERVAL=30 + '[' '!' -e /var/rudder/cfengine-community/last_successful_inputs_update -o -e /opt/rudder/etc/disable-agent ']' ++ find /var/rudder/cfengine-community/last_successful_inputs_update -mmin +30 + test + '[' 2 -gt 8 ']' + check_and_fix_cf_lock + MAX_CF_LOCK_SIZE=10485760 + '[' -e /var/rudder/cfengine-community/state/cf_lock.lmdb ']' + '[' zLinux = zAIX ']' ++ stat -c%s /var/rudder/cfengine-community/state/cf_lock.lmdb + CF_LOCK_SIZE=155648 + '[' 155648 -ge 10485760 ']' + '[' zLinux '!=' zAIX ']' + check_and_fix_specific_rudder_agent_file /etc/init.d/rudder-agent init + FILE_TO_RESTORE=/etc/init.d/rudder-agent + FILE_TYPE=init + LATEST_BACKUPED_FILES= + '[' '!' -e /etc/init.d/rudder-agent ']' + check_and_fix_specific_rudder_agent_file /etc/default/rudder-agent default + FILE_TO_RESTORE=/etc/default/rudder-agent + FILE_TYPE=default + LATEST_BACKUPED_FILES= + '[' '!' -e /etc/default/rudder-agent ']' + check_and_fix_specific_rudder_agent_file /etc/cron.d/rudder-agent cron + FILE_TO_RESTORE=/etc/cron.d/rudder-agent + FILE_TYPE=cron + LATEST_BACKUPED_FILES= + '[' '!' -e /etc/cron.d/rudder-agent ']' + base=/var/rudder/cfengine-community/inputs + empty /var/rudder/cfengine-community/inputs/common/1.0/update.cf + '[' '!' -f /var/rudder/cfengine-community/inputs/common/1.0/update.cf ']' ++ awk '{print $1}' ++ du /var/rudder/cfengine-community/inputs/common/1.0/update.cf + '[' 20 = 0 ']' + empty /var/rudder/cfengine-community/inputs/failsafe.cf + '[' '!' -f /var/rudder/cfengine-community/inputs/failsafe.cf ']' ++ awk '{print $1}' ++ du /var/rudder/cfengine-community/inputs/failsafe.cf + '[' 8 = 0 ']' + empty /var/rudder/cfengine-community/inputs/promises.cf + '[' '!' -f /var/rudder/cfengine-community/inputs/promises.cf ']' ++ awk '{print $1}' ++ du /var/rudder/cfengine-community/inputs/promises.cf + '[' 36 = 0 ']'
# ps wwwuax|grep cf-exec|grep -v grep root 1679 0.0 0.3 107816 3984 ? Ss 09:34 0:00 /var/rudder/cfengine-community/bin/cf-execd root 2046 0.0 0.3 107816 3984 ? Ss 09:34 0:00 /var/rudder/cfengine-community/bin/cf-execd
# ps wwwuax|grep cf-exec|grep -v grep root 1679 0.0 0.3 107816 3984 ? Ss 09:34 0:00 /var/rudder/cfengine-community/bin/cf-execd root 2046 0.0 0.3 107816 3984 ? Ss 09:34 0:00 /var/rudder/cfengine-community/bin/cf-execd # /etc/init.d/rudder-agent stop rudder-agent[7161]: [INFO] Using /etc/default/rudder-agent for configuration rudder-agent[7164]: [INFO] Using /var/rudder/cfengine-community for CFEngine workdir rudder-agent[7165]: [INFO] Halting CFEngine Community cf-serverd... rudder-agent[7376]: [OK] CFEngine Community cf-serverd stopped after 2 seconds rudder-agent[7377]: [INFO] Halting CFEngine Community cf-execd... rudder-agent[8140]: [OK] CFEngine Community cf-execd stopped after 6 seconds # ps wwwuax|grep cf-exec|grep -v grep root 1679 0.0 0.3 107816 3984 ? Ss 09:34 0:00 /var/rudder/cfengine-community/bin/cf-execd # kill 1679 # ps wwwuax|grep cf-exec|grep -v grep # /etc/init.d/rudder-agent start rudder-agent[8902]: [INFO] Using /etc/default/rudder-agent for configuration rudder-agent[8905]: [INFO] Using /var/rudder/cfengine-community for CFEngine workdir rudder-agent[8906]: [INFO] Launching CFEngine Community cf-serverd... rudder-agent[9081]: [OK] CFEngine Community cf-serverd started after 1 seconds rudder-agent[9082]: [INFO] Launching CFEngine Community cf-execd... rudder-agent[9258]: [OK] CFEngine Community cf-execd started after 1 seconds # ps wwwuax|grep cf-exec|grep -v grep root 9255 0.0 0.2 40224 2860 ? Ss 10:49 0:00 /var/rudder/cfengine-community/bin/cf-execd
Updated by Dennis Cabooter almost 9 years ago
It seems like this is only happening on Ubuntu machines, not on CentOS/RHEL ones.
Updated by Nicolas CHARLES almost 9 years ago
- Category set to Packaging
- Assignee changed from Nicolas CHARLES to Benoît PECCATTE
- Target version set to 2.11.17
Ok, the problem is
echo -n WARNING: Too many instance of CFEngine cf-execd processes running. Killing them... WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...+ xargs kill -9 + awk 'BEGIN { OFS=" "} {print $2 }' + echo ' 1679 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd 2046 4026531838 S ? 00:00:00 /var/rudder/cfengine-community/bin/cf-execd'
it does detect that there are 2 cf-execd running, but doesn't get the proper entry for pid
This is probably linked to #7189 and #7243
Could not reproduce it on Centos nor Debian 7, but on Ubuntu the value is invalid
echo ${PS_COMMAND} eval ps --no-header -e -O utsns | grep -E '^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'
but I do not have namespace; i think we should use ps -ef
Updated by Nicolas CHARLES almost 9 years ago
- Related to Bug #7189: issues with process management on physical hosting LXC containers added
Updated by Benoît PECCATTE almost 9 years ago
Ubuntu supporte namespaces and in the previous output the command
ps -h -o utsns --pid $$
gives 4026531838 (the value in your grep) which only possible if you have namespace support.
But I see a possible reason, ps -O utsns change the output field order so the kill doesn't work.
Updated by Benoît PECCATTE almost 9 years ago
- Status changed from New to In progress
Updated by Benoît PECCATTE almost 9 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Benoît PECCATTE to Nicolas CHARLES
- Pull Request set to https://github.com/Normation/rudder-packages/pull/783
Updated by Benoît PECCATTE almost 9 years ago
- Status changed from Pending technical review to Pending release
- % Done changed from 0 to 100
Applied in changeset rudder-packages|05320e8ca678754450ac96c5f16ee47daaf668a8.
Updated by Jonathan CLARKE almost 9 years ago
Applied in changeset rudder-packages|6ab33d78a910aae58544e6a3623c7ba4d3b2ebc6.
Updated by Vincent MEMBRÉ over 8 years ago
- Status changed from Pending release to Released
Updated by Florian Heigl over 8 years ago
I found one needs to also modify
@[root@rudder 1.0]# git show
commit 811a3ca2e8f1342b58fb19151e720c1ffda68da8
Author: root user (CLI) <root@localhost>
Date: Wed Jan 13 00:34:57 2016 +0100
adjust for lxc env
diff --git a/techniques/system/common/1.0/promises.st b/techniques/system/common/1.0/promises.st
index b59974c..5b6db6e 100644
--- a/techniques/system/common/1.0/promises.st
+++ b/techniques/system/common/1.0/promises.st@ -341,12 +341,12
@ bundle agent check_cf_processes
# process_kill is the same for SIGKILL.
!windows::
# On windows, cf-execd is a service, and there can be only one instance of it running (by design)
- "process_term[execd]" string => "2";
- "process_kill[execd]" string => "5";
+ "process_term[execd]" string => "6";
+ "process_kill[execd]" string => "8";
any::
- "process_term[agent]" string => "5";
- "process_kill[agent]" string => "8";
+ "process_term[agent]" string => "8";
+ "process_kill[agent]" string => "16";
"binaries" slist => getindices("process_term");@
This is not sufficient since it'll also raise the limits on all containers, i just don't know a more appropriate fix.