Bug #20685
closedExcessive Agent restarts (Agent on Debian 11)
Description
Hi,
I believe this is a bug, however I cannot see anything amiss in the logs as to why, however willing to take any guidance to try and debug further.
We have our servers being monitored by a monitoring tool (zabbix), and since upgrading the Rudder agents (and Server) to v7.0.0 we have been having excessive volumes of agent restarts across many of our nodes. We have 88 nodes, and in just the last 10 hours we have had over 1000 alerts of rudder services restarting on 43 of the nodes. This wasnt the case with Rudder v6.1.12.
I have attached various details and info to try and assist:
restart counts.txt
- a list of the restart counts received in the last 10 hours, showing the volumes arent consistent, not that it affects every node.
restart events.txt
- detailled breakdown of the last 1000 process restart timings.
Then, taking the current worst offender (vps001.bhs), I have the following files for the period 12:00-15:00 UTC today:
syslog, daemon.log (standard log files)
rudder.log (we direct all rudder and cf events to this log)
promise and cfengine logs for the persod around a couple of the restarts.
Environment Details for the agent:
root@vps001:/var/log# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye root@vps001:/var/log# cat /etc/apt/sources.list.d/rudder.list deb https://repository.rudder.io/apt/7.0/ bullseye main root@vps001:/var/log# dpkg -l | grep -i rudder ii rudder-agent 7.0.0-debian11 amd64 Configuration management and audit tool - agent root@vps001:~# systemctl status rudder* rudder-agent.service - Rudder agent umbrella service Loaded: loaded (/lib/systemd/system/rudder-agent.service; enabled; vendor preset: enabled) Active: active (exited) since Tue 2022-02-01 15:32:33 UTC; 6min ago Docs: man:rudder(8) https://docs.rudder.io Process: 1492011 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Main PID: 1492011 (code=exited, status=0/SUCCESS) CPU: 3ms Feb 01 15:32:33 vps001 systemd[1]: Starting Rudder agent umbrella service... Feb 01 15:32:33 vps001 systemd[1]: Finished Rudder agent umbrella service. rudder-cf-serverd.service - CFEngine file server Loaded: loaded (/lib/systemd/system/rudder-cf-serverd.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2022-02-01 15:32:33 UTC; 6min ago Main PID: 1492017 (cf-serverd) Tasks: 1 (limit: 1128) Memory: 6.6M CPU: 573ms CGroup: /system.slice/rudder-cf-serverd.service └─1492017 /opt/rudder/bin/cf-serverd --graceful-detach=600 --no-fork --inform Feb 01 15:32:33 vps001 systemd[1]: Started CFEngine file server. Feb 01 15:32:36 vps001 cf-serverd[1492017]: notice: Server is starting... Feb 01 15:32:36 vps001 cf-serverd[1492017]: CFEngine(server) rudder Server is starting... rudder-cf-execd.service - CFEngine Execution Scheduler Loaded: loaded (/lib/systemd/system/rudder-cf-execd.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2022-02-01 15:32:33 UTC; 6min ago Main PID: 1492015 (cf-execd) Tasks: 1 (limit: 1128) Memory: 104.2M CPU: 22.877s CGroup: /system.slice/rudder-cf-execd.service └─1492015 /opt/rudder/bin/cf-execd --no-fork Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sudoParameters@@result_success@@32377fd7-02fd-43d0-aab7-28460> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder Deleted file '/var/rudder/tmp/check_ssh_key_distribution//root.aut> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@Common@@result_na@@hasPolicyServer-root@@common-hasPolicyServ> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@Common@@control@@rudder@@run@@0@@end@@20220128-232623-d3d4d1b> ls -l /lib/systemd/system/rudder* -rw-r--r-- 1 root root 469 Nov 22 2017 /lib/systemd/system/rudder-agent.service -rw-r--r-- 1 root root 512 Nov 22 2017 /lib/systemd/system/rudder-cf-execd.service -rw-r--r-- 1 root root 522 Nov 22 2017 /lib/systemd/system/rudder-cf-serverd.service
Files
Updated by Nigel Mundy almost 3 years ago
Followup - it seems to be Erroring on the Attachment uploads. Please provide an email address and I will provide them that way.
Updated by Nicolas CHARLES almost 3 years ago
Hi, you can send it to nch[at]rudder.io, i'll attach it myself
Updated by Nicolas CHARLES almost 3 years ago
Or past it on gitter: https://gitter.im/normation/rudder
Updated by Nigel Mundy almost 3 years ago
Nicolas CHARLES wrote in #note-3:
Hi, you can send it to nch[at]rudder.io, i'll attach it myself
Sent by email in one zip.
Regards
/\/igel.
Updated by Alexis Mousset almost 3 years ago
Hi, thanks, could you also send us the content of /var/log/rudder/agent-check/check.log
?
Updated by Nigel Mundy almost 3 years ago
Having checked that log, the timestamp of that file indicates it hasnt been updated since we have upgraded from v6.2.12
root@vps001:~# ls -l /var/log/rudder/agent-check/ total 4.0K -rw-r--r-- 1 root root 2.9K Jan 28 18:16 check.log@
and checking the cronjob, that is the correct file:
cat /etc/cron.d/rudder-agent # Cron file for Rudder # # Will manually run cf-agent in case cf-execd is no longer running. cf-agent will fire up a new cf-execd. # # To temporarily avoid this behaviour, run 'rudder agent disable' 0,5,10,15,20,25,30,35,40,45,50,55 * * * * root /opt/rudder/bin/rudder agent check -q >> /var/log/rudder/agent-check/check.log 2>&1
Running it manually generates no error output either:
root@vps001:~# /opt/rudder/bin/rudder agent check -c FINISH: Rudder agent check ran properly, please look at messages above to see if there has been any error.
Verbatim contents of the agent check log:
notice: Trusting new key: MD5=d6cff0254d641112dbaa0c0b46715108 R: Initial policies have been successfully downloaded from the policy server rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled Rudder agent 6.0.1-debian10 Node uuid: 90a73114-c1e0-469d-9c80-604b6fca4252 Start execution with config [0] info Rudder agent was run on a subset of policies - not all policies were checked ## Summary ##################################################################### 1 components verified in 3 directives => 1 components in Enforce mode -> 1 compliant Execution time: 7.20s ################################################################################ rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled rudder-cf-serverd: enabled rudder-cf-execd: enabled /var/rudder/cfengine-community/inputs/promises.cf:228:0: error: Duplicate definition of bundle abort_rudder with type agent /var/rudder/ncf/common/20_cfe_basics/abort.cf:69:0: error: Duplicate definition of bundle abort_rudder with type agent R: Initial policies have been successfully downloaded from the policy server /var/rudder/cfengine-community/inputs/promises.cf:228:0: error: Duplicate definition of bundle abort_rudder with type agent /var/rudder/ncf/common/20_cfe_basics/abort.cf:69:0: error: Duplicate definition of bundle abort_rudder with type agent error: Failed to establish TLS connection: underlying network error () error: No suitable server found error: Failed to establish TLS connection: underlying network error () error: No suitable server found R: Initial policies have been successfully downloaded from the policy server error: Rudder agent promises could not be updated. R: Initial policies have been successfully downloaded from the policy server rudder agent check is already running Terminated Terminated Rudder agent 6.2.12-debian11 Node uuid: 90a73114-c1e0-469d-9c80-604b6fca4252 M| State Technique Component Key Message E| repaired Common Compute inventory splay Scheduling rudder_run_inventory was repaired Start execution with config [20220121-145004-ceb3dfc] info Rudder agent was run on a subset of policies - not all policies were checked ## Summary ##################################################################### 2 components verified in 4 directives => 2 components in Enforce mode -> 1 compliant -> 1 repaired Execution time: 6.72s ################################################################################ Reports sent.
Updated by Nicolas CHARLES almost 3 years ago
I've been able to reproduce
the command rudder agent check kills rudder-cf-execd when there are more than one instance
and when the agent runs, there is more than one instance...
Updated by Nicolas CHARLES almost 3 years ago
- Target version set to 7.0.1
This is specific 7.0 - introduce in agent version 3.18
cf-execd now runs cf-agent from a child process instead of a thread on POSIX systems (ENT-6182)
Updated by Nicolas CHARLES almost 3 years ago
- Status changed from New to In progress
- Assignee set to Nicolas CHARLES
Updated by Nicolas CHARLES almost 3 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Nicolas CHARLES to Benoît PECCATTE
- Pull Request set to https://github.com/Normation/rudder-agent/pull/360
Updated by Nicolas CHARLES almost 3 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder-agent|385194f8b7d305db00608dca646070336ba20cd4.
Updated by Alexis Mousset almost 3 years ago
- Fix check changed from To do to Checked
Updated by Vincent MEMBRÉ almost 3 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 7.0.1 which was released today.