Bug #20685
Updated by Alexis Mousset almost 3 years ago
Hi, I believe this is a bug, however I cannot see anything amiss in the logs as to why, however willing to take any guidance to try and debug further. We have our servers being monitored by a monitoring tool (zabbix), and since upgrading the Rudder agents (and Server) to v7.0.0 we have been having excessive volumes of agent restarts across many of our nodes. We have 88 nodes, and in just the last 10 hours we have had over 1000 alerts of rudder services restarting on 43 of the nodes. This wasnt the case with Rudder v6.1.12. I have attached various details and info to try and assist: restart counts.txt - a list of the restart counts received in the last 10 hours, showing the volumes arent consistent, not that it affects every node. restart events.txt - detailled breakdown of the last 1000 process restart timings. Then, taking the current worst offender (vps001.bhs), I have the following files for the period 12:00-15:00 UTC today: syslog, daemon.log (standard log files) rudder.log (we direct all rudder and cf events to this log) promise and cfengine logs for the persod around a couple of the restarts. Environment Details for the agent: <pre> root@vps001:/var/log# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye root@vps001:/var/log# cat /etc/apt/sources.list.d/rudder.list deb https://repository.rudder.io/apt/7.0/ bullseye main root@vps001:/var/log# dpkg -l | grep -i rudder ii rudder-agent 7.0.0-debian11 amd64 Configuration management and audit tool - agent root@vps001:~# systemctl status rudder* rudder-agent.service - Rudder agent umbrella service Loaded: loaded (/lib/systemd/system/rudder-agent.service; enabled; vendor preset: enabled) Active: active (exited) since Tue 2022-02-01 15:32:33 UTC; 6min ago Docs: man:rudder(8) https://docs.rudder.io Process: 1492011 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Main PID: 1492011 (code=exited, status=0/SUCCESS) CPU: 3ms Feb 01 15:32:33 vps001 systemd[1]: Starting Rudder agent umbrella service... Feb 01 15:32:33 vps001 systemd[1]: Finished Rudder agent umbrella service. rudder-cf-serverd.service - CFEngine file server Loaded: loaded (/lib/systemd/system/rudder-cf-serverd.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2022-02-01 15:32:33 UTC; 6min ago Main PID: 1492017 (cf-serverd) Tasks: 1 (limit: 1128) Memory: 6.6M CPU: 573ms CGroup: /system.slice/rudder-cf-serverd.service └─1492017 /opt/rudder/bin/cf-serverd --graceful-detach=600 --no-fork --inform Feb 01 15:32:33 vps001 systemd[1]: Started CFEngine file server. Feb 01 15:32:36 vps001 cf-serverd[1492017]: notice: Server is starting... Feb 01 15:32:36 vps001 cf-serverd[1492017]: CFEngine(server) rudder Server is starting... rudder-cf-execd.service - CFEngine Execution Scheduler Loaded: loaded (/lib/systemd/system/rudder-cf-execd.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2022-02-01 15:32:33 UTC; 6min ago Main PID: 1492015 (cf-execd) Tasks: 1 (limit: 1128) Memory: 104.2M CPU: 22.877s CGroup: /system.slice/rudder-cf-execd.service └─1492015 /opt/rudder/bin/cf-execd --no-fork Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sudoParameters@@result_success@@32377fd7-02fd-43d0-aab7-28460> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder Deleted file '/var/rudder/tmp/check_ssh_key_distribution//root.aut> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@Common@@result_na@@hasPolicyServer-root@@common-hasPolicyServ> Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@Common@@control@@rudder@@run@@0@@end@@20220128-232623-d3d4d1b> ls -l /lib/systemd/system/rudder* -rw-r--r-- 1 root root 469 Nov 22 2017 /lib/systemd/system/rudder-agent.service -rw-r--r-- 1 root root 512 Nov 22 2017 /lib/systemd/system/rudder-cf-execd.service -rw-r--r-- 1 root root 522 Nov 22 2017 /lib/systemd/system/rudder-cf-serverd.service </pre>