Project

General

Profile

Actions

Bug #20685

closed

Excessive Agent restarts (Agent on Debian 11)

Added by Nigel Mundy 10 months ago. Updated 10 months ago.

Status:
Released
Priority:
N/A
Category:
Agent
Target version:
Severity:
Minor - inconvenience | misleading | easy workaround
UX impact:
User visibility:
Effort required:
Priority:
0
Regression:

Description

Hi,
I believe this is a bug, however I cannot see anything amiss in the logs as to why, however willing to take any guidance to try and debug further.

We have our servers being monitored by a monitoring tool (zabbix), and since upgrading the Rudder agents (and Server) to v7.0.0 we have been having excessive volumes of agent restarts across many of our nodes. We have 88 nodes, and in just the last 10 hours we have had over 1000 alerts of rudder services restarting on 43 of the nodes. This wasnt the case with Rudder v6.1.12.

I have attached various details and info to try and assist:

restart counts.txt
- a list of the restart counts received in the last 10 hours, showing the volumes arent consistent, not that it affects every node.

restart events.txt
- detailled breakdown of the last 1000 process restart timings.

Then, taking the current worst offender (vps001.bhs), I have the following files for the period 12:00-15:00 UTC today:

syslog, daemon.log (standard log files)
rudder.log (we direct all rudder and cf events to this log)
promise and cfengine logs for the persod around a couple of the restarts.

Environment Details for the agent:

root@vps001:/var/log# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

root@vps001:/var/log# cat /etc/apt/sources.list.d/rudder.list
deb https://repository.rudder.io/apt/7.0/ bullseye main

root@vps001:/var/log# dpkg -l | grep -i rudder
ii  rudder-agent                   7.0.0-debian11                 amd64        Configuration management and audit tool - agent

root@vps001:~# systemctl status rudder*
rudder-agent.service - Rudder agent umbrella service
     Loaded: loaded (/lib/systemd/system/rudder-agent.service; enabled; vendor preset: enabled)
     Active: active (exited) since Tue 2022-02-01 15:32:33 UTC; 6min ago
       Docs: man:rudder(8)
             https://docs.rudder.io
    Process: 1492011 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 1492011 (code=exited, status=0/SUCCESS)
        CPU: 3ms

Feb 01 15:32:33 vps001 systemd[1]: Starting Rudder agent umbrella service...
Feb 01 15:32:33 vps001 systemd[1]: Finished Rudder agent umbrella service.

rudder-cf-serverd.service - CFEngine file server
     Loaded: loaded (/lib/systemd/system/rudder-cf-serverd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-02-01 15:32:33 UTC; 6min ago
   Main PID: 1492017 (cf-serverd)
      Tasks: 1 (limit: 1128)
     Memory: 6.6M
        CPU: 573ms
     CGroup: /system.slice/rudder-cf-serverd.service
             └─1492017 /opt/rudder/bin/cf-serverd --graceful-detach=600 --no-fork --inform

Feb 01 15:32:33 vps001 systemd[1]: Started CFEngine file server.
Feb 01 15:32:36 vps001 cf-serverd[1492017]:   notice: Server is starting...
Feb 01 15:32:36 vps001 cf-serverd[1492017]: CFEngine(server) rudder Server is starting...

rudder-cf-execd.service - CFEngine Execution Scheduler
     Loaded: loaded (/lib/systemd/system/rudder-cf-execd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-02-01 15:32:33 UTC; 6min ago
   Main PID: 1492015 (cf-execd)
      Tasks: 1 (limit: 1128)
     Memory: 104.2M
        CPU: 22.877s
     CGroup: /system.slice/rudder-cf-execd.service
             └─1492015 /opt/rudder/bin/cf-execd --no-fork

Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sudoParameters@@result_success@@32377fd7-02fd-43d0-aab7-28460>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder Deleted file '/var/rudder/tmp/check_ssh_key_distribution//root.aut>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@sshKeyDistribution@@result_na@@32377fd7-02fd-43d0-aab7-28460a>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@Common@@result_na@@hasPolicyServer-root@@common-hasPolicyServ>
Feb 01 15:37:31 vps001 cf-agent[1494209]: CFEngine(agent) rudder R: @@Common@@control@@rudder@@run@@0@@end@@20220128-232623-d3d4d1b>

ls -l /lib/systemd/system/rudder*
-rw-r--r-- 1 root root 469 Nov 22  2017 /lib/systemd/system/rudder-agent.service
-rw-r--r-- 1 root root 512 Nov 22  2017 /lib/systemd/system/rudder-cf-execd.service
-rw-r--r-- 1 root root 522 Nov 22  2017 /lib/systemd/system/rudder-cf-serverd.service


Files

clipboard-202202021442-i6khe.png (43.7 KB) clipboard-202202021442-i6khe.png Nicolas CHARLES, 2022-02-02 14:42
Actions #1

Updated by Nigel Mundy 10 months ago

Adding attachment 11

Actions #2

Updated by Nigel Mundy 10 months ago

Followup - it seems to be Erroring on the Attachment uploads. Please provide an email address and I will provide them that way.

Actions #3

Updated by Nicolas CHARLES 10 months ago

Hi, you can send it to nch[at]rudder.io, i'll attach it myself

Actions #4

Updated by Nicolas CHARLES 10 months ago

Actions #5

Updated by Nigel Mundy 10 months ago

Nicolas CHARLES wrote in #note-3:

Hi, you can send it to nch[at]rudder.io, i'll attach it myself

Sent by email in one zip.

Regards

/\/igel.

Actions #6

Updated by Alexis Mousset 10 months ago

  • Description updated (diff)
Actions #7

Updated by Alexis Mousset 10 months ago

Hi, thanks, could you also send us the content of /var/log/rudder/agent-check/check.log?

Actions #8

Updated by Nigel Mundy 10 months ago

Having checked that log, the timestamp of that file indicates it hasnt been updated since we have upgraded from v6.2.12

root@vps001:~# ls -l /var/log/rudder/agent-check/
total 4.0K
-rw-r--r-- 1 root root 2.9K Jan 28 18:16 check.log@

and checking the cronjob, that is the correct file:

cat /etc/cron.d/rudder-agent
# Cron file for Rudder
#
# Will manually run cf-agent in case cf-execd is no longer running. cf-agent will fire up a new cf-execd.
#
# To temporarily avoid this behaviour, run 'rudder agent disable'

0,5,10,15,20,25,30,35,40,45,50,55 * * * * root /opt/rudder/bin/rudder agent check -q >> /var/log/rudder/agent-check/check.log 2>&1

Running it manually generates no error output either:

root@vps001:~# /opt/rudder/bin/rudder agent check -c
FINISH: Rudder agent check ran properly, please look at messages above to see if there has been any error.

Verbatim contents of the agent check log:

  notice: Trusting new key: MD5=d6cff0254d641112dbaa0c0b46715108
R: Initial policies have been successfully downloaded from the policy server
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
Rudder agent 6.0.1-debian10
Node uuid: 90a73114-c1e0-469d-9c80-604b6fca4252
Start execution with config [0]

info     Rudder agent was run on a subset of policies - not all policies were checked

## Summary #####################################################################
1 components verified in 3 directives
   => 1 components in Enforce mode
      -> 1 compliant
Execution time: 7.20s
################################################################################
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
rudder-cf-serverd: enabled
rudder-cf-execd: enabled
/var/rudder/cfengine-community/inputs/promises.cf:228:0: error: Duplicate definition of bundle abort_rudder with type agent
/var/rudder/ncf/common/20_cfe_basics/abort.cf:69:0: error: Duplicate definition of bundle abort_rudder with type agent
R: Initial policies have been successfully downloaded from the policy server
/var/rudder/cfengine-community/inputs/promises.cf:228:0: error: Duplicate definition of bundle abort_rudder with type agent
/var/rudder/ncf/common/20_cfe_basics/abort.cf:69:0: error: Duplicate definition of bundle abort_rudder with type agent
   error: Failed to establish TLS connection: underlying network error ()
   error: No suitable server found
   error: Failed to establish TLS connection: underlying network error ()
   error: No suitable server found
R: Initial policies have been successfully downloaded from the policy server
error: Rudder agent promises could not be updated.
R: Initial policies have been successfully downloaded from the policy server
rudder agent check is already running
Terminated
Terminated
Rudder agent 6.2.12-debian11
Node uuid: 90a73114-c1e0-469d-9c80-604b6fca4252
M| State         Technique                 Component                 Key                Message
E| repaired      Common                    Compute inventory splay                      Scheduling rudder_run_inventory was repaired
Start execution with config [20220121-145004-ceb3dfc]

info     Rudder agent was run on a subset of policies - not all policies were checked

## Summary #####################################################################
2 components verified in 4 directives
   => 2 components in Enforce mode
      -> 1 compliant
      -> 1 repaired
Execution time: 6.72s
################################################################################
Reports sent.

Actions #9

Updated by Nicolas CHARLES 10 months ago

I've been able to reproduce

the command rudder agent check kills rudder-cf-execd when there are more than one instance
and when the agent runs, there is more than one instance...

Actions #10

Updated by Nicolas CHARLES 10 months ago

  • Target version set to 7.0.1

This is specific 7.0 - introduce in agent version 3.18

 cf-execd now runs cf-agent from a child process instead of a
      thread on POSIX systems (ENT-6182)

Actions #11

Updated by Nicolas CHARLES 10 months ago

  • Status changed from New to In progress
  • Assignee set to Nicolas CHARLES
Actions #12

Updated by Nicolas CHARLES 10 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to Benoît PECCATTE
  • Pull Request set to https://github.com/Normation/rudder-agent/pull/360
Actions #13

Updated by Nicolas CHARLES 10 months ago

  • Status changed from Pending technical review to Pending release
Actions #15

Updated by Vincent MEMBRÉ 10 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 7.0.1 which was released today.

Actions

Also available in: Atom PDF