Project

General

Profile

Bug #12243

Agent components should not try to load failsafe.cf when policies are broken

Added by François ARMAND 9 months ago. Updated 8 months ago.

Status:
Released
Priority:
N/A
Category:
Server components
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Operational - other Techniques | Technique editor | Rudder settings
Effort required:
Priority:
76
Tags:

Description

I upgraded from rudder 4.2 (rudder-webapp_4.2.5~rc1~git201803160242-stretch0_all.deb) towards rudder-webapp_4.3.0~rc2~git201803200032-stretch0_all.deb on debian 9.

After upgrade, webapp is working, I can log, but nodes connected to the server can't get their policies anymore:

root@relay:/home/vagrant# rudder agent update -i
rudder     info: Failed to connect to server: Connection refused
rudder     info: No server is responding on port: 5309
rudder     info: Unable to establish connection to 'server'
   error: No suitable server found
rudder     info: Automatically promoting context scope for 'rudder_promises_generated_tmp_file_error' to namespace visibility, due to persistence
rudder     info: Promise belongs to bundle 'update_action' in file '/var/rudder/cfengine-community/inputs/common/1.0/update.cf' near line 223
rudder     info: Failed to connect to server: Connection refused
rudder     info: No server is responding on port: 5309
rudder     info: Unable to establish connection to 'server'
   error: No suitable server found
rudder     info: Automatically promoting context scope for 'rudder_ncf_hash_update_error' to namespace visibility, due to persistence
rudder     info: Promise belongs to bundle 'update_action' in file '/var/rudder/cfengine-community/inputs/common/1.0/update.cf' near line 231
rudder     info: Failed to connect to server: Connection refused
rudder     info: No server is responding on port: 5309
rudder     info: Unable to establish connection to 'server'
   error: No suitable server found
rudder     info: Automatically promoting context scope for 'rudder_ncf_hash_update_error' to namespace visibility, due to persistence
rudder     info: Promise belongs to bundle 'update_action' in file '/var/rudder/cfengine-community/inputs/common/1.0/update.cf' near line 237
R: *********************************************************************************
* rudder-agent could not get an updated configuration from the policy server.   *
* This can be caused by:                                                        *
*   * a networking issue                                                        *
*   * an unavailable server                                                     *
*   * if the node's IP in not if the allowed networks of its policy server.     *
* Any existing configuration policy will continue to be applied without change. *
*********************************************************************************
ok: Rudder agent promises were updated.

On the server, cf-execd is started, and the date is coherent with the update time:

root@server:/home/vagrant/plop# ps aux | grep cf-
root      9011  0.0  0.5 105040  8972 ?        Ss   09:40   0:00 /var/rudder/cfengine-community/bin/cf-execd --no-fork
root     12321  0.0  0.5  38632  8912 ?        Ss   09:40   0:00 /var/rudder/cfengine-community/bin/cf-serverd --no-fork

But on system logs, I have:

Mar 20 09:40:14 server systemd[1]: Stopped Rudder agent umbrella service.
Mar 20 09:40:15 server systemd[1]: Started CFEngine Execution Scheduler.
Mar 20 09:40:15 server systemd[1]: Starting Rudder agent umbrella service...
Mar 20 09:40:15 server systemd[1]: Started CFEngine file server.
Mar 20 09:40:15 server systemd[1]: Started Rudder agent umbrella service.
Mar 20 09:40:15 server cf-serverd[9013]:    error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.3/_var_rudder_ncf_common_20_cfe_basics': No such file or directory' for parsing.
Mar 20 09:40:15 server cf-serverd[9013]: CFEngine(server) rudder Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.3/_var_rudder_ncf_common_20_cfe_basics': No such file or directory
Mar 20 09:40:15 server systemd[1]: rudder-cf-serverd.service: Main process exited, code=exited, status=1/FAILURE
Mar 20 09:40:15 server systemd[1]: rudder-cf-serverd.service: Unit entered failed state.
Mar 20 09:40:15 server systemd[1]: rudder-cf-serverd.service: Failed with result 'exit-code'.
...

Mar 20 09:40:26 server cf-serverd[12321]: /var/rudder/cfengine-community/inputs/promises.cf:362:0: error: Undefined bundle _create_current_expected_reports_file with type usebundle
Mar 20 09:40:26 server cf-serverd[12321]: /var/rudder/cfengine-community/inputs/promises.cf:741:0: error: Undefined bundle _clean_old_expected_reports_file with type usebundle
Mar 20 09:40:26 server cf-serverd[12321]: /var/rudder/cfengine-community/inputs/rudder-directives.cf:37:0: error: Undefined bundle current_technique_report_info with type usebundle
Mar 20 09:40:26 server cf-serverd[12321]:    error: Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
Mar 20 09:40:26 server cf-serverd[12321]:    error: CFEngine was not able to get confirmation of promises from cf-promises, so going to failsafe
Mar 20 09:40:26 server cf-serverd[12321]:    error: CFEngine failsafe.cf: /var/rudder/cfengine-community/inputs /var/rudder/cfengine-community/inputs/failsafe.cf
Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server)  Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server)  CFEngine was not able to get confirmation of promises from cf-promises, so going to failsafe
Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server)  CFEngine failsafe.cf: /var/rudder/cfengine-community/inputs /var/rudder/cfengine-community/inputs/failsafe.cf
Mar 20 09:40:26 server cf-serverd[12321]:   notice: Server is starting...
Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server) rudder Server is starting...
...

Mar 20 09:40:40 server cf-serverd[12321]:   notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:40:40 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:41:28 server cf-serverd[12321]:   notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:41:28 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:41:55 server cf-serverd[12321]:   notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:41:55 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:42:12 server cf-serverd[12321]:   notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:42:12 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:43:29 server cf-serverd[12321]:   notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:43:29 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
...

Mar 20 09:45:27 server cf-agent[19573]: CFEngine(agent) rudder R: @@Common@@control@@rudder@@run@@0@@start@@20180320-094124-53ed56e5@@2018-03-20 
  ....
Mar 20 09:45:29 server cf-serverd[12321]:   notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:45:29 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
  ....
Mar 20 09:45:31 server cf-agent[19573]: CFEngine(agent) rudder R: @@Common@@control@@rudder@@run@@0@@end@@20180320-094124-53ed56e5@@2018-03-20 09:45:26+00:00##root@#End execution


Subtasks

Bug #12265: rudder agent check should trigger failsafe run when promises are brokenReleasedAlexis MOUSSET

Associated revisions

Revision 774e96dc (diff)
Added by Alexis MOUSSET 9 months ago

Fixes #12243: Upgrade 4.2 -> 4.3: cf-serverd starts reading failsafe.cf and never goes back (node policy update impossible)

History

#1 Updated by François ARMAND 9 months ago

It does not seems to correct itself (even when running by hand rudder agent run -u -i)

#2 Updated by François ARMAND 9 months ago

Just reloading cf-serverd does not work, as "failsafe.cf" is used:

root@server:/var/rudder/configuration-repository# /opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server
# in journalctl:
Mar 20 10:28:21 server cf-serverd[12321]:   notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 10:28:21 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'

#3 Updated by François ARMAND 9 months ago

Forcing full policy regeneration does not heal the problem, cf-serverd is still reading failsafe.cf.

#4 Updated by François ARMAND 9 months ago

  • Subject changed from Upgrade 4.2 -> 4.3: cf-serverd error leading to node update impossible to Upgrade 4.2 -> 4.3: cf-serverd starts reading failsafe.cf and never goes back (node policy update impossible)
  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility set to Operational - other Techniques | Technique editor | Rudder settings
  • Priority changed from 0 to 76

#5 Updated by François ARMAND 9 months ago

Running "rudder server debug xxxx" makes "xxxx" able to get its policies, but as soon as the debug server is stoppend, we fall back in bad state with impossible update.

#6 Updated by Nicolas CHARLES 9 months ago

Bahaviour is not the same in 4.1.10
Breaking promises.cf results in

Mar 20 11:45:12 orchestrateur-4 [25127]: CFEngine(common)  There are syntax errors in policy files
Mar 20 11:45:12 orchestrateur-4 cf-serverd[32336]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'

and it didn't reload the promises nor acl (cf-serverd is still functionnal)

#7 Updated by Nicolas CHARLES 9 months ago

in 4.2, I have

Mar 20 11:52:05 orchestrateur-6 [10077]: CFEngine(common)  There are syntax errors in policy files
Mar 20 11:51:06 orchestrateur-6 cf-serverd[16990]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'

#8 Updated by Nicolas CHARLES 9 months ago

actually, in 4.3, it behaves the same
If I manually corrupt promises.cf and reload cf-serverd policies, I get

Mar 20 05:55:01 server cf-serverd[8667]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises
Mar 20 05:55:01 server cf-serverd[8667]: /var/rudder/cfengine-community/inputs/promises.cf:24:3: error: syntax error
Mar 20 05:55:01 server cf-serverd[8667]: iii
Mar 20 05:55:01 server cf-serverd[8667]: ^
Mar 20 05:55:01 server cf-serverd[8667]: /var/rudder/cfengine-community/inputs/promises.cf:24:3: error: Expected 'bundle' or 'body' keyword, wrong input 'iii'
Mar 20 05:55:01 server cf-serverd[8667]: iii
Mar 20 05:55:01 server cf-serverd[8667]: ^
Mar 20 05:55:01 server cf-serverd[8667]: error: There are syntax errors in policy files
Mar 20 05:55:01 server cf-serverd[8667]: error: Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'

so it does not switch to failsafe.cf in terminal
when it's done automatically, it breaks

#9 Updated by Alexis MOUSSET 9 months ago

There should be an error log saying:

Failsafe condition triggered. Interactive session detected, skipping failsafe.cf execution.

do you see it?

#10 Updated by François ARMAND 9 months ago

Restarting rudder-agent (service rudder-agent restart) repairs the problem:

Mar 20 11:05:40 server systemd[1]: Stopped Rudder agent umbrella service.
Mar 20 11:05:40 server systemd[1]: Stopping Rudder agent umbrella service...
Mar 20 11:05:40 server systemd[1]: Starting Rudder agent umbrella service...
Mar 20 11:05:40 server systemd[1]: Stopping CFEngine Execution Scheduler...
Mar 20 11:05:40 server cf-serverd[12321]:   notice: Cleaning up and exiting...
Mar 20 11:05:40 server cf-serverd[12321]: CFEngine(server) rudder Cleaning up and exiting...
Mar 20 11:05:40 server systemd[1]: Stopping CFEngine file server...
Mar 20 11:05:40 server systemd[1]: Stopped CFEngine Execution Scheduler.
Mar 20 11:05:40 server systemd[1]: Stopped CFEngine file server.
Mar 20 11:05:40 server systemd[1]: Started Rudder agent umbrella service.
Mar 20 11:05:40 server systemd[1]: Started CFEngine file server.
Mar 20 11:05:40 server systemd[1]: Started CFEngine Execution Scheduler.
Mar 20 11:05:41 server cf-serverd[12037]:   notice: Server is starting...
Mar 20 11:05:41 server cf-serverd[12037]: CFEngine(server) rudder Server is starting...

And then policy updates from nodes are possible again.

#11 Updated by Nicolas CHARLES 9 months ago

breaking intentionnaly generated promises in 4.1 does not trigger the issue

Mar 20 13:43:56 orchestrateur-4 cf-serverd[32336]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'

#12 Updated by Nicolas CHARLES 9 months ago

This is not related to systemd
I stopped cf-serverd, and started it manually, with broken promises
it went

root@server:/home/vagrant# /var/rudder/cfengine-community/bin/cf-serverd  -v --no-fork
 verbose: Could not open extension plugin 'cfengine-enterprise.so' from '/var/rudder/cfengine-community/lib/cfengine-enterprise.so': (not installed)
 verbose:  CFEngine Core 3.10.3
 verbose: ----------------------------------------------------------------
 verbose:  Initialization preamble 
 verbose: ----------------------------------------------------------------
 verbose: Default port for cfengine is 5308
 verbose: Work directory is /var/rudder/cfengine-community
 verbose: Making sure that internal directories are private...
 verbose: Checking integrity of the trusted workdir
 verbose: Checking integrity of the state database
 verbose: Checking integrity of the module directory
 verbose: Checking integrity of the PKI directory
 verbose: Loaded private key at '/var/rudder/cfengine-community/ppkeys/localhost.priv'
 verbose: Loaded public key '/var/rudder/cfengine-community/ppkeys/localhost.pub'
 verbose: Reference time set to 'Tue Mar 20 09:01:28 2018'
 verbose: CFEngine Core 3.10.3 - ready
 verbose: ----------------------------------------------------------------
 verbose:  Environment discovery 
 verbose: ----------------------------------------------------------------
 verbose: Host name is: server
 verbose: Operating System Type is linux
 verbose: Operating System Release is 4.4.21-69-default
 verbose: Architecture = x86_64
 verbose: CFEngine detected operating system description is linux
 verbose: The time is now Tue Mar 20 09:01:28 2018
 verbose: Additional hard class defined as: 64_bit
 verbose: Additional hard class defined as: linux_4_4_21_69_default
 verbose: Additional hard class defined as: linux_x86_64
 verbose: Additional hard class defined as: linux_x86_64_4_4_21_69_default
 verbose: GNU autoconf class from compile time: compiled_on_linux_gnu
 verbose: Address given by nameserver: 192.168.42.2
 verbose: No interface exception file /var/rudder/cfengine-community/inputs/ignore_interfaces.rx
 verbose: Interface 1: lo
 verbose: Interface 2: eth0
 verbose: IP address of host set to 10.0.2.15
 verbose: Interface 3: eth1
 verbose: Trying to locate my IPv6 address
 verbose: Found IPv6 address fe80::a00:27ff:fe9f:3206
 verbose: Found IPv6 address fe80::a00:27ff:fe88:bef0
 verbose: Reading netstat info from /proc/net/netstat
 verbose: Reading (null) info from /proc/net/route
 verbose: Reading (null) info from /proc/net/snmp6
 verbose: Reading (null) info from /proc/net/ipv6_route
 verbose: Reading (null) info from /proc/net/if_inet6
 verbose: Reading interfaces_data info from /proc/net/dev
 verbose: Looking for environment from cf-monitord...
 verbose: Unable to detect environment from cf-monitord
 verbose: This appears to be a SUSE system.
 verbose: Looking for SUSE enterprise info in 'SUSE Linux Enterprise Server 12 (x86_64)'
 verbose: Discovered SUSE version SuSE_12
 verbose: Found 1 processor
 verbose: Loading persistent classes
 verbose: Persistent class 'heartbeat_sent' for 3 more minutes
 verbose: Adding persistent class 'heartbeat_sent' to heap
 verbose: Persistent class 'inventory_sent' for 233 more minutes
 verbose: Adding persistent class 'inventory_sent' to heap
 verbose: Persistent class 'rudder_ncf_common_updated_ok' for 3 more minutes
 verbose: Adding persistent class 'rudder_ncf_common_updated_ok' to heap
 verbose: Persistent class 'rudder_ncf_local_updated_ok' for 3 more minutes
 verbose: Adding persistent class 'rudder_ncf_local_updated_ok' to heap
 verbose: Persistent class 'site_ok' for 9 more minutes
 verbose: Adding persistent class 'site_ok' to heap
 verbose: Loading JSON augments from '/var/rudder/cfengine-community/inputs/def.json' (input dir '/var/rudder/cfengine-community/inputs', input file '/var/rudder/cfengine-community/inputs/promises.cf'
 verbose: could not load JSON augments from '/var/rudder/cfengine-community/inputs/def.json'
 verbose: This agent is bootstrapped to: 127.0.0.1
 verbose: Input file '/var/rudder/cfengine-community/inputs/promises.cf' has changed since the last policy read attempt (file is newer than previous)
 verbose: Input file is changed since last validation, validating it
 verbose: Verifying the syntax of the inputs...
 verbose: Checking policy with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
/var/rudder/cfengine-community/inputs/promises.cf:24:3: error: syntax error
iii
  ^
/var/rudder/cfengine-community/inputs/promises.cf:24:3: error: Expected 'bundle' or 'body' keyword, wrong input 'iii'
iii
  ^
   error: There are syntax errors in policy files
   error: Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
   error: Failsafe condition triggered. Interactive session detected, skipping failsafe.cf execution.
   error: Error reading CFEngine policy. Exiting...

reloading with /opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server resulted in the following log

Mar 20 09:12:11 server cf-serverd[21809]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Mar 20 09:12:11 server cf-serverd[21809]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'

Please note that running again the cf-serverd tells me:

Mar 20 09:11:46 server cf-promises[22300]: CFEngine(common)  There are syntax errors in policy files
Mar 20 09:11:46 server cf-serverd[22298]: CFEngine(server)  Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
Mar 20 09:11:46 server cf-serverd[22298]: CFEngine(server)  Failsafe condition triggered. Interactive session detected, skipping failsafe.cf execution.
Mar 20 09:11:46 server cf-serverd[22298]: CFEngine(server)  Error reading CFEngine policy. Exiting...

#13 Updated by Nicolas CHARLES 9 months ago

Interestingly, the bad cf-serverd is listening on port 5308 rather than 5309

#14 Updated by Alexis MOUSSET 9 months ago

  • Assignee set to Alexis MOUSSET

#15 Updated by Alexis MOUSSET 9 months ago

What happens here:

  • This behavior always existed but was never visible because agent restart with broken policies were done only during installation/upgrade, hence very likely in a tty, and CFEngine does not switch to failsafe in this case
  • Since we are using systemd units, "service rudder agent restart" uses systemctl, CFEngine sees no tty and reloads failsafe.

This can be reproduced on any previous Rudder version by restarting the service (with broken policies) with nohup (for example).

#16 Updated by Alexis MOUSSET 9 months ago

We could:

  • Patch CFEngine to disable the switch to failsafe (but that won't probably be acceptable upstream, and requires improving "rudder agent check" to trigger a failsafe run when promises are broken)
  • ?

#17 Updated by Benoît PECCATTE 9 months ago

Doesn't ruder agent check already run failsafe when promises are broken ?

#18 Updated by Alexis MOUSSET 9 months ago

No, there are two cases:

  • failsafe is broken -> reset initial policy + failsafe + run
  • promises are broken -> clear update flag to force update at next run

#19 Updated by Alexis MOUSSET 9 months ago

  • Target version changed from 4.3.0~rc2 to 4.1.11

#20 Updated by Alexis MOUSSET 9 months ago

Target 4.1, this should fix strange bugs where node couldn't connect while the server was running.

#21 Updated by Alexis MOUSSET 9 months ago

  • Status changed from New to In progress

#22 Updated by Alexis MOUSSET 9 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Alexis MOUSSET to Benoît PECCATTE
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/1530

#23 Updated by Alexis MOUSSET 9 months ago

  • Subject changed from Upgrade 4.2 -> 4.3: cf-serverd starts reading failsafe.cf and never goes back (node policy update impossible) to Agent components should not try to load failsafe.cf when policies are broken

#24 Updated by Alexis MOUSSET 9 months ago

  • Status changed from Pending technical review to Pending release

#25 Updated by Alexis MOUSSET 8 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.11, 4.2.5 and 4.3.0~rc3 which were released today.

Also available in: Atom PDF