Bug #12243
closedAgent components should not try to load failsafe.cf when policies are broken
Description
I upgraded from rudder 4.2 (rudder-webapp_4.2.5~rc1~git201803160242-stretch0_all.deb) towards rudder-webapp_4.3.0~rc2~git201803200032-stretch0_all.deb on debian 9.
After upgrade, webapp is working, I can log, but nodes connected to the server can't get their policies anymore:
root@relay:/home/vagrant# rudder agent update -i rudder info: Failed to connect to server: Connection refused rudder info: No server is responding on port: 5309 rudder info: Unable to establish connection to 'server' error: No suitable server found rudder info: Automatically promoting context scope for 'rudder_promises_generated_tmp_file_error' to namespace visibility, due to persistence rudder info: Promise belongs to bundle 'update_action' in file '/var/rudder/cfengine-community/inputs/common/1.0/update.cf' near line 223 rudder info: Failed to connect to server: Connection refused rudder info: No server is responding on port: 5309 rudder info: Unable to establish connection to 'server' error: No suitable server found rudder info: Automatically promoting context scope for 'rudder_ncf_hash_update_error' to namespace visibility, due to persistence rudder info: Promise belongs to bundle 'update_action' in file '/var/rudder/cfengine-community/inputs/common/1.0/update.cf' near line 231 rudder info: Failed to connect to server: Connection refused rudder info: No server is responding on port: 5309 rudder info: Unable to establish connection to 'server' error: No suitable server found rudder info: Automatically promoting context scope for 'rudder_ncf_hash_update_error' to namespace visibility, due to persistence rudder info: Promise belongs to bundle 'update_action' in file '/var/rudder/cfengine-community/inputs/common/1.0/update.cf' near line 237 R: ********************************************************************************* * rudder-agent could not get an updated configuration from the policy server. * * This can be caused by: * * * a networking issue * * * an unavailable server * * * if the node's IP in not if the allowed networks of its policy server. * * Any existing configuration policy will continue to be applied without change. * ********************************************************************************* ok: Rudder agent promises were updated.
On the server, cf-execd is started, and the date is coherent with the update time:
root@server:/home/vagrant/plop# ps aux | grep cf- root 9011 0.0 0.5 105040 8972 ? Ss 09:40 0:00 /var/rudder/cfengine-community/bin/cf-execd --no-fork root 12321 0.0 0.5 38632 8912 ? Ss 09:40 0:00 /var/rudder/cfengine-community/bin/cf-serverd --no-fork
But on system logs, I have:
Mar 20 09:40:14 server systemd[1]: Stopped Rudder agent umbrella service. Mar 20 09:40:15 server systemd[1]: Started CFEngine Execution Scheduler. Mar 20 09:40:15 server systemd[1]: Starting Rudder agent umbrella service... Mar 20 09:40:15 server systemd[1]: Started CFEngine file server. Mar 20 09:40:15 server systemd[1]: Started Rudder agent umbrella service. Mar 20 09:40:15 server cf-serverd[9013]: error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.3/_var_rudder_ncf_common_20_cfe_basics': No such file or directory' for parsing. Mar 20 09:40:15 server cf-serverd[9013]: CFEngine(server) rudder Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.3/_var_rudder_ncf_common_20_cfe_basics': No such file or directory Mar 20 09:40:15 server systemd[1]: rudder-cf-serverd.service: Main process exited, code=exited, status=1/FAILURE Mar 20 09:40:15 server systemd[1]: rudder-cf-serverd.service: Unit entered failed state. Mar 20 09:40:15 server systemd[1]: rudder-cf-serverd.service: Failed with result 'exit-code'. ... Mar 20 09:40:26 server cf-serverd[12321]: /var/rudder/cfengine-community/inputs/promises.cf:362:0: error: Undefined bundle _create_current_expected_reports_file with type usebundle Mar 20 09:40:26 server cf-serverd[12321]: /var/rudder/cfengine-community/inputs/promises.cf:741:0: error: Undefined bundle _clean_old_expected_reports_file with type usebundle Mar 20 09:40:26 server cf-serverd[12321]: /var/rudder/cfengine-community/inputs/rudder-directives.cf:37:0: error: Undefined bundle current_technique_report_info with type usebundle Mar 20 09:40:26 server cf-serverd[12321]: error: Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"' Mar 20 09:40:26 server cf-serverd[12321]: error: CFEngine was not able to get confirmation of promises from cf-promises, so going to failsafe Mar 20 09:40:26 server cf-serverd[12321]: error: CFEngine failsafe.cf: /var/rudder/cfengine-community/inputs /var/rudder/cfengine-community/inputs/failsafe.cf Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server) Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"' Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server) CFEngine was not able to get confirmation of promises from cf-promises, so going to failsafe Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server) CFEngine failsafe.cf: /var/rudder/cfengine-community/inputs /var/rudder/cfengine-community/inputs/failsafe.cf Mar 20 09:40:26 server cf-serverd[12321]: notice: Server is starting... Mar 20 09:40:26 server cf-serverd[12321]: CFEngine(server) rudder Server is starting... ... Mar 20 09:40:40 server cf-serverd[12321]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:40:40 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:41:28 server cf-serverd[12321]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:41:28 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:41:55 server cf-serverd[12321]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:41:55 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:42:12 server cf-serverd[12321]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:42:12 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:43:29 server cf-serverd[12321]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:43:29 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' ... Mar 20 09:45:27 server cf-agent[19573]: CFEngine(agent) rudder R: @@Common@@control@@rudder@@run@@0@@start@@20180320-094124-53ed56e5@@2018-03-20 .... Mar 20 09:45:29 server cf-serverd[12321]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:45:29 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' .... Mar 20 09:45:31 server cf-agent[19573]: CFEngine(agent) rudder R: @@Common@@control@@rudder@@run@@0@@end@@20180320-094124-53ed56e5@@2018-03-20 09:45:26+00:00##root@#End execution
Updated by François ARMAND over 6 years ago
It does not seems to correct itself (even when running by hand rudder agent run -u -i)
Updated by François ARMAND over 6 years ago
Just reloading cf-serverd does not work, as "failsafe.cf" is used:
root@server:/var/rudder/configuration-repository# /opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server # in journalctl: Mar 20 10:28:21 server cf-serverd[12321]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 10:28:21 server cf-serverd[12321]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Updated by François ARMAND over 6 years ago
Forcing full policy regeneration does not heal the problem, cf-serverd is still reading failsafe.cf.
Updated by François ARMAND over 6 years ago
- Subject changed from Upgrade 4.2 -> 4.3: cf-serverd error leading to node update impossible to Upgrade 4.2 -> 4.3: cf-serverd starts reading failsafe.cf and never goes back (node policy update impossible)
- Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility set to Operational - other Techniques | Technique editor | Rudder settings
- Priority changed from 0 to 76
Updated by François ARMAND over 6 years ago
Running "rudder server debug xxxx" makes "xxxx" able to get its policies, but as soon as the debug server is stoppend, we fall back in bad state with impossible update.
Updated by Nicolas CHARLES over 6 years ago
Bahaviour is not the same in 4.1.10
Breaking promises.cf results in
Mar 20 11:45:12 orchestrateur-4 [25127]: CFEngine(common) There are syntax errors in policy files Mar 20 11:45:12 orchestrateur-4 cf-serverd[32336]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
and it didn't reload the promises nor acl (cf-serverd is still functionnal)
Updated by Nicolas CHARLES over 6 years ago
in 4.2, I have
Mar 20 11:52:05 orchestrateur-6 [10077]: CFEngine(common) There are syntax errors in policy files Mar 20 11:51:06 orchestrateur-6 cf-serverd[16990]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
Updated by Nicolas CHARLES over 6 years ago
actually, in 4.3, it behaves the same
If I manually corrupt promises.cf and reload cf-serverd policies, I get
Mar 20 05:55:01 server cf-serverd[8667]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises Mar 20 05:55:01 server cf-serverd[8667]: /var/rudder/cfengine-community/inputs/promises.cf:24:3: error: syntax error Mar 20 05:55:01 server cf-serverd[8667]: iii Mar 20 05:55:01 server cf-serverd[8667]: ^ Mar 20 05:55:01 server cf-serverd[8667]: /var/rudder/cfengine-community/inputs/promises.cf:24:3: error: Expected 'bundle' or 'body' keyword, wrong input 'iii' Mar 20 05:55:01 server cf-serverd[8667]: iii Mar 20 05:55:01 server cf-serverd[8667]: ^ Mar 20 05:55:01 server cf-serverd[8667]: error: There are syntax errors in policy files Mar 20 05:55:01 server cf-serverd[8667]: error: Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
so it does not switch to failsafe.cf in terminal
when it's done automatically, it breaks
Updated by Alexis Mousset over 6 years ago
There should be an error log saying:
Failsafe condition triggered. Interactive session detected, skipping failsafe.cf execution.
do you see it?
Updated by François ARMAND over 6 years ago
Restarting rudder-agent (service rudder-agent restart) repairs the problem:
Mar 20 11:05:40 server systemd[1]: Stopped Rudder agent umbrella service. Mar 20 11:05:40 server systemd[1]: Stopping Rudder agent umbrella service... Mar 20 11:05:40 server systemd[1]: Starting Rudder agent umbrella service... Mar 20 11:05:40 server systemd[1]: Stopping CFEngine Execution Scheduler... Mar 20 11:05:40 server cf-serverd[12321]: notice: Cleaning up and exiting... Mar 20 11:05:40 server cf-serverd[12321]: CFEngine(server) rudder Cleaning up and exiting... Mar 20 11:05:40 server systemd[1]: Stopping CFEngine file server... Mar 20 11:05:40 server systemd[1]: Stopped CFEngine Execution Scheduler. Mar 20 11:05:40 server systemd[1]: Stopped CFEngine file server. Mar 20 11:05:40 server systemd[1]: Started Rudder agent umbrella service. Mar 20 11:05:40 server systemd[1]: Started CFEngine file server. Mar 20 11:05:40 server systemd[1]: Started CFEngine Execution Scheduler. Mar 20 11:05:41 server cf-serverd[12037]: notice: Server is starting... Mar 20 11:05:41 server cf-serverd[12037]: CFEngine(server) rudder Server is starting...
And then policy updates from nodes are possible again.
Updated by Nicolas CHARLES over 6 years ago
breaking intentionnaly generated promises in 4.1 does not trigger the issue
Mar 20 13:43:56 orchestrateur-4 cf-serverd[32336]: CFEngine(server) rudder Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"'
Updated by Nicolas CHARLES over 6 years ago
This is not related to systemd
I stopped cf-serverd, and started it manually, with broken promises
it went
root@server:/home/vagrant# /var/rudder/cfengine-community/bin/cf-serverd -v --no-fork verbose: Could not open extension plugin 'cfengine-enterprise.so' from '/var/rudder/cfengine-community/lib/cfengine-enterprise.so': (not installed) verbose: CFEngine Core 3.10.3 verbose: ---------------------------------------------------------------- verbose: Initialization preamble verbose: ---------------------------------------------------------------- verbose: Default port for cfengine is 5308 verbose: Work directory is /var/rudder/cfengine-community verbose: Making sure that internal directories are private... verbose: Checking integrity of the trusted workdir verbose: Checking integrity of the state database verbose: Checking integrity of the module directory verbose: Checking integrity of the PKI directory verbose: Loaded private key at '/var/rudder/cfengine-community/ppkeys/localhost.priv' verbose: Loaded public key '/var/rudder/cfengine-community/ppkeys/localhost.pub' verbose: Reference time set to 'Tue Mar 20 09:01:28 2018' verbose: CFEngine Core 3.10.3 - ready verbose: ---------------------------------------------------------------- verbose: Environment discovery verbose: ---------------------------------------------------------------- verbose: Host name is: server verbose: Operating System Type is linux verbose: Operating System Release is 4.4.21-69-default verbose: Architecture = x86_64 verbose: CFEngine detected operating system description is linux verbose: The time is now Tue Mar 20 09:01:28 2018 verbose: Additional hard class defined as: 64_bit verbose: Additional hard class defined as: linux_4_4_21_69_default verbose: Additional hard class defined as: linux_x86_64 verbose: Additional hard class defined as: linux_x86_64_4_4_21_69_default verbose: GNU autoconf class from compile time: compiled_on_linux_gnu verbose: Address given by nameserver: 192.168.42.2 verbose: No interface exception file /var/rudder/cfengine-community/inputs/ignore_interfaces.rx verbose: Interface 1: lo verbose: Interface 2: eth0 verbose: IP address of host set to 10.0.2.15 verbose: Interface 3: eth1 verbose: Trying to locate my IPv6 address verbose: Found IPv6 address fe80::a00:27ff:fe9f:3206 verbose: Found IPv6 address fe80::a00:27ff:fe88:bef0 verbose: Reading netstat info from /proc/net/netstat verbose: Reading (null) info from /proc/net/route verbose: Reading (null) info from /proc/net/snmp6 verbose: Reading (null) info from /proc/net/ipv6_route verbose: Reading (null) info from /proc/net/if_inet6 verbose: Reading interfaces_data info from /proc/net/dev verbose: Looking for environment from cf-monitord... verbose: Unable to detect environment from cf-monitord verbose: This appears to be a SUSE system. verbose: Looking for SUSE enterprise info in 'SUSE Linux Enterprise Server 12 (x86_64)' verbose: Discovered SUSE version SuSE_12 verbose: Found 1 processor verbose: Loading persistent classes verbose: Persistent class 'heartbeat_sent' for 3 more minutes verbose: Adding persistent class 'heartbeat_sent' to heap verbose: Persistent class 'inventory_sent' for 233 more minutes verbose: Adding persistent class 'inventory_sent' to heap verbose: Persistent class 'rudder_ncf_common_updated_ok' for 3 more minutes verbose: Adding persistent class 'rudder_ncf_common_updated_ok' to heap verbose: Persistent class 'rudder_ncf_local_updated_ok' for 3 more minutes verbose: Adding persistent class 'rudder_ncf_local_updated_ok' to heap verbose: Persistent class 'site_ok' for 9 more minutes verbose: Adding persistent class 'site_ok' to heap verbose: Loading JSON augments from '/var/rudder/cfengine-community/inputs/def.json' (input dir '/var/rudder/cfengine-community/inputs', input file '/var/rudder/cfengine-community/inputs/promises.cf' verbose: could not load JSON augments from '/var/rudder/cfengine-community/inputs/def.json' verbose: This agent is bootstrapped to: 127.0.0.1 verbose: Input file '/var/rudder/cfengine-community/inputs/promises.cf' has changed since the last policy read attempt (file is newer than previous) verbose: Input file is changed since last validation, validating it verbose: Verifying the syntax of the inputs... verbose: Checking policy with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"' /var/rudder/cfengine-community/inputs/promises.cf:24:3: error: syntax error iii ^ /var/rudder/cfengine-community/inputs/promises.cf:24:3: error: Expected 'bundle' or 'body' keyword, wrong input 'iii' iii ^ error: There are syntax errors in policy files error: Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"' error: Failsafe condition triggered. Interactive session detected, skipping failsafe.cf execution. error: Error reading CFEngine policy. Exiting...
reloading with /opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server resulted in the following log
Mar 20 09:12:11 server cf-serverd[21809]: CFEngine(server) rudder Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf' Mar 20 09:12:11 server cf-serverd[21809]: notice: Rereading policy file '/var/rudder/cfengine-community/inputs/failsafe.cf'
Please note that running again the cf-serverd tells me:
Mar 20 09:11:46 server cf-promises[22300]: CFEngine(common) There are syntax errors in policy files Mar 20 09:11:46 server cf-serverd[22298]: CFEngine(server) Policy failed validation with command '"/var/rudder/cfengine-community/bin/cf-promises" -c "/var/rudder/cfengine-community/inputs/promises.cf"' Mar 20 09:11:46 server cf-serverd[22298]: CFEngine(server) Failsafe condition triggered. Interactive session detected, skipping failsafe.cf execution. Mar 20 09:11:46 server cf-serverd[22298]: CFEngine(server) Error reading CFEngine policy. Exiting...
Updated by Nicolas CHARLES over 6 years ago
Interestingly, the bad cf-serverd is listening on port 5308 rather than 5309
Updated by Alexis Mousset over 6 years ago
What happens here:
- This behavior always existed but was never visible because agent restart with broken policies were done only during installation/upgrade, hence very likely in a tty, and CFEngine does not switch to failsafe in this case
- Since we are using systemd units, "service rudder agent restart" uses systemctl, CFEngine sees no tty and reloads failsafe.
This can be reproduced on any previous Rudder version by restarting the service (with broken policies) with nohup (for example).
Updated by Alexis Mousset over 6 years ago
We could:
- Patch CFEngine to disable the switch to failsafe (but that won't probably be acceptable upstream, and requires improving "rudder agent check" to trigger a failsafe run when promises are broken)
- ?
Updated by Benoît PECCATTE over 6 years ago
Doesn't ruder agent check already run failsafe when promises are broken ?
Updated by Alexis Mousset over 6 years ago
No, there are two cases:
- failsafe is broken -> reset initial policy + failsafe + run
- promises are broken -> clear update flag to force update at next run
Updated by Alexis Mousset over 6 years ago
- Target version changed from 4.3.0~rc2 to 4.1.11
Updated by Alexis Mousset over 6 years ago
Target 4.1, this should fix strange bugs where node couldn't connect while the server was running.
Updated by Alexis Mousset over 6 years ago
- Status changed from New to In progress
Updated by Alexis Mousset over 6 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Alexis Mousset to Benoît PECCATTE
- Pull Request set to https://github.com/Normation/rudder-packages/pull/1530
Updated by Alexis Mousset over 6 years ago
- Subject changed from Upgrade 4.2 -> 4.3: cf-serverd starts reading failsafe.cf and never goes back (node policy update impossible) to Agent components should not try to load failsafe.cf when policies are broken
Updated by Alexis Mousset over 6 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder-packages|774e96dc9d9d64e41f0ab847f4e9b3e3d3e1d83c.
Updated by Alexis Mousset over 6 years ago
- Status changed from Pending release to Released