Actions
Bug #8980
closedUpdate 3.0->3.1 on SLES commits and rebuilds vanilla system techniques
Pull Request:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Name check:
Fix check:
Regression:
Description
Hi Folks,
We experienced an issue during the update from 3.0 -> 3.1.
Constraints¶
We have to state some facts, that caused this issue:- We have customized the system techniques (changed TCP to UDP in 3.0, since it was not customizable back then) - including commited to git.
diff --git a/techniques/system/common/1.0/promises.st b/techniques/system/common/1.0/promises.st index 1397a83..ec94f63 100644 --- a/techniques/system/common/1.0/promises.st +++ b/techniques/system/common/1.0/promises.st @@ -510,7 +510,7 @@ bundle agent check_log_system !windows.rsyslogd.!policy_server:: "/etc/rsyslog.d/rudder-agent.conf" - edit_line => append_if_no_lines("#Rudder log system${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then @@${server_info.cfserved}:&SYSLOGPORT&${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then ~"), + edit_line => append_if_no_lines("#Rudder log system${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then @${server_info.cfserved}:&SYSLOGPORT&${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then ~"), create => "true", edit_defaults => empty_backup, classes => kept_if_else("rsyslog_kept", "rsyslog_repaired" , "rsyslog_failed");
- We deploy Rudder using zypper patterns, in which the pattern is versioned, and the pattern contains the exact version of the packages. Example of the Rudder-Pattern (yum repository's pattern):
<pattern xmlns="http://novell.com/package/metadata/suse/pattern" xmlns:rpm="http://linux.duke.edu/metadata/rpm" xmlns:custom="http://fake.custom.ns/metadata/rpm > <name><![CDATA[RUDDER_MASTER]]></name> <arch>x86_64</arch> <version epoch="0" ver="11.3" rel="3.01.11.01"/> <summary><![CDATA[Rudder Root Server 3.1.11 rel 1 SLES11SP3 incl direct dependencies]]></summary> <description><![CDATA[Pattern RUDDER_MASTER]]></description> <uservisible/> <custom:env><![CDATA[RUDDER_NO_TECHNIQUE_AUTOCOMMIT=1]]></custom:env> <rpm:requires> <rpm:entry name="ncf" flags="EQ" epoch="1398866025" ver="0.201606040106-1.SLES.11" /> <rpm:entry name="ncf-api-virtualenv" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-agent" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-inventory-ldap" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-reports" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-server-root" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-jetty" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-techniques" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-inventory-endpoint" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <rpm:entry name="rudder-webapp" flags="EQ" epoch="1398866025" ver="3.1.11.release-1.SLES.11" /> <!-- some other items are left out, but these are the mostly important ones --> </rpm:requires> </pattern>
- We use the feature of setting
RUDDER_NO_TECHNIQUE_AUTOCOMMIT=1
(as merged via #7222).
The Problem¶
... is pretty complex, I am trying to put it in chronologically correct:
The update is being triggered by updating the given RUDDER_MASTER pattern via zypper.¶
This resolves the necessary packages, and does the update:
[09:28:31] + export RUDDER_NO_TECHNIQUE_AUTOCOMMIT=1 [09:28:31] + zypper -v -n --force-resolution -t pattern --repo Pattern --repo ThirdParty RUDDER_MASTER=11.3-3.01.11.01 [...] [09:28:31] The following packages are going to be upgraded: [...] [09:28:31] ncf [09:28:31] 1398866025:0.201601211502-1.SLES.11 -> 1398866025:0.201606040106-1.SLES.11 [09:28:31] ncf-api-virtualenv [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [09:28:31] rudder-inventory-endpoint [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [09:28:31] rudder-inventory-ldap [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [09:28:31] rudder-jetty [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [09:28:31] rudder-reports [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [09:28:31] rudder-server-root [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [09:28:31] rudder-techniques [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [09:28:31] rudder-webapp [09:28:31] 1398866025:3.0.13.release-1.SLES.11 -> 1398866025:3.1.11.release-1.SLES.11 [...] [09:29:14] Installing: ncf-1398866025:0.201606040106-1.SLES.11 [.....done] [09:29:34] Installing: rudder-inventory-ldap-1398866025:3.1.11.release-1.SLES.11 [.........done] [09:29:35] Installing: rudder-jetty-1398866025:3.1.11.release-1.SLES.11 [......done] [09:29:36] Installing: rudder-reports-1398866025:3.1.11.release-1.SLES.11 [...done] [09:29:38] Installing: rudder-techniques-1398866025:3.1.11.release-1.SLES.11 [.........done] [09:29:43] Installing: ncf-api-virtualenv-1398866025:3.1.11.release-1.SLES.11 [......done] [09:29:47] Installing: rudder-inventory-endpoint-1398866025:3.1.11.release-1.SLES.11 [.................done] [09:29:59] Installing: rudder-webapp-1398866025:3.1.11.release-1.SLES.11 [...................................done] [09:30:00] Installing: rudder-server-root-1398866025:3.1.11.release-1.SLES.11 [.done] [...]
The order of the installation is important, because that influenses the files and techniques being used.
The Package of rudder-inventory-ldap
is updated.¶
The %postinstall
script is being called after the 3.0.13 -> 3.1.11
update has been done, which includes the rudder-upgrade script:
[...] if [ -x /opt/rudder/bin/rudder-upgrade ] then echo "INFO: Running the Rudder upgrade script /opt/rudder/bin/rudder-upgrade fi [...]However, are is a chain of problems with this:
- The called script
/opt/rudder/bin/rudder-upgrade
is provided byrudder-webapp
- The
rudder-webapp
is not updated yet (see timing above) => The old script is being executed - The old script DOES NOT HAVE this no-autocommit feature.
- The
rudder-upgrade
script always calls the functionupgrade_system_techniques
which invokesupdate_rudder_repository_from_system_directory
- This commits the system techniques from
/opt/rudder/share/techniques/system/
, which is provided by the Packagerudder-techniques
- The package
rudder-techniques
is also still not been updated (see timing above), so basically what happens is this:
commit 955b4a32a15d0cb295ceac7d18959056cc18321e Upgrade system Techniques from /opt/rudder/share/techniques/system/ - automatically done by rudder-upgrade script diff --git a/techniques/system/common/1.0/promises.st b/techniques/system/common/1.0/promises.st index 8211ae4..9fa628b 100644 --- a/techniques/system/common/1.0/promises.st +++ b/techniques/system/common/1.0/promises.st @@ -514,7 +514,7 @@ bundle agent check_log_system !windows.rsyslogd.!policy_server:: "/etc/rsyslog.d/rudder-agent.conf" - edit_line => append_if_no_lines("#Rudder log system${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then @${server_info.cfserved}:&SYSLOGPORT&${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then ~"), + edit_line => append_if_no_lines("#Rudder log system${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then @@${server_info.cfserved}:&SYSLOGPORT&${const.n}if $syslogfacility-text == 'local6' and $programname startswith 'rudder' then ~"), create => "true", edit_defaults => empty_backup, classes => rudder_common_classes("rsyslog");This means:
- You commit back the old vanilla techniques which lack any customizations
- You trigger a technique reload
- Since the jetty is still running, the default system policy is going to be rebuilt for all nodes
- If you mess up and fail to properly disable the relays, it is and also distributed to the wild
- This causes any node to revert to TCP logging, flooding the relays with TCP connections, which than can DDOS the policy server
- This is causing the syslogs to hang due to no rudder message goes through, causing IO waits and high load, making some systems and ops people stressful.
Actions