Bug #14322
closedDirective parameter values are mixed between directives
Added by Tobias Ell almost 6 years ago. Updated over 5 years ago.
Description
Server: 5.0.6
all nodes 5.0.6
I have several directives to ensure content of files. I used to use "file content" technique but after mix ups I switch some to "file download" technique.
I still have mixups of directives sometimes.
At the moment one node has problems because the directive zzz-ltur doesn't not work as configured.
Please check screenshot directive-zzz-ltur for the setup and the screenshot compliance-report-zzz-ltur for the output (it names a totally different file as download source).
I am not sure whether I have set everything up correctly. I tried using different priorities for directives based on the same technique but it doesn't seem to help. I have also tried several times "regenerate policies" but that does not help either.
Files
directive-zzz-ltur.png (25.9 KB) directive-zzz-ltur.png | directive setup | Tobias Ell, 2019-02-14 10:35 | |
compliance-report-zzz-ltur.png (6.23 KB) compliance-report-zzz-ltur.png | compliance report for directive | Tobias Ell, 2019-02-14 10:35 | |
updated-directive-20190214005601.diff (1.21 KB) updated-directive-20190214005601.diff | Tobias Ell, 2019-02-19 15:09 | ||
updated-directive-20190214005631.diff (928 Bytes) updated-directive-20190214005631.diff | Tobias Ell, 2019-02-19 15:09 | ||
updated-directive-20190214005705.diff (1.53 KB) updated-directive-20190214005705.diff | Tobias Ell, 2019-02-19 15:09 |
Updated by Nicolas CHARLES almost 6 years ago
- Project changed from 41 to Rudder
Hello Tobias,
This looks indeed strange. Could you check that:- policy generation was successful
- that the compliance details of the node states that the nodes does have up to date policies
- the node can update its policies (rudder agent update on the node)
Also, what is the agent run frequency on this node ? Could it be that it runs on a different schedule as the others (every xx hours)?
Updated by Tobias Ell almost 6 years ago
Hello Nicolas,
1. the GUI shows policy generation green - is there some log where I could find detailed information?
Because of other problems I have done lots of "regenerate all policies" - I never saw an error.
2. The policies of the node are reported as up to date.
3. The node can update it's policies.
Frequency of the nodes (as reported by "rudder agent info") is always correct.
Also, this phenomenom does not persist - it does not happen on all nodes and seems to go away on the concerned nodes.
Updated by François ARMAND almost 6 years ago
Hello Tobias,
Can you tell us how do you have the mix-up (what it means for you to have a mix-up)? Do you see it on the node (for ex you have the two file downloaded) ? Or only in the compliance message?
Do you have an other directive configuration on that node with the name displayed on the message?
For us, knowing that piece would help us understand if the problem is on the agent logic (either with the policies we give to the node or the agent understanding of the policies), or in the compliance processing. The first category of problem is terrible, because it means that the node is not configured as you want (production broken). The second is very bad (because you don't get an accurate view on the reality), but at least the node configuration is ok.
From what I believe for now, it's just the messages that get mixed up, and not the real configuration on the node.
If the problem happen again, can you go to the "technical log" tab of the node and filter on "copy file" (or each file name) and send us screenshots?
And if you have a wait to reproduce the problem, it would be massively helpful (until we are not able to reproduce, it's very hard to understand what's the problem, and even if we get an idea, we can't check that we corrected it afterwards).
Updated by Tobias Ell almost 6 years ago
Hi François,
unfortunately I have the worst case. The files get the wrong content.
It started with 4.3.7 I guess. I have a lot of directives based on "File content" with many of them applied to every node.
I noticed that I couldn't lock in to some servers because the content of a service script would emerge in a file in
/etc/profile.d/ thus blocking shell execution.On some nodes logrotate would work because a logrotate file would contain the
contents of a repository file.
This did not happen on all nodes or at least in the same way (concerning the same directives).
I thought that maybe the server would mix up the content of the directives, so I did a lot of "regenerate all policies" but it
didn't help.
I have migrated to 5.0.6 since but the problem still occured sometimes.
I will try to find a node with problems and send you the technical log - anything else I should collect?
Updated by François ARMAND almost 6 years ago
- Tracker changed from Question to Bug
- Category set to Web - Config management
- Target version set to 4.3.10
- Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility set to Getting started - demo | first install | Technique editor and level 1 Techniques
- Priority set to 94
OK, so this is very, very bad. I'm requalifying the ticket accordingly.
What would help:
- on the root server, a tar.gz of /var/rudder/share/xxxx-node-id-yyyy
for the node having the problem,
- on the node, a tar.gz of /var/rudder/cfengine-community/
Updated by François ARMAND almost 6 years ago
By the way, you may want to anonimize the data or send the content privately to rudder@rudder.io (private internal mailing list)
Updated by Tobias Ell almost 6 years ago
Hi François,
I sent the files to the private mailing list.
Updated by François ARMAND almost 6 years ago
- Category deleted (
Web - Config management) - Target version deleted (
4.3.10)
From, the log, we see that something happens around midnight on 2019-02-14: the run at 00:04:06 UTC get an updated policy configuration, and the agent starts to copy /etc/profile.d/zzz-ltur.sh from /bin/procs.
Then, it stops around 9:30 UTC: the last run with the problem is at 09:34:29+00 and the next run correct back the copy.
So we need to know if the problem was on the config or in rudder logic. For that, we would like to have:
- event logs for the directive f3fe0964_b855_45e2_8670_5b7254ed9ade on 2019-02-14 (to see if rudder saw a change at that time, and perhaps get an hint about what change) - this is done in Rudder UI > utils > event logs
- the node configuration at that time. This one is a bit tougher, because it needs to exec some SQL on the root server.
On the root server, exec SQL command onRudder base with the command: psql -p 5432 -U rudder -h localhost -d rudder -c "select * from nodeconfigurations where nodeid='32377fd7-02fd-43d0-aab7-28460a91347b' and begindate >= '2019-02-13' and begindate <= '2019-02-14'order by begindate desc limit 1;" > /tmp/nodeconfig
You can send us the file by email for privacy reason. Thanks!
(nodeid changed in the query)
Updated by Tobias Ell almost 6 years ago
- Category set to Web - Config management
Sorry - neither procedure produces results - looks like I do not keep enough logs.
Updated by Nicolas CHARLES almost 6 years ago
ha, events log in the UI don't allow filtering by DirectiveId (and that's a shame)
to get the data, you can either search by Directive Name in the UI, or run the following query in postgresql
select * from eventlog where xpath('/entry/directive/id', data)::text = '{<id>f3fe0964_b855_45e2_8670_5b7254ed9ade</id>}' and creationDate > '2019-02-13' and creationDate < '2019-02-15';
sorry for the inconvenience
Updated by Tobias Ell almost 6 years ago
Hi Nicolas, unfortunately this select does not deliver a result either.
Looks like we'll have to wait for the problem to occur again.
Are there any settings I should change so we'll have better output?
Updated by Nicolas CHARLES almost 6 years ago
Two last requests:
Can you sent us (by mail also)
- the logs of rudder webapp (/var/log/rudder/webapp) from 2019-02-13 and 2019-02-14
- all eventlog between the 13 and 15 of february ?
select * from eventlog where creationDate > '2019-02-13' and creationDate < '2019-02-15'
Thank you so much for your patience,
Nicolas
Updated by Tobias Ell almost 6 years ago
Nicolas don't thank me - you give excellent support for open source software.
I'll give you any help I can. The eventlog entries and webapp logs have been sent.
Thanks for your help!
Got to go now - if there's anything else you need I will provide it tomorrow.
Updated by Nicolas CHARLES almost 6 years ago
Thank you for the information Tobias,
from the eventlog, there are indeed no changes in the impacted directives, so it seems to really be Rudder that's wrongly configures something.
I made a mistake in the query for nodeconfigurations, using the wrong nodeid (i mixed it up with rule id)
does the query
psql -U rudder -h localhost -d rudder -c "select * from nodeconfigurations where nodeid='1f9492d3-804f-4f40-bff1-2217276b667a' and begindate >= '2019-02-13' and begindate <= '2019-02-14'order by begindate desc limit 1;" > /tmp/nodeconfig
returns something? If so, can you send it to us, it will tell us what Rudder computed for this node
Updated by Tobias Ell almost 6 years ago
sorry for the delay - the last query had an empty result set as well,
no new occurrances either
Updated by François ARMAND almost 6 years ago
Hello, we would like to see what directives where modified on the faulty run. For that, we would need that you do in rudder server, in directory `/var/rudder/configuration-repository`:
git log --stat --since="2019-02-13 23:00:00" --until="2019-02-14 01:00:00"
That will result in logs like:
... commit 8ce46015f57736ab1628f4d4ab5b4228a505c9ab Author: root user (CLI) <root@localhost> Date: Thu Feb 14 05:52:25 2019 +0100 Forced technique upgrade from version on Thu Feb 14 05:52:25 CET 2019 techniques/systemSettings/remoteAccess/sshConfiguration/5.0/metadata.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) ...
If you have a commit around 00:56:00 (or 23:56:00, not sure about how git handle utc/utc+1) with several directives modified, also do a:
git show here-the-git-commit-sha1 > updated-directive.diff
and send us the result please.
Thanks
Updated by Tobias Ell almost 6 years ago
- File updated-directive-20190214005601.diff updated-directive-20190214005601.diff added
- File updated-directive-20190214005631.diff updated-directive-20190214005631.diff added
- File updated-directive-20190214005705.diff updated-directive-20190214005705.diff added
there were 3 updates that night, I'll attach the output
Updated by François ARMAND almost 6 years ago
So, we successfully reproduce it by putting > 500 copy fie directive on a node.
It apprears that:
- node configuration is ok,
- LDAP is ok,
- directive in /var/rudder/config-repo/directives/xxxx.xml is OK.
So it seems that the only possible remaining culprit is string template that somehow don't get the correct thing to write. We are norrowing it.
Updated by François ARMAND almost 6 years ago
Thanks for the diff, it confirms that your configuration is OK, but that the part that changed the variable when it wrote the file didn't wrote the correct thing.
Updated by Nicolas CHARLES almost 6 years ago
- Related to Bug #13987: Massive performance penalty in policy generation due to invalid usage of StringTemplate added
Updated by Nicolas CHARLES almost 6 years ago
- Status changed from New to In progress
- Assignee changed from François ARMAND to Nicolas CHARLES
Updated by Nicolas CHARLES almost 6 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Nicolas CHARLES to François ARMAND
- Pull Request set to https://github.com/Normation/rudder/pull/2144
Updated by Nicolas CHARLES almost 6 years ago
When doing #13987, i introduced a thread-unsafety: before we reparsed each template each time, injected variable in it, and wrote it on filesytem.
Highly ineficient, so i cached the parsing, to get the instance, and synchronized this part.
However, the usage of this template is NOT synchronized, so two thread could put variables in the same template, overriding each other. The more available thread, the worse the mixing up is (with 4 thread, and 500 directives based on the same technique, I got around 8 mixup, with two thread i have mixup one time every 3 generation, with 1 thread never)
Synchronizing the whole method fixes the issue, and induce a small perf penalty (Write node configurations: 25061 ms with synchronized, Write node configurations: 15273 ms with no synchrized).
Updated by Nicolas CHARLES almost 6 years ago
updated the PR with minimal locking. Impact on performance is not significant.
Updated by Nicolas CHARLES almost 6 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder|27a2ceaae9116eec7114593267a775cb0cd225eb.
Updated by François ARMAND over 5 years ago
- Related to Bug #14386: UI "settings" for management of hooks works inconsistently added
Updated by François ARMAND over 5 years ago
- Related to deleted (Bug #14386: UI "settings" for management of hooks works inconsistently)
Updated by François ARMAND over 5 years ago
- Subject changed from directives are mixed to Directive parameter values are mixed between directives
Updated by François ARMAND over 5 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 4.3.10 and 5.0.7 which were released today.