Project

General

Profile

Bug #14322

Directive parameter values are mixed between directives

Added by Tobias Ell 3 months ago. Updated 3 months ago.

Status:
Released
Priority:
N/A
Category:
Web - Config management
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Getting started - demo | first install | Technique editor and level 1 Techniques
Effort required:
Priority:
94

Description

Server: 5.0.6
all nodes 5.0.6

I have several directives to ensure content of files. I used to use "file content" technique but after mix ups I switch some to "file download" technique.
I still have mixups of directives sometimes.

At the moment one node has problems because the directive zzz-ltur doesn't not work as configured.
Please check screenshot directive-zzz-ltur for the setup and the screenshot compliance-report-zzz-ltur for the output (it names a totally different file as download source).
I am not sure whether I have set everything up correctly. I tried using different priorities for directives based on the same technique but it doesn't seem to help. I have also tried several times "regenerate policies" but that does not help either.


Files

directive-zzz-ltur.png (25.9 KB) directive-zzz-ltur.png directive setup Tobias Ell, 2019-02-14 10:35
compliance-report-zzz-ltur.png (6.23 KB) compliance-report-zzz-ltur.png compliance report for directive Tobias Ell, 2019-02-14 10:35
updated-directive-20190214005601.diff (1.21 KB) updated-directive-20190214005601.diff Tobias Ell, 2019-02-19 15:09
updated-directive-20190214005631.diff (928 Bytes) updated-directive-20190214005631.diff Tobias Ell, 2019-02-19 15:09
updated-directive-20190214005705.diff (1.53 KB) updated-directive-20190214005705.diff Tobias Ell, 2019-02-19 15:09

Subtasks

Bug #14358: add a unit test to ensure that generation on 500 directives based on the same technique doesn't failtReleasedNicolas CHARLESActions

Related issues

Related to Rudder - Bug #13987: Massive performance penalty in policy generation due to invalid usage of StringTemplateReleasedActions

Associated revisions

Revision 27a2ceaa (diff)
Added by Nicolas CHARLES 3 months ago

Fixes #14322: directives are mixed

History

#1

Updated by Nicolas CHARLES 3 months ago

  • Project changed from ncf to Rudder

Hello Tobias,

This looks indeed strange. Could you check that:
  1. policy generation was successful
  2. that the compliance details of the node states that the nodes does have up to date policies
  3. the node can update its policies (rudder agent update on the node)

Also, what is the agent run frequency on this node ? Could it be that it runs on a different schedule as the others (every xx hours)?

#2

Updated by Tobias Ell 3 months ago

Hello Nicolas,

1. the GUI shows policy generation green - is there some log where I could find detailed information?
Because of other problems I have done lots of "regenerate all policies" - I never saw an error.
2. The policies of the node are reported as up to date.
3. The node can update it's policies.

Frequency of the nodes (as reported by "rudder agent info") is always correct.
Also, this phenomenom does not persist - it does not happen on all nodes and seems to go away on the concerned nodes.

#3

Updated by François ARMAND 3 months ago

Hello Tobias,

Can you tell us how do you have the mix-up (what it means for you to have a mix-up)? Do you see it on the node (for ex you have the two file downloaded) ? Or only in the compliance message?

Do you have an other directive configuration on that node with the name displayed on the message?

For us, knowing that piece would help us understand if the problem is on the agent logic (either with the policies we give to the node or the agent understanding of the policies), or in the compliance processing. The first category of problem is terrible, because it means that the node is not configured as you want (production broken). The second is very bad (because you don't get an accurate view on the reality), but at least the node configuration is ok.
From what I believe for now, it's just the messages that get mixed up, and not the real configuration on the node.

If the problem happen again, can you go to the "technical log" tab of the node and filter on "copy file" (or each file name) and send us screenshots?

And if you have a wait to reproduce the problem, it would be massively helpful (until we are not able to reproduce, it's very hard to understand what's the problem, and even if we get an idea, we can't check that we corrected it afterwards).

#4

Updated by Tobias Ell 3 months ago

Hi François,

unfortunately I have the worst case. The files get the wrong content.
It started with 4.3.7 I guess. I have a lot of directives based on "File content" with many of them applied to every node.
I noticed that I couldn't lock in to some servers because the content of a service script would emerge in a file in
/etc/profile.d/ thus blocking shell execution.On some nodes logrotate would work because a logrotate file would contain the
contents of a repository file.
This did not happen on all nodes or at least in the same way (concerning the same directives).
I thought that maybe the server would mix up the content of the directives, so I did a lot of "regenerate all policies" but it
didn't help.
I have migrated to 5.0.6 since but the problem still occured sometimes.
I will try to find a node with problems and send you the technical log - anything else I should collect?

#5

Updated by François ARMAND 3 months ago

  • Tracker changed from Question to Bug
  • Category set to Web - Config management
  • Target version set to 4.3.10
  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility set to Getting started - demo | first install | Technique editor and level 1 Techniques
  • Priority set to 94

OK, so this is very, very bad. I'm requalifying the ticket accordingly.

What would help:

- on the root server, a tar.gz of /var/rudder/share/xxxx-node-id-yyyy for the node having the problem,
- on the node, a tar.gz of /var/rudder/cfengine-community/

#6

Updated by François ARMAND 3 months ago

By the way, you may want to anonimize the data or send the content privately to (private internal mailing list)

#7

Updated by Tobias Ell 3 months ago

Hi François,

I sent the files to the private mailing list.

#8

Updated by François ARMAND 3 months ago

  • Category deleted (Web - Config management)
  • Target version deleted (4.3.10)

From, the log, we see that something happens around midnight on 2019-02-14: the run at 00:04:06 UTC get an updated policy configuration, and the agent starts to copy /etc/profile.d/zzz-ltur.sh from /bin/procs.

Then, it stops around 9:30 UTC: the last run with the problem is at 09:34:29+00 and the next run correct back the copy.

So we need to know if the problem was on the config or in rudder logic. For that, we would like to have:

- event logs for the directive f3fe0964_b855_45e2_8670_5b7254ed9ade on 2019-02-14 (to see if rudder saw a change at that time, and perhaps get an hint about what change) - this is done in Rudder UI > utils > event logs

- the node configuration at that time. This one is a bit tougher, because it needs to exec some SQL on the root server.

On the root server, exec SQL command onRudder base with the command: psql -p 5432 -U rudder -h localhost -d rudder -c "select * from nodeconfigurations where nodeid='32377fd7-02fd-43d0-aab7-28460a91347b' and begindate >= '2019-02-13' and begindate <= '2019-02-14'order by begindate desc limit 1;" > /tmp/nodeconfig

You can send us the file by email for privacy reason. Thanks!

(nodeid changed in the query)

#9

Updated by Tobias Ell 3 months ago

  • Category set to Web - Config management

Sorry - neither procedure produces results - looks like I do not keep enough logs.

#10

Updated by Nicolas CHARLES 3 months ago

ha, events log in the UI don't allow filtering by DirectiveId (and that's a shame)

to get the data, you can either search by Directive Name in the UI, or run the following query in postgresql

select * from eventlog where xpath('/entry/directive/id', data)::text = '{<id>f3fe0964_b855_45e2_8670_5b7254ed9ade</id>}' and creationDate > '2019-02-13' and creationDate < '2019-02-15';

sorry for the inconvenience

#11

Updated by Tobias Ell 3 months ago

Hi Nicolas, unfortunately this select does not deliver a result either.
Looks like we'll have to wait for the problem to occur again.
Are there any settings I should change so we'll have better output?

#12

Updated by Nicolas CHARLES 3 months ago

ha, this is really surprising.
Two last requests:
Can you sent us (by mail also)
  1. the logs of rudder webapp (/var/log/rudder/webapp) from 2019-02-13 and 2019-02-14
  2. all eventlog between the 13 and 15 of february ?
    select * from eventlog where creationDate > '2019-02-13' and creationDate < '2019-02-15'
    

Thank you so much for your patience,
Nicolas

#13

Updated by Tobias Ell 3 months ago

Nicolas don't thank me - you give excellent support for open source software.
I'll give you any help I can. The eventlog entries and webapp logs have been sent.
Thanks for your help!
Got to go now - if there's anything else you need I will provide it tomorrow.

#14

Updated by Nicolas CHARLES 3 months ago

Thank you for the information Tobias,
from the eventlog, there are indeed no changes in the impacted directives, so it seems to really be Rudder that's wrongly configures something.

I made a mistake in the query for nodeconfigurations, using the wrong nodeid (i mixed it up with rule id)

does the query

psql -U rudder -h localhost -d rudder -c "select * from nodeconfigurations where nodeid='1f9492d3-804f-4f40-bff1-2217276b667a' and begindate >= '2019-02-13' and begindate <= '2019-02-14'order by begindate desc limit 1;" > /tmp/nodeconfig

returns something? If so, can you send it to us, it will tell us what Rudder computed for this node

#15

Updated by François ARMAND 3 months ago

  • Assignee set to François ARMAND
#17

Updated by Tobias Ell 3 months ago

sorry for the delay - the last query had an empty result set as well,
no new occurrances either

#18

Updated by François ARMAND 3 months ago

Hello, we would like to see what directives where modified on the faulty run. For that, we would need that you do in rudder server, in directory `/var/rudder/configuration-repository`:

git log --stat --since="2019-02-13 23:00:00" --until="2019-02-14 01:00:00" 

That will result in logs like:

...
commit 8ce46015f57736ab1628f4d4ab5b4228a505c9ab
Author: root user (CLI) <root@localhost>
Date:   Thu Feb 14 05:52:25 2019 +0100

     Forced technique upgrade from version  on Thu Feb 14 05:52:25 CET 2019

 techniques/systemSettings/remoteAccess/sshConfiguration/5.0/metadata.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
...

If you have a commit around 00:56:00 (or 23:56:00, not sure about how git handle utc/utc+1) with several directives modified, also do a:

git show here-the-git-commit-sha1 > updated-directive.diff

and send us the result please.

Thanks

#20

Updated by François ARMAND 3 months ago

So, we successfully reproduce it by putting > 500 copy fie directive on a node.

It apprears that:

- node configuration is ok,
- LDAP is ok,
- directive in /var/rudder/config-repo/directives/xxxx.xml is OK.

So it seems that the only possible remaining culprit is string template that somehow don't get the correct thing to write. We are norrowing it.

#21

Updated by François ARMAND 3 months ago

Thanks for the diff, it confirms that your configuration is OK, but that the part that changed the variable when it wrote the file didn't wrote the correct thing.

#22

Updated by Nicolas CHARLES 3 months ago

  • Related to Bug #13987: Massive performance penalty in policy generation due to invalid usage of StringTemplate added
#23

Updated by Nicolas CHARLES 3 months ago

  • Target version set to 4.3.10
#24

Updated by Nicolas CHARLES 3 months ago

  • Status changed from New to In progress
  • Assignee changed from François ARMAND to Nicolas CHARLES
#25

Updated by Nicolas CHARLES 3 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to François ARMAND
  • Pull Request set to https://github.com/Normation/rudder/pull/2144
#26

Updated by Nicolas CHARLES 3 months ago

When doing #13987, i introduced a thread-unsafety: before we reparsed each template each time, injected variable in it, and wrote it on filesytem.
Highly ineficient, so i cached the parsing, to get the instance, and synchronized this part.

However, the usage of this template is NOT synchronized, so two thread could put variables in the same template, overriding each other. The more available thread, the worse the mixing up is (with 4 thread, and 500 directives based on the same technique, I got around 8 mixup, with two thread i have mixup one time every 3 generation, with 1 thread never)

Synchronizing the whole method fixes the issue, and induce a small perf penalty (Write node configurations: 25061 ms with synchronized, Write node configurations: 15273 ms with no synchrized).

#27

Updated by Nicolas CHARLES 3 months ago

updated the PR with minimal locking. Impact on performance is not significant.

#28

Updated by Nicolas CHARLES 3 months ago

  • Status changed from Pending technical review to Pending release
#29

Updated by François ARMAND 3 months ago

  • Related to Bug #14386: UI "settings" for management of hooks works inconsistently added
#30

Updated by François ARMAND 3 months ago

  • Related to deleted (Bug #14386: UI "settings" for management of hooks works inconsistently)
#31

Updated by François ARMAND 3 months ago

  • Subject changed from directives are mixed to Directive parameter values are mixed between directives
#32

Updated by François ARMAND 3 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.3.10 and 5.0.7 which were released today.

Also available in: Atom PDF