Some reports are duplicated between agent and postgres leading to "unexpected" compliance
We have users reporting a lot of "unexpected", for example in copy-file or enforce file content post-hook components. After analysis, there seems to be a pattern where:
- on the node side, `/var/rudder/cfengine-community/outputs/the_run_output` contains the correct reports
- in database, the corresponding reports is duplicated for that run (with non consecutive insert id)
So something happen between the agent and postgres that duplicates the messages. All our investigation let us believe it's a duplicated syslog message, which could be correlated by the fact that the network is bad on these installations.
The long terme solution is to process messages with a run atomicity, and send them on the network with an "at least once" messaging semantic (it's easy to detect a duplicated run report server side). That need major architectural changes on Rudder.
A short term workaround is to had an option to let the compliance ignores unexpected when we have exactly the same reports. The compliance detail would still display both messages, but we would use the report compliance level.
We don't forsee a lot of cases where a fully duplicated message can be produced, so in most cases it seems preferable to use that option (and remove a lot of false positives bad compliance which hurts actionnability). So we believe that the option should be on by default, but still present for people who want to be strict on non compliance level.
That option would be removed once the long term solution is available.
Updated by François ARMAND 11 months ago
Work in progess here: https://github.com/fanf/rudder/commit/b8e07aed6a5dae9143951fef23d6ef9d9a533d5e
Updated by Vincent MEMBRÉ 10 months ago
- Status changed from Pending release to Released
- Priority changed from 108 to 107