Project

General

Profile

Actions

Bug #6021

closed

Pending state is not managed as user could expect in changes only mode

Added by Nicolas CHARLES over 6 years ago. Updated about 6 years ago.

Status:
Released
Priority:
1
Category:
Web - Compliance & node report
Target version:
Severity:
User visibility:
Effort required:
Priority:

Description

That lead to Rudder displaying what is the actual, current state of configuration and report received at the moment of the modification, and not a preview of what is expected given then new configuration. This is extremly surprising for the user.

The two main occurence of surprising behavior are:

- when we create a rule with directives and apply it to nodes, its state is immediatly "no reports" (because of course, no node even answer at that moment)
- Some Directives or Nodes may be missing from the reporting, in non compliance mode, during the "pending" period of time (see #5716 for screenshot).


Subtasks 1 (0 open1 closed)

Bug #6189: Compliance debug logger is on debug mode and logs a lot!ReleasedFrançois ARMAND2015-01-27Actions

Related issues

Related to Rudder - Bug #6040: If i update or create a rule, all nodes that are in the target of this rule get the "pending" stateReleasedFrançois ARMAND2014-12-29Actions
Has duplicate Rudder - Bug #5716: Some Directives or Nodes may be missing from the reporting, in non compliance mode, during the "pending" period of timeRejected2014-10-30Actions
Actions #1

Updated by Nicolas CHARLES over 6 years ago

in the reports by Node, the Rule doesn't show up (because we don't have reports from this configid most probably)

Actions #2

Updated by Nicolas CHARLES over 6 years ago

Oh, and not all nodes appears in the list of nodes where the rule apply
It showed centos-6-64, where it should show node1 and centos-6-64

Actions #3

Updated by Nicolas CHARLES over 6 years ago

Ok, i don't have all the node that appears, because one of the node is pending a change from another rule, so it is looking for the bad nodeconfigid ( CheckChanges(2014-12-17T14:55:23.000+01:00,NodeConfigIdInfo(NodeConfigId(1660048484),2014-12-17T14:36:14.223+01:00,Some(2014-12-17T15:00:39.226+01:00)))) rather than NodeConfigIdInfo(NodeConfigId(-190266289))

Actions #4

Updated by Nicolas CHARLES over 6 years ago

I confirm that in compliance mode, it works as expected

Actions #5

Updated by François ARMAND over 6 years ago

  • Target version changed from 3.0.0~beta2 to 3.0.0~rc1
Actions #6

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 3.0.0~rc1 to 3.0.0
Actions #7

Updated by François ARMAND over 6 years ago

It's more likelly a similar occurence than in #6040.

The problem is that in order to have an accurate "pending" state for nodes and rules, we need to know what configuration changed compared to what we received in the last run of corresponding agent. It's actually "received" and not "ask", because (for example) we don't wan to remove a pending state of something whose config didn't changed in the last run (but do changed in the n-1 one) and still don't have any reports for).

In #6040, that was manageable since we have the whole state given by the received reports - that's the goal of compliace.

Here, we would need to rebuild the compliance for policy at "t-1", then compute what state are switchet to pending with the new config. This is an abomination in performance, making Rudder unusable.

So what we really need is a cache of "last" (nodeId, configId, run) -> compliance (at least until rule level). I'm not sur about how big is "last", but we need at least one by node, and for each node, at least the last run for current configId, and something like the last run for the previous config id.

This cache would be invalidated/computed again each time a new run reach Rudder.

Actions #8

Updated by François ARMAND over 6 years ago

  • Subject changed from When I create a rule, in change only mode, it is immediatly in "No report" state rather than Pending to Pending state is not managed as user could expect in changes only mode
  • Description updated (diff)
Actions #9

Updated by François ARMAND over 6 years ago

  • Status changed from New to Discussion
Actions #10

Updated by François ARMAND over 6 years ago

  • Description updated (diff)
  • Status changed from Discussion to In progress
  • Assignee set to François ARMAND
  • Priority changed from N/A to 1

So, we brainstormed a lot, and reach an elegant, rather simple, easely cachable solution.
To begin with, both Full compliance and change only trigger a "pending / awaiting data" state just after a change in configuration.
It's the same as now for full compliance, and for change only the rationnal is that the node MUST update its promises (since they changed), and it is expected to happen before a run-interval period.

Then, compliance for a given node is a function of ONLY.
The logic is (for one node, but works for n in the same way):
- for a given node (we have a node-bias),
- get the expected configuration right now (last expected configuration)
- errors may happen if the node does not exists or if it does not have config right now. For example, it was added just a second ago.
=> "no data for that node"
- get the last run for the node.
If nodeConfigId(last run) == nodeConfigId(expected config)
=> simple compare & merge
else {
- expected reports INTERSECTION received report > compute the compliance on
received reports (with an expiration date)
- expected reports - received report > pending reports (with an expiration date)
}

Moreover, all compliance status have an expiration date after which it becomes a "no data" report. It's already what is done for pending, but it is also needed for compliance (after that date, you should have gotten an other run), or change only ("end of times").

A WIP branch is available here: https://github.com/fanf/rudder/tree/bug_6021/pending_state_is_not_managed_as_user_could_expect_in_changes_only_mode

Actions #11

Updated by François ARMAND over 6 years ago

Version commit:9aee9a2 on my branch compiles, even if a lot of tests were disable for that.

The performance for a big number of node (2000) are NOT good. On my test machine, even with no reports in base (safe for the root server), the Postgres Request used to get expected reports takes > 20s.

Getting compliance for one node is quick (<50ms for the request)

Actions #12

Updated by Nicolas CHARLES about 6 years ago

You could try to explain analyze the query to see what is not going right

And i believe that generate_subscript may not be what you need, for instance if you have this table

rudder=> select * from expectedreportsnodes where nodejoinkey = 49;
 nodejoinkey |                nodeid                |                nodeconfigids                 
-------------+--------------------------------------+----------------------------------------------
          49 | 06da3556-5204-4bd7-b3b0-fa5e7bcfbbea | {-1884617228}
          49 | 769cd476-b96b-4546-b2fb-0196e333cbc7 | {-1959368040}
          49 | root                                 | {-1684204762,321158356,935965435,1829297815}

you obtain this result

 nodejoinkey |                nodeid                |                nodeconfigids                 | v 
-------------+--------------------------------------+----------------------------------------------+---
          49 | 06da3556-5204-4bd7-b3b0-fa5e7bcfbbea | {-1884617228}                                | 1
          49 | 769cd476-b96b-4546-b2fb-0196e333cbc7 | {-1959368040}                                | 1
          49 | root                                 | {-1684204762,321158356,935965435,1829297815} | 1
          49 | root                                 | {-1684204762,321158356,935965435,1829297815} | 2
          49 | root                                 | {-1684204762,321158356,935965435,1829297815} | 3
          49 | root                                 | {-1684204762,321158356,935965435,1829297815} | 4

So you get more lines, and i'm not sure you are using them in the query nor in the result

Actions #13

Updated by Nicolas CHARLES about 6 years ago

could it be that you need to use unnest rather that generate_subscripts, and so have the values outside of the array to query on it (see http://blog.heapanalytics.com/dont-iterate-over-a-postgres-array-with-a-loop/ )

also, using ANY to search the arrays (if you don't want to use unnest), and even better ANY (VALUES (nodeConfigIds) ) might help a lot (see https://www.datadoghq.com/2013/08/100x-faster-postgres-performance-by-changing-1-line/ )

Actions #14

Updated by François ARMAND about 6 years ago

I tried several variation of the query in commit 9e9e9d1 of my branch.

unnest does not work, throwing an exception for no obvious reason.
generate_subscript seems to be quite efficient after all.

The result is reasonnably quick. 4s to compute compliance with 2000 nodes (~80k expected reports for all nodes).

Now, the next clear optimisation is a cache.

Actions #15

Updated by François ARMAND about 6 years ago

I opened a stackoverflow question here: http://stackoverflow.com/questions/28025753/looking-in-array-with-a-big-set-of-input-values

There is other trail of optimization that can be tested, but the general idea for now is:

- we have an usable solution, and whatever optimisation we set on postgres, we will still have hundreds of millisec of pur computation & rendering
- so the next step is to set-up the cache, so that we just have the correct data, always
- we can still opimize the query after that

Actions #16

Updated by François ARMAND about 6 years ago

In the last commit (5dc5955 around now), I ported most of the old tests, which are passing.
I also added a "debug_compliance" logger that allows to see what the compliance computer is trying to merge with what. Enable it by adding in logback.xml:

    <logger name="debug_compliance" level="debug" />

Actions #17

Updated by François ARMAND about 6 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/796

The last version on my branche is at last reviewable, so the PR: https://github.com/Normation/rudder/pull/796

I didn't rebased for now, perhaps the steps in commit will help understand the followed path.

I added a trace level for the "debug_compliance" logger. I highly encourage tester to use that logger to fully understand what is going on, even is just seing things matching what is expected on the interface is a good start :)

Nico, I let you have a look at that. Of course, we can spend some time reviewing it together.

Actions #18

Updated by François ARMAND about 6 years ago

The logger named change to "explain_compliance", so now use:

  <logger name="explain_compliance" level="trace" />  
Actions #19

Updated by François ARMAND about 6 years ago

Some more cleaning and a big correction on the rule grid displayed in validation pop-up (ex when directives change).

Actions #20

Updated by François ARMAND about 6 years ago

  • Status changed from Pending technical review to Pending release
  • % Done changed from 0 to 100
Actions #21

Updated by Vincent MEMBRÉ about 6 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 3.0.0, which was released on 2015/02/16

Actions

Also available in: Atom PDF