Project

General

Profile

Actions

Architecture #4427

open

cf-promises check on ALL generated promises leads to huge generation time

Added by François ARMAND about 10 years ago. Updated about 5 years ago.

Status:
New
Priority:
1
Category:
Web - Config management
Effort required:
Name check:
Fix check:
Regression:

Description

When configuration for nodes are generated, there is a step just after promise generation that runs cf-promise on all generated content, so that if an error is detected, the generation process is interrupted.

Even considering that interupting the whole generation process for one given node in error is OK, that way of doing thing leads to HUGE (tens of MINUTES) generation time when we deal with big configuration (tens of thousand of expected reports, say 150 nodes with ~ 50 directives on them).

So, we must find a better way to:

- remove that time,
- still correctly inform the user about what is wrong (in case it is)

Ticket #4242 was the initiator of that performance problem.

Workaround: for whose who don't want to spend all the time in cf-promises, you can configure the properties: "rudder.community.checkpromises.command" to "/bin/true" in the file "/opt/rudder/rudder-web.properties".
Of course with that configuration, cf-promises won't run on generated files for each node. It is OK in most cases, but in rare occasion, especially if you are creating your own Technique, it may avoid catching a promise error early.


Related issues 5 (2 open3 closed)

Related to Rudder - Bug #4242: Promise generation takes too long when getting more and more nodesReleasedFrançois ARMAND2014-03-11Actions
Related to Rudder - Bug #4475: Promise generation process should not lose time by forking to run "/bin/true"ReleasedNicolas CHARLES2014-02-17Actions
Related to Rudder - User story #5641: Make the agent policies update a state machine with integrity checkNewActions
Related to Rudder - Bug #14331: Trigger agent update and run after policy server has finished policy generationReleasedAlexis MoussetActions
Related to Rudder - Architecture #7297: Speed up of promise validationNewActions
Actions #1

Updated by François ARMAND about 10 years ago

  • Description updated (diff)

The first idea that came to mind is that if we don't want to corelate the generation time with the complexity of the configuration (number of nodes, etc), we have to either:

- 1/ make that check asynchrone,
- 2/ split the check and deport most of it to other parts of the generation process,
- 3/ don't do the check on the server but on the client.

Some initial remarks on these solution:

1/ make that check asynchrone,

That means that the verification of promises is no more part of the generation process.
Perhaps that means that the process of update new config on the FS so that the node see them is also asynchrone (or we could be in the case that the node update its config before its checked, and so we are almost in the 3/ case)
We need a way to give the user feedback if something is wronfg, or take action.

2/ split the check and deport most of it to other parts of the generation process,

In that spririt, we check each part of promise when/where they are generated. Typically, we could generate a mock promise when a directive is changed, and validate it. And the same when a rule is modified.

Problem: some error may arrised only on particular node because of the combination of some directvies or node parameters on that node exactly.

So that solution seems to add A LOT of complexity for no gain compared to the other two => avoid that.

3/ don't do the check on the server but on the client.

On that scenario, the promise check is not done on the server. But when the node download new configuration for him, before actually updating its config with these new ones, it checks them with cf-promises. If the check is OK, the configuration are updated. Else, the old one are kept, and some error are reported to the user.

This lead to the unique real promise check that really count, because promises can use node-runtime-specific values, and the server is just not able to know them.

Of course, on that scenarion, the server-side check time is exactly 0.

Problem: we need some way to send back the information to the user. The more natural way is to have a system directive in charge of the copy, and have it in error when the check fails.

Actions #2

Updated by Nicolas CHARLES about 10 years ago

There is also one modification that could be worth doing.
Now, when we add a node to a rule/group, then we consider that the rule changed, and regenerate the promises for all the nodes of that rule.

We could, for node addition only, generate only the promises for the new nodes (== increasing the serial of a rule only if there there are nodes in the old target of rules that are not in the new target of rules)

It is clearly not exclusive with the solution stated above

What do you think of it?

Actions #3

Updated by Jonathan CLARKE about 10 years ago

Nicolas CHARLES wrote:

There is also one modification that could be worth doing.
Now, when we add a node to a rule/group, then we consider that the rule changed, and regenerate the promises for all the nodes of that rule.

We could, for node addition only, generate only the promises for the new nodes (== increasing the serial of a rule only if there there are nodes in the old target of rules that are not in the new target of rules)

It is clearly not exclusive with the solution stated above

What do you think of it?

Any and all optimisations that are easy to describe, understand and provide potential gain are good for me!

Actions #4

Updated by Jonathan CLARKE about 10 years ago

Also, a quick and easy optimisation here would be to modify the code that runs the verification command (usually cf-promises) and check if it is set to "/bin/true", then avoid running any systems calls for that case (forking to run an external command, even just /bin/true, 1000 or 5000 times has got to have a cost).

Actions #5

Updated by François ARMAND about 10 years ago

Jonathan CLARKE wrote:

Also, a quick and easy optimisation here would be to modify the code that runs the verification command (usually cf-promises) and check if it is set to "/bin/true", then avoid running any systems calls for that case (forking to run an external command, even just /bin/true, 1000 or 5000 times has got to have a cost).

This is addressed in #4475

Actions #6

Updated by Jonathan CLARKE about 10 years ago

  • Target version changed from 2.10.0~beta1 to 2.11.0~beta1
Actions #7

Updated by Vincent MEMBRÉ almost 10 years ago

  • Target version changed from 2.11.0~beta1 to 2.11.0~beta2
Actions #8

Updated by Matthieu CERDA almost 10 years ago

  • Target version changed from 2.11.0~beta2 to 2.11.0~rc1
Actions #9

Updated by Vincent MEMBRÉ over 9 years ago

  • Target version changed from 2.11.0~rc1 to 2.11.0~rc2
Actions #10

Updated by Vincent MEMBRÉ over 9 years ago

  • Target version changed from 2.11.0~rc2 to 2.11.0
Actions #11

Updated by Vincent MEMBRÉ over 9 years ago

  • Target version changed from 2.11.0 to 2.11.1
Actions #12

Updated by Nicolas PERRON over 9 years ago

  • Target version changed from 2.11.1 to 2.11.2
Actions #13

Updated by François ARMAND over 9 years ago

  • Status changed from New to 8
  • Target version changed from 2.11.2 to 140
Actions #14

Updated by François ARMAND over 9 years ago

  • Subject changed from cf-promise check on ALL generated promises leads to huge generation time to cf-promises check on ALL generated promises leads to huge generation time
Actions #15

Updated by Matthieu CERDA over 9 years ago

  • Target version changed from 140 to 3.0.0~beta1
Actions #16

Updated by François ARMAND over 9 years ago

  • Description updated (diff)
Actions #17

Updated by Jonathan CLARKE over 9 years ago

  • Target version changed from 3.0.0~beta1 to 3.0.0~beta2
Actions #18

Updated by François ARMAND over 9 years ago

  • Target version changed from 3.0.0~beta2 to 3.1.0~beta1

Reported in 3.1

Actions #19

Updated by Benoît PECCATTE almost 9 years ago

  • Status changed from 8 to New
Actions #20

Updated by Florian Heigl almost 9 years ago

Just wanted to put in a "this feels good" vote for 1)
I think those need to be turned into async:

Policy gen
Policy test
Policy application (if tested)

Not a simple task if one considers changes on top of changes and the need to respect validation.
You'll probably end up inventing filesystem snapshots ;)

Actions #21

Updated by Vincent MEMBRÉ almost 9 years ago

  • Target version changed from 3.1.0~beta1 to 3.1.0~rc1
Actions #22

Updated by François ARMAND almost 9 years ago

Well, actually if you thing to the whole process as a stream of events each one leading to an immutable version of THE IT configuration (internal to Rudder: the one data structure that allows to know the state of all the config), it is quite easy to model: we have the chance to have really few events to take care of, and it is rather easy to know what are incompatibles changes.

So imagine something like that:

Incomming events are: group/technique/directive/rule/parameters changes, new inventory from a node, system parameter (like authorized network or run interval) changes
If the event is not legal (bad parameters, apply to something not existing, etc), give an immediate feedback to the event producer and ignore it.
On legal event:
=> create a new version of the global configuration model . That version is the new reference one, new modif are related to that one and all folowing steps are asynchrone. Here, we also keep (at least) the last good configuration model. Note that here, we can have several events applied at the same time, we just don't have promise generated for them. New event are not starting a new generation, but are waiting for the asynchrone steps to finish.

So we end up with somethin like:

T0 -> config0, status: fully generated, CURRENT, path to root directory of promises: ...
T1 -> config1, status: not generated
T2 -> config2, status: not generated
T3 -> config3, status: processing async steps
T4 -> config4, status: waiting

=> create promises, write them, test them, and why not post-hook are part of the processing.

Until a config reach end of processing, we don't change the CURRENT flag to the new root directory. Changing the flag must be something atomic (change a symlink, change a snapshot number of the fs, change a branch on git...)

So, what happens on error, because that's the intersting part ?

When a generation fails, nothing is broken yet, because old ok promises are still the one available.
So we mark the config "bad", display in a status page the reason, and start building the most recent config not builded yet (either one not generated, or one waiting).

We could even have that flow at the node granularity, but we need to make the information clearly available for the user, so that he can knows with one sight what node are in what config version (of course, the important part is being to know easily what nodes are not is the last version and so may miss important updates).

T0 -> config0, status: fully generated, CURRENT, path to root directory of promises: ...
T1 -> config1, status: not generated
T2 -> config2, status: not generated
T3 -> config3, status: bad: error message(s) explaining the problems
T4 -> config4, status: waiting
T5 -> config5, status: waiting
T5 -> config4, status: processing async steps

Effectively, it looks like some kind of fs snapshotting, but now that we have git, it's almost a given ;)

Actions #23

Updated by Vincent MEMBRÉ almost 9 years ago

  • Target version changed from 3.1.0~rc1 to 3.1.0
Actions #24

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 3.1.0 to 3.1.1
Actions #25

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 3.1.1 to 3.1.2
Actions #26

Updated by Jonathan CLARKE over 8 years ago

  • Target version changed from 3.1.2 to 3.2.0~beta1
Actions #27

Updated by Jonathan CLARKE over 8 years ago

  • Tracker changed from Bug to Architecture
Actions #28

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 3.2.0~beta1 to 3.2.0~rc1
Actions #29

Updated by Benoît PECCATTE over 8 years ago

  • Target version changed from 3.2.0~rc1 to 3.2.0~rc2
Actions #30

Updated by Benoît PECCATTE about 8 years ago

  • Target version changed from 3.2.0~rc2 to 3.2.0
Actions #31

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 3.2.0 to 3.2.1
Actions #32

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 3.2.1 to 3.2.2
Actions #33

Updated by Alexis Mousset about 8 years ago

  • Target version changed from 3.2.2 to 4.0.0~rc2
Actions #34

Updated by François ARMAND over 7 years ago

  • Target version changed from 4.0.0~rc2 to 4.1.0~beta1
Actions #35

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 4.1.0~beta1 to 4.1.0~beta2
Actions #36

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 4.1.0~beta2 to 4.1.0~beta3
Actions #37

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 4.1.0~beta3 to 4.1.0~rc1
Actions #38

Updated by François ARMAND about 7 years ago

  • Target version changed from 4.1.0~rc1 to 4.2.0~beta1
Actions #39

Updated by François ARMAND almost 7 years ago

We are working on a major speed improvement of cf-promise: https://github.com/cfengine/core/pull/2818

Actions #40

Updated by Alexis Mousset almost 7 years ago

  • Target version changed from 4.2.0~beta1 to 4.2.0~beta2
Actions #41

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.0~beta2 to 4.2.0~beta3
Actions #42

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.0~beta3 to 4.2.0~rc1
Actions #43

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.0~rc1 to 4.2.0~rc2
Actions #44

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.0~rc2 to 4.2.0
Actions #45

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.0 to 4.2.1
Actions #46

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.1 to 4.2.2
Actions #47

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.2 to 4.2.3
Actions #48

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.3 to 4.2.4
Actions #49

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.2.4 to 4.2.5
Actions #50

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.2.5 to 4.2.6
Actions #51

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.2.6 to 4.2.7
Actions #52

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.2.7 to 414
Actions #53

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 414 to Ideas (not version specific)
Actions #54

Updated by Nicolas CHARLES about 5 years ago

  • Related to Bug #14331: Trigger agent update and run after policy server has finished policy generation added
Actions #55

Updated by Nicolas CHARLES about 5 years ago

Following up on this one:
  1. policy generation is much much faster
  2. policy validation still takes a lot of time (about 1s per node)
  3. We have an option in Settings to disable the policy validation (see #14331 )

Plan is to outsource to the nodes themselves the policy validation (i cannot find the ticket for the moment on the bugtracker thought)

Actions #56

Updated by Nicolas CHARLES about 5 years ago

Actions

Also available in: Atom PDF