Architecture #4427: cf-promises check on ALL generated promises leads to huge generation time - Rudder - Issue Tracker

Actions

Copy link

Architecture #4427

open

cf-promises check on ALL generated promises leads to huge generation time

Added by François ARMAND over 11 years ago. Updated 11 months ago.

Status:

New

Priority:

N/A

Assignee:

Nicolas CHARLES

Category:

Web - Config management

Target version:

Ideas (not version specific)

Pull Request:

Effort required:

Name check:

Fix check:

Regression:

Description

When configuration for nodes are generated, there is a step just after promise generation that runs cf-promise on all generated content, so that if an error is detected, the generation process is interrupted.

Even considering that interupting the whole generation process for one given node in error is OK, that way of doing thing leads to HUGE (tens of MINUTES) generation time when we deal with big configuration (tens of thousand of expected reports, say 150 nodes with ~ 50 directives on them).

So, we must find a better way to:

- remove that time,
- still correctly inform the user about what is wrong (in case it is)

Ticket #4242 was the initiator of that performance problem.

Workaround: for whose who don't want to spend all the time in cf-promises, you can configure the properties: "rudder.community.checkpromises.command" to "/bin/true" in the file "/opt/rudder/rudder-web.properties".
Of course with that configuration, cf-promises won't run on generated files for each node. It is OK in most cases, but in rare occasion, especially if you are creating your own Technique, it may avoid catching a promise error early.

Related issues 5 (2 open — 3 closed)

Actions

Copy link

Updated by François ARMAND over 11 years ago

Description updated (diff)

The first idea that came to mind is that if we don't want to corelate the generation time with the complexity of the configuration (number of nodes, etc), we have to either:

- 1/ make that check asynchrone,
- 2/ split the check and deport most of it to other parts of the generation process,
- 3/ don't do the check on the server but on the client.

Some initial remarks on these solution:

1/ make that check asynchrone,

That means that the verification of promises is no more part of the generation process.
Perhaps that means that the process of update new config on the FS so that the node see them is also asynchrone (or we could be in the case that the node update its config before its checked, and so we are almost in the 3/ case)
We need a way to give the user feedback if something is wronfg, or take action.

2/ split the check and deport most of it to other parts of the generation process,

In that spririt, we check each part of promise when/where they are generated. Typically, we could generate a mock promise when a directive is changed, and validate it. And the same when a rule is modified.

Problem: some error may arrised only on particular node because of the combination of some directvies or node parameters on that node exactly.

So that solution seems to add A LOT of complexity for no gain compared to the other two => avoid that.

3/ don't do the check on the server but on the client.

On that scenario, the promise check is not done on the server. But when the node download new configuration for him, before actually updating its config with these new ones, it checks them with cf-promises. If the check is OK, the configuration are updated. Else, the old one are kept, and some error are reported to the user.

This lead to the unique real promise check that really count, because promises can use node-runtime-specific values, and the server is just not able to know them.

Of course, on that scenarion, the server-side check time is exactly 0.

Problem: we need some way to send back the information to the user. The more natural way is to have a system directive in charge of the copy, and have it in error when the check fails.

Actions

Copy link

Updated by Nicolas CHARLES over 11 years ago

There is also one modification that could be worth doing.
Now, when we add a node to a rule/group, then we consider that the rule changed, and regenerate the promises for all the nodes of that rule.

We could, for node addition only, generate only the promises for the new nodes (== increasing the serial of a rule only if there there are nodes in the old target of rules that are not in the new target of rules)

It is clearly not exclusive with the solution stated above

What do you think of it?

Actions

Copy link

Updated by Jonathan CLARKE over 11 years ago

Nicolas CHARLES wrote:

There is also one modification that could be worth doing.
Now, when we add a node to a rule/group, then we consider that the rule changed, and regenerate the promises for all the nodes of that rule.

We could, for node addition only, generate only the promises for the new nodes (== increasing the serial of a rule only if there there are nodes in the old target of rules that are not in the new target of rules)

It is clearly not exclusive with the solution stated above

What do you think of it?

Any and all optimisations that are easy to describe, understand and provide potential gain are good for me!

Actions

Copy link

Updated by Jonathan CLARKE over 11 years ago

Also, a quick and easy optimisation here would be to modify the code that runs the verification command (usually cf-promises) and check if it is set to "/bin/true", then avoid running any systems calls for that case (forking to run an external command, even just /bin/true, 1000 or 5000 times has got to have a cost).

Actions

Copy link

Updated by François ARMAND over 11 years ago

Jonathan CLARKE wrote:

Also, a quick and easy optimisation here would be to modify the code that runs the verification command (usually cf-promises) and check if it is set to "/bin/true", then avoid running any systems calls for that case (forking to run an external command, even just /bin/true, 1000 or 5000 times has got to have a cost).

This is addressed in #4475

Actions

Copy link

Updated by Jonathan CLARKE over 11 years ago

Target version changed from 2.10.0~beta1 to 2.11.0~beta1

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

Target version changed from 2.11.0~beta1 to 2.11.0~beta2

Actions

Copy link

Updated by Matthieu CERDA about 11 years ago

Target version changed from 2.11.0~beta2 to 2.11.0~rc1

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

Target version changed from 2.11.0~rc1 to 2.11.0~rc2

Actions

Copy link

#10

Updated by Vincent MEMBRÉ about 11 years ago

Target version changed from 2.11.0~rc2 to 2.11.0

Actions

Copy link

#11

Updated by Vincent MEMBRÉ about 11 years ago

Target version changed from 2.11.0 to 2.11.1

Actions

Copy link

#12

Updated by Nicolas PERRON about 11 years ago

Target version changed from 2.11.1 to 2.11.2

Actions

Copy link

#13

Updated by François ARMAND about 11 years ago

Status changed from New to 8
Target version changed from 2.11.2 to 140

Actions

Copy link

#14

Updated by François ARMAND almost 11 years ago

Subject changed from cf-promise check on ALL generated promises leads to huge generation time to cf-promises check on ALL generated promises leads to huge generation time

Actions

Copy link

#15

Updated by Matthieu CERDA almost 11 years ago

Target version changed from 140 to 3.0.0~beta1

Actions

Copy link

#16

Updated by François ARMAND over 10 years ago

Description updated (diff)

Actions

Copy link

#17

Updated by Jonathan CLARKE over 10 years ago

Target version changed from 3.0.0~beta1 to 3.0.0~beta2

Actions

Copy link

#18

Updated by François ARMAND over 10 years ago

Target version changed from 3.0.0~beta2 to 3.1.0~beta1

Reported in 3.1

Actions

Copy link

#19

Updated by Benoît PECCATTE over 10 years ago

Status changed from 8 to New

Actions

Copy link

#20

Updated by Florian Heigl about 10 years ago

Just wanted to put in a "this feels good" vote for 1)
I think those need to be turned into async:

Policy gen
Policy test
Policy application (if tested)

Not a simple task if one considers changes on top of changes and the need to respect validation.
You'll probably end up inventing filesystem snapshots ;)

Actions

Copy link

#21

Updated by Vincent MEMBRÉ about 10 years ago

Target version changed from 3.1.0~beta1 to 3.1.0~rc1

Actions

Copy link

#22

Updated by François ARMAND about 10 years ago

Well, actually if you thing to the whole process as a stream of events each one leading to an immutable version of THE IT configuration (internal to Rudder: the one data structure that allows to know the state of all the config), it is quite easy to model: we have the chance to have really few events to take care of, and it is rather easy to know what are incompatibles changes.

So imagine something like that:

Incomming events are: group/technique/directive/rule/parameters changes, new inventory from a node, system parameter (like authorized network or run interval) changes
If the event is not legal (bad parameters, apply to something not existing, etc), give an immediate feedback to the event producer and ignore it.
On legal event:
=> create a new version of the global configuration model . That version is the new reference one, new modif are related to that one and all folowing steps are asynchrone. Here, we also keep (at least) the last good configuration model. Note that here, we can have several events applied at the same time, we just don't have promise generated for them. New event are not starting a new generation, but are waiting for the asynchrone steps to finish.

So we end up with somethin like:

T0 -> config0, status: fully generated, CURRENT, path to root directory of promises: ...
T1 -> config1, status: not generated
T2 -> config2, status: not generated
T3 -> config3, status: processing async steps
T4 -> config4, status: waiting

=> create promises, write them, test them, and why not post-hook are part of the processing.

Until a config reach end of processing, we don't change the CURRENT flag to the new root directory. Changing the flag must be something atomic (change a symlink, change a snapshot number of the fs, change a branch on git...)

So, what happens on error, because that's the intersting part ?

When a generation fails, nothing is broken yet, because old ok promises are still the one available.
So we mark the config "bad", display in a status page the reason, and start building the most recent config not builded yet (either one not generated, or one waiting).

We could even have that flow at the node granularity, but we need to make the information clearly available for the user, so that he can knows with one sight what node are in what config version (of course, the important part is being to know easily what nodes are not is the last version and so may miss important updates).

T0 -> config0, status: fully generated, CURRENT, path to root directory of promises: ...
T1 -> config1, status: not generated
T2 -> config2, status: not generated
T3 -> config3, status: bad: error message(s) explaining the problems
T4 -> config4, status: waiting
T5 -> config5, status: waiting
T5 -> config4, status: processing async steps

Effectively, it looks like some kind of fs snapshotting, but now that we have git, it's almost a given ;)

Actions

Copy link

#23