User story #10551
Make policy generation be node by node
We want to make the policy generation node by node, so that:
- a faulty node does not block the policy generation for other nodes,
- in case of very long generation (>30 min), we don't have to wait the whole time to have node starting to get new policy generation,
- errors are reported on a node by node basis
- we can have a meaningfull progress bar for the generation ("7 nodes out of 25"...)
This, of course, lead a number of questions, for example:
- how do we manage dependencies (typically between a node and its policy server, if hostname change)? What happen if only one the the two updates breaks?
- how do we make errors understandable and discoverable? Imagine if 7000 nodes are in error.
(and certainly a number of others).
Moreover, the parallelism of the policy generation can be more fine-grained controlled, along with the JS timeout, dynamic group computation at start of generation, and change computation with API:
# for max parallelism, either use '1x' to mean "1 time the number of CPU / 2" or '3' to mean '3 threads' curl -k -H "X-API-Token: xxx" -X POST 'https://.../rudder/api/latest/settings/rudder_generation_max_parallelism' -d "value=1x" # value is in seconds curl -k -H "X-API-Token: xxx" -X POST 'https://.../rudder/api/latest/settings/rudder_generation_js_timeout' -d "value=10" # use 'false' or 'true' curl -k -H "X-API-Token: xxx" -X POST 'https://.../rudder/api/latest/settings/rudder_generation_compute_dyngroups' -d "value=false" curl -k -H "X-API-Token: xxx" -X POST 'https://.../rudder/api/latest/settings/rudder_compute_changes' -d "value=false"
Updated by Nicolas CHARLES 8 months ago
One important point: once a node is generated, its policies are available, and its expected reports are also available (and considered as expected)
So if we generate the policy of a new node, and the policy generation is slow (like 15 minutes), and its policy server has its policy generated last, for 15 minutes the new nodes wont have its policy served to it, and it will be in "No Answer" for 15 minutes