Architecture #6087
closedArchitecture to handle thousands of nodes
Description
Since 3.0, Rudder manage correctly hundreds of nodes. When we reach the couple of thousand of nodes, Rudder starts to be not nice to use (at all).
The main pain points are:
- the UI is slow;
- generation of policies takes several minutes (with cf-promises check disabled)
The main reason for that are (and are linked):
- we have (almost) no cache
- we don't handle diff/gradual update
Tipically, we should be able to only write promise files that are actually modified, not all file for a given node - and ideally, we should only write parameters-like file, all other being shared in a common place (and ln -s to).
Example of cache that could be set: we preciselly know when the list of directive change.
And the major cache that we could set-up is on nodes (inventories).
Note the in #5452, we introduced a cache for nodeInfo, but the underlying architecture was not adapted. Ideally, we want to have some event generated when a node/inventory/config is modified, and use these events to invalidate caches.
Updated by François ARMAND about 10 years ago
Here are some metrics to get the size of things:
For 2000 nodes, ~10 user directives (based on different techniques) by nodes, ~30 rules, on a machine with non-ssd, 2Go for inventory, 2Go for Rudder (it seems to be the lower bounds):
Promise generation is taking > 7 minutes, ~5 of which are spent on writing 150000 files;
Here are some memory size in bytes used by (naive) data structures used:
All rules: 56160
All node infos: 3406344
All inventories: 183238992 => ~100ko by inventory
All directives: 1570880
All groups: 488368
All parameters: 1472
So, even with naive datastructure, we could very reasonnably brute-cache everything and only have around 1Go of ram taken by these caches for 10 000 nodes.
In fact, LDAP backend is of no use at all in that context, and so we can get back ram from it.
Updated by Nicolas CHARLES about 10 years ago
François ARMAND wrote:
Here are some metrics to get the size of things:
For 2000 nodes, ~10 user directives (based on different techniques) by nodes, ~30 rules, on a machine with non-ssd, 2Go for inventory, 2Go for Rudder (it seems to be the lower bounds):
Promise generation is taking > 7 minutes, ~5 of which are spent on writing 150000 files;
Here are some memory size in bytes used by (naive) data structures used:
All rules: 56160
All node infos: 3406344
All inventories: 183238992 => ~100ko by inventory
Be careful, i think in your test you have the same software for all nodes, which is not true in a real size environment.
All directives: 1570880
I'm really surprised by this size: 10 directives use 1,5 Mo ? how come ?
All groups: 488368
All parameters: 1472So, even with naive datastructure, we could very reasonnably brute-cache everything and only have around 1Go of ram taken by these caches for 10 000 nodes.
In fact, LDAP backend is of no use at all in that context, and so we can get back ram from it.
Updated by François ARMAND about 10 years ago
Nicolas CHARLES wrote:
François ARMAND wrote:
Here are some metrics to get the size of things:
For 2000 nodes, ~10 user directives (based on different techniques) by nodes, ~30 rules, on a machine with non-ssd, 2Go for inventory, 2Go for Rudder (it seems to be the lower bounds):
Promise generation is taking > 7 minutes, ~5 of which are spent on writing 150000 files;
Here are some memory size in bytes used by (naive) data structures used:
All rules: 56160
All node infos: 3406344
All inventories: 183238992 => ~100ko by inventoryBe careful, i think in your test you have the same software for all nodes, which is not true in a real size environment.
Yes, it's taken care of (see for ex: http://www.rudder-project.org/redmine/issues/5965#note-10)
All directives: 1570880
I'm really surprised by this size: 10 directives use 1,5 Mo ? how come ?
No, there 10 nodes by node, but node 10 directives in total. There is ~100 categories/techniques/directives
And here, it's the Scala datastructure (FullActiveTechniqueCategory), which is quiet heavy.
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 3.1.0~beta1 to 3.1.0~rc1
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 3.1.0~rc1 to 3.1.0
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 3.1.0 to 3.1.1
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 3.1.1 to 3.1.2
Updated by Jonathan CLARKE over 9 years ago
- Target version changed from 3.1.2 to Ideas (not version specific)
Updated by François ARMAND about 8 years ago
- Status changed from New to Rejected
In 4.0:
- almost everything is cached,
- generation time is dominated by cf-promises checks, and without that by file writing (for what we can't do anything for now).
So I'm closing that ticket and we will open more specific one when needed.