Project

General

Profile

Actions

Bug #24713

closed

Dynamic groups are slow to compute in Rudder 8.1

Added by François ARMAND 18 days ago. Updated 4 days ago.

Status:
Released
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
0
Name check:
To do
Fix check:
Checked
Regression:
No

Description

Since we change query processor in Rudder 8.1, dynamic groups are slow to compute, especially on instance with thousands of nodes.


Related issues 2 (1 open1 closed)

Related to Rudder - Bug #24652: Rudder 8.1 slows down over timeNewNicolas CHARLESActions
Related to Rudder - Bug #24712: ExpiredCompliance events are pilling upReleasedNicolas CHARLESActions
Actions #1

Updated by François ARMAND 18 days ago

  • Related to Bug #24652: Rudder 8.1 slows down over time added
Actions #2

Updated by François ARMAND 18 days ago

  • Related to Bug #24712: ExpiredCompliance events are pilling up added
Actions #3

Updated by François ARMAND 18 days ago · Edited

So, the new query processor works on two steps:

  • first, analysis. We have three kind of backends, one working on CoreNodeFact objects in cache, one working on subgroups, one for others using the old LDAP query.
    • for CoreNodeFact matchers, it directly matches properties on scala objects
    • for SubGroup matchers, the analysis query group's nodeId for each sub group and then use that set of nodeIds,
    • for LDAP matchers, the analysis does the old LDAP query composition of all LDAP lines to get a set of nodeIds matching them

    Analysis also does the correct and/or, inv, with root or not, composition of lines.

  • second, we process the query, ie we run one time through all CoreNodeFact and matches each node against the (composed) matcher.

So, further analysis shows that:

  • the size of node fact does not matter, espcially having a lot of software or a couple does not change timing (appart for queries on software)
  • even if we have a simplistic query on just OS, LDAP analysis takes almost all of the analysis timing,
  • analysis is more than linearly (quadratic? more?) correlated with the number of nodes
  • even if we have a simplistic query, the process part is more than linearly correlated with nodes,

We also see that if we remove the LDAP matcher which should not be used on simplistic query on OS (because it's purely a CoreNodeFact property), then we get:
- the process time depends on number of nodes only logarithmically
- we have a constant factor between the process time and the equivalent scala collect function on the node list - the factor is x100 for using ZIO, and x10 for having logs (not sure why, it could need a dedicated analysis, likely the strings are built even if the log is not used, ie we are missing some call-by-name somewhere). But even with that x1000 factor, we are still in the range of a few milliseconds (versus micro seconds for the chunk trasversal with collect). So we don't really care.
- more importantly, the analysis time drops to micro-second range, which is what is expected for a couple composition, even with 10 lines of criteria.

So, the root cayse is clearly linked to the LDAP matcher.

More analysis show that:

  • 1/ we don't really split the matcher between the three kinds, and so for a lot of cases (hostname, ram, properties, etc), we do BOTH an LDAP matcher and a CoreNodeFact matcher,
  • 2/ when there is no LDAP line's matcher, we are still doing an LDAP query (and not "no query, just skip that since we don't have any")
  • 3/ the LDAP query with 0 criteria returns ALL nodes, which is long (why we had a long analysis part depending on number of nodes), and which means that then, we do a set of all node IDs".contains(nodeId) on each nodes (which explains why the process part was so long, too).
So, a first simple couple of changes whould be:
  • really deduplicate on matcher type,
  • if no LDAP matcher, just skip LDAP query

An other optim could be: if group/ldap returns a set of node IDs the size of all nodes, then just return an alway-true matcher. But I'm not totally sure LDAP doesn't contain crap that could lead to that being false.

Actions #4

Updated by François ARMAND 18 days ago

  • Status changed from New to In progress
  • Assignee set to François ARMAND
Actions #5

Updated by François ARMAND 18 days ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Vincent MEMBRÉ
  • Pull Request set to https://github.com/Normation/rudder/pull/5601
Actions #6

Updated by François ARMAND 14 days ago

With the proposed correction in PR, I get:

2024-04-15 19:21:51+0000 DEBUG dynamic-group.timing - Computing dynamic groups without dependencies finished in 36028 ms
2024-04-15 19:24:37+0000 DEBUG dynamic-group.timing - Computing dynamic groups without dependencies finished in 49727 ms
2024-04-15 19:25:01+0000 DEBUG dynamic-group.timing - Computing dynamic groups without dependencies finished in 23767 ms
2024-04-15 19:34:52+0000 DEBUG dynamic-group.timing - Computing dynamic groups without dependencies finished in 23668 ms
2024-04-15 19:36:22+0000 DEBUG dynamic-group.timing - Computing dynamic groups without dependencies finished in 22414 ms

So a roughtly 200x speed-up compared to first logs on the load server data (which may not be representative).

Actions #7

Updated by Anonymous 11 days ago

  • Status changed from Pending technical review to Pending release
Actions #8

Updated by Alexis Mousset 8 days ago

  • Subject changed from Dynamic group are slow to compute in Rudder 8.1 to Dynamic groups are slow to compute in Rudder 8.1
Actions #9

Updated by François ARMAND 6 days ago

  • Fix check changed from To do to Checked
Actions #10

Updated by Vincent MEMBRÉ 4 days ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 8.1.1 which was released today.

Actions

Also available in: Atom PDF