Project

General

Profile

Actions

Bug #12937

open

In Rudder 6.2.0 inventory processing merge_uuid part get extremelly slow on debian

Added by Nicolas CHARLES almost 3 years ago. Updated 3 months ago.

Status:
Pending release
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Operational - other Techniques | Rudder settings | Plugins
Effort required:
Priority:
70
Tags:

Description

After a time, inventory processing on a rudder which manage debian system get extremelly slow - up to several minutes by inventory, versus the expected <1s duration. This cause chain problem with the accumulation of to-be-processed inventories in /var/rudder/inventories/incoming and /var/rudder/inventories/accepted-nodes-updates.

The main observable consequence is new nodes taking several hours to appear in "pending node" in UI.

Analysis:

This was an old ticket which got a brand new hype in 6.2. The underlying root cause was that the number of software was growing linearly with the number of debian-like inventory processed (because of bug on a check condition on the "source name" and "source version" attribute, only present on debian-like since 6.2).

After a time, the huge number of software (in the million range for a middle-sized infra of just 500 linux, vs ~50k in normal operationnal condition, so in the range of a 10k node installation, without the corresponding sized root server) was exacerbing the `merge_uuid` action for them.

The PR correcting that ticket also add a new API endpoint allowing to clean-up unused software entries (see doc API for system/maintenance/purgeSoftware). The purge is triggered every night, but you might want to trigger it after install rudder 6.2.1 to start diminishing the number of garbage software as soon as possible.

History:

With Rudder 4.1, some inventories are painfully slow to process
Checking the logs, in trace, it shows

[2018-07-11 16:50:35] DEBUG com.normation.inventory.services.provisioning.NodeInventoryDNFinderService - Server Id 'NodeId(YYYYYYY-YYYYYYY-YYYYYYY-YYYYYYY-YYYYYYY)' found in DIT 'AcceptedInventory' with id finder 'use_existing_id'
[2018-07-11 16:50:35] DEBUG com.normation.inventory.services.provisioning.MachineDNFinderService - Processing machine id finder use_existing_id
[2018-07-11 16:50:35] DEBUG com.normation.inventory.services.provisioning.MachineDNFinderService - Processing machine id finder check_mother_board_uuid_accepted
[2018-07-11 16:50:35] DEBUG com.normation.inventory.services.provisioning.MachineDNFinderService - Machine Id 'MachineUuid(XXXXX-XXXXX-XXXXX-XXXXX-XXXXX)' found with id finder 'check_mother_board_uuid_accepted'
[2018-07-11 16:50:35] TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:merge_uuid': 30645 ms
[2018-07-11 16:50:35] TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:check_machine_cn': 0 ms
[2018-07-11 16:50:35] TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:set_last_inventory_date': 0 ms
[2018-07-11 16:50:35] TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:add_ip_values': 0 ms
[2018-07-11 16:50:35] TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:log_inventory': 0 ms
[2018-07-11 16:50:35] TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Pre commit report: 30647 ms

There is no reason for such check to be so slow.

WORKAROUND

It may helps to delete unreferenced software, and perhaps the batch that should do that is in error (you can grep unreferenced /var/log/rudder/webapp/* to see if there is errors or a lot of them)

You can do it manually:

So, first, MAKE A BACKUP OF YOUR LDAP: https://docs.rudder.io/reference/6.2/administration/procedures.html#_migration_backups_and_restores

Then, create de directory for work, and go in it. Then:

- create a directory for work, and go in it. Then:
- all softwares:

ldapsearch -LLL -o ldif-wrap=no -h localhost -p 389 -x -D "cn=Manager,cn=rudder-configuration" -w LDAP_PASS_FROM_rudder-passwords_file -b "ou=Software,ou=Inventories,cn=rudder-configuration" -s one 1.1 | sort | uniq | cut -d: -f2 > all-soft-sorted.dns

- softwares used in accepted nodes:
ldapsearch -LLL -o ldif-wrap=no -h localhost -p 389 -x -D "cn=Manager,cn=rudder-configuration" -w LDAP_PASS -b "ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" -s one 'nodeId=*' software | grep softwareId | sort | uniq | cut -d: -f2 > nodes-soft-sorted.dns

- software to delete (ie all - nodes):
grep -f nodes-soft-sorted.dns -v all-soft-sorted.dns > soft-to-delete.dns

- then delete unreferenced:
ldapdelete -h localhost -p 389 -x -D "cn=Manager,cn=rudder-configuration" -w LDAP_PASS -c -f soft-to-delete.dns


Subtasks 1 (0 open1 closed)

Bug #18830: Missing doc for purgeSoftware API endpointReleasedAlexis MOUSSETActions

Related issues

Related to Rudder - Architecture #17128: review index for LDAPReleasedFrançois ARMANDActions
Related to Rudder - Bug #18873: purge software batch sometime terminate in error without log messageReleasedVincent MEMBRÉActions
Actions #1

Updated by Nicolas CHARLES almost 3 years ago

  • Tags set to Sponsored
Actions #2

Updated by Benoît PECCATTE almost 3 years ago

  • Target version changed from 4.1.14 to 4.1.15
Actions #3

Updated by François ARMAND over 2 years ago

Is there something special with these inventories?

Is it easely reproducible (or something link to the system state, like load, etc) ?

Actions #4

Updated by Nicolas CHARLES over 2 years ago

There are a lot of virtual machines there - hence the slowing

Actions #5

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 4.1.15 to 4.1.16
Actions #6

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 4.1.16 to 4.1.17
Actions #7

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 4.1.17 to 4.1.18
Actions #8

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 4.1.18 to 4.1.19
Actions #9

Updated by Alexis MOUSSET over 2 years ago

  • Target version changed from 4.1.19 to 4.1.20
Actions #10

Updated by François ARMAND about 2 years ago

  • Target version changed from 4.1.20 to 4.1.21
Actions #11

Updated by Vincent MEMBRÉ about 2 years ago

  • Target version changed from 4.1.21 to 4.1.22
Actions #12

Updated by Benoît PECCATTE about 2 years ago

  • User visibility set to Operational - other Techniques | Rudder settings | Plugins
  • Priority changed from 0 to 57

Nicolas, do you have more information ?

Actions #13

Updated by Vincent MEMBRÉ almost 2 years ago

  • Target version changed from 4.1.22 to 4.1.23
  • Priority changed from 57 to 56
Actions #14

Updated by Vincent MEMBRÉ almost 2 years ago

  • Target version changed from 4.1.23 to 4.1.24
Actions #15

Updated by Vincent MEMBRÉ almost 2 years ago

  • Target version changed from 4.1.24 to 588
  • Priority changed from 56 to 54
Actions #16

Updated by Nicolas CHARLES almost 2 years ago

  • Target version changed from 588 to 5.0.13

still true

Actions #17

Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 5.0.13 to 5.0.14
Actions #18

Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 5.0.14 to 5.0.15
Actions #19

Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 5.0.15 to 5.0.16
Actions #20

Updated by Alexis MOUSSET about 1 year ago

  • Target version changed from 5.0.16 to 5.0.17
Actions #21

Updated by Nicolas CHARLES about 1 year ago

on 5.0, it still happens

15:16:43.247 [monix-io-37]     TRACE inventory-processing - Start post processing of inventory 'rmm-xxxxxxxxx-xxxxxxxx.ocs' for node 'yyyyyy-yyyyyyyy'
15:16:43.248 [inventory-wait-sig-32] DEBUG inventory-processing - Inventory 'rmm-xxxxxxxxx-xxxxxxxx.ocs' for node 'yyyyyy-yyyyyyyy' pre-processed in 1 second and 147 milliseconds ms
15:16:43.248 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:check_os_type_is_known': 0 ms
15:16:43.250 [inventory-wait-sig-32] TRACE inventory-processing - Inventory 'rmm-xxxxxxxxx-xxxxxxxx.ocs' for Node 'yyyyyy-yyyyyyyy' added to processing queue.
15:16:53.788 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:merge_uuid': 10540 ms
15:16:53.788 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:check_machine_cn': 0 ms
15:16:53.788 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:set_last_inventory_date': 0 ms
15:16:53.788 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:add_ip_values': 0 ms
15:16:53.790 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Precommit 'pre_commit_inventory:log_inventory': 1 ms
15:16:53.791 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Pre commit report: 10543 ms
15:16:53.791 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Saving software: 0 ms
15:16:53.830 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Saving machine: 39 ms
15:16:53.932 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Saving node: 102 ms
15:16:53.933 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Saving vms: 1 ms
15:16:53.934 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Commit report: 144 ms
15:16:53.936 [monix-io-37]     TRACE com.normation.inventory.ldap.provisioning.DefaultReportSaver - Post commit report: 2 ms
15:16:53.937 [monix-io-37]     DEBUG inventory-processing - Report saved.
15:16:53.937 [monix-io-37]     INFO  inventory-processing - Report 'rmm-xxxxxxxxx-xxxxxxxx.ocs' for node 'rmm-p1196662pl.ads.local' [yyyyyy-yyyyyyyy] (signature:undefined) processed in 10 seconds and 690 milliseconds ms

Actions #22

Updated by Nicolas CHARLES about 1 year ago

the cost is in this part

    val mergedSoftwares = softwareIdFinder.tryWith(report.applications.toSet) match {
      case eb: EmptyBox =>
        val e = eb ?~! "Error when trying to find existing software UUIDs" 
        logger.error(e.messageChain)
        return e
      case Full(s) => s
    }

DEBUG com.normation.inventory.ldap.provisioning.UuidMergerPreCommit - merge software 10853 ms
Actions #23

Updated by Nicolas CHARLES about 1 year ago

adding indexes
index cn eq
index softwareVersion eq

makes it goes from 10s to 860ms

we used to haveindex, then removed them by replacing them by caching https://issues.rudder.io/issues/5965
now we use mdb, which doesn't have cache. Maybe adding indexes would be a good idea...

I've been playing with method NameAndVersionIdFinder.tryWith , to make it faster, and I'm facing a puzzle: batching the query makes ldap faster (559ms vs 10s), yet the whole method is slower: either foldLeft or flatMap gets much slower. Actually, that's because it starts returning data as soon as it gets some, and slapd continues in background to process the data

so here the only solution I have is to add more indexes

Actions #24

Updated by Nicolas CHARLES about 1 year ago

adding more RAM and CPUs to the system marginally improve perfs.
As data all fit in RAM, i don't get to read from disk, but slapd is still crunhcing at 80-100% CPUs while processing data

Actions #25

Updated by Nicolas CHARLES about 1 year ago

Actions #26

Updated by Vincent MEMBRÉ about 1 year ago

  • Target version changed from 5.0.17 to 5.0.18
Actions #27

Updated by Vincent MEMBRÉ 10 months ago

  • Target version changed from 5.0.18 to 5.0.19
Actions #28

Updated by Nicolas CHARLES 9 months ago

  • Status changed from New to Rejected

this was mostly fixed by #17128

Actions #29

Updated by François ARMAND 4 months ago

  • Description updated (diff)
Actions #30

Updated by François ARMAND 4 months ago

  • Description updated (diff)
Actions #31

Updated by François ARMAND 4 months ago

  • Description updated (diff)
Actions #32

Updated by François ARMAND 4 months ago

  • Description updated (diff)
Actions #33

Updated by François ARMAND 4 months ago

  • Status changed from Rejected to New
  • Target version changed from 5.0.19 to 6.2.1

It seems to be worse in 6.2, even with indexes in place. Users reports inventory acceptation time of up to 10 minutes (!!!!), and even after software has explained in workaround section above, still up to 30s (with indexes).

Idea: we don't really "merge" software ids, we are merelly normalizing them (we don't really use any extended information about them). So we could just normalize software dn, and try to add them if not present (and see how to let openldap handle it).

Actions #34

Updated by François ARMAND 4 months ago

  • Status changed from New to In progress
  • Assignee set to François ARMAND
Actions #35

Updated by François ARMAND 4 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Vincent MEMBRÉ
  • Pull Request set to https://github.com/Normation/rudder/pull/3471
Actions #36

Updated by François ARMAND 4 months ago

  • Status changed from Pending technical review to In progress
  • Assignee changed from Vincent MEMBRÉ to François ARMAND
  • Pull Request deleted (https://github.com/Normation/rudder/pull/3471)
  • Severity changed from Minor - inconvenience | misleading | easy workaround to Major - prevents use of part of Rudder | no simple workaround
  • Priority changed from 54 to 70

OK, so after some research, there is something really fishy with openldap write operation for software in 6.2. The acceptation time of an inventory is ~100x bigger (after a few hundred inventories) than on 6.1: in 6.2, we are in the 30s-Xminutes range, while in 6.1 we were in the 300-600ms range.

So, the first identified cause is the index on `software` and more impacting on `modifyTimestamp`.
Removing them and rebuilding indexes led to ~10x improvement, but we are still in 3s-15s range.

Identifying the second cause is hard. This is what I tested and WIP conclusion:

- I took a Centos 8 VM with rudder 6.1 with very few nodes.
- I cloned it into to working VM: VM6.1 which is the rudder 6.1 reference point, and VM6.2 for which the first step was to upgrade to rudder 6.2
- I disabled webapp on each VM to be able to use code running from my en dev.

First experience:

- adding 1000 inventories in VM6.1 with rudder-webapp 6.1 HEAD code is fast;
- adding 1000 inventories in VM6.2 with rudder-webapp 6.2 HEAD code is slow;
- removing `modifyTimestamp` and `software` indexes in VM6.2 and adding inventories is much better (~10x)

=> these two indexes are at least a part of the drop in performance.

Second test: trying to change merge_uuid code

There was a lot of other tests done trying to make code faster for `merge_uuid` but they were misdirected (see bad PR here: https://github.com/Normation/rudder/pull/3471).

At least, we learned that:

- it was a bad idea to use UUID for softwares when they are identified uniquely by name and version;
- it was a bad idea to use the full DN in node when just the UUID would have been enought and more compact;
- it was a bad idea to not make software owned by a node, either by serializing them in json directly as node attributes, or in a sub-entry below node entry.

The problem is not with the 6.2 branch webapp code

Then, I tried to use 6.2 HEAD webapp code with VM6.1. Adding inventories was as fast as with webapp code from 6.1 branch.

=> the problem seems to be with something we changed with openldap (idea: 1/ binaries; 2/ schema; 3/ other config than index), with software (4/ we added some attribute to them), or with the data base (5/ mdb binaries, 6/ data by themselves)

openldap 1/ binary or 2/ schema or 3/ config

I tried to use VM6.1 openldap schema and binary in VM6.2. It didn't changed anything.

I didn't found anything in configuration that changed:

- beside the two indexes, slapd.conf are unchanged,
- there is no specific configuration in /var/rudder/ldap/openldap-data for mdb performance
- the exec command are the same in 6.1 and 6.2: /opt/rudder/libexec/slapd -n rudder-slapd -u rudder-slapd -f /opt/rudder/etc/openldap/slapd.conf -h ldap://0.0.0.0:389/
- I don't have other ideas of configuration that could have change

Data corruption (6/) ?

I thought that maybe the migration process from 6.1 to 6.2 lead to a data corruption but:

- replacing /var/rudder/ldap/openldap-data data files in VM6.2 with the ones from VM6.1 lead to bad performance;
- trying to backup data in VM6.2, delete data, restore data (following https://docs.rudder.io/reference/6.2/administration/procedures.html#_backup) to eliminate a data format corruption lead to bad performance.

=> it seems that the data format is not the cause by itself of the drop in perf.

Next step:

- test if code 6.1 in VM6.2 is slow. I forgot to test that. It would also allow to test if the added data in software are the cause.
- test mdb binaries from VM6.1 to VM6.2

Actions #37

Updated by François ARMAND 4 months ago

SO, I was able to make 6.2 quick.

For that, I:

- copied binaries from VM6.1 /opt/rudder/libexec, /opt/rudder/lib/liblmdb.so, /opt/rudder/lib/ldap
- copied data from VM6.1: /var/rudder/ldap/openldap-data
- restart everything.

Things are very quick at first (a lot in 100 ms range), and then seems to stabiliez around 1-2s (perhaps it's just due to inventories number of software. Since it's real data, they may have an impact).

BUT: when I switched just the lmdb binary back to 6.2, things get much worse, and switching again to 6.1 didn't heal them. I had to restart with data from 6.1 too. So it seems that lmdb binary or something in rudder code 6.2 make things go slower in a way that I still don't know how to reverse.

Actions #38

Updated by François ARMAND 4 months ago

I confirm that I can add as many inventories as I want in VM6.1 with code from HEAD 6.2 and the timing are:

- majority 50-150ms,
- 99% < 300ms
- max 800ms

Actions #39

Updated by François ARMAND 4 months ago

So, I copied inventory.schem and rudder.schem in VM6.1 and I lost one order of magnitude in perf:

- 5% <1s
- majority between 2-6s
- some up to 8-10s

It may be because a lot of software must be updated to add `sourceName` and `sourceVersion` attributes, but there may be more to that too.

Actions #40

Updated by François ARMAND 4 months ago

SO.

There is a problem in the merge of software and they are never merged, always new one are created. That cause explosion in the number of software and saving soft is slower than doing nothing in any case.

Actions #41

Updated by François ARMAND 4 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/3472
Actions #42

Updated by François ARMAND 4 months ago

  • Status changed from Pending technical review to Pending release
Actions #43

Updated by François ARMAND 4 months ago

  • Subject changed from When receiving an inventory, the merge_uuid part can be painfully slow to In Rudder 6.2.0 inventory processing merge_uuid part get extremelly slow on debian
  • Description updated (diff)

Further analysis on index bring new clues. Now that OpenLDAP is not massively undersized in CPU for the number of entries, the difference between having indexes and not having them is not that big (~10% mean, from 620ms with indexes to 550ms without). Since we need these indexes to (massively) speed up some request, I think we should keep them for now.

Actions #45

Updated by Vincent MEMBRÉ 4 months ago

This bug has been fixed in Rudder 6.2.1 which was released today.

Actions #46

Updated by François ARMAND 3 months ago

  • Related to Bug #18873: purge software batch sometime terminate in error without log message added
Actions

Also available in: Atom PDF