Bug #17557
closedError 500 when adding or deleting nodes through API
Description
Hello,
I got an odd problem when using the API to delete and deploy nodes.
Each action on nodes seems to launch a "regenerate policy" (which I assume is normal), but when I add or delete multiple nodes I sometimes (random) get an error 500 from the API.
Apache just show a 500 on the API:
[28/May/2020:09:35:20 +0200] "DELETE /rudder/api/latest/nodes/f10b0a1b-70f4-4d6c-94f2-d1bccee9fbac HTTP/1.1" 500
And when digging in the rudder logs I found this problem:
May 28 09:35:42 rudder rudder[scheduledJob]: [ERROR] Error when updating dynamic group 'all-nodes-with-cfengine-agent' <- Error when processing request for updating dynamic group 'All classic Nodes' (all-nodes-with-cfengine-agent) <- Inconsistancy: When trying to access Node information from cache was empty, but it should not, this is a developper issue, please open an issue on https://issues.rudder.io
As asked is the message, I open this bug :-)
But anyway this problem is problematic because I do not know how to solve this problem, and even when deleting manually, doing factory-reset, readding the nodes, etc, sometimes is not sufficient, and difficult to automatize :-/
I am not sure about what information could help you work on this problem, so do not hesitate to tell me if you need any more details!
Thanks for the help.
Regards,
Updated by Victor Héry over 4 years ago
I forgot to indicate that we use rudder server v6.0.6 and rudder-agent v6.0.6 :)
Updated by Nicolas CHARLES over 4 years ago
Hi Victor
Thank you for this ticket.
How much memory is allocated to rudder-jetty (param XMX in /etc/default/rudder-jetty) ?
Updated by Florian Ganée over 4 years ago
Nicolas CHARLES wrote in #note-2:
Hi Victor
Thank you for this ticket.
How much memory is allocated to rudder-jetty (param XMX in /etc/default/rudder-jetty) ?
Hi Nicolas,
We've followed Rudder documentation with Perf optimization.
Here is the configuration of the Jetty:JAVA_XMX=16384
JAVA_MAXPERMSIZE=256
JAVA_GC="-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=500 -XX:+UseStringDeduplication"
Updated by Nicolas CHARLES over 4 years ago
- Target version set to 6.0.7
Thank you for your answer.
This configuration looks sane. Do you have a high load on the system? How many CPUs do you have?
Could you also send the logs (sanitized) from the webapp from the 5 minutes before the error ?
it may be that the ldap is overwhelmed, but i've never seen that with less than 5000 nodes
Updated by Florian Ganée over 4 years ago
Nicolas CHARLES wrote in #note-4:
Thank you for your answer.
This configuration looks sane. Do you have a high load on the system? How many CPUs do you have?
Could you also send the logs (sanitized) from the webapp from the 5 minutes before the error ?it may be that the ldap is overwhelmed, but i've never seen that with less than 5000 nodes
The server has 12 threads (6 cores over-threaded). When we had this error, we had a load about 5 to 10.
In our use case, we do a lot of API calls with Ansible simultaneously. It causes the high load, especially by the LDAP process in Rudder which sounds related to what you're talking about.
It may be quite hard at the moment to get all previous logs, sorry.
Updated by Nicolas CHARLES over 4 years ago
- Category set to Performance and scalability
ok, so maybe that LDAP does not follow
Can you run this following script (from rudder-upgrade in 6.1)
service rudder-jetty stop SLAPD_CONF="/opt/rudder/etc/openldap/slapd.conf" NEED_REINDEX=false for i in "directiveId" "softwareVersion" "cn"; do line="index\s*${i}\s*eq" INDEX_COUNT=$(grep -c "^${line}" ${SLAPD_CONF} 2>/dev/null || true) if [ ${INDEX_COUNT} -eq 0 ]; then echo "Adding LDAP index on attribute: ${i}" sed -i "/index\s*objectClass\s*eq/a index\t${i}\teq" ${SLAPD_CONF} NEED_REINDEX=true fi done if ${NEED_REINDEX}; then # stop slapd, slapindex -q, start slapd. systemctl stop rudder-slapd echo -n "Reindexing LDAP directory - this may take a few minutes..." su - rudder-slapd -s /bin/sh -c "/opt/rudder/sbin/slapindex" echo " Done" systemctl start rudder-slapd fi service rudder-jetty start
it will add the ldap indexes that exists in 6.1, which does significantly improve perfs, notably concerning node inventory management (around 5 to 20 times faster)
Updated by Vincent MEMBRÉ over 4 years ago
- Target version changed from 6.0.7 to 6.0.8
Updated by François ARMAND over 4 years ago
- User visibility set to Operational - other Techniques | Rudder settings | Plugins
- Priority changed from 0 to 32
Updated by Vincent MEMBRÉ over 4 years ago
- Target version changed from 6.0.8 to 6.0.9
Updated by Vincent MEMBRÉ about 4 years ago
- Target version changed from 6.0.9 to 6.0.10
- Priority changed from 32 to 31
Updated by Vincent MEMBRÉ about 4 years ago
- Target version changed from 6.0.10 to 798
- Priority changed from 31 to 30
Updated by Benoît PECCATTE over 3 years ago
- Target version changed from 798 to 6.1.14
- Priority changed from 30 to 27
Updated by Vincent MEMBRÉ over 3 years ago
- Target version changed from 6.1.14 to 6.1.15
Updated by Vincent MEMBRÉ over 3 years ago
- Target version changed from 6.1.15 to 6.1.16
Updated by Vincent MEMBRÉ about 3 years ago
- Target version changed from 6.1.16 to 6.1.17
Updated by Vincent MEMBRÉ about 3 years ago
- Target version changed from 6.1.17 to 6.1.18
Updated by Vincent MEMBRÉ almost 3 years ago
- Target version changed from 6.1.18 to 6.1.19
Updated by François ARMAND almost 3 years ago
- Status changed from New to Resolved
We believe that these kind of bug (random error 500 with API) are all rooted to performance problems with LDAP.
We greatly improved our performance in that topic, and so I'm cloing that one. Please open again if you are still affected by it in recent rudder version.