Project

General

Profile

Actions

Bug #17557

closed

Error 500 when adding or deleting nodes through API

Added by Victor Héry about 4 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
N/A
Assignee:
-
Category:
Performance and scalability
Target version:
Severity:
Minor - inconvenience | misleading | easy workaround
UX impact:
User visibility:
Operational - other Techniques | Rudder settings | Plugins
Effort required:
Priority:
27
Name check:
To do
Fix check:
To do
Regression:

Description

Hello,

I got an odd problem when using the API to delete and deploy nodes.

Each action on nodes seems to launch a "regenerate policy" (which I assume is normal), but when I add or delete multiple nodes I sometimes (random) get an error 500 from the API.

Apache just show a 500 on the API:

[28/May/2020:09:35:20 +0200] "DELETE /rudder/api/latest/nodes/f10b0a1b-70f4-4d6c-94f2-d1bccee9fbac HTTP/1.1" 500 

And when digging in the rudder logs I found this problem:

May 28 09:35:42 rudder rudder[scheduledJob]: [ERROR] Error when updating dynamic group 'all-nodes-with-cfengine-agent' <- Error when processing request for updating dynamic group 'All classic Nodes' (all-nodes-with-cfengine-agent) <- Inconsistancy: When trying to access Node information from cache was empty, but it should not, this is a developper issue, please open an issue on https://issues.rudder.io

As asked is the message, I open this bug :-)

But anyway this problem is problematic because I do not know how to solve this problem, and even when deleting manually, doing factory-reset, readding the nodes, etc, sometimes is not sufficient, and difficult to automatize :-/

I am not sure about what information could help you work on this problem, so do not hesitate to tell me if you need any more details!

Thanks for the help.

Regards,

Actions #1

Updated by Victor Héry about 4 years ago

I forgot to indicate that we use rudder server v6.0.6 and rudder-agent v6.0.6 :)

Actions #2

Updated by Nicolas CHARLES about 4 years ago

Hi Victor

Thank you for this ticket.
How much memory is allocated to rudder-jetty (param XMX in /etc/default/rudder-jetty) ?

Actions #3

Updated by Florian Ganée about 4 years ago

Nicolas CHARLES wrote in #note-2:

Hi Victor

Thank you for this ticket.
How much memory is allocated to rudder-jetty (param XMX in /etc/default/rudder-jetty) ?

Hi Nicolas,

We've followed Rudder documentation with Perf optimization.
Here is the configuration of the Jetty:
JAVA_XMX=16384
JAVA_MAXPERMSIZE=256
JAVA_GC="-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=500 -XX:+UseStringDeduplication"

Actions #4

Updated by Nicolas CHARLES about 4 years ago

  • Target version set to 6.0.7

Thank you for your answer.
This configuration looks sane. Do you have a high load on the system? How many CPUs do you have?
Could you also send the logs (sanitized) from the webapp from the 5 minutes before the error ?

it may be that the ldap is overwhelmed, but i've never seen that with less than 5000 nodes

Actions #5

Updated by Florian Ganée about 4 years ago

Nicolas CHARLES wrote in #note-4:

Thank you for your answer.
This configuration looks sane. Do you have a high load on the system? How many CPUs do you have?
Could you also send the logs (sanitized) from the webapp from the 5 minutes before the error ?

it may be that the ldap is overwhelmed, but i've never seen that with less than 5000 nodes

The server has 12 threads (6 cores over-threaded). When we had this error, we had a load about 5 to 10.
In our use case, we do a lot of API calls with Ansible simultaneously. It causes the high load, especially by the LDAP process in Rudder which sounds related to what you're talking about.
It may be quite hard at the moment to get all previous logs, sorry.

Actions #6

Updated by Nicolas CHARLES about 4 years ago

  • Category set to Performance and scalability

ok, so maybe that LDAP does not follow

Can you run this following script (from rudder-upgrade in 6.1)

  service rudder-jetty stop

  SLAPD_CONF="/opt/rudder/etc/openldap/slapd.conf" 
  NEED_REINDEX=false
  for i in "directiveId" "softwareVersion" "cn"; do
    line="index\s*${i}\s*eq" 
    INDEX_COUNT=$(grep -c "^${line}" ${SLAPD_CONF} 2>/dev/null || true)
    if [ ${INDEX_COUNT} -eq 0 ]; then
      echo "Adding LDAP index on attribute: ${i}" 
      sed -i "/index\s*objectClass\s*eq/a index\t${i}\teq" ${SLAPD_CONF}
      NEED_REINDEX=true
    fi
  done

  if ${NEED_REINDEX}; then
    # stop slapd, slapindex -q, start slapd. 
    systemctl stop rudder-slapd
    echo -n "Reindexing LDAP directory - this may take a few minutes..." 
    su - rudder-slapd -s /bin/sh -c "/opt/rudder/sbin/slapindex" 
    echo " Done" 
    systemctl start rudder-slapd
  fi

  service rudder-jetty start

it will add the ldap indexes that exists in 6.1, which does significantly improve perfs, notably concerning node inventory management (around 5 to 20 times faster)

Actions #7

Updated by Vincent MEMBRÉ almost 4 years ago

  • Target version changed from 6.0.7 to 6.0.8
Actions #8

Updated by François ARMAND almost 4 years ago

  • User visibility set to Operational - other Techniques | Rudder settings | Plugins
  • Priority changed from 0 to 32
Actions #9

Updated by Vincent MEMBRÉ almost 4 years ago

  • Target version changed from 6.0.8 to 6.0.9
Actions #10

Updated by Vincent MEMBRÉ over 3 years ago

  • Target version changed from 6.0.9 to 6.0.10
  • Priority changed from 32 to 31
Actions #11

Updated by Vincent MEMBRÉ over 3 years ago

  • Target version changed from 6.0.10 to 798
  • Priority changed from 31 to 30
Actions #12

Updated by Benoît PECCATTE almost 3 years ago

  • Target version changed from 798 to 6.1.14
  • Priority changed from 30 to 27
Actions #13

Updated by Vincent MEMBRÉ almost 3 years ago

  • Target version changed from 6.1.14 to 6.1.15
Actions #14

Updated by Vincent MEMBRÉ almost 3 years ago

  • Target version changed from 6.1.15 to 6.1.16
Actions #15

Updated by Vincent MEMBRÉ almost 3 years ago

  • Target version changed from 6.1.16 to 6.1.17
Actions #16

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 6.1.17 to 6.1.18
Actions #17

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 6.1.18 to 6.1.19
Actions #18

Updated by François ARMAND over 2 years ago

  • Status changed from New to Resolved

We believe that these kind of bug (random error 500 with API) are all rooted to performance problems with LDAP.
We greatly improved our performance in that topic, and so I'm cloing that one. Please open again if you are still affected by it in recent rudder version.

Actions

Also available in: Atom PDF