Bug #17557: Error 500 when adding or deleting nodes through API - Rudder - Issue Tracker

Actions

Copy link

Bug #17557

closed

Error 500 when adding or deleting nodes through API

Added by Victor Héry about 5 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

N/A

Assignee:

Category:

Performance and scalability

Target version:

6.1.19

Pull Request:

Severity:

Minor - inconvenience | misleading | easy workaround

UX impact:

User visibility:

Operational - other Techniques | Rudder settings | Plugins

Effort required:

Priority:

Name check:

To do

Fix check:

To do

Regression:

Description

Hello,

I got an odd problem when using the API to delete and deploy nodes.

Each action on nodes seems to launch a "regenerate policy" (which I assume is normal), but when I add or delete multiple nodes I sometimes (random) get an error 500 from the API.

Apache just show a 500 on the API:

[28/May/2020:09:35:20 +0200] "DELETE /rudder/api/latest/nodes/f10b0a1b-70f4-4d6c-94f2-d1bccee9fbac HTTP/1.1" 500

And when digging in the rudder logs I found this problem:

May 28 09:35:42 rudder rudder[scheduledJob]: [ERROR] Error when updating dynamic group 'all-nodes-with-cfengine-agent' <- Error when processing request for updating dynamic group 'All classic Nodes' (all-nodes-with-cfengine-agent) <- Inconsistancy: When trying to access Node information from cache was empty, but it should not, this is a developper issue, please open an issue on https://issues.rudder.io

As asked is the message, I open this bug :-)

But anyway this problem is problematic because I do not know how to solve this problem, and even when deleting manually, doing factory-reset, readding the nodes, etc, sometimes is not sufficient, and difficult to automatize :-/

I am not sure about what information could help you work on this problem, so do not hesitate to tell me if you need any more details!

Thanks for the help.

Regards,

Actions

Copy link

Updated by Victor Héry about 5 years ago

I forgot to indicate that we use rudder server v6.0.6 and rudder-agent v6.0.6 :)

Actions

Copy link

Updated by Nicolas CHARLES about 5 years ago

Hi Victor

Thank you for this ticket.
How much memory is allocated to rudder-jetty (param XMX in /etc/default/rudder-jetty) ?

Actions

Copy link

Updated by Florian Ganée about 5 years ago

Nicolas CHARLES wrote in #note-2:

Hi Victor

Thank you for this ticket.
How much memory is allocated to rudder-jetty (param XMX in /etc/default/rudder-jetty) ?

Hi Nicolas,

We've followed Rudder documentation with Perf optimization.
Here is the configuration of the Jetty:
JAVA_XMX=16384 JAVA_MAXPERMSIZE=256 JAVA_GC="-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=500 -XX:+UseStringDeduplication"

Actions

Copy link

Updated by Nicolas CHARLES about 5 years ago

Target version set to 6.0.7

Thank you for your answer.
This configuration looks sane. Do you have a high load on the system? How many CPUs do you have?
Could you also send the logs (sanitized) from the webapp from the 5 minutes before the error ?

it may be that the ldap is overwhelmed, but i've never seen that with less than 5000 nodes

Actions

Copy link

Updated by Florian Ganée about 5 years ago

Nicolas CHARLES wrote in #note-4:

Thank you for your answer.
This configuration looks sane. Do you have a high load on the system? How many CPUs do you have?
Could you also send the logs (sanitized) from the webapp from the 5 minutes before the error ?

it may be that the ldap is overwhelmed, but i've never seen that with less than 5000 nodes

The server has 12 threads (6 cores over-threaded). When we had this error, we had a load about 5 to 10.
In our use case, we do a lot of API calls with Ansible simultaneously. It causes the high load, especially by the LDAP process in Rudder which sounds related to what you're talking about.
It may be quite hard at the moment to get all previous logs, sorry.

Actions

Copy link

Updated by Nicolas CHARLES about 5 years ago

Category set to Performance and scalability

ok, so maybe that LDAP does not follow

Can you run this following script (from rudder-upgrade in 6.1)

  service rudder-jetty stop

  SLAPD_CONF="/opt/rudder/etc/openldap/slapd.conf" 
  NEED_REINDEX=false
  for i in "directiveId" "softwareVersion" "cn"; do
    line="index\s*${i}\s*eq" 
    INDEX_COUNT=$(grep -c "^${line}" ${SLAPD_CONF} 2>/dev/null || true)
    if [ ${INDEX_COUNT} -eq 0 ]; then
      echo "Adding LDAP index on attribute: ${i}" 
      sed -i "/index\s*objectClass\s*eq/a index\t${i}\teq" ${SLAPD_CONF}
      NEED_REINDEX=true
    fi
  done

  if ${NEED_REINDEX}; then
    # stop slapd, slapindex -q, start slapd. 
    systemctl stop rudder-slapd
    echo -n "Reindexing LDAP directory - this may take a few minutes..." 
    su - rudder-slapd -s /bin/sh -c "/opt/rudder/sbin/slapindex" 
    echo " Done" 
    systemctl start rudder-slapd
  fi

  service rudder-jetty start

it will add the ldap indexes that exists in 6.1, which does significantly improve perfs, notably concerning node inventory management (around 5 to 20 times faster)

Actions

Copy link

Updated by Vincent MEMBRÉ about 5 years ago

Target version changed from 6.0.7 to 6.0.8

Actions

Copy link

Updated by François ARMAND about 5 years ago

User visibility set to Operational - other Techniques | Rudder settings | Plugins
Priority changed from 0 to 32

Actions

Copy link

Updated by Vincent MEMBRÉ about 5 years ago

Target version changed from 6.0.8 to 6.0.9

Actions

Copy link

#10

Updated by Vincent MEMBRÉ almost 5 years ago

Target version changed from 6.0.9 to 6.0.10
Priority changed from 32 to 31

Actions

Copy link

#11

Updated by Vincent MEMBRÉ almost 5 years ago

Target version changed from 6.0.10 to 798
Priority changed from 31 to 30

Actions

Copy link

#12

Updated by Benoît PECCATTE about 4 years ago

Target version changed from 798 to 6.1.14
Priority changed from 30 to 27

Actions

Copy link

#13

Updated by Vincent MEMBRÉ about 4 years ago

Target version changed from 6.1.14 to 6.1.15

Actions

Copy link

#14

Updated by Vincent MEMBRÉ about 4 years ago

Target version changed from 6.1.15 to 6.1.16

Actions

Copy link

#15

Updated by Vincent MEMBRÉ almost 4 years ago

Target version changed from 6.1.16 to 6.1.17

Actions

Copy link

#16

Updated by Vincent MEMBRÉ almost 4 years ago

Target version changed from 6.1.17 to 6.1.18

Actions

Copy link

#17

Updated by Vincent MEMBRÉ over 3 years ago

Target version changed from 6.1.18 to 6.1.19

Actions

Copy link

#18

Updated by François ARMAND over 3 years ago

Status changed from New to Resolved

We believe that these kind of bug (random error 500 with API) are all rooted to performance problems with LDAP.
We greatly improved our performance in that topic, and so I'm cloing that one. Please open again if you are still affected by it in recent rudder version.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Rudder

Custom queries

Bug #17557

Error 500 when adding or deleting nodes through API

Updated by Victor Héry about 5 years ago

Updated by Nicolas CHARLES about 5 years ago

Updated by Florian Ganée about 5 years ago

Updated by Nicolas CHARLES about 5 years ago

Updated by Florian Ganée about 5 years ago

Updated by Nicolas CHARLES about 5 years ago

Updated by Vincent MEMBRÉ about 5 years ago

Updated by François ARMAND about 5 years ago

Updated by Vincent MEMBRÉ about 5 years ago

Updated by Vincent MEMBRÉ almost 5 years ago

Updated by Vincent MEMBRÉ almost 5 years ago

Updated by Benoît PECCATTE about 4 years ago

Updated by Vincent MEMBRÉ about 4 years ago

Updated by Vincent MEMBRÉ about 4 years ago

Updated by Vincent MEMBRÉ almost 4 years ago

Updated by Vincent MEMBRÉ almost 4 years ago

Updated by Vincent MEMBRÉ over 3 years ago

Updated by François ARMAND over 3 years ago