Project

General

Profile

Bug #10646

"SocketException(message='Socket closed'" error at the end of generation with 500 nodes

Added by Nicolas CHARLES over 1 year ago. Updated 4 months ago.

Status:
Released
Priority:
N/A
Category:
Web - Config management
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
54

Description

On a fairly large install (2000 nodes, 500 nodes accepted at once), I got the following error at the end of generation

⇨ Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389: SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) / writeTo(ASN1Buffer.java:851) / sendMessage(LDAPConnectionInternals.java:525) / sendMessage(LDAPConnection.java:4249) / processAsync(ModifyRequest.java:745) / process(ModifyRequest.java:650) / modify(LDAPConnection.java:2745) / $anonfun$modifyAction$3(LDAPConnection.scala:541) / apply(Exception.scala:224) / $anonfun$modifyAction$1(LDAPConnection.scala:541) / applyMod(LDAPConnection.scala:489) / $anonfun$applyModify$2(LDAPConnection.scala:552) / save(LDAPConnection.scala:609) / $anonfun$save$8(NodeConfigurationCacheRepository.scala:451) / flatMap(Box.scala:808) / $anonfun$save$4(NodeConfigurationCacheRepository.scala:447) / flatMap(Box.scala:808) / $anonfun$save$3(NodeConfigurationCacheRepository.scala:446) / $anonfun$flatMap$1(LDAPConnectionProvider.scala:117) / withCon(LDAPConnectionProvider.scala:154) / withCon$(LDAPConnectionProvider.scala:143) / withCon(LDAPConnectionProvider.scala:385) / flatMap(LDAPConnectionProvider.scala:116) / flatMap$(LDAPConnectionProvider.scala:115) / flatMap(LDAPConnectionProvider.scala:385) / save(NodeConfigurationCacheRepository.scala:445) / cacheNodeConfiguration(NodeConfigurationServiceImpl.scala:72) / $anonfun$writeNodeConfigurations$4(DeploymentService.scala:1010) / flatMap(Box.scala:808) / writeNodeConfigurations(DeploymentService.scala:1004) / writeNodeConfigurations$(DeploymentService.scala:992) / writeNodeConfigurations(DeploymentService.scala:480) / $anonfun$deploy$71(DeploymentService.scala:232) / flatMap(Box.scala:808) / $anonfun$deploy$65(DeploymentService.scala:221) / flatMap(Box.scala:808) / $anonfun$deploy$61(DeploymentService.scala:215) / flatMap(Box.scala:808) / $anonfun$deploy$58(DeploymentService.scala:214) / flatMap(Box.scala:808) / $anonfun$deploy$54(DeploymentService.scala:209) / flatMap(Box.scala:808) / $anonfun$deploy$50(DeploymentService.scala:204) / flatMap(Box.scala:808) / $anonfun$deploy$47(DeploymentService.scala:197) / flatMap(Box.scala:808) / $anonfun$deploy$43(DeploymentService.scala:188) / flatMap(Box.scala:808) / $anonfun$deploy$38(DeploymentService.scala:174) / flatMap(Box.scala:808) / $anonfun$deploy$36(DeploymentService.scala:173) / flatMap(Box.scala:808) / $anonfun$deploy$34(DeploymentService.scala:172) / flatMap(Box.scala:808) / $anonfun$deploy$33(DeploymentService.scala:171) / flatMap(Box.scala:808) / $anonfun$deploy$31(DeploymentService.scala:170) / flatMap(Box.scala:808) / $anonfun$deploy$28(DeploymentService.scala:167) / flatMap(Box.scala:808) / $anonfun$deploy$24(DeploymentService.scala:164) / flatMap(Box.scala:808) / $anonfun$deploy$20(DeploymentService.scala:161) / flatMap(Box.scala:808) / $anonfun$deploy$16(DeploymentService.scala:158) / flatMap(Box.scala:808) / $anonfun$deploy$12(DeploymentService.scala:155) / flatMap(Box.scala:808) / $anonfun$deploy$8(DeploymentService.scala:152) / flatMap(Box.scala:808) / $anonfun$deploy$5(DeploymentService.scala:147) / flatMap(Box.scala:808) / $anonfun$deploy$2(DeploymentService.scala:142) / flatMap(Box.scala:808) / deploy(DeploymentService.scala:141) / deploy$(DeploymentService.scala:127) / deploy(DeploymentService.scala:480) / applyOrElse(AsyncDeploymentAgent.scala:350) / apply(AbstractPartialFunction.scala:34) / execTranslate(LiftActor.scala:504) / execTranslate$(LiftActor.scala:496) / execTranslate(AsyncDeploymentAgent.scala:342) / proc2(LiftActor.scala:305) / $anonfun$processMailbox$1(LiftActor.scala:227) / apply(JFunction0$mcV$sp.java:12) / around(LiftActor.scala:241) / around$(LiftActor.scala:240) / around(AsyncDeploymentAgent.scala:342) / processMailbox(LiftActor.scala:227) / $anonfun$$bang$3(LiftActor.scala:190) / run(LiftActor.scala:76) / runWorker(ThreadPoolExecutor.java:1142) / run(ThreadPoolExecutor.java:617) / run(Thread.java:745)', revision=24201)

no error on the ldap side in the logs


Related issues

Related to Rudder - Bug #10645: Proxy error when accepting 500 inventories at the same timeNew
Related to Rudder - Bug #13256: LDAP IO error on generation with a lot of nodesReleased

Associated revisions

Revision f2b2c8a4 (diff)
Added by François ARMAND 5 months ago

Fixes #10646: Error at the end of generation after accepting 500 nodes

History

#1 Updated by Nicolas CHARLES over 1 year ago

  • Related to Bug #10645: Proxy error when accepting 500 inventories at the same time added

#2 Updated by Jonathan CLARKE over 1 year ago

  • Subject changed from error at the end of generation, after accepting 500 nodes to Error at the end of generation after accepting 500 nodes
  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility changed from Infrequent - complex configurations | third party integrations to Operational - other Techniques | Technique editor | Rudder settings
  • Priority changed from 0 to 60

#3 Updated by Jonathan CLARKE over 1 year ago

  • User visibility changed from Operational - other Techniques | Technique editor | Rudder settings to Infrequent - complex configurations | third party integrations
  • Priority changed from 60 to 49

Assuming this only happens when 500 nodes are accepted simultanteously, this is an infrequent situation. If it also happens for less nodes, say <=100, we'll need to increase the visibility field again.

#4 Updated by Nicolas CHARLES over 1 year ago

  • Severity deleted (Critical - prevents main use of Rudder | no workaround | data loss | security)
  • Priority changed from 49 to 0

logs say:

Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osName" requested
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access granted by manage(=mwrscxd)
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: result not in cache (osFullName)
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osFullName" requested
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted
Apr 24 17:02:09 dhcp240 rudder-slapd[10927]: dnMatch 1024

but slapd log says

[2017-04-24 17:01:43] DEBUG application - [Store Agent Run Times] Scheduled task finished in 595 ms (started at 2017/04/24 17:01:42, finished at 2017/04/24 17:01:43)
[2017-04-24 17:01:45] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 242543 ms
[2017-04-24 17:01:45] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Cannot write configuration node <- Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389:  SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155)

#5 Updated by Nicolas CHARLES over 1 year ago

  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • Priority changed from 0 to 49

#6 Updated by François ARMAND over 1 year ago

It is a problem occuring in the underlying LDAP lib (unbound).

So the connection, for some reason, can not be done because the socket is closed, and so the connection never reach the LDAP server, which is why OpenLDAP log are void at that point of time.

It happens when we try to aquire a connection in NodeConfigurationCacheRepository.scala #l445:

  def save(caches: Set[NodeConfigurationHash]): Box[Set[NodeId]] = {
    val updatedIds = caches.map(_.id)
    for {
      ldap          <- ldapCon  // <========= HERE
      existingEntry <- ldap.get(rudderDit.NODE_CONFIGS.dn)

This code is totally standard, we are using it again and again everywhere. If it was repeatable, I would say that something went wrong with the LDAP connection pool before / during the deployment, and that for some reason, the fault happen around the same time (based on number of open con?). Or perhaps something during the deployment, just before that action, corrupt the connection pool -perhaps a bad data.

I fear that I won't be able to make more assumption.

So, some questions:

  • does restarting Rudder solved the problem?
  • cleaning cache ? (but I really believe that the problem is not linked to NodeHash, but something before)

Next step is trying to set unboundId at debug or trace level, and see what the logs are telling.

#7 Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.2 to 4.1.3

#8 Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.3 to 4.1.4

#9 Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.4 to 4.1.5

#10 Updated by Alexis MOUSSET over 1 year ago

  • Target version changed from 4.1.5 to 4.1.6
  • Priority changed from 49 to 48

#11 Updated by Benoît PECCATTE over 1 year ago

  • Priority changed from 48 to 64

#12 Updated by Benoît PECCATTE over 1 year ago

  • Assignee set to François ARMAND

Fixing #10645 will probably solve this one

#13 Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.6 to 4.1.7
  • Priority changed from 64 to 62

#14 Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.7 to 4.1.8

#15 Updated by Vincent MEMBRÉ about 1 year ago

  • Target version changed from 4.1.8 to 4.1.9
  • Priority changed from 62 to 60

#16 Updated by Benoît PECCATTE about 1 year ago

  • Priority changed from 60 to 59

#17 Updated by Vincent MEMBRÉ about 1 year ago

  • Target version changed from 4.1.9 to 4.1.10
  • Priority changed from 59 to 58

#18 Updated by Vincent MEMBRÉ 10 months ago

  • Target version changed from 4.1.10 to 4.1.11
  • Priority changed from 58 to 56

#19 Updated by Vincent MEMBRÉ 8 months ago

  • Target version changed from 4.1.11 to 4.1.12
  • Priority changed from 56 to 54

#20 Updated by Vincent MEMBRÉ 7 months ago

  • Target version changed from 4.1.12 to 4.1.13

#21 Updated by Vincent MEMBRÉ 5 months ago

  • Target version changed from 4.1.13 to 4.1.14

#22 Updated by François ARMAND 5 months ago

One of our user experience the same problem on a Rudder 4.1.7.

The server is quite loaded and has > 10k nodes.

#23 Updated by François ARMAND 5 months ago

I'm able to reproduce by faking a huge Node Configuration entry by changing that method in NodeConfigurationCacheRepository:

  private[this] def toLdap(nodeConfigs: Set[NodeConfigurationHash]): LDAPEntry = {
    val cc = if(nodeConfigs.nonEmpty) {
      val c = nodeConfigs.head
       for(i <- 0 to 10000) yield c.copy(
         id = NodeId(c.id.value+i.toString),
          policyHash = c.policyHash ++ (for(j <- 1 to 100) yield PolicyHash( Cf3PolicyDraftId(RuleId("test--dir-" + j), DirectiveId("test-rule-"+j)) , j))
        )
    } else {
      Set[NodeConfigurationHash]()
    }
    val caches = (cc ++ nodeConfigs).map{ x => write(x) }
    val entry = rudderDit.NODE_CONFIGS.model
    entry +=! (A_NODE_CONFIG, caches.toSeq:_*)
    entry
  }

#24 Updated by François ARMAND 5 months ago

I was able to minify it with that program:

object NodesConfigurationErrorMain {

  val rwLdap =
    new RWPooledSimpleAuthConnectionProvider(
      host = "localhost",
      port = 1389,
      authDn = "cn=Manager,cn=rudder-configuration",
      authPw = "secret",
      poolSize = 2
  )

  def str(size: Int) = new String( Array.fill[Char](size)('x'))

  def main(args: Array[String]): Unit = {
    val dn = new DN("cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration")
    val entry = LDAPEntry(dn)

    entry += new Attribute("objectClass", "top", "nodeConfigurations")
    entry += new Attribute("cn", "Nodes Configuration")

//here, with 454, we get:
//Failure(Can not save (add) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:1389:  SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) ....
    val values = for(i <- 0 until 167) yield i + "-"+str(100000) ++ str(453)
    entry += new Attribute("nodeConfig", values:_*)
    println(for {
      ldap <- rwLdap
      _ <- ldap.delete(dn)
      _ <- ldap.save(entry)
    } yield {
      "done" 
    })

  }
}

#25 Updated by François ARMAND 5 months ago

To be clearer: in the previous code snippets, it's really the total sum size of the entry which is the problem. I can decrease the size of each value and add more of them, or increase it and add even less.

It is more likely because somewhere, we hit an int overflow, likely in the ASN.1 encoding or the actual write of the code into the socket.

#27 Updated by François ARMAND 5 months ago

  • Status changed from New to In progress

#28 Updated by François ARMAND 5 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/1626

#29 Updated by François ARMAND 5 months ago

Actually, the problem is just that OpenLDAP limit the size of request to avoid DOS by people sending huge, useless request.

The correction is to increase "sockbuf_max_incoming_auth" in slapd.conf.

# Specify the maximum incoming LDAP PDU size for authenticated sessions. The default is 4194303.
# This is too little for us, we have big entries (like Nodes Configuration)
sockbuf_max_incoming_auth 1000000000

#30 Updated by Nicolas CHARLES 5 months ago

  • Status changed from Pending technical review to In progress

I'm taking over this issue!

#31 Updated by Nicolas CHARLES 5 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to François ARMAND
  • Pull Request changed from https://github.com/Normation/rudder-packages/pull/1626 to https://github.com/Normation/rudder-packages/pull/1627

#32 Updated by Normation Quality Assistant 5 months ago

  • Assignee changed from François ARMAND to Nicolas CHARLES

#33 Updated by François ARMAND 5 months ago

  • Status changed from Pending technical review to Pending release

#34 Updated by François ARMAND 5 months ago

So, to So, to sum up that problem:

- OpenLDAP has (like all LDAP server) a protection agains denial of service attack that limit the size of request it processes. This is to avoid having an attacker sending big, useless request with hard to decode ASN.1, and make the server burn CPU on them.
- that protection is by "PDU" (protocol data unit), so it's not a diret mapping into request size
- but it explains why we sometime get LDAP error "socket close" - when too much data comes, openldap just close the socket to protect itself.
- you get the following message with OpenLDAP log level "conns trace" (yeah, that's why we never saw it - the log level needed is extremelly deep)

Jul 23 10:03:19 server rudder-slapd[31057]: connection_get(10): got connid=1002
Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): checking for input on id=1002
Jul 23 10:03:19 server rudder-slapd[31057]: ber_get_next on fd 10 failed errno=34 (Numerical result out of range)
Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): input error=-2 id=1002, closing.
Jul 23 10:03:19 server rudder-slapd[31057]: connection_closing: readying conn=1002 sd=10 for close
Jul 23 10:03:19 server rudder-slapd[31057]: connection_close: conn=1002 sd=10

#35 Updated by François ARMAND 5 months ago

  • Subject changed from Error at the end of generation after accepting 500 nodes to "SocketException(message='Socket closed'" error at the end of generation with 500 nodes

#36 Updated by Benoît PECCATTE 4 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.14 and 4.3.4 which were released today.

#37 Updated by François ARMAND 4 months ago

  • Related to Bug #13256: LDAP IO error on generation with a lot of nodes added

Also available in: Atom PDF