Bug #10646: "SocketException(message='Socket closed'" error at the end of generation with 500 nodes - Rudder - Issue Tracker

Actions

Copy link

Bug #10646

closed

"SocketException(message='Socket closed'" error at the end of generation with 500 nodes

Added by Nicolas CHARLES over 8 years ago. Updated about 7 years ago.

Status:

Released

Priority:

N/A

Assignee:

Nicolas CHARLES

Category:

Web - Config management

Target version:

4.1.14

Pull Request:

https://github.com/Normation/rudder-p...

Severity:

Critical - prevents main use of Rudder | no workaround | data loss | security

UX impact:

User visibility:

Infrequent - complex configurations | third party integrations

Effort required:

Priority:

Name check:

Fix check:

Regression:

Description

On a fairly large install (2000 nodes, 500 nodes accepted at once), I got the following error at the end of generation

⇨ Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389: SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) / writeTo(ASN1Buffer.java:851) / sendMessage(LDAPConnectionInternals.java:525) / sendMessage(LDAPConnection.java:4249) / processAsync(ModifyRequest.java:745) / process(ModifyRequest.java:650) / modify(LDAPConnection.java:2745) / $anonfun$modifyAction$3(LDAPConnection.scala:541) / apply(Exception.scala:224) / $anonfun$modifyAction$1(LDAPConnection.scala:541) / applyMod(LDAPConnection.scala:489) / $anonfun$applyModify$2(LDAPConnection.scala:552) / save(LDAPConnection.scala:609) / $anonfun$save$8(NodeConfigurationCacheRepository.scala:451) / flatMap(Box.scala:808) / $anonfun$save$4(NodeConfigurationCacheRepository.scala:447) / flatMap(Box.scala:808) / $anonfun$save$3(NodeConfigurationCacheRepository.scala:446) / $anonfun$flatMap$1(LDAPConnectionProvider.scala:117) / withCon(LDAPConnectionProvider.scala:154) / withCon$(LDAPConnectionProvider.scala:143) / withCon(LDAPConnectionProvider.scala:385) / flatMap(LDAPConnectionProvider.scala:116) / flatMap$(LDAPConnectionProvider.scala:115) / flatMap(LDAPConnectionProvider.scala:385) / save(NodeConfigurationCacheRepository.scala:445) / cacheNodeConfiguration(NodeConfigurationServiceImpl.scala:72) / $anonfun$writeNodeConfigurations$4(DeploymentService.scala:1010) / flatMap(Box.scala:808) / writeNodeConfigurations(DeploymentService.scala:1004) / writeNodeConfigurations$(DeploymentService.scala:992) / writeNodeConfigurations(DeploymentService.scala:480) / $anonfun$deploy$71(DeploymentService.scala:232) / flatMap(Box.scala:808) / $anonfun$deploy$65(DeploymentService.scala:221) / flatMap(Box.scala:808) / $anonfun$deploy$61(DeploymentService.scala:215) / flatMap(Box.scala:808) / $anonfun$deploy$58(DeploymentService.scala:214) / flatMap(Box.scala:808) / $anonfun$deploy$54(DeploymentService.scala:209) / flatMap(Box.scala:808) / $anonfun$deploy$50(DeploymentService.scala:204) / flatMap(Box.scala:808) / $anonfun$deploy$47(DeploymentService.scala:197) / flatMap(Box.scala:808) / $anonfun$deploy$43(DeploymentService.scala:188) / flatMap(Box.scala:808) / $anonfun$deploy$38(DeploymentService.scala:174) / flatMap(Box.scala:808) / $anonfun$deploy$36(DeploymentService.scala:173) / flatMap(Box.scala:808) / $anonfun$deploy$34(DeploymentService.scala:172) / flatMap(Box.scala:808) / $anonfun$deploy$33(DeploymentService.scala:171) / flatMap(Box.scala:808) / $anonfun$deploy$31(DeploymentService.scala:170) / flatMap(Box.scala:808) / $anonfun$deploy$28(DeploymentService.scala:167) / flatMap(Box.scala:808) / $anonfun$deploy$24(DeploymentService.scala:164) / flatMap(Box.scala:808) / $anonfun$deploy$20(DeploymentService.scala:161) / flatMap(Box.scala:808) / $anonfun$deploy$16(DeploymentService.scala:158) / flatMap(Box.scala:808) / $anonfun$deploy$12(DeploymentService.scala:155) / flatMap(Box.scala:808) / $anonfun$deploy$8(DeploymentService.scala:152) / flatMap(Box.scala:808) / $anonfun$deploy$5(DeploymentService.scala:147) / flatMap(Box.scala:808) / $anonfun$deploy$2(DeploymentService.scala:142) / flatMap(Box.scala:808) / deploy(DeploymentService.scala:141) / deploy$(DeploymentService.scala:127) / deploy(DeploymentService.scala:480) / applyOrElse(AsyncDeploymentAgent.scala:350) / apply(AbstractPartialFunction.scala:34) / execTranslate(LiftActor.scala:504) / execTranslate$(LiftActor.scala:496) / execTranslate(AsyncDeploymentAgent.scala:342) / proc2(LiftActor.scala:305) / $anonfun$processMailbox$1(LiftActor.scala:227) / apply(JFunction0$mcV$sp.java:12) / around(LiftActor.scala:241) / around$(LiftActor.scala:240) / around(AsyncDeploymentAgent.scala:342) / processMailbox(LiftActor.scala:227) / $anonfun$$bang$3(LiftActor.scala:190) / run(LiftActor.scala:76) / runWorker(ThreadPoolExecutor.java:1142) / run(ThreadPoolExecutor.java:617) / run(Thread.java:745)', revision=24201)

no error on the ldap side in the logs

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Nicolas CHARLES over 8 years ago

Related to Bug #10645: Proxy error when accepting 500 inventories at the same time added

Actions

Copy link

Updated by Jonathan CLARKE over 8 years ago

Subject changed from error at the end of generation, after accepting 500 nodes to Error at the end of generation after accepting 500 nodes
Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility changed from Infrequent - complex configurations | third party integrations to Operational - other Techniques | Technique editor | Rudder settings
Priority changed from 0 to 60

Actions

Copy link

Updated by Jonathan CLARKE over 8 years ago

User visibility changed from Operational - other Techniques | Technique editor | Rudder settings to Infrequent - complex configurations | third party integrations
Priority changed from 60 to 49

Assuming this only happens when 500 nodes are accepted simultanteously, this is an infrequent situation. If it also happens for less nodes, say <=100, we'll need to increase the visibility field again.

Actions

Copy link

Updated by Nicolas CHARLES over 8 years ago

Severity deleted (~~Critical - prevents main use of Rudder | no workaround | data loss | security~~)
Priority changed from 49 to 0

logs say:

Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osName" requested
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access granted by manage(=mwrscxd)
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: result not in cache (osFullName)
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osFullName" requested
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted
Apr 24 17:02:09 dhcp240 rudder-slapd[10927]: dnMatch 1024

but slapd log says

[2017-04-24 17:01:43] DEBUG application - [Store Agent Run Times] Scheduled task finished in 595 ms (started at 2017/04/24 17:01:42, finished at 2017/04/24 17:01:43)
[2017-04-24 17:01:45] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 242543 ms
[2017-04-24 17:01:45] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Cannot write configuration node <- Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389:  SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155)

Actions

Copy link

Updated by Nicolas CHARLES over 8 years ago

Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
Priority changed from 0 to 49

Actions

Copy link

Updated by François ARMAND over 8 years ago

It is a problem occuring in the underlying LDAP lib (unbound).

So the connection, for some reason, can not be done because the socket is closed, and so the connection never reach the LDAP server, which is why OpenLDAP log are void at that point of time.

It happens when we try to aquire a connection in NodeConfigurationCacheRepository.scala #l445:

  def save(caches: Set[NodeConfigurationHash]): Box[Set[NodeId]] = {
    val updatedIds = caches.map(_.id)
    for {
      ldap          <- ldapCon  // <========= HERE
      existingEntry <- ldap.get(rudderDit.NODE_CONFIGS.dn)

This code is totally standard, we are using it again and again everywhere. If it was repeatable, I would say that something went wrong with the LDAP connection pool before / during the deployment, and that for some reason, the fault happen around the same time (based on number of open con?). Or perhaps something during the deployment, just before that action, corrupt the connection pool -perhaps a bad data.

I fear that I won't be able to make more assumption.

So, some questions:

does restarting Rudder solved the problem?
cleaning cache ? (but I really believe that the problem is not linked to NodeHash, but something before)

Next step is trying to set unboundId at debug or trace level, and see what the logs are telling.

Actions

Copy link

Updated by Vincent MEMBRÉ over 8 years ago

Target version changed from 4.1.2 to 4.1.3

Actions

Copy link

Updated by Vincent MEMBRÉ over 8 years ago

Target version changed from 4.1.3 to 4.1.4

Actions

Copy link

Updated by Vincent MEMBRÉ about 8 years ago

Target version changed from 4.1.4 to 4.1.5

Actions

Copy link

#10

Updated by Alexis Mousset about 8 years ago

Target version changed from 4.1.5 to 4.1.6
Priority changed from 49 to 48

Actions

Copy link

#11

Updated by Benoît PECCATTE about 8 years ago

Priority changed from 48 to 64

Actions

Copy link

#12

Updated by Benoît PECCATTE about 8 years ago

Assignee set to François ARMAND

Fixing #10645 will probably solve this one

Actions

Copy link

#13

Updated by Vincent MEMBRÉ about 8 years ago

Target version changed from 4.1.6 to 4.1.7
Priority changed from 64 to 62

Actions

Copy link

#14

Updated by Vincent MEMBRÉ about 8 years ago

Target version changed from 4.1.7 to 4.1.8

Actions

Copy link

#15

Updated by Vincent MEMBRÉ almost 8 years ago

Target version changed from 4.1.8 to 4.1.9
Priority changed from 62 to 60

Actions

Copy link

#16

Updated by Benoît PECCATTE almost 8 years ago

Priority changed from 60 to 59

Actions

Copy link

#17

Updated by Vincent MEMBRÉ over 7 years ago

Target version changed from 4.1.9 to 4.1.10
Priority changed from 59 to 58

Actions

Copy link

#18

Updated by Vincent MEMBRÉ over 7 years ago

Target version changed from 4.1.10 to 4.1.11
Priority changed from 58 to 56

Actions

Copy link

#19

Updated by Vincent MEMBRÉ over 7 years ago

Target version changed from 4.1.11 to 4.1.12
Priority changed from 56 to 54

Actions

Copy link

#20

Updated by Vincent MEMBRÉ over 7 years ago

Target version changed from 4.1.12 to 4.1.13

Actions

Copy link

#21

Updated by Vincent MEMBRÉ about 7 years ago

Target version changed from 4.1.13 to 4.1.14

Actions

Copy link

#22

Updated by François ARMAND about 7 years ago

One of our user experience the same problem on a Rudder 4.1.7.

The server is quite loaded and has > 10k nodes.

Actions

Copy link

#23

Updated by François ARMAND about 7 years ago

I'm able to reproduce by faking a huge Node Configuration entry by changing that method in NodeConfigurationCacheRepository:

  private[this] def toLdap(nodeConfigs: Set[NodeConfigurationHash]): LDAPEntry = {
    val cc = if(nodeConfigs.nonEmpty) {
      val c = nodeConfigs.head
       for(i <- 0 to 10000) yield c.copy(
         id = NodeId(c.id.value+i.toString),
          policyHash = c.policyHash ++ (for(j <- 1 to 100) yield PolicyHash( Cf3PolicyDraftId(RuleId("test--dir-" + j), DirectiveId("test-rule-"+j)) , j))
        )
    } else {
      Set[NodeConfigurationHash]()
    }
    val caches = (cc ++ nodeConfigs).map{ x => write(x) }
    val entry = rudderDit.NODE_CONFIGS.model
    entry +=! (A_NODE_CONFIG, caches.toSeq:_*)
    entry
  }

Actions

Copy link

#24

Updated by François ARMAND about 7 years ago

I was able to minify it with that program:

object NodesConfigurationErrorMain {

  val rwLdap =
    new RWPooledSimpleAuthConnectionProvider(
      host = "localhost",
      port = 1389,
      authDn = "cn=Manager,cn=rudder-configuration",
      authPw = "secret",
      poolSize = 2
  )

  def str(size: Int) = new String( Array.fill[Char](size)('x'))

  def main(args: Array[String]): Unit = {
    val dn = new DN("cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration")
    val entry = LDAPEntry(dn)

    entry += new Attribute("objectClass", "top", "nodeConfigurations")
    entry += new Attribute("cn", "Nodes Configuration")

//here, with 454, we get:
//Failure(Can not save (add) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:1389:  SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) ....
    val values = for(i <- 0 until 167) yield i + "-"+str(100000) ++ str(453)
    entry += new Attribute("nodeConfig", values:_*)
    println(for {
      ldap <- rwLdap
      _ <- ldap.delete(dn)
      _ <- ldap.save(entry)
    } yield {
      "done" 
    })

  }
}

Actions

Copy link

#25

Updated by François ARMAND about 7 years ago

To be clearer: in the previous code snippets, it's really the total sum size of the entry which is the problem. I can decrease the size of each value and add more of them, or increase it and add even less.

It is more likely because somewhere, we hit an int overflow, likely in the ASN.1 encoding or the actual write of the code into the socket.

Actions

Copy link

#26

Updated by François ARMAND about 7 years ago

Bug opened in ldap sdk: https://github.com/pingidentity/ldapsdk/issues/47

Actions

Copy link

#27

Updated by François ARMAND about 7 years ago

Status changed from New to In progress

Actions

Copy link

#28

Updated by François ARMAND about 7 years ago

Status changed from In progress to Pending technical review
Assignee changed from François ARMAND to Nicolas CHARLES
Pull Request set to https://github.com/Normation/rudder-packages/pull/1626

PR https://github.com/Normation/rudder-packages/pull/1626

Actions

Copy link

#29

Updated by François ARMAND about 7 years ago

Actually, the problem is just that OpenLDAP limit the size of request to avoid DOS by people sending huge, useless request.

The correction is to increase "sockbuf_max_incoming_auth" in slapd.conf.

# Specify the maximum incoming LDAP PDU size for authenticated sessions. The default is 4194303.
# This is too little for us, we have big entries (like Nodes Configuration)
sockbuf_max_incoming_auth 1000000000

Actions

Copy link

#30

Updated by Nicolas CHARLES about 7 years ago

Status changed from Pending technical review to In progress

I'm taking over this issue!

Actions

Copy link

#31

Updated by Nicolas CHARLES about 7 years ago

Status changed from In progress to Pending technical review
Assignee changed from Nicolas CHARLES to François ARMAND
Pull Request changed from https://github.com/Normation/rudder-packages/pull/1626 to https://github.com/Normation/rudder-packages/pull/1627

PR https://github.com/Normation/rudder-packages/pull/1627

Actions

Copy link

#32

Updated by Rudder Quality Assistant about 7 years ago

Assignee changed from François ARMAND to Nicolas CHARLES

Actions

Copy link

#33

Updated by François ARMAND about 7 years ago

Status changed from Pending technical review to Pending release

Applied in changeset rudder-packages|f2b2c8a46357d593a2770ceadd6be3cb394c0806.

Actions

Copy link

#34

Updated by François ARMAND about 7 years ago

So, to So, to sum up that problem:

- OpenLDAP has (like all LDAP server) a protection agains denial of service attack that limit the size of request it processes. This is to avoid having an attacker sending big, useless request with hard to decode ASN.1, and make the server burn CPU on them.
- that protection is by "PDU" (protocol data unit), so it's not a diret mapping into request size
- but it explains why we sometime get LDAP error "socket close" - when too much data comes, openldap just close the socket to protect itself.
- you get the following message with OpenLDAP log level "conns trace" (yeah, that's why we never saw it - the log level needed is extremelly deep)

Jul 23 10:03:19 server rudder-slapd[31057]: connection_get(10): got connid=1002
Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): checking for input on id=1002
Jul 23 10:03:19 server rudder-slapd[31057]: ber_get_next on fd 10 failed errno=34 (Numerical result out of range)
Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): input error=-2 id=1002, closing.
Jul 23 10:03:19 server rudder-slapd[31057]: connection_closing: readying conn=1002 sd=10 for close
Jul 23 10:03:19 server rudder-slapd[31057]: connection_close: conn=1002 sd=10

Actions

Copy link

#35

Updated by François ARMAND about 7 years ago

Subject changed from Error at the end of generation after accepting 500 nodes to "SocketException(message='Socket closed'" error at the end of generation with 500 nodes

Actions

Copy link

#36

Updated by Benoît PECCATTE about 7 years ago

Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.14 and 4.3.4 which were released today.

4.1.14: Announce Changelog
4.3.4: Announce Changelog
Download: https://www.rudder-project.org/site/get-rudder/downloads/

Actions

Copy link

#37

Updated by François ARMAND about 7 years ago

Related to Bug #13256: LDAP IO error on generation with a lot of nodes added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Rudder

Custom queries

Bug #10646

"SocketException(message='Socket closed'" error at the end of generation with 500 nodes

Updated by Nicolas CHARLES over 8 years ago

Updated by Jonathan CLARKE over 8 years ago

Updated by Jonathan CLARKE over 8 years ago

Updated by Nicolas CHARLES over 8 years ago

Updated by Nicolas CHARLES over 8 years ago

Updated by François ARMAND over 8 years ago

Updated by Vincent MEMBRÉ over 8 years ago

Updated by Vincent MEMBRÉ over 8 years ago

Updated by Vincent MEMBRÉ about 8 years ago

Updated by Alexis Mousset about 8 years ago

Updated by Benoît PECCATTE about 8 years ago

Updated by Benoît PECCATTE about 8 years ago

Updated by Vincent MEMBRÉ about 8 years ago

Updated by Vincent MEMBRÉ about 8 years ago

Updated by Vincent MEMBRÉ almost 8 years ago

Updated by Benoît PECCATTE almost 8 years ago

Updated by Vincent MEMBRÉ over 7 years ago

Updated by Vincent MEMBRÉ over 7 years ago

Updated by Vincent MEMBRÉ over 7 years ago

Updated by Vincent MEMBRÉ over 7 years ago

Updated by Vincent MEMBRÉ about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by Nicolas CHARLES about 7 years ago

Updated by Nicolas CHARLES about 7 years ago

Updated by Rudder Quality Assistant about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by François ARMAND about 7 years ago

Updated by Benoît PECCATTE about 7 years ago

Updated by François ARMAND about 7 years ago