Bug #10646
closed"SocketException(message='Socket closed'" error at the end of generation with 500 nodes
Description
On a fairly large install (2000 nodes, 500 nodes accepted at once), I got the following error at the end of generation
⇨ Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389: SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) / writeTo(ASN1Buffer.java:851) / sendMessage(LDAPConnectionInternals.java:525) / sendMessage(LDAPConnection.java:4249) / processAsync(ModifyRequest.java:745) / process(ModifyRequest.java:650) / modify(LDAPConnection.java:2745) / $anonfun$modifyAction$3(LDAPConnection.scala:541) / apply(Exception.scala:224) / $anonfun$modifyAction$1(LDAPConnection.scala:541) / applyMod(LDAPConnection.scala:489) / $anonfun$applyModify$2(LDAPConnection.scala:552) / save(LDAPConnection.scala:609) / $anonfun$save$8(NodeConfigurationCacheRepository.scala:451) / flatMap(Box.scala:808) / $anonfun$save$4(NodeConfigurationCacheRepository.scala:447) / flatMap(Box.scala:808) / $anonfun$save$3(NodeConfigurationCacheRepository.scala:446) / $anonfun$flatMap$1(LDAPConnectionProvider.scala:117) / withCon(LDAPConnectionProvider.scala:154) / withCon$(LDAPConnectionProvider.scala:143) / withCon(LDAPConnectionProvider.scala:385) / flatMap(LDAPConnectionProvider.scala:116) / flatMap$(LDAPConnectionProvider.scala:115) / flatMap(LDAPConnectionProvider.scala:385) / save(NodeConfigurationCacheRepository.scala:445) / cacheNodeConfiguration(NodeConfigurationServiceImpl.scala:72) / $anonfun$writeNodeConfigurations$4(DeploymentService.scala:1010) / flatMap(Box.scala:808) / writeNodeConfigurations(DeploymentService.scala:1004) / writeNodeConfigurations$(DeploymentService.scala:992) / writeNodeConfigurations(DeploymentService.scala:480) / $anonfun$deploy$71(DeploymentService.scala:232) / flatMap(Box.scala:808) / $anonfun$deploy$65(DeploymentService.scala:221) / flatMap(Box.scala:808) / $anonfun$deploy$61(DeploymentService.scala:215) / flatMap(Box.scala:808) / $anonfun$deploy$58(DeploymentService.scala:214) / flatMap(Box.scala:808) / $anonfun$deploy$54(DeploymentService.scala:209) / flatMap(Box.scala:808) / $anonfun$deploy$50(DeploymentService.scala:204) / flatMap(Box.scala:808) / $anonfun$deploy$47(DeploymentService.scala:197) / flatMap(Box.scala:808) / $anonfun$deploy$43(DeploymentService.scala:188) / flatMap(Box.scala:808) / $anonfun$deploy$38(DeploymentService.scala:174) / flatMap(Box.scala:808) / $anonfun$deploy$36(DeploymentService.scala:173) / flatMap(Box.scala:808) / $anonfun$deploy$34(DeploymentService.scala:172) / flatMap(Box.scala:808) / $anonfun$deploy$33(DeploymentService.scala:171) / flatMap(Box.scala:808) / $anonfun$deploy$31(DeploymentService.scala:170) / flatMap(Box.scala:808) / $anonfun$deploy$28(DeploymentService.scala:167) / flatMap(Box.scala:808) / $anonfun$deploy$24(DeploymentService.scala:164) / flatMap(Box.scala:808) / $anonfun$deploy$20(DeploymentService.scala:161) / flatMap(Box.scala:808) / $anonfun$deploy$16(DeploymentService.scala:158) / flatMap(Box.scala:808) / $anonfun$deploy$12(DeploymentService.scala:155) / flatMap(Box.scala:808) / $anonfun$deploy$8(DeploymentService.scala:152) / flatMap(Box.scala:808) / $anonfun$deploy$5(DeploymentService.scala:147) / flatMap(Box.scala:808) / $anonfun$deploy$2(DeploymentService.scala:142) / flatMap(Box.scala:808) / deploy(DeploymentService.scala:141) / deploy$(DeploymentService.scala:127) / deploy(DeploymentService.scala:480) / applyOrElse(AsyncDeploymentAgent.scala:350) / apply(AbstractPartialFunction.scala:34) / execTranslate(LiftActor.scala:504) / execTranslate$(LiftActor.scala:496) / execTranslate(AsyncDeploymentAgent.scala:342) / proc2(LiftActor.scala:305) / $anonfun$processMailbox$1(LiftActor.scala:227) / apply(JFunction0$mcV$sp.java:12) / around(LiftActor.scala:241) / around$(LiftActor.scala:240) / around(AsyncDeploymentAgent.scala:342) / processMailbox(LiftActor.scala:227) / $anonfun$$bang$3(LiftActor.scala:190) / run(LiftActor.scala:76) / runWorker(ThreadPoolExecutor.java:1142) / run(ThreadPoolExecutor.java:617) / run(Thread.java:745)', revision=24201)
no error on the ldap side in the logs
Updated by Nicolas CHARLES over 7 years ago
- Related to Bug #10645: Proxy error when accepting 500 inventories at the same time added
Updated by Jonathan CLARKE over 7 years ago
- Subject changed from error at the end of generation, after accepting 500 nodes to Error at the end of generation after accepting 500 nodes
- Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility changed from Infrequent - complex configurations | third party integrations to Operational - other Techniques | Technique editor | Rudder settings
- Priority changed from 0 to 60
Updated by Jonathan CLARKE over 7 years ago
- User visibility changed from Operational - other Techniques | Technique editor | Rudder settings to Infrequent - complex configurations | third party integrations
- Priority changed from 60 to 49
Assuming this only happens when 500 nodes are accepted simultanteously, this is an infrequent situation. If it also happens for less nodes, say <=100, we'll need to increase the visibility field again.
Updated by Nicolas CHARLES over 7 years ago
- Severity deleted (
Critical - prevents main use of Rudder | no workaround | data loss | security) - Priority changed from 49 to 0
logs say:
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osName" requested Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access granted by manage(=mwrscxd) Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: result not in cache (osFullName) Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osFullName" requested Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted Apr 24 17:02:09 dhcp240 rudder-slapd[10927]: dnMatch 1024
but slapd log says
[2017-04-24 17:01:43] DEBUG application - [Store Agent Run Times] Scheduled task finished in 595 ms (started at 2017/04/24 17:01:42, finished at 2017/04/24 17:01:43) [2017-04-24 17:01:45] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 242543 ms [2017-04-24 17:01:45] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Cannot write configuration node <- Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389: SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155)
Updated by Nicolas CHARLES over 7 years ago
- Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
- Priority changed from 0 to 49
Updated by François ARMAND over 7 years ago
It is a problem occuring in the underlying LDAP lib (unbound).
So the connection, for some reason, can not be done because the socket is closed, and so the connection never reach the LDAP server, which is why OpenLDAP log are void at that point of time.
It happens when we try to aquire a connection in NodeConfigurationCacheRepository.scala #l445:
def save(caches: Set[NodeConfigurationHash]): Box[Set[NodeId]] = { val updatedIds = caches.map(_.id) for { ldap <- ldapCon // <========= HERE existingEntry <- ldap.get(rudderDit.NODE_CONFIGS.dn)
This code is totally standard, we are using it again and again everywhere. If it was repeatable, I would say that something went wrong with the LDAP connection pool before / during the deployment, and that for some reason, the fault happen around the same time (based on number of open con?). Or perhaps something during the deployment, just before that action, corrupt the connection pool -perhaps a bad data.
I fear that I won't be able to make more assumption.
So, some questions:
- does restarting Rudder solved the problem?
- cleaning cache ? (but I really believe that the problem is not linked to NodeHash, but something before)
Next step is trying to set unboundId at debug or trace level, and see what the logs are telling.
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 4.1.2 to 4.1.3
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 4.1.3 to 4.1.4
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 4.1.4 to 4.1.5
Updated by Alexis Mousset over 7 years ago
- Target version changed from 4.1.5 to 4.1.6
- Priority changed from 49 to 48
Updated by Benoît PECCATTE over 7 years ago
- Assignee set to François ARMAND
Fixing #10645 will probably solve this one
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 4.1.6 to 4.1.7
- Priority changed from 64 to 62
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 4.1.7 to 4.1.8
Updated by Vincent MEMBRÉ about 7 years ago
- Target version changed from 4.1.8 to 4.1.9
- Priority changed from 62 to 60
Updated by Vincent MEMBRÉ about 7 years ago
- Target version changed from 4.1.9 to 4.1.10
- Priority changed from 59 to 58
Updated by Vincent MEMBRÉ almost 7 years ago
- Target version changed from 4.1.10 to 4.1.11
- Priority changed from 58 to 56
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.11 to 4.1.12
- Priority changed from 56 to 54
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.12 to 4.1.13
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.13 to 4.1.14
Updated by François ARMAND over 6 years ago
One of our user experience the same problem on a Rudder 4.1.7.
The server is quite loaded and has > 10k nodes.
Updated by François ARMAND over 6 years ago
I'm able to reproduce by faking a huge Node Configuration entry by changing that method in NodeConfigurationCacheRepository:
private[this] def toLdap(nodeConfigs: Set[NodeConfigurationHash]): LDAPEntry = { val cc = if(nodeConfigs.nonEmpty) { val c = nodeConfigs.head for(i <- 0 to 10000) yield c.copy( id = NodeId(c.id.value+i.toString), policyHash = c.policyHash ++ (for(j <- 1 to 100) yield PolicyHash( Cf3PolicyDraftId(RuleId("test--dir-" + j), DirectiveId("test-rule-"+j)) , j)) ) } else { Set[NodeConfigurationHash]() } val caches = (cc ++ nodeConfigs).map{ x => write(x) } val entry = rudderDit.NODE_CONFIGS.model entry +=! (A_NODE_CONFIG, caches.toSeq:_*) entry }
Updated by François ARMAND over 6 years ago
I was able to minify it with that program:
object NodesConfigurationErrorMain { val rwLdap = new RWPooledSimpleAuthConnectionProvider( host = "localhost", port = 1389, authDn = "cn=Manager,cn=rudder-configuration", authPw = "secret", poolSize = 2 ) def str(size: Int) = new String( Array.fill[Char](size)('x')) def main(args: Array[String]): Unit = { val dn = new DN("cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration") val entry = LDAPEntry(dn) entry += new Attribute("objectClass", "top", "nodeConfigurations") entry += new Attribute("cn", "Nodes Configuration") //here, with 454, we get: //Failure(Can not save (add) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:1389: SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) .... val values = for(i <- 0 until 167) yield i + "-"+str(100000) ++ str(453) entry += new Attribute("nodeConfig", values:_*) println(for { ldap <- rwLdap _ <- ldap.delete(dn) _ <- ldap.save(entry) } yield { "done" }) } }
Updated by François ARMAND over 6 years ago
To be clearer: in the previous code snippets, it's really the total sum size of the entry which is the problem. I can decrease the size of each value and add more of them, or increase it and add even less.
It is more likely because somewhere, we hit an int overflow, likely in the ASN.1 encoding or the actual write of the code into the socket.
Updated by François ARMAND over 6 years ago
Bug opened in ldap sdk: https://github.com/pingidentity/ldapsdk/issues/47
Updated by François ARMAND over 6 years ago
- Status changed from New to In progress
Updated by François ARMAND over 6 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from François ARMAND to Nicolas CHARLES
- Pull Request set to https://github.com/Normation/rudder-packages/pull/1626
Updated by François ARMAND over 6 years ago
Actually, the problem is just that OpenLDAP limit the size of request to avoid DOS by people sending huge, useless request.
The correction is to increase "sockbuf_max_incoming_auth" in slapd.conf.
# Specify the maximum incoming LDAP PDU size for authenticated sessions. The default is 4194303. # This is too little for us, we have big entries (like Nodes Configuration) sockbuf_max_incoming_auth 1000000000
Updated by Nicolas CHARLES over 6 years ago
- Status changed from Pending technical review to In progress
I'm taking over this issue!
Updated by Nicolas CHARLES over 6 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Nicolas CHARLES to François ARMAND
- Pull Request changed from https://github.com/Normation/rudder-packages/pull/1626 to https://github.com/Normation/rudder-packages/pull/1627
Updated by Rudder Quality Assistant over 6 years ago
- Assignee changed from François ARMAND to Nicolas CHARLES
Updated by François ARMAND over 6 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder-packages|f2b2c8a46357d593a2770ceadd6be3cb394c0806.
Updated by François ARMAND over 6 years ago
So, to So, to sum up that problem:
- OpenLDAP has (like all LDAP server) a protection agains denial of service attack that limit the size of request it processes. This is to avoid having an attacker sending big, useless request with hard to decode ASN.1, and make the server burn CPU on them.
- that protection is by "PDU" (protocol data unit), so it's not a diret mapping into request size
- but it explains why we sometime get LDAP error "socket close" - when too much data comes, openldap just close the socket to protect itself.
- you get the following message with OpenLDAP log level "conns trace" (yeah, that's why we never saw it - the log level needed is extremelly deep)
Jul 23 10:03:19 server rudder-slapd[31057]: connection_get(10): got connid=1002 Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): checking for input on id=1002 Jul 23 10:03:19 server rudder-slapd[31057]: ber_get_next on fd 10 failed errno=34 (Numerical result out of range) Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): input error=-2 id=1002, closing. Jul 23 10:03:19 server rudder-slapd[31057]: connection_closing: readying conn=1002 sd=10 for close Jul 23 10:03:19 server rudder-slapd[31057]: connection_close: conn=1002 sd=10
Updated by François ARMAND over 6 years ago
- Subject changed from Error at the end of generation after accepting 500 nodes to "SocketException(message='Socket closed'" error at the end of generation with 500 nodes
Updated by Benoît PECCATTE over 6 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 4.1.14 and 4.3.4 which were released today.
- 4.1.14: Announce Changelog
- 4.3.4: Announce Changelog
- Download: https://www.rudder-project.org/site/get-rudder/downloads/
Updated by François ARMAND over 6 years ago
- Related to Bug #13256: LDAP IO error on generation with a lot of nodes added