Bug #10456
closed
Several core features don't work anymore with more than 1000 nodes
Added by Janos Mattyasovszky over 7 years ago.
Updated over 7 years ago.
Category:
Web - Compliance & node report
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Operational - other Techniques | Technique editor | Rudder settings
Description
I somehow managed to produce this by adding 2287 nodes to a 4.1-rc1, and this happened after the jetty-restart:
[2017-03-17 15:09:26] INFO migration - Migration of inline variables in Directives to 'rudder' namespace already done, skipping
[2017-03-17 15:09:26] INFO bootchecks - Check for force reload of Techniques library
[2017-03-17 15:09:26] INFO bootchecks - Flag file '/opt/rudder/etc/force_technique_reload' does not exist, do not Technique library will not be reloaded
[2017-03-17 15:09:26] INFO bootchecks - Check that system groups 'hasPolicyServer-*' are dynamics and correct them
[2017-03-17 15:09:26] INFO bootchecks - Resume policy update if it was running before shutdown
[2017-03-17 15:09:26] INFO bootchecks - Flag file '/opt/rudder/etc/policy-update-running' found, Start a new policy update now
[2017-03-17 15:09:26] INFO bootstrap.liftweb.Boot - classpath
[2017-03-17 15:09:26] INFO com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Start policy generation, checking updated rules
[2017-03-17 15:09:26] INFO application - Application Rudder started
[2017-03-17 15:09:26] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Pre-policy-generation scripts hooks ran in 20 ms
[2017-03-17 15:09:26] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Pre-policy-generation modules hooks in 2 ms, start getting all generation related data.
2017-03-17 15:09:26.455:INFO:oejs.AbstractConnector:Started SelectChannelConnector@127.0.0.1:8080
[2017-03-17 15:09:26] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - All relevant information fetched in 313 ms, start names historization.
[2017-03-17 15:09:30] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node contexts built in 4187 ms, start to build new node configurations.
[2017-03-17 15:09:31] ERROR historization - Error with update nodes historization process <- Could not update the nodes historization information in base. <- null
[2017-03-17 15:09:31] ERROR historization - Root cause was: null
[2017-03-17 15:09:31] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 4808 ms
[2017-03-17 15:09:31] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Could not update the nodes historization information in base. <- null
[2017-03-17 15:09:31] ERROR com.normation.rudder.batch.AsyncDeploymentAgent - Policy update error for process '10' at 2017-03-17 15:09:31: Could not update the nodes historization information in base.
Exception in thread "pool-6-thread-15" java.lang.StackOverflowError
at doobie.util.atom$Atom$$anon$2.$anonfun$set$2(atom.scala:47)
at doobie.util.atom$Atom$$anon$2.$anonfun$set$2$adapted(atom.scala:47)
at doobie.util.composite$LowerPriorityComposite$$anon$6.$anonfun$set$4(composite.scala:109)
at doobie.util.composite$LowerPriorityComposite$$anon$6.$anonfun$set$4$adapted(composite.scala:109)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
[[[cut for sanity, but this continues down quite a lot]]]
- Category set to Web - Compliance & node report
- Target version set to 4.1.0
- Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
This happend with currently 2287 in the system, I will try to go back to ~1000 nodes and test again.
This is most likelly the same bug as #10380, but it is more impacting.
A first, simple workarounnd may be to just increase the available stack space. @Janos Matya, would you mind trying to add in `/etc/default/rudder-jetty` the following arg:
JAVA_OPTIONS="-Xss4M"
But the correct solution is to check the number of element in the "in" clause, and if more than, say 200, create a temporary table + join.
- Related to Bug #10380: Dashboard do not load due to stackoverflow error in 4.1 added
- Related to Bug #10453: Error during policy generation added
- Subject changed from Cannot historize names to Stackoverflow during generation for 2200 nodes
And as Nicolas reported, the "historization: null" is most likelly something that is not correctly initialized, and we will be correcting it in #10453
that did work.
[2017-03-20 13:06:43] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node configuration written on filesystem in 439848 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node configuration cached in LDAP in 296 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node configuration written in 440261 ms, start to update expected reports.
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Reports updated in 866 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Post-policy-generation hooks ran in 74 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Timing summary:
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Run pre-gen scripts hooks : 29 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Run pre-gen modules hooks : 2 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Fetch all information : 358 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Historize names : 901 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Build current rule values : 15 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Build target configuration: 766 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Update rule vals : 65 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Increment rule serials : 59 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 440261 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Save expected reports : 866 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Run post generation hooks : 74 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Number of nodes updated : 1087
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 445502 ms
[2017-03-20 13:06:44] INFO com.normation.rudder.batch.AsyncDeploymentAgent - Successful policy update '21' [started 2017-03-20 12:59:19 - ended 2017-03-20 13:06:44]
PS: These times were with writing to a ramdisk:
# df -hTP /var/rudder/share
Filesystem Type Size Used Avail Use% Mounted on
none tmpfs 2.0G 876M 1.2G 43% /var/rudder/share
OK, nice, we have a workaround.
Numbers are interesting, to. We are spending just far too much time writting files :/
- Related to deleted (Bug #10453: Error during policy generation)
- Has duplicate Bug #10453: Error during policy generation added
- Related to deleted (Bug #10380: Dashboard do not load due to stackoverflow error in 4.1)
- Has duplicate Bug #10380: Dashboard do not load due to stackoverflow error in 4.1 added
- Subject changed from Stackoverflow during generation for 2200 nodes to Several core features don't work anymore with more than 1000 nodes
- User visibility set to Operational - other Techniques | Technique editor | Rudder settings
- Status changed from New to In progress
- Assignee set to François ARMAND
- Status changed from In progress to Pending technical review
- Assignee changed from François ARMAND to Nicolas CHARLES
- Pull Request set to https://github.com/Normation/rudder/pull/1605
- Assignee changed from Nicolas CHARLES to Alexis Mousset
- Pull Request changed from https://github.com/Normation/rudder/pull/1605 to https://github.com/Normation/rudder/pull/1606
- Status changed from Pending technical review to Pending release
- Status changed from Pending release to Released
This bug has been fixed in Rudder 4.1.0 which was released today.
- Related to Bug #11037: Missing agent reports after Rudder server restart added
Also available in: Atom
PDF