Project

General

Profile

Actions

Bug #4242

closed

Promise generation takes too long when getting more and more nodes

Added by Olivier Mauras about 11 years ago. Updated over 10 years ago.

Status:
Released
Priority:
3
Category:
Performance and scalability
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Name check:
Fix check:
Regression:

Description

Hello,

On my 130 managed nodes Rudder server, rules generation takes up to 6mn30s.

Here is a quick run example:

Success: Rules applied at 2013-12-11 15:14 (took 5 min 40s)

To be honest it starts to make the new server provisioning workflow slow because of this.
A new server has to wait around 10mn before being able to apply rudder rules which are

From what i understand, cf-promises is run on all nodes id directories which is ~1.2s here:

/usr/bin/time /opt/rudder/bin/cf-promises -f /var/rudder/share/a17dc652-4f84-4fa2-8c97-26e15e6f42b3/rules/cfengine-community/promises.cf
1.15user 0.04system 0:01.22elapsed 98%CPU (0avgtext+0avgdata 25248maxresident)k
0inputs+16outputs (0major+8468minor)pagefaults 0swaps

While not being the only culprit here, it takes a good 2minutes to run over all 130 nodes. Wouldn't that be faster to only run it on one fully generated rule and then simply copy promise files over their id directories?

Should i try to increase LDAP cache - 1GB by default - or increase JVM heap size - 1GB also?

Let me know if you need more informations.

Olivier


Subtasks 1 (0 open1 closed)

Bug #4583: Some error case are not handled in #4242ReleasedFrançois ARMAND2014-03-11Actions

Related issues 5 (1 open4 closed)

Related to Rudder - Architecture #4427: cf-promises check on ALL generated promises leads to huge generation time NewNicolas CHARLESActions
Related to Rudder - Bug #4475: Promise generation process should not lose time by forking to run "/bin/true"ReleasedNicolas CHARLES2014-02-17Actions
Related to Rudder - Bug #4584: Building node configuration in the promise generation process takes a lot of timeRejected2014-03-11Actions
Related to Rudder - Bug #4542: Missing property 'rudder.debug.nodeconfiguration.path' in configuration fileReleasedNicolas CHARLES2014-02-28Actions
Related to Rudder - Architecture #4782: Split rudder variable parsing and variable bindingReleasedNicolas CHARLES2014-04-29Actions
Actions #1

Updated by Olivier Mauras about 11 years ago

By setting

rudder.community.checkpromises.command=/bin/true
it now gives me

Success: Rules applied at 2013-12-11 16:27 (took 2 min 46s)
Actions #2

Updated by François ARMAND about 11 years ago

Hello,

I suspect one part of the generation to take a lot of time, related to putting things in LDAP. Could you please run in debug and copy here for the generation timing logs? It should be something like:

[2013-12-11 16:55:27] DEBUG com.normation.rudder.services.servers.NodeConfigurationServiceImpl - Written in ldap the node configuration caches in 134 millisec
....
....
....
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Reports updated in 572ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Timing summary:
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Fetch all information     :        361 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Historize names           :       1333 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Build current rule vals   :        185 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Expand rule parameters    :        237 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Build target configuration:        400 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Update rule vals          :        587 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Increment rule serials    :        274 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Write node configurations :       1576 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Save expected reports     :        572 ms
[2013-12-11 16:55:28] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Deployment completed in 5667 millisec

And if I guess correctly, the "written in LDAP" (really good english, thought) and the "write node config" should take a lot of time.

Thanks,

Actions #3

Updated by Olivier Mauras about 11 years ago

timing summary:

[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Timing summary:
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Fetch all information     :        218 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Historize names           :        143 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Build current rule vals   :         37 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Expand rule parameters    :        282 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Build target configuration:      38300 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Update rule vals          :        263 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Increment rule serials    :        160 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Write node configurations :     103092 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Save expected reports     :      20477 ms
[2013-12-11 17:06:37] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Deployment completed in 163102 millisec

Actions #4

Updated by Olivier Mauras about 11 years ago

With ldap directory on tmpfs:

[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Timing summary:
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Fetch all information     :        205 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Historize names           :        112 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Build current rule vals   :         36 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Expand rule parameters    :        259 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Build target configuration:      39686 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Update rule vals          :        258 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Increment rule serials    :        114 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Write node configurations :     152479 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Save expected reports     :       9591 ms
[2013-12-11 17:28:53] DEBUG com.normation.rudder.services.policies.DeploymentServiceImpl - Deployment completed in 202853 millisec

Not better at all :(

Actions #5

Updated by Matthieu CERDA almost 11 years ago

  • Category set to Web - Config management
  • Status changed from New to 8
  • Assignee set to François ARMAND
  • Target version set to 2.6.11

Qualified as an important bug, assigning to FAR.

Actions #6

Updated by François ARMAND almost 11 years ago

Some more thought on that one.

There is two parts on that problem, one related to storage of generated node coniguration, the other linked to the validation of generated promises with cf-promises.

Say in that one, we are going to manage the storage of configuraiton, and see #4427 for the cf-promise part.

So, the problem: we store all the node configuration in our LDAP base. We need them in a "snapshot" way, so that 1/ we are sure that the configuration is not modifed along the generation process, 2/we can debug by seeing what the Rudder-vision of the node configuration was, and especially 3/ we are able to decide if a node configuration was modified compared to the last generation round, so that we only process actually updated node configuration further down the promise generation algorithm.

A node configuration is something big, and composed of a lot of domain objects :

- node variables (hostname and the like),
- Rudder parameters,
- system variables,
- and of course, directives applied to that node with (possibly) specific parameters.

For a node with 50 directives, that leads to 101 LDAP objects (one big). So, for 100 nodes, that leads to 10 000 LDAP objects.
In the case that we don't have any luck, we have to rewrite ALL these object, and LDAP is not really good for that.

So, the solution is to change the node configuration data model. What we really need is the 3/ above (be able to decide if a node configuration is updated). For that, we don't need all the actual node configuration, some kind of hash can be sufficient.
But in that case, we loose 2/, the ability to debug. The good answer to that is to have a runtime configurable switch that will dump node coniguration in a human readable format, only when we want to debug - no need to impact performance with something not used in the general case.
And for 1/, the promise generation algorithm has dramatically changed since Rudder 2.1, and is mostly based on immutable, in-memory object now. So we "just" have to finish the transformation, and since the whole process is mono-threaded, use these in-memory objects to forbid any race-condition-like problem.

Actions #7

Updated by François ARMAND almost 11 years ago

Work in progress: https://github.com/fanf/rudder/tree/proto/remove_node_configuration

Everything compile and tests pass, need some cleaning and a more clean way to enable the debuging.

First performance tests are REALLY good (from tens of seconds to less than one second for that part). Need some more use case scenario to check.

Actions #8

Updated by François ARMAND almost 11 years ago

  • Assignee changed from François ARMAND to Nicolas CHARLES

A first version of pull requests are available here:

- https://github.com/Normation/cf-clerk/pull/32
- https://github.com/Normation/rudder/pull/435

They are quite heavy, so Nico, let's talk about them!

Actions #9

Updated by François ARMAND almost 11 years ago

  • Target version changed from 2.6.11 to 2.10.0~beta1

Seing the impact on code, it can't be a simple bug correction, but should go in the next version.

We are currently looking for workarounds limiting the timing for earlier version of Rudder.

Actions #10

Updated by François ARMAND almost 11 years ago

For the record, with first testing data, with 120 nodes and around 50 directives on my testing machine we have:

- before: around 70 second JUST FOR SAVING IN LDAP
- now: less than 500 ms

It was around 50 ms for ~20 nodes, so it seems to gain something like 3-5 ms by node almost lineary with nodes, and with little impact for the number of directive.

Actions #11

Updated by François ARMAND almost 11 years ago

  • Pull Request set to https://github.com/Normation/rudder/pull/435
Actions #12

Updated by Nicolas CHARLES almost 11 years ago

  • Assignee changed from Nicolas CHARLES to François ARMAND

Francois, I've made a remark about the file name, that could be a constant;
otherwise that's good for me !
Thank you

Actions #13

Updated by François ARMAND almost 11 years ago

  • Status changed from 8 to Pending release
  • % Done changed from 0 to 100
Actions #15

Updated by Nicolas CHARLES almost 11 years ago

  • Status changed from Pending release to Pending technical review

Francois, I've added a PR that add error message when something can go bad for two step of deployment
https://github.com/Normation/rudder/pull/456

Actions #16

Updated by Nicolas CHARLES almost 11 years ago

  • Status changed from Pending technical review to Pending release

remove the second PR to #4583

Actions #17

Updated by Vincent MEMBRÉ almost 11 years ago

  • Tracker changed from Bug to User story
  • Subject changed from Rules generation is too long to Speed up promises generation by optimization on LDAP requests
  • Category changed from Web - Config management to Performance and scalability
Actions #18

Updated by Vincent MEMBRÉ almost 11 years ago

  • Tracker changed from User story to Bug
  • Pull Request set to https://github.com/Normation/rudder/pull/435
Actions #19

Updated by Vincent MEMBRÉ almost 11 years ago

  • Subject changed from Speed up promises generation by optimization on LDAP requests to Bad usage of ldap requests during promise generation make it very long
Actions #20

Updated by Vincent MEMBRÉ almost 11 years ago

  • Subject changed from Bad usage of ldap requests during promise generation make it very long to Promise generation takes too long when getting more and more nodes
Actions #21

Updated by Vincent MEMBRÉ over 10 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 2.10.0~beta1, which was released today.
Check out:

The release announcement: http://www.rudder-project.org/pipermail/rudder-announce/2014-March/000084.html
The full ChangeLog: http://www.rudder-project.org/foswiki/bin/view/System/Documentation:ChangeLog210
Download information: https://www.rudder-project.org/site/get-rudder/downloads/
Actions

Also available in: Atom PDF