Bug #15775: Performance problems with ZIO - Rudder - Issue Tracker

Actions

Copy link

Bug #15775

closed

Bug #15205: Use blocking threadpool for unsafeRun()

Performance problems with ZIO

Added by François ARMAND almost 6 years ago. Updated almost 6 years ago.

Status:

Released

Priority:

N/A

Assignee:

Nicolas CHARLES

Category:

Performance and scalability

Target version:

6.0.0~beta1

Pull Request:

https://github.com/Normation/rudder/p...

Severity:

UX impact:

User visibility:

Effort required:

Priority:

Name check:

To do

Fix check:

To do

Regression:

Description

We found order of magnitude performance decrease with the new ZIO backend.

The problem seems to come (at least) from:

- slow LDAP connections read. It seems to be linked to ZIO, but not sure why (most likely, cost of interpretation of ZIO programs on that context). That leads to dynamic groups update taking > x10 time in 5.1 than in 5.0 ;
- thousands of short living threads (most of them linked to cats.something). That is extremely expensive.

Some references:
- Blocking as default `effect` import?: https://github.com/zio/zio/issues/1275
- Zio run blocking backwardscompatible code: https://stackoverflow.com/questions/55380840/zio-run-blocking-backwardscompatible-code

Actions

Copy link

Updated by François ARMAND almost 6 years ago

Status changed from New to In progress

Actions

Copy link

Updated by François ARMAND almost 6 years ago

Work in progess here: https://github.com/fanf/rudder/commit/bed63a9b5f46f78ac69935dcc5612f13060eb2bd

Actions

Copy link

Updated by François ARMAND almost 6 years ago

Description updated (diff)

Actions

Copy link

Updated by François ARMAND almost 6 years ago

Status changed from In progress to Pending technical review
Assignee changed from François ARMAND to Nicolas CHARLES
Pull Request set to https://github.com/Normation/rudder/pull/2486

PR https://github.com/Normation/rudder/pull/2486

Actions

Copy link

Updated by François ARMAND almost 6 years ago

Subject changed from Performance problems with ZIO LDAP connections to Performance problems with ZIO

Some more informations: on a generation, this is the difference on timing between 5.0 and 5.1:

Number of nodes updated 4549 4549 Run pre-gen scripts hooks 0 358 x -- Run pre-gen modules hooks 0 4 x -- Fetch all information 1437 6156 x 4,28 Historize names 3282 5720 x 1,74 Build current rule values 839 1515 x 1,81 Build target configuration 140241 956087 x 6,82 - node context 16455 79897 x 4,86 - other things 123786 876190 x 7,08 Write node configurations 401538 2307613 x 5,75 Save expected reports 13471 13323 x 0,99 Run post generation hooks 129 267 x 2,07 </pre

The first two can be considered as outliners. The problem is not specific to I/O. One of the most I/O intensive task, fecting data, is "only" x4. But one of the less I/O intensive task, other things in build target configuration, is x7!

So we need to understand if the problem is with parallelism, or worse CPU-bounded tasks.

=> add timing information in 5.0 and 5.1 to be able to analyse where the problem comes from.

Actions

Copy link

Updated by Nicolas CHARLES almost 6 years ago

Protocol used for new tests:

logs in Trace for policy_generation and debig_timing

ran the following command, on both 5.0 and 5.1 logs:


sed -i "/is a Dsc node/d"  *
sed -i "/discarding less priorize/d"  *
sed -i "/Not updating changes by rule/d"  *
sed -i "/their promises are going to be written/d"  *
sed -i "/does not define any components/d"  *
sed -i "/Not updating non-modified node/d"  *

grep "debug_timing" * >> debug_timing.log

grep "getNodeStatusReports" debug_timing.log >> getNodeStatusReports.log
sed -i "/getNodeStatusReports/d"  debug_timing.log

grep "computing policy status" debug_timing.log >> computePolicyStatus.log
sed -i "/computing policy status/d"  debug_timing.log

grep "Compute complianceForRun" debug_timing.log >> complianceForRun.log
sed -i "/Compute complianceForRun/d"  debug_timing.log

grep "compute compliance reports" debug_timing.log >> complianceReports.log
sed -i "/compute compliance reports/d"  debug_timing.log

grep "global cost" debug_timing.log >> globalCost.log
sed -i "/global cost/d"  debug_timing.log

grep "Cache for nodes info expire" debug_timing.log >> cacheExpire.log
sed -i "/Cache for nodes info expire/d"  debug_timing.log

grep "policy.generation" * >> policy_generation.log

sed -i "/to merge multi-directives/d" policy_generation.log
sed -i "/directive for unique techniques/d" policy_generation.log
sed -i "/indep multi-directives/d" policy_generation.log
sed -i "/Resolved policies for/d" policy_generation.log

# Policy status comparision
sed -i 's/.*: //' computePolicyStatus.log
sed -i 's/ms//' computePolicyStatus.log

sed -i 's/.*: //' getNodeStatusReports.log
sed -i 's/ms//' getNodeStatusReports.log

sed -i 's/.*: //' complianceReports.log
sed -i 's/ms//' complianceReports.log

sed -i 's/.*: //' cacheExpire.log
sed -i 's/ms//' cacheExpire.log

(unfortunately global cost and Compute complianceForRun don't exist in 5.1)

Result for getNodeStatusReports

Version	Number of entries	Sum	Average	Mode	Min	Max
5.0	152087	13773	11.0424	2	0	532
5.1	17822	9146	1.94861	1	0	461

Result for computePolicyStatus

Version	Number of entries	Sum	Average	Mode	Min	Max
5.0	5397	13773	0.391854	0	0	341
5.1	3597	9146	0.393287	0	0	541

Result for complianceReports

Version	Number of entries	Sum	Average	Mode	Min	Max
5.0	317721	126	2521.6	6	3	165463
5.1	48765	50	975.3	12	0	24429

Result for cacheExpire

Version	Number of entries	Sum	Average	Mode	Min	Max
5.0	350287	3929	89.1542	88	76	171
5.1	58129	562	103.432	95	5	175

Note: resultats for 5.0 are on a full night, 5.1 is on couple of hours, so it might skews the result a bit, but using only half a day for 5.0 show even worse stats for 5.0

Actions

Copy link

Updated by François ARMAND almost 6 years ago

After much more tests, the performance problem seems to lie in `toBox` and/or `toIO` (from box).
I'm not sure I undertand why, but avoiding that perf killer in hot spot leads to as good as 5.0 perf (or even better in some cases).

The hottest point is evalJS, which is done one time per directive param. Porting that to IOResult leads to excellent results.

For ex, in 5.0 I have:

===> Total out: 374 ms
===> Total in : 368 ms
===> Total exp: 34 ms

Run nodeconfigProg          : 129 ms

In 6.0 with evalJS in Box:

===> Total out: 3954 ms
===> Total in : 3904 ms
===> Total exp: 3234 ms
Run nodeconfigProg          : 4793 ms

And with evalJS in IORestult:

===> Total out: 270 ms
===> Total in : 261 ms
===> Total exp: 21 ms

config : 187 ms

Actions

Copy link

Updated by François ARMAND almost 6 years ago

Status changed from Pending technical review to In progress
Assignee changed from Nicolas CHARLES to François ARMAND

Actions

Copy link

Updated by François ARMAND almost 6 years ago

Status changed from In progress to Pending technical review
Assignee changed from François ARMAND to Nicolas CHARLES
Pull Request changed from https://github.com/Normation/rudder/pull/2486 to https://github.com/Normation/rudder/pull/2529

PR https://github.com/Normation/rudder/pull/2529

Actions

Copy link

#10