Bug #15775
closedBug #15205: Use blocking threadpool for unsafeRun()
Performance problems with ZIO
Description
We found order of magnitude performance decrease with the new ZIO backend.
The problem seems to come (at least) from:
- slow LDAP connections read. It seems to be linked to ZIO, but not sure why (most likely, cost of interpretation of ZIO programs on that context). That leads to dynamic groups update taking > x10 time in 5.1 than in 5.0 ;
- thousands of short living threads (most of them linked to cats.something). That is extremely expensive.
Some references:
- Blocking as default `effect` import?: https://github.com/zio/zio/issues/1275
- Zio run blocking backwardscompatible code: https://stackoverflow.com/questions/55380840/zio-run-blocking-backwardscompatible-code
Updated by François ARMAND about 5 years ago
- Status changed from New to In progress
Updated by François ARMAND about 5 years ago
Work in progess here: https://github.com/fanf/rudder/commit/bed63a9b5f46f78ac69935dcc5612f13060eb2bd
Updated by François ARMAND about 5 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from François ARMAND to Nicolas CHARLES
- Pull Request set to https://github.com/Normation/rudder/pull/2486
Updated by François ARMAND about 5 years ago
- Subject changed from Performance problems with ZIO LDAP connections to Performance problems with ZIO
Some more informations: on a generation, this is the difference on timing between 5.0 and 5.1:
Number of nodes updated 4549 4549 Run pre-gen scripts hooks 0 358 x -- Run pre-gen modules hooks 0 4 x -- Fetch all information 1437 6156 x 4,28 Historize names 3282 5720 x 1,74 Build current rule values 839 1515 x 1,81 Build target configuration 140241 956087 x 6,82 - node context 16455 79897 x 4,86 - other things 123786 876190 x 7,08 Write node configurations 401538 2307613 x 5,75 Save expected reports 13471 13323 x 0,99 Run post generation hooks 129 267 x 2,07 </preThe first two can be considered as outliners. The problem is not specific to I/O. One of the most I/O intensive task, fecting data, is "only" x4. But one of the less I/O intensive task, other things in build target configuration, is
x7
!So we need to understand if the problem is with parallelism, or worse CPU-bounded tasks.
=> add timing information in 5.0 and 5.1 to be able to analyse where the problem comes from.
Updated by Nicolas CHARLES about 5 years ago
- logs in Trace for policy_generation and debig_timing
- ran the following command, on both 5.0 and 5.1 logs:
sed -i "/is a Dsc node/d" * sed -i "/discarding less priorize/d" * sed -i "/Not updating changes by rule/d" * sed -i "/their promises are going to be written/d" * sed -i "/does not define any components/d" * sed -i "/Not updating non-modified node/d" * grep "debug_timing" * >> debug_timing.log grep "getNodeStatusReports" debug_timing.log >> getNodeStatusReports.log sed -i "/getNodeStatusReports/d" debug_timing.log grep "computing policy status" debug_timing.log >> computePolicyStatus.log sed -i "/computing policy status/d" debug_timing.log grep "Compute complianceForRun" debug_timing.log >> complianceForRun.log sed -i "/Compute complianceForRun/d" debug_timing.log grep "compute compliance reports" debug_timing.log >> complianceReports.log sed -i "/compute compliance reports/d" debug_timing.log grep "global cost" debug_timing.log >> globalCost.log sed -i "/global cost/d" debug_timing.log grep "Cache for nodes info expire" debug_timing.log >> cacheExpire.log sed -i "/Cache for nodes info expire/d" debug_timing.log grep "policy.generation" * >> policy_generation.log sed -i "/to merge multi-directives/d" policy_generation.log sed -i "/directive for unique techniques/d" policy_generation.log sed -i "/indep multi-directives/d" policy_generation.log sed -i "/Resolved policies for/d" policy_generation.log # Policy status comparision sed -i 's/.*: //' computePolicyStatus.log sed -i 's/ms//' computePolicyStatus.log sed -i 's/.*: //' getNodeStatusReports.log sed -i 's/ms//' getNodeStatusReports.log sed -i 's/.*: //' complianceReports.log sed -i 's/ms//' complianceReports.log sed -i 's/.*: //' cacheExpire.log sed -i 's/ms//' cacheExpire.log
(unfortunately global cost and Compute complianceForRun don't exist in 5.1)
Result for getNodeStatusReportsVersion | Number of entries | Sum | Average | Mode | Min | Max |
---|---|---|---|---|---|---|
5.0 | 152087 | 13773 | 11.0424 | 2 | 0 | 532 |
5.1 | 17822 | 9146 | 1.94861 | 1 | 0 | 461 |
Version | Number of entries | Sum | Average | Mode | Min | Max |
---|---|---|---|---|---|---|
5.0 | 5397 | 13773 | 0.391854 | 0 | 0 | 341 |
5.1 | 3597 | 9146 | 0.393287 | 0 | 0 | 541 |
Version | Number of entries | Sum | Average | Mode | Min | Max |
---|---|---|---|---|---|---|
5.0 | 317721 | 126 | 2521.6 | 6 | 3 | 165463 |
5.1 | 48765 | 50 | 975.3 | 12 | 0 | 24429 |
Version | Number of entries | Sum | Average | Mode | Min | Max |
---|---|---|---|---|---|---|
5.0 | 350287 | 3929 | 89.1542 | 88 | 76 | 171 |
5.1 | 58129 | 562 | 103.432 | 95 | 5 | 175 |
Note: resultats for 5.0 are on a full night, 5.1 is on couple of hours, so it might skews the result a bit, but using only half a day for 5.0 show even worse stats for 5.0
Updated by François ARMAND about 5 years ago
After much more tests, the performance problem seems to lie in `toBox` and/or `toIO` (from box).
I'm not sure I undertand why, but avoiding that perf killer in hot spot leads to as good as 5.0 perf (or even better in some cases).
The hottest point is evalJS, which is done one time per directive param. Porting that to IOResult
leads to excellent results.
For ex, in 5.0 I have:
===> Total out: 374 ms ===> Total in : 368 ms ===> Total exp: 34 ms Run nodeconfigProg : 129 ms
In 6.0 with evalJS in Box
:
===> Total out: 3954 ms ===> Total in : 3904 ms ===> Total exp: 3234 ms Run nodeconfigProg : 4793 ms
And with evalJS in IORestult
:
===> Total out: 270 ms ===> Total in : 261 ms ===> Total exp: 21 ms config : 187 ms
Updated by François ARMAND about 5 years ago
- Status changed from Pending technical review to In progress
- Assignee changed from Nicolas CHARLES to François ARMAND
Updated by François ARMAND about 5 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from François ARMAND to Nicolas CHARLES
- Pull Request changed from https://github.com/Normation/rudder/pull/2486 to https://github.com/Normation/rudder/pull/2529
Updated by François ARMAND about 5 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder|fdbc8e6c78d96a4c94498f51ae2a22887ffbd40a.
Updated by Vincent MEMBRÉ about 5 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 6.0.0~beta1 which was released today.