Project

General

Profile

Bug #15062

Allow only catching up with recent runs in agent report processing batch

Added by Nicolas CHARLES 2 months ago. Updated about 2 months ago.

Status:
Released
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
User visibility:
Effort required:
Priority:
0

Description

at start of the web interface, we catch up with all reports sent when the web interface was stopped, with a granularity of one day
on very loaded system, this is simply infeasible: there aren't enough ram to fetch it all, and poor postgresql choke on query

2019-06-11 20:21:29 UTCLOG:  duration: 997284.994 ms  execute <unnamed>: select distinct
          T.nodeid, T.executiontimestamp, coalesce(C.keyvalue, '') as nodeconfigid, coalesce(C.iscomplete, false) as complete, T.insertionid
        from
          (select nodeid, executiontimestamp, min(id) as insertionid from ruddersysevents where id > $1 and id <= $2 group by nodeid, executiontimestamp) as T
        left join
          (select
            true as iscomplete, nodeid, executiontimestamp, keyvalue
          from
            ruddersysevents where id > $3 and id <= $4 and
            eventtype = 'control' and
            component = 'end'
          ) as C
        on T.nodeid = C.nodeid and T.executiontimestamp = C.executiontimestamp
2019-06-11 20:21:29 UTCDETAIL:  parameters: $1 = '1428065585', $2 = '1437233953', $3 = '1428065585', $4 = '1437233953'

We should:
  1. be able to turn this feature off (or say: i catch up only xx minutes to avoid a gray compliance at start)
  2. be able to catch all (when using advanced reporting plugin), but deal with it in batches of yy minutes
  3. improve indexes

query

select
            true as iscomplete, nodeid, executiontimestamp, keyvalue
          from
            ruddersysevents where id >  1428065585 and id <= 1437233953 and
            eventtype = 'control' and
            component = 'end' ;

takes 36 seconds (see https://explain.depesz.com/s/T07l ) when there's no load on database

the index on component is used, but that's all.

Index on component is used only for that, we should use this index instead
CREATE INDEX endRun_control_idx ON RudderSysEvents (id) WHERE eventType = 'control' and component = 'end';

having this index results in 2,5s on a crazyly highly loaded system(load of 4 on a 2 CPU system)


Subtasks

Bug #15063: Change index on ruddersysevents to remove inefficient component index and replace it by a composite index ReleasedFrançois ARMANDActions
Bug #15064: Add an entry in rudder-upgrade to run index migration script during upgrate ReleasedFrançois ARMANDActions
Bug #15142: Missing migration script at upgrade from 4.1 to 5.0 on sles12ReleasedVincent MEMBRÉActions
Bug #15076: typo in query from parent ticketReleasedNicolas CHARLESActions

Associated revisions

Revision fba640a7 (diff)
Added by Nicolas CHARLES 2 months ago

Fixes #15062: store run agent batch may never catch up on very loaded system because of inefficient index and the way it handles reports

History

#1

Updated by Nicolas CHARLES 2 months ago

with the new index, it did finish

[2019-06-11 21:05:12] DEBUG report - [Store Agent Run Times #1] checking agent runs from SQL ID 1428065585 [2019-06-11T18:31:06.000Z - 2019-06-11T21:05:12.467Z]

...

[2019-06-11 21:20:17] DEBUG report - [Store Agent Run Times #1] (905407 ms) Added or updated 32373 agent runs, up to SQL ID 1440557937 (last run time was 2019-06-11T21:00:48.000Z)

#2

Updated by Nicolas CHARLES 2 months ago

method used to create index

SET maintenance_work_mem TO '2GB';

drop index component_idx;

CREATE INDEX endRun_control_idx ON RudderSysEvents (id) WHERE eventType = 'control' and component = 'end';

#3

Updated by Nicolas CHARLES 2 months ago

  • Status changed from New to In progress
  • Assignee set to Nicolas CHARLES
#4

Updated by Nicolas CHARLES 2 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to François ARMAND
  • Pull Request set to https://github.com/Normation/rudder/pull/2256
#5

Updated by Rudder Quality Assistant 2 months ago

  • Assignee changed from François ARMAND to Nicolas CHARLES
#6

Updated by Nicolas CHARLES 2 months ago

  • Status changed from Pending technical review to Pending release
#10

Updated by Alexis MOUSSET about 2 months ago

  • Subject changed from store run agent batch may never catch up on very loaded system because of inefficient index and the way it handles reports to Allow only catching up with recent runs in agent report processing batch
#12

Updated by Vincent MEMBRÉ about 2 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 5.0.12 which was released today.

Also available in: Atom PDF