Bug #15062
closedAllow only catching up with recent runs in agent report processing batch
Description
at start of the web interface, we catch up with all reports sent when the web interface was stopped, with a granularity of one day
on very loaded system, this is simply infeasible: there aren't enough ram to fetch it all, and poor postgresql choke on query
2019-06-11 20:21:29 UTCLOG: duration: 997284.994 ms execute <unnamed>: select distinct T.nodeid, T.executiontimestamp, coalesce(C.keyvalue, '') as nodeconfigid, coalesce(C.iscomplete, false) as complete, T.insertionid from (select nodeid, executiontimestamp, min(id) as insertionid from ruddersysevents where id > $1 and id <= $2 group by nodeid, executiontimestamp) as T left join (select true as iscomplete, nodeid, executiontimestamp, keyvalue from ruddersysevents where id > $3 and id <= $4 and eventtype = 'control' and component = 'end' ) as C on T.nodeid = C.nodeid and T.executiontimestamp = C.executiontimestamp 2019-06-11 20:21:29 UTCDETAIL: parameters: $1 = '1428065585', $2 = '1437233953', $3 = '1428065585', $4 = '1437233953'We should:
- be able to turn this feature off (or say: i catch up only xx minutes to avoid a gray compliance at start)
- be able to catch all (when using advanced reporting plugin), but deal with it in batches of yy minutes
- improve indexes
query
select true as iscomplete, nodeid, executiontimestamp, keyvalue from ruddersysevents where id > 1428065585 and id <= 1437233953 and eventtype = 'control' and component = 'end' ;
takes 36 seconds (see https://explain.depesz.com/s/T07l ) when there's no load on database
the index on component is used, but that's all.
Index on component is used only for that, we should use this index instead
CREATE INDEX endRun_control_idx ON RudderSysEvents (id) WHERE eventType = 'control' and component = 'end';
having this index results in 2,5s on a crazyly highly loaded system(load of 4 on a 2 CPU system)
Updated by Nicolas CHARLES over 5 years ago
with the new index, it did finish
[2019-06-11 21:05:12] DEBUG report - [Store Agent Run Times #1] checking agent runs from SQL ID 1428065585 [2019-06-11T18:31:06.000Z - 2019-06-11T21:05:12.467Z] ... [2019-06-11 21:20:17] DEBUG report - [Store Agent Run Times #1] (905407 ms) Added or updated 32373 agent runs, up to SQL ID 1440557937 (last run time was 2019-06-11T21:00:48.000Z)
Updated by Nicolas CHARLES over 5 years ago
method used to create index
SET maintenance_work_mem TO '2GB'; drop index component_idx; CREATE INDEX endRun_control_idx ON RudderSysEvents (id) WHERE eventType = 'control' and component = 'end';
Updated by Nicolas CHARLES over 5 years ago
- Status changed from New to In progress
- Assignee set to Nicolas CHARLES
Updated by Nicolas CHARLES over 5 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Nicolas CHARLES to François ARMAND
- Pull Request set to https://github.com/Normation/rudder/pull/2256
Updated by Rudder Quality Assistant over 5 years ago
- Assignee changed from François ARMAND to Nicolas CHARLES
Updated by Nicolas CHARLES over 5 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder|fba640a7b082f31fbb6689f321308c317fbbd474.
Updated by François ARMAND over 5 years ago
- Fix check changed from To do to Checked
Updated by Alexis Mousset over 5 years ago
- Subject changed from store run agent batch may never catch up on very loaded system because of inefficient index and the way it handles reports to Allow only catching up with recent runs in agent report processing batch
Updated by Alexis Mousset over 5 years ago
- Name check changed from To do to Reviewed
Updated by Vincent MEMBRÉ over 5 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 5.0.12 which was released today.
Updated by Nicolas CHARLES about 5 years ago
- Related to Bug #14959: Batch Store Run Agent can be limited only in days for catching up old report added