Bug #10881
closedAlmost nodes become in "red state" after upgrading from 4.0.2 to 4.1.3
Description
Hi,
After upgrading from rudder 4.0.2 to 4.1.3 almost 100% of my nodes become in "red state" in the dashboard and in the node list.
In node detail I have a "red status" with the following message :
"This node is sending reports from an unknown configuration policy (with configuration ID '20170303-131304-4da3122a' that is unknown to Rudder, run started at 2017-03-14 06:07:26)"
When next schedule run and nodes finish to check compliance everything come back to "green state" with 100% compliant.
Here is an output from database queries :
select * from ruddersysevents where nodeid = '019eb66d-XXXX-XXXX-XXXX-e7db93b174cd' and keyvalue = 'EndRun' order by executiontimestamp desc limit 10;
67081753 | 2017-06-08 18:02:14+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-08 18:01:55+02 | log_info | common | End execution with config [20170608-150958-1d4bcac]
67070378 | 2017-06-08 17:32:16+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-08 17:31:55+02 | log_info | common | End execution with config [20170608-150958-1d4bcac]
66967912 | 2017-06-08 12:02:49+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-08 12:02:25+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
66844788 | 2017-06-08 06:02:22+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-08 06:01:58+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
66721535 | 2017-06-08 00:01:58+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-08 00:01:31+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
66606606 | 2017-06-07 18:02:27+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-07 18:02:04+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
66484779 | 2017-06-07 12:01:57+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-07 12:01:37+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
66368189 | 2017-06-07 06:02:33+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-07 06:02:10+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
66245178 | 2017-06-07 00:02:10+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-07 00:01:43+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
66130878 | 2017-06-06 18:02:40+02 | 019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | common-root | hasPolicyServer-root | 70 | common | EndRun | 2017-06-06 18:02:16+02 | log_info | common | End execution with config [20170531-140238-63911ff5]
select nodeId,nodeconfigId,begindate,enddate from nodeconfigurations where nodeid='019eb66d-XXXX-XXXX-XXXX-e7db93b174cd' order by begindate desc limit 10;
019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | 20170608-150958-1d4bcac | 2017-06-08 15:09:58.124+02 |
019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | 20170608-150417-3f0f1fee | 2017-06-08 15:04:17.401+02 | 2017-06-08 15:09:58.124+02
019eb66d-XXXX-XXXX-XXXX-e7db93b174cd | 20170531-140238-63911ff5 | 2017-05-31 14:02:38.567+02 | 2017-06-08 15:04:17.401+02
Updated by I C over 7 years ago
Another query for troubleshooting :
select nodeId,nodeconfigId,begindate,enddate from archivednodeconfigurations where nodeconfigid = '20170303-131304-4da3122a';
1d0197ac-977e-4a2c-b2a9-c59799613e8f | 20170303-131304-4da3122a | 2017-03-03 13:13:04.704+01 | 2017-03-14 09:59:14.863+01
nodeId return by the query is not the one expected (019eb66d-d6eb-4ef9-9cbf-e7db93b174cd)
Updated by François ARMAND over 7 years ago
OK, so it seems that:
- something startelled the compliance algo, which choose to look to some very old run
- the corresponding configuration was moved in parallel in archive table,
Several point to investigate:
- why a so old run was chosen ?
- why the compliance id is not for the corresponding node ?
Updated by François ARMAND over 7 years ago
- Target version set to 4.1.4
- Severity set to Minor - inconvenience | misleading | easy workaround
- User visibility set to Operational - other Techniques | Technique editor | Rudder settings
I'm setting the version to 4.1 (because it happens for migration to 4.1), but perhaps the same logic is already in 4.0.
The severity is minor, because just regenerating policies made the compliance come back to a correct state.
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 4.1.4 to 4.1.5
Updated by Alexis Mousset over 7 years ago
- Target version changed from 4.1.5 to 4.1.6
Updated by François ARMAND over 7 years ago
- Related to Bug #11037: Missing agent reports after Rudder server restart added
Updated by Nicolas CHARLES over 7 years ago
- Related to Bug #10643: If node run interval is longer than 5 minutes, there may be "no report" at start of Rudder added
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 4.1.6 to 4.1.7
Updated by François ARMAND over 7 years ago
- Status changed from New to Rejected
So, now that we know what the problem was in #11037, we are (almost) sur that it is the same problem here.
I'm closing it as "duplicate" but if you see it happen again, please reopen.
Updated by François ARMAND over 7 years ago
- Related to deleted (Bug #11037: Missing agent reports after Rudder server restart)
Updated by François ARMAND over 7 years ago
- Is duplicate of Bug #11037: Missing agent reports after Rudder server restart added