Bug #6118
closedcf-agent execution is killed if cf-execd is restarted (for example on a 5a.m daily restart)
Description
We have a promise to restart services at 5am every day. Unfortunately, it does not behave as expected as stopping executor daemon kills apparently the agent, causing it to skip a run (it does show especially with advanced reporting plugin)
In 2.10, it merely truncate the logs in outputs folder, but in 2.11 and 3.0 it skips the runs completely
Updated by Vincent MEMBRÉ almost 10 years ago
- Target version changed from 2.11.6 to 2.11.7
Updated by Vincent MEMBRÉ almost 10 years ago
- Target version changed from 2.11.7 to 2.11.8
Updated by Vincent MEMBRÉ almost 10 years ago
- Target version changed from 2.11.8 to 2.11.9
Updated by François ARMAND almost 10 years ago
- Status changed from New to Discussion
- Assignee set to Nicolas CHARLES
Just to be sur I understand the bug here: the problem is not the 5am stop, it's that if we restart cf-execd, then cf-agent is killed (and so, if a run was on going, it is killed). Is it that ?
Updated by Nicolas CHARLES almost 10 years ago
- Assignee changed from Nicolas CHARLES to François ARMAND
The problem from the user point of view is that every day at 5 am there is a dent in reporting, with no answer everywhere
The cause is that the agent kills cf-execd every day at 5 am (it's part of the promises), and in result it kill the agent that was launched by cf-execd, interrupting the run.
Probably killing cf-execd should not kill cf-agent, but i'm not sure about this one.
Updated by François ARMAND almost 10 years ago
- Subject changed from cf-agent execution is killed every day at 5am to cf-agent execution is killed if cf-execd is restarted (for example on a 5a.m daily restart)
I think they should be independant, but don't quite see the implication here.
Well, in all cases, I only see two possibilities:
- either cf-execd and cf-agent are independant, and killing the parent does not kill the child;
- or they are not (and we don't want them to be), and we need to have a gracefull restart for cf-execd. That may be extremelly tricky, because deciding between a stalled cf-agent and one that is just taking a lot of time to finish its tasks may not have a definitive answer...
I think that cf-execd manages a lot of thing (sending emails, getting stats of execution, etc) so perhaps we won't be able to split them, even if we wanted.
For now, I don't see other option. Perhaps we should just not kill the process every days, and consider a kill for what it is - an interruption.
Updated by Nicolas CHARLES almost 10 years ago
The issue here is not in case of stalled agent, it's really that we are preventively restarting cf-execd every day, and it kills the agent.
Restarting the executor daemon used to be part of defaut CFEngine masterfiles, but now they are not anymore, (at least since 08/2013)
So maybe we should remove this preventive restart
Updated by François ARMAND almost 10 years ago
Nicolas CHARLES wrote:
The issue here is not in case of stalled agent, it's really that we are preventively restarting cf-execd every day, and it kills the agent.
I understand that, what let think that I was talking about stalled agent for the restart ? With graceful restart, you have basically 3 cases to consider:
- "please stop" : [ok I'm stopped]
- "please stop" : [nothing happen : stall or just finishing ?] ;
- ok, after some times / some retries, it's stopped;
- even after some times / some retries, it is not stopped => stalled or not ? When to kill ?
Restarting the executor daemon used to be part of defaut CFEngine masterfiles, but now they are not anymore, (at least since 08/2013)
So maybe we should remove this preventive restart
That could be the workaround for that.
Updated by François ARMAND almost 10 years ago
- Assignee changed from François ARMAND to Nicolas CHARLES
- Reproduced set to No
Updated by Vincent MEMBRÉ almost 10 years ago
- Target version changed from 2.11.9 to 2.11.10
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.11.10 to 2.11.11
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.11.11 to 2.11.12
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.11.12 to 2.11.13
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.11.13 to 2.11.14
Updated by Vincent MEMBRÉ about 9 years ago
- Target version changed from 2.11.14 to 2.11.15
Updated by Nicolas CHARLES about 9 years ago
- Related to Bug #7274: The daily cf-execd and cf-serverd restart should use SRC on AIX added
Updated by Vincent MEMBRÉ about 9 years ago
- Target version changed from 2.11.15 to 2.11.16
Updated by Vincent MEMBRÉ about 9 years ago
- Target version changed from 2.11.16 to 2.11.17
Updated by Vincent MEMBRÉ about 9 years ago
- Target version changed from 2.11.17 to 2.11.18
Updated by Vincent MEMBRÉ almost 9 years ago
- Target version changed from 2.11.18 to 2.11.19
Updated by Vincent MEMBRÉ almost 9 years ago
- Target version changed from 2.11.19 to 2.11.20
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.20 to 2.11.21
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.21 to 2.11.22
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.22 to 2.11.23
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.23 to 2.11.24
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.24 to 308
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 308 to 3.1.14
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 3.1.14 to 3.1.15
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 3.1.15 to 3.1.16
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 3.1.16 to 3.1.17
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 3.1.17 to 3.1.18
Updated by Vincent MEMBRÉ almost 8 years ago
- Target version changed from 3.1.18 to 3.1.19
Updated by François ARMAND over 7 years ago
- Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility set to Operational - other Techniques | Technique editor | Rudder settings
- Priority set to 0
I'm setting it to critical because there is no workaround, and it breaks the main purpose of rudder. On the other hand, I'm setting it to operationnal because the problem is not supposed to happen on tests.
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.19 to 3.1.20
Updated by Jonathan CLARKE over 7 years ago
- Status changed from Discussion to New
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.20 to 3.1.21
Updated by Benoît PECCATTE over 7 years ago
- Priority changed from 0 to 50
I think we should just remove this kill.
We are now 5 year later and cf-execd should not have problem running for a long time anymore.
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.21 to 3.1.22
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.22 to 3.1.23
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.23 to 3.1.24
Updated by Benoît PECCATTE about 7 years ago
- Target version changed from 3.1.24 to 4.3.0~beta1
Let's just remove the restart in 4.3, this should not be needed anymore
Updated by Nicolas CHARLES about 7 years ago
- Status changed from New to In progress
- Assignee set to Nicolas CHARLES
Updated by Nicolas CHARLES about 7 years ago
- Status changed from In progress to Rejected