Project

General

Profile

Actions

Bug #8351

closed

After the promises generation, cf-serverd config may not be reloaded, preventing new nodes from connecting

Added by Nicolas CHARLES over 8 years ago. Updated over 2 years ago.

Status:
Released
Priority:
N/A
Category:
Server components
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
UX impact:
User visibility:
Operational - other Techniques | Technique editor | Rudder settings
Effort required:
Medium
Priority:
0
Name check:
Fix check:
Checked
Regression:

Description

When Rudder regenerate promises, it executes rudder-reload-cf-serverd as a post-hook to force reload of cf-serverd
In some cases, it may not do anything.

Coredumb reported with Rudder 3.2, on a Centos 7, that with time it fails (it works for 3 weeks-1 months, then fail to update cf-serverd conf)
Running rudder server debug for a new node allows it to access its data, or restarting cf-serverd lets it fetch its data.

Happen on 3.2, but most likely on previous version


Subtasks 3 (0 open3 closed)

Architecture #18889: Add systemd socket activation to cf-serverdReleasedAlexis MoussetActions
Architecture #18893: Implement graceful restart on cf-serverdReleasedAlexis MoussetActions
Bug #18948: After the promises generation, cf-serverd config may not be reloaded, preventing new nodes from connecting - missing spaceReleasedNicolas CHARLESActions
Actions #1

Updated by Nicolas CHARLES over 8 years ago

  • Assignee set to Alexis Mousset

Alexis,

You're the most familiar with this code - could you look at it ?

Actions #2

Updated by Alexis Mousset over 8 years ago

  • Status changed from New to Discussion
The configuration reload:
  • May take up to 1 minute on lightly loaded servers
  • Does not occur until all threads are idle, which may never happen on loaded servers

On the other hand, a stop/start would break current connections, resulting in failed updates and file copies (and even worse with old agents that do not handle network errors correctly).

Actions #3

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.11.21 to 2.11.22
Actions #4

Updated by Jonathan CLARKE over 8 years ago

An ideal solution here would be to have a proxy in front of one or two cf-serverd processes that can phase out an existing cf-serverd when we know it's running an old config, and direct all new connections to the new cf-serverd.

haproxy?

Actions #5

Updated by Jonathan CLARKE over 8 years ago

  • Assignee changed from Alexis Mousset to Jonathan CLARKE
Actions #6

Updated by Janos Mattyasovszky over 8 years ago

haproxy is a little bit overkill, don't you think?
You could use iptable rules and have them simply redirect to a newer version after doing the update.
That would also keep existing connections alive, since only new incoming connections would be affected.
Just make sure you'll have cf-serverd listen on localhost (bindtointerface), not to have additional ports open.
And by making this change, you'd also have to put the port information (port => "&COMMUNITYPORT&";) outside of cf-served.st, and the external script would need to manage the ports of the running cf-serverd instances, since you could have already n running instances of it, all having still open connections that they are serving. how do you know a cf-serverd is not having any clients any more, and can be killed? How do you handle the checking of the numbers of processes that are allowed to run? So many question when you go from 1 to n... :)

Actions #7

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.11.22 to 2.11.23
Actions #8

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.11.23 to 2.11.24
Actions #9

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.11.24 to 308
Actions #10

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 308 to 3.1.14
Actions #11

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 3.1.14 to 3.1.15
Actions #12

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 3.1.15 to 3.1.16
Actions #13

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 3.1.16 to 3.1.17
Actions #14

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 3.1.17 to 3.1.18
Actions #15

Updated by Vincent MEMBRÉ almost 8 years ago

  • Target version changed from 3.1.18 to 3.1.19
Actions #16

Updated by Benoît PECCATTE over 7 years ago

  • Severity set to Major - prevents use of part of Rudder | no simple workaround
  • User visibility set to Operational - other Techniques | Technique editor | Rudder settings
  • Priority set to 52
Actions #17

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 3.1.19 to 3.1.20
Actions #18

Updated by Jonathan CLARKE over 7 years ago

  • Status changed from Discussion to New
  • Assignee deleted (Jonathan CLARKE)
Actions #19

Updated by Benoît PECCATTE over 7 years ago

  • Target version changed from 3.1.20 to 4.2.0~beta1
  • Effort required set to Medium
  • Priority changed from 52 to 51

The solution to use iptables seems to be the best one :

- Run 2 instances of cf-serverd on 2 new ports 5311 and 5312
- Have an iptables rules to redirect on one : iptables I INPUT -p tcp --dport 5309 -j REDIRECT --to-port 5311
call reload on both instances and switch the destination port on promise generation

We must also :
- check that iptables is present and netfilter loaded at install time
- check that check-rudder-agent et al support more than 2 cf-serverd (don't forget cfengine enterprise)

So setting this to medium.
And since it is touchy, targeting this to next version.

Actions #20

Updated by Alexis Mousset over 7 years ago

  • Target version changed from 4.2.0~beta1 to 4.2.0~beta2
  • Priority changed from 51 to 50
Actions #21

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 4.2.0~beta2 to 4.2.0~beta3
Actions #22

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 4.2.0~beta3 to 4.2.0~rc1
Actions #23

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 4.2.0~rc1 to 4.2.0~rc2
Actions #24

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 4.2.0~rc2 to 4.2.0
Actions #25

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 4.2.0 to 4.2.1
Actions #26

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 4.2.1 to 4.2.2
Actions #27

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 4.2.2 to 4.2.3
  • Priority changed from 50 to 56
Actions #28

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 4.2.3 to 4.2.4
Actions #29

Updated by Vincent MEMBRÉ almost 7 years ago

  • Target version changed from 4.2.4 to 4.2.5
Actions #30

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.5 to 4.2.6
Actions #31

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.6 to 4.2.7
Actions #32

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.2.7 to 414
Actions #33

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 414 to 4.3.4
Actions #34

Updated by Benoît PECCATTE over 6 years ago

  • Target version changed from 4.3.4 to 4.3.5
Actions #35

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.3.5 to 4.3.6
Actions #36

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.3.6 to 4.3.7
Actions #37

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.3.7 to 4.3.8
  • Priority changed from 56 to 0
Actions #38

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.3.8 to 4.3.9
Actions #39

Updated by Alexis Mousset almost 6 years ago

  • Target version changed from 4.3.9 to 4.3.10
Actions #40

Updated by François ARMAND almost 6 years ago

  • Target version changed from 4.3.10 to 4.3.11
Actions #41

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.3.11 to 4.3.12
Actions #42

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.3.12 to 4.3.13
Actions #43

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.3.13 to 4.3.14
Actions #44

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.3.14 to 587
Actions #45

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 587 to 4.3.14
Actions #46

Updated by Alexis Mousset over 5 years ago

  • Target version changed from 4.3.14 to 5.0.13
Actions #47

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 5.0.13 to 5.0.14
Actions #48

Updated by Vincent MEMBRÉ about 5 years ago

  • Target version changed from 5.0.14 to 5.0.15
Actions #49

Updated by Vincent MEMBRÉ about 5 years ago

  • Target version changed from 5.0.15 to 5.0.16
Actions #50

Updated by Alexis Mousset almost 5 years ago

  • Target version changed from 5.0.16 to 5.0.17
Actions #51

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.17 to 5.0.18
Actions #52

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.18 to 5.0.19
Actions #53

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.19 to 5.0.20
Actions #54

Updated by Alexis Mousset about 4 years ago

  • Target version changed from 5.0.20 to 6.2.0~beta1
Actions #55

Updated by Alexis Mousset about 4 years ago

Options are:

  • restart instead of reload + two-steps policy update in the agent
  • implementing graceful restart properly in cf-serverd
  • switch between two instances using iptables
  • prevent new connection using iptables until reload is done (with a timeout)
Actions #56

Updated by Janos Matya about 4 years ago

Is it an architectural thing to keep using cf-serverd? Does it benefit in long term?
Any benefits on long term to switch over to some kind of https-based file serving?

Actions #57

Updated by François ARMAND about 4 years ago

We were discussing that idea not two weeks ago. It has a side bonus point to allow consistant handling of cfengine/DSC agent.

And it allows to clearly defined our own protocol and its elements (for ex: something different for remote run, with clearly defined limited commands and identical for cfengine/dsc, again).

Actions #58

Updated by Vincent MEMBRÉ about 4 years ago

  • Target version changed from 6.2.0~beta1 to 6.2.0~rc1
Actions #59

Updated by François ARMAND about 4 years ago

  • Target version deleted (6.2.0~rc1)
Actions #60

Updated by Benoît PECCATTE about 4 years ago

There is anther solution to do that : use systemd to do graceful reload.
We can use systemd's socket activation to pass sockets to cf-serverd, this is easy to implement : https://github.com/puma/puma/blob/master/docs/systemd.md and https://insanity.industries/post/socket-activation-all-the-things/

From there, it should be possible to let systemd handle port opening and incoming connections and let it pass those connection to the right cf-server instance.
There is a blog post by Lennart that says systemd can do this kind graceful restart, but systemd waits for the process to fully stop before restarting a new one, which fails to provide the feature.

However, there are workarounds to make it work : this post in french https://vincent.bernat.ch/fr/blog/2018-systemd-golang-socket-activation explains those workarounds, they all go around making systemd ignoring that the process has not finished stopping.

Actions #61

Updated by Alexis Mousset almost 4 years ago

  • Status changed from New to In progress
  • Assignee set to Alexis Mousset
Actions #62

Updated by Alexis Mousset almost 4 years ago

To sum things up, what happens is:

  1. We add a new node to Rudder, we generate policies for it
  2. cf-serverd ACLs are updated to allow it to connect
  3. cf-serverd policies are updated (on root or relay), with these new ACLs
  4. Then the cf-serverd process is supposed to detect the configuration change within a minute, and reload it. The problem is that cf-serverd cannot reload config with open connections (due to technical limitations that would probably be hard to overcome), so it does not even try to check if reload is needed when there is at least one connected node. On moderately loaded relays it may work eventually, but as new connections are not prevented, it is easy to completely skip configuration reload indefinitely.
    • A service reload (with SIGHUP) has exactly the same limit, and will be ignored if there are open connections.
    • A service restart fixes the problem, but will break existing connections, potentially leading to policy update and file copy errors on the connected nodes.

So we actually need two things:

  • A way to properly reload config.
    • It could use systemd socket activation, as we have systemd on all relays and root servers. In this case we would spawn a new cf-serverd with the new config when a configuration reload is required, and let the old process handle existing connections. We would need to ask it to terminate when all connections are closed (a feature that does not exist for now) to avoid piling up.
  • A way to detect a reload is needed, from outside of cf-serverd.
    • It is already done on root server using a policy generation hook.
    • On relays we might need to rely on policy update repairs, to trigger a reload by systemd
Actions #63

Updated by Alexis Mousset almost 4 years ago

Benoît PECCATTE is working on implementing the systemd socket activation.

Upstream PR https://github.com/cfengine/core/pull/4499.

Actions #64

Updated by Vincent MEMBRÉ almost 4 years ago

  • Status changed from In progress to Pending release

Fix was done in sub tickets #18889 #18893, I'm keeping to track changes and since it has the longest history!

Actions #65

Updated by Vincent MEMBRÉ almost 4 years ago

  • Target version set to 6.1.10
Actions #66

Updated by Vincent MEMBRÉ almost 4 years ago

  • Status changed from Pending release to New
Actions #67

Updated by Benoît PECCATTE almost 4 years ago

  • Status changed from New to In progress
  • Assignee changed from Alexis Mousset to Benoît PECCATTE
Actions #68

Updated by Benoît PECCATTE almost 4 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Benoît PECCATTE to Alexis Mousset
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/2424
Actions #69

Updated by Benoît PECCATTE almost 4 years ago

  • Status changed from Pending technical review to Pending release
Actions #70

Updated by Nicolas CHARLES almost 4 years ago

  • Fix check set to Checked
Actions #71

Updated by Vincent MEMBRÉ almost 4 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 6.1.10 and 6.2.3 which were released today.

Actions

Also available in: Atom PDF