Bug #8351
closed
After the promises generation, cf-serverd config may not be reloaded, preventing new nodes from connecting
Added by Nicolas CHARLES about 9 years ago.
Updated about 3 years ago.
Category:
Server components
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Operational - other Techniques | Technique editor | Rudder settings
Description
When Rudder regenerate promises, it executes rudder-reload-cf-serverd as a post-hook to force reload of cf-serverd
In some cases, it may not do anything.
Coredumb reported with Rudder 3.2, on a Centos 7, that with time it fails (it works for 3 weeks-1 months, then fail to update cf-serverd conf)
Running rudder server debug for a new node allows it to access its data, or restarting cf-serverd lets it fetch its data.
Happen on 3.2, but most likely on previous version
- Assignee set to Alexis Mousset
Alexis,
You're the most familiar with this code - could you look at it ?
- Status changed from New to Discussion
The configuration reload:
- May take up to 1 minute on lightly loaded servers
- Does not occur until all threads are idle, which may never happen on loaded servers
On the other hand, a stop/start would break current connections, resulting in failed updates and file copies (and even worse with old agents that do not handle network errors correctly).
- Target version changed from 2.11.21 to 2.11.22
An ideal solution here would be to have a proxy in front of one or two cf-serverd processes that can phase out an existing cf-serverd when we know it's running an old config, and direct all new connections to the new cf-serverd.
haproxy?
- Assignee changed from Alexis Mousset to Jonathan CLARKE
haproxy is a little bit overkill, don't you think?
You could use iptable rules and have them simply redirect to a newer version after doing the update.
That would also keep existing connections alive, since only new incoming connections would be affected.
Just make sure you'll have cf-serverd listen on localhost (bindtointerface), not to have additional ports open.
And by making this change, you'd also have to put the port information (port => "&COMMUNITYPORT&";) outside of cf-served.st, and the external script would need to manage the ports of the running cf-serverd instances, since you could have already n running instances of it, all having still open connections that they are serving. how do you know a cf-serverd is not having any clients any more, and can be killed? How do you handle the checking of the numbers of processes that are allowed to run? So many question when you go from 1 to n... :)
- Target version changed from 2.11.22 to 2.11.23
- Target version changed from 2.11.23 to 2.11.24
- Target version changed from 2.11.24 to 308
- Target version changed from 308 to 3.1.14
- Target version changed from 3.1.14 to 3.1.15
- Target version changed from 3.1.15 to 3.1.16
- Target version changed from 3.1.16 to 3.1.17
- Target version changed from 3.1.17 to 3.1.18
- Target version changed from 3.1.18 to 3.1.19
- Severity set to Major - prevents use of part of Rudder | no simple workaround
- User visibility set to Operational - other Techniques | Technique editor | Rudder settings
- Priority set to 52
- Target version changed from 3.1.19 to 3.1.20
- Status changed from Discussion to New
- Assignee deleted (
Jonathan CLARKE)
- Target version changed from 3.1.20 to 4.2.0~beta1
- Effort required set to Medium
- Priority changed from 52 to 51
The solution to use iptables seems to be the best one :
- Run 2 instances of cf-serverd on 2 new ports 5311 and 5312
- Have an iptables rules to redirect on one : iptables I INPUT -p tcp --dport 5309 -j REDIRECT --to-port 5311
call reload on both instances and switch the destination port on promise generation
We must also :
- check that iptables is present and netfilter loaded at install time
- check that check-rudder-agent et al support more than 2 cf-serverd (don't forget cfengine enterprise)
So setting this to medium.
And since it is touchy, targeting this to next version.
- Target version changed from 4.2.0~beta1 to 4.2.0~beta2
- Priority changed from 51 to 50
- Target version changed from 4.2.0~beta2 to 4.2.0~beta3
- Target version changed from 4.2.0~beta3 to 4.2.0~rc1
- Target version changed from 4.2.0~rc1 to 4.2.0~rc2
- Target version changed from 4.2.0~rc2 to 4.2.0
- Target version changed from 4.2.0 to 4.2.1
- Target version changed from 4.2.1 to 4.2.2
- Target version changed from 4.2.2 to 4.2.3
- Priority changed from 50 to 56
- Target version changed from 4.2.3 to 4.2.4
- Target version changed from 4.2.4 to 4.2.5
- Target version changed from 4.2.5 to 4.2.6
- Target version changed from 4.2.6 to 4.2.7
- Target version changed from 4.2.7 to 414
- Target version changed from 414 to 4.3.4
- Target version changed from 4.3.4 to 4.3.5
- Target version changed from 4.3.5 to 4.3.6
- Target version changed from 4.3.6 to 4.3.7
- Target version changed from 4.3.7 to 4.3.8
- Priority changed from 56 to 0
- Target version changed from 4.3.8 to 4.3.9
- Target version changed from 4.3.9 to 4.3.10
- Target version changed from 4.3.10 to 4.3.11
- Target version changed from 4.3.11 to 4.3.12
- Target version changed from 4.3.12 to 4.3.13
- Target version changed from 4.3.13 to 4.3.14
- Target version changed from 4.3.14 to 587
- Target version changed from 587 to 4.3.14
- Target version changed from 4.3.14 to 5.0.13
- Target version changed from 5.0.13 to 5.0.14
- Target version changed from 5.0.14 to 5.0.15
- Target version changed from 5.0.15 to 5.0.16
- Target version changed from 5.0.16 to 5.0.17
- Target version changed from 5.0.17 to 5.0.18
- Target version changed from 5.0.18 to 5.0.19
- Target version changed from 5.0.19 to 5.0.20
- Target version changed from 5.0.20 to 6.2.0~beta1
Options are:
- restart instead of reload + two-steps policy update in the agent
- implementing graceful restart properly in cf-serverd
- switch between two instances using iptables
- prevent new connection using iptables until reload is done (with a timeout)
Is it an architectural thing to keep using cf-serverd? Does it benefit in long term?
Any benefits on long term to switch over to some kind of https-based file serving?
We were discussing that idea not two weeks ago. It has a side bonus point to allow consistant handling of cfengine/DSC agent.
And it allows to clearly defined our own protocol and its elements (for ex: something different for remote run, with clearly defined limited commands and identical for cfengine/dsc, again).
- Target version changed from 6.2.0~beta1 to 6.2.0~rc1
- Target version deleted (
6.2.0~rc1)
- Status changed from New to In progress
- Assignee set to Alexis Mousset
To sum things up, what happens is:
- We add a new node to Rudder, we generate policies for it
cf-serverd
ACLs are updated to allow it to connect
cf-serverd
policies are updated (on root or relay), with these new ACLs
- Then the
cf-serverd
process is supposed to detect the configuration change within a minute, and reload it. The problem is that cf-serverd
cannot reload config with open connections (due to technical limitations that would probably be hard to overcome), so it does not even try to check if reload is needed when there is at least one connected node. On moderately loaded relays it may work eventually, but as new connections are not prevented, it is easy to completely skip configuration reload indefinitely.
- A service reload (with SIGHUP) has exactly the same limit, and will be ignored if there are open connections.
- A service restart fixes the problem, but will break existing connections, potentially leading to policy update and file copy errors on the connected nodes.
So we actually need two things:
- A way to properly reload config.
- It could use systemd socket activation, as we have systemd on all relays and root servers. In this case we would spawn a new
cf-serverd
with the new config when a configuration reload is required, and let the old process handle existing connections. We would need to ask it to terminate when all connections are closed (a feature that does not exist for now) to avoid piling up.
- A way to detect a reload is needed, from outside of cf-serverd.
- It is already done on root server using a policy generation hook.
- On relays we might need to rely on policy update repairs, to trigger a reload by systemd
- Status changed from In progress to Pending release
Fix was done in sub tickets #18889 #18893, I'm keeping to track changes and since it has the longest history!
- Target version set to 6.1.10
- Status changed from Pending release to New
- Status changed from New to In progress
- Assignee changed from Alexis Mousset to Benoît PECCATTE
- Status changed from In progress to Pending technical review
- Assignee changed from Benoît PECCATTE to Alexis Mousset
- Pull Request set to https://github.com/Normation/rudder-packages/pull/2424
- Status changed from Pending technical review to Pending release
- Status changed from Pending release to Released
This bug has been fixed in Rudder 6.1.10 and 6.2.3 which were released today.
Also available in: Atom
PDF