Bug #4717
closedDocument how to solve hughe IO wait problem leading to random "NoAnswer"
Description
There was a hughe problem on Ubuntu nodes. At random on or more nodes were in NoAnswer state. Remove all tcdb files and manually run cf-agent -KI solved the problem temporary. Also we had hughe IO wait on our storage, which affected the storage and all our VMs. Eventually I found out that CFEngine (Tokyo Cabinet) was the cause of all our IO problems (everywhere in our network).
There were machines really doing nothing (yet) and they had hughe IO waits. The iotop command showed that cf-agent was the only process writing to the file system. After Kegeruneku pointed me to http://blog.normation.com/en/2013/09/09/speed-up-your-cfengine-by-using-a-ram-disk/ and I implemented that through Rudder, all problems seem to be gone.
Please advise everyone to add the following to fstab, especially if they use Ubuntu (12.04 LTS - Precise). You should add this to the Rudder documentation in bold. However, it only applies to cfengine versions that use Tokyo Cabinet.
# Tmpfs for the CFEngine state backend storage directory tmpfs /var/rudder/cfengine-community/state tmpfs size=128M,nr_inodes=2k,mode=0755,noexec,nosuid,noatime,nodiratime 0 0