KVM Watchdog setup
I had some Debian VM that, after migrating to KVM would lock up at some occasions.
I would just find a black screen if I looked at the console, and it'd not react to anything but a hard reset.
In this case, there's a few things that need to be done
- turn on netconsole
- turn off console blanking
- increase monitoring to find issues prior to the crash
- automatically restart
This article comes from the last step - if it's already in a hard hang, I'll have to reset it anyway, so I can automate that without any negative consequence
I had to adjust my OpenNebula template so it passes down the right information. This is then read by libvirt, and this is then handed to qemu, which then generates your "fake" PCI device. Sadly the Xen watchdog isn't supported in libvirt - the feature has existed longer than KVM is even around, but noone put in the effort to make it accessible. Well, what can you do.. nothing :-) so lets move on with what we can do!
I started by making a testing template and creating / deleting my VM until it worked.
Suppose your VM is already running you will have to stop it. The reason is that the watchdog driver ends up as a virtual PCI device from KVM which apparently cannot be hotplugged (while a real one could be hotplugged into the VM. Go figure)
For debugging I kept looking at the VM's deployment files which contain the rendered XML.
Update your template to have a RAW kvm section:
Here's what I have in the 'advanced' view which is more friendly to copy-paste
Once deployed if you use ps and look at the qemu process it should find a watchdog argument now.
Try like this
And make sure you're on the right VM host :-)
The resulting OS device can be seen via a few ways:
(Note you need a real lspci, not the busybox one)
Debian / Ubuntu
The daemon is called "watchdog" and comes with no watchdog devices configured. So it's somewhat safe to install.
This scriptlet should add the watchdog driver to /etc/modules and load it
Next thing is to configure the watchdog and restart it.
I've gone with a very minimal config:
The only thing I enabled was the device itself.
To my understanding that means as long as this driver is still running, there won't be a reset of the system.
If you kill -9 the watchdog daemon it'll not "log off" the device and within approx. 30 seconds the system will be reset.
I've seen there are more features (custom tests, automatic repairs) in this daemon but I don't want to make use of most of them. The only extension I can think off is reacting if no process can be spawned or a file can't be writen in /root. There's also file age tests but those have inherent race conditions with NTP should it need to step 1000s of seconds because the system came up with a confused clock. Other's of the selftests can fail if you need to stop syslog or cron. Fragile and not recommendable outside of fully embedded systems.
This is why I stuck with the most basic functionality: Reset the system if it hangs.
A very safe or 'defensive' implementation of "fork a process that creates and deletes a file in /root" would still be interesting.
RedHat / Oracle / CentOS
not yet tried, should be completely identical
A link to their docs is below.
Alpine has apparently no "watchdog" binary.
I'm still trying to find out more.
No idea as of now.
A quick round of google turned up the "ichwd" driver which seems to be newer.
Link is below.
I have not really understood the "nowayout" driver feature. My system properly resets if I kill -9 the daemon, I don't know what the option is supposed to do.
If you need it, you can configure it using /etc/modprobe.d.
(Windows guests aren't supported: https://bugzilla.redhat.com/show_bug.cgi?id=610063 (because the RH devs don't even consider compiling Windows code on Windows...)