KVM Watchdog setup

 

I had some Debian VM that, after migrating to KVM would lock up at some occasions.

I would just find a black screen if I looked at the console, and it'd not react to anything but a hard reset.

In this case, there's a few things that need to be done

  • turn on netconsole
  • turn off console blanking
  • increase monitoring to find issues prior to the crash
  • automatically restart

 

This article comes from the last step - if it's already in a hard hang, I'll have to reset it anyway, so I can automate that without any negative consequence

I had to adjust my OpenNebula template so it passes down the right information. This is then read by libvirt, and this is then handed to qemu, which then generates your "fake" PCI device. Sadly the Xen watchdog isn't supported in libvirt - the feature has existed longer than KVM is even around, but noone put in the effort to make it accessible. Well, what can you do.. nothing :-) so lets move on with what we can do!

 

OpenNebula Config

I started by making a testing template and creating / deleting my VM until it worked.

Suppose your VM is already running you will have to stop it. The reason is that the watchdog driver ends up as a virtual PCI device from KVM which apparently cannot be hotplugged (while a real one could be hotplugged into the VM. Go figure)

For debugging I kept looking at the VM's deployment files which contain the rendered XML.

Update  your template to have a RAW kvm section:

Here's what I have in the 'advanced' view which is more friendly to copy-paste

MEMORY="1024"
RAW=[TYPE="kvm",DATA="<devices> <watchdog model='i6300esb' action='reset'>   <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </watchdog> </devices>"]
GRAPHICS=[LISTEN="0.0.0.0",TYPE="VNC"]
DISK=[IMAGE="zzzz",CACHE="unsafe",IMAGE_UNAME="administrator"]
OS=[ARCH="x86_64"]
NIC=[NETWORK_UNAME="administrator",NETWORK="zzzz"]
CPU="0.25"

 

Once deployed if you use ps and look at the qemu process it should find a watchdog argument now.

Try like this

ps -ef | grep one-VMNUMBER | grep watchdog

And make sure you're on the right VM host :-)

 

OS Setup

The resulting OS device can be seen via a few ways:

# lspci | grep Watch
00:06.0 System peripheral: Intel Corporation 6300ESB Watchdog Timer
# dmesg | grep 6300ESB
[    1.186404] i6300ESB timer: Intel 6300ESB WatchDog Timer Driver v0.05
[    1.186604] i6300ESB timer: initialized (0xffffc900004d6000). heartbeat=30 sec (nowayout=0)

(Note you need a real lspci, not  the busybox one)

 

Debian / Ubuntu

 

The daemon is called "watchdog" and comes with no watchdog devices configured. So it's somewhat safe to install.

# apt-get install watchdog

 

This scriptlet should add the watchdog driver to /etc/modules and load it

#!/bin/sh
if ! grep -q i6300esb /etc/modules ; then
  echo i6300esb >> /etc/modules
fi
if ! lsmod | grep -q i6300esb ; then
  modprobe i6300esb
fi

 

Next thing is to configure the watchdog and restart it.

I've gone with a very minimal config:

# grep -h -v -e ^$ -e ^# /etc/watchdog.conf 
watchdog-device	= /dev/watchdog
realtime		= yes
priority		= 1
# service watchdog restart

The only thing I enabled was the device itself.

To my understanding that means as long as this driver is still running, there won't be a reset of the system.

If you kill -9 the watchdog daemon it'll not "log off" the device and within approx. 30 seconds the system will be reset.

I've seen there are more features (custom tests, automatic repairs) in this daemon but I don't want to make use of most of them. The only extension I can think off is reacting if no process can be spawned or a file can't be writen in /root. There's also file age tests but those have inherent race conditions with NTP should it need to step 1000s of seconds because the system came up with a confused clock. Other's of the selftests can fail if you need to stop syslog or cron. Fragile and not recommendable outside of fully embedded systems.

This is why I stuck with the most basic functionality: Reset the system if it hangs.

 

A very safe or 'defensive' implementation of "fork a process that creates and deletes a file in /root" would still be interesting.

 

RedHat / Oracle / CentOS

not yet tried, should be completely identical

A link to their docs is below.

 

Alpine

Alpine has apparently no "watchdog" binary.

I'm still trying to find out more.

 

FreeBSD

No idea as of now.

A quick round of google turned up the "ichwd" driver which seems to be newer.

Link is below.

 

 

Notes

nowayout

I have not really understood the "nowayout" driver feature. My system properly resets if I kill -9 the daemon, I don't know what the option is supposed to do.

If you need it, you can configure it using /etc/modprobe.d.

 

 

Links

http://www.sat.dundee.ac.uk/psc/watchdog/watchdog-testing.html (Debian/Ubuntu)

https://github.com/miniwark/miniwark-howtos/wiki/Hardware-Watchdog-Timer-setup-on-Ubuntu-12.04 (Debian/Ubuntu)

(Windows guests aren't supported: https://bugzilla.redhat.com/show_bug.cgi?id=610063 (because the RH devs don't even consider compiling Windows code on Windows...)

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/section-libvirt-dom-xml-watchdog.html (RedHat)

https://koitsu.wordpress.com/2010/07/13/freebsd-and-hardwaresoftware-watchdogs/ (FreeBSD)