A list of helpful checks to run prior to a system reboot:
System bootup settings:
Will it boot to the same runlevel as it is currently running in? (check /etc/inittab)
BSD people - what is the current securelevel?
Are there any special settings in /etc/sysctl.conf
Do you know of any missing settings (i.e. is ip forwarding enabled but not set in sysctl.conf?
Amount of Kernel happiness:
- Will it boot the same kernel as it is currently running? (check i.e. /boot/grub.conf or /boot/loader.conf)
- Does the KERNEL still exist in /boot?
- Is there a backup kernel?
- What is the *age* of the kernel and backup kernel?
Check dmesg, looking for any messages and/or errors that you don't expect.
if you find one, the next step is to identify it's date of occurance via the system logfiles. You can then skip back to the logs from when the system booted.
Identify if the error is
- right after boot but not later on (though uncommon for errors that make it into dmesg)
Linux users: Linux error messages will not reliably end up somewhere. Whats printed as alert on the local console may not be in the logs. What's erroring to your application (i.e. filesystem full) will not be visible anywhere but in the application log.
keep this in your history, kiddo!
Are any processes clogging hundreds of minutes of CPU time? Are they NOT those big monster database the server is intended to run? Almost every tool can go berserk at times - have a look for these long-running broken processes.
Do you see any backup helper scripts that are running for longer than the backup interval is? They'll surely go away now. But good to make sure there has been a successful backup after the hung one! 99% of times it's just been something on the backup server or a network error, but there is a coincidence between hung backups and deeply-corrupted filesystems.
You need to find out if you have any
- corrupt local filesystems
- SAN multipathing issues
- Missing NFS mounts
This is checked in two passes, so that you won't try something that will hang anyway.
only using readonly commands
compare /etc/fstab and /proc/mounts
Identify filesystems that are temporarily mounted, but not recorded in the fstab.
Identify mountpoints that are part of a cluster ressource group
if this shows no issues
actually testing availability
- mount -a
- automount -v (pick your OS version of this)
- ls /some/nfs/mount /nfs/auto/mount
No real need to test for blocklayer problems here.
They might exist (i.e. xm block-detach -f is not visible from within a VM) but probing for them would have only adverse effects.
Filesystem usage levels:
Different "levels" apply for data, application and OS filesystems
/ should have a usage under 60-70% on properly maintained systems.
AIX users: Never, ever reboot at 100%. Fix first.
Application and system core dumps can end up here on boot, this can cause various issues with less resilient parts of the system, same goes for /var and /tmp
So /tmp should be around 2-10% during standard usage and should be under 80% before a reboot
A /var filesystem is most happy up and around 80% usage. The same goes for all optional filesystems beyond /var.
such as /usr or /opt: Unless a very moronic coder is involved, these will not see substantial growth except during application installs and you can plan for a safe ride up until 94-95%.
Some apps will not split out their install to /opt/<vendor> and /var/opt/<vendor> respectively. These may well be writing temporary or state or log data to their directory, but you couldquite safely assume that "what is in will stay in" meaning there may be changes to files on a reboot, but these will probably not result in much growth. Anyway, definitely track applications like that and identify their log files, as this would be the main cause of concern: Log file spamming.
Generally no need to worry about these if running cleaner OS as Unix or *BSD, and on Linux you might well add some safety margin.
If you're using Nagios with check_mk you can make use of the "trend" analysis option to track and alert based on filesystem growth rate (i.e. MBytes per month)!
User data filesystems:
0-95% Not too important in the grand scheme of things - what was full before the reboot can be full after the reboot
Watch out for filesystems that are at 99%, they might become full at the final sync/flush/close
Especially true with extend based filesystems that weren't tuned for the data stored on them - i.e. when you put a 100GB file into a 100GB filesystem that has mixed extent sizes then there might not be a large enough single extent to fulfill the write request and your file will not be correctly closed
Filesystems that are at 100% already: Expect the corresponding application to coredump on reboot, but the OS itself should be fine with that.
Database filesystems at 95-100% might be a problem or not, it depends on the settings of the database, with autoextension it is bad, and otherwise it is always full like this. You need to ask your DBA or look at historic data. (Well, you'd just know this, right?)
On the other hand, a redo log ("binlog") filesystem should not be over *60%*ish usage as this would indicate your backup system is not picking up the logs.
Is your package database OK?
yum/rpm, apt/dpkg, FreeBSD pkgdb or SD-UX databases are based on various database backends, even textfiles, but they all share corruption issues.
If you're going to reboot some node that is in completely unknown state it is very helpful to first find out if it has a broken installation system / database (all), unconfigured software (dpkg & hp-ux only) or broken packages (very different commands are using for verifying this)
SW Raid status:
Software raids should be in a fully synced status and more importantly not in a broken / degraded status before you reboot. From experience, software raid is more prone to catastrophic failures and also more often not properly configured and people who spend $$$ on a hw raid controller appear to be more likely to monitor its status.
So when dealing with a server on software raid, expect it to be misconfigured, degraded, and unmonitored.
Similarly it is important to remember that many hardware raid controllers will require a keypress to acknowledge disk failures.
It is extremely helpful to verify your system is bootable.
KVM or serial access
Check on which port the actual kernel console is accessible.
Test and verify that you can access it and that you can log in there as root.
Check the PAM config and go sure there hasn't been some smart attempt to use directory based logon that breaks root login in single user modes. It is not enough to find correct order in NSS (/etc/nsswitch.conf) because PAM mechanisms can break anyway - i.e. Quest Authentication Service might simply dump core instead of letting you in.
Testing this is possible via stopping the directory based login and verifying that a root login on a different terminal (==console) will still succeed.
Interface card settings
IP aliasses (outside of cluster)