A few notes and ideas to
make Linux a stable OS
- Use either data=ordered or data=journal
- Use block_validity (from 2.6.35)
- Use journal checksumming (if available, and note this is not the end-it-all solution)
- Use errors=panic
- Use discard if running on Thin / SSD storage, and barriers.
- If you can't use barriers, note that they're turned on using barrier=1 but off using nobarrier.
- Set a reasonable flush interval. Ext4 can take up to a minute to bother writing out your data. Some way to look good in benchmarks!
- noauto_da_alloc mount option seems to override "commit=xxx" for overall crash-safer processing.
Consider using an external journal device.too risky in case of fsck.
you can wire these settings into the filesystems! maybe you can even make them default for new filesystems using mke2fs.conf.
Alternatives to Ext
- do not use EXT at all.
- This is not plain FUD, i.e. Ceph devs ended up recommending XFS instead of Ext4 due to finding too many bugs. Not for the long run, but for the time being.
- XFS is available as a feature w/support on RHEL
- JFS would be a good alternative but has issues due to bad OS integration. It gives false errors at mount time if it needs an fsck, and that renders it kind of useless. It also does't support online shrinking or an shared external log volume for multiple FS like on AIX.
- Most big shops end up using VxFS on Linux in the long run. I've seen this in multiple places where they have a big amount of valuable data to handle. There is a free version of VxFS + VxVM, but it only works for 4 volumes and filesystems max. If you like a 32TB root filesystem, this is for you Seriously, if your data is worth enough to justify the license cost, this is the way to go. It has data integrity down for sure, and all the "future" and "new" features of the standard Linux FS were in VxFS 10 years ago. Guess what: it has some more these days.
Use cgroups to shield core OS processes
Likewise, adjust OOM prio of system processes
Use special sysctls
- panic on unexpected NMI
- panic = 120 to freeze the box for a moment after a panic occured
- disable memory overcommit (echo 2 > /proc/sys/vm/overcommit_memory)
- Linux comes with a software watchdog
- IPMI has a watchdog
- HP iLO and IBM RSA also have watchdogs
Test them carefully, consider using them.
My server rebooted!
Yes they will also zap your server if it runs into overload issues. They exist to cycle the OS if it has a problem. That is the whole point!
Enable CRC Checking
- Normally Linux simply ignores CRC errors on PCI. That's for performance reasons, and we all know how fast Linux is, right?
- Check with your HW vendor if they already track PCI CRC from their management cards.
- the md software raid knows about intent logs.
- Create them at RAID creation time for big volumes.
- For small ones, it is OK to add them later (mdadm -G internal /dev/mdXX)
- Ensure path_checker is correct (TUR) for active/passive arrays
- make sure retry queueing if is set so you never get the idiotic "rejecting IO to dead device" message for a path that is back online
- Make sure ssh starts early on
- NTP should use step-tickers to make initial jumps and should not kill itself on clock issues
- nsswitch.conf might need a look or two, and resolv.conf needs fast failover options
- if you want LDAP on RHEL, use sssd, not the legacy stuff
- Disable avahi / kudzu crap
- Let /etc/rc.local log a message at runlevel completion
- Consider hacking reboot and shutdown commands. These can hang if the system has problems. (LOL)
Disable console blanking!
Set up a serial console (for ipmi, etc)
Configure netconsole kernel logging to a syslog server. It must be on the same subnet.
Arp cache size:
By default the cache only holds 1024 entries, for large flat LANs raise it to a reasonable value:
Don't let it handle ARP/IP on the wrong interface
linux considers IP addresses property of the OS, not a singular interface. It's supposed to be a helpful feature, keeping connectivity if it can tell "oh, that interface also sees the subnet.
If you're not incompetent and can tie your shoes even in the dark, you might want to disable this feature. Otherwise you can't really rely on any tests of your network config you do.
i.e. you have 4 nics connected to a switch, that will be rolled over to 2 LACP bundles, one for management, one for consumer traffic. Now, will it help you during testing if you get a "successful" reply for interface 1+2 from interface 4? No it WONT.
Sysctl: net.ipv4.conf.ethX.arp_filter = 1