Pre- and post-reboot actions/checklist for remote Linux servers

These instructions apply primarily to Owl systems. Many other modern Linux distributions use GRUB instead of LILO by default, but we intentionally keep using LILO in Owl.

Before the reboot

  • Take note of the running services and OpenVZ containers (if applicable), optionally make sure that all of them and no others are configured to start upon bootup
  • Sanity-check and save the NTP-synchronized system time to the RTC/NVRAM with ”/sbin/clock -uw” 1)
  • Make sure that an fsck won't be forced for ext2/3/4 filesystems on bootup because of them possibly having reached or exceeded the “maximum mount count” or “maximum time since last check”: check each filesystem with tune2fs -l /dev/… and, if needed, disable the regular checks with tune2fs -c0 -i0 /dev/… (we normally do this right after creating filesystems, but it does not hurt to check it pre-reboot)
  • Check the status of any software RAID devices with cat /proc/mdstat (if any of them are degraded, then at least take this into consideration)
  • If switching to a different kernel and/or userland install, then prepare a “safety net”:
    • Configure the new boot target for the very next reboot only, with automatic fallback to the current/tested boot target (you'll use lilo -R … once you're done editing /etc/lilo.conf and after you've run lilo)
    • In the new boot target, specify the panic=10 kernel parameter (to ensure that the system is automatically rebooted into the previous boot target should the kernel panic, e.g. upon failing to mount the root filesystem)
    • Consider configuring netconsole (another kernel parameter) and running nc -vul UDP-PORT-NUMBER under screen on a nearby system to capture the kernel messages (but then you will likely need a second reboot to disable netconsole)
    • When making changes to /etc/lilo.conf, be careful about your use of append vs. addappend (if in doubt about these, review the lilo.conf(5) man page first)
    • Place /sbin/shutdown -r 5 & into /etc/rc.d/rc.local (this is needed in case the new system boots up mostly fine, but its networking setup fails for whatever reason)
  • Right before the reboot, consider shutting some of the services down while you still have control and see the shutdown messages (e.g., run service vz stop to shutdown OpenVZ containers) - this may serve to reduce the risk of the system getting stuck on shutdown, as well as provide more info to us

After the reboot

  • If the delayed reboot from rc.local trick was used, then run shutdown -c 2), then comment out the command from rc.local
  • Sanity-check the system - what kernel booted up, any potentially unexpected messages from it, software RAIDs status, system time, started services and containers, whatever else may be relevant
  • If netconsole was activated and you don't intend to keep using it (such as running nc on the other end), disable it and prepare for another reboot
  • Make the new configuration the default (e.g., swap the previous/new boot targets in /etc/lilo.conf and run lilo)
1) assuming that the RTC is kept in UTC and the system is configured accordingly, which it should be
2) note that if a lot of services are being started upon bootup, then it might take a while until rc.local is executed, so that command might need to be run a bit later - or you might even be able to edit rc.local before it executes
internal/reboot.txt · Last modified: 2010/07/23 15:29 by solar
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate to DokuWiki Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki Powered by OpenVZ Powered by Openwall GNU/*/Linux