How to reboot a Linux server stuck into the Big Kernel Lock

We encountered a situation where a remote server's Linux kernel Oops'ed 1) with the Big Kernel Lock acquired 2), and we used the following program to reboot that server:

#include <sys/io.h>

int main(void)
{
        iopl(3);
        outb(0xfe, 0x64);
        return 0;
}

Alternatively:

echo -en '\xfe' | dd of=/dev/port seek=100 bs=1 count=1

This program/command is asking the PC AT keyboard controller on the motherboard to pulse the reset line, as described, for example, here. This is specific to PC-compatible x86 systems indeed. The machine we actually used this trick on was fairly modern - it had the Supermicro X7DVL-E motherboard, which uses the Intel 5000V (Blackford VS) chipset and can run up to two quad-core Xeon CPUs. There's no longer a separate 8042 chip, yet the functionality remained.

The “official” ways to reboot would not work because the reboot(2) syscall starts by trying to acquire the Big Kernel Lock, so it would get stuck. Yet, despite of the Big Kernel Lock, it was possible to SSH into the server, stop most processes, and remount the filesystems read-only. All of this was with an OpenVZ kernel from their “rhel5” branch, and it shows that recent kernel versions make very little use of the Big Kernel Lock (they use fine-grained locking or data structures not requiring locking instead).

Also relevant is the fact that this specific Linux system had kernel.panic_on_oops set to 0. This is the default with mainstream Linux kernels, but Red Hat (and thus official OpenVZ kernels based on Red Hat's) are changing the default to 1. If kernel.panic_on_oops were 1 and kernel.panic was 0 (the default), the server would get stuck and we would not be able to recover it remotely on our own. On the other hand, if kernel.panic_on_oops was set to 1 and kernel.panic to non-zero, the server would reboot on its own (after the specified number of seconds).

1) “Oops” is the Linux word for getting an unexpected fault trap while in the kernel-mode
2) presumably, this happened when a NIC driver bug was triggered by incoming network traffic during the driver's shutdown sequence, although it can happen for a variety of reasons involving kernel bugs or/and hardware failures
internal/kernel-big-lock-reboot.txt · Last modified: 2010/09/23 12:23 by solar
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate to DokuWiki Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki Powered by OpenVZ Powered by Openwall GNU/*/Linux