Red Hat Linux developer Richard WM Jones has shared an eyebrow raising tale of Linux bug hunting. Jones noticed that Linux 6.4 has a bug which means it will hang on boot about 1 in 1,000 times. Jones set out to pinpoint the bug, and prove he had caught it red handed. However, his headlining travail, involving booting Linux 292,612 times (and another 1,000 times to confirm the bug) apparently "only took 21 hours." It also seems that the bug is less common with Intel hardware than AMD based machines.
Jones caught first whiff of this elusive but replicable Linux booting bug when some server software tests with nbdkit (a protocol for accessing block devices over a network) seemed to be "randomly hanging" when used with libguestsfs (a tool for accessing and modifying virtual machine disk images.) Though we know the looping testing phase was a measly 21 hours long (even though there were an astronomical 293,612 boot processes initiated) Jones says that getting to this point "took many days." The Linux developer recounts that a painful bisection between Linux v6.0 and v6.4-rc6 helped him narrow down the boot hang culprit. That culprit is claimed to be a regression in the printk time feature. Reverting this code commit "fixes the problem," asserts Jones.
A clue to the cause was that the bug always appeared at the same early stage of the booting process, when booting the latest qemu. If you follow this link you will see the easiest way to replicate the hanging issue is to run a guestfish command in a loop, with many instances in parallel, parsing the output to detect when there was a boot hanging event. Usually he ran the guestfish loop 10,000 times, as a workable threshold to gather useful log data.
Perhaps of some interested to hardware fans, Jones remarks that this weird boot hang issue occurs less often on Intel systems than AMD systems. Whatever the case, hopefully the exposure and pinpointing of this bug means that will be squashed, never to return.