The Bug Fighting Chronicles
Background
Earlier this year I wrote about the Linux kernel work I've been doing implementing a partitioned PMU on ARM that allows guest VMs to directly access PMU hardware for much greater efficiency. In the time since, I've been working to productionize this code at Google. That meant backporting to our internal production kernel. Most of the code was developed following upstream. Specifically I was using the KVM maintainer repo queue branch to gain some useful features that were not included upstream yet. I had to make this work with Google's production kernel that was based on v6.6.
Unfortunately there was a lot of stuff that changed between v6.6 and v6.16 in the PMU implementation details so that backport took me several weeks to get something functioning at a production level that passed all our internal testing. Despite this, in the following weeks after submitting the feature, there were a few bugs reported here and there.
The Bug
The bug report came in October 29th. The problem was that the `perf` command on the host was not correctly counting events on the guest and this occurred only when my feature was enabled. The numbers being reported were either garbage or zero depending on whether `perf` was also running in the guest.
Stakes
The kernel I had submitted my change to that we were planning to enable the partitioned PMU on was headed for general availability to customers in six weeks. The message from management and my tech lead were clear I was to drop everything to work on this full time and I did. I worked extra hours for two solid weeks and lost sleep over this bug. Normally I am a heavy sleeper, but during that two week period the slightest disturbance jolted me awake.
If this bug wasn't solved, my feature couldn't be enabled in production for the new platform and all the juicy impact I could brag about in my yearly review would be down the drain.
Debugging Process
The first thing I did was reproduce the bug multiple times under different conditions. The bug occurred for multiple possible splits of counters between the host and the guest. It occurred for the older generation of hardware too.
First Theory
Since it was a problem visible from the host, that at least reduced the possible locations of a bug from thousands of lines down to hundreds. The first thing I figured I should look at is the changes on the host PMU driver. Since host and guest `perf` seemed to interfere with each other, it seemed like a reasonable assumption that the host driver was not respecting the partition boundary. I had several problems backporting my changes due to differences between the two driver versions.
I put a warning where I knew the host driver shouldn't be accessing a counter that was allocated to the guest and I did see the host driver accessing a guest counter. That looks like the source of the problem.
I spent a few days introducing some extra guards until I was certain the host driver could not possibly be accessing the guest counters, but there was still a bug, so clearly I had more work to do.
Second Theory
I was lost for a few days where else the bug could be if it wasn't in the host driver, so I came up with an idea to methodically remove every branch where my feature did something different until the bug went away, and when it did I would know where to look.
It took a few days but I finally saw the bug go away when I removed a branch in the computation of the what value to set for the `MDCR_EL2` register which must be set to something different when entering the guest VM. The problem had to be there.
So I double checked how that register was being programmed, but it all looked correct. I printed the value of the register before and after entering the guest, and I stared at it long and hard, referencing the ARM manual for what each bit meant.
Suddenly, I saw it! There was a bit called `HPME` that wasn't set after entering the guest. The ARM PMU does something weird, where setting the partitioning pivot `HPMN` changes the global enable bit for the host level counters to `HPME` when it's normally `PMCR_EL0.N`.
That meant entering the guest was disabling the host-level counters, so of course they couldn't count guest events.
Solution
I made the change to set that bit when entering the guest and suddenly everything worked fine. It was simultaneously infuriating and gratifying. One bit in one register took me two weeks to find. At the code level, I effectively had to see something that wasn't there, that the macro naming that bit was not referenced to set the bit in the register.
Psychological Impact
I lost sleep over this bug. It was probably the fourth hardest bug I've ever solved, but I was very glad I didn't need to escalate and get more of the team to look at it.