**coltonlewis.name: Partitioned PMU Part 5 [Org] All L1 (Kernel Hacker Mode) ---

Partitioned PMU Part 5

Guest VMs now have direct hardware access to their designated PMU counters. Now we must make sure the guest can operate them reliably.

Context Swap

The most important thing to plan for is that the guest will be regularly swapped out of the CPU. For anyone not familiar with what that means, it is similar to how processes get swapped out. Almost all modern operating systems use a technique called preemptive multitasking to be able to run many processes on the same physical hardware.

The way it works is that each process, or VCPU in our case, is given a turn by the system scheduler. When it is that process's turn, it is loaded onto the CPU with reduced permissions so that the CPU directly executes the machine code associated with that process. When the time is up, an interrupt occurs which triggers some kernel code. The kernel code will then save the relevant registers to capture the execution state of the paused process so it may be resumed later. When the process is resumed after an arbitrary period of time, the kernel reloads those registers with those it saved from the process before and the process continues executing as if nothing had happened.

Since I am introducing new hardware state belonging to the guest, I have to write some relevant load and put functions for when the VCPU is swapped in and out. The load functions take register values from memory and store them in the hardware. The put function takes register values from hardware and stores them in memory.

We check at the top that the PMU is actually partitioned and the appropriate traps configured, otherwise it would be wasteful and foolish to run this function. After doing that, the meat of the function is very mechanical. However, there is a slight complication with the bit mask registers. Because there are multiple names pointing to the same state that do different functions, we need to do a small amount of bit hacking so 1s are written to the SET registers for bits we want to set and 1s are written to the CLR register for the bits we want to clear.

Enforcing Event Filter

Normally it would not make sense to load the event type registers because they remain trapped, but there is an important caveat here. Because the hardware was previously occupied by another guest or by the host, those event type registers could possibly contain events that are not allowed in the current guest and if the current guest were to try to directly enable the counter, it would be counting events KVM does not want it to. It's also possible while the guest was unloaded that the host decided to ban more events.

As a consequence of this, the proper thing to do is to double check all the event type registers from memory against the KVM event filter and if those events are no longer allowed, make sure KVM modifies the event type bits to something that is allowed.

Handling Interrupts

Interrupts are the most delicate part of this operation. They happen on a very tight time tolerance because they prevent the CPU from doing any other useful work while they are happening, and they have to be handled by the host interrupt handler first with the interrupt state threaded through and injected into the guest the next time it loads.

The first part of that is to modify the host interrupt handler to take note of if any of the interrupts belong to a guest counter and if they did call and additional KVM function to save which guest counters those were.

The second part of that is to take the interrupt state KVM saved from the host interrupt handler and inject it into the guest when the guest loads. Injecting means making the guest think there is an interrupt even though the hardware interrupt has already been handled.

ARM has a different bit of hardware called the GIC (Generic Interrupt Controller) that, like the PMU, is significantly vitalized and holds a lot of state in software. So we can inject a virtual PMU interrupt by manipulating the state of the virtual GIC. I'm very glad there were existing functions to do this already because it is confusing code to look at.