Partitioned PMU Part 3
Right now my work is going through its 3rd re-roll upstream. I won't quote specifics from it, but it may be enlightening if you want to see the actual C code going into this. Kernel work goes through many revisions, and before my code ever made it to the list, every one of the 22 patches there went through dozens of minor revisions and a couple of complete rewrites as I learned more about the code I was working on and made adjustments to my original design to account for the realities.
Consider this post a high level summary of each block of changes.
Checking CPU capabilities
The first few patches are small. They add a couple lines in the right places to check CPU capabilities that KVM doesn't currently check for. Because checking for this kind of thing is a fairly rote task, there are some abstractions in place to handle it.
It starts with a file arch/arm64/tools/sysreg which is a plain text description of all the relevant system registers and the meaning of their bit fields. During compilation, this file is parsed by an awk script that reads it and generates a C header defining masking and shifting macros to access every field of every register.
Using all these macros, the file arch/arm64/kernel/cpufeature.c builds a bunch of structures that define each CPU feature and give a human readable name to whichever registers tell you if you have the feature or not accesses you need to do. The header at the top of that file has a warning that it is hard to understand due to the heavy macro use, but it works pretty well.
The details are more complicated of course, but not by too much. The features I've added over the course of my series were mainly FEAT_ICNTR to check for a dedicated instruction counter, FEAT_FGT2 to check for advanced FGT support, and FEAT_HPMN0 to check for if the CPU supports setting the MDCR_EL2.HPMN field to 0. The current iteration of the patch series has only FEAT_HPMN0. FEAT_FGT2 was added by another developer for a different feature and I was asked to save FEAT_ICNTR support for another patch.
For now, I was asked to make an instruction counter and undefined access from the guest.
Code Reorganization
The next two patches are code reorganization around the interface between KVM and the PMU.
The first is moving some declarations around, fixing a tricky circular include problem I encountered earlier in the project, tried to upstream separately, and was snapped at for not doing a comprehensive enough fix, the maintainer Marc Zyngier showed me what kind of header reorganization he would have liked. This patch is also somewhat debloated from its original form because the header reorganization was long overdue and upstreamed separately after all.
The second is moving a bunch of definitions from pmu-emul.c to pmu.c because those functions are not specific to the emulated PMU implementation. They deal directly with the PMU hardware and I will be reusing them for much of my implementation.
PMU Driver Changes
The next three patches are the first meaty parts of the series.
The first defines a kernel command line parameter to request PMU partitioning. It wasn't an ideal solution to specify this at boot time with no way to change it, and the better solution would make the feature completely switchable while the system is running, but the entire perf subsystem contains many difficult to identify and change assumptions that the number of PMU counters is static after the PMU hardware is initialized.
The key change here is the driver tracks which counters out of 33 possible counters defined by the architecture are implemented on the running system and therefore available to use. I introduce a function armv8pmu_partition() that changes the bit mask tracking those counters when the PMU is initialized. Poof, those counters will now not be selected when the host driver needs to allocate one to measure a perf event on the host.
The second is a formality generalizing some bit masks.
The third is the cleanup work to make sure the host driver doesn't touch any of the guest counters accidentally. Some register writes affect all counters on the host. The PMOVS* registers, for example, are a singular bit mask that function for all implemented counters on the system, even those I previously hid from the host driver. That means I had to introduce some additional masking on those writes to prevent them from writing to the guest counters' when resetting the bits for the host counters.