Abstract
I'll be porting/implementing the PMCTools kit from FreeBSD to Minix. This uses on-board hardware performance counters to track down and identify performance sinks. This can then be used to improve the speed/efficiency of Minix.
Design
This project is broken up into three parts:
The driver, which will interact between applications and the performance counters. As much of this as possible is run in userspace, but the actual instructions must be executed by the kernel in Ring-0
The library/
API, a simple and clean interface to the driver that lets users add performance monitoring functionality to their programs.
Analysis software, which takes data generated by the driver and library and turns it into useful statistics.
The first part will be a lot of assembly (and even some binary/hex for the instructions not in the assembler). The rest should be mostly C.
Finer design points (I'll add to this as I get to the latter parts of the project)
hwpmc
hwpmc is a kernel-module, a set of hooks, functions, structures, and data, that gets shoved into the FreeBSD kernel and when the kernel catches a pmc call, it goes to hwpmc code. My pmc server is going to be used as a surrogate kernel, and will catch (PMC) messages and use the appropriate hwpmc code. Initially I'll cut out most of the functionality and just get a working model that implements system-wide counting functionality, and after I finish an end-to-end prototype, I'll expand functionality from there.
pmclib
This is a set of library procedures that make appropriate system calls to the FreeBSD kernel. I'll have to modify these to make system calls to my surrogate kernel (my pmc server), and I'll have more news on that later.
Test Plan and Evaluation
Test-driven development is the plan, so I'll try to put tested and proved features here.
Schedule and Deliverables
Throughout the Summer:
(pre-Coding and through midterm and final exam periods and afterwards as well)
Deliverables: Create a blog for the purpose of day-to-day work tracking, so my mentor(s) and the community can keep tabs on where I am and what I'm doing.
Update the Wiki with information I find useful that may be of use to other users/developers.
Pre-Coding Period:
Become very familiar with the intel and AMD processor manuals (http://www.intel.com/products/processor/manuals/). These are the references on which hwpmc is based. hwpmc is the hardware driver that actually touches the processor's performance monitor counters, so this is VERY architecture specific. I will only be dealing with a subset of all of the architectures covered by hwpmc currently, as it already has the *86 architectures covered (as well as ARM and a bunch of others).
Contact the PMCTools developers to possibly collaborate with on some of this project. One of the bigger ToDo's for PMCTools is find better ways of presenting and analysing the data, which is what I propose to do in the latter part of the summer. It would be awesome to contribute to two Open Source projects with one summer project.
Thoroughly read through the PMCTools source code. After final exams I will switch my primary workstation over to FreeBSD/Minix to facilitate development. Here is something I can do that touches on the three main packages (hwpmc, libpmc, and pmcstat) without actual coding for my project:
Week 0: Install and use PMCTools under FreeBSD. Write some simple unit tests that confirm that the hwpmc driver can successfully communicate with my processor. Also write some unit tests that demonstrate functions in libpmc. Finally use pmcstat to measure the performance of specific processes, multiple processes of the same program, and the system as a whole.
Deliverables: Demonstrated use of PMCTools (hwpmc, libpmc, pmcstat) in FreeBSD via simple unit tests.
Coding begins:
Week 1: Start Porting hwpmc for i686 (http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/hwpmc/).
Method:
Implement a pmc server that is the device driver for the PMCs. Hopefully this will be very nearly a port of the functionality of hwpmc (organizing the use of PMCs).
Kernel must execute WRMSR (Write MSR) and RDMSR (Read MSR) at Ring-0 to set and read PMCs and Events.
pmc server must set and read PMCs, outputting data in a format compatible with FreeBSD's PMCTools.
For per-process monitoring, add code to the scheduler that reads the PMC values (so we can compare before and after a process has run).
Deliverable:
Demonstrate working PMCs: Activate, Set events, and monitor each PMC. Unit tests that exercise this.
pmc server that can signal the kernel to: Set events, and set PMCs. Unit tests that exercise this, then read the values back to verify them.
pmc server that can read PMC values and output with PMCTools' format. Unit tests that exercise this; comparison output from FreeBSD-Current.
Accomplished:
New server in boot image (for fixed endpoint). This server will handle the hwpmc functionality. (its just called pmc)
New message types and PMC_* messages for the server to handle. m10 has two 64-bit unsigned ints (because MSRs and counters are 40-64 bits long)
PMC Server can signal the kernel with: SYS_WRMSR, SYS_RDMSR, SYS_SETPCE. Kernel functions to carry out these.
ToDo:
DONE: level0(func, void*args), run functions (with arguements) at ring0 (Already mostly done with this, just finishing up)
Unit tests for above functionality. Create new organized directory for these to live in.
Week 2:
Quick finish the ToDo above.
Start directly porting hwpmc code. The kernel calls implemented (wrmsr, rdmsr) match PMCTools kernel calls (FreeBSD calls of the same name and function).
Build the rest of the framework on the pmc server so the hwpmc port is as close as possible to match the FreeBSD hwpmc
Because the test bed machine is not working yet, implement a small PMC demo in the server to allow testing of functionality even without access to counters.
Unit tests to demonstrate the above
Week 3-4: System-wide Counting
Working development machine (Pentium 4, not the Core2 Duo in the proposal; thats still being worked on)
Systemwide counting capabilities. P4 has 18 (0-17) performance counters that can be used concurrently.
Started doing preliminary performance measurements. See 'currently counting' below.
PM server handles WRMSR/RDMSR/SETPCE calls; all working w/ unit tests. (which by extension tests and verifies the new level() with arguments)
New pmc driver statically assigns PMCs/events; has the advantage of not changing events every context switch
Started porting libpmc; modifying to use statically assigned PMCs
Currently Counting: Branching (total branches versus mispredict branches). I'd really love to hear what people want to have measured, so please let me know if you have ideas.
Testing Scenarios: Idle (just sitting there) and during my regular workflow.
Known Issues: (I'm not mentioning all of the C99 etc issues with the FreeBSD code here)
Week 5 Started Porting Libpmc
Working simple pmc driver; counting/reading/assignment
Utilizes some of the hwpmc macros/structures for building event assignments
Event construction is still largely manual
Started implementing libpmc-
API functions via simple calls to pmc driver (the driver has a subset of hwpmc functionality so some
API call are null functions)
Week 6: userland PMC allocation via libpmc API
Userland programs can now call the proc_allocate_pmc functions for the 'p4' and 'iaf' processors (Pentium 4 and Intel Core/Core2 Fixed Function counters).
Explained in more detail at (Documentation): http://ajray.wordpress.com/2009/07/07/proc_allocate_pmc-functionality/
Week 7: more libpmc allocation functions (it goes without saying, but theres more detail at http://ajray.wordpress.com)
Expanded to half of the Core(2) arch (the fixed function counters) and the AMD arch. Syntax is the same as the p4_allocate_pmc() function.
These aren't completely debugged; I don't have physical access to these machines, so I'm going back and forth with my mentor to try to get them working on his machine.
All of the ported libpmc functions are now extremely verbose; on any kind of fatal error they should explain why they are failing (with a message thats both human readable and should help locate the point in the libpmc.c code where the failure occured).
Added a small amount of verbosity to the system calls (WRMSR, SETPCE) which are printf statements in the system task. You should get feedback (on tty0) from executing either one of those calls (and wrmsr should tell you what it wrote and where).
Added some small commandline programs that can be used to manually control PMCs (and I'm using them for debugging). They are:
A setpce program to enable userland RDPMC (and should trigger the system task to print a message about SETPCE on tty0)
Some setevent programs that take command-line input and set the event registers/control registers for PMCs, also enabling them at the same time so (in theory) their respective counters should start counting. There is one for each processor arch currently supported in libpmc (p4, iaf, amd)
test_libpmc programs that use the libpmc() *_allocate_pmc() function to generate useful input for the setevent programs. One for each arch and they each have a specific test event that has been manually checked against the event configuration I get by hand from the respective manual.
rdpmc commandline program that prints the contents of the selected PMC to stdout. Also acts as an implicit test for userland RDPMC (and by extension, SETPCE)
Last arch I'm going to port (libpmc-wise) for now is the other half of Core(2), called 'iap'. These are the general-purpose counters on Core(2) procs (theres only two of them), and have one event register each, similar to the AMD procs. After that I can start working on porting userland functions that do something useful with the output of libpmc functions (pmcstat, etc.).
Week 8
IAP Arch is in libpmc. Some test scripts remain but the function (iap_allocate_pmc) is there.
I still haven't gotten positive confirmation on counting on AMD/Core2 platforms, so I'll probably end up working on those some more (hopefully just minor fixes).
From here I will start building the statistical/frontend of PMCTools to do actual analysis with the counters.
Week 9
pmcstat porting - mostly done. I thought I could get this done within a week, and its been a week and I'm nearly there. Some quick notes on it
System-wide counting mode only for now (process-virtual counting can be added, but the pmcstat code for that is too bsd-specific to be useful)
Very much not-reentrant and not-multiprocess-safe for overlapping counters. In no way can more than one of these be run simultaneously on the same counters (though you can have separate ones on separate counters). For sane architectures (K8, Core(2) IAF/IAP) this is the counter number. For Pentium 4's this is the row/index and event register multisets.
Week 10
pmcstat() built. more note on it:
Logging is for now disabled (more interest on the actual hardware measurements than the formatted output), for now plain printing
looking into callchain capture/mapping (in the kernel or otherwise) it is for now, disabled. that is however something i can focus on if that's desired
Now to add architecture support to it (probably P4, K8, IAF, IAP, in that order)
Week 11
Updates
I'll be updating my branch constantly as I work on the code, and I'll be keeping track of my progress in my blog. I'll also be haunting the IRC channel and the mailing list all summer, so feel free to contact me there as well (ajray on irc.freenode.net).
Weekly Status
Daily Status
License Info
I will be keeping clear distinctions between the PMCTools and my personal code, and I'll be releasing my code under the BSD license.