(This was Thursday, but I got a bit distracted and failed to post here earlier.)
Did some more work on the profiler, primarily fixing some bugs and cleaning up code enough to commit it (here/here/here).
If anyone wants to test it, you should recompile the game then launch it and press F11 to enable the new profiler's HTTP output (it's disabled by default for performance and security), then open the file source/tools/profiler2/profiler2.html in a web browser. (I've only tested Firefox 7; browser compatibility isn't a priority). Then it should work, with luck.
One extra feature is that the code can add text annotations to each profile region, which are displayed as tooltips:

Holding the mouse over the "convert texture [...]" region pops up the box saying which texture it's converting. The "compress" in this image is happening in a secondary thread, so it overlaps many frames. (In practice the compression is done before running the game and cached, so this isn't normally a real slowdown). I think this makes it particularly useful for improving the game's startup/load performance, since we can see what files are taking a long time when they're first loaded.
Day not-counted
I subsequently played around with a few more things (not committed yet), mostly for fun rather than to be useful for the game, so they don't count as proper work. In particular, the profiler can display the GPU like a separate thread (using the GL_ARB_timer_query extension, supported on most NVIDIA and ATI/AMD cards with the proprietary drivers):

Here the CPU is the top group of coloured boxes, the GPU is the bottom group. This gives a nice view of the fancy things that graphics drivers do. At the start of the timeline (on the left), the CPU is rendering frame 4824 while the GPU is rendering frame 4822. That means the CPU is submitting a load of rendering commands, which get queued up for 10-20msec before the GPU (which is busy rendering an earlier frame) gets around to processing those commands and doing the actual rendering. On frame 4827 I did something to delay the CPU a bit. The GPU carried on rendering the queued frames, but then had to pause until frame 4827 was ready to render. After that, the CPU and GPU times wobble around a bit (but the GPU maintains a fairly smooth framerate) until the GPU has synchronised back to 2 frames behind the CPU.
One annoyance is that Intel drivers don't support GL_ARB_timer_query, and Intel GPUs are where optimisation matters most (since they're so slow). It turns out that the most recent ones have an extension called GL_INTEL_performance_queries, which sounded interesting, but there's no specification and it seems nobody has ever written anything about it. So I fiddled with it a bit and wrote some documentation of the extension. Not sure it's hugely useful, but at least it can give a reasonable indication of the time the GPU spends on each part of the rendering. (It only gives relative times, not absolute, so unfortunately it messes up the timeline view a bit, but it's better than nothing.)














