Saturday, July 30, 2016

dirty tricks for moar fps!

This weekend I landed a patchset in mesa to add support for resource shadowing and batch re-ordering in freedreno.  What this is, will take a bit of explaining, but the tl;dr: is a nice fps boost in many games/apps.

But first, a bit of background about tiling gpu's:  the basic idea of a tiler is to render N draw calls a tile at a time, with a tile's worth of the "framebuffer state" (ie. each of the MRT color bufs + depth/stencil) resident in an internal tile buffer.  The idea is that most of your memory traffic is to/from your color and z/s buffers.  So rather than rendering each of your draw calls in it's entirety, you split the screen up into tiles and repeat each of the N draws for each tile to fast internal/on-chip memory.  This avoids going back to main memory for each of the color and z/s buffer accesses, and enables a tiler to do more with less memory bandwidth.  But it means there is never a single point in the sequence of draws.. ie. draw #1 for tile #2 could happen after draw #2 for tile #1.  (Also, that is why GL_TIMESTAMP queries are bonkers for tilers.)

For purpose of discussion (and also how things are named in the code, if you look), I will define a tile-pass, ie. rendering of N draws for each tile in succession (or even if multiple tiles are rendered in parallel) as a "batch".

Unfortunately, many games/apps are not written with tilers in mind.  There are a handful of common anti-patterns which force a driver for a tiling gpu to flush the current batch.  Examples are unnecessary FBO switches, and texture or UBO uploads mid-batch.

For example, with a 1920x1080 r8g8b8a8 render target, with z24s8 depth/stencil buffer, an unnecessary batch flush costs you 16MB of write memory bandwidth, plus another 16MB of read when we later need to pull the data back into the tile buffer.  That number can easily get much bigger with games using float16 or float32 (rather than 8 bits per component) intermediate buffers, and/or multiple render targets.  Ie. two MRT's with float16 internal-format plus z24s8 z/s would be 40MB write + 40MB read per extra flush.

So, take the example of a UBO update, at a point where you are not otherwise needing to flush the batch (ie. swapbuffers or FBO switch).  A straightforward gl driver for a tiler would need to flush the current batch, so each of the draws before the UBO update would see the old state, and each of the draws after the UBO update would see the new state.

Enter resource shadowing and batch reordering.  Two reasonably big (ie. touches a lot of the code) changes in the driver which combine to avoid these extra batch flushes, as much as possible.

Resource shadowing is allocating a new backing GEM buffer object (BO) for the resource (texture/UBO/VBO/etc), and if necessary copying parts of the BO contents to the new buffer (back-blit).

So for the example of the UBO update, rather than taking the 16MB+16MB (or more) hit of a tile flush, why not just create two versions of the UBO.  It might involve copying a few KB's of UBO (ie. whatever was not overwritten by the game), but that is a lot less than 32MB?

But of course, it is not that simple.  Was the buffer or texture level mapped with GL_MAP_INVALIDATE_BUFFER_BIT or GL_MAP_INVALIDATE_RANGE_BIT?  (Or GL API that implies the equivalent, although fortunately as a gallium driver we don't have to care so much about all the various different GL paths that amount to the same thing for the hw.)  For a texture with mipmap levels, we unfortunately don't know at the time where we need to create the new shadow BO whether the next GL calls will glGenerateMipmap() or upload the remaining mipmap levels.  So there is a bit of complexity in handling all the cases properly.  There may be a few more cases we could handle without falling back to flushing the current batch, but for now we handle all the common cases.

The batch re-ordering component of this allows any potential back-blits from the shadow'd BO to the new BO (when resource shadowing kicks in), to be split out into a separate batch.  The resource/dependency tracking between batches and resources (ie. if various batches need to read from a given resource, we need to know that so they can be executed before something writes to the resource) lets us know which order to flush various in-flight batches to achieve correct results.  Note that this is partly because we use util_blitter, which turns any internally generated resource-shadowing back-blits into normal draw calls (since we don't have a dedicated blit pipe).. but this approach also handles the unnecessary FBO switch case for free.

Unfortunately, the batch re-ordering required a bit of an overhaul about how cmdstream buffers are handled, which required changes in all layers of the stack (mesa + libdrm + kernel).  The kernel changes are in drm-next for 4.8 and libdrm parts are in the latest libdrm release.  And while things will continue to work with a new userspace and old kernel, all these new optimizations will be disabled.

(And, while there is a growing number of snapdragon/adreno SBC's and phones/tablets getting upstream attention, if you are stuck on a downstream 3.10 kernel, look here.)

And for now, even with a new enough kernel, for the time being reorder support is not enabled by default.  There are a couple more piglit tests remaining to investigate, but I'll probably flip it to be enabled by default (if you have a new enough kernel) before the next mesa release branch.  Until then, use FD_MESA_DEBUG=reorder (and once the default is switched, that would be FD_MESA_DEBUG=noreorder to disable).

I'll cover the implementation and tricks to keep the CPU overhead of all this extra bookkeeping small later (probably at XDC2016), since this post is already getting rather long.  But the juicy bits: ~30% gain in supertuxkart (new render engine) and ~20% gain in manhattan are the big winners.  In general at least a few percent gain in most things I looked at, generally in the 5-10% range.