Wednesday, May 4, 2016

Freedreno (not so) periodic update

Since I seem to be not so good at finding time for blog updates recently, this update probably covers a greater timespan than it should, and some of this is already old news ;-)

Already quite some time ago, but in case you didn't already notice: with the mesa 11.1 release, freedreno now supports up to (desktop) gl3.1 on both a3xx and a4xx (in addition to gles3).  Which is high enough to show up on the front page at glxinfo.  (Which, btw, is a useful tool to see exactly which gl/gles extensions are supported by which version of mesa on various different hw.)

A couple months back, I spent a bit of time starting to look at performance.  On master now (so will be in 11.3), we have timestamp and time-elapsed query support for a4xx, and I may expose a few more performance counters (mostly for the benefit of gallium HUD).  I still need to add support for a3xx, but already this is useful to help profile.  In addition, I've cobbled together a simple fdperf cmdline tool:



I also got around to (finally) implementing hw binning support for a4xx, which for *some* games can have a pretty big perf boost:
  • glmark2 'refract' bench (an extreme example): 31fps -> 124fps
  • xonotic (med): 44.4fps -> 50.3fps
  • supertuxkart (new render engine): 15fps -> 19fps
More recently I've started to run the dEQP gles3 tests against freedreno.  Initially the results where not too good, but since then I've fixed a couple thousand test cases.. fortunately it was just a few bugs and a couple missing workaround for hw bug/limitations (depending on how you look at it) which counted for the bulk of the fails.  Now we are at 98.9% pass (or 99.5% if you don't count the 'skips' against the pass ratio).  These fixes have also helped piglit, where we are now up to 98.3% pass.  These figures are a4xx, but most of the fixes apply to a3xx as well.

I've also made some improvements in ir3 (shader compiler for a3xx and later) so the code it generates is starting to be pretty decent.  The immediate->const lowering that I just pushed helps reduce register pressure in a lot of cases.  We still need support for spilling, but at least now shadertoy (which is some sort of cruel joke against shader compiler writers) isn't a complete horror show:



In other cool news, in case you had not already seen: Rob Herring and John Stultz from linaro have been doing some cool work, with Rob getting android running on an upstream kernel plus mesa running on db410c and qemu (with freedreno and virtgl), and John taking all that, and getting it all running on a nexus7 tablet.  (And more recently, getting wifi working as well.)  I had the opportunity to see this in person when I was at Linaro Connect in March.  It might not seem impressive if you are unfamiliar with the extent to which android device kernels diverge from upstream, but to see an upstream kernel running on an actual device with only ~50patches is quite a feat:



The UI was actually reasonably fast, despite not yet using overlays to bypass GPU for composition.  But as ongoing work in drm/kms for explicit fencing, and mesa EGL_ANDROID_native_fence_sync land, we should be able to get hw composition working.



Saturday, August 15, 2015

freedreno - mesa 11.0 progress update, OpenGLES3 and more

So the big news for the upcoming mesa 11.0 release is gl4.x support for radeon and nouveau.  Which has been in the works for a long time, and a pretty tremendous milestone (and the reason that the next mesa release is 11.0 rather than 10.7).  But on the freedreno side of things, we haven't been sitting still either.  In fact, with the transform-feedback support I landed a couple weeks ago (for a3xx+a4xx), plus MRT+z32s8 support for a4xx (Ilia landed the a3xx parts of those a while back), we now support OpenGLES 3.0[1] on both adreno 3xx and 4xx!!

In addition, with the TBO support that landed a few days ago, plus handful of other fixes in the last few days, we have the new antarctica gl3.1 render engine for supertuxkart working!


Note that you need to use MESA_GL_VERSION_OVERRIDE=3.1 and MESA_GLSL_VERSION_OVERRIDE=140, since while we support everything that stk needs, we don't yet support everything needed to advertise gl3.1.  (But hey, according to qualcomm, adreno 3xx doesn't even support higher than gles3.0.. I guess we'll have to show them ;-))

The nice thing to see about this working, is that it is utilizing pretty much all of the recent freedreno features (transform feedback, MRT, UBO's, TBO's, etc).

Of course, the new render engine is considerably more heavyweight compared to older versions of stk.  But I think there is some low hanging fruit on the stk engine side of things to reclaim some of those lost fps.

update: oh, and the first time around, I completely forgot to mention that qualcomm has recently published *some* gpu docs, for a3xx, for the dragonboard 410c. Not quite as extensive as what broadcom has published for vc4, but it gives us all the a3xx registers, which is quite a lot more than any other SoC vendor has done to date :-)


[1] minus MSAA.. There is a bigger task, which is on the TODO list, to teach mesa st about some extensions to support MSAA resolve on tile->mem.. such as EXT_multisampled_render_to_texture, plus perhaps driconf option to enable it for apps that are not aware, which would make MSAA much more useful on a tiling gpu.  Until then, mesa doesn't check MSAA for gles3, and if it did we could advertise PIPE_CAP_FAKE_SW_MSAA.  Plus, who really cares about MSAA on a 5" 4k screen?


Saturday, July 4, 2015

happy (gpu) independence day

So, I realized it has been a while since posting about freedreno progress, so in honor of US independence day I figured it was as good an excuse as any for an update about independence from gpu blob driver for snapdragon/adreno..

Back in end of March 2015 at ELC, I gave a freedreno update presentation at ELC, listing the following major tasks left for gles3 support:
  • Uniform Buffer Objects (UBO)
  • Transform Feedback (TF)
  • Multi-Render-Target (MRT)
  • advanced flow control in shader compiler
 and additionally for gl3:
  • Multisample anti-aliasing (MSAA)
  • NV_conditional_render
  • 32b depth (z32 and z32_s8) (which I forgot to mention in the presentation)
EDIT: Ilia pointed out that 32b depth is needed for gles3 too, and gl3 additionally needs clipdist/etc (which we'll have to emulate, but hopefully can do in a generic nir pass) and rgtc (which will need sw decompression hopefully in mesa core so other drivers for gles class hw can reuse).  Original list was based on what mesa's compute_version() code was checking quite some time back.
 
Since then, we've gained support for UBO's (a3xx by Ilia Mirkin, and a4xx), MRT (for a3xx and core, again thanks to Ilia.. still needs to be wired up for a4xx), 32b depth (a3xx and core, again thanks to Ilia), and I've finished up shader compiler for loops/flow-control for ir3 (a3xx/a4xx).  The shader compiler work was a somewhat larger task than I expected (and I did expect it to be a lot of work), but it also involved moving over to NIR, in addition to re-writing the scheduler and register allocation passes, as well as a lot of re-org to ir3 in order to support multiple basic blocks.  The move to NIR was not strictly required, but it brings a lot of benefits in the form of shared support for conversion to SSA, scalarizing, CSE, DCE, constant folding, and algebraic optimizations.  And I figured it was less work in the long run to move to NIR first and drop the TGSI frontend, before doing all the refactoring needed to support loops and non-lowerable flow-control.  Incidentally, the compiler work should make the shader-compiler part of TF easier (since we need to generate a conditional write to TF buffer iff not overwriting past the end of the TF buffer).

In the mean time, freedreno and drm/msm have also gained support for the a306 gpu found in the new dragonboard 410c.  This board is a nice new low cost ($75) snapdragon community board based on the 64bit snapdragon 410.  And thanks to a lot of work by linaro and qualcomm, the upstream kernel situation for this board is looking pretty good.  It is shipping initially with a 4.0 based kernel (with patches on top for stuff that hadn't yet been merged for 4.0, including a lot of stuff backported from 4.1 and 4.2), including gpu/display/audio/video-codec/etc.  I believe that the 4.1 kernel was the first version where a vanilla kernel could boot on db410c with basic stuff (like serial console) working.  The kernel support for the gpu and display, other than the adv7533 hdmi bridge chip) landed in 4.2.  There is still more work to get *everything* (including audio, vidc, etc) merged upstream, but work continues in that direction, making this quite an exciting board.
Also, we have a GSoC student, Varad, working on freedreno support for android.  It is still in early stages, with some debugging still to do, but he has made a lot of progress and things are starting to work.
And since no blog post is complete without some nice screenshots...  the other day someone pointed me at a post in the dolphin forums about how dolphin was running on a420 (same device as in the ifc6540).  We all had a good laugh about the rendering issues with the blob driver.  But, since dolphin was the first gl3 game that worked with freedreno, I was curious how freedreno would do.. so I fired up the ifc6540 and replayed some dolphin fifo logs that would let me render approximately the same scenes:





Yoshi looks to be rendering pretty well.. digimon has a bit of corruption, but no where near as bad as the blob driver.  I suspect the issue with digimon is an instruction scheduling issue in the shader compiler (well, no rest for the gpu driver writers), but nice to see that it is already in pretty good shape.

Now we just need steam store or some unigine demos for arm linux :-P



Sunday, February 22, 2015

a4xx shaping up

So, I finally figured out the bug that was causing some incorrect rendering in xonotic (and, it turns out, to be the same bug plaguing a lot of other games/webgl/etc).  The fix is pushed to upstream mesa master (and I guess I should probably push it to the 10.5 stable branch too).  Now that xonotic renders correctly, I think I can finally call freedreno a4xx support usable:



Also, for fun, a little comparison between the ifc6540 board (snapdragon 805, aka apq8084), and my laptop (i5-4310U).  Both have 1920x1080 resolution, both running gnome-shell and firefox (with identical settings).  Laptop is fedora f21 while ifc6540 is rawhide), but it is quite close to an apples-to-apples comparision:



Obviously not a rigorous benchmark, so please don't read too much into the results.  The intel is still faster overall (as it should be at it's size/price/power budget), but amazing that the gap is becoming so small between something that can be squeezed into a cell phone and dedicated laptop class chips.

Saturday, December 20, 2014

a4xx in the holiday spirit

Just in time for the upcoming break, we have figured out how to do alpha-test, and now supertuxkart is rendering properly:



If you are wondering about the new stk beta, I have a build from a few weeks back which seems to render properly as well.. few rough edges but I think that is just from using random git commit-id for stk.  But we don't have enough gl3 features yet (on a3xx or a4xx) to be using the new rendering paths.

And gnome-shell works nicely too.  Still some rendering issues with xonotic.  And a little ways behind a3xx in piglit results, but not quite as much as I would have expected at this early stage.

Still missing are some optimizations that are important for certain use-cases (hw-binning support for games, GMEM bypass for UI/mipmap-generation/etc).  But the a420 in apq8084 (ifc6540 board) is surprisingly fast all the same.

Saturday, November 15, 2014

freedreno a4xx

A couple weeks ago, qualcomm (quic) surprised some by sending kernel patches to enable the new adreno 4xx family of GPUs found in their latest SoCs.  Such as the apq8084 powering my ifc6540 board with the a420 GPU.  Note that qualcomm had already sent patches to enable display support for apq8084, merged in 3.17.  And I'm looking forward to more good things from their upstream efforts in the future.

So in the last weeks, in between various other kernel work (atomic-helper conversion and few other misc things for 3.19) and RHEL stuff, I've managed to bang out initial gallium support for a4xx.  There are still plenty of missing things, or stuff hard-coded, etc.  But yesterday I managed to get textures working, and fix RGBA/BGRA confusion, so now enough works for 'gears and maybe about half of glmark2:



I've intentionally pushed it (just now) after the mesa 10.4 branch point, since it isn't quite ready to be enabled by default in distro mesa builds.  When it gets to the point of at least being able to run a desktop environment (gnome-shell / compiz / etc), I may backport to 10.4.  But there is still a lot of work to do.  The good news is that so far it seems quite fast (and that is without hw binning or XA yet even!)

Monday, October 13, 2014

Silly r/e tool nonsense hacks

In the process of reverse engineering work for freedreno, I've cobbled together some interesting tools.  The earliest and most useful of which is cffdump.  (Named after some command-stream dumping debug code in the old kgsl android kernel driver, upon which it was originally inspired.)  The cffdump tool knows how to parse out the "toplevel" command-stream stored as an .rd (re-dump) file, finding packets that load state memory, write registers, IB (indirect branch), etc.  The .rd file contains snapshots of gpu buffers, in order to chase gpu pointers at decode time.  It links in librnn from the nouveau envytools project for the decoding of individual registers, and a few other things.  It also calls out to the freedreno disassembler code to show inline disassembly of shaders, decodes vertex and constant (uniform) buffers, etc.  And even generates pretty color output (thanks to librnn):


A few months back, I added some basic lua scripting support to cffdump, mostly to assist in r/e work for adreno a4xx.  When invoked with the --script argument, cffdump would load the specific lua script, and call the 'draw' function it defines on each CP_DRAW_INDX opcode.  The choice of lua was mostly because it seemed fairly easy to integrate with .c code.

Since then, I've had the thought in the back of my mind that adding script bindings to integrate rnn register decode to lua would be useful for much more.  Such as writing a command-stream validator to check for inconsistent programming.  There are a number of places where inconsistencies between various register settings and such will result in gpu lockup.  The general adreno design philosophy appears to be to not ever dedicate transistors to making the driver writer's life easier... which for a SoC gpu is certainly the right choice, but it doesn't make things any easier for me.  Over time, I've discovered many of these of these rules, but they are mostly all in my head at the moment.  And from time to time, when adding new features to the gallium driver, I inadvertently break one or more of the rules and end up wasting time studying cmdstream dumps from the freedreno gallium driver to figure out what I did wrong.

So, on the way to XDC2014 I started hacking up support for register decoding from lua scripts.  It turns out that time in airports and airplanes, where I can't exactly break out an ifc6410 and hdmi monitor to do some driver work, is a good time to catch up on these sort of projects.  Now I can do nifty things like:

-- load rnn database file for a320:
r = rnn.init("a320")

function start_cmdstream(name)
  io.write("START: " .. name .. "\n")
end

function draw(primtype, nindx)
  -- simple full register access:
  io.write("GRAS_CL_VPORT_XOFFSET: " .. r.GRAS_CL_VPORT_XOFFSET .. "\n")
  -- access boolean bitfield Z_ENABLE in RB_DEPTH_CONTROL register:
  io.write("RB_DEPTH_CONTROL.Z_ENABLE: " .. tostring(r.RB_DEPTH_CONTROL.Z_ENABLE) .. "\n")
  -- access ROP_CONTROL bitfield inside CONTROL register inside RB_MRT[] array:
  io.write("RB_MRT[0].CONTROL.ROP_CODE: " .. r.RB_MRT[0].CONTROL.ROP_CODE .. "\n")
end

function end_cmdstream()
  io.write("END\n")
end

function finish()
  io.write("FINISH\n")
end


which will generate output like:

[robclark@thunkpad:~/src/freedreno (master)]$ ./cffdump --script test.lua piglit.rd
Reading piglit.rd...
START: piglit.rd

GRAS_CL_VPORT_XOFFSET: 79.5
RB_DEPTH_CONTROL.Z_ENABLE: true
RB_MRT[0].CONTROL.ROP_CODE: 12

Currently it should handle all of the rnndb constructs that are used for adreno.  Ie. simple registers, arrays of simple registers, arrays of groups of registers, etc.  No support for "stripes" yet since those are not used for freedreno.

At the moment, all the script bindings are in freedreno.git/util/script.c but if there is some interest in this from nouveau or anyone else using librnn then it would be a good idea to try to refactor some of this into more generic code in librnn.  It would still need a bit of glue from the tool linking librnn to get at the actual register values.

Still needed are a few more script hooks (such as CP_LOAD_STATE) to do everything I need for a validator script.  Hopefully I find some time to work on that before the next conference ;-)

PS. I hope this post is at least a bit coherent.. I am still a bit jetlagged..