Wednesday, February 5, 2014

freedreno: new compiler

Complementing the hw binning support which landed earlier this year, and is now enabled by default, I've recently pushed the initial round of new-compiler work to mesa.  Initially I was going to keep it on a branch until I had a chance to sort out a better register allocation (RA) algorithm, but the improved instruction scheduling fixed so many bugs that I decided it should be merged in it's current form.

Or explained another way, ever since fedora updated to supertuxkart 0.8.1, about half the tracks had rendering problems and/or triggered gpu hangs.  The new compiler fixed all those problems (and more).  And I like supertuxkart :-)

Background:

The original a3xx compiler was more of a simple TGSI translator.  It translated each TGSI opcode into a simple sequence of one or more native instructions.  There was a fixed (per-shader) mapping between TGSI INPUT, OUTPUT, and TEMP vec4 register files to the native (flat) scalar register file.  A not-insignificant part of the code was relatively generic, in concept but not implementation, lowering of TGSI opcodes that relate more closely to old ARB shader instructions, (SCS - Sine Cosine, LIT - Light Coefficients, etc) than the instruction set of any modern GPU.

The simple TGSI translator approach works fine with simple shader ISA's.  It worked ok for a2xx, other than slightly suboptimal register usage.  But the problem is that a3xx (and a4xx) is not such a simple instruction set architecture.  In particular, the instruction scheduling required that the compiler be aware of the shader instruction pipeline(s).  

This was obvious pretty early on in the reverse engineering stage.  But in the early days of the gallium a3xx support, there were too many other things to do... spending the needed time on the compiler then was not really an option.  Instead the "use lots of nop's and hope for the best" strategy was employed.

And while it worked as a stop-gap solution, it turns out that there are a lot of edge cases where "hope for the best" does not really work out that well in practice.  After debugging a number of rendering bugs and piglit failures which all traced back to instruction scheduling problems, it was becoming clear that it was time for a more permanent solution.

In with the new:

First thing I wanted to do before adding a lot more complexity is to rip out a bunch of code.  With that in mind I implemented a generic TGSI lowering pass, to replace about a dozen opcodes with sequences of equivalent simpler instructions.  This probably should be made configurable and moved to util, I think most of the lowerings would be useful to other gallium drivers.

Once the handling of the now unneeded TGSI opcodes was removed, I copied fd3_compiler to fd3_compiler_old.  Originally the plan was to remove this before pushing upstream.  I just wanted a way to compare the results from the original compiler to the new compiler to help during testing and debugging.  But currently shaders with relative addressing need to fall back to the old compiler, so it stays for now.

The next step was to turn ir3 (the a3xx IR), which originates from the fdre-a3xx shader assembler into something more useful.  The approach I settled on (mostly to ease the transition) was to add a few extra "meta-instructions" to hold some additional information which would be needed in later passes, including Φ (Phi) instructions where a result depends on flow control.  Plus a few extra instruction and register flags, the important one being IR3_REG_SSA, used for src register nodes to indicate that the register node points to the dependent instruction.  Now what used to be the compiler (well, roughly 2/3rds of it) is the front-end.  Instead of producing a linear sequence of instructions fed directly to the assembler/codegen, the frontend is now generating a graph of instructions modified by subsequent passes until we have something suitable for codegen.

For each output, we keep the pointer to the instruction which generates that value (at the scalar level), which in turn has the pointer to the instructions generating it's srcs/inputs, and so on.  As before, the front end is generating sequences of scalar instructions for each (written) component in a TGSI vector instruction.  Although now instructions whose result is not used simply has nobody pointing to them so they naturally vanish.

At the same time, mostly to preserve my sanity while debugging, but partially also to make nifty pictures, I implemented an "ir3 dumper" which would dump out the graph in .dot syntax:


The first pass eliminates some redundant moves (some of which come from the front end, some from TGSI itself).  Probably the front end could be a bit more clever about not inserting unneeded moves, but since TGSI has separate INPUT/OUTPUT/TEMP register files, there will always be some extra moves which need eliminating.

After that, I calculate a "depth" for each instruction, where the depth is the number of instruction cycles/slots required to compute that value:

    dd(instr, n): depth(instr->src[n]) + delay(instr->src[n], instr)
    depth(instr): 1 + max(dd(instr, 0), ..., dd(instr, N))

where delay(p,c) gives the required number of instruction slots between an instruction which produces a value and an instruction which consumes a value.

The depth is used for scheduling.  The short version of how it works is to recursively schedule output instructions with the greatest depth until no more instructions can be scheduled (more delay slots needed).  For instructions with multiple inputs/srcs, the unscheduled src instruction with the greatest depth is scheduled first.  Once we hit a point where there are some delay slots to fill, we switch to the next deepest output, and so on until the needed delay slots are filled.  If there are no instructions that can be scheduled, then we insert nop's.

Once the graph is scheduled, we have a linear sequence of instructions, at which point we do RA.  I won't say too much about that now, since it is already a long post and I'll probably change the algorithm.  It is worth noting that some register assignment algorithms can coalesce unneeded moves.  Although moves factor into the scheduling decisions for the a3xx ISA, so I'm not really sure that this is too useful me.

The end result, thanks to a combination of removal of scalar instructions to calculate TGSI vec4 register components which are unused, plus removal of unnecessary moves, plus scheduling other instructions rather than filling with no-op's everywhere, for non trivial shaders it is not uncommon to see the compiler use ~33% the number of instructions, and half the number of registers.

Testing/Debugging:

Validating compilers is hard.  Piglit has a number of tests to exercise relatively specific features.  But with games, it isn't always the case that an incorrect shader produces (visually) incorrect results.  And visually incorrect results are not always straightforward to trace back to the problem.  Ie. games typically have many shaders, many draw calls, tracking down the problematic draw and it's shaders is not always easy.

So I wrote a very simplistic emulator for testing the output of the compiler.  I captured the TGSI dumps of all the shaders from various apps (ST_DEBUG=tgsi).  The test app would assemble the TGSI, feed into both the old and new compiler, then run same sets of randomized inputs through the resulting shaders and compare outputs.

There are a few cases where differing output is expected, since the new compiler has slightly more well defined undefined behaviour for shaders that use uninitialized values... to avoid invalid pointers in the graph produced by the front-end, uninitialized values get a 'mov Rdst, immed{0.0}' instruction.  So there are some cases where the resulting shader needs to be manually validated.  But in general this let me test (and debug) the new compiler with 100's of shaders in a relatively short amount of time.

Performance:

So the obvious question, what does this all mean in terms of performance?  Well, start with the easy results, es2gears[1]:
  • original compiler: ~435fps
  • new compiler: ~539fps
With supertuxkart, the result is a bit easier to show in pictures.  Part of the problem is that the tracks that are heavy enough on the GPU to not be purely CPU limited, didn't actually work before with the original compiler.  That plus, as far as I know, there is no simple benchmark mode which spits out a number at the end, as with xonotic.  So I used the trace points + timechart approach, mentioned in a previous post.

    supertuxkart -f --track fortmagma --profile-laps=1

I manually took one second long captures, in as close to the same spot as possible (just after light turns green):

    ./perf timechart record -a -g -o stk-apq8074-opt+bin-1.data sleep 1

In this case I was running on an apq8074/a330 device, fwiw.  Our starting point is:


Then once hw binning is in place, we are starting to look more CPU limited than anything:


And with addition of new compiler, the GPU is idle more of the time, but since the GPU is no longer the bottleneck (on the less demanding tracks) there isn't too much change in framerate:


Still, it could help power if the GPU can shut off sooner, and other levels which push the GPU harder benefit.

With binning plus improved compiler, there should not be any more huge performance gaps compared to the blob compiler.  Without linux blob drivers, there is no way to make a real apples to apples comparison, but remaining things that could be improved should be a few percent here and there.  Which is a good thing.  There are still plenty of missing features and undiscovered bugs, I'm sure.  But I'm hopefully that we can at least have things in good shape for a3xx before the first a4xx devices ship ;-)


-----
[1] Windowed apps would benefit somewhat from XA support in DDX, avoiding stall for GPU to complete before sw blit (memcpy) to front buffer.. but the small default window size for 'gears means that hw binning does not have much impact.  The remaining figures are for fullscreen 1280x720.

14 comments:

  1. Rob, it is possible to benchmark SuperTuxKart easily. You could easily run phoronix-test-suite benchmark supertuxkart to have it all automated... or dig through: http://openbenchmarking.org/innhold/9ac6cb71a0ce2e482bda16ad7c731bf925b0cf91 to see the commands/arguments needed to pass to SuperTuxKart (ignore the irrelevant warsow lines I just noticed in there that are irrelevant due to copy-past mess). Otherwise email me if questions.

    -- Michael

    ReplyDelete
    Replies
    1. oh, heh, I completely missed that the --profile modes printed out the FPS. I'm getting 24.2/23.0/20.9 (opt+bin, bin, orig).

      Although those seem a bit low compared to what I see with gallium HUD (which itself causes a 1-2fps penalty). Not sure if those figures include the loading time.. or maybe thermal throttling is kicking in?

      btw, are the graphics portions of the phoronix-test-suite ported to armv7? That could be interesting to get going, although not really sure that I want to compile a whole bunch of games myself..

      Delete
    2. hmm, ok, the ifc6410 (with a *slightly* overclocked cpu/gpu) and performance governor gives some more sensible numbers. The apq8074 *should* be faster, but I guess thermal is kicking in and limiting it.

      binning+compiler: 26.7fps
      binning: 25.6fps
      orig: 14.7fps

      Delete
  2. Hi Rob,

    With that new kernel did you do anything different? It seems i get netdev errors which randomly boots me off gbe and a whole heap of battery spam!

    Could you link or paste a quick guide to how you're creating the .img files?

    Cheers

    Mike

    ReplyDelete
    Replies
    1. quick answer on boot img creation (which I guess I need to add to wiki).. here is what I use:

      abootimg --create ifc6410-boot.img -k ../msm/arm/arch/arm/boot/zImage -f ./bootimg.cfg -r initramfs-3.4.0-g0f0fed7-00108-gf736ca1-dirty.img

      (and 'abootimg -x' to extract an existing boot img to get ramdisk, zImage, cfg, etc)

      about battery spam I either change loglevel to 0 to filter it out, or comment it out locally in my kernel. (Other than being annoying, it isn't hurting anything.)

      Not sure if the netdev thing you mention is related to this: https://github.com/freedreno/freedreno/wiki/Ifc6410#wiki-making-networkmanager-behave ?

      At any rate, I have noticed the serial driver seems a bit iffy.. a heavy flood of log msgs to uart seems like it can trigger hard lockup.

      Delete
    2. Thanks for the boot img stuff I will take a look.

      Also huge thanks for what you are doing and most importantly the principle behind it on freedreno!

      I already implemented the NetworkManager Faq stuff when rsyslog was killing cpu cycles. The fix for rsyslog (if anyone else has it) was simply to delete the offending journal in:

      /var/log/journal

      Then reboot.

      As for netdev here is the actual issue:

      https://plus.google.com/108473591761180051109/posts/MB3EXo1PGuE

      or tinypic if you don't have g+

      http://tinypic.com/r/fw0u3q/8

      The issue seems pretty generic and I haven't had chance to google it yet (got a newborn :0). When I find the answer I'll add it here. But it only started to appear when I started using your -OC.img.

      I have a heatsink and fan on my board powered by the DC psu in a mini itx so I was hoping to get some free performance!

      Cheers

      Mike

      Delete
    3. hmm, interesting. I remember seeing a similar looking backtrace when I unplug ethernet cable (but otherwise seems to be harmless). Wonder if the overclock img is causing some issues there. At any rate, normal variances between chips could mean some do better w/ overclocking vs others. Also, it is the cold time of year around these parts, which might also explain why I've had good luck so far with overclocking.

      But if the overclock kernel causes issues, it is probably safest not to use it. With latest gallium driver, even with default clocks, from glmark2 results I am getting we are still the fastest arm board out there. (Well, to be fair, most of the results I can find are from various odroid and similar type boards with mali-400mp, and that is already pretty old/slow.)

      Delete
    4. So once I got rid of all the battery spam NetworkManager can't start as it is unable to write to the ifcfg-rh file.

      Since I spent ages messing about with mesa and missing libs I reinstalled. Instead of doing yum update I left it at defaults which means the network card is working. So the update breaks something...

      However every time I reboot the device now the MAC address is different... which I find very odd.

      I have had it running for the last 24-48 hours, totally stable no crashes.

      Delete
    5. I disabled NetworkManager.service and enabled network.service and added an ifcfg-enp1s0

      config in

      /etc/sysconfig/network-scripts

      this also fixed the mac address:

      DEVICE=enp1s0
      TYPR=Ethernet
      ONBOOT=yes
      NM_CONTROLLED=no
      HWADDR=enp1s0_mac_address (find it in ifconfig)
      PEERDNS=yes
      PEERROUTES=yes
      IPADDR=the_ip_you_want
      GATEWAY=your_router
      DNS1=your_router

      nano (cause vi sucks)
      /etc/resolv.conf

      and add

      namespace what_you_put_dns1_as

      Delete
  3. fwiw, for MAC addr (for ethernet), I just hack the kernel locally. Since it doesn't have a MAC in eeprom, it always generates a random MAC at boot:

    ----------------
    diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_hw.c b/drivers/net/ethernet/atheros/atl1c/atl1c_hw.c
    index bd1667c..79cf163 100644
    --- a/drivers/net/ethernet/atheros/atl1c/atl1c_hw.c
    +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_hw.c
    @@ -220,8 +220,27 @@ int atl1c_read_mac_addr(struct atl1c_hw *hw)
    int err = 0;

    err = atl1c_get_permanent_address(hw);
    - if (err)
    - random_ether_addr(hw->perm_mac_addr);
    + if (err) {
    + hw->perm_mac_addr[0] = 0x5e;
    + hw->perm_mac_addr[1] = 0x86;
    + hw->perm_mac_addr[2] = 0x53;
    + hw->perm_mac_addr[3] = 0xc5;
    + hw->perm_mac_addr[4] = 0xa9;
    + hw->perm_mac_addr[5] = 0x29;
    + }

    memcpy(hw->mac_addr, hw->perm_mac_addr, sizeof(hw->perm_mac_addr));
    return err;
    ----------------

    ReplyDelete
  4. Since the subject of kernels came up I can just state that we have a mainline kernel booting on the Qualcomm APQ chips these days, however I only have the APQ8660-based board, here is my HOWTO (which now actually works):
    http://www.df.lth.se/~triad/krad/dragonboard/

    We need a lot more stuff in there before it's operational for graphics I suspect, like regulators which are in turn behind a remote call interface on another chip. But all clocks are in place for most chips.

    ReplyDelete
    Replies
    1. hey, cool.. fwiw, here is where I got to with my last attempt (a couple months ago) for mainline kernel on ifc6410:

      http://cgit.freedesktop.org/~robclark/linux/log/?h=msm-8064-drm

      fwiw, buried under some work-in-progress msm/drm stuff is earlyprintk debug uart support for apq8064. I'd started creating a dt file, and did some quick/dirty porting of drivers from 3.4 android kernel (plus some dt adaptation). At least gpio's work, and I seem to get the clocks/regulators/etc that I need. The interface clk for hdmi block seems ok, but something isn't right (no ddc and no hpd debounce, although if I read hpd pin directly as gpio I can detect plug/unplug..), so maybe functional clk not ok? I need to find time to rebase this on latest stuff and have another go.

      Delete
    2. So I don't know exactly what is needed for your work, but pin control incl. GPIO should be in place upstream for some SoCs but the device tree changes are not in yet. Björn Andersson from SONY and some Qualcomm people are preparing patches for the rpm thing that controls regulators and root clocks, RTC etc, so a lot of infrastructure work is happening right now. Here is an interesting tree:
      https://github.com/andersson/kernel/commits/v3.14-rc2/qcom-smd

      Delete
    3. The infrastructure requirements display/gpu are a bit tricky/unclear.. because of parent clks, parent supplies, etc, there is more than one clk/regulator/etc driver that needs to be in place for things to work. But having rpm stuff in place should be a big help.

      The stuff in my branch is earlier version of the stuff from Björn and co. There is one root clk controlled by rpm (at least on apq8064), although in theory this clk should be on from the bootloader.

      Delete