It's been a while since I've posted an update about the progress of freedreno.. so no major/big headlines, just lots of small stuff.
Mesa 10 I finally polished up the support for emulating (via index buffer) GL_QUAD and other desktop GL primitives which aren't supported in hardware by adreno. This is needed for gnome-shell and compiz (and probably other compositing window managers using opengl). The u_primconvert utility could be handy in case any of the other upcoming drivers for SoC GPU's need to emulate any GL primitives which are not in GLES. This, plus some other fixes needed for latest gnome-shell in fedora 20 where merged prior to the mesa 10.0 branch point, meaning that once Mesa 10 trickles into distributions, you should be able to use distro packaged freedreno rather than needing to rebuild mesa from git.
Piglit
Since last blog post, I've added support for relative addressing (needed by chromium gl rendering, and a bunch of piglit tests), and fixed a whole bunch of little bugs or missing bits. And I've started publishing piglit results. Don't read too much into the absolute numbers, the all_es2 tests from Tom Gall's gles2-all branch still has a number of bogus tests (ie. shaders with precision specifier issues, etc), so not all the failures are freedreno bugs. But there has been an increase in pass's (and no more crashers) over last few months.
I do really badly need a better collection of GLES2 tests ;-) Boards The IFC6410 is finally shipping out in larger numbers, as more folks in #freedreno are starting to receive their boards. This board has been my primary freedreno dev platform for a while now. If you are looking for a nice small SBC type ARM board with open source graphics, this is a pretty sweet little board. Pico-itx, APQ8064 (1.5GHz quad core krait + adreno 320), 2GiB DDR3, SATA and gigabit-ethernet (hooked up via pci-e, not usb :-)). Only downside is upstream kernel support for APQ8064 is pretty non-existent[1], there is only a downstream msm-3.4 based kernel (see ifc6410-drm branch). And more recently I received a bStem board. This board is more targeted at robotics (bunch of sensors, FPGA, and various add on boards for motor/RC control, etc). But it has APQ8060A (1.7GHz dual core krait + adreno 320), and the typical hdmi and usb connectors. I've pushed initial kernel msm drm/kms support to the bstem-drm branch. I'm using the same Fedora 20 filesystem that I use with the ifc.
Notes: [1]
APQ8x74 (aka snapdragon 800) seems to be getting into better shape in
upstream kernel, so hopefully we start seeing APQ8074 versions of some
of these boards at some point.
Adreno 4xx Last week qualcomm announced their first adreno 420 device. We knew this was coming, since support has been starting to show up in qualcomm's downstream android kernel driver (kgsl) in the last few months. It unfortunately doesn't contain nearly as many useful hints as kgsl did for 2xx and 3xx, but it does give us a few register names. And fwiw, more recent versions of the android blob userspace GLES drivers appear to have support for 4xx.
The recent announcements don't give too much details, but previous leaked specs indicate DX11 feature-set, and this seems to be backed up by handful of register names we can see from downstream kgsl driver. (ie. hull/tesselator/domain/geometry shaders, etc).
From what I can tell so far, 4xx appears to be same shader ISA as 3xx (phew!), but pretty much all registers change or at least move, and a lot more features in hw. So hopefully shouldn't take as long to figure out compared to 3xx (which had both new shader ISA plus register reshuffling).. at least for getting basics running.
Since the recent blob drivers have 4xx support, it should be possible to make a reasonable amount of progress on 4xx r/e before we can get our hands on actual devices. Of course, there is still much to do on 3xx, so for the time being 4xx is not a priority. Mailing List, etc Since more folks are starting to play with freedreno (on IFC6410 and other devices), the whole email-questions-directly-to-rob thing is starting to look like it might not scale too well in the long run. And, asking questions on IRC doesn't work out too well if you don't have a bip or screen setup to keep your connection alive until someone has a chance to wake up and answer. So now we have a mailing list: http://lists.freedesktop.org/mailman/listinfo/freedreno That plus steadily improving docs and info on the wiki should hopefully help.
Now that msm drm/kms kernel driver is merged upstream, I've spent the last few weeks on a bit of a debugging / fixing spree. (Yes, an odd way to start a post about performance/profiling.) I added proper support for mipmaps/cubemaps/etc (multi-slice resources), killed a few gpu lockup bugs, installed a bunch of games and went looking for and fixing rendering issues. I've put together a status table on the freedreno wiki. In the process, I noticed some games, such as supertuxkart, which had low fps, also also had unusually low gpu utilization (30-50%). Now, a new graphics driver stack will always have lots of room for optimization (which is certainly true of freedreno). The key is to know which optimization to work on first. It does no good to make the shader compiler generate 2x faster shaders (which I think is currently possible) if that is just going to take you from 30-50% utilization to 15-25% utilization at roughly the same fps. So before we get to the fun optimizations, we need to take care of any of the cpu side bottlenecks in the driver. Now the linux perf tool is pretty nice just for identifying purely cpu bottlenecks. In fact it showed me pretty quickly that the upstream IOMMU framework struggles with gpu type workloads. Mapping/unmapping individual pages is not really the way to do it. On the downstream msm-3.4 based android kernel, we have iommu_map_range() and iommu_unmap_range()[1]... using these instead is worth 2-3 fps in xonotic, and probably more in supertuxkart, but we'll come back to that.
But perf tool does not really help much with gpu or cpu/gpu interactions, at least not by itself. So, first I added some trace points in the kernel drm/kms driver.. in particular, I put tracepoints:
tracing the fence # when work is submitted to the gpu, and when we get the completion interrupt.
tracing the fence # when cpu waits on a fence and when it finishes waiting
and when pageflip is requested and when it completes (after rendering completes and after vsync)
And then I hacked up the perf timechart tool to display gpu information in the timechart, for a nice timeline overview. Currently I have it looking for the msm trace events, but I think that it would be useful to have a small set of generic trace events which all the drm drivers can use, so that tools won't have to be looking for driver specific traces. I think what I have is a reasonable start, but probably needs a bit of work to handle gpu's that have multiple rings, etc. With that, I fired up supertuxkart again (in demo mode so it will drive itself), and then perf timechart record for a couple seconds to capture a short trace:
You can see above, there is a new bar at the top, below the cpu bars, for the gpu, showing when the gpu is active. And a green overlay bar on the gpu showing where pageflip has been requested (typically right after rendering submitted), and when pageflip completes (next vblank after rendering completes. And below, in the per-process bars, a yellow overlay marker when the process is pending on a fence (waiting for some gpu rendering to complete). And immediately we can see see that that the bottleneck is a fence that supertuxkart is stalling on before it is able to submit rendering for the next frame. After a little bit of poking, I realized that I should implement support for PIPE_TRANSFER_DISCARD_WHOLE_RESOURCE in the freedreno gallium driver. If this usage bit is set, it is a hint to the gallium driver that the previous buffer contents do not need to be preserved after the upload. So in cases that the backing gem buffer object (bo) is still busy (referenced by previous rendering which is not yet complete), it is better to just delete the bo and create a new one, rather than stalling the cpu. The drm driver holds a ref for bo's that are associated to gpu rendering which has not yet completed, so the pages for the old bo don't go away until the gpu is finished with them.
With this change, things have improved, but there is still a bottleneck:
(note that the timescale differs between these three timecharts, since the capture duration differed) Oddly we see a lot of activity on kworker (workqueue worker thread in the kernel). This is mainly retire_worker, in particular releasing the reference that the driver holds to bo's for rendering which is now completed. After a bit more digging, it turns out that supertuxkart is creating on the order of 150-200 transient buffers per frame. Unref'ing these, unmapping from IOMMU and cpu, and deleting backing pages for that many buffers takes some time. Even with some optimization in the kernel, there is still going to be a lot of overhead in the associated vma setup/teardown (since many of these buffers are used for vertex/attribute upload, and will need to be mmap'd), zeroing out pages before the next allocation, etc. So borrowing an idea from i915, I implemented a bo cache in userspace, in libdrm_freedreno. On new allocations, we round up to the next bucket size, and if there is a unused buffer in the bucket cache which is not still busy, we take that buffer instead of allocating a new one. (If I add a BO_FOR_RENDERING flag, like i915, I could take a still-busy gem bo for cases where I know cpu access will not be needed... by the time the gpu starts writing to the buffer, it will be no longer busy.) With this, things look much better:
As you can see, the gpu is nearly continuously occupied. And a nice benefit is a drop in cpu utilization. To do this properly, I need to add a MADVISE style ioctl in msm drm/kms driver, so userspace can advise the kernel that it is keeping a bo around in a cache, and that the kernel is free to free the backing pages under memory pressure, tear down the cpu mapping, etc. This will prevent the wrath of the OOM killer :-)
So now with the bottlenecks in the driver worked out, future work to make the gpu render faster (ie, hw binning pass, shader compiler optimizations, etc) will actually bring a meaningful benefit.
Notes: [1] just fwiw, the ideal IOMMU API would give me a way to make multiple map/unmap updates without tlb/etc flush. This should be even better than the map/unmap_range variants. I know when I'm submitting rendering jobs which reference the buffers to the GPU, so I have good points for a batch IOMMU update flush.
I've spent the morning cleaning up and adding some useful information to the freedreno wiki (such as a3xx shader isa, how tiling works on adreno, and how to use the various tools). So if you want to learn how adreno works and/or start to contribute yourself, now you have no excuse ;-)
About a month ago, I received a new ARM dev board, a IFC6410! Which despite the boring sounding name is quite an impressive bit of kit. About $150, quad-core krait, 2G DDR, SATA, gigabit ethernet.. and adreno a320. It is basically the same SoC that is in the nexus4 (or the new nexus7). But in more convenient form factor for development. And with this board that I've been developing a new msm drm/kms driver. For a while now, freedreno has been limping along with the msm fbdev and kgsl drivers from their android kernel tree, while I focused on the userspace gallium driver and ddx (xf86-video-freedreno). But that was always a short-term solution.. with the qcom android drivers, I can't really handle synchronization between processes, which gets really crazy w/ x11 and compositing window manager where you have sharing in both directions (as texture and/or render target), I can't handle page flipping (let alone page flipping synchronized with the GPU), and have general robustness issues. Unfortunately, the msm android fbdev driver code is a real mess (at least the mdp4 parts). Even by android / vendor kernel standards, which are pretty low to begin with. And I don't have any docs on the display controller. In the end, I ended up instrumenting the code to trace all the register reads/writes, etc, wrote a small parser tool using envytools/librnn, and starting writing rnndb register database for the display controller registers. It was a lot easier to get a general picture of how the hardware works that way! Plus I can generate register level headers from rnndb in the same way I do for the gallium driver. So, earlier in the month, I sent first round of RFC patches, with just basic KMS support. A couple weeks ago I send the 2nd round which added a3xx gpu support and basic kmscube working. Since then I've fixed a few things, added HW cursor support and more gpu debugfs bits to help when things go wrong. And added kms support in xf86-video-freedreno. And so 3rd (and hopefully final-RFC round) of patches will go out soon. But now, time for some eye candy: gnome-shell running on freedreno + msm drm/kms:
and, now that we have drm/kms support, we can use wayland/weston drm compositor:
so, as gnome-shell as-a wayland compositor work progresses, freedreno should be in good shape for the next generation of linux desktop :-) ----- NOTE: If you look on the msm-drm branches in libdrm and xf86-video-freedreno trees, you'll notice that I've structured things to work on either current android drivers (with a couple small patches), or on msm drm/kms driver. This is mainly because it is unlikely that I'll be able to support every random lcd panel on every snapdragon phone/tablet that someone might want to try out freedreno on. Time permitting, I'll eventually add support for the LCD panels on devices I have (HP touchpad, nexus4), and support for some of the older generation adreno gpus.. although patches certainly welcome ;-)
As promised in my previous post, now there is an f19 installer for nexus4: https://github.com/freedreno/nexus4-fedora So if you have an n4 and a bit of free space, you can play around with accelerated open-source gpu goodness :-)
gallium/freedreno + xf86-video-freedreno using XA gallium state tracker on fedora F18. Gnome-shell, compiz, xonontic, ioquake all work. Just need to clean up the patches for XA and freedreno a bit more before they are ready for upstream. And hopefully in the next couple days I'll have some time to put together a sort of make-shift installer for anyone else who wants to try.
So, I've been working on the freedreno gallium support for adreno a3xx for the past few weeks or so, and now it is starting to take shape:
The full video here. The code is on the a3xx branch on github.. still has a ways got go before I'm ready to go upstream with it, but now we're getting into the fun bits :-) Special thanks to Benjamin Tissoires for getting the touchscreen going for me on the nexus4.
A while back, I got my shiny new nexus4 in order to start playing with the new a320 adreno gpu. It turns out, there are some quite significant changes in the new generation, in particular new shader ISA and all the registers around shader setup. Other registers are mostly shuffled around, with a few new registers for new features (like MRT) and a few tweaks here and there. There is quite a lot more flexibility in the shader ISA (thank-you opencl), and some nice changes like moving vertex fetch out of the shader and a lot more flexibility in the positioning of outputs/varyings. Although as a result there was more that needed to be figured out to setup the shader related registers compared to a2xx. At this point, I'm working with simple "fdre-a3xx" tests. Basically libfdre (freedreno reverse engineering) gives a very simplified gl-ish API, with shaders written in assembly, rather than full on GLSL compiler. I've got a pretty good shader assembler/disassembler, and cmdstream parsing utilizing envytools/librnn for parsing the register values. There is still some more work to do at the fdre level (depth, stencil, textures) before work on the gallium driver can start, but things are progressing well.
After fixing a handful of bugs, and adding support (still in a slightly hacky way) for some desktop GL (1.4) features (namely, GL_QUADS), I've got gnome-shell working with freedreno:
Full video here I've also made a sort of fedora F18 installer for the touchpad, for anyone out there who wants to give it a spin:
Last weekend was fosdem, where I showed a few more things working on freedreno (compiz, q3a). The recording of the talk is provided by phoronix. Although unfortunately the mic was closer to the audience than me (and I'm not really a loud speaker), and the camera remained pointed at the projector while I was showing the demo, so you miss the exciting bits. So I re-ran the demo and captured it with my advanced cinematic skillz, and without the backlight hickups which plagued the original demo (but had nothing to do with freedreno... have I mentioned before that drivers/video/msm is made of fail?).
Full video here. I still need to fix support for mip-map textures (which you may have noticed in the q3a part), and there are a couple rendering bugs with the unity plugin in compiz, but at least the other compiz plugins I've tried so far seem fine. Until now I have not put much (or really, any) focus on performance and there is definitely some low hanging fruit on the performance front. Unfortunately some of the issues, like proper synchronization between xorg and gl/dri2 client, and flipping rather than doing fullscreen presentation blits, will I think require a proper drm kernel driver. Although I suspect there is room for some compiler optimization and a few other things.
Just a quick update, and a sneak-peak at what is in store for FOSDEM. The freedreno gallium driver has made a lot of progress since the last post (sorry, been too busy hacking to blog about it). But now finally xbmc is working with freedreno:
Full video here. Special thanks to Ricardo Salveti for building a working version of xbmc for 12.10. You can find it in his PPA.