Saturday, April 14, 2012

Fighting back against binary blobs!

So I'm a big fan of opensrc graphics.. and one thing that has frustrated me for a long time is lack of open graphics on ARM platforms. I'm a big fan of open source in general, and that is why I love TI (and Linaro). TI has been very focused on publishing public TRMs getting support for the OMAP platform in the upstream kernel tree. I can build Linus's kernel tree and get something pretty well functional on my pandaboard. The display and omapdrm support in the upstream kernel is progressing pretty well. Which is great. The rpmsg framework is merged in the mainline for 3.4, which is the first step in getting multimedia (video decode/encode) support in the upstream kernel.

But one area where our hands are tied is graphics acceleration. I'd love nothing more than to be working on an opensrc and upstream driver for the SGX GPU used on OMAP platforms. But due to what I know and have access to about the inner workings of the IMGtech GPU's, that would not be possible without IMG's approval. I hope someday they warm up to the open source community, but for now I am forced to look elsewhere to contribute.

But wait.. what about the GPL pvr kernel driver? Well, the fact is that userspace and kernel are not independent. I love not only the linux kernel but the whole gnu/linux system of which a userspace that is developed in a collaborative open fashion is an integral part. And this is especially true in the realm of graphics drivers.. no where else are there such complex interactions between userspace and kernel. I am not strictly against having a closed userspace GL stack provided there is an open userspace alternative that is at least able to exercise the same kernel APIs. If there is an open userspace, that gives anyone who wants to, the freedom to start hacking and contributing and making things better. That is the great thing about the open source! With only a closed userspace, there is no freedom to fix the kernel parts. And the interaction between userspace and kernel parts of a graphics are too complex to be able to accept and properly review a kernel driver for acceptance into the upstream kernel tree without some open userspace that can exercise those APIs provided by the kernel part of the driver. Simply slapping some GPL headers on a kernel module that is ridden with OS abstraction layers and NIH re-invention of infrastructure provided by the upstream kernel isn't going to cut it here. And without an open userspace, there is no room for the open source community to refactor and fix anything.

But I'm not one to sit around and complain about a problem indefinitely without eventually trying to do something about it. One thing that gave me a glimmer of hope is the lima project. The first real (non vaporware) opensrc graphics effort on ARM. With that as a piece of needed inspiration, what could I do to help the cause? Well, with ARM as a member company of Linaro, and coming into contact with ARM folks working on mali, as well as engineers from other Linaro member companies who use mali, it seemed like direct contribution to the lima project might be a bit of a gray area. I don't think I really know any internal s3cr3ts of how mali works (and certainly not more than the lima folks have already figured out). But I don't want to get Linaro in trouble with it's member companies and it seemed like a potential conflict of interest. So what could I do? Pick another ARM platform that I know nothing about, and go to town!

This really leaves two big players. Of the two, I had a friend who could loan me a dragonboard to hack on, so that pretty much clinched the deal. (Although I have hopes that someday someone will figure out how to get something based on the nouveau driver running on tegra.)


Methodology

The approach I took is quite similar to, and strongly inspired by, the approach that Luc Verhaegen took with the lima driver project. It basically amounts to using a LD_PRELOAD shim to intercept system calls, digging through the kernel code to understand the existing userspace<->kernel API, and figuring out how to observe and log the interesting bits.

I've started with 2d acceleration support, mainly because that seemed like a good "warm-up" exercise, and also because there is currently no publicly available acceleration for x11 for the snapdragon platform (binary blob or otherwise). Most of the time so far has gone into figuring out the kernel APIs, and writing some utility code to log and post-process the results of running some simple test apps using the closed src binaries available for android, obtained from a cyanogenmod filesystem (because qualcomm does not provide any userspace support for gnu-linux (non-android) userspace, at least not to the general public). I used some linker tricks to link the test code against the android binary blob libs, and android libc, etc, within a ubuntu 11.10 filesystem. (Fwiw, I use 11.10 because that was prior to the switch over to armhf and based on the 3.0 kernel, which was what I had available from codeaurora git trees.) The good news is, from what I've been able to figure out from the GPL kernel driver, a lot of the infrastructure like pixel and cmdstream buffer allocation, and cmdstream submission, appear to be similar for 2d and 3d, so I think a lot of the work done so far for 2d accel will be useful when it comes to working on the 3d part.

The libwrap code I wrote logs information about the blits (cmdstream, and various parameters like gpu addresses, surface dimensions, blit coords) to a simple .rd log file (which amounts to a sequence of type/length/value fields). These .rd files get processed with a utility I wrote called "redump", to generate a reports showing side-by-side comparisons of the cmdstream, with similarities and parts of dwords that appeared to match surface and blit parameters highlighted. It isn't a perfect disassembly of the command stream, but it certainly helps to spot patterns.

Once I had a reasonable collection of tests for the types of blit operations which are important for an x11 EXA driver, I started varying parameters to figure out the limits, ie. what is the largest blit x, y, width, height, max surface width, height, stride, etc, to establish how many bits are used to encode different fields in the command stream. In some cases, I noticed there were multiple encoding options so parameters could be packed if fewer dwords if less bits where needed to encode the parameters. (For the current EXA driver I'm pretty much using the worst case encoding options so far, to keep things simple.)

With these tests, and the corresponding redump reports, I started work on implementing the EXA accel fxns for the xf86-video-freedreno driver. The work on the EXA driver really only started about 1.5 weekends ago (and most of the time at the beginning was just getting a skeletal driver setup, which is based on a stripped down and simplified xf86-video-msm).


Current Status

So far, I've got the basic solid/copy/composite operations implemented. There are some limitations still in the composite code, such as operations with masks are rejected. (There is an awkward limitation in libC2D2 that there is no way to specific independently mask and src coordinates.. I'm not sure yet if this is a limitation of the hw, but we will be a bit on our own to figure out this via experimentation with the cmdstream. One option to deal with this is ptr arithmetic on the mask surface gpu addr.) And there are still some lesser used color formats that I haven't tackled.

The next big thing, however, will be to deal properly with submission of multiple blits at a time, and not having to block until submitted blits are completely. Without this, performance is (as you would expect) quite bad. But that is easy enough to fix later. There is some awkwardness with the current kernel interface (see NOTES in freedreno tree about how context switch restore works). But that can be fixed by enhancing the kernel part to take separate ptrs in a single ioctl. And of course deciphering the context restore packet would be needed to properly support context switching if you have multiple processes using 2d (but this isn't too important for having a single xserver running so I think we can come back to it later).

A quick note on the kernel: the existing driver from qualcomm is what I'd call a semi-DRM driver. It is using GEM buffers, so it gives us what we'd need eventually for DRI2 and 3d. But not mode setting (which is handled via fbdev driver, also opened by xserver), and not a batchbuffer sort of interface for cmd submission.. cmd submission is handled via separate kgsl-2d/3d devices which are not aware of GEM buffer handles, so mapping buffers to the GPU cannot be handled as part of the cmd submission. So far I'm leaving the kernel driver mostly as-is (sans maybe some minor backwards compatible enhancements), because it is essential to be able to run test code based on the existing binary blob libraries back to back with work-in-progress xorg/mesa drivers. One approach to cleaning up the kernel part might be to provide an emulation layer to emulate the old interfaces, although for now there are enough other things to do that I haven't given this much thought yet. Of course, volunteers are always welcome ;-)

The git trees can be found at: https://gitorious.org/freedreno/
And an IRC channel on freenode at #freedreno
So far there are no mailing lists (I'm not really sure where they could be hosted) or web page other than the wiki pages at gitorious.


Disclaimer

This is a project that I've been working on in my own free time, not using the resources or time of my employeer or Linaro. It is something I've been working on of my own accord because quite simply I want to see advancement of the state of open source graphics in linux. I hope that Linaro will be supportive of this effort, and of open source graphics on all ARM platforms. And I know that a lot of the individual people that make up Linaro are quite passionate about open source. But I realize that dealing with the business concerns of all various member and potential member companies is a difficult balancing act. And as always, the opinions expressed here in my blog are those of my own and not necessarily those of my employer or of Linaro.

Saturday, January 14, 2012

ubuntu-tv 1080p on omap4 panda




you can find a higher resolution video at: http://www.youtube.com/watch?v=HJuyNrVOS1I
needs a bit of cleanup, but patches are here: https://github.com/robclark/qtmobility-1.1.0

update: and as usual Ricardo has a nicer looking video at his blog

Thursday, January 5, 2012

xbmc update

of course I've been too busy to write anything, but Ricardo has updated his blog w/ a note about xbmc progress:

http://rsalveti.wordpress.com/2012/01/06/hw-video-decode-and-xbmc-ubuntu-linaro/

Sunday, November 27, 2011

Catchin up..

just catching up on some news since last posting:

First, TI/OMAP PPA for ubuntu 11.10 now contains support for hw video codecs via DCE and gst-ducati. (decoders: h264, mpeg4, mpeg2, vc1; encoders: h264, mpeg4). Yah!

But lately I've been mostly working on omapdrm, a DRM/KMS display driver for omap, corresponding X11 driver (xf86-video-omap). The kernel driver is now queued up in the staging tree for 3.3. But not forgetting multimedia, I've been also working (as a linaro assignee) on extending dri2 protocol for more efficient video rendering (see linux-video.pdf) and
UMM/dmabuf for sharing buffers between multiple devices (camera+drm, or multiple drm devices for a prime type setup).

And lastly, been doing some hacking trying to get xbmc working nicely with the hw video codecs for hw accel hd playback.. but more on that shortly when I have something work.


Thursday, June 23, 2011

Building DCE firmware

Now, thanks to public release of codec libraries, and after some slacking on my part, all the bits and pieces needed to build your very own ducati (cortex-m3) firmware are available. I've put together a wiki page with instructions of where to find all the pieces, and how to build, for 2.6.38 kernel:

http://www.omappedia.org/wiki/DistributedCodecEngine

This is using syslink-2.0, tiler-2.0, and GA codecs/FC/etc.. but now at least, if you want to use a kernel with a different version of syslink, you can rebuild the firmware with appropriate corresponding bios-syslink on the coprocessor side.

Sunday, April 17, 2011

better late than never

haven't had time to post for a while, so just getting caught up on a few things (in reverse chronological order)

ffmpeg vp8 decoder

Mans Rullgard has improved the original neon vp8 patches, and pushed them into the main tree:

omap drm/kms display driver

A while back I started experimenting with the DRM display driver framework, and now have a basic driver which implements the KMS part of DRM. It uses a plugin API for SGX/PVR driver to register and handle it's own set of ioctls related to 2d/3d acceleration. Still TBD is overlay support, and cleaner way to handle buffer allocation (GEM?).

Now with the userspace pvr xorg driver, basic XRandR is working (change resolution, setup multi-monitor virtual display, etc). Being able to change resolution without cryptic sysfs cmds is nice for a change.

universal buffer allocation/management BoF

There was a BoF at ELC last week on the topic of common buffer allocation/management APIs to support zero copy buffer passing between various IP blocks (display, GPU, codecs, ISP, etc). Currently each SoC vendor has some custom API (CMEM, PMEM, NVMEM, TILER.. etc). Google is introducing ION. Most of the rest of the linux world (ie. desktop) uses GEM and/or TTM, which admittedly are somewhat GPU-centric.

In the desktop world, 3d/codec accelerators and display are all on the graphics card. But in the embedded/SoC world, you might have several vendors who use a common 3d block (for example), but each with their own unique display controller. And different video encode/decode accelerators. And different ISPs.. and so on.

For me, right now GEM is interesting as a way to expose allocating of TILER buffers on OMAP4 for video encode/decode and display. DRI already provides a path in userspace to pass GEM buffers and use DRM to handle the authentication duties (although GEM/DRM are perhaps not strictly required.. but they are something that exists in upstream kernel tree today). But short of mapping buffers into userspace process, there is currently no good way to pass these buffers to a v4l2 camera, or IVAHD video encoder/decoder. Possibly interface to video encoder/decoder IP can be thru the DRM display driver (another plugin, perhaps). Although that still leaves camera.

And there was also a bit of discussion on the related topic of how to expose display to userspace.. fbdev is ancient legacy, v4l2 MCF is the new kid on the block, but DRM/KMS is what is used in the desktop world. It seems like MCF should be more flexible for building different sorts of graphs, and to handle oddball features like writeback-pipe on OMAP4. Although DRM/KMS is already handling hotplug, EDID parsing, and provides sufficient flexibility for building display graphs (fb -> crtc -> encoder -> connector). At this point I prefer sticking with DRM/KMS for mode setting so that normal uses can be exposed to userspace in normal ways.

At this point, it isn't clear what the conclusion will be. A more modularized DRM with buffer management more easily split out (or at least shared with other devices)? ION or GEM or some merger of the two? The BoF was just a short 1hr session to better define the problem. The next step will be follow up sessions during the Linaro Developer Summit in Budapest.