Blogging the Monkey

Monday, August 23, 2010

ffpv8 + neon = 720p24

btw, been a long time since I had a chance to update the blog.. so I just thought I'd drop a quick note about something I've been playing with for the last few weekends.. the new ffvp8 decoder!

I've started writing the neon dsp functions for the VP8 decoder, as an excuse to learn a bit more about sw video codecs, neon, and VP8. At this point, not all of the dsp functions are implemented, but all the important ones for all the VP8 clips that I can find are implemented (loop filter, bicubic MC functions, and some misc other functions). Most of the major other ones, such as the bilinear MC functions, don't seem to be used in the clips that I can find, but should not be too hard to add when I find clips to test with.

The result is some 15-20% faster than libvpx, mostly thanks to ffvp8 being more cache friendly than libvpx decoder, and not doing silly things like memcpy of reference frames, rather than my hard-core neon optimizing skills.. and this is even without ffvp8 being a multi-threaded decoder, which is something that would benefit an SMP cortex-a9 platform like OMAP4 if done properly. And all this should be possible to get a bit faster by spending some time tweaking the instruction order to avoid stalls and some other tricks like that. (And hopefully I'll learn a few tricks in the process as the patches are reviewed.)

The result so far is here:

http://github.com/robclark/ffmpeg/commits/vp8-4

Current status is that it is all working, and producing bit exact output compared the plain 'C' versions of the DSP functions for all the test clips I have. I'll update again when I add more or when the patches are in upstream ffmpeg.

I also have some work-in-progress patches for gst-ffmpeg to avoid a memcpy for codecs that don't support edge emulation, although these depend on rowstride and some of the other related features that we've added to GStreamer for omap4.

Monday, March 8, 2010

git format-patch for specific commit-ids

for revision ranges, appending ^! (which appropriate escaping) to a commit-id causes it to represent that range beginning/ending with that commit-id inclusively instead of exclusively. So for specifying a revision range consisting of exactly one specific revision (let's call it r1), you can specify r1^! r1

And if you want to generate a patch from all your commits while ignoring all other commits:

for c in `git log --author=Clark | grep ^commit | awk '{print $2}'`; do

git format-patch "$c^\!" "$c";

done

Wednesday, February 17, 2010

texture streaming in action

Hands-on with TAT's dual-screen phone concept and augmented reality app

Thursday, October 8, 2009

texture streaming

IMG has an interesting texture streaming EGL extension for their GFX graphics core, which is supported on TI omap3 devices. It provides a way to stream YUV (or RGB) video to a 3d surface without incurring the expense of a texture upload or an extra memory copy and without the need for colorspace conversion. This makes for interesting possibility, such as decoding video to a 3d surface.

The texture streaming extension utilizes a kernel API to tell the GFX hw when a new frame of video is ready to render. The catalog group in TI had created a 'bc_cat' linux kernel module to allow userspace to allocate buffers to use in texture streaming.

I've taken this driver, and made a few tweaks, and created a GStreamer video sink plugin (see gst-plugin-bc which includes the 'bc_cat' kernel module) to allow for using texture streaming in a GStreamer pipeline. This allows our accelerated codecs to decode video directly to the buffer used by the GFX core. The hw/DSP and GFX core do all the work:

gst-launch filesrc location=/mnt/mmc/iron_man-tlr2_h640w.mp g ! avidemux name=d d.video_00 ! queue ! omx_mpeg4dec ! queue ! bcsink d.audio_00 ! queue ! omx_mp3dec ! alsasink

Here is another example, this time using camera to capture to 3d surface:

gst-launch v4l2src ! "video/x-raw-yuv,width=640,height=480" ! bcsink sync=false

(sorry about the low quality video and the glare which makes it a bit hard to see on the screen)

Touchbook!

I got my touchbook last night! Haven't had too much time to play with it yet, but a very cool little device.

Wednesday, August 12, 2009

NEON is fashionable

... as long as we aren't talking about your wardrobe..

so, my college Daniel brought to my attention a gstreamer use-case that was in need of some performance optimization. When decoding DVD content, the audio samples from the AC-3 decoder are in float-32bit format. These need to be converted to 16bit integer format to play through alsasink. But the overhead of audioconvert on the ARM was quite high. Which seemed like a good enough excuse to learn NEON. So last weekend I broke out oprofile and had a look at where the cycles went, and what needed to be optimized. The pipeline I used for testing was:

gst-launch filesrc use-mmap=true num-buffers=12000 location=/my-clip.vob ! dvddemux name=d d.current_audio ! a52dec ! audioconvert ! audio/x-raw-int,width=16,depth=16! fakesink

Initially, this pipeline took roughly 20.7sec. By comparision, the same pipeline without audioconvert:

gst-launch filesrc use-mmap=true num-buffers=12000 location=/my-clip.vob ! dvddemux name=d d.current_audio ! a52dec ! fakesink

took roughly 11.1s. A couple functions immediately stood out: gst_audio_quantize_quantize_signed_tpdf_none() and audio_convert_unpack_float_le(). The former was particularly bad just due to the random number generation for dithering, which contained a divide by 0xffffffff. As Barbie (and processors everywhere) say, division is hard. Many embedded processors will emulate division (read: expensive), and even processors that have divide instructions, the cycle count is high. Just changing this to 32bit right shift (ie. divide by (1LLU<<32)) made a big improvement. Results might not be exactly the same, but it is close enough that you couldn't hear the difference. In the end, I left gst_fast_random_int32_range() untouched, but replacement vectorized code kept the >>32 approach.

The next step was to come up with a way to plug in accelerated versions of various audioconvert functions. I changed them all to use __attribute__((weak)) which is a neat trick of the GNU compiler (and ARM ltd. compiler) which lets you provide default versions of some symbol which can be overridden at link time. (Unfortunately this doesn't seem to be supported by all compilers that gstreamer supports, so I need to come up with a more portable approach.) This let me add an armv7 specific file to re-implement these two functions. The algorithms themselves stay basically the same, but are processing four elements at a time, except for a few instructions which use 64bit math and those are processing two at a time. In the end, the pipeline with audioconvert dropped to ~12sec, roughly 10x improvement for audioconvert (through a combination of processing 4x samples at a time, faster psuedo-random number generation, and the fact that NEON provides a saturating addition instruction). That is with gcc NEON intrinsics. I guess a few cycles could be saved by writing it all in assembly, which was my original plan, but at this point these two functions only show up a couple pages down in oprofile output.. so time would be better spent looking into liba52 (audio decoder). The video part of a playback pipeline is no issue, the DSP on OMAP3 can decode this without breaking a sweat!

Here is the patch. In the end, probably the long-term solution for audioconvert would be to use orc, for cross-platform vector acceleration. Although currently orc doesn't support floating point, and doesn't have a free NEON back-end. I may clean up my patch (to address the portability issues, and make the build system figure out whether or not to build my new armv7.c file depending on the target architecture). In the end, it depends on how much of a short-term solution that would be..

update: I thought I'd add a few links that I found useful

some Cortex-A8 / NEON info from TI wiki
NEON and VFP Programming section in ARM compiler tools assembler guide, gives a good reference on all the NEON/VFP instructions
ARM NEON Intrinsics section from GCC manual (but not really any more info compared to what you can find by just looking at arm_neon.h).. but once you know the instruction you want (from previous link) this is useful to find the matching intrinsic name.
Using NEON Support appendix from ARM compiler reference guide.. the GCC intrinsics pretty much match ARM's, and the ARM doc has more useful info

Monday, August 10, 2009

Sending GIT patches via email from behind oppressive proxies

I was wanting to setup a wait to email patches via my gmail account.. I found this post from my friend Nishanth: Nishanth' tech rambles: Setting up Email forwarding System for GIT, which was quite helpful.

On my macosx laptop, sendmail worked out of the box.. but it finds the SMTP server from DNS. So at work it was using an internal SMTP server. And at home, it wasn't finding any SMTP server. And for some reason using the internal SMTP server seemed to only work for sending patches to internal addresses. I'm sure the built-in SMTP client could be configured to work as I needed, but I really had no desire to figure out how.

Following the instructions on Nishanth's post, I setup msmtp. (go macports!) And it works great. Somehow it automagically works properly both behind the proxy at work, and from home, depending on the systemwide proxy settings. (The power of Steve Jobs compels it!)

To send patches, I use this alias:


alias gsend='git send-email --from "<addr>" --envelope-sender "<addr>" --smtp-server /opt/local/bin/msmtp'

(replace <addr> with your email address)