Wednesday, August 12, 2009

NEON is fashionable

... as long as we aren't talking about your wardrobe..

so, my college Daniel brought to my attention a gstreamer use-case that was in need of some performance optimization. When decoding DVD content, the audio samples from the AC-3 decoder are in float-32bit format. These need to be converted to 16bit integer format to play through alsasink. But the overhead of audioconvert on the ARM was quite high. Which seemed like a good enough excuse to learn NEON. So last weekend I broke out oprofile and had a look at where the cycles went, and what needed to be optimized. The pipeline I used for testing was:

gst-launch filesrc use-mmap=true num-buffers=12000 location=/my-clip.vob ! dvddemux name=d d.current_audio ! a52dec ! audioconvert ! audio/x-raw-int,width=16,depth=16! fakesink

Initially, this pipeline took roughly 20.7sec. By comparision, the same pipeline without audioconvert:

gst-launch filesrc use-mmap=true num-buffers=12000 location=/my-clip.vob ! dvddemux name=d d.current_audio ! a52dec ! fakesink

took roughly 11.1s. A couple functions immediately stood out: gst_audio_quantize_quantize_signed_tpdf_none() and audio_convert_unpack_float_le(). The former was particularly bad just due to the random number generation for dithering, which contained a divide by 0xffffffff. As Barbie (and processors everywhere) say, division is hard. Many embedded processors will emulate division (read: expensive), and even processors that have divide instructions, the cycle count is high. Just changing this to 32bit right shift (ie. divide by (1LLU<<32)) made a big improvement. Results might not be exactly the same, but it is close enough that you couldn't hear the difference. In the end, I left gst_fast_random_int32_range() untouched, but replacement vectorized code kept the >>32 approach.

The next step was to come up with a way to plug in accelerated versions of various audioconvert functions. I changed them all to use __attribute__((weak)) which is a neat trick of the GNU compiler (and ARM ltd. compiler) which lets you provide default versions of some symbol which can be overridden at link time. (Unfortunately this doesn't seem to be supported by all compilers that gstreamer supports, so I need to come up with a more portable approach.) This let me add an armv7 specific file to re-implement these two functions. The algorithms themselves stay basically the same, but are processing four elements at a time, except for a few instructions which use 64bit math and those are processing two at a time. In the end, the pipeline with audioconvert dropped to ~12sec, roughly 10x improvement for audioconvert (through a combination of processing 4x samples at a time, faster psuedo-random number generation, and the fact that NEON provides a saturating addition instruction). That is with gcc NEON intrinsics. I guess a few cycles could be saved by writing it all in assembly, which was my original plan, but at this point these two functions only show up a couple pages down in oprofile output.. so time would be better spent looking into liba52 (audio decoder). The video part of a playback pipeline is no issue, the DSP on OMAP3 can decode this without breaking a sweat!

Here is the patch. In the end, probably the long-term solution for audioconvert would be to use orc, for cross-platform vector acceleration. Although currently orc doesn't support floating point, and doesn't have a free NEON back-end. I may clean up my patch (to address the portability issues, and make the build system figure out whether or not to build my new armv7.c file depending on the target architecture). In the end, it depends on how much of a short-term solution that would be..

update: I thought I'd add a few links that I found useful
  • some Cortex-A8 / NEON info from TI wiki
  • NEON and VFP Programming section in ARM compiler tools assembler guide, gives a good reference on all the NEON/VFP instructions
  • ARM NEON Intrinsics section from GCC manual (but not really any more info compared to what you can find by just looking at arm_neon.h).. but once you know the instruction you want (from previous link) this is useful to find the matching intrinsic name.
  • Using NEON Support appendix from ARM compiler reference guide.. the GCC intrinsics pretty much match ARM's, and the ARM doc has more useful info

1 comment: