I've started writing the neon dsp functions for the VP8 decoder, as an excuse to learn a bit more about sw video codecs, neon, and VP8. At this point, not all of the dsp functions are implemented, but all the important ones for all the VP8 clips that I can find are implemented (loop filter, bicubic MC functions, and some misc other functions). Most of the major other ones, such as the bilinear MC functions, don't seem to be used in the clips that I can find, but should not be too hard to add when I find clips to test with.
The result is some 15-20% faster than libvpx, mostly thanks to ffvp8 being more cache friendly than libvpx decoder, and not doing silly things like memcpy of reference frames, rather than my hard-core neon optimizing skills.. and this is even without ffvp8 being a multi-threaded decoder, which is something that would benefit an SMP cortex-a9 platform like OMAP4 if done properly. And all this should be possible to get a bit faster by spending some time tweaking the instruction order to avoid stalls and some other tricks like that. (And hopefully I'll learn a few tricks in the process as the patches are reviewed.)
The result so far is here:
Current status is that it is all working, and producing bit exact output compared the plain 'C' versions of the DSP functions for all the test clips I have. I'll update again when I add more or when the patches are in upstream ffmpeg.
I also have some work-in-progress patches for gst-ffmpeg to avoid a memcpy for codecs that don't support edge emulation, although these depend on rowstride and some of the other related features that we've added to GStreamer for omap4.