[vlc-devel] [PATCH] arm_neon: Add an optimized routine for deinterleaving chroma

Mon Oct 7 13:27:19 CEST 2013

On Wed, 2 Oct 2013, Martin Storsjö wrote:

> This supports conversion from NV12/21/16/24 to I420/YV12/I422/I444.
>
> This avoids hitting swscale for the NV12->I420 conversion, for hw
> decoders that return NV12/21 in combination with the android vout
> in YUV mode.
> ---
> Made the neon routine independent of the particular subsampling
> mode and added support for NV16->I422 and NV24->I444 (although
> these two are untested), renamed the neon function to an even
> better name (IMO).
>
> I unrolled the loop once to avoid stalls due to latency in the
> VLD2 instruction, as suggested by Rémi. This didn't give any
> noticeable speedup on an A9 - I can test on an A8 earliest on
> Friday. This unrolling makes the routine overread 16 bytes at
> the end of the interleaved UV-plane unless it the pitch is
> aligned to 32 bytes (AFAIK it's currently only aligned to 16 bytes),
> unless separate handling for the tail of each row is added.
> ---

The unrolling didn't seem to give any measurable speedup in this 
particular case on an A8.

So what's the verdict on this case then, keep it simple (which also avoids 
overreads or avoids requiring having the interleaved UV-plane aligned to 
32 bytes) or keep the unrolling?

// Martin