[x264-devel] [PATCH 12/24] arm: Implement x264_plane_copy_neon

Mon Aug 24 23:35:11 CEST 2015

On 2015-08-24 23:37:35 +0300, Martin Storsjö wrote:
> On Fri, 21 Aug 2015, Janne Grunau wrote:
> 
> >On 2015-08-13 23:59:33 +0300, Martin Storsjö wrote:
> >>checkasm timing       Cortex-A7      A8     A9
> >>plane_copy_c                  13253  10923  9016
> >>plane_copy_neon               7339   5191   8939
> >>---
> >> common/arm/mc-a.S |   32 ++++++++++++++++++++++++++++++++
> >> common/arm/mc-c.c |    3 +++
> >> 2 files changed, 35 insertions(+)
> >>
> >>diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
> >>index 695a6ca..4225c71 100644
> >>--- a/common/arm/mc-a.S
> >>+++ b/common/arm/mc-a.S
> >>@@ -6,6 +6,7 @@
> >>  * Authors: David Conrad <lessen42 at gmail.com>
> >>  *          Mans Rullgard <mans at mansr.com>
> >>  *          Stefan Groenroos <stefan.gronroos at gmail.com>
> >>+ *          Janne Grunau <janne-x264 at jannau.net>
> >>  *
> >>  * This program is free software; you can redistribute it and/or modify
> >>  * it under the terms of the GNU General Public License as published by
> >>@@ -1461,6 +1462,37 @@ function x264_load_deinterleave_chroma_fenc_neon
> >>     bx              lr
> >> endfunc
> >>
> >>+function x264_plane_copy_neon
> >>+    push            {r4-r5}
> >>+    ldrd            r4,  r5, [sp, #8]
> >
> >you could use r4 and lr for the common pop {..., pc} pattern, not that
> >it'll make a differences here
> 
> I can't do ldrd r4, lr though, so I'd have to do two separate loads here.
> 
> On A9, this gives a slight slowdown (8913 -> 8949, vs the c version
> at 9004), but for A8 it surprisingly give a pretty large speedup,
> (5098 -> 4867 vs 10995 for C).

the cortex-a8 result surprises me a little. Can you check the stack 
alignment on the cortex-a8 and/or /proc/cpu/alignment? Not sure if 46 
cycles (23 cycles faster, 50% chance of misaligned stack) are enough for 
the kernels alignment trap handler/fixup but that's the best explanation 
I have for the large speedup.

> I would have thought that ldrd would generally be preferred for
> reading two arguments from the stack, but here it really helps. Is
> it perhaps due to pipelining on the A8, when the following
> instruction uses r4?

the only advantage of ldrd is smaller code size, disadvantage is the 
alignment requirement

Janne