[x264-devel] [PATCH 12/24] arm: Implement x264_plane_copy_neon
Janne Grunau
janne-x264 at jannau.net
Mon Aug 24 23:35:11 CEST 2015
On 2015-08-24 23:37:35 +0300, Martin Storsjö wrote:
> On Fri, 21 Aug 2015, Janne Grunau wrote:
>
> >On 2015-08-13 23:59:33 +0300, Martin Storsjö wrote:
> >>checkasm timing Cortex-A7 A8 A9
> >>plane_copy_c 13253 10923 9016
> >>plane_copy_neon 7339 5191 8939
> >>---
> >> common/arm/mc-a.S | 32 ++++++++++++++++++++++++++++++++
> >> common/arm/mc-c.c | 3 +++
> >> 2 files changed, 35 insertions(+)
> >>
> >>diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
> >>index 695a6ca..4225c71 100644
> >>--- a/common/arm/mc-a.S
> >>+++ b/common/arm/mc-a.S
> >>@@ -6,6 +6,7 @@
> >> * Authors: David Conrad <lessen42 at gmail.com>
> >> * Mans Rullgard <mans at mansr.com>
> >> * Stefan Groenroos <stefan.gronroos at gmail.com>
> >>+ * Janne Grunau <janne-x264 at jannau.net>
> >> *
> >> * This program is free software; you can redistribute it and/or modify
> >> * it under the terms of the GNU General Public License as published by
> >>@@ -1461,6 +1462,37 @@ function x264_load_deinterleave_chroma_fenc_neon
> >> bx lr
> >> endfunc
> >>
> >>+function x264_plane_copy_neon
> >>+ push {r4-r5}
> >>+ ldrd r4, r5, [sp, #8]
> >
> >you could use r4 and lr for the common pop {..., pc} pattern, not that
> >it'll make a differences here
>
> I can't do ldrd r4, lr though, so I'd have to do two separate loads here.
>
> On A9, this gives a slight slowdown (8913 -> 8949, vs the c version
> at 9004), but for A8 it surprisingly give a pretty large speedup,
> (5098 -> 4867 vs 10995 for C).
the cortex-a8 result surprises me a little. Can you check the stack
alignment on the cortex-a8 and/or /proc/cpu/alignment? Not sure if 46
cycles (23 cycles faster, 50% chance of misaligned stack) are enough for
the kernels alignment trap handler/fixup but that's the best explanation
I have for the large speedup.
> I would have thought that ldrd would generally be preferred for
> reading two arguments from the stack, but here it really helps. Is
> it perhaps due to pipelining on the A8, when the following
> instruction uses r4?
the only advantage of ldrd is smaller code size, disadvantage is the
alignment requirement
Janne
More information about the x264-devel
mailing list