[x264-devel] [PATCH 12/24] arm: Implement x264_plane_copy_neon

Martin Storsjö martin at martin.st
Tue Aug 25 09:40:20 CEST 2015


On Mon, 24 Aug 2015, Janne Grunau wrote:

> On 2015-08-24 23:37:35 +0300, Martin Storsjö wrote:
>> On Fri, 21 Aug 2015, Janne Grunau wrote:
>>
>>> On 2015-08-13 23:59:33 +0300, Martin Storsjö wrote:
>>>> checkasm timing       Cortex-A7      A8     A9
>>>> plane_copy_c                  13253  10923  9016
>>>> plane_copy_neon               7339   5191   8939
>>>> ---
>>>> common/arm/mc-a.S |   32 ++++++++++++++++++++++++++++++++
>>>> common/arm/mc-c.c |    3 +++
>>>> 2 files changed, 35 insertions(+)
>>>>
>>>> diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
>>>> index 695a6ca..4225c71 100644
>>>> --- a/common/arm/mc-a.S
>>>> +++ b/common/arm/mc-a.S
>>>> @@ -6,6 +6,7 @@
>>>>  * Authors: David Conrad <lessen42 at gmail.com>
>>>>  *          Mans Rullgard <mans at mansr.com>
>>>>  *          Stefan Groenroos <stefan.gronroos at gmail.com>
>>>> + *          Janne Grunau <janne-x264 at jannau.net>
>>>>  *
>>>>  * This program is free software; you can redistribute it and/or modify
>>>>  * it under the terms of the GNU General Public License as published by
>>>> @@ -1461,6 +1462,37 @@ function x264_load_deinterleave_chroma_fenc_neon
>>>>     bx              lr
>>>> endfunc
>>>>
>>>> +function x264_plane_copy_neon
>>>> +    push            {r4-r5}
>>>> +    ldrd            r4,  r5, [sp, #8]
>>>
>>> you could use r4 and lr for the common pop {..., pc} pattern, not that
>>> it'll make a differences here
>>
>> I can't do ldrd r4, lr though, so I'd have to do two separate loads here.
>>
>> On A9, this gives a slight slowdown (8913 -> 8949, vs the c version
>> at 9004), but for A8 it surprisingly give a pretty large speedup,
>> (5098 -> 4867 vs 10995 for C).
>
> the cortex-a8 result surprises me a little. Can you check the stack
> alignment on the cortex-a8 and/or /proc/cpu/alignment? Not sure if 46
> cycles (23 cycles faster, 50% chance of misaligned stack) are enough for
> the kernels alignment trap handler/fixup but that's the best explanation
> I have for the large speedup.

The stack pointer is only 4-byte-aligned, not 8-byte-aligned, at this 
point, so that might be the cause. (But the counters in 
/proc/cpu/alignment don't increase when running the ldrd, and setting it 
to sigbus doesn't make it change either. And according to the ARM ARM, 
ldrd only requires word alignment.)

// Martin


More information about the x264-devel mailing list