[x264-devel] [PATCH 12/24] arm: Implement x264_plane_copy_neon

Martin Storsjö martin at martin.st
Mon Aug 24 22:37:35 CEST 2015


On Fri, 21 Aug 2015, Janne Grunau wrote:

> On 2015-08-13 23:59:33 +0300, Martin Storsjö wrote:
>> checkasm timing       Cortex-A7      A8     A9
>> plane_copy_c                  13253  10923  9016
>> plane_copy_neon               7339   5191   8939
>> ---
>>  common/arm/mc-a.S |   32 ++++++++++++++++++++++++++++++++
>>  common/arm/mc-c.c |    3 +++
>>  2 files changed, 35 insertions(+)
>>
>> diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
>> index 695a6ca..4225c71 100644
>> --- a/common/arm/mc-a.S
>> +++ b/common/arm/mc-a.S
>> @@ -6,6 +6,7 @@
>>   * Authors: David Conrad <lessen42 at gmail.com>
>>   *          Mans Rullgard <mans at mansr.com>
>>   *          Stefan Groenroos <stefan.gronroos at gmail.com>
>> + *          Janne Grunau <janne-x264 at jannau.net>
>>   *
>>   * This program is free software; you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License as published by
>> @@ -1461,6 +1462,37 @@ function x264_load_deinterleave_chroma_fenc_neon
>>      bx              lr
>>  endfunc
>>
>> +function x264_plane_copy_neon
>> +    push            {r4-r5}
>> +    ldrd            r4,  r5, [sp, #8]
>
> you could use r4 and lr for the common pop {..., pc} pattern, not that
> it'll make a differences here

I can't do ldrd r4, lr though, so I'd have to do two separate loads here.

On A9, this gives a slight slowdown (8913 -> 8949, vs the c version at 
9004), but for A8 it surprisingly give a pretty large speedup, (5098 -> 
4867 vs 10995 for C).

I would have thought that ldrd would generally be preferred for reading 
two arguments from the stack, but here it really helps. Is it perhaps due 
to pipelining on the A8, when the following instruction uses r4?

// Martin


More information about the x264-devel mailing list