[x264-devel] [PATCH 2/2] arm: Implement x264_mbtree_propagate_{cost, list}_neon

Martin Storsjö martin at martin.st
Thu Sep 3 09:53:17 CEST 2015


On Thu, 3 Sep 2015, Janne Grunau wrote:

> On 2015-09-03 09:30:44 +0300, Martin Storsjö wrote:
>> The cost function could be simplified to avoid having to clobber
>> q4/q5, but this requires reordering instructions which increase
>> the total runtime.
>>
>> checkasm timing       Cortex-A7      A8      A9
>> mbtree_propagate_cost_c      63702   155835  62829
>> mbtree_propagate_cost_neon   17199   10454   11106
>
> any idea why the cortex-a8 c version is that bad? Different
> compiler/system?

These are all run with the same exact static binary, and the beaglebone 
I've tested it on is completely idle, so it shouldn't really be any noise. 
I can reproduce the numbers as well. No idea what is causing it though...

>> mbtree_propagate_list_c      104203  108949  84532
>> mbtree_propagate_list_neon   82035   78348   60410
>>
>> ---
>> Applied Janne's suggestions on mbtree_propagate_cost_neon, and squashed
>> his patch for mbtree_propagate_list_neon.
>> ---
>>  common/arm/mc-a.S |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  common/arm/mc-c.c |    9 ++++
>>  2 files changed, 128 insertions(+)
>>
>> diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
>> index 5e0c117..b06b957 100644
>> --- a/common/arm/mc-a.S
>> +++ b/common/arm/mc-a.S
>> @@ -28,6 +28,11 @@
>>
>>  #include "asm.S"
>>
>> +.section .rodata
>> +.align 4
>> +pw_0to15:
>> +.short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
>> +
>>  .text
>>
>>  // note: prefetch stuff assumes 64-byte cacheline, true for the Cortex-A8
>> @@ -1760,3 +1765,117 @@ function integral_init8v_neon
>>  2:
>>      bx              lr
>>  endfunc
>> +
>> +function x264_mbtree_propagate_cost_neon
>> +    push            {r4-r5,lr}
>> +    ldrd            r4, r5, [sp, #12]
>> +    ldr             lr, [sp, #20]
>> +    vld1.32         {d6[], d7[]},  [r5]
>
> push            {r11}
> ldrd            r11, r12, [sp, #3]
> vld1.32         {d6[], d7[]},  [r12]
> ldr             r12, [sp, #12]
>
> and adapt the rest. patch ok, no need to change this, it won't make a
> large difference (I'm not even sure if it'll be faster). just to
> satiesfy my OCD.

Hmm, neat. I'll keep that in mind if the patch needs to be remade for some 
other reason.

// Martin


More information about the x264-devel mailing list