[x264-devel] [PATCH 2/2] arm: Implement x264_mbtree_propagate_{cost, list}_neon

Thu Sep 3 09:35:35 CEST 2015

On 2015-09-03 09:30:44 +0300, Martin Storsjö wrote:
> The cost function could be simplified to avoid having to clobber
> q4/q5, but this requires reordering instructions which increase
> the total runtime.
> 
> checkasm timing       Cortex-A7      A8      A9
> mbtree_propagate_cost_c      63702   155835  62829
> mbtree_propagate_cost_neon   17199   10454   11106

any idea why the cortex-a8 c version is that bad? Different 
compiler/system?

> mbtree_propagate_list_c      104203  108949  84532
> mbtree_propagate_list_neon   82035   78348   60410
> 
> ---
> Applied Janne's suggestions on mbtree_propagate_cost_neon, and squashed
> his patch for mbtree_propagate_list_neon.
> ---
>  common/arm/mc-a.S |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  common/arm/mc-c.c |    9 ++++
>  2 files changed, 128 insertions(+)
> 
> diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
> index 5e0c117..b06b957 100644
> --- a/common/arm/mc-a.S
> +++ b/common/arm/mc-a.S
> @@ -28,6 +28,11 @@
>  
>  #include "asm.S"
>  
> +.section .rodata
> +.align 4
> +pw_0to15:
> +.short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
> +
>  .text
>  
>  // note: prefetch stuff assumes 64-byte cacheline, true for the Cortex-A8
> @@ -1760,3 +1765,117 @@ function integral_init8v_neon
>  2:
>      bx              lr
>  endfunc
> +
> +function x264_mbtree_propagate_cost_neon
> +    push            {r4-r5,lr}
> +    ldrd            r4, r5, [sp, #12]
> +    ldr             lr, [sp, #20]
> +    vld1.32         {d6[], d7[]},  [r5]

push            {r11}
ldrd            r11, r12, [sp, #3]
vld1.32         {d6[], d7[]},  [r12]
ldr             r12, [sp, #12]

and adapt the rest. patch ok, no need to change this, it won't make a 
large difference (I'm not even sure if it'll be faster). just to 
satiesfy my OCD.

Janne