[x264-devel] [PATCH 2/2] arm: Implement x264_mbtree_propagate_{cost, list}_neon
Janne Grunau
janne-x264 at jannau.net
Thu Sep 3 09:35:35 CEST 2015
On 2015-09-03 09:30:44 +0300, Martin Storsjö wrote:
> The cost function could be simplified to avoid having to clobber
> q4/q5, but this requires reordering instructions which increase
> the total runtime.
>
> checkasm timing Cortex-A7 A8 A9
> mbtree_propagate_cost_c 63702 155835 62829
> mbtree_propagate_cost_neon 17199 10454 11106
any idea why the cortex-a8 c version is that bad? Different
compiler/system?
> mbtree_propagate_list_c 104203 108949 84532
> mbtree_propagate_list_neon 82035 78348 60410
>
> ---
> Applied Janne's suggestions on mbtree_propagate_cost_neon, and squashed
> his patch for mbtree_propagate_list_neon.
> ---
> common/arm/mc-a.S | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> common/arm/mc-c.c | 9 ++++
> 2 files changed, 128 insertions(+)
>
> diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
> index 5e0c117..b06b957 100644
> --- a/common/arm/mc-a.S
> +++ b/common/arm/mc-a.S
> @@ -28,6 +28,11 @@
>
> #include "asm.S"
>
> +.section .rodata
> +.align 4
> +pw_0to15:
> +.short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
> +
> .text
>
> // note: prefetch stuff assumes 64-byte cacheline, true for the Cortex-A8
> @@ -1760,3 +1765,117 @@ function integral_init8v_neon
> 2:
> bx lr
> endfunc
> +
> +function x264_mbtree_propagate_cost_neon
> + push {r4-r5,lr}
> + ldrd r4, r5, [sp, #12]
> + ldr lr, [sp, #20]
> + vld1.32 {d6[], d7[]}, [r5]
push {r11}
ldrd r11, r12, [sp, #3]
vld1.32 {d6[], d7[]}, [r12]
ldr r12, [sp, #12]
and adapt the rest. patch ok, no need to change this, it won't make a
large difference (I'm not even sure if it'll be faster). just to
satiesfy my OCD.
Janne
More information about the x264-devel
mailing list