[x264-devel] [PATCH 08/24] arm: Add neon versions of vsad, asd8 and ssd_nv12_core
Martin Storsjö
martin at martin.st
Mon Aug 24 12:45:53 CEST 2015
On Tue, 18 Aug 2015, Janne Grunau wrote:
> On 2015-08-13 23:59:29 +0300, Martin Storsjö wrote:
>> These are straight translations of the aarch64 versions.
>>
>> checkasm timing Cortex-A7 A8 A9
>> vsad_c 16234 10984 9850
>> vsad_neon 2132 1020 789
>>
>> asd8_c 5859 3561 3543
>> asd8_neon 1407 1279 1250
>>
>> ssd_nv12_c 608967 593057 427131
>> ssd_nv12_neon 73017 34251 41577
>> ---
>> common/arm/pixel-a.S | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> common/arm/pixel.h | 6 +++
>> common/pixel.c | 3 ++
>> 3 files changed, 138 insertions(+)
>>
>> diff --git a/common/arm/pixel-a.S b/common/arm/pixel-a.S
>> index 36858bc..f60fdb5 100644
>> --- a/common/arm/pixel-a.S
>> +++ b/common/arm/pixel-a.S
>> @@ -4,6 +4,7 @@
>> * Copyright (C) 2009-2015 x264 project
>> *
>> * Authors: David Conrad <lessen42 at gmail.com>
>> + * Janne Grunau <janne-x264 at jannau.net>
>> *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License as published by
>> @@ -388,6 +389,59 @@ SAD_X_FUNC 4, 8, 16
>> SAD_X_FUNC 4, 16, 8
>> SAD_X_FUNC 4, 16, 16
>>
>> +function x264_pixel_vsad_neon
>> + subs r2, r2, #2
>> + vld1.8 {q0}, [r0], r1
>> + vld1.8 {q1}, [r0], r1
>> + vabdl.u8 q2, d0, d2
>> + vabdl.u8 q3, d1, d3
>> + ble 2f
>> +1:
>> + subs r2, r2, #2
>> + vld1.8 {q0}, [r0], r1
>> + vabal.u8 q2, d2, d0
>> + vabal.u8 q3, d3, d1
>> + vld1.8 {q1}, [r0], r1
>> + blt 2f
>> + vabal.u8 q2, d0, d2
>> + vabal.u8 q3, d1, d3
>> + bgt 1b
>> +2:
>> + vadd.u16 q0, q2, q3
>> + HORIZ_ADD d0, d0, d1
>> + vmov.32 r0, d0[0]
>> + bx lr
>> +endfunc
>> +
>> +function x264_pixel_asd8_neon
>> + ldr r12, [sp, #0]
>> + sub r12, r12, #2
>> + vld1.8 {d0}, [r0], r1
>> + vld1.8 {d1}, [r2], r3
>> + vld1.8 {d2}, [r0], r1
>> + vld1.8 {d3}, [r2], r3
>> + vsubl.u8 q8, d0, d1
>> +1:
>> + subs r12, r12, #2
>> + vld1.8 {d4}, [r0], r1
>> + vld1.8 {d5}, [r2], r3
>> + vsubl.u8 q9, d2, d3
>> + vsubl.u8 q10, d4, d5
>> + vadd.s16 q8, q9
>> + vld1.8 {d2}, [r0], r1
>> + vld1.8 {d3}, [r2], r3
>> + vadd.s16 q8, q10
>> + bgt 1b
>> + vsubl.u8 q9, d2, d3
>> + vadd.s16 q8, q9
>> + vpaddl.s16 q8, q8
>> + vpadd.s32 d16, d16, d17
>> + vpadd.s32 d16, d16, d17
>> + vabs.s32 d16, d16
>> + vmov.32 r0, d16[0]
>> + bx lr
>> +endfunc
>> +
>>
>> .macro SSD_START_4
>> vld1.32 {d16[]}, [r0,:32], r1
>> @@ -489,6 +543,81 @@ SSD_FUNC 8, 16
>> SSD_FUNC 16, 8
>> SSD_FUNC 16, 16
>>
>> +function x264_pixel_ssd_nv12_core_neon
>> + vpush {q4-q5}
>
> why? q12/q13 seems to be free and could be used instead
Indeed, I seem to have forgotten to recheck the need for this after
finishing porting it. Done locally.
>> + push {r4-r5}
>> + ldrd r4, r5, [sp, #40]
>> + add r12, r4, #8
>> + and r12, r12, #~15
>
> bic r12, r12, #15
>
> would be clearer, bic (immediate) doesn't exists in aarch64
Thanks, fixed locally.
// Martin
More information about the x264-devel
mailing list