[x264-devel] [PATCH 1/1] arm: optimize neon luma intra deblock
Janne Grunau
janne-x264 at jannau.net
Wed Sep 2 00:07:59 CEST 2015
Hi Martin,
I forgot to rescheduling the beginning of the macro in the last iteration.
The floating point compare is a neat trick to test a 64-bit register
against 0 and control the program flow based on it. I tend to forget that
this is possible.
We might have to add vmrs handling to gaspp. Older assembers might
understand fmstat instead.
Janne
---8<---
Almost another 10% faster.
a7 a15
deblock_luma_intra[0]_neon: 3059 1128
deblock_luma_intra[1]_neon: 2038 720
---
common/arm/deblock-a.S | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/common/arm/deblock-a.S b/common/arm/deblock-a.S
index 8d8d807..74c1f0e 100644
--- a/common/arm/deblock-a.S
+++ b/common/arm/deblock-a.S
@@ -197,26 +197,22 @@ endfunc
.macro h264_loop_filter_luma_intra
vdup.8 q14, r2 @ alpha
- vdup.8 q15, r3 @ beta
vabd.u8 q4, q8, q0 @ abs(p0 - q0)
vabd.u8 q5, q9, q8 @ abs(p1 - p0)
vabd.u8 q6, q1, q0 @ abs(q1 - q0)
- vclt.u8 q7, q4, q14 @ < alpha
- vclt.u8 q5, q5, q15 @ < beta
- vclt.u8 q6, q6, q15 @ < beta
-
+ vdup.8 q15, r3 @ beta
vmov.u8 q13, #2
+ vclt.u8 q7, q4, q14 @ < alpha
vshr.u8 q14, q14, #2 @ alpha >> 2
+ vclt.u8 q5, q5, q15 @ < beta
vadd.u8 q14, q14, q13 @ (alpha >> 2) + 2
- vclt.u8 q13, q4, q14 @ < (alpha >> 2) + 2 if_2
-
vand q7, q7, q5
+ vclt.u8 q6, q6, q15 @ < beta
+ vclt.u8 q13, q4, q14 @ < (alpha >> 2) + 2 if_2
vand q12, q7, q6 @ if_1
vshrn.u16 d28, q12, #4
- vrev64.32 d29, d28
- vorr d28, d28, d29
- vmov.32 r2, d28[0]
- cmp r2, #0
+ vcmp.f64 d28, #0
+ vmrs APSR_nzcv, FPSCR
beq 9f
sub sp, sp, #32
--
2.5.0
More information about the x264-devel
mailing list