[x264-devel] [Git][videolan/x264][master] 14 commits: mc: Add initial support for 10 bit neon support
Anton Mitrofanov (@BugMaster)
gitlab at videolan.org
Sun Oct 1 15:29:40 UTC 2023
Anton Mitrofanov pushed to branch master at VideoLAN / x264
Commits:
249924ea by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add initial support for 10 bit neon support
Add if/else clause in files to control which code is used.
Move generic function out of 8-bit depth scope to common one
for both modes.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
ba45eba3 by Hubert Mazur at 2023-10-01T15:13:40+00:00
aarch64/mc-c: Unify pixel/uint8_t usage
Previously some functions from motion compensation family used uint8_t,
while the others pixel definition. Unify this and change every uint8_t
usage to pixel.
This commit is a prerequisite to 10 bit depth support.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
13a24888 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for pixel_avg
Provide neon optimized implementation for pixel_avg functions from
motion compensation family for 10 bit depth.
Checkasm benchmarks are shown below.
avg_4x2_c: 703
avg_4x2_neon: 222
avg_4x4_c: 1405
avg_4x4_neon: 516
avg_4x8_c: 2759
avg_4x8_neon: 898
avg_4x16_c: 5808
avg_4x16_neon: 1776
avg_8x4_c: 2767
avg_8x4_neon: 412
avg_8x8_c: 5559
avg_8x8_neon: 841
avg_8x16_c: 11176
avg_8x16_neon: 1668
avg_16x8_c: 10493
avg_16x8_neon: 1504
avg_16x16_c: 21116
avg_16x16_neon: 2985
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
bb3d83dd by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for pixel_avg2
Provide neon optimized implementation for pixel_avg2 functions from
motion compensation family for 10 bit depth.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
f0b0489f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_copy
Provide neon optimized implementation for mc_copy functions from
motion compensation family for 10 bit depth.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
25d5baf4 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_weight
Provide neon optimized implementation for mc_weight functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
weight_w4_c: 4734
weight_w4_neon: 4165
weight_w8_c: 8930
weight_w8_neon: 1620
weight_w16_c: 16939
weight_w16_neon: 2729
weight_w20_c: 20721
weight_w20_neon: 3470
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
08761208 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Move mc_luma and get_ref wrappers
Provide mc_luma and get_ref wrappers were only defined with 8 bit depth.
As all required 10 bit depth helper functions exists, move it out from
if scope and make it always defined regardless the bit depth.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
7ff0f978 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_chroma
Provide neon optimized implementation for mc_chroma functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
mc_chroma_2x2_c: 700
mc_chroma_2x2_neon: 478
mc_chroma_2x4_c: 1300
mc_chroma_2x4_neon: 765
mc_chroma_4x2_c: 1229
mc_chroma_4x2_neon: 483
mc_chroma_4x4_c: 2383
mc_chroma_4x4_neon: 773
mc_chroma_4x8_c: 4662
mc_chroma_4x8_neon: 1319
mc_chroma_8x4_c: 4450
mc_chroma_8x4_neon: 940
mc_chroma_8x8_c: 8797
mc_chroma_8x8_neon: 1638
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
25ef8832 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_integral
Provide neon optimized implementation for mc_integral functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
integral_init4h_c: 2651
integral_init4h_neon: 550
integral_init4v_c: 4247
integral_init4v_neon: 612
integral_init8h_c: 2544
integral_init8h_neon: 1027
integral_init8v_c: 1996
integral_init8v_neon: 245
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
0a810f4f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_lowres
Provide neon optimized implementation for mc_lowres function from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
lowres_init_c: 149446
lowres_init_neon: 13172
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
68d71206 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_load func
Provide neon optimized implementation for mc_load_deinterleave function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
load_deinterleave_chroma_fdec_c: 2936
load_deinterleave_chroma_fdec_neon: 422
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
df179744 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for store func
Provide neon optimized implementation for mc_store_interleave function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
load_deinterleave_chroma_fenc_c: 2910
load_deinterleave_chroma_fenc_neon: 430
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
e47bede8 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for copy funcs
Provide neon optimized implementation for mc_plane_copy function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
plane_copy_c: 2955
plane_copy_neon: 2910
plane_copy_deinterleave_c: 24056
plane_copy_deinterleave_neon: 3625
plane_copy_deinterleave_rgb_c: 19928
plane_copy_deinterleave_rgb_neon: 3941
plane_copy_interleave_c: 24399
plane_copy_interleave_neon: 4723
plane_copy_swap_c: 32269
plane_copy_swap_neon: 3211
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
cc5c343f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for hpel filter
Provide neon optimized implementation for mc_plane_copy function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
hpel_filter_c: 111495
hpel_filter_neon: 37849
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
2 changed files:
- common/aarch64/mc-a.S
- common/aarch64/mc-c.c
Changes:
=====================================
common/aarch64/mc-a.S
=====================================
@@ -85,6 +85,220 @@ endfunc
prefetch_fenc 420
prefetch_fenc 422
+function mbtree_propagate_cost_neon, export=1
+ ld1r {v5.4s}, [x5]
+8:
+ subs w6, w6, #8
+ ld1 {v1.8h}, [x1], #16
+ ld1 {v2.8h}, [x2], #16
+ ld1 {v3.8h}, [x3], #16
+ ld1 {v4.8h}, [x4], #16
+ bic v3.8h, #0xc0, lsl #8
+ umin v3.8h, v2.8h, v3.8h
+ umull v20.4s, v2.4h, v4.4h // propagate_intra
+ umull2 v21.4s, v2.8h, v4.8h // propagate_intra
+ usubl v22.4s, v2.4h, v3.4h // propagate_num
+ usubl2 v23.4s, v2.8h, v3.8h // propagate_num
+ uxtl v26.4s, v2.4h // propagate_denom
+ uxtl2 v27.4s, v2.8h // propagate_denom
+ uxtl v24.4s, v1.4h
+ uxtl2 v25.4s, v1.8h
+ ucvtf v20.4s, v20.4s
+ ucvtf v21.4s, v21.4s
+ ucvtf v26.4s, v26.4s
+ ucvtf v27.4s, v27.4s
+ ucvtf v22.4s, v22.4s
+ ucvtf v23.4s, v23.4s
+ frecpe v28.4s, v26.4s
+ frecpe v29.4s, v27.4s
+ ucvtf v24.4s, v24.4s
+ ucvtf v25.4s, v25.4s
+ frecps v30.4s, v28.4s, v26.4s
+ frecps v31.4s, v29.4s, v27.4s
+ fmla v24.4s, v20.4s, v5.4s // propagate_amount
+ fmla v25.4s, v21.4s, v5.4s // propagate_amount
+ fmul v28.4s, v28.4s, v30.4s
+ fmul v29.4s, v29.4s, v31.4s
+ fmul v16.4s, v24.4s, v22.4s
+ fmul v17.4s, v25.4s, v23.4s
+ fmul v18.4s, v16.4s, v28.4s
+ fmul v19.4s, v17.4s, v29.4s
+ fcvtns v20.4s, v18.4s
+ fcvtns v21.4s, v19.4s
+ sqxtn v0.4h, v20.4s
+ sqxtn2 v0.8h, v21.4s
+ st1 {v0.8h}, [x0], #16
+ b.gt 8b
+ ret
+endfunc
+
+const pw_0to15, align=5
+ .short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+endconst
+
+function mbtree_propagate_list_internal_neon, export=1
+ movrel x11, pw_0to15
+ dup v31.8h, w4 // bipred_weight
+ movi v30.8h, #0xc0, lsl #8
+ ld1 {v29.8h}, [x11] //h->mb.i_mb_x,h->mb.i_mb_y
+ movi v28.4s, #4
+ movi v27.8h, #31
+ movi v26.8h, #32
+ dup v24.8h, w5 // mb_y
+ zip1 v29.8h, v29.8h, v24.8h
+8:
+ subs w6, w6, #8
+ ld1 {v1.8h}, [x1], #16 // propagate_amount
+ ld1 {v2.8h}, [x2], #16 // lowres_cost
+ and v2.16b, v2.16b, v30.16b
+ cmeq v25.8h, v2.8h, v30.8h
+ umull v16.4s, v1.4h, v31.4h
+ umull2 v17.4s, v1.8h, v31.8h
+ rshrn v16.4h, v16.4s, #6
+ rshrn2 v16.8h, v17.4s, #6
+ bsl v25.16b, v16.16b, v1.16b // if( lists_used == 3 )
+ // propagate_amount = (propagate_amount * bipred_weight + 32) >> 6
+ ld1 {v4.8h,v5.8h}, [x0], #32
+ sshr v6.8h, v4.8h, #5
+ sshr v7.8h, v5.8h, #5
+ add v6.8h, v6.8h, v29.8h
+ add v29.8h, v29.8h, v28.8h
+ add v7.8h, v7.8h, v29.8h
+ add v29.8h, v29.8h, v28.8h
+ st1 {v6.8h,v7.8h}, [x3], #32
+ and v4.16b, v4.16b, v27.16b
+ and v5.16b, v5.16b, v27.16b
+ uzp1 v6.8h, v4.8h, v5.8h // x & 31
+ uzp2 v7.8h, v4.8h, v5.8h // y & 31
+ sub v4.8h, v26.8h, v6.8h // 32 - (x & 31)
+ sub v5.8h, v26.8h, v7.8h // 32 - (y & 31)
+ mul v19.8h, v6.8h, v7.8h // idx3weight = y*x;
+ mul v18.8h, v4.8h, v7.8h // idx2weight = y*(32-x);
+ mul v17.8h, v6.8h, v5.8h // idx1weight = (32-y)*x;
+ mul v16.8h, v4.8h, v5.8h // idx0weight = (32-y)*(32-x) ;
+ umull v6.4s, v19.4h, v25.4h
+ umull2 v7.4s, v19.8h, v25.8h
+ umull v4.4s, v18.4h, v25.4h
+ umull2 v5.4s, v18.8h, v25.8h
+ umull v2.4s, v17.4h, v25.4h
+ umull2 v3.4s, v17.8h, v25.8h
+ umull v0.4s, v16.4h, v25.4h
+ umull2 v1.4s, v16.8h, v25.8h
+ rshrn v19.4h, v6.4s, #10
+ rshrn2 v19.8h, v7.4s, #10
+ rshrn v18.4h, v4.4s, #10
+ rshrn2 v18.8h, v5.4s, #10
+ rshrn v17.4h, v2.4s, #10
+ rshrn2 v17.8h, v3.4s, #10
+ rshrn v16.4h, v0.4s, #10
+ rshrn2 v16.8h, v1.4s, #10
+ zip1 v0.8h, v16.8h, v17.8h
+ zip2 v1.8h, v16.8h, v17.8h
+ zip1 v2.8h, v18.8h, v19.8h
+ zip2 v3.8h, v18.8h, v19.8h
+ st1 {v0.8h,v1.8h}, [x3], #32
+ st1 {v2.8h,v3.8h}, [x3], #32
+ b.ge 8b
+ ret
+endfunc
+
+function memcpy_aligned_neon, export=1
+ tst x2, #16
+ b.eq 32f
+ sub x2, x2, #16
+ ldr q0, [x1], #16
+ str q0, [x0], #16
+32:
+ tst x2, #32
+ b.eq 640f
+ sub x2, x2, #32
+ ldp q0, q1, [x1], #32
+ stp q0, q1, [x0], #32
+640:
+ cbz x2, 1f
+64:
+ subs x2, x2, #64
+ ldp q0, q1, [x1, #32]
+ ldp q2, q3, [x1], #64
+ stp q0, q1, [x0, #32]
+ stp q2, q3, [x0], #64
+ b.gt 64b
+1:
+ ret
+endfunc
+
+function memzero_aligned_neon, export=1
+ movi v0.16b, #0
+ movi v1.16b, #0
+1:
+ subs x1, x1, #128
+ stp q0, q1, [x0, #96]
+ stp q0, q1, [x0, #64]
+ stp q0, q1, [x0, #32]
+ stp q0, q1, [x0], 128
+ b.gt 1b
+ ret
+endfunc
+
+// void mbtree_fix8_pack( int16_t *dst, float *src, int count )
+function mbtree_fix8_pack_neon, export=1
+ subs w3, w2, #8
+ b.lt 2f
+1:
+ subs w3, w3, #8
+ ld1 {v0.4s,v1.4s}, [x1], #32
+ fcvtzs v0.4s, v0.4s, #8
+ fcvtzs v1.4s, v1.4s, #8
+ sqxtn v2.4h, v0.4s
+ sqxtn2 v2.8h, v1.4s
+ rev16 v3.16b, v2.16b
+ st1 {v3.8h}, [x0], #16
+ b.ge 1b
+2:
+ adds w3, w3, #8
+ b.eq 4f
+3:
+ subs w3, w3, #1
+ ldr s0, [x1], #4
+ fcvtzs w4, s0, #8
+ rev16 w5, w4
+ strh w5, [x0], #2
+ b.gt 3b
+4:
+ ret
+endfunc
+
+// void mbtree_fix8_unpack( float *dst, int16_t *src, int count )
+function mbtree_fix8_unpack_neon, export=1
+ subs w3, w2, #8
+ b.lt 2f
+1:
+ subs w3, w3, #8
+ ld1 {v0.8h}, [x1], #16
+ rev16 v1.16b, v0.16b
+ sxtl v2.4s, v1.4h
+ sxtl2 v3.4s, v1.8h
+ scvtf v4.4s, v2.4s, #8
+ scvtf v5.4s, v3.4s, #8
+ st1 {v4.4s,v5.4s}, [x0], #32
+ b.ge 1b
+2:
+ adds w3, w3, #8
+ b.eq 4f
+3:
+ subs w3, w3, #1
+ ldrh w4, [x1], #2
+ rev16 w5, w4
+ sxth w6, w5
+ scvtf s0, w6, #8
+ str s0, [x0], #4
+ b.gt 3b
+4:
+ ret
+endfunc
+
+#if BIT_DEPTH == 8
+
// void pixel_avg( uint8_t *dst, intptr_t dst_stride,
// uint8_t *src1, intptr_t src1_stride,
// uint8_t *src2, intptr_t src2_stride, int weight );
@@ -1542,214 +1756,2047 @@ function integral_init8v_neon, export=1
ret
endfunc
-function mbtree_propagate_cost_neon, export=1
- ld1r {v5.4s}, [x5]
-8:
- subs w6, w6, #8
- ld1 {v1.8h}, [x1], #16
- ld1 {v2.8h}, [x2], #16
- ld1 {v3.8h}, [x3], #16
- ld1 {v4.8h}, [x4], #16
- bic v3.8h, #0xc0, lsl #8
- umin v3.8h, v2.8h, v3.8h
- umull v20.4s, v2.4h, v4.4h // propagate_intra
- umull2 v21.4s, v2.8h, v4.8h // propagate_intra
- usubl v22.4s, v2.4h, v3.4h // propagate_num
- usubl2 v23.4s, v2.8h, v3.8h // propagate_num
- uxtl v26.4s, v2.4h // propagate_denom
- uxtl2 v27.4s, v2.8h // propagate_denom
- uxtl v24.4s, v1.4h
- uxtl2 v25.4s, v1.8h
- ucvtf v20.4s, v20.4s
- ucvtf v21.4s, v21.4s
- ucvtf v26.4s, v26.4s
- ucvtf v27.4s, v27.4s
- ucvtf v22.4s, v22.4s
- ucvtf v23.4s, v23.4s
- frecpe v28.4s, v26.4s
- frecpe v29.4s, v27.4s
- ucvtf v24.4s, v24.4s
- ucvtf v25.4s, v25.4s
- frecps v30.4s, v28.4s, v26.4s
- frecps v31.4s, v29.4s, v27.4s
- fmla v24.4s, v20.4s, v5.4s // propagate_amount
- fmla v25.4s, v21.4s, v5.4s // propagate_amount
- fmul v28.4s, v28.4s, v30.4s
- fmul v29.4s, v29.4s, v31.4s
- fmul v16.4s, v24.4s, v22.4s
- fmul v17.4s, v25.4s, v23.4s
- fmul v18.4s, v16.4s, v28.4s
- fmul v19.4s, v17.4s, v29.4s
- fcvtns v20.4s, v18.4s
- fcvtns v21.4s, v19.4s
- sqxtn v0.4h, v20.4s
- sqxtn2 v0.8h, v21.4s
- st1 {v0.8h}, [x0], #16
- b.gt 8b
+#else // BIT_DEPTH == 8
+
+// void pixel_avg( pixel *dst, intptr_t dst_stride,
+// pixel *src1, intptr_t src1_stride,
+// pixel *src2, intptr_t src2_stride, int weight );
+.macro AVGH w h
+function pixel_avg_\w\()x\h\()_neon, export=1
+ mov w10, #64
+ cmp w6, #32
+ mov w9, #\h
+ b.eq pixel_avg_w\w\()_neon
+ subs w7, w10, w6
+ b.lt pixel_avg_weight_w\w\()_add_sub_neon // weight > 64
+ cmp w6, #0
+ b.ge pixel_avg_weight_w\w\()_add_add_neon
+ b pixel_avg_weight_w\w\()_sub_add_neon // weight < 0
+endfunc
+.endm
+
+AVGH 4, 2
+AVGH 4, 4
+AVGH 4, 8
+AVGH 4, 16
+AVGH 8, 4
+AVGH 8, 8
+AVGH 8, 16
+AVGH 16, 8
+AVGH 16, 16
+
+// 0 < weight < 64
+.macro load_weights_add_add
+ mov w6, w6
+.endm
+.macro weight_add_add dst, s1, s2, h=
+.ifc \h, 2
+ umull2 \dst, \s1, v30.8h
+ umlal2 \dst, \s2, v31.8h
+.else
+ umull \dst, \s1, v30.4h
+ umlal \dst, \s2, v31.4h
+.endif
+.endm
+
+// weight > 64
+.macro load_weights_add_sub
+ neg w7, w7
+.endm
+.macro weight_add_sub dst, s1, s2, h=
+.ifc \h, 2
+ umull2 \dst, \s1, v30.8h
+ umlsl2 \dst, \s2, v31.8h
+.else
+ umull \dst, \s1, v30.4h
+ umlsl \dst, \s2, v31.4h
+.endif
+.endm
+
+// weight < 0
+.macro load_weights_sub_add
+ neg w6, w6
+.endm
+.macro weight_sub_add dst, s1, s2, h=
+.ifc \h, 2
+ umull2 \dst, \s2, v31.8h
+ umlsl2 \dst, \s1, v30.8h
+.else
+ umull \dst, \s2, v31.4h
+ umlsl \dst, \s1, v30.4h
+.endif
+.endm
+
+.macro AVG_WEIGHT ext
+function pixel_avg_weight_w4_\ext\()_neon
+ load_weights_\ext
+ dup v30.8h, w6
+ dup v31.8h, w7
+ lsl x3, x3, #1
+ lsl x5, x5, #1
+ lsl x1, x1, #1
+1: // height loop
+ subs w9, w9, #2
+ ld1 {v0.d}[0], [x2], x3
+ ld1 {v1.d}[0], [x4], x5
+ weight_\ext v4.4s, v0.4h, v1.4h
+ ld1 {v2.d}[0], [x2], x3
+ ld1 {v3.d}[0], [x4], x5
+
+ mvni v28.8h, #0xfc, lsl #8
+
+ sqrshrun v4.4h, v4.4s, #6
+ weight_\ext v5.4s, v2.4h, v3.4h
+ smin v4.4h, v4.4h, v28.4h
+ sqrshrun v5.4h, v5.4s, #6
+
+ st1 {v4.d}[0], [x0], x1
+
+ smin v5.4h, v5.4h, v28.4h
+
+ st1 {v5.d}[0], [x0], x1
+
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg_weight_w8_\ext\()_neon
+ load_weights_\ext
+ dup v30.8h, w6
+ dup v31.8h, w7
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
+1: // height loop
+ subs w9, w9, #4
+ ld1 {v0.8h}, [x2], x3
+ ld1 {v1.8h}, [x4], x5
+ weight_\ext v16.4s, v0.4h, v1.4h
+ weight_\ext v17.4s, v0.8h, v1.8h, 2
+ ld1 {v2.8h}, [x2], x3
+ ld1 {v3.8h}, [x4], x5
+ weight_\ext v18.4s, v2.4h, v3.4h
+ weight_\ext v19.4s, v2.8h, v3.8h, 2
+ ld1 {v4.8h}, [x2], x3
+ ld1 {v5.8h}, [x4], x5
+ weight_\ext v20.4s, v4.4h, v5.4h
+ weight_\ext v21.4s, v4.8h, v5.8h, 2
+ ld1 {v6.8h}, [x2], x3
+ ld1 {v7.8h}, [x4], x5
+ weight_\ext v22.4s, v6.4h, v7.4h
+ weight_\ext v23.4s, v6.8h, v7.8h, 2
+
+ mvni v28.8h, #0xfc, lsl #8
+
+ sqrshrun v0.4h, v16.4s, #6
+ sqrshrun v2.4h, v18.4s, #6
+ sqrshrun v4.4h, v20.4s, #6
+ sqrshrun2 v0.8h, v17.4s, #6
+ sqrshrun v6.4h, v22.4s, #6
+ sqrshrun2 v2.8h, v19.4s, #6
+ sqrshrun2 v4.8h, v21.4s, #6
+ smin v0.8h, v0.8h, v28.8h
+ smin v2.8h, v2.8h, v28.8h
+ sqrshrun2 v6.8h, v23.4s, #6
+ smin v4.8h, v4.8h, v28.8h
+ smin v6.8h, v6.8h, v28.8h
+
+ st1 {v0.8h}, [x0], x1
+ st1 {v2.8h}, [x0], x1
+ st1 {v4.8h}, [x0], x1
+ st1 {v6.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg_weight_w16_\ext\()_neon
+ load_weights_\ext
+ dup v30.8h, w6
+ dup v31.8h, w7
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
+1: // height loop
+ subs w9, w9, #2
+
+ ld1 {v0.8h, v1.8h}, [x2], x3
+ ld1 {v2.8h, v3.8h}, [x4], x5
+ ld1 {v4.8h, v5.8h}, [x2], x3
+ ld1 {v6.8h, v7.8h}, [x4], x5
+
+ weight_\ext v16.4s, v0.4h, v2.4h
+ weight_\ext v17.4s, v0.8h, v2.8h, 2
+ weight_\ext v18.4s, v1.4h, v3.4h
+ weight_\ext v19.4s, v1.8h, v3.8h, 2
+ weight_\ext v20.4s, v4.4h, v6.4h
+ weight_\ext v21.4s, v4.8h, v6.8h, 2
+ weight_\ext v22.4s, v5.4h, v7.4h
+ weight_\ext v23.4s, v5.8h, v7.8h, 2
+
+ mvni v28.8h, #0xfc, lsl #8
+
+ sqrshrun v0.4h, v16.4s, #6
+ sqrshrun v1.4h, v18.4s, #6
+ sqrshrun v2.4h, v20.4s, #6
+ sqrshrun2 v0.8h, v17.4s, #6
+ sqrshrun2 v1.8h, v19.4s, #6
+ sqrshrun2 v2.8h, v21.4s, #6
+ smin v0.8h, v0.8h, v28.8h
+ smin v1.8h, v1.8h, v28.8h
+ sqrshrun v3.4h, v22.4s, #6
+ smin v2.8h, v2.8h, v28.8h
+ sqrshrun2 v3.8h, v23.4s, #6
+ smin v3.8h, v3.8h, v28.8h
+
+ st1 {v0.8h, v1.8h}, [x0], x1
+ st1 {v2.8h, v3.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+.endm
+
+AVG_WEIGHT add_add
+AVG_WEIGHT add_sub
+AVG_WEIGHT sub_add
+
+function pixel_avg_w4_neon
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
+
+1: subs w9, w9, #2
+ ld1 {v0.d}[0], [x2], x3
+ ld1 {v2.d}[0], [x4], x5
+ ld1 {v0.d}[1], [x2], x3
+ ld1 {v2.d}[1], [x4], x5
+ urhadd v0.8h, v0.8h, v2.8h
+ st1 {v0.d}[0], [x0], x1
+ st1 {v0.d}[1], [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg_w8_neon
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
+1: subs w9, w9, #4
+ ld1 {v0.8h}, [x2], x3
+ ld1 {v1.8h}, [x4], x5
+ ld1 {v2.8h}, [x2], x3
+ urhadd v0.8h, v0.8h, v1.8h
+ ld1 {v3.8h}, [x4], x5
+ st1 {v0.8h}, [x0], x1
+ ld1 {v4.8h}, [x2], x3
+ urhadd v1.8h, v2.8h, v3.8h
+ ld1 {v5.8h}, [x4], x5
+ st1 {v1.8h}, [x0], x1
+ ld1 {v6.8h}, [x2], x3
+ ld1 {v7.8h}, [x4], x5
+ urhadd v2.8h, v4.8h, v5.8h
+ urhadd v3.8h, v6.8h, v7.8h
+ st1 {v2.8h}, [x0], x1
+ st1 {v3.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg_w16_neon
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
+
+1: subs w9, w9, #4
+
+ ld1 {v0.8h, v1.8h}, [x2], x3
+ ld1 {v2.8h, v3.8h}, [x4], x5
+ ld1 {v4.8h, v5.8h}, [x2], x3
+ urhadd v0.8h, v0.8h, v2.8h
+ urhadd v1.8h, v1.8h, v3.8h
+ ld1 {v6.8h, v7.8h}, [x4], x5
+ ld1 {v20.8h, v21.8h}, [x2], x3
+ st1 {v0.8h, v1.8h}, [x0], x1
+ urhadd v4.8h, v4.8h, v6.8h
+ urhadd v5.8h, v5.8h, v7.8h
+ ld1 {v22.8h, v23.8h}, [x4], x5
+ ld1 {v24.8h, v25.8h}, [x2], x3
+ st1 {v4.8h, v5.8h}, [x0], x1
+ ld1 {v26.8h, v27.8h}, [x4], x5
+ urhadd v20.8h, v20.8h, v22.8h
+ urhadd v21.8h, v21.8h, v23.8h
+ urhadd v24.8h, v24.8h, v26.8h
+ urhadd v25.8h, v25.8h, v27.8h
+ st1 {v20.8h, v21.8h}, [x0], x1
+ st1 {v24.8h, v25.8h}, [x0], x1
+
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg2_w4_neon, export=1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w5, w5, #2
+ ld1 {v0.4h}, [x2], x3
+ ld1 {v2.4h}, [x4], x3
+ ld1 {v1.4h}, [x2], x3
+ ld1 {v3.4h}, [x4], x3
+ urhadd v0.4h, v0.4h, v2.4h
+ urhadd v1.4h, v1.4h, v3.4h
+
+ st1 {v0.4h}, [x0], x1
+ st1 {v1.4h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg2_w8_neon, export=1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w5, w5, #2
+ ld1 {v0.8h}, [x2], x3
+ ld1 {v2.8h}, [x4], x3
+ ld1 {v1.8h}, [x2], x3
+ ld1 {v3.8h}, [x4], x3
+ urhadd v0.8h, v0.8h, v2.8h
+ urhadd v1.8h, v1.8h, v3.8h
+
+ st1 {v0.8h}, [x0], x1
+ st1 {v1.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg2_w16_neon, export=1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w5, w5, #2
+ ld1 {v0.8h, v1.8h}, [x2], x3
+ ld1 {v2.8h, v3.8h}, [x4], x3
+ ld1 {v4.8h, v5.8h}, [x2], x3
+ ld1 {v6.8h, v7.8h}, [x4], x3
+ urhadd v0.8h, v0.8h, v2.8h
+ urhadd v1.8h, v1.8h, v3.8h
+ urhadd v4.8h, v4.8h, v6.8h
+ urhadd v5.8h, v5.8h, v7.8h
+
+ st1 {v0.8h, v1.8h}, [x0], x1
+ st1 {v4.8h, v5.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function pixel_avg2_w20_neon, export=1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ sub x1, x1, #32
+1:
+ subs w5, w5, #2
+
+ ld1 {v0.8h, v1.8h, v2.8h}, [x2], x3
+ ld1 {v3.8h, v4.8h, v5.8h}, [x4], x3
+ ld1 {v20.8h, v21.8h, v22.8h}, [x2], x3
+ ld1 {v23.8h, v24.8h, v25.8h}, [x4], x3
+
+ urhadd v0.8h, v0.8h, v3.8h
+ urhadd v1.8h, v1.8h, v4.8h
+ urhadd v2.4h, v2.4h, v5.4h
+ urhadd v20.8h, v20.8h, v23.8h
+ urhadd v21.8h, v21.8h, v24.8h
+ urhadd v22.4h, v22.4h, v25.4h
+
+ st1 {v0.8h, v1.8h}, [x0], #32
+ st1 {v2.4h}, [x0], x1
+ st1 {v20.8h, v21.8h}, [x0], #32
+ st1 {v22.4h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+// void mc_copy( pixel *dst, intptr_t dst_stride, pixel *src, intptr_t src_stride, int height )
+function mc_copy_w4_neon, export=1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w4, w4, #4
+ ld1 {v0.d}[0], [x2], x3
+ ld1 {v1.d}[0], [x2], x3
+ ld1 {v2.d}[0], [x2], x3
+ ld1 {v3.d}[0], [x2], x3
+ st1 {v0.d}[0], [x0], x1
+ st1 {v1.d}[0], [x0], x1
+ st1 {v2.d}[0], [x0], x1
+ st1 {v3.d}[0], [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_copy_w8_neon, export=1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1: subs w4, w4, #4
+ ld1 {v0.8h}, [x2], x3
+ ld1 {v1.8h}, [x2], x3
+ ld1 {v2.8h}, [x2], x3
+ ld1 {v3.8h}, [x2], x3
+ st1 {v0.8h}, [x0], x1
+ st1 {v1.8h}, [x0], x1
+ st1 {v2.8h}, [x0], x1
+ st1 {v3.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_copy_w16_neon, export=1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1: subs w4, w4, #4
+ ld1 {v0.8h, v1.8h}, [x2], x3
+ ld1 {v2.8h, v3.8h}, [x2], x3
+ ld1 {v4.8h, v5.8h}, [x2], x3
+ ld1 {v6.8h, v7.8h}, [x2], x3
+ st1 {v0.8h, v1.8h}, [x0], x1
+ st1 {v2.8h, v3.8h}, [x0], x1
+ st1 {v4.8h, v5.8h}, [x0], x1
+ st1 {v6.8h, v7.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+.macro weight_prologue type
+ mov w9, w5 // height
+.ifc \type, full
+ ldr w12, [x4, #32] // denom
+.endif
+ ldp w4, w5, [x4, #32+4] // scale, offset
+ dup v0.8h, w4
+ lsl w5, w5, #2
+ dup v1.4s, w5
+.ifc \type, full
+ neg w12, w12
+ dup v2.4s, w12
+.endif
+.endm
+
+// void mc_weight( pixel *src, intptr_t src_stride, pixel *dst,
+// intptr_t dst_stride, const x264_weight_t *weight, int h )
+function mc_weight_w20_neon, export=1
+ weight_prologue full
+ lsl x3, x3, #1
+ lsl x1, x1, #1
+ sub x1, x1, #32
+1:
+ subs w9, w9, #2
+ ld1 {v16.8h, v17.8h, v18.8h}, [x2], x3
+ ld1 {v19.8h, v20.8h, v21.8h}, [x2], x3
+
+ umull v22.4s, v16.4h, v0.4h
+ umull2 v23.4s, v16.8h, v0.8h
+ umull v24.4s, v17.4h, v0.4h
+ umull2 v25.4s, v17.8h, v0.8h
+ umull v26.4s, v18.4h, v0.4h
+ umull v27.4s, v21.4h, v0.4h
+
+ srshl v22.4s, v22.4s, v2.4s
+ srshl v23.4s, v23.4s, v2.4s
+ srshl v24.4s, v24.4s, v2.4s
+ srshl v25.4s, v25.4s, v2.4s
+ srshl v26.4s, v26.4s, v2.4s
+ srshl v27.4s, v27.4s, v2.4s
+ add v22.4s, v22.4s, v1.4s
+ add v23.4s, v23.4s, v1.4s
+ add v24.4s, v24.4s, v1.4s
+ add v25.4s, v25.4s, v1.4s
+ add v26.4s, v26.4s, v1.4s
+ add v27.4s, v27.4s, v1.4s
+
+ sqxtun v22.4h, v22.4s
+ sqxtun2 v22.8h, v23.4s
+ sqxtun v23.4h, v24.4s
+ sqxtun2 v23.8h, v25.4s
+ sqxtun v24.4h, v26.4s
+ sqxtun2 v24.8h, v27.4s
+
+ umull v16.4s, v19.4h, v0.4h
+ umull2 v17.4s, v19.8h, v0.8h
+ umull v18.4s, v20.4h, v0.4h
+ umull2 v19.4s, v20.8h, v0.8h
+
+ srshl v16.4s, v16.4s, v2.4s
+ srshl v17.4s, v17.4s, v2.4s
+ srshl v18.4s, v18.4s, v2.4s
+ srshl v19.4s, v19.4s, v2.4s
+ add v16.4s, v16.4s, v1.4s
+ add v17.4s, v17.4s, v1.4s
+ add v18.4s, v18.4s, v1.4s
+ add v19.4s, v19.4s, v1.4s
+
+ sqxtun v16.4h, v16.4s
+ sqxtun2 v16.8h, v17.4s
+ sqxtun v17.4h, v18.4s
+ sqxtun2 v17.8h, v19.4s
+
+ mvni v31.8h, #0xfc, lsl #8
+
+ umin v22.8h, v22.8h, v31.8h
+ umin v23.8h, v23.8h, v31.8h
+ umin v24.8h, v24.8h, v31.8h
+ umin v16.8h, v16.8h, v31.8h
+ umin v17.8h, v17.8h, v31.8h
+
+ st1 {v22.8h, v23.8h}, [x0], #32
+ st1 {v24.d}[0], [x0], x1
+ st1 {v16.8h, v17.8h}, [x0], #32
+ st1 {v24.d}[1], [x0], x1
+
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w16_neon, export=1
+ weight_prologue full
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w9, w9, #2
+ ld1 {v4.8h, v5.8h}, [x2], x3
+ ld1 {v6.8h, v7.8h}, [x2], x3
+
+ umull v22.4s, v4.4h, v0.4h
+ umull2 v23.4s, v4.8h, v0.8h
+ umull v24.4s, v5.4h, v0.4h
+ umull2 v25.4s, v5.8h, v0.8h
+
+ srshl v22.4s, v22.4s, v2.4s
+ srshl v23.4s, v23.4s, v2.4s
+ srshl v24.4s, v24.4s, v2.4s
+ srshl v25.4s, v25.4s, v2.4s
+
+ add v22.4s, v22.4s, v1.4s
+ add v23.4s, v23.4s, v1.4s
+ add v24.4s, v24.4s, v1.4s
+ add v25.4s, v25.4s, v1.4s
+
+ sqxtun v22.4h, v22.4s
+ sqxtun2 v22.8h, v23.4s
+ sqxtun v23.4h, v24.4s
+ sqxtun2 v23.8h, v25.4s
+
+ umull v26.4s, v6.4h, v0.4h
+ umull2 v27.4s, v6.8h, v0.8h
+ umull v28.4s, v7.4h, v0.4h
+ umull2 v29.4s, v7.8h, v0.8h
+
+ srshl v26.4s, v26.4s, v2.4s
+ srshl v27.4s, v27.4s, v2.4s
+ srshl v28.4s, v28.4s, v2.4s
+ srshl v29.4s, v29.4s, v2.4s
+
+ add v26.4s, v26.4s, v1.4s
+ add v27.4s, v27.4s, v1.4s
+ add v28.4s, v28.4s, v1.4s
+ add v29.4s, v29.4s, v1.4s
+
+ sqxtun v26.4h, v26.4s
+ sqxtun2 v26.8h, v27.4s
+ sqxtun v27.4h, v28.4s
+ sqxtun2 v27.8h, v29.4s
+
+ mvni v31.8h, 0xfc, lsl #8
+
+ umin v22.8h, v22.8h, v31.8h
+ umin v23.8h, v23.8h, v31.8h
+ umin v26.8h, v26.8h, v31.8h
+ umin v27.8h, v27.8h, v31.8h
+
+ st1 {v22.8h, v23.8h}, [x0], x1
+ st1 {v26.8h, v27.8h}, [x0], x1
+
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w8_neon, export=1
+ weight_prologue full
+ lsl x3, x3, #1
+ lsl x1, x1, #1
+1:
+ subs w9, w9, #2
+ ld1 {v16.8h}, [x2], x3
+ ld1 {v17.8h}, [x2], x3
+
+ umull v4.4s, v16.4h, v0.4h
+ umull2 v5.4s, v16.8h, v0.8h
+ umull v6.4s, v17.4h, v0.4h
+ umull2 v7.4s, v17.8h, v0.8h
+
+ srshl v4.4s, v4.4s, v2.4s
+ srshl v5.4s, v5.4s, v2.4s
+ srshl v6.4s, v6.4s, v2.4s
+ srshl v7.4s, v7.4s, v2.4s
+
+ add v4.4s, v4.4s, v1.4s
+ add v5.4s, v5.4s, v1.4s
+ add v6.4s, v6.4s, v1.4s
+ add v7.4s, v7.4s, v1.4s
+
+ sqxtun v16.4h, v4.4s
+ sqxtun2 v16.8h, v5.4s
+ sqxtun v17.4h, v6.4s
+ sqxtun2 v17.8h, v7.4s
+
+ mvni v28.8h, #0xfc, lsl #8
+
+ umin v16.8h, v16.8h, v28.8h
+ umin v17.8h, v17.8h, v28.8h
+
+ st1 {v16.8h}, [x0], x1
+ st1 {v17.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w4_neon, export=1
+ weight_prologue full
+ lsl x3, x3, #1
+ lsl x1, x1, #1
+1:
+ subs w9, w9, #2
+ ld1 {v16.d}[0], [x2], x3
+ ld1 {v16.d}[1], [x2], x3
+ umull v4.4s, v16.4h, v0.4h
+ umull2 v5.4s, v16.8h, v0.8h
+ srshl v4.4s, v4.4s, v2.4s
+ srshl v5.4s, v5.4s, v2.4s
+ add v4.4s, v4.4s, v1.4s
+ add v5.4s, v5.4s, v1.4s
+
+ sqxtun v16.4h, v4.4s
+ sqxtun2 v16.8h, v5.4s
+
+ mvni v28.8h, #0xfc, lsl #8
+
+ umin v16.8h, v16.8h, v28.8h
+
+ st1 {v16.d}[0], [x0], x1
+ st1 {v16.d}[1], [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w20_nodenom_neon, export=1
+ weight_prologue nodenom
+ lsl x3, x3, #1
+ lsl x1, x1, #1
+ sub x1, x1, #32
+1:
+ subs w9, w9, #2
+ ld1 {v16.8h, v17.8h, v18.8h}, [x2], x3
+ mov v20.16b, v1.16b
+ mov v21.16b, v1.16b
+ mov v22.16b, v1.16b
+ mov v23.16b, v1.16b
+ mov v24.16b, v1.16b
+ mov v25.16b, v1.16b
+ ld1 {v2.8h, v3.8h, v4.8h}, [x2], x3
+ mov v26.16b, v1.16b
+ mov v27.16b, v1.16b
+ mov v28.16b, v1.16b
+ mov v29.16b, v1.16b
+
+ umlal v20.4s, v16.4h, v0.4h
+ umlal2 v21.4s, v16.8h, v0.8h
+ umlal v22.4s, v17.4h, v0.4h
+ umlal2 v23.4s, v17.8h, v0.8h
+ umlal v24.4s, v18.4h, v0.4h
+ umlal v25.4s, v4.4h, v0.4h
+ umlal v26.4s, v2.4h, v0.4h
+ umlal2 v27.4s, v2.8h, v0.8h
+ umlal v28.4s, v3.4h, v0.4h
+ umlal2 v29.4s, v3.8h, v0.8h
+
+ sqxtun v2.4h, v20.4s
+ sqxtun2 v2.8h, v21.4s
+ sqxtun v3.4h, v22.4s
+ sqxtun2 v3.8h, v23.4s
+ sqxtun v4.4h, v24.4s
+ sqxtun2 v4.8h, v25.4s
+ sqxtun v5.4h, v26.4s
+ sqxtun2 v5.8h, v27.4s
+ sqxtun v6.4h, v28.4s
+ sqxtun2 v6.8h, v29.4s
+
+ mvni v31.8h, 0xfc, lsl #8
+
+ umin v2.8h, v2.8h, v31.8h
+ umin v3.8h, v3.8h, v31.8h
+ umin v4.8h, v4.8h, v31.8h
+ umin v5.8h, v5.8h, v31.8h
+ umin v6.8h, v6.8h, v31.8h
+
+ st1 {v2.8h, v3.8h}, [x0], #32
+ st1 {v4.d}[0], [x0], x1
+ st1 {v5.8h, v6.8h}, [x0], #32
+ st1 {v4.d}[1], [x0], x1
+
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w16_nodenom_neon, export=1
+ weight_prologue nodenom
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w9, w9, #2
+ ld1 {v2.8h, v3.8h}, [x2], x3
+ mov v27.16b, v1.16b
+ mov v28.16b, v1.16b
+ mov v29.16b, v1.16b
+ mov v30.16b, v1.16b
+ ld1 {v4.8h, v5.8h}, [x2], x3
+ mov v20.16b, v1.16b
+ mov v21.16b, v1.16b
+ mov v22.16b, v1.16b
+ mov v23.16b, v1.16b
+
+ umlal v27.4s, v2.4h, v0.4h
+ umlal2 v28.4s, v2.8h, v0.8h
+ umlal v29.4s, v3.4h, v0.4h
+ umlal2 v30.4s, v3.8h, v0.8h
+
+ umlal v20.4s, v4.4h, v0.4h
+ umlal2 v21.4s, v4.8h, v0.8h
+ umlal v22.4s, v5.4h, v0.4h
+ umlal2 v23.4s, v5.8h, v0.8h
+
+ sqxtun v2.4h, v27.4s
+ sqxtun2 v2.8h, v28.4s
+ sqxtun v3.4h, v29.4s
+ sqxtun2 v3.8h, v30.4s
+
+ sqxtun v4.4h, v20.4s
+ sqxtun2 v4.8h, v21.4s
+ sqxtun v5.4h, v22.4s
+ sqxtun2 v5.8h, v23.4s
+
+ mvni v31.8h, 0xfc, lsl #8
+
+ umin v2.8h, v2.8h, v31.8h
+ umin v3.8h, v3.8h, v31.8h
+ umin v4.8h, v4.8h, v31.8h
+ umin v5.8h, v5.8h, v31.8h
+
+ st1 {v2.8h, v3.8h}, [x0], x1
+ st1 {v4.8h, v5.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w8_nodenom_neon, export=1
+ weight_prologue nodenom
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w9, w9, #2
+ ld1 {v16.8h}, [x2], x3
+ mov v27.16b, v1.16b
+ ld1 {v17.8h}, [x2], x3
+ mov v28.16b, v1.16b
+ mov v29.16b, v1.16b
+ mov v30.16b, v1.16b
+
+ umlal v27.4s, v16.4h, v0.4h
+ umlal2 v28.4s, v16.8h, v0.8h
+ umlal v29.4s, v17.4h, v0.4h
+ umlal2 v30.4s, v17.8h, v0.8h
+
+ sqxtun v4.4h, v27.4s
+ sqxtun2 v4.8h, v28.4s
+ sqxtun v5.4h, v29.4s
+ sqxtun2 v5.8h, v30.4s
+
+ mvni v31.8h, 0xfc, lsl #8
+
+ umin v4.8h, v4.8h, v31.8h
+ umin v5.8h, v5.8h, v31.8h
+
+ st1 {v4.8h}, [x0], x1
+ st1 {v5.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w4_nodenom_neon, export=1
+ weight_prologue nodenom
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w9, w9, #2
+ ld1 {v16.d}[0], [x2], x3
+ ld1 {v16.d}[1], [x2], x3
+ mov v27.16b, v1.16b
+ mov v28.16b, v1.16b
+ umlal v27.4s, v16.4h, v0.4h
+ umlal2 v28.4s, v16.8h, v0.8h
+
+ sqxtun v4.4h, v27.4s
+ sqxtun2 v4.8h, v28.4s
+
+ mvni v31.8h, 0xfc, lsl #8
+
+ umin v4.8h, v4.8h, v31.8h
+
+ st1 {v4.d}[0], [x0], x1
+ st1 {v4.d}[1], [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+.macro weight_simple_prologue
+ ldr w6, [x4] // offset
+ lsl w6, w6, #2
+ dup v1.8h, w6
+.endm
+
+.macro weight_simple name op
+function mc_weight_w20_\name\()_neon, export=1
+ weight_simple_prologue
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ sub x1, x1, #32
+1:
+ subs w5, w5, #2
+ ld1 {v2.8h, v3.8h, v4.8h}, [x2], x3
+ ld1 {v5.8h, v6.8h, v7.8h}, [x2], x3
+
+ zip1 v4.2d, v4.2d, v7.2d
+
+ \op v2.8h, v2.8h, v1.8h
+ \op v3.8h, v3.8h, v1.8h
+ \op v4.8h, v4.8h, v1.8h
+ \op v5.8h, v5.8h, v1.8h
+ \op v6.8h, v6.8h, v1.8h
+
+ mvni v31.8h, #0xfc, lsl #8
+
+ umin v2.8h, v2.8h, v28.8h
+ umin v3.8h, v3.8h, v28.8h
+ umin v4.8h, v4.8h, v28.8h
+ umin v5.8h, v5.8h, v28.8h
+ umin v6.8h, v6.8h, v28.8h
+
+ st1 {v2.8h, v3.8h}, [x0], #32
+ st1 {v4.d}[0], [x0], x1
+ st1 {v5.8h, v6.8h}, [x0], #32
+ st1 {v4.d}[1], [x0], x1
+
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w16_\name\()_neon, export=1
+ weight_simple_prologue
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w5, w5, #2
+ ld1 {v16.8h, v17.8h}, [x2], x3
+ ld1 {v18.8h, v19.8h}, [x2], x3
+
+ \op v16.8h, v16.8h, v1.8h
+ \op v17.8h, v17.8h, v1.8h
+ \op v18.8h, v18.8h, v1.8h
+ \op v19.8h, v19.8h, v1.8h
+
+ mvni v28.8h, #0xfc, lsl #8
+
+ umin v16.8h, v16.8h, v28.8h
+ umin v17.8h, v17.8h, v28.8h
+ umin v18.8h, v18.8h, v28.8h
+ umin v19.8h, v19.8h, v28.8h
+
+ st1 {v16.8h, v17.8h}, [x0], x1
+ st1 {v18.8h, v19.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w8_\name\()_neon, export=1
+ weight_simple_prologue
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w5, w5, #2
+ ld1 {v16.8h}, [x2], x3
+ ld1 {v17.8h}, [x2], x3
+ \op v16.8h, v16.8h, v1.8h
+ \op v17.8h, v17.8h, v1.8h
+
+ mvni v28.8h, 0xfc, lsl #8
+
+ umin v16.8h, v16.8h, v28.8h
+ umin v17.8h, v17.8h, v28.8h
+
+ st1 {v16.8h}, [x0], x1
+ st1 {v17.8h}, [x0], x1
+ b.gt 1b
+ ret
+endfunc
+
+function mc_weight_w4_\name\()_neon, export=1
+ weight_simple_prologue
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ subs w5, w5, #2
+ ld1 {v16.d}[0], [x2], x3
+ ld1 {v16.d}[1], [x2], x3
+ \op v16.8h, v16.8h, v1.8h
+ mvni v28.8h, 0xfc, lsl #8
+
+ umin v16.8h, v16.8h, v28.8h
+
+ st1 {v16.d}[0], [x0], x1
+ st1 {v16.d}[1], [x0], x1
+ b.gt 1b
+ ret
+endfunc
+.endm
+
+weight_simple offsetadd, uqadd
+weight_simple offsetsub, uqsub
+
+// void mc_chroma( pixel *dst_u, pixel *dst_v,
+// intptr_t i_dst_stride,
+// pixel *src, intptr_t i_src_stride,
+// int dx, int dy, int i_width, int i_height );
+function mc_chroma_neon, export=1
+ ldr w15, [sp] // height
+ sbfx x12, x6, #3, #29 // asr(3) and sign extend
+ sbfx x11, x5, #3, #29 // asr(3) and sign extend
+ cmp w7, #4
+ lsl x4, x4, #1
+ mul x12, x12, x4
+ add x3, x3, x11, lsl #2
+
+ and w5, w5, #7
+ and w6, w6, #7
+
+ add x3, x3, x12
+
+ b.gt mc_chroma_w8_neon
+ b.eq mc_chroma_w4_neon
+endfunc
+
+.macro CHROMA_MC_START r00, r01, r10, r11
+ mul w12, w5, w6 // cD = d8x *d8y
+ lsl w13, w5, #3
+ add w9, w12, #64
+ lsl w14, w6, #3
+ tst w12, w12
+ sub w9, w9, w13
+ sub w10, w13, w12 // cB = d8x *(8-d8y);
+ sub w11, w14, w12 // cC = (8-d8x)*d8y
+ sub w9, w9, w14 // cA = (8-d8x)*(8-d8y);
+.endm
+
+.macro CHROMA_MC width, vsize
+function mc_chroma_w\width\()_neon
+ lsl x2, x2, #1
+// since the element size varies, there's a different index for the 2nd store
+.if \width == 4
+ .set idx2, 1
+.else
+ .set idx2, 2
+.endif
+ CHROMA_MC_START
+ b.eq 2f
+
+ ld2 {v28.8h, v29.8h}, [x3], x4
+ dup v0.8h, w9 // cA
+ dup v1.8h, w10 // cB
+
+ ext v6.16b, v28.16b, v28.16b, #2
+ ext v7.16b, v29.16b, v29.16b, #2
+
+ ld2 {v30.8h, v31.8h}, [x3], x4
+ dup v2.8h, w11 // cC
+ dup v3.8h, w12 // cD
+
+ ext v22.16b, v30.16b, v30.16b, #2
+ ext v23.16b, v31.16b, v31.16b, #2
+
+ trn1 v0.2d, v0.2d, v1.2d
+ trn1 v2.2d, v2.2d, v3.2d
+
+ trn1 v4.2d, v28.2d, v6.2d
+ trn1 v5.2d, v29.2d, v7.2d
+ trn1 v20.2d, v30.2d, v22.2d
+ trn1 v21.2d, v31.2d, v23.2d
+1: // height loop, interpolate xy
+ subs w15, w15, #2
+
+ mul v16.8h, v4.8h, v0.8h
+ mul v17.8h, v5.8h, v0.8h
+ mla v16.8h, v20.8h, v2.8h
+ mla v17.8h, v21.8h, v2.8h
+
+ ld2 {v28.8h, v29.8h}, [x3], x4
+ transpose v24.2d, v25.2d, v16.2d, v17.2d
+
+ ext v6.16b, v28.16b, v28.16b, #2
+ ext v7.16b, v29.16b, v29.16b, #2
+ trn1 v4.2d, v28.2d, v6.2d
+ trn1 v5.2d, v29.2d, v7.2d
+
+ add v16.8h, v24.8h, v25.8h
+ urshr v16.8h, v16.8h, #6
+
+ mul v18.8h, v20.8h, v0.8h
+ mul v19.8h, v21.8h, v0.8h
+ mla v18.8h, v4.8h, v2.8h
+ mla v19.8h, v5.8h, v2.8h
+
+ ld2 {v30.8h, v31.8h}, [x3], x4
+
+ transpose v26.2d, v27.2d, v18.2d, v19.2d
+ add v18.8h, v26.8h, v27.8h
+ urshr v18.8h, v18.8h, #6
+
+ ext v22.16b, v30.16b, v30.16b, #2
+ ext v23.16b, v31.16b, v31.16b, #2
+ trn1 v20.2d, v30.2d, v22.2d
+ trn1 v21.2d, v31.2d, v23.2d
+
+ st1 {v16.\vsize}[0], [x0], x2
+ st1 {v16.\vsize}[idx2], [x1], x2
+ st1 {v18.\vsize}[0], [x0], x2
+ st1 {v18.\vsize}[idx2], [x1], x2
+ b.gt 1b
+
+ ret
+2: // dx or dy are 0
+ tst w11, w11
+ add w10, w10, w11
+ dup v0.8h, w9
+ dup v1.8h, w10
+
+ b.eq 4f
+
+ ld1 {v4.8h}, [x3], x4
+ ld1 {v6.8h}, [x3], x4
+3: // vertical interpolation loop
+ subs w15, w15, #2
+
+ mul v16.8h, v4.8h, v0.8h
+ mla v16.8h, v6.8h, v1.8h
+ ld1 {v4.8h}, [x3], x4
+ mul v17.8h, v6.8h, v0.8h
+ mla v17.8h, v4.8h, v1.8h
+ ld1 {v6.8h}, [x3], x4
+
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ uzp1 v18.8h, v16.8h, v17.8h // d16=uuuu|uuuu, d17=vvvv|vvvv
+ uzp2 v19.8h, v16.8h, v17.8h // d16=uuuu|uuuu, d17=vvvv|vvvv
+
+ st1 {v18.\vsize}[0], [x0], x2
+ st1 {v18.\vsize}[idx2], [x0], x2
+ st1 {v19.\vsize}[0], [x1], x2
+ st1 {v19.\vsize}[idx2], [x1], x2
+ b.gt 3b
+
+ ret
+
+4: // dy is 0
+ ld1 {v4.8h, v5.8h}, [x3], x4
+ ld1 {v6.8h, v7.8h}, [x3], x4
+
+ ext v5.16b, v4.16b, v5.16b, #4
+ ext v7.16b, v6.16b, v7.16b, #4
+5: // horizontal interpolation loop
+ subs w15, w15, #2
+
+ mul v16.8h, v4.8h, v0.8h
+ mla v16.8h, v5.8h, v1.8h
+ mul v17.8h, v6.8h, v0.8h
+ mla v17.8h, v7.8h, v1.8h
+
+ ld1 {v4.8h, v5.8h}, [x3], x4
+ ld1 {v6.8h, v7.8h}, [x3], x4
+
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ ext v5.16b, v4.16b, v5.16b, #4
+ ext v7.16b, v6.16b, v7.16b, #4
+ uzp1 v18.8h, v16.8h, v17.8h // d16=uuuu|uuuu, d17=vvvv|vvvv
+ uzp2 v19.8h, v16.8h, v17.8h // d16=uuuu|uuuu, d17=vvvv|vvvv
+
+ st1 {v18.\vsize}[0], [x0], x2
+ st1 {v18.\vsize}[idx2], [x0], x2
+ st1 {v19.\vsize}[0], [x1], x2
+ st1 {v19.\vsize}[idx2], [x1], x2
+ b.gt 5b
+
+ ret
+endfunc
+.endm
+
+ CHROMA_MC 2, s
+ CHROMA_MC 4, d
+
+function mc_chroma_w8_neon
+ lsl x2, x2, #1
+ CHROMA_MC_START
+
+ b.eq 2f
+ sub x4, x4, #32
+ ld2 {v4.8h, v5.8h}, [x3], #32
+ ld2 {v6.8h, v7.8h}, [x3], x4
+
+ ld2 {v20.8h, v21.8h}, [x3], #32
+ ld2 {v22.8h, v23.8h}, [x3], x4
+
+ dup v0.8h, w9 // cA
+ dup v1.8h, w10 // cB
+
+ ext v24.16b, v4.16b, v6.16b, #2
+ ext v26.16b, v6.16b, v4.16b, #2
+ ext v28.16b, v20.16b, v22.16b, #2
+ ext v30.16b, v22.16b, v20.16b, #2
+
+ ext v25.16b, v5.16b, v7.16b, #2
+ ext v27.16b, v7.16b, v5.16b, #2
+ ext v29.16b, v21.16b, v23.16b, #2
+ ext v31.16b, v23.16b, v21.16b, #2
+
+ dup v2.8h, w11 // cC
+ dup v3.8h, w12 // cD
+
+1: // height loop, interpolate xy
+ subs w15, w15, #2
+
+ mul v16.8h, v4.8h, v0.8h
+ mul v17.8h, v5.8h, v0.8h
+ mla v16.8h, v24.8h, v1.8h
+ mla v17.8h, v25.8h, v1.8h
+ mla v16.8h, v20.8h, v2.8h
+ mla v17.8h, v21.8h, v2.8h
+ mla v16.8h, v28.8h, v3.8h
+ mla v17.8h, v29.8h, v3.8h
+
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ st1 {v16.8h}, [x0], x2
+ st1 {v17.8h}, [x1], x2
+
+ ld2 {v4.8h, v5.8h}, [x3], #32
+ ld2 {v6.8h, v7.8h}, [x3], x4
+
+ mul v16.8h, v20.8h, v0.8h
+ mul v17.8h, v21.8h, v0.8h
+ ext v24.16b, v4.16b, v6.16b, #2
+ ext v26.16b, v6.16b, v4.16b, #2
+ mla v16.8h, v28.8h, v1.8h
+ mla v17.8h, v29.8h, v1.8h
+ ext v25.16b, v5.16b, v7.16b, #2
+ ext v27.16b, v7.16b, v5.16b, #2
+ mla v16.8h, v4.8h, v2.8h
+ mla v17.8h, v5.8h, v2.8h
+ mla v16.8h, v24.8h, v3.8h
+ mla v17.8h, v25.8h, v3.8h
+
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ ld2 {v20.8h, v21.8h}, [x3], #32
+ ld2 {v22.8h, v23.8h}, [x3], x4
+ ext v28.16b, v20.16b, v22.16b, #2
+ ext v30.16b, v22.16b, v20.16b, #2
+ ext v29.16b, v21.16b, v23.16b, #2
+ ext v31.16b, v23.16b, v21.16b, #2
+
+ st1 {v16.8h}, [x0], x2
+ st1 {v17.8h}, [x1], x2
+ b.gt 1b
+
+ ret
+2: // dx or dy are 0
+ tst w11, w11
+ add w10, w10, w11
+ dup v0.8h, w9
+ dup v1.8h, w10
+
+ b.eq 4f
+
+ ld2 {v4.8h, v5.8h}, [x3], x4
+ ld2 {v6.8h, v7.8h}, [x3], x4
+3: // vertical interpolation loop
+ subs w15, w15, #2
+
+ mul v16.8h, v4.8h, v0.8h
+ mul v17.8h, v5.8h, v0.8h
+ mla v16.8h, v6.8h, v1.8h
+ mla v17.8h, v7.8h, v1.8h
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ st1 {v16.8h}, [x0], x2
+ st1 {v17.8h}, [x1], x2
+
+ ld2 {v4.8h, v5.8h}, [x3], x4
+
+ mul v16.8h, v6.8h, v0.8h
+ mul v17.8h, v7.8h, v0.8h
+ ld2 {v6.8h, v7.8h}, [x3], x4
+ mla v16.8h, v4.8h, v1.8h
+ mla v17.8h, v5.8h, v1.8h
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ st1 {v16.8h}, [x0], x2
+ st1 {v17.8h}, [x1], x2
+ b.gt 3b
+
+ ret
+4: // dy is 0
+ sub x4, x4, #32
+
+ ld2 {v4.8h, v5.8h}, [x3], #32
+ ld2 {v6.8h, v7.8h}, [x3], x4
+ ext v24.16b, v4.16b, v6.16b, #2
+ ext v26.16b, v6.16b, v4.16b, #2
+ ld2 {v20.8h, v21.8h}, [x3], #32
+ ld2 {v22.8h, v23.8h}, [x3], x4
+ ext v28.16b, v20.16b, v22.16b, #2
+ ext v30.16b, v22.16b, v20.16b, #2
+
+ ext v25.16b, v5.16b, v7.16b, #2
+ ext v27.16b, v7.16b, v5.16b, #2
+ ext v29.16b, v21.16b, v23.16b, #2
+ ext v31.16b, v23.16b, v21.16b, #2
+
+5: // horizontal interpolation loop
+ subs w15, w15, #2
+
+ mul v16.8h, v4.8h, v0.8h
+ mul v17.8h, v5.8h, v0.8h
+ mla v16.8h, v24.8h, v1.8h
+ mla v17.8h, v25.8h, v1.8h
+
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ st1 {v16.8h}, [x0], x2
+ st1 {v17.8h}, [x1], x2
+
+ mul v16.8h, v20.8h, v0.8h
+ mul v17.8h, v21.8h, v0.8h
+ ld2 {v4.8h, v5.8h}, [x3], #32
+ ld2 {v6.8h, v7.8h}, [x3], x4
+ mla v16.8h, v28.8h, v1.8h
+ mla v17.8h, v29.8h, v1.8h
+ ld2 {v20.8h,v21.8h}, [x3], #32
+ ld2 {v22.8h,v23.8h}, [x3], x4
+
+ urshr v16.8h, v16.8h, #6
+ urshr v17.8h, v17.8h, #6
+
+ ext v24.16b, v4.16b, v6.16b, #2
+ ext v26.16b, v6.16b, v4.16b, #2
+ ext v28.16b, v20.16b, v22.16b, #2
+ ext v30.16b, v22.16b, v20.16b, #2
+ ext v29.16b, v21.16b, v23.16b, #2
+ ext v31.16b, v23.16b, v21.16b, #2
+ ext v25.16b, v5.16b, v7.16b, #2
+ ext v27.16b, v7.16b, v5.16b, #2
+
+ st1 {v16.8h}, [x0], x2
+ st1 {v17.8h}, [x1], x2
+ b.gt 5b
+
+ ret
+endfunc
+
+.macro integral4h p1, p2
+ ext v1.16b, \p1\().16b, \p2\().16b, #2
+ ext v2.16b, \p1\().16b, \p2\().16b, #4
+ ext v3.16b, \p1\().16b, \p2\().16b, #6
+ add v0.8h, \p1\().8h, v1.8h
+ add v4.8h, v2.8h, v3.8h
+ add v0.8h, v0.8h, v4.8h
+ add v0.8h, v0.8h, v5.8h
+.endm
+
+function integral_init4h_neon, export=1
+ sub x3, x0, x2, lsl #1
+ lsl x2, x2, #1
+ ld1 {v6.8h,v7.8h}, [x1], #32
+1:
+ subs x2, x2, #32
+ ld1 {v5.8h}, [x3], #16
+ integral4h v6, v7
+ ld1 {v6.8h}, [x1], #16
+ ld1 {v5.8h}, [x3], #16
+ st1 {v0.8h}, [x0], #16
+ integral4h v7, v6
+ ld1 {v7.8h}, [x1], #16
+ st1 {v0.8h}, [x0], #16
+ b.gt 1b
+ ret
+endfunc
+
+.macro integral8h p1, p2, s
+ ext v1.16b, \p1\().16b, \p2\().16b, #2
+ ext v2.16b, \p1\().16b, \p2\().16b, #4
+ ext v3.16b, \p1\().16b, \p2\().16b, #6
+ ext v4.16b, \p1\().16b, \p2\().16b, #8
+ ext v5.16b, \p1\().16b, \p2\().16b, #10
+ ext v6.16b, \p1\().16b, \p2\().16b, #12
+ ext v7.16b, \p1\().16b, \p2\().16b, #14
+ add v0.8h, \p1\().8h, v1.8h
+ add v2.8h, v2.8h, v3.8h
+ add v4.8h, v4.8h, v5.8h
+ add v6.8h, v6.8h, v7.8h
+ add v0.8h, v0.8h, v2.8h
+ add v4.8h, v4.8h, v6.8h
+ add v0.8h, v0.8h, v4.8h
+ add v0.8h, v0.8h, \s\().8h
+.endm
+
+function integral_init8h_neon, export=1
+ sub x3, x0, x2, lsl #1
+ lsl x2, x2, #1
+
+ ld1 {v16.8h, v17.8h}, [x1], #32
+1:
+ subs x2, x2, #32
+ ld1 {v18.8h}, [x3], #16
+ integral8h v16, v17, v18
+ ld1 {v16.8h}, [x1], #16
+ ld1 {v18.8h}, [x3], #16
+ st1 {v0.8h}, [x0], #16
+ integral8h v17, v16, v18
+ ld1 {v17.8h}, [x1], #16
+ st1 {v0.8h}, [x0], #16
+ b.gt 1b
+ ret
+endfunc
+
+function integral_init4v_neon, export=1
+ mov x3, x0
+ add x4, x0, x2, lsl #3
+ add x8, x0, x2, lsl #4
+ lsl x2, x2, #1
+ sub x2, x2, #16
+ ld1 {v20.8h, v21.8h, v22.8h}, [x3], #48
+ ld1 {v16.8h, v17.8h, v18.8h}, [x8], #48
+1:
+ subs x2, x2, #32
+ ld1 {v24.8h, v25.8h}, [x4], #32
+ ext v0.16b, v20.16b, v21.16b, #8
+ ext v1.16b, v21.16b, v22.16b, #8
+ ext v2.16b, v16.16b, v17.16b, #8
+ ext v3.16b, v17.16b, v18.16b, #8
+ sub v24.8h, v24.8h, v20.8h
+ sub v25.8h, v25.8h, v21.8h
+ add v0.8h, v0.8h, v20.8h
+ add v1.8h, v1.8h, v21.8h
+ add v2.8h, v2.8h, v16.8h
+ add v3.8h, v3.8h, v17.8h
+ st1 {v24.8h}, [x1], #16
+ st1 {v25.8h}, [x1], #16
+ mov v20.16b, v22.16b
+ mov v16.16b, v18.16b
+ sub v0.8h, v2.8h, v0.8h
+ sub v1.8h, v3.8h, v1.8h
+ ld1 {v21.8h, v22.8h}, [x3], #32
+ ld1 {v17.8h, v18.8h}, [x8], #32
+ st1 {v0.8h}, [x0], #16
+ st1 {v1.8h}, [x0], #16
+ b.gt 1b
+2:
+ ret
+endfunc
+
+function integral_init8v_neon, export=1
+ add x2, x0, x1, lsl #4
+ sub x1, x1, #8
+ ands x3, x1, #16 - 1
+ b.eq 1f
+ subs x1, x1, #8
+ ld1 {v0.8h}, [x0]
+ ld1 {v2.8h}, [x2], #16
+ sub v4.8h, v2.8h, v0.8h
+ st1 {v4.8h}, [x0], #16
+ b.le 2f
+1:
+ subs x1, x1, #16
+ ld1 {v0.8h,v1.8h}, [x0]
+ ld1 {v2.8h,v3.8h}, [x2], #32
+ sub v4.8h, v2.8h, v0.8h
+ sub v5.8h, v3.8h, v1.8h
+ st1 {v4.8h}, [x0], #16
+ st1 {v5.8h}, [x0], #16
+ b.gt 1b
+2:
ret
endfunc
-const pw_0to15, align=5
- .short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
-endconst
+// frame_init_lowres_core( pixel *src0, pixel *dst0, pixel *dsth,
+// pixel *dstv, pixel *dstc, intptr_t src_stride,
+// intptr_t dst_stride, int width, int height )
+function frame_init_lowres_core_neon, export=1
+ ldr w8, [sp]
+ lsl x5, x5, #1
+ sub x10, x6, w7, uxtw // dst_stride - width
+ lsl x10, x10, #1
+ and x10, x10, #~31
+
+ stp d8, d9, [sp, #-0x40]!
+ stp d10, d11, [sp, #0x10]
+ stp d12, d13, [sp, #0x20]
+ stp d14, d15, [sp, #0x30]
+
+1:
+ mov w9, w7 // width
+ mov x11, x0 // src0
+ add x12, x0, x5 // src1 = src0 + src_stride
+ add x13, x0, x5, lsl #1 // src2 = src1 + src_stride
+
+ ld2 {v0.8h, v1.8h}, [x11], #32
+ ld2 {v2.8h, v3.8h}, [x11], #32
+ ld2 {v4.8h, v5.8h}, [x12], #32
+ ld2 {v6.8h, v7.8h}, [x12], #32
+ ld2 {v28.8h, v29.8h}, [x13], #32
+ ld2 {v30.8h, v31.8h}, [x13], #32
+
+ urhadd v20.8h, v0.8h, v4.8h
+ urhadd v21.8h, v2.8h, v6.8h
+ urhadd v22.8h, v4.8h, v28.8h
+ urhadd v23.8h, v6.8h, v30.8h
+2:
+ subs w9, w9, #16
+
+ urhadd v24.8h, v1.8h, v5.8h
+ urhadd v25.8h, v3.8h, v7.8h
+ urhadd v26.8h, v5.8h, v29.8h
+ urhadd v27.8h, v7.8h, v31.8h
+
+ ld2 {v0.8h, v1.8h}, [x11], #32
+ ld2 {v2.8h, v3.8h}, [x11], #32
+ ld2 {v4.8h, v5.8h}, [x12], #32
+ ld2 {v6.8h, v7.8h}, [x12], #32
+ ld2 {v28.8h, v29.8h}, [x13], #32
+ ld2 {v30.8h, v31.8h}, [x13], #32
+
+ urhadd v16.8h, v0.8h, v4.8h
+ urhadd v17.8h, v2.8h, v6.8h
+ urhadd v18.8h, v4.8h, v28.8h
+ urhadd v19.8h, v6.8h, v30.8h
+
+ ext v8.16b, v20.16b, v21.16b, #2
+ ext v9.16b, v21.16b, v16.16b, #2
+ ext v10.16b, v22.16b, v23.16b, #2
+ ext v11.16b, v23.16b, v18.16b, #2
+
+ urhadd v12.8h, v20.8h, v24.8h
+ urhadd v8.8h, v24.8h, v8.8h
+
+ urhadd v24.8h, v21.8h, v25.8h
+ urhadd v22.8h, v22.8h, v26.8h
+ urhadd v10.8h, v26.8h, v10.8h
+ urhadd v26.8h, v23.8h, v27.8h
+ urhadd v9.8h, v25.8h, v9.8h
+ urhadd v11.8h, v27.8h, v11.8h
+
+ st1 {v12.8h}, [x1], #16
+ st1 {v24.8h}, [x1], #16
+ st1 {v22.8h}, [x3], #16
+ st1 {v26.8h}, [x3], #16
+ st1 {v8.8h, v9.8h}, [x2], #32
+ st1 {v10.8h, v11.8h}, [x4], #32
+
+ b.le 3f
+
+ subs w9, w9, #16
+
+ urhadd v24.8h, v1.8h, v5.8h
+ urhadd v25.8h, v3.8h, v7.8h
+ urhadd v26.8h, v5.8h, v29.8h
+ urhadd v27.8h, v7.8h, v31.8h
+
+ ld2 {v0.8h, v1.8h}, [x11], #32
+ ld2 {v2.8h, v3.8h}, [x11], #32
+ ld2 {v4.8h, v5.8h}, [x12], #32
+ ld2 {v6.8h, v7.8h}, [x12], #32
+ ld2 {v28.8h, v29.8h}, [x13], #32
+ ld2 {v30.8h, v31.8h}, [x13], #32
+
+ urhadd v20.8h, v0.8h, v4.8h
+ urhadd v21.8h, v2.8h, v6.8h
+ urhadd v22.8h, v4.8h, v28.8h
+ urhadd v23.8h, v6.8h, v30.8h
+
+ ext v8.16b, v16.16b, v17.16b, #2
+ ext v9.16b, v17.16b, v20.16b, #2
+ ext v10.16b, v18.16b, v19.16b, #2
+ ext v11.16b, v19.16b, v22.16b, #2
+
+ urhadd v12.8h, v16.8h, v24.8h
+ urhadd v13.8h, v17.8h, v25.8h
+
+ urhadd v14.8h, v18.8h, v26.8h
+ urhadd v15.8h, v19.8h, v27.8h
+
+ urhadd v16.8h, v24.8h, v8.8h
+ urhadd v17.8h, v25.8h, v9.8h
+
+ urhadd v18.8h, v26.8h, v10.8h
+ urhadd v19.8h, v27.8h, v11.8h
+
+ st1 {v12.8h, v13.8h}, [x1], #32
+ st1 {v14.8h, v15.8h}, [x3], #32
+ st1 {v16.8h, v17.8h}, [x2], #32
+ st1 {v18.8h, v19.8h}, [x4], #32
+ b.gt 2b
+3:
+ subs w8, w8, #1
+ add x0, x0, x5, lsl #1
+ add x1, x1, x10
+ add x2, x2, x10
+ add x3, x3, x10
+ add x4, x4, x10
+ b.gt 1b
+
+ ldp d8, d9, [sp]
+ ldp d10, d11, [sp, #0x10]
+ ldp d12, d13, [sp, #0x20]
+ ldp d14, d15, [sp, #0x30]
+
+ add sp, sp, #0x40
-function mbtree_propagate_list_internal_neon, export=1
- movrel x11, pw_0to15
- dup v31.8h, w4 // bipred_weight
- movi v30.8h, #0xc0, lsl #8
- ld1 {v29.8h}, [x11] //h->mb.i_mb_x,h->mb.i_mb_y
- movi v28.4s, #4
- movi v27.8h, #31
- movi v26.8h, #32
- dup v24.8h, w5 // mb_y
- zip1 v29.8h, v29.8h, v24.8h
-8:
- subs w6, w6, #8
- ld1 {v1.8h}, [x1], #16 // propagate_amount
- ld1 {v2.8h}, [x2], #16 // lowres_cost
- and v2.16b, v2.16b, v30.16b
- cmeq v25.8h, v2.8h, v30.8h
- umull v16.4s, v1.4h, v31.4h
- umull2 v17.4s, v1.8h, v31.8h
- rshrn v16.4h, v16.4s, #6
- rshrn2 v16.8h, v17.4s, #6
- bsl v25.16b, v16.16b, v1.16b // if( lists_used == 3 )
- // propagate_amount = (propagate_amount * bipred_weight + 32) >> 6
- ld1 {v4.8h,v5.8h}, [x0], #32
- sshr v6.8h, v4.8h, #5
- sshr v7.8h, v5.8h, #5
- add v6.8h, v6.8h, v29.8h
- add v29.8h, v29.8h, v28.8h
- add v7.8h, v7.8h, v29.8h
- add v29.8h, v29.8h, v28.8h
- st1 {v6.8h,v7.8h}, [x3], #32
- and v4.16b, v4.16b, v27.16b
- and v5.16b, v5.16b, v27.16b
- uzp1 v6.8h, v4.8h, v5.8h // x & 31
- uzp2 v7.8h, v4.8h, v5.8h // y & 31
- sub v4.8h, v26.8h, v6.8h // 32 - (x & 31)
- sub v5.8h, v26.8h, v7.8h // 32 - (y & 31)
- mul v19.8h, v6.8h, v7.8h // idx3weight = y*x;
- mul v18.8h, v4.8h, v7.8h // idx2weight = y*(32-x);
- mul v17.8h, v6.8h, v5.8h // idx1weight = (32-y)*x;
- mul v16.8h, v4.8h, v5.8h // idx0weight = (32-y)*(32-x) ;
- umull v6.4s, v19.4h, v25.4h
- umull2 v7.4s, v19.8h, v25.8h
- umull v4.4s, v18.4h, v25.4h
- umull2 v5.4s, v18.8h, v25.8h
- umull v2.4s, v17.4h, v25.4h
- umull2 v3.4s, v17.8h, v25.8h
- umull v0.4s, v16.4h, v25.4h
- umull2 v1.4s, v16.8h, v25.8h
- rshrn v19.4h, v6.4s, #10
- rshrn2 v19.8h, v7.4s, #10
- rshrn v18.4h, v4.4s, #10
- rshrn2 v18.8h, v5.4s, #10
- rshrn v17.4h, v2.4s, #10
- rshrn2 v17.8h, v3.4s, #10
- rshrn v16.4h, v0.4s, #10
- rshrn2 v16.8h, v1.4s, #10
- zip1 v0.8h, v16.8h, v17.8h
- zip2 v1.8h, v16.8h, v17.8h
- zip1 v2.8h, v18.8h, v19.8h
- zip2 v3.8h, v18.8h, v19.8h
- st1 {v0.8h,v1.8h}, [x3], #32
- st1 {v2.8h,v3.8h}, [x3], #32
- b.ge 8b
ret
endfunc
-function memcpy_aligned_neon, export=1
- tst x2, #16
+function load_deinterleave_chroma_fenc_neon, export=1
+ mov x4, #FENC_STRIDE/2
+ lsl x4, x4, #1
+ lsl x2, x2, #1
+ b load_deinterleave_chroma
+endfunc
+
+function load_deinterleave_chroma_fdec_neon, export=1
+ mov x4, #FDEC_STRIDE/2
+ lsl x4, x4, #1
+ lsl x2, x2, #1
+load_deinterleave_chroma:
+ ld2 {v0.8h, v1.8h}, [x1], x2
+ ld2 {v2.8h, v3.8h}, [x1], x2
+ subs w3, w3, #2
+ st1 {v0.8h}, [x0], x4
+ st1 {v1.8h}, [x0], x4
+ st1 {v2.8h}, [x0], x4
+ st1 {v3.8h}, [x0], x4
+ b.gt load_deinterleave_chroma
+
+ ret
+endfunc
+
+function store_interleave_chroma_neon, export=1
+ mov x5, #FDEC_STRIDE
+ lsl x5, x5, #1
+ lsl x1, x1, #1
+1:
+ ld1 {v0.8h}, [x2], x5
+ ld1 {v1.8h}, [x3], x5
+ ld1 {v2.8h}, [x2], x5
+ ld1 {v3.8h}, [x3], x5
+ subs w4, w4, #2
+ zip1 v4.8h, v0.8h, v1.8h
+ zip1 v6.8h, v2.8h, v3.8h
+ zip2 v5.8h, v0.8h, v1.8h
+ zip2 v7.8h, v2.8h, v3.8h
+
+ st1 {v4.8h, v5.8h}, [x0], x1
+ st1 {v6.8h, v7.8h}, [x0], x1
+ b.gt 1b
+
+ ret
+endfunc
+
+function plane_copy_core_neon, export=1
+ add w8, w4, #31 // 32-bit write clears the upper 32-bit the register
+ and w4, w8, #~31
+ // safe use of the full reg since negative width makes no sense
+ sub x1, x1, x4
+ sub x3, x3, x4
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+1:
+ mov w8, w4
+16:
+ tst w8, #16
b.eq 32f
- sub x2, x2, #16
- ldr q0, [x1], #16
- str q0, [x0], #16
+ subs w8, w8, #16
+ ldp q0, q1, [x2], #32
+ stp q0, q1, [x0], #32
+ b.eq 0f
32:
- tst x2, #32
- b.eq 640f
- sub x2, x2, #32
- ldp q0, q1, [x1], #32
- stp q0, q1, [x0], #32
-640:
- cbz x2, 1f
-64:
- subs x2, x2, #64
- ldp q0, q1, [x1, #32]
- ldp q2, q3, [x1], #64
- stp q0, q1, [x0, #32]
- stp q2, q3, [x0], #64
- b.gt 64b
+ subs w8, w8, #32
+ ldp q0, q1, [x2], #32
+ ldp q2, q3, [x2], #32
+ stp q0, q1, [x0], #32
+ stp q2, q3, [x0], #32
+ b.gt 32b
+0:
+ subs w5, w5, #1
+ add x2, x2, x3
+ add x0, x0, x1
+ b.gt 1b
+
+ ret
+endfunc
+
+function plane_copy_swap_core_neon, export=1
+ lsl w4, w4, #1
+ add w8, w4, #31 // 32-bit write clears the upper 32-bit the register
+ and w4, w8, #~31
+ sub x1, x1, x4
+ sub x3, x3, x4
+ lsl x1, x1, #1
+ lsl x3, x3, #1
1:
+ mov w8, w4
+ tbz w4, #4, 32f
+ subs w8, w8, #16
+ ld1 {v0.8h, v1.8h}, [x2], #32
+ rev32 v0.8h, v0.8h
+ rev32 v1.8h, v1.8h
+ st1 {v0.8h, v1.8h}, [x0], #32
+ b.eq 0f
+32:
+ subs w8, w8, #32
+ ld1 {v0.8h ,v1.8h, v2.8h, v3.8h}, [x2], #64
+ rev32 v20.8h, v0.8h
+ rev32 v21.8h, v1.8h
+ rev32 v22.8h, v2.8h
+ rev32 v23.8h, v3.8h
+ st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [x0], #64
+ b.gt 32b
+0:
+ subs w5, w5, #1
+ add x2, x2, x3
+ add x0, x0, x1
+ b.gt 1b
+
ret
endfunc
-function memzero_aligned_neon, export=1
- movi v0.16b, #0
- movi v1.16b, #0
+function plane_copy_deinterleave_neon, export=1
+ add w9, w6, #15
+ and w9, w9, #~15
+ sub x1, x1, x9
+ sub x3, x3, x9
+ sub x5, x5, x9, lsl #1
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
1:
- subs x1, x1, #128
- stp q0, q1, [x0, #96]
- stp q0, q1, [x0, #64]
- stp q0, q1, [x0, #32]
- stp q0, q1, [x0], 128
+ ld2 {v0.8h, v1.8h}, [x4], #32
+ ld2 {v2.8h, v3.8h}, [x4], #32
+ subs w9, w9, #16
+ st1 {v0.8h}, [x0], #16
+ st1 {v2.8h}, [x0], #16
+ st1 {v1.8h}, [x2], #16
+ st1 {v3.8h}, [x2], #16
b.gt 1b
+
+ add x4, x4, x5
+ subs w7, w7, #1
+ add x0, x0, x1
+ add x2, x2, x3
+ mov w9, w6
+ b.gt 1b
+
ret
endfunc
-// void mbtree_fix8_pack( int16_t *dst, float *src, int count )
-function mbtree_fix8_pack_neon, export=1
- subs w3, w2, #8
- b.lt 2f
+function plane_copy_interleave_core_neon, export=1
+ add w9, w6, #15
+ and w9, w9, #0xfffffff0
+ sub x1, x1, x9, lsl #1
+ sub x3, x3, x9
+ sub x5, x5, x9
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
1:
- subs w3, w3, #8
- ld1 {v0.4s,v1.4s}, [x1], #32
- fcvtzs v0.4s, v0.4s, #8
- fcvtzs v1.4s, v1.4s, #8
- sqxtn v2.4h, v0.4s
- sqxtn2 v2.8h, v1.4s
- rev16 v3.16b, v2.16b
- st1 {v3.8h}, [x0], #16
- b.ge 1b
-2:
- adds w3, w3, #8
- b.eq 4f
-3:
- subs w3, w3, #1
- ldr s0, [x1], #4
- fcvtzs w4, s0, #8
- rev16 w5, w4
- strh w5, [x0], #2
- b.gt 3b
-4:
+ ld1 {v0.8h}, [x2], #16
+ ld1 {v1.8h}, [x4], #16
+ ld1 {v2.8h}, [x2], #16
+ ld1 {v3.8h}, [x4], #16
+ subs w9, w9, #16
+ st2 {v0.8h, v1.8h}, [x0], #32
+ st2 {v2.8h, v3.8h}, [x0], #32
+ b.gt 1b
+
+ subs w7, w7, #1
+ add x0, x0, x1
+ add x2, x2, x3
+ add x4, x4, x5
+ mov w9, w6
+ b.gt 1b
+
ret
endfunc
-// void mbtree_fix8_unpack( float *dst, int16_t *src, int count )
-function mbtree_fix8_unpack_neon, export=1
- subs w3, w2, #8
- b.lt 2f
+.macro deinterleave_rgb
+ subs x11, x11, #8
+ st1 {v0.8h}, [x0], #16
+ st1 {v1.8h}, [x2], #16
+ st1 {v2.8h}, [x4], #16
+ b.gt 1b
+
+ subs w10, w10, #1
+ add x0, x0, x1
+ add x2, x2, x3
+ add x4, x4, x5
+ add x6, x6, x7
+ mov x11, x9
+ b.gt 1b
+.endm
+
+function plane_copy_deinterleave_rgb_neon, export=1
+#if SYS_MACOSX
+ ldr w8, [sp]
+ ldp w9, w10, [sp, #4]
+#else
+ ldr x8, [sp]
+ ldp x9, x10, [sp, #8]
+#endif
+ cmp w8, #3
+ uxtw x9, w9
+ add x11, x9, #7
+ and x11, x11, #~7
+ sub x1, x1, x11
+ sub x3, x3, x11
+ sub x5, x5, x11
+ lsl x1, x1, #1
+ lsl x3, x3, #1
+ lsl x5, x5, #1
+ b.ne 4f
+ sub x7, x7, x11, lsl #1
+ sub x7, x7, x11
+ lsl x7, x7, #1
1:
- subs w3, w3, #8
- ld1 {v0.8h}, [x1], #16
- rev16 v1.16b, v0.16b
- sxtl v2.4s, v1.4h
- sxtl2 v3.4s, v1.8h
- scvtf v4.4s, v2.4s, #8
- scvtf v5.4s, v3.4s, #8
- st1 {v4.4s,v5.4s}, [x0], #32
- b.ge 1b
-2:
- adds w3, w3, #8
- b.eq 4f
-3:
- subs w3, w3, #1
- ldrh w4, [x1], #2
- rev16 w5, w4
- sxth w6, w5
- scvtf s0, w6, #8
- str s0, [x0], #4
- b.gt 3b
+ ld3 {v0.8h, v1.8h, v2.8h}, [x6], #48
+ deinterleave_rgb
+
+ ret
4:
+ sub x7, x7, x11, lsl #2
+ lsl x7, x7, #1
+1:
+ ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [x6], #64
+ deinterleave_rgb
+
+ ret
+endfunc
+
+// void hpel_filter( pixel *dsth, pixel *dstv, pixel *dstc, pixel *src,
+// intptr_t stride, int width, int height, int16_t *buf )
+function hpel_filter_neon, export=1
+ lsl x5, x5, #1
+ ubfm x9, x3, #3, #7
+ add w15, w5, w9
+ sub x13, x3, x9 // align src
+ sub x10, x0, x9
+ sub x11, x1, x9
+ sub x12, x2, x9
+ movi v30.8h, #5
+ movi v31.8h, #20
+
+ lsl x4, x4, #1
+ stp d8, d9, [sp, #-0x40]!
+ stp d10, d11, [sp, #0x10]
+ stp d12, d13, [sp, #0x20]
+ stp d14, d15, [sp, #0x30]
+
+ str q0, [sp, #-0x50]!
+
+1: // line start
+ mov x3, x13
+ mov x2, x12
+ mov x1, x11
+ mov x0, x10
+ add x7, x3, #32 // src pointer next 16b for horiz filter
+ mov x5, x15 // restore width
+ sub x3, x3, x4, lsl #1 // src - 2*stride
+ ld1 {v28.8h, v29.8h}, [x7], #32 // src[16:31]
+ add x9, x3, x5 // holds src - 2*stride + width
+
+ ld1 {v8.8h, v9.8h}, [x3], x4 // src-2*stride[0:15]
+ ld1 {v10.8h, v11.8h}, [x3], x4 // src-1*stride[0:15]
+ ld1 {v12.8h, v13.8h}, [x3], x4 // src-0*stride[0:15]
+ ld1 {v14.8h, v15.8h}, [x3], x4 // src+1*stride[0:15]
+ ld1 {v16.8h, v17.8h}, [x3], x4 // src+2*stride[0:15]
+ ld1 {v18.8h, v19.8h}, [x3], x4 // src+3*stride[0:15]
+
+ ext v22.16b, v7.16b, v12.16b, #12
+ ext v23.16b, v12.16b, v13.16b, #12
+ uaddl v1.4s, v8.4h, v18.4h
+ uaddl2 v20.4s, v8.8h, v18.8h
+ ext v24.16b, v12.16b, v13.16b, #6
+ ext v25.16b, v13.16b, v28.16b, #6
+ umlsl v1.4s, v10.4h, v30.4h
+ umlsl2 v20.4s, v10.8h, v30.8h
+ ext v26.16b, v7.16b, v12.16b, #14
+ ext v27.16b, v12.16b, v13.16b, #14
+ umlal v1.4s, v12.4h, v31.4h
+ umlal2 v20.4s, v12.8h, v31.8h
+ ext v3.16b, v12.16b, v13.16b, #2
+ ext v4.16b, v13.16b, v28.16b, #2
+ umlal v1.4s, v14.4h, v31.4h
+ umlal2 v20.4s, v14.8h, v31.8h
+ ext v21.16b, v12.16b, v13.16b, #4
+ ext v5.16b, v13.16b, v28.16b, #4
+ umlsl v1.4s, v16.4h, v30.4h
+ umlsl2 v20.4s, v16.8h, v30.8h
+
+2: // next 16 pixel of line
+ subs x5, x5, #32
+ sub x3, x9, x5 // src - 2*stride += 16
+
+ uaddl v8.4s, v22.4h, v24.4h
+ uaddl2 v22.4s, v22.8h, v24.8h
+ uaddl v10.4s, v23.4h, v25.4h
+ uaddl2 v23.4s, v23.8h, v25.8h
+
+ umlsl v8.4s, v26.4h, v30.4h
+ umlsl2 v22.4s, v26.8h, v30.8h
+ umlsl v10.4s, v27.4h, v30.4h
+ umlsl2 v23.4s, v27.8h, v30.8h
+
+ umlal v8.4s, v12.4h, v31.4h
+ umlal2 v22.4s, v12.8h, v31.8h
+ umlal v10.4s, v13.4h, v31.4h
+ umlal2 v23.4s, v13.8h, v31.8h
+
+ umlal v8.4s, v3.4h, v31.4h
+ umlal2 v22.4s, v3.8h, v31.8h
+ umlal v10.4s, v4.4h, v31.4h
+ umlal2 v23.4s, v4.8h, v31.8h
+
+ umlsl v8.4s, v21.4h, v30.4h
+ umlsl2 v22.4s, v21.8h, v30.8h
+ umlsl v10.4s, v5.4h, v30.4h
+ umlsl2 v23.4s, v5.8h, v30.8h
+
+ uaddl v5.4s, v9.4h, v19.4h
+ uaddl2 v2.4s, v9.8h, v19.8h
+
+ sqrshrun v8.4h, v8.4s, #5
+ sqrshrun2 v8.8h, v22.4s, #5
+ sqrshrun v10.4h, v10.4s, #5
+ sqrshrun2 v10.8h, v23.4s, #5
+
+ mov v6.16b, v12.16b
+ mov v7.16b, v13.16b
+
+ mvni v23.8h, #0xfc, lsl #8
+
+ umin v8.8h, v8.8h, v23.8h
+ umin v10.8h, v10.8h, v23.8h
+
+ st1 {v8.8h}, [x0], #16
+ st1 {v10.8h}, [x0], #16
+
+ umlsl v5.4s, v11.4h, v30.4h
+ umlsl2 v2.4s, v11.8h, v30.8h
+
+ ld1 {v8.8h, v9.8h}, [x3], x4
+ umlal v5.4s, v13.4h, v31.4h
+ umlal2 v2.4s, v13.8h, v31.8h
+ ld1 {v10.8h, v11.8h}, [x3], x4
+ umlal v5.4s, v15.4h, v31.4h
+ umlal2 v2.4s, v15.8h, v31.8h
+ ld1 {v12.8h, v13.8h}, [x3], x4
+ umlsl v5.4s, v17.4h, v30.4h
+ umlsl2 v2.4s, v17.8h, v30.8h
+ ld1 {v14.8h, v15.8h}, [x3], x4
+
+ sqrshrun v4.4h, v5.4s, #5
+ sqrshrun2 v4.8h, v2.4s, #5
+ sqrshrun v18.4h, v1.4s, #5
+ sqrshrun2 v18.8h, v20.4s, #5
+
+ mvni v17.8h, #0xfc, lsl #8
+
+ smin v4.8h, v4.8h, v17.8h
+ smin v18.8h, v18.8h, v17.8h
+
+ st1 {v18.8h}, [x1], #16
+ st1 {v4.8h}, [x1], #16
+
+ ld1 {v16.8h, v17.8h}, [x3], x4 // src+2*stride[0:15]
+ ld1 {v18.8h, v19.8h}, [x3], x4 // src+3*stride[0:15]
+
+ str q9, [sp, #0x10]
+ str q15, [sp, #0x20]
+ str q17, [sp, #0x30]
+ str q19, [sp, #0x40]
+
+ ldr q28, [sp]
+
+ ext v22.16b, v28.16b, v1.16b, #8
+ ext v9.16b, v1.16b, v20.16b, #8
+ ext v26.16b, v1.16b, v20.16b, #12
+ ext v17.16b, v20.16b, v5.16b, #12
+ ext v23.16b, v28.16b, v1.16b, #12
+ ext v19.16b, v1.16b, v20.16b, #12
+
+ uaddl v3.4s, v8.4h, v18.4h
+ uaddl2 v15.4s, v8.8h, v18.8h
+ umlsl v3.4s, v10.4h, v30.4h
+ umlsl2 v15.4s, v10.8h, v30.8h
+ umlal v3.4s, v12.4h, v31.4h
+ umlal2 v15.4s, v12.8h, v31.8h
+ umlal v3.4s, v14.4h, v31.4h
+ umlal2 v15.4s, v14.8h, v31.8h
+ umlsl v3.4s, v16.4h, v30.4h
+ umlsl2 v15.4s, v16.8h, v30.8h
+
+ add v4.4s, v22.4s, v26.4s
+ add v26.4s, v9.4s, v17.4s
+
+ ext v25.16b, v1.16b, v20.16b, #8
+ ext v22.16b, v20.16b, v5.16b, #8
+ ext v24.16b, v1.16b, v20.16b, #4
+ ext v9.16b, v20.16b, v5.16b, #4
+
+ add v31.4s, v23.4s, v25.4s
+ add v19.4s, v19.4s, v22.4s
+ add v6.4s, v24.4s, v1.4s
+ add v17.4s, v9.4s, v20.4s
+ sub v4.4s, v4.4s, v31.4s // a-b
+ sub v26.4s, v26.4s, v19.4s // a-b
+ sub v31.4s, v31.4s, v6.4s // b-c
+ sub v19.4s, v19.4s, v17.4s // b-c
+
+ ext v22.16b, v20.16b, v5.16b, #8
+ ext v9.16b, v5.16b, v2.16b, #8
+ ext v24.16b, v5.16b, v2.16b, #12
+ ext v28.16b, v2.16b, v3.16b, #12
+ ext v23.16b, v20.16b, v5.16b, #12
+ ext v30.16b, v5.16b, v2.16b, #12
+ ext v25.16b, v5.16b, v2.16b, #8
+ ext v29.16b, v2.16b, v3.16b, #8
+
+ add v22.4s, v22.4s, v24.4s
+ add v9.4s, v9.4s, v28.4s
+ add v23.4s, v23.4s, v25.4s
+ add v29.4s, v29.4s, v30.4s
+
+ ext v24.16b, v5.16b, v2.16b, #4
+ ext v28.16b, v2.16b, v3.16b, #4
+
+ add v24.4s, v24.4s, v5.4s
+ add v28.4s, v28.4s, v2.4s
+
+ sub v22.4s, v22.4s, v23.4s
+ sub v9.4s, v9.4s, v29.4s
+ sub v23.4s, v23.4s, v24.4s
+ sub v29.4s, v29.4s, v28.4s
+
+ sshr v4.4s, v4.4s, #2
+ sshr v0.4s, v26.4s, #2
+ sshr v22.4s, v22.4s, #2
+ sshr v9.4s, v9.4s, #2
+
+ sub v4.4s, v4.4s, v31.4s
+ sub v0.4s, v0.4s, v19.4s
+ sub v22.4s, v22.4s, v23.4s
+ sub v9.4s, v9.4s, v29.4s
+
+ sshr v4.4s, v4.4s, #2
+ sshr v0.4s, v0.4s, #2
+ sshr v22.4s, v22.4s, #2
+ sshr v9.4s, v9.4s, #2
+
+ add v4.4s, v4.4s, v6.4s
+ add v0.4s, v0.4s, v17.4s
+ add v22.4s, v22.4s, v24.4s
+ add v9.4s, v9.4s, v28.4s
+
+ str q2, [sp]
+
+ sqrshrun v4.4h, v4.4s, #6
+ sqrshrun2 v4.8h, v0.4s, #6
+ sqrshrun v22.4h, v22.4s, #6
+ sqrshrun2 v22.8h, v9.4s, #6
+
+ mov v0.16b, v5.16b
+
+ ld1 {v28.8h, v29.8h}, [x7], #32 // src[16:31]
+
+ ldr q9, [sp, #0x10]
+ ldr q17, [sp, #0x30]
+ ldr q19, [sp, #0x40]
+
+ ext v26.16b, v7.16b, v12.16b, #14
+ ext v27.16b, v12.16b, v13.16b, #14
+
+ mvni v25.8h, 0xfc, lsl #8
+
+ smin v22.8h, v22.8h, v25.8h
+ smin v4.8h, v4.8h, v25.8h
+
+ st1 {v4.8h}, [x2], #16
+ st1 {v22.8h}, [x2], #16
+
+ mov v1.16b, v3.16b
+ mov v20.16b, v15.16b
+
+ ldr q15, [sp, #0x20]
+
+ ext v22.16b, v7.16b, v12.16b, #12
+ ext v23.16b, v12.16b, v13.16b, #12
+ ext v3.16b, v12.16b, v13.16b, #2
+ ext v4.16b, v13.16b, v28.16b, #2
+ ext v21.16b, v12.16b, v13.16b, #4
+ ext v5.16b, v13.16b, v28.16b, #4
+ ext v24.16b, v12.16b, v13.16b, #6
+ ext v25.16b, v13.16b, v28.16b, #6
+
+ movi v30.8h, #5
+ movi v31.8h, #20
+
+ b.gt 2b
+
+ subs w6, w6, #1
+ add x10, x10, x4
+ add x11, x11, x4
+ add x12, x12, x4
+ add x13, x13, x4
+ b.gt 1b
+
+ add sp, sp, #0x50
+
+ ldp d8, d9, [sp]
+ ldp d10, d11, [sp, #0x10]
+ ldp d12, d13, [sp, #0x20]
+ ldp d14, d15, [sp, #0x30]
+ add sp, sp, #0x40
+
ret
endfunc
+
+#endif
=====================================
common/aarch64/mc-c.c
=====================================
@@ -28,11 +28,11 @@
#include "mc.h"
#define x264_prefetch_ref_aarch64 x264_template(prefetch_ref_aarch64)
-void x264_prefetch_ref_aarch64( uint8_t *, intptr_t, int );
+void x264_prefetch_ref_aarch64( pixel *, intptr_t, int );
#define x264_prefetch_fenc_420_aarch64 x264_template(prefetch_fenc_420_aarch64)
-void x264_prefetch_fenc_420_aarch64( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_prefetch_fenc_420_aarch64( pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_prefetch_fenc_422_aarch64 x264_template(prefetch_fenc_422_aarch64)
-void x264_prefetch_fenc_422_aarch64( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_prefetch_fenc_422_aarch64( pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_memcpy_aligned_neon x264_template(memcpy_aligned_neon)
void *x264_memcpy_aligned_neon( void *dst, const void *src, size_t n );
@@ -40,32 +40,32 @@ void *x264_memcpy_aligned_neon( void *dst, const void *src, size_t n );
void x264_memzero_aligned_neon( void *dst, size_t n );
#define x264_pixel_avg_16x16_neon x264_template(pixel_avg_16x16_neon)
-void x264_pixel_avg_16x16_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_16x16_neon( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_16x8_neon x264_template(pixel_avg_16x8_neon)
-void x264_pixel_avg_16x8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_16x8_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_8x16_neon x264_template(pixel_avg_8x16_neon)
-void x264_pixel_avg_8x16_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_8x16_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_8x8_neon x264_template(pixel_avg_8x8_neon)
-void x264_pixel_avg_8x8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_8x8_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_8x4_neon x264_template(pixel_avg_8x4_neon)
-void x264_pixel_avg_8x4_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_8x4_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_4x16_neon x264_template(pixel_avg_4x16_neon)
-void x264_pixel_avg_4x16_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x16_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_4x8_neon x264_template(pixel_avg_4x8_neon)
-void x264_pixel_avg_4x8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x8_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_4x4_neon x264_template(pixel_avg_4x4_neon)
-void x264_pixel_avg_4x4_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x4_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg_4x2_neon x264_template(pixel_avg_4x2_neon)
-void x264_pixel_avg_4x2_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x2_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_pixel_avg2_w4_neon x264_template(pixel_avg2_w4_neon)
-void x264_pixel_avg2_w4_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w4_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
#define x264_pixel_avg2_w8_neon x264_template(pixel_avg2_w8_neon)
-void x264_pixel_avg2_w8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w8_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
#define x264_pixel_avg2_w16_neon x264_template(pixel_avg2_w16_neon)
-void x264_pixel_avg2_w16_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w16_neon( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
#define x264_pixel_avg2_w20_neon x264_template(pixel_avg2_w20_neon)
-void x264_pixel_avg2_w20_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w20_neon( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
#define x264_plane_copy_core_neon x264_template(plane_copy_core_neon)
void x264_plane_copy_core_neon( pixel *dst, intptr_t i_dst,
@@ -111,12 +111,12 @@ void x264_load_deinterleave_chroma_fenc_neon( pixel *dst, pixel *src, intptr_t i
#define x264_mc_weight_w8_offsetadd_neon x264_template(mc_weight_w8_offsetadd_neon)
#define x264_mc_weight_w8_offsetsub_neon x264_template(mc_weight_w8_offsetsub_neon)
#define MC_WEIGHT(func)\
-void x264_mc_weight_w20##func##_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
-void x264_mc_weight_w16##func##_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
-void x264_mc_weight_w8##func##_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
-void x264_mc_weight_w4##func##_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w20##func##_neon( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w16##func##_neon( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w8##func##_neon ( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w4##func##_neon ( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
\
-static void (* mc##func##_wtab_neon[6])( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int ) =\
+static void (* mc##func##_wtab_neon[6])( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int ) =\
{\
x264_mc_weight_w4##func##_neon,\
x264_mc_weight_w4##func##_neon,\
@@ -126,32 +126,30 @@ static void (* mc##func##_wtab_neon[6])( uint8_t *, intptr_t, uint8_t *, intptr_
x264_mc_weight_w20##func##_neon,\
};
-#if !HIGH_BIT_DEPTH
MC_WEIGHT()
MC_WEIGHT(_nodenom)
MC_WEIGHT(_offsetadd)
MC_WEIGHT(_offsetsub)
-#endif
#define x264_mc_copy_w4_neon x264_template(mc_copy_w4_neon)
-void x264_mc_copy_w4_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_mc_copy_w4_neon ( pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_mc_copy_w8_neon x264_template(mc_copy_w8_neon)
-void x264_mc_copy_w8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_mc_copy_w8_neon ( pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_mc_copy_w16_neon x264_template(mc_copy_w16_neon)
-void x264_mc_copy_w16_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_mc_copy_w16_neon( pixel *, intptr_t, pixel *, intptr_t, int );
#define x264_mc_chroma_neon x264_template(mc_chroma_neon)
-void x264_mc_chroma_neon( uint8_t *, uint8_t *, intptr_t, uint8_t *, intptr_t, int, int, int, int );
+void x264_mc_chroma_neon( pixel *, pixel *, intptr_t, pixel *, intptr_t, int, int, int, int );
#define x264_integral_init4h_neon x264_template(integral_init4h_neon)
-void x264_integral_init4h_neon( uint16_t *, uint8_t *, intptr_t );
+void x264_integral_init4h_neon( uint16_t *, pixel *, intptr_t );
#define x264_integral_init4v_neon x264_template(integral_init4v_neon)
void x264_integral_init4v_neon( uint16_t *, uint16_t *, intptr_t );
#define x264_integral_init8h_neon x264_template(integral_init8h_neon)
-void x264_integral_init8h_neon( uint16_t *, uint8_t *, intptr_t );
+void x264_integral_init8h_neon( uint16_t *, pixel *, intptr_t );
#define x264_integral_init8v_neon x264_template(integral_init8v_neon)
void x264_integral_init8v_neon( uint16_t *, intptr_t );
#define x264_frame_init_lowres_core_neon x264_template(frame_init_lowres_core_neon)
-void x264_frame_init_lowres_core_neon( uint8_t *, uint8_t *, uint8_t *, uint8_t *, uint8_t *, intptr_t, intptr_t, int, int );
+void x264_frame_init_lowres_core_neon( pixel *, pixel *, pixel *, pixel *, pixel *, intptr_t, intptr_t, int, int );
#define x264_mbtree_propagate_cost_neon x264_template(mbtree_propagate_cost_neon)
void x264_mbtree_propagate_cost_neon( int16_t *, uint16_t *, uint16_t *, uint16_t *, uint16_t *, float *, int );
@@ -161,7 +159,25 @@ void x264_mbtree_fix8_pack_neon( uint16_t *dst, float *src, int count );
#define x264_mbtree_fix8_unpack_neon x264_template(mbtree_fix8_unpack_neon)
void x264_mbtree_fix8_unpack_neon( float *dst, uint16_t *src, int count );
-#if !HIGH_BIT_DEPTH
+static void (* const pixel_avg_wtab_neon[6])( pixel *, intptr_t, pixel *, intptr_t, pixel *, int ) =
+{
+ NULL,
+ x264_pixel_avg2_w4_neon,
+ x264_pixel_avg2_w8_neon,
+ x264_pixel_avg2_w16_neon, // no slower than w12, so no point in a separate function
+ x264_pixel_avg2_w16_neon,
+ x264_pixel_avg2_w20_neon,
+};
+
+static void (* const mc_copy_wtab_neon[5])( pixel *, intptr_t, pixel *, intptr_t, int ) =
+{
+ NULL,
+ x264_mc_copy_w4_neon,
+ x264_mc_copy_w8_neon,
+ NULL,
+ x264_mc_copy_w16_neon,
+};
+
static void weight_cache_neon( x264_t *h, x264_weight_t *w )
{
if( w->i_scale == 1<<w->i_denom )
@@ -183,39 +199,20 @@ static void weight_cache_neon( x264_t *h, x264_weight_t *w )
w->weightfn = mc_wtab_neon;
}
-static void (* const pixel_avg_wtab_neon[6])( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int ) =
-{
- NULL,
- x264_pixel_avg2_w4_neon,
- x264_pixel_avg2_w8_neon,
- x264_pixel_avg2_w16_neon, // no slower than w12, so no point in a separate function
- x264_pixel_avg2_w16_neon,
- x264_pixel_avg2_w20_neon,
-};
-
-static void (* const mc_copy_wtab_neon[5])( uint8_t *, intptr_t, uint8_t *, intptr_t, int ) =
-{
- NULL,
- x264_mc_copy_w4_neon,
- x264_mc_copy_w8_neon,
- NULL,
- x264_mc_copy_w16_neon,
-};
-
-static void mc_luma_neon( uint8_t *dst, intptr_t i_dst_stride,
- uint8_t *src[4], intptr_t i_src_stride,
+static void mc_luma_neon( pixel *dst, intptr_t i_dst_stride,
+ pixel *src[4], intptr_t i_src_stride,
int mvx, int mvy,
int i_width, int i_height, const x264_weight_t *weight )
{
int qpel_idx = ((mvy&3)<<2) + (mvx&3);
intptr_t offset = (mvy>>2)*i_src_stride + (mvx>>2);
- uint8_t *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
+ pixel *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
if( (mvy&3) == 3 ) // explicit if() to force conditional add
src1 += i_src_stride;
if( qpel_idx & 5 ) /* qpel interpolation needed */
{
- uint8_t *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
+ pixel *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
pixel_avg_wtab_neon[i_width>>2](
dst, i_dst_stride, src1, i_src_stride,
src2, i_height );
@@ -228,20 +225,20 @@ static void mc_luma_neon( uint8_t *dst, intptr_t i_dst_stride,
mc_copy_wtab_neon[i_width>>2]( dst, i_dst_stride, src1, i_src_stride, i_height );
}
-static uint8_t *get_ref_neon( uint8_t *dst, intptr_t *i_dst_stride,
- uint8_t *src[4], intptr_t i_src_stride,
+static pixel *get_ref_neon( pixel *dst, intptr_t *i_dst_stride,
+ pixel *src[4], intptr_t i_src_stride,
int mvx, int mvy,
int i_width, int i_height, const x264_weight_t *weight )
{
int qpel_idx = ((mvy&3)<<2) + (mvx&3);
intptr_t offset = (mvy>>2)*i_src_stride + (mvx>>2);
- uint8_t *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
+ pixel *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
if( (mvy&3) == 3 ) // explicit if() to force conditional add
src1 += i_src_stride;
if( qpel_idx & 5 ) /* qpel interpolation needed */
{
- uint8_t *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
+ pixel *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
pixel_avg_wtab_neon[i_width>>2](
dst, *i_dst_stride, src1, i_src_stride,
src2, i_height );
@@ -262,19 +259,18 @@ static uint8_t *get_ref_neon( uint8_t *dst, intptr_t *i_dst_stride,
}
#define x264_hpel_filter_neon x264_template(hpel_filter_neon)
-void x264_hpel_filter_neon( uint8_t *dsth, uint8_t *dstv, uint8_t *dstc,
- uint8_t *src, intptr_t stride, int width,
+void x264_hpel_filter_neon( pixel *dsth, pixel *dstv, pixel *dstc,
+ pixel *src, intptr_t stride, int width,
int height, int16_t *buf );
PLANE_COPY(16, neon)
PLANE_COPY_SWAP(16, neon)
PLANE_INTERLEAVE(neon)
PROPAGATE_LIST(neon)
-#endif // !HIGH_BIT_DEPTH
void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
{
-#if !HIGH_BIT_DEPTH
+
if( cpu&X264_CPU_ARMV8 )
{
pf->prefetch_fenc_420 = x264_prefetch_fenc_420_aarch64;
@@ -285,20 +281,13 @@ void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
if( !(cpu&X264_CPU_NEON) )
return;
- pf->copy_16x16_unaligned = x264_mc_copy_w16_neon;
- pf->copy[PIXEL_16x16] = x264_mc_copy_w16_neon;
- pf->copy[PIXEL_8x8] = x264_mc_copy_w8_neon;
- pf->copy[PIXEL_4x4] = x264_mc_copy_w4_neon;
-
- pf->plane_copy = plane_copy_neon;
- pf->plane_copy_swap = plane_copy_swap_neon;
- pf->plane_copy_deinterleave = x264_plane_copy_deinterleave_neon;
- pf->plane_copy_deinterleave_rgb = x264_plane_copy_deinterleave_rgb_neon;
- pf->plane_copy_interleave = plane_copy_interleave_neon;
+ pf->mbtree_propagate_cost = x264_mbtree_propagate_cost_neon;
+ pf->mbtree_propagate_list = mbtree_propagate_list_neon;
+ pf->mbtree_fix8_pack = x264_mbtree_fix8_pack_neon;
+ pf->mbtree_fix8_unpack = x264_mbtree_fix8_unpack_neon;
- pf->load_deinterleave_chroma_fdec = x264_load_deinterleave_chroma_fdec_neon;
- pf->load_deinterleave_chroma_fenc = x264_load_deinterleave_chroma_fenc_neon;
- pf->store_interleave_chroma = x264_store_interleave_chroma_neon;
+ pf->memcpy_aligned = x264_memcpy_aligned_neon;
+ pf->memzero_aligned = x264_memzero_aligned_neon;
pf->avg[PIXEL_16x16] = x264_pixel_avg_16x16_neon;
pf->avg[PIXEL_16x8] = x264_pixel_avg_16x8_neon;
@@ -310,6 +299,11 @@ void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
pf->avg[PIXEL_4x4] = x264_pixel_avg_4x4_neon;
pf->avg[PIXEL_4x2] = x264_pixel_avg_4x2_neon;
+ pf->copy_16x16_unaligned = x264_mc_copy_w16_neon;
+ pf->copy[PIXEL_16x16] = x264_mc_copy_w16_neon;
+ pf->copy[PIXEL_8x8] = x264_mc_copy_w8_neon;
+ pf->copy[PIXEL_4x4] = x264_mc_copy_w4_neon;
+
pf->weight = mc_wtab_neon;
pf->offsetadd = mc_offsetadd_wtab_neon;
pf->offsetsub = mc_offsetsub_wtab_neon;
@@ -318,20 +312,30 @@ void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
pf->mc_chroma = x264_mc_chroma_neon;
pf->mc_luma = mc_luma_neon;
pf->get_ref = get_ref_neon;
- pf->hpel_filter = x264_hpel_filter_neon;
- pf->frame_init_lowres_core = x264_frame_init_lowres_core_neon;
pf->integral_init4h = x264_integral_init4h_neon;
pf->integral_init8h = x264_integral_init8h_neon;
pf->integral_init4v = x264_integral_init4v_neon;
pf->integral_init8v = x264_integral_init8v_neon;
- pf->mbtree_propagate_cost = x264_mbtree_propagate_cost_neon;
- pf->mbtree_propagate_list = mbtree_propagate_list_neon;
- pf->mbtree_fix8_pack = x264_mbtree_fix8_pack_neon;
- pf->mbtree_fix8_unpack = x264_mbtree_fix8_unpack_neon;
+ pf->frame_init_lowres_core = x264_frame_init_lowres_core_neon;
+
+ pf->load_deinterleave_chroma_fdec = x264_load_deinterleave_chroma_fdec_neon;
+ pf->load_deinterleave_chroma_fenc = x264_load_deinterleave_chroma_fenc_neon;
+
+ pf->store_interleave_chroma = x264_store_interleave_chroma_neon;
+
+ pf->plane_copy = plane_copy_neon;
+ pf->plane_copy_swap = plane_copy_swap_neon;
+ pf->plane_copy_deinterleave = x264_plane_copy_deinterleave_neon;
+ pf->plane_copy_deinterleave_rgb = x264_plane_copy_deinterleave_rgb_neon;
+ pf->plane_copy_interleave = plane_copy_interleave_neon;
+
+ pf->hpel_filter = x264_hpel_filter_neon;
+
+#if !HIGH_BIT_DEPTH
+
+
- pf->memcpy_aligned = x264_memcpy_aligned_neon;
- pf->memzero_aligned = x264_memzero_aligned_neon;
#endif // !HIGH_BIT_DEPTH
}
View it on GitLab: https://code.videolan.org/videolan/x264/-/compare/834c5c92db67bf34467915305544ad8c4fe97657...cc5c343f432ba7c6ce1e11aa49cbb718e7e4710e
--
View it on GitLab: https://code.videolan.org/videolan/x264/-/compare/834c5c92db67bf34467915305544ad8c4fe97657...cc5c343f432ba7c6ce1e11aa49cbb718e7e4710e
You're receiving this email because of your account on code.videolan.org.
VideoLAN code repository instance
More information about the x264-devel
mailing list