[x264-devel] [Git][videolan/x264][master] 14 commits: mc: Add initial support for 10 bit neon support

Anton Mitrofanov (@BugMaster) gitlab at videolan.org
Sun Oct 1 15:29:40 UTC 2023



Anton Mitrofanov pushed to branch master at VideoLAN / x264


Commits:
249924ea by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add initial support for 10 bit neon support

Add if/else clause in files to control which code is used.
Move generic function out of 8-bit depth scope to common one
for both modes.

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
ba45eba3 by Hubert Mazur at 2023-10-01T15:13:40+00:00
aarch64/mc-c: Unify pixel/uint8_t usage

Previously some functions from motion compensation family used uint8_t,
while the others pixel definition. Unify this and change every uint8_t
usage to pixel.
This commit is a prerequisite to 10 bit depth support.

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
13a24888 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for pixel_avg

Provide neon optimized implementation for pixel_avg functions from
motion compensation family for 10 bit depth.
Checkasm benchmarks are shown below.

avg_4x2_c: 703
avg_4x2_neon: 222
avg_4x4_c: 1405
avg_4x4_neon: 516
avg_4x8_c: 2759
avg_4x8_neon: 898
avg_4x16_c: 5808
avg_4x16_neon: 1776
avg_8x4_c: 2767
avg_8x4_neon: 412
avg_8x8_c: 5559
avg_8x8_neon: 841
avg_8x16_c: 11176
avg_8x16_neon: 1668
avg_16x8_c: 10493
avg_16x8_neon: 1504
avg_16x16_c: 21116
avg_16x16_neon: 2985

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
bb3d83dd by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for pixel_avg2

Provide neon optimized implementation for pixel_avg2 functions from
motion compensation family for 10 bit depth.

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
f0b0489f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_copy

Provide neon optimized implementation for mc_copy functions from
motion compensation family for 10 bit depth.

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
25d5baf4 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_weight

Provide neon optimized implementation for mc_weight functions from
motion compensation family for 10 bit depth.

Benchmark results are shown below.

weight_w4_c: 4734
weight_w4_neon: 4165
weight_w8_c: 8930
weight_w8_neon: 1620
weight_w16_c: 16939
weight_w16_neon: 2729
weight_w20_c: 20721
weight_w20_neon: 3470

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
08761208 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Move mc_luma and get_ref wrappers

Provide mc_luma and get_ref wrappers were only defined with 8 bit depth.
As all required 10 bit depth helper functions exists, move it out from
if scope and make it always defined regardless the bit depth.

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
7ff0f978 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_chroma

Provide neon optimized implementation for mc_chroma functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.

mc_chroma_2x2_c: 700
mc_chroma_2x2_neon: 478
mc_chroma_2x4_c: 1300
mc_chroma_2x4_neon: 765
mc_chroma_4x2_c: 1229
mc_chroma_4x2_neon: 483
mc_chroma_4x4_c: 2383
mc_chroma_4x4_neon: 773
mc_chroma_4x8_c: 4662
mc_chroma_4x8_neon: 1319
mc_chroma_8x4_c: 4450
mc_chroma_8x4_neon: 940
mc_chroma_8x8_c: 8797
mc_chroma_8x8_neon: 1638

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
25ef8832 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_integral

Provide neon optimized implementation for mc_integral functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.

integral_init4h_c: 2651
integral_init4h_neon: 550
integral_init4v_c: 4247
integral_init4v_neon: 612
integral_init8h_c: 2544
integral_init8h_neon: 1027
integral_init8v_c: 1996
integral_init8v_neon: 245

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
0a810f4f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_lowres

Provide neon optimized implementation for mc_lowres function from
motion compensation family for 10 bit depth.
Benchmark results are shown below.

lowres_init_c: 149446
lowres_init_neon: 13172

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
68d71206 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_load func

Provide neon optimized implementation for mc_load_deinterleave function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.

load_deinterleave_chroma_fdec_c: 2936
load_deinterleave_chroma_fdec_neon: 422

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
df179744 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for store func

Provide neon optimized implementation for mc_store_interleave function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.

load_deinterleave_chroma_fenc_c: 2910
load_deinterleave_chroma_fenc_neon: 430

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
e47bede8 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for copy funcs

Provide neon optimized implementation for mc_plane_copy function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.

plane_copy_c:  2955
plane_copy_neon: 2910
plane_copy_deinterleave_c: 24056
plane_copy_deinterleave_neon: 3625
plane_copy_deinterleave_rgb_c: 19928
plane_copy_deinterleave_rgb_neon: 3941
plane_copy_interleave_c: 24399
plane_copy_interleave_neon: 4723
plane_copy_swap_c: 32269
plane_copy_swap_neon: 3211

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -
cc5c343f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for hpel filter

Provide neon optimized implementation for mc_plane_copy function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.

hpel_filter_c: 111495
hpel_filter_neon: 37849

Signed-off-by: Hubert Mazur <hum at semihalf.com>

- - - - -


2 changed files:

- common/aarch64/mc-a.S
- common/aarch64/mc-c.c


Changes:

=====================================
common/aarch64/mc-a.S
=====================================
@@ -85,6 +85,220 @@ endfunc
 prefetch_fenc 420
 prefetch_fenc 422
 
+function mbtree_propagate_cost_neon, export=1
+    ld1r        {v5.4s},  [x5]
+8:
+    subs        w6,  w6,  #8
+    ld1         {v1.8h},  [x1], #16
+    ld1         {v2.8h},  [x2], #16
+    ld1         {v3.8h},  [x3], #16
+    ld1         {v4.8h},  [x4], #16
+    bic         v3.8h,  #0xc0, lsl #8
+    umin        v3.8h,  v2.8h,  v3.8h
+    umull       v20.4s, v2.4h,  v4.4h   // propagate_intra
+    umull2      v21.4s, v2.8h,  v4.8h   // propagate_intra
+    usubl       v22.4s, v2.4h,  v3.4h   // propagate_num
+    usubl2      v23.4s, v2.8h,  v3.8h   // propagate_num
+    uxtl        v26.4s, v2.4h           // propagate_denom
+    uxtl2       v27.4s, v2.8h           // propagate_denom
+    uxtl        v24.4s, v1.4h
+    uxtl2       v25.4s, v1.8h
+    ucvtf       v20.4s, v20.4s
+    ucvtf       v21.4s, v21.4s
+    ucvtf       v26.4s, v26.4s
+    ucvtf       v27.4s, v27.4s
+    ucvtf       v22.4s, v22.4s
+    ucvtf       v23.4s, v23.4s
+    frecpe      v28.4s, v26.4s
+    frecpe      v29.4s, v27.4s
+    ucvtf       v24.4s, v24.4s
+    ucvtf       v25.4s, v25.4s
+    frecps      v30.4s, v28.4s, v26.4s
+    frecps      v31.4s, v29.4s, v27.4s
+    fmla        v24.4s, v20.4s, v5.4s   // propagate_amount
+    fmla        v25.4s, v21.4s, v5.4s   // propagate_amount
+    fmul        v28.4s, v28.4s, v30.4s
+    fmul        v29.4s, v29.4s, v31.4s
+    fmul        v16.4s, v24.4s, v22.4s
+    fmul        v17.4s, v25.4s, v23.4s
+    fmul        v18.4s, v16.4s, v28.4s
+    fmul        v19.4s, v17.4s, v29.4s
+    fcvtns      v20.4s, v18.4s
+    fcvtns      v21.4s, v19.4s
+    sqxtn       v0.4h,  v20.4s
+    sqxtn2      v0.8h,  v21.4s
+    st1         {v0.8h},  [x0], #16
+    b.gt        8b
+    ret
+endfunc
+
+const pw_0to15, align=5
+    .short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+endconst
+
+function mbtree_propagate_list_internal_neon, export=1
+    movrel      x11,  pw_0to15
+    dup         v31.8h,  w4             // bipred_weight
+    movi        v30.8h,  #0xc0, lsl #8
+    ld1         {v29.8h},  [x11] //h->mb.i_mb_x,h->mb.i_mb_y
+    movi        v28.4s,  #4
+    movi        v27.8h,  #31
+    movi        v26.8h,  #32
+    dup         v24.8h,  w5             // mb_y
+    zip1        v29.8h,  v29.8h, v24.8h
+8:
+    subs        w6,  w6,  #8
+    ld1         {v1.8h},  [x1], #16     // propagate_amount
+    ld1         {v2.8h},  [x2], #16     // lowres_cost
+    and         v2.16b, v2.16b, v30.16b
+    cmeq        v25.8h, v2.8h,  v30.8h
+    umull       v16.4s, v1.4h,  v31.4h
+    umull2      v17.4s, v1.8h,  v31.8h
+    rshrn       v16.4h, v16.4s, #6
+    rshrn2      v16.8h, v17.4s, #6
+    bsl         v25.16b, v16.16b, v1.16b // if( lists_used == 3 )
+    //          propagate_amount = (propagate_amount * bipred_weight + 32) >> 6
+    ld1         {v4.8h,v5.8h},  [x0],  #32
+    sshr        v6.8h,  v4.8h,  #5
+    sshr        v7.8h,  v5.8h,  #5
+    add         v6.8h,  v6.8h,  v29.8h
+    add         v29.8h, v29.8h, v28.8h
+    add         v7.8h,  v7.8h,  v29.8h
+    add         v29.8h, v29.8h, v28.8h
+    st1         {v6.8h,v7.8h},  [x3],  #32
+    and         v4.16b, v4.16b, v27.16b
+    and         v5.16b, v5.16b, v27.16b
+    uzp1        v6.8h,  v4.8h,  v5.8h   // x & 31
+    uzp2        v7.8h,  v4.8h,  v5.8h   // y & 31
+    sub         v4.8h,  v26.8h, v6.8h   // 32 - (x & 31)
+    sub         v5.8h,  v26.8h, v7.8h   // 32 - (y & 31)
+    mul         v19.8h, v6.8h,  v7.8h   // idx3weight = y*x;
+    mul         v18.8h, v4.8h,  v7.8h   // idx2weight = y*(32-x);
+    mul         v17.8h, v6.8h,  v5.8h   // idx1weight = (32-y)*x;
+    mul         v16.8h, v4.8h,  v5.8h   // idx0weight = (32-y)*(32-x) ;
+    umull       v6.4s,  v19.4h, v25.4h
+    umull2      v7.4s,  v19.8h, v25.8h
+    umull       v4.4s,  v18.4h, v25.4h
+    umull2      v5.4s,  v18.8h, v25.8h
+    umull       v2.4s,  v17.4h, v25.4h
+    umull2      v3.4s,  v17.8h, v25.8h
+    umull       v0.4s,  v16.4h, v25.4h
+    umull2      v1.4s,  v16.8h, v25.8h
+    rshrn       v19.4h, v6.4s,  #10
+    rshrn2      v19.8h, v7.4s,  #10
+    rshrn       v18.4h, v4.4s,  #10
+    rshrn2      v18.8h, v5.4s,  #10
+    rshrn       v17.4h, v2.4s,  #10
+    rshrn2      v17.8h, v3.4s,  #10
+    rshrn       v16.4h, v0.4s,  #10
+    rshrn2      v16.8h, v1.4s,  #10
+    zip1        v0.8h,  v16.8h, v17.8h
+    zip2        v1.8h,  v16.8h, v17.8h
+    zip1        v2.8h,  v18.8h, v19.8h
+    zip2        v3.8h,  v18.8h, v19.8h
+    st1         {v0.8h,v1.8h},  [x3], #32
+    st1         {v2.8h,v3.8h},  [x3], #32
+    b.ge        8b
+    ret
+endfunc
+
+function memcpy_aligned_neon, export=1
+    tst         x2,  #16
+    b.eq        32f
+    sub         x2,  x2,  #16
+    ldr         q0,  [x1], #16
+    str         q0,  [x0], #16
+32:
+    tst         x2,  #32
+    b.eq        640f
+    sub         x2,  x2,  #32
+    ldp         q0,  q1,  [x1], #32
+    stp         q0,  q1,  [x0], #32
+640:
+    cbz         x2,  1f
+64:
+    subs        x2,  x2,  #64
+    ldp         q0,  q1,  [x1, #32]
+    ldp         q2,  q3,  [x1], #64
+    stp         q0,  q1,  [x0, #32]
+    stp         q2,  q3,  [x0], #64
+    b.gt        64b
+1:
+    ret
+endfunc
+
+function memzero_aligned_neon, export=1
+    movi        v0.16b,  #0
+    movi        v1.16b,  #0
+1:
+    subs        x1,  x1,  #128
+    stp         q0,  q1,  [x0, #96]
+    stp         q0,  q1,  [x0, #64]
+    stp         q0,  q1,  [x0, #32]
+    stp         q0,  q1,  [x0], 128
+    b.gt        1b
+    ret
+endfunc
+
+// void mbtree_fix8_pack( int16_t *dst, float *src, int count )
+function mbtree_fix8_pack_neon, export=1
+    subs        w3,  w2,  #8
+    b.lt        2f
+1:
+    subs        w3,  w3,  #8
+    ld1         {v0.4s,v1.4s}, [x1], #32
+    fcvtzs      v0.4s,  v0.4s,  #8
+    fcvtzs      v1.4s,  v1.4s,  #8
+    sqxtn       v2.4h,  v0.4s
+    sqxtn2      v2.8h,  v1.4s
+    rev16       v3.16b, v2.16b
+    st1         {v3.8h},  [x0], #16
+    b.ge        1b
+2:
+    adds        w3,  w3,  #8
+    b.eq        4f
+3:
+    subs        w3,  w3,  #1
+    ldr         s0, [x1], #4
+    fcvtzs      w4,  s0,  #8
+    rev16       w5,  w4
+    strh        w5, [x0], #2
+    b.gt        3b
+4:
+    ret
+endfunc
+
+// void mbtree_fix8_unpack( float *dst, int16_t *src, int count )
+function mbtree_fix8_unpack_neon, export=1
+    subs        w3,  w2,  #8
+    b.lt        2f
+1:
+    subs        w3,  w3,  #8
+    ld1         {v0.8h}, [x1], #16
+    rev16       v1.16b, v0.16b
+    sxtl        v2.4s,  v1.4h
+    sxtl2       v3.4s,  v1.8h
+    scvtf       v4.4s,  v2.4s,  #8
+    scvtf       v5.4s,  v3.4s,  #8
+    st1         {v4.4s,v5.4s}, [x0], #32
+    b.ge        1b
+2:
+    adds        w3,  w3,  #8
+    b.eq        4f
+3:
+    subs        w3,  w3,  #1
+    ldrh        w4, [x1], #2
+    rev16       w5,  w4
+    sxth        w6,  w5
+    scvtf       s0,  w6,  #8
+    str         s0, [x0], #4
+    b.gt        3b
+4:
+    ret
+endfunc
+
+#if BIT_DEPTH == 8
+
 // void pixel_avg( uint8_t *dst,  intptr_t dst_stride,
 //                 uint8_t *src1, intptr_t src1_stride,
 //                 uint8_t *src2, intptr_t src2_stride, int weight );
@@ -1542,214 +1756,2047 @@ function integral_init8v_neon, export=1
     ret
 endfunc
 
-function mbtree_propagate_cost_neon, export=1
-    ld1r        {v5.4s},  [x5]
-8:
-    subs        w6,  w6,  #8
-    ld1         {v1.8h},  [x1], #16
-    ld1         {v2.8h},  [x2], #16
-    ld1         {v3.8h},  [x3], #16
-    ld1         {v4.8h},  [x4], #16
-    bic         v3.8h,  #0xc0, lsl #8
-    umin        v3.8h,  v2.8h,  v3.8h
-    umull       v20.4s, v2.4h,  v4.4h   // propagate_intra
-    umull2      v21.4s, v2.8h,  v4.8h   // propagate_intra
-    usubl       v22.4s, v2.4h,  v3.4h   // propagate_num
-    usubl2      v23.4s, v2.8h,  v3.8h   // propagate_num
-    uxtl        v26.4s, v2.4h           // propagate_denom
-    uxtl2       v27.4s, v2.8h           // propagate_denom
-    uxtl        v24.4s, v1.4h
-    uxtl2       v25.4s, v1.8h
-    ucvtf       v20.4s, v20.4s
-    ucvtf       v21.4s, v21.4s
-    ucvtf       v26.4s, v26.4s
-    ucvtf       v27.4s, v27.4s
-    ucvtf       v22.4s, v22.4s
-    ucvtf       v23.4s, v23.4s
-    frecpe      v28.4s, v26.4s
-    frecpe      v29.4s, v27.4s
-    ucvtf       v24.4s, v24.4s
-    ucvtf       v25.4s, v25.4s
-    frecps      v30.4s, v28.4s, v26.4s
-    frecps      v31.4s, v29.4s, v27.4s
-    fmla        v24.4s, v20.4s, v5.4s   // propagate_amount
-    fmla        v25.4s, v21.4s, v5.4s   // propagate_amount
-    fmul        v28.4s, v28.4s, v30.4s
-    fmul        v29.4s, v29.4s, v31.4s
-    fmul        v16.4s, v24.4s, v22.4s
-    fmul        v17.4s, v25.4s, v23.4s
-    fmul        v18.4s, v16.4s, v28.4s
-    fmul        v19.4s, v17.4s, v29.4s
-    fcvtns      v20.4s, v18.4s
-    fcvtns      v21.4s, v19.4s
-    sqxtn       v0.4h,  v20.4s
-    sqxtn2      v0.8h,  v21.4s
-    st1         {v0.8h},  [x0], #16
-    b.gt        8b
+#else // BIT_DEPTH == 8
+
+// void pixel_avg( pixel *dst,  intptr_t dst_stride,
+//                 pixel *src1, intptr_t src1_stride,
+//                 pixel *src2, intptr_t src2_stride, int weight );
+.macro AVGH w h
+function pixel_avg_\w\()x\h\()_neon, export=1
+    mov         w10, #64
+    cmp         w6, #32
+    mov         w9, #\h
+    b.eq        pixel_avg_w\w\()_neon
+    subs        w7, w10, w6
+    b.lt        pixel_avg_weight_w\w\()_add_sub_neon     // weight > 64
+    cmp         w6, #0
+    b.ge        pixel_avg_weight_w\w\()_add_add_neon
+    b           pixel_avg_weight_w\w\()_sub_add_neon     // weight < 0
+endfunc
+.endm
+
+AVGH  4, 2
+AVGH  4, 4
+AVGH  4, 8
+AVGH  4, 16
+AVGH  8, 4
+AVGH  8, 8
+AVGH  8, 16
+AVGH 16, 8
+AVGH 16, 16
+
+// 0 < weight < 64
+.macro load_weights_add_add
+    mov         w6, w6
+.endm
+.macro weight_add_add dst, s1, s2, h=
+.ifc \h, 2
+    umull2      \dst, \s1, v30.8h
+    umlal2      \dst, \s2, v31.8h
+.else
+    umull       \dst, \s1, v30.4h
+    umlal       \dst, \s2, v31.4h
+.endif
+.endm
+
+// weight > 64
+.macro load_weights_add_sub
+    neg         w7, w7
+.endm
+.macro weight_add_sub dst, s1, s2, h=
+.ifc \h, 2
+    umull2      \dst, \s1, v30.8h
+    umlsl2      \dst, \s2, v31.8h
+.else
+    umull       \dst, \s1, v30.4h
+    umlsl       \dst, \s2, v31.4h
+.endif
+.endm
+
+// weight < 0
+.macro load_weights_sub_add
+    neg         w6, w6
+.endm
+.macro weight_sub_add dst, s1, s2, h=
+.ifc \h, 2
+    umull2      \dst, \s2, v31.8h
+    umlsl2      \dst, \s1, v30.8h
+.else
+    umull       \dst, \s2, v31.4h
+    umlsl       \dst, \s1, v30.4h
+.endif
+.endm
+
+.macro AVG_WEIGHT ext
+function pixel_avg_weight_w4_\ext\()_neon
+    load_weights_\ext
+    dup         v30.8h, w6
+    dup         v31.8h, w7
+    lsl         x3, x3, #1
+    lsl         x5, x5, #1
+    lsl         x1, x1, #1
+1:  // height loop
+    subs        w9, w9, #2
+    ld1         {v0.d}[0], [x2], x3
+    ld1         {v1.d}[0], [x4], x5
+    weight_\ext v4.4s, v0.4h, v1.4h
+    ld1         {v2.d}[0], [x2], x3
+    ld1         {v3.d}[0], [x4], x5
+
+    mvni        v28.8h, #0xfc, lsl #8
+
+    sqrshrun    v4.4h, v4.4s, #6
+    weight_\ext v5.4s, v2.4h, v3.4h
+    smin        v4.4h, v4.4h, v28.4h
+    sqrshrun    v5.4h, v5.4s, #6
+
+    st1         {v4.d}[0], [x0], x1
+
+    smin        v5.4h, v5.4h, v28.4h
+
+    st1         {v5.d}[0], [x0], x1
+
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg_weight_w8_\ext\()_neon
+    load_weights_\ext
+    dup         v30.8h, w6
+    dup         v31.8h, w7
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    lsl         x5, x5, #1
+1:  // height loop
+    subs        w9, w9, #4
+    ld1         {v0.8h}, [x2], x3
+    ld1         {v1.8h}, [x4], x5
+    weight_\ext v16.4s, v0.4h, v1.4h
+    weight_\ext v17.4s, v0.8h, v1.8h, 2
+    ld1         {v2.8h}, [x2], x3
+    ld1         {v3.8h}, [x4], x5
+    weight_\ext v18.4s, v2.4h, v3.4h
+    weight_\ext v19.4s, v2.8h, v3.8h, 2
+    ld1         {v4.8h}, [x2], x3
+    ld1         {v5.8h}, [x4], x5
+    weight_\ext v20.4s, v4.4h, v5.4h
+    weight_\ext v21.4s, v4.8h, v5.8h, 2
+    ld1         {v6.8h}, [x2], x3
+    ld1         {v7.8h}, [x4], x5
+    weight_\ext v22.4s, v6.4h, v7.4h
+    weight_\ext v23.4s, v6.8h, v7.8h, 2
+
+    mvni        v28.8h, #0xfc, lsl #8
+
+    sqrshrun    v0.4h, v16.4s, #6
+    sqrshrun    v2.4h, v18.4s, #6
+    sqrshrun    v4.4h, v20.4s, #6
+    sqrshrun2   v0.8h, v17.4s, #6
+    sqrshrun    v6.4h, v22.4s, #6
+    sqrshrun2   v2.8h, v19.4s, #6
+    sqrshrun2   v4.8h, v21.4s, #6
+    smin        v0.8h, v0.8h, v28.8h
+    smin        v2.8h, v2.8h, v28.8h
+    sqrshrun2   v6.8h, v23.4s, #6
+    smin        v4.8h, v4.8h, v28.8h
+    smin        v6.8h, v6.8h, v28.8h
+
+    st1        {v0.8h}, [x0], x1
+    st1        {v2.8h}, [x0], x1
+    st1        {v4.8h}, [x0], x1
+    st1        {v6.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg_weight_w16_\ext\()_neon
+    load_weights_\ext
+    dup         v30.8h, w6
+    dup         v31.8h, w7
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    lsl         x5, x5, #1
+1:  // height loop
+    subs        w9, w9, #2
+
+    ld1         {v0.8h, v1.8h}, [x2], x3
+    ld1         {v2.8h, v3.8h}, [x4], x5
+    ld1         {v4.8h, v5.8h}, [x2], x3
+    ld1         {v6.8h, v7.8h}, [x4], x5
+
+    weight_\ext v16.4s, v0.4h, v2.4h
+    weight_\ext v17.4s, v0.8h, v2.8h, 2
+    weight_\ext v18.4s, v1.4h, v3.4h
+    weight_\ext v19.4s, v1.8h, v3.8h, 2
+    weight_\ext v20.4s, v4.4h, v6.4h
+    weight_\ext v21.4s, v4.8h, v6.8h, 2
+    weight_\ext v22.4s, v5.4h, v7.4h
+    weight_\ext v23.4s, v5.8h, v7.8h, 2
+
+    mvni        v28.8h, #0xfc, lsl #8
+
+    sqrshrun    v0.4h, v16.4s, #6
+    sqrshrun    v1.4h, v18.4s, #6
+    sqrshrun    v2.4h, v20.4s, #6
+    sqrshrun2   v0.8h, v17.4s, #6
+    sqrshrun2   v1.8h, v19.4s, #6
+    sqrshrun2   v2.8h, v21.4s, #6
+    smin        v0.8h, v0.8h, v28.8h
+    smin        v1.8h, v1.8h, v28.8h
+    sqrshrun    v3.4h, v22.4s, #6
+    smin        v2.8h, v2.8h, v28.8h
+    sqrshrun2   v3.8h, v23.4s, #6
+    smin        v3.8h, v3.8h, v28.8h
+
+    st1        {v0.8h, v1.8h}, [x0], x1
+    st1        {v2.8h, v3.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+.endm
+
+AVG_WEIGHT add_add
+AVG_WEIGHT add_sub
+AVG_WEIGHT sub_add
+
+function pixel_avg_w4_neon
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    lsl         x5, x5, #1
+
+1:  subs        w9, w9, #2
+    ld1         {v0.d}[0], [x2], x3
+    ld1         {v2.d}[0], [x4], x5
+    ld1         {v0.d}[1], [x2], x3
+    ld1         {v2.d}[1], [x4], x5
+    urhadd      v0.8h, v0.8h, v2.8h
+    st1         {v0.d}[0], [x0], x1
+    st1         {v0.d}[1], [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg_w8_neon
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    lsl         x5, x5, #1
+1:  subs        w9, w9, #4
+    ld1         {v0.8h}, [x2], x3
+    ld1         {v1.8h}, [x4], x5
+    ld1         {v2.8h}, [x2], x3
+    urhadd      v0.8h, v0.8h, v1.8h
+    ld1         {v3.8h}, [x4], x5
+    st1         {v0.8h}, [x0], x1
+    ld1         {v4.8h}, [x2], x3
+    urhadd      v1.8h, v2.8h, v3.8h
+    ld1         {v5.8h}, [x4], x5
+    st1         {v1.8h}, [x0], x1
+    ld1         {v6.8h}, [x2], x3
+    ld1         {v7.8h}, [x4], x5
+    urhadd      v2.8h, v4.8h, v5.8h
+    urhadd      v3.8h, v6.8h, v7.8h
+    st1         {v2.8h}, [x0], x1
+    st1         {v3.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg_w16_neon
+    lsl        x1, x1, #1
+    lsl        x3, x3, #1
+    lsl        x5, x5, #1
+
+1:  subs       w9, w9, #4
+
+    ld1        {v0.8h, v1.8h}, [x2], x3
+    ld1        {v2.8h, v3.8h}, [x4], x5
+    ld1        {v4.8h, v5.8h}, [x2], x3
+    urhadd     v0.8h, v0.8h, v2.8h
+    urhadd     v1.8h, v1.8h, v3.8h
+    ld1        {v6.8h, v7.8h}, [x4], x5
+    ld1        {v20.8h, v21.8h}, [x2], x3
+    st1        {v0.8h, v1.8h}, [x0], x1
+    urhadd     v4.8h, v4.8h, v6.8h
+    urhadd     v5.8h, v5.8h, v7.8h
+    ld1        {v22.8h, v23.8h}, [x4], x5
+    ld1        {v24.8h, v25.8h}, [x2], x3
+    st1        {v4.8h, v5.8h}, [x0], x1
+    ld1        {v26.8h, v27.8h}, [x4], x5
+    urhadd     v20.8h, v20.8h, v22.8h
+    urhadd     v21.8h, v21.8h, v23.8h
+    urhadd     v24.8h, v24.8h, v26.8h
+    urhadd     v25.8h, v25.8h, v27.8h
+    st1        {v20.8h, v21.8h}, [x0], x1
+    st1        {v24.8h, v25.8h}, [x0], x1
+
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg2_w4_neon, export=1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w5, w5, #2
+    ld1         {v0.4h}, [x2], x3
+    ld1         {v2.4h}, [x4], x3
+    ld1         {v1.4h}, [x2], x3
+    ld1         {v3.4h}, [x4], x3
+    urhadd      v0.4h, v0.4h, v2.4h
+    urhadd      v1.4h, v1.4h, v3.4h
+
+    st1         {v0.4h}, [x0], x1
+    st1         {v1.4h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg2_w8_neon, export=1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w5, w5, #2
+    ld1         {v0.8h}, [x2], x3
+    ld1         {v2.8h}, [x4], x3
+    ld1         {v1.8h}, [x2], x3
+    ld1         {v3.8h}, [x4], x3
+    urhadd      v0.8h, v0.8h, v2.8h
+    urhadd      v1.8h, v1.8h, v3.8h
+
+    st1         {v0.8h}, [x0], x1
+    st1         {v1.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg2_w16_neon, export=1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w5, w5, #2
+    ld1         {v0.8h, v1.8h}, [x2], x3
+    ld1         {v2.8h, v3.8h}, [x4], x3
+    ld1         {v4.8h, v5.8h}, [x2], x3
+    ld1         {v6.8h, v7.8h}, [x4], x3
+    urhadd      v0.8h, v0.8h, v2.8h
+    urhadd      v1.8h, v1.8h, v3.8h
+    urhadd      v4.8h, v4.8h, v6.8h
+    urhadd      v5.8h, v5.8h, v7.8h
+
+    st1         {v0.8h, v1.8h}, [x0], x1
+    st1         {v4.8h, v5.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function pixel_avg2_w20_neon, export=1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    sub         x1, x1, #32
+1:
+    subs        w5, w5, #2
+
+    ld1         {v0.8h, v1.8h, v2.8h}, [x2], x3
+    ld1         {v3.8h, v4.8h, v5.8h}, [x4], x3
+    ld1         {v20.8h, v21.8h, v22.8h}, [x2], x3
+    ld1         {v23.8h, v24.8h, v25.8h}, [x4], x3
+
+    urhadd      v0.8h, v0.8h, v3.8h
+    urhadd      v1.8h, v1.8h, v4.8h
+    urhadd      v2.4h, v2.4h, v5.4h
+    urhadd      v20.8h, v20.8h, v23.8h
+    urhadd      v21.8h, v21.8h, v24.8h
+    urhadd      v22.4h, v22.4h, v25.4h
+
+    st1         {v0.8h, v1.8h}, [x0], #32
+    st1         {v2.4h}, [x0], x1
+    st1         {v20.8h, v21.8h}, [x0], #32
+    st1         {v22.4h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+// void mc_copy( pixel *dst, intptr_t dst_stride, pixel *src, intptr_t src_stride, int height )
+function mc_copy_w4_neon, export=1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w4, w4, #4
+    ld1         {v0.d}[0], [x2], x3
+    ld1         {v1.d}[0], [x2], x3
+    ld1         {v2.d}[0], [x2], x3
+    ld1         {v3.d}[0], [x2], x3
+    st1         {v0.d}[0], [x0], x1
+    st1         {v1.d}[0], [x0], x1
+    st1         {v2.d}[0], [x0], x1
+    st1         {v3.d}[0], [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_copy_w8_neon, export=1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:  subs        w4, w4, #4
+    ld1         {v0.8h}, [x2], x3
+    ld1         {v1.8h}, [x2], x3
+    ld1         {v2.8h}, [x2], x3
+    ld1         {v3.8h}, [x2], x3
+    st1         {v0.8h}, [x0], x1
+    st1         {v1.8h}, [x0], x1
+    st1         {v2.8h}, [x0], x1
+    st1         {v3.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_copy_w16_neon, export=1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:  subs        w4, w4, #4
+    ld1         {v0.8h, v1.8h}, [x2], x3
+    ld1         {v2.8h, v3.8h}, [x2], x3
+    ld1         {v4.8h, v5.8h}, [x2], x3
+    ld1         {v6.8h, v7.8h}, [x2], x3
+    st1         {v0.8h, v1.8h}, [x0], x1
+    st1         {v2.8h, v3.8h}, [x0], x1
+    st1         {v4.8h, v5.8h}, [x0], x1
+    st1         {v6.8h, v7.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+.macro weight_prologue type
+    mov         w9, w5                  // height
+.ifc \type, full
+    ldr         w12, [x4, #32]          // denom
+.endif
+    ldp         w4, w5, [x4, #32+4]     // scale, offset
+    dup         v0.8h, w4
+    lsl         w5, w5, #2
+    dup         v1.4s, w5
+.ifc \type, full
+    neg         w12, w12
+    dup         v2.4s, w12
+.endif
+.endm
+
+// void mc_weight( pixel *src, intptr_t src_stride, pixel *dst,
+//                 intptr_t dst_stride, const x264_weight_t *weight, int h )
+function mc_weight_w20_neon, export=1
+    weight_prologue full
+    lsl         x3, x3, #1
+    lsl         x1, x1, #1
+    sub         x1, x1, #32
+1:
+    subs        w9, w9, #2
+    ld1         {v16.8h, v17.8h, v18.8h}, [x2], x3
+    ld1         {v19.8h, v20.8h, v21.8h}, [x2], x3
+
+    umull       v22.4s, v16.4h, v0.4h
+    umull2      v23.4s, v16.8h, v0.8h
+    umull       v24.4s, v17.4h, v0.4h
+    umull2      v25.4s, v17.8h, v0.8h
+    umull       v26.4s, v18.4h, v0.4h
+    umull       v27.4s, v21.4h, v0.4h
+
+    srshl       v22.4s, v22.4s, v2.4s
+    srshl       v23.4s, v23.4s, v2.4s
+    srshl       v24.4s, v24.4s, v2.4s
+    srshl       v25.4s, v25.4s, v2.4s
+    srshl       v26.4s, v26.4s, v2.4s
+    srshl       v27.4s, v27.4s, v2.4s
+    add         v22.4s, v22.4s, v1.4s
+    add         v23.4s, v23.4s, v1.4s
+    add         v24.4s, v24.4s, v1.4s
+    add         v25.4s, v25.4s, v1.4s
+    add         v26.4s, v26.4s, v1.4s
+    add         v27.4s, v27.4s, v1.4s
+
+    sqxtun       v22.4h, v22.4s
+    sqxtun2      v22.8h, v23.4s
+    sqxtun       v23.4h, v24.4s
+    sqxtun2      v23.8h, v25.4s
+    sqxtun       v24.4h, v26.4s
+    sqxtun2      v24.8h, v27.4s
+
+    umull       v16.4s, v19.4h, v0.4h
+    umull2      v17.4s, v19.8h, v0.8h
+    umull       v18.4s, v20.4h, v0.4h
+    umull2      v19.4s, v20.8h, v0.8h
+
+    srshl       v16.4s, v16.4s, v2.4s
+    srshl       v17.4s, v17.4s, v2.4s
+    srshl       v18.4s, v18.4s, v2.4s
+    srshl       v19.4s, v19.4s, v2.4s
+    add         v16.4s, v16.4s, v1.4s
+    add         v17.4s, v17.4s, v1.4s
+    add         v18.4s, v18.4s, v1.4s
+    add         v19.4s, v19.4s, v1.4s
+
+    sqxtun       v16.4h, v16.4s
+    sqxtun2      v16.8h, v17.4s
+    sqxtun       v17.4h, v18.4s
+    sqxtun2      v17.8h, v19.4s
+
+    mvni        v31.8h, #0xfc, lsl #8
+
+    umin        v22.8h, v22.8h, v31.8h
+    umin        v23.8h, v23.8h, v31.8h
+    umin        v24.8h, v24.8h, v31.8h
+    umin        v16.8h, v16.8h, v31.8h
+    umin        v17.8h, v17.8h, v31.8h
+
+    st1         {v22.8h, v23.8h}, [x0], #32
+    st1         {v24.d}[0], [x0], x1
+    st1         {v16.8h, v17.8h}, [x0], #32
+    st1         {v24.d}[1], [x0], x1
+
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w16_neon, export=1
+    weight_prologue full
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w9, w9, #2
+    ld1         {v4.8h, v5.8h}, [x2], x3
+    ld1         {v6.8h, v7.8h}, [x2], x3
+
+    umull       v22.4s, v4.4h, v0.4h
+    umull2      v23.4s, v4.8h, v0.8h
+    umull       v24.4s, v5.4h, v0.4h
+    umull2      v25.4s, v5.8h, v0.8h
+
+    srshl       v22.4s, v22.4s, v2.4s
+    srshl       v23.4s, v23.4s, v2.4s
+    srshl       v24.4s, v24.4s, v2.4s
+    srshl       v25.4s, v25.4s, v2.4s
+
+    add         v22.4s, v22.4s, v1.4s
+    add         v23.4s, v23.4s, v1.4s
+    add         v24.4s, v24.4s, v1.4s
+    add         v25.4s, v25.4s, v1.4s
+
+    sqxtun       v22.4h, v22.4s
+    sqxtun2      v22.8h, v23.4s
+    sqxtun       v23.4h, v24.4s
+    sqxtun2      v23.8h, v25.4s
+
+    umull       v26.4s, v6.4h, v0.4h
+    umull2      v27.4s, v6.8h, v0.8h
+    umull       v28.4s, v7.4h, v0.4h
+    umull2      v29.4s, v7.8h, v0.8h
+
+    srshl       v26.4s, v26.4s, v2.4s
+    srshl       v27.4s, v27.4s, v2.4s
+    srshl       v28.4s, v28.4s, v2.4s
+    srshl       v29.4s, v29.4s, v2.4s
+
+    add         v26.4s, v26.4s, v1.4s
+    add         v27.4s, v27.4s, v1.4s
+    add         v28.4s, v28.4s, v1.4s
+    add         v29.4s, v29.4s, v1.4s
+
+    sqxtun       v26.4h, v26.4s
+    sqxtun2      v26.8h, v27.4s
+    sqxtun       v27.4h, v28.4s
+    sqxtun2      v27.8h, v29.4s
+
+    mvni        v31.8h, 0xfc, lsl #8
+
+    umin        v22.8h, v22.8h, v31.8h
+    umin        v23.8h, v23.8h, v31.8h
+    umin        v26.8h, v26.8h, v31.8h
+    umin        v27.8h, v27.8h, v31.8h
+
+    st1         {v22.8h, v23.8h}, [x0], x1
+    st1         {v26.8h, v27.8h}, [x0], x1
+
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w8_neon, export=1
+    weight_prologue full
+    lsl         x3, x3, #1
+    lsl         x1, x1, #1
+1:
+    subs        w9, w9, #2
+    ld1         {v16.8h}, [x2], x3
+    ld1         {v17.8h}, [x2], x3
+
+    umull       v4.4s, v16.4h, v0.4h
+    umull2      v5.4s, v16.8h, v0.8h
+    umull       v6.4s, v17.4h, v0.4h
+    umull2      v7.4s, v17.8h, v0.8h
+
+    srshl       v4.4s, v4.4s, v2.4s
+    srshl       v5.4s, v5.4s, v2.4s
+    srshl       v6.4s, v6.4s, v2.4s
+    srshl       v7.4s, v7.4s, v2.4s
+
+    add         v4.4s, v4.4s, v1.4s
+    add         v5.4s, v5.4s, v1.4s
+    add         v6.4s, v6.4s, v1.4s
+    add         v7.4s, v7.4s, v1.4s
+
+    sqxtun       v16.4h, v4.4s
+    sqxtun2      v16.8h, v5.4s
+    sqxtun       v17.4h, v6.4s
+    sqxtun2      v17.8h, v7.4s
+
+    mvni        v28.8h, #0xfc, lsl #8
+
+    umin        v16.8h, v16.8h, v28.8h
+    umin        v17.8h, v17.8h, v28.8h
+
+    st1         {v16.8h}, [x0], x1
+    st1         {v17.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w4_neon, export=1
+    weight_prologue full
+    lsl         x3, x3, #1
+    lsl         x1, x1, #1
+1:
+    subs        w9, w9, #2
+    ld1         {v16.d}[0], [x2], x3
+    ld1         {v16.d}[1], [x2], x3
+    umull       v4.4s, v16.4h, v0.4h
+    umull2      v5.4s, v16.8h, v0.8h
+    srshl       v4.4s, v4.4s, v2.4s
+    srshl       v5.4s, v5.4s, v2.4s
+    add         v4.4s, v4.4s, v1.4s
+    add         v5.4s, v5.4s, v1.4s
+
+    sqxtun       v16.4h, v4.4s
+    sqxtun2      v16.8h, v5.4s
+
+    mvni        v28.8h, #0xfc, lsl #8
+
+    umin        v16.8h, v16.8h, v28.8h
+
+    st1         {v16.d}[0], [x0], x1
+    st1         {v16.d}[1], [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w20_nodenom_neon, export=1
+    weight_prologue nodenom
+    lsl         x3, x3, #1
+    lsl         x1, x1, #1
+    sub         x1, x1, #32
+1:
+    subs        w9, w9, #2
+    ld1         {v16.8h, v17.8h, v18.8h}, [x2], x3
+    mov         v20.16b, v1.16b
+    mov         v21.16b, v1.16b
+    mov         v22.16b, v1.16b
+    mov         v23.16b, v1.16b
+    mov         v24.16b, v1.16b
+    mov         v25.16b, v1.16b
+    ld1         {v2.8h, v3.8h, v4.8h}, [x2], x3
+    mov         v26.16b, v1.16b
+    mov         v27.16b, v1.16b
+    mov         v28.16b, v1.16b
+    mov         v29.16b, v1.16b
+
+    umlal       v20.4s, v16.4h, v0.4h
+    umlal2      v21.4s, v16.8h, v0.8h
+    umlal       v22.4s, v17.4h, v0.4h
+    umlal2      v23.4s, v17.8h, v0.8h
+    umlal       v24.4s, v18.4h, v0.4h
+    umlal       v25.4s, v4.4h, v0.4h
+    umlal       v26.4s, v2.4h, v0.4h
+    umlal2      v27.4s, v2.8h, v0.8h
+    umlal       v28.4s, v3.4h, v0.4h
+    umlal2      v29.4s, v3.8h, v0.8h
+
+    sqxtun       v2.4h, v20.4s
+    sqxtun2      v2.8h, v21.4s
+    sqxtun       v3.4h, v22.4s
+    sqxtun2      v3.8h, v23.4s
+    sqxtun       v4.4h, v24.4s
+    sqxtun2      v4.8h, v25.4s
+    sqxtun       v5.4h, v26.4s
+    sqxtun2      v5.8h, v27.4s
+    sqxtun       v6.4h, v28.4s
+    sqxtun2      v6.8h, v29.4s
+
+    mvni        v31.8h, 0xfc, lsl #8
+
+    umin        v2.8h, v2.8h, v31.8h
+    umin        v3.8h, v3.8h, v31.8h
+    umin        v4.8h, v4.8h, v31.8h
+    umin        v5.8h, v5.8h, v31.8h
+    umin        v6.8h, v6.8h, v31.8h
+
+    st1         {v2.8h, v3.8h}, [x0], #32
+    st1         {v4.d}[0], [x0], x1
+    st1         {v5.8h, v6.8h}, [x0], #32
+    st1         {v4.d}[1], [x0], x1
+
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w16_nodenom_neon, export=1
+    weight_prologue nodenom
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w9, w9, #2
+    ld1         {v2.8h, v3.8h}, [x2], x3
+    mov         v27.16b, v1.16b
+    mov         v28.16b, v1.16b
+    mov         v29.16b, v1.16b
+    mov         v30.16b, v1.16b
+    ld1         {v4.8h, v5.8h}, [x2], x3
+    mov         v20.16b, v1.16b
+    mov         v21.16b, v1.16b
+    mov         v22.16b, v1.16b
+    mov         v23.16b, v1.16b
+
+    umlal       v27.4s, v2.4h, v0.4h
+    umlal2      v28.4s, v2.8h, v0.8h
+    umlal       v29.4s, v3.4h, v0.4h
+    umlal2      v30.4s, v3.8h, v0.8h
+
+    umlal       v20.4s, v4.4h, v0.4h
+    umlal2      v21.4s, v4.8h, v0.8h
+    umlal       v22.4s, v5.4h, v0.4h
+    umlal2      v23.4s, v5.8h, v0.8h
+
+    sqxtun       v2.4h, v27.4s
+    sqxtun2      v2.8h, v28.4s
+    sqxtun       v3.4h, v29.4s
+    sqxtun2      v3.8h, v30.4s
+
+    sqxtun       v4.4h, v20.4s
+    sqxtun2      v4.8h, v21.4s
+    sqxtun       v5.4h, v22.4s
+    sqxtun2      v5.8h, v23.4s
+
+    mvni        v31.8h, 0xfc, lsl #8
+
+    umin        v2.8h, v2.8h, v31.8h
+    umin        v3.8h, v3.8h, v31.8h
+    umin        v4.8h, v4.8h, v31.8h
+    umin        v5.8h, v5.8h, v31.8h
+
+    st1         {v2.8h, v3.8h}, [x0], x1
+    st1         {v4.8h, v5.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w8_nodenom_neon, export=1
+    weight_prologue nodenom
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w9, w9, #2
+    ld1         {v16.8h}, [x2], x3
+    mov         v27.16b, v1.16b
+    ld1         {v17.8h}, [x2], x3
+    mov         v28.16b, v1.16b
+    mov         v29.16b, v1.16b
+    mov         v30.16b, v1.16b
+
+    umlal       v27.4s, v16.4h, v0.4h
+    umlal2      v28.4s, v16.8h, v0.8h
+    umlal       v29.4s, v17.4h, v0.4h
+    umlal2      v30.4s, v17.8h, v0.8h
+
+    sqxtun       v4.4h, v27.4s
+    sqxtun2      v4.8h, v28.4s
+    sqxtun       v5.4h, v29.4s
+    sqxtun2      v5.8h, v30.4s
+
+    mvni        v31.8h, 0xfc, lsl #8
+
+    umin        v4.8h, v4.8h, v31.8h
+    umin        v5.8h, v5.8h, v31.8h
+
+    st1         {v4.8h}, [x0], x1
+    st1         {v5.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w4_nodenom_neon, export=1
+    weight_prologue nodenom
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w9, w9, #2
+    ld1         {v16.d}[0], [x2], x3
+    ld1         {v16.d}[1], [x2], x3
+    mov         v27.16b, v1.16b
+    mov         v28.16b, v1.16b
+    umlal       v27.4s, v16.4h, v0.4h
+    umlal2      v28.4s, v16.8h, v0.8h
+
+    sqxtun       v4.4h, v27.4s
+    sqxtun2      v4.8h, v28.4s
+
+    mvni        v31.8h, 0xfc, lsl #8
+
+    umin        v4.8h, v4.8h, v31.8h
+
+    st1         {v4.d}[0], [x0], x1
+    st1         {v4.d}[1], [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+.macro weight_simple_prologue
+    ldr         w6, [x4]               // offset
+    lsl         w6, w6, #2
+    dup         v1.8h, w6
+.endm
+
+.macro weight_simple name op
+function mc_weight_w20_\name\()_neon, export=1
+    weight_simple_prologue
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    sub         x1, x1, #32
+1:
+    subs        w5, w5, #2
+    ld1         {v2.8h, v3.8h, v4.8h}, [x2], x3
+    ld1         {v5.8h, v6.8h, v7.8h}, [x2], x3
+
+    zip1        v4.2d, v4.2d, v7.2d
+
+    \op         v2.8h, v2.8h, v1.8h
+    \op         v3.8h, v3.8h, v1.8h
+    \op         v4.8h, v4.8h, v1.8h
+    \op         v5.8h, v5.8h, v1.8h
+    \op         v6.8h, v6.8h, v1.8h
+
+    mvni        v31.8h, #0xfc, lsl #8
+
+    umin        v2.8h, v2.8h, v28.8h
+    umin        v3.8h, v3.8h, v28.8h
+    umin        v4.8h, v4.8h, v28.8h
+    umin        v5.8h, v5.8h, v28.8h
+    umin        v6.8h, v6.8h, v28.8h
+
+    st1         {v2.8h, v3.8h}, [x0], #32
+    st1         {v4.d}[0], [x0], x1
+    st1         {v5.8h, v6.8h}, [x0], #32
+    st1         {v4.d}[1], [x0], x1
+
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w16_\name\()_neon, export=1
+    weight_simple_prologue
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w5, w5, #2
+    ld1         {v16.8h, v17.8h}, [x2], x3
+    ld1         {v18.8h, v19.8h}, [x2], x3
+
+    \op         v16.8h, v16.8h, v1.8h
+    \op         v17.8h, v17.8h, v1.8h
+    \op         v18.8h, v18.8h, v1.8h
+    \op         v19.8h, v19.8h, v1.8h
+
+    mvni        v28.8h, #0xfc, lsl #8
+
+    umin        v16.8h, v16.8h, v28.8h
+    umin        v17.8h, v17.8h, v28.8h
+    umin        v18.8h, v18.8h, v28.8h
+    umin        v19.8h, v19.8h, v28.8h
+
+    st1         {v16.8h, v17.8h}, [x0], x1
+    st1         {v18.8h, v19.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w8_\name\()_neon, export=1
+    weight_simple_prologue
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w5, w5, #2
+    ld1         {v16.8h}, [x2], x3
+    ld1         {v17.8h}, [x2], x3
+    \op         v16.8h, v16.8h, v1.8h
+    \op         v17.8h, v17.8h, v1.8h
+
+    mvni        v28.8h, 0xfc, lsl #8
+
+    umin        v16.8h, v16.8h, v28.8h
+    umin        v17.8h, v17.8h, v28.8h
+
+    st1         {v16.8h}, [x0], x1
+    st1         {v17.8h}, [x0], x1
+    b.gt        1b
+    ret
+endfunc
+
+function mc_weight_w4_\name\()_neon, export=1
+    weight_simple_prologue
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    subs        w5, w5, #2
+    ld1         {v16.d}[0], [x2], x3
+    ld1         {v16.d}[1], [x2], x3
+    \op         v16.8h, v16.8h, v1.8h
+    mvni        v28.8h, 0xfc, lsl #8
+
+    umin        v16.8h, v16.8h, v28.8h
+
+    st1         {v16.d}[0], [x0], x1
+    st1         {v16.d}[1], [x0], x1
+    b.gt        1b
+    ret
+endfunc
+.endm
+
+weight_simple offsetadd, uqadd
+weight_simple offsetsub, uqsub
+
+// void mc_chroma( pixel *dst_u, pixel *dst_v,
+//                 intptr_t i_dst_stride,
+//                 pixel *src, intptr_t i_src_stride,
+//                 int dx, int dy, int i_width, int i_height );
+function mc_chroma_neon, export=1
+    ldr         w15, [sp]               // height
+    sbfx        x12, x6, #3, #29        // asr(3) and sign extend
+    sbfx        x11, x5, #3, #29        // asr(3) and sign extend
+    cmp         w7, #4
+    lsl         x4, x4, #1
+    mul         x12, x12, x4
+    add         x3, x3, x11, lsl #2
+
+    and         w5, w5, #7
+    and         w6, w6, #7
+
+    add         x3, x3, x12
+
+    b.gt        mc_chroma_w8_neon
+    b.eq        mc_chroma_w4_neon
+endfunc
+
+.macro CHROMA_MC_START r00, r01, r10, r11
+    mul         w12, w5, w6             // cD = d8x    *d8y
+    lsl         w13, w5, #3
+    add         w9, w12, #64
+    lsl         w14, w6, #3
+    tst         w12, w12
+    sub         w9, w9, w13
+    sub         w10, w13, w12           // cB = d8x    *(8-d8y);
+    sub         w11, w14, w12           // cC = (8-d8x)*d8y
+    sub         w9, w9, w14             // cA = (8-d8x)*(8-d8y);
+.endm
+
+.macro CHROMA_MC width, vsize
+function mc_chroma_w\width\()_neon
+    lsl         x2, x2, #1
+// since the element size varies, there's a different index for the 2nd store
+.if \width == 4
+    .set idx2, 1
+.else
+    .set idx2, 2
+.endif
+    CHROMA_MC_START
+    b.eq        2f
+
+    ld2         {v28.8h, v29.8h}, [x3], x4
+    dup         v0.8h, w9               // cA
+    dup         v1.8h, w10              // cB
+
+    ext         v6.16b, v28.16b, v28.16b, #2
+    ext         v7.16b, v29.16b, v29.16b, #2
+
+    ld2         {v30.8h, v31.8h}, [x3], x4
+    dup         v2.8h, w11              // cC
+    dup         v3.8h, w12              // cD
+
+    ext         v22.16b, v30.16b, v30.16b, #2
+    ext         v23.16b, v31.16b, v31.16b, #2
+
+    trn1        v0.2d, v0.2d, v1.2d
+    trn1        v2.2d, v2.2d, v3.2d
+
+    trn1        v4.2d, v28.2d, v6.2d
+    trn1        v5.2d, v29.2d, v7.2d
+    trn1        v20.2d, v30.2d, v22.2d
+    trn1        v21.2d, v31.2d, v23.2d
+1:  // height loop, interpolate xy
+    subs        w15, w15, #2
+
+    mul         v16.8h, v4.8h, v0.8h
+    mul         v17.8h, v5.8h, v0.8h
+    mla         v16.8h, v20.8h, v2.8h
+    mla         v17.8h, v21.8h, v2.8h
+
+    ld2         {v28.8h, v29.8h}, [x3], x4
+    transpose   v24.2d, v25.2d, v16.2d, v17.2d
+
+    ext         v6.16b, v28.16b, v28.16b, #2
+    ext         v7.16b, v29.16b, v29.16b, #2
+    trn1        v4.2d, v28.2d, v6.2d
+    trn1        v5.2d, v29.2d, v7.2d
+
+    add         v16.8h, v24.8h, v25.8h
+    urshr       v16.8h, v16.8h, #6
+
+    mul         v18.8h, v20.8h, v0.8h
+    mul         v19.8h, v21.8h, v0.8h
+    mla         v18.8h, v4.8h, v2.8h
+    mla         v19.8h, v5.8h, v2.8h
+
+    ld2         {v30.8h, v31.8h}, [x3], x4
+
+    transpose   v26.2d, v27.2d, v18.2d, v19.2d
+    add         v18.8h, v26.8h, v27.8h
+    urshr       v18.8h, v18.8h, #6
+
+    ext         v22.16b, v30.16b, v30.16b, #2
+    ext         v23.16b, v31.16b, v31.16b, #2
+    trn1        v20.2d, v30.2d, v22.2d
+    trn1        v21.2d, v31.2d, v23.2d
+
+    st1         {v16.\vsize}[0], [x0], x2
+    st1         {v16.\vsize}[idx2], [x1], x2
+    st1         {v18.\vsize}[0], [x0], x2
+    st1         {v18.\vsize}[idx2], [x1], x2
+    b.gt        1b
+
+    ret
+2:  // dx or dy are 0
+    tst         w11, w11
+    add         w10, w10, w11
+    dup         v0.8h, w9
+    dup         v1.8h, w10
+
+    b.eq        4f
+
+    ld1         {v4.8h}, [x3], x4
+    ld1         {v6.8h}, [x3], x4
+3:  // vertical interpolation loop
+    subs        w15, w15, #2
+
+    mul         v16.8h, v4.8h, v0.8h
+    mla         v16.8h, v6.8h, v1.8h
+    ld1         {v4.8h}, [x3], x4
+    mul         v17.8h, v6.8h, v0.8h
+    mla         v17.8h, v4.8h, v1.8h
+    ld1         {v6.8h}, [x3], x4
+
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    uzp1        v18.8h, v16.8h, v17.8h  // d16=uuuu|uuuu, d17=vvvv|vvvv
+    uzp2        v19.8h, v16.8h, v17.8h  // d16=uuuu|uuuu, d17=vvvv|vvvv
+
+    st1         {v18.\vsize}[0], [x0], x2
+    st1         {v18.\vsize}[idx2], [x0], x2
+    st1         {v19.\vsize}[0], [x1], x2
+    st1         {v19.\vsize}[idx2], [x1], x2
+    b.gt        3b
+
+    ret
+
+4:  // dy is 0
+    ld1         {v4.8h, v5.8h}, [x3], x4
+    ld1         {v6.8h, v7.8h}, [x3], x4
+
+    ext         v5.16b, v4.16b, v5.16b, #4
+    ext         v7.16b, v6.16b, v7.16b, #4
+5:  // horizontal interpolation loop
+    subs        w15, w15, #2
+
+    mul         v16.8h, v4.8h, v0.8h
+    mla         v16.8h, v5.8h, v1.8h
+    mul         v17.8h, v6.8h, v0.8h
+    mla         v17.8h, v7.8h, v1.8h
+
+    ld1         {v4.8h, v5.8h}, [x3], x4
+    ld1         {v6.8h, v7.8h}, [x3], x4
+
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    ext         v5.16b, v4.16b, v5.16b, #4
+    ext         v7.16b, v6.16b, v7.16b, #4
+    uzp1        v18.8h, v16.8h, v17.8h  // d16=uuuu|uuuu, d17=vvvv|vvvv
+    uzp2        v19.8h, v16.8h, v17.8h  // d16=uuuu|uuuu, d17=vvvv|vvvv
+
+    st1         {v18.\vsize}[0], [x0], x2
+    st1         {v18.\vsize}[idx2], [x0], x2
+    st1         {v19.\vsize}[0], [x1], x2
+    st1         {v19.\vsize}[idx2], [x1], x2
+    b.gt        5b
+
+    ret
+endfunc
+.endm
+
+    CHROMA_MC 2, s
+    CHROMA_MC 4, d
+
+function mc_chroma_w8_neon
+    lsl         x2, x2, #1
+    CHROMA_MC_START
+
+    b.eq        2f
+    sub         x4, x4, #32
+    ld2         {v4.8h, v5.8h}, [x3], #32
+    ld2         {v6.8h, v7.8h}, [x3], x4
+
+    ld2         {v20.8h, v21.8h}, [x3], #32
+    ld2         {v22.8h, v23.8h}, [x3], x4
+
+    dup         v0.8h, w9               // cA
+    dup         v1.8h, w10              // cB
+
+    ext         v24.16b, v4.16b, v6.16b, #2
+    ext         v26.16b, v6.16b, v4.16b, #2
+    ext         v28.16b, v20.16b, v22.16b, #2
+    ext         v30.16b, v22.16b, v20.16b, #2
+
+    ext         v25.16b, v5.16b, v7.16b, #2
+    ext         v27.16b, v7.16b, v5.16b, #2
+    ext         v29.16b, v21.16b, v23.16b, #2
+    ext         v31.16b, v23.16b, v21.16b, #2
+
+    dup         v2.8h, w11              // cC
+    dup         v3.8h, w12              // cD
+
+1:  // height loop, interpolate xy
+    subs        w15, w15, #2
+
+    mul         v16.8h, v4.8h, v0.8h
+    mul         v17.8h, v5.8h, v0.8h
+    mla         v16.8h, v24.8h, v1.8h
+    mla         v17.8h, v25.8h, v1.8h
+    mla         v16.8h, v20.8h, v2.8h
+    mla         v17.8h, v21.8h, v2.8h
+    mla         v16.8h, v28.8h, v3.8h
+    mla         v17.8h, v29.8h, v3.8h
+
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    st1         {v16.8h}, [x0], x2
+    st1         {v17.8h}, [x1], x2
+
+    ld2         {v4.8h, v5.8h}, [x3], #32
+    ld2         {v6.8h, v7.8h}, [x3], x4
+
+    mul         v16.8h, v20.8h, v0.8h
+    mul         v17.8h, v21.8h, v0.8h
+    ext         v24.16b, v4.16b, v6.16b, #2
+    ext         v26.16b, v6.16b, v4.16b, #2
+    mla         v16.8h, v28.8h, v1.8h
+    mla         v17.8h, v29.8h, v1.8h
+    ext         v25.16b, v5.16b, v7.16b, #2
+    ext         v27.16b, v7.16b, v5.16b, #2
+    mla         v16.8h, v4.8h, v2.8h
+    mla         v17.8h, v5.8h, v2.8h
+    mla         v16.8h, v24.8h, v3.8h
+    mla         v17.8h, v25.8h, v3.8h
+
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    ld2         {v20.8h, v21.8h}, [x3], #32
+    ld2         {v22.8h, v23.8h}, [x3], x4
+    ext         v28.16b, v20.16b, v22.16b, #2
+    ext         v30.16b, v22.16b, v20.16b, #2
+    ext         v29.16b, v21.16b, v23.16b, #2
+    ext         v31.16b, v23.16b, v21.16b, #2
+
+    st1         {v16.8h}, [x0], x2
+    st1         {v17.8h}, [x1], x2
+    b.gt        1b
+
+    ret
+2:  // dx or dy are 0
+    tst         w11, w11
+    add         w10, w10, w11
+    dup         v0.8h, w9
+    dup         v1.8h, w10
+
+    b.eq        4f
+
+    ld2         {v4.8h, v5.8h}, [x3], x4
+    ld2         {v6.8h, v7.8h}, [x3], x4
+3:  // vertical interpolation loop
+    subs        w15, w15, #2
+
+    mul         v16.8h, v4.8h, v0.8h
+    mul         v17.8h, v5.8h, v0.8h
+    mla         v16.8h, v6.8h, v1.8h
+    mla         v17.8h, v7.8h, v1.8h
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    st1         {v16.8h}, [x0], x2
+    st1         {v17.8h}, [x1], x2
+
+    ld2         {v4.8h, v5.8h}, [x3], x4
+
+    mul         v16.8h, v6.8h, v0.8h
+    mul         v17.8h, v7.8h, v0.8h
+    ld2         {v6.8h, v7.8h}, [x3], x4
+    mla         v16.8h, v4.8h, v1.8h
+    mla         v17.8h, v5.8h, v1.8h
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    st1         {v16.8h}, [x0], x2
+    st1         {v17.8h}, [x1], x2
+    b.gt        3b
+
+    ret
+4:  // dy is 0
+    sub         x4, x4, #32
+
+    ld2         {v4.8h, v5.8h}, [x3], #32
+    ld2         {v6.8h, v7.8h}, [x3], x4
+    ext         v24.16b, v4.16b, v6.16b, #2
+    ext         v26.16b, v6.16b, v4.16b, #2
+    ld2         {v20.8h, v21.8h}, [x3], #32
+    ld2         {v22.8h, v23.8h}, [x3], x4
+    ext         v28.16b, v20.16b, v22.16b, #2
+    ext         v30.16b, v22.16b, v20.16b, #2
+
+    ext         v25.16b, v5.16b, v7.16b, #2
+    ext         v27.16b, v7.16b, v5.16b, #2
+    ext         v29.16b, v21.16b, v23.16b, #2
+    ext         v31.16b, v23.16b, v21.16b, #2
+
+5:  // horizontal interpolation loop
+    subs        w15, w15, #2
+
+    mul         v16.8h, v4.8h, v0.8h
+    mul         v17.8h, v5.8h, v0.8h
+    mla         v16.8h, v24.8h, v1.8h
+    mla         v17.8h, v25.8h, v1.8h
+
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    st1         {v16.8h}, [x0], x2
+    st1         {v17.8h}, [x1], x2
+
+    mul         v16.8h, v20.8h, v0.8h
+    mul         v17.8h, v21.8h, v0.8h
+    ld2         {v4.8h, v5.8h}, [x3], #32
+    ld2         {v6.8h, v7.8h}, [x3], x4
+    mla         v16.8h, v28.8h, v1.8h
+    mla         v17.8h, v29.8h, v1.8h
+    ld2         {v20.8h,v21.8h}, [x3], #32
+    ld2         {v22.8h,v23.8h}, [x3], x4
+
+    urshr       v16.8h, v16.8h, #6
+    urshr       v17.8h, v17.8h, #6
+
+    ext         v24.16b, v4.16b, v6.16b, #2
+    ext         v26.16b, v6.16b, v4.16b, #2
+    ext         v28.16b, v20.16b, v22.16b, #2
+    ext         v30.16b, v22.16b, v20.16b, #2
+    ext         v29.16b, v21.16b, v23.16b, #2
+    ext         v31.16b, v23.16b, v21.16b, #2
+    ext         v25.16b, v5.16b, v7.16b, #2
+    ext         v27.16b, v7.16b, v5.16b, #2
+
+    st1         {v16.8h}, [x0], x2
+    st1         {v17.8h}, [x1], x2
+    b.gt        5b
+
+    ret
+endfunc
+
+.macro integral4h p1, p2
+    ext         v1.16b, \p1\().16b, \p2\().16b, #2
+    ext         v2.16b, \p1\().16b, \p2\().16b, #4
+    ext         v3.16b, \p1\().16b, \p2\().16b, #6
+    add         v0.8h, \p1\().8h, v1.8h
+    add         v4.8h, v2.8h, v3.8h
+    add         v0.8h, v0.8h, v4.8h
+    add         v0.8h, v0.8h, v5.8h
+.endm
+
+function integral_init4h_neon, export=1
+    sub         x3, x0, x2, lsl #1
+    lsl         x2, x2, #1
+    ld1         {v6.8h,v7.8h}, [x1], #32
+1:
+    subs        x2, x2, #32
+    ld1         {v5.8h}, [x3], #16
+    integral4h  v6, v7
+    ld1         {v6.8h}, [x1], #16
+    ld1         {v5.8h}, [x3], #16
+    st1         {v0.8h}, [x0], #16
+    integral4h  v7, v6
+    ld1         {v7.8h}, [x1], #16
+    st1         {v0.8h}, [x0], #16
+    b.gt        1b
+    ret
+endfunc
+
+.macro integral8h p1, p2, s
+    ext         v1.16b, \p1\().16b, \p2\().16b, #2
+    ext         v2.16b, \p1\().16b, \p2\().16b, #4
+    ext         v3.16b, \p1\().16b, \p2\().16b, #6
+    ext         v4.16b, \p1\().16b, \p2\().16b, #8
+    ext         v5.16b, \p1\().16b, \p2\().16b, #10
+    ext         v6.16b, \p1\().16b, \p2\().16b, #12
+    ext         v7.16b, \p1\().16b, \p2\().16b, #14
+    add         v0.8h, \p1\().8h, v1.8h
+    add         v2.8h, v2.8h, v3.8h
+    add         v4.8h, v4.8h, v5.8h
+    add         v6.8h, v6.8h, v7.8h
+    add         v0.8h, v0.8h, v2.8h
+    add         v4.8h, v4.8h, v6.8h
+    add         v0.8h, v0.8h, v4.8h
+    add         v0.8h, v0.8h, \s\().8h
+.endm
+
+function integral_init8h_neon, export=1
+    sub         x3, x0, x2, lsl #1
+    lsl         x2, x2, #1
+
+    ld1         {v16.8h, v17.8h}, [x1], #32
+1:
+    subs        x2, x2, #32
+    ld1         {v18.8h}, [x3], #16
+    integral8h  v16, v17, v18
+    ld1         {v16.8h}, [x1], #16
+    ld1         {v18.8h}, [x3], #16
+    st1         {v0.8h}, [x0], #16
+    integral8h  v17, v16, v18
+    ld1         {v17.8h}, [x1], #16
+    st1         {v0.8h},  [x0], #16
+    b.gt        1b
+    ret
+endfunc
+
+function integral_init4v_neon, export=1
+    mov         x3, x0
+    add         x4, x0, x2, lsl #3
+    add         x8, x0, x2, lsl #4
+    lsl         x2, x2, #1
+    sub         x2, x2, #16
+    ld1         {v20.8h, v21.8h, v22.8h}, [x3], #48
+    ld1         {v16.8h, v17.8h, v18.8h}, [x8], #48
+1:
+    subs        x2, x2, #32
+    ld1         {v24.8h, v25.8h}, [x4], #32
+    ext         v0.16b, v20.16b, v21.16b, #8
+    ext         v1.16b, v21.16b, v22.16b, #8
+    ext         v2.16b, v16.16b, v17.16b, #8
+    ext         v3.16b, v17.16b, v18.16b, #8
+    sub         v24.8h, v24.8h, v20.8h
+    sub         v25.8h, v25.8h, v21.8h
+    add         v0.8h, v0.8h, v20.8h
+    add         v1.8h, v1.8h, v21.8h
+    add         v2.8h, v2.8h, v16.8h
+    add         v3.8h, v3.8h, v17.8h
+    st1         {v24.8h}, [x1], #16
+    st1         {v25.8h}, [x1], #16
+    mov         v20.16b, v22.16b
+    mov         v16.16b, v18.16b
+    sub         v0.8h, v2.8h, v0.8h
+    sub         v1.8h, v3.8h, v1.8h
+    ld1         {v21.8h, v22.8h}, [x3], #32
+    ld1         {v17.8h, v18.8h}, [x8], #32
+    st1         {v0.8h}, [x0], #16
+    st1         {v1.8h}, [x0], #16
+    b.gt        1b
+2:
+    ret
+endfunc
+
+function integral_init8v_neon, export=1
+    add         x2, x0, x1, lsl #4
+    sub         x1, x1, #8
+    ands        x3, x1, #16 - 1
+    b.eq        1f
+    subs        x1, x1, #8
+    ld1         {v0.8h}, [x0]
+    ld1         {v2.8h}, [x2], #16
+    sub         v4.8h, v2.8h, v0.8h
+    st1         {v4.8h}, [x0], #16
+    b.le        2f
+1:
+    subs        x1, x1, #16
+    ld1         {v0.8h,v1.8h}, [x0]
+    ld1         {v2.8h,v3.8h}, [x2], #32
+    sub         v4.8h, v2.8h, v0.8h
+    sub         v5.8h, v3.8h, v1.8h
+    st1         {v4.8h}, [x0], #16
+    st1         {v5.8h}, [x0], #16
+    b.gt        1b
+2:
     ret
 endfunc
 
-const pw_0to15, align=5
-    .short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
-endconst
+// frame_init_lowres_core( pixel *src0, pixel *dst0, pixel *dsth,
+//                         pixel *dstv, pixel *dstc, intptr_t src_stride,
+//                         intptr_t dst_stride, int width, int height )
+function frame_init_lowres_core_neon, export=1
+    ldr         w8, [sp]
+    lsl         x5, x5, #1
+    sub         x10, x6, w7, uxtw       // dst_stride - width
+    lsl         x10, x10, #1
+    and         x10, x10, #~31
+
+    stp         d8, d9, [sp, #-0x40]!
+    stp         d10, d11, [sp, #0x10]
+    stp         d12, d13, [sp, #0x20]
+    stp         d14, d15, [sp, #0x30]
+
+1:
+    mov         w9, w7                  // width
+    mov         x11, x0                 // src0
+    add         x12, x0, x5             // src1 = src0 + src_stride
+    add         x13, x0, x5, lsl #1     // src2 = src1 + src_stride
+
+    ld2         {v0.8h, v1.8h}, [x11], #32
+    ld2         {v2.8h, v3.8h}, [x11], #32
+    ld2         {v4.8h, v5.8h}, [x12], #32
+    ld2         {v6.8h, v7.8h}, [x12], #32
+    ld2         {v28.8h, v29.8h}, [x13], #32
+    ld2         {v30.8h, v31.8h}, [x13], #32
+
+    urhadd      v20.8h, v0.8h, v4.8h
+    urhadd      v21.8h, v2.8h, v6.8h
+    urhadd      v22.8h, v4.8h, v28.8h
+    urhadd      v23.8h, v6.8h, v30.8h
+2:
+    subs        w9, w9, #16
+
+    urhadd      v24.8h, v1.8h, v5.8h
+    urhadd      v25.8h, v3.8h, v7.8h
+    urhadd      v26.8h, v5.8h, v29.8h
+    urhadd      v27.8h, v7.8h, v31.8h
+
+    ld2         {v0.8h, v1.8h}, [x11], #32
+    ld2         {v2.8h, v3.8h}, [x11], #32
+    ld2         {v4.8h, v5.8h}, [x12], #32
+    ld2         {v6.8h, v7.8h}, [x12], #32
+    ld2         {v28.8h, v29.8h}, [x13], #32
+    ld2         {v30.8h, v31.8h}, [x13], #32
+
+    urhadd      v16.8h, v0.8h, v4.8h
+    urhadd      v17.8h, v2.8h, v6.8h
+    urhadd      v18.8h, v4.8h, v28.8h
+    urhadd      v19.8h, v6.8h, v30.8h
+
+    ext         v8.16b, v20.16b, v21.16b, #2
+    ext         v9.16b, v21.16b, v16.16b, #2
+    ext         v10.16b, v22.16b, v23.16b, #2
+    ext         v11.16b, v23.16b, v18.16b, #2
+
+    urhadd      v12.8h, v20.8h, v24.8h
+    urhadd      v8.8h, v24.8h, v8.8h
+
+    urhadd      v24.8h, v21.8h, v25.8h
+    urhadd      v22.8h, v22.8h, v26.8h
+    urhadd      v10.8h, v26.8h, v10.8h
+    urhadd      v26.8h, v23.8h, v27.8h
+    urhadd      v9.8h, v25.8h, v9.8h
+    urhadd      v11.8h, v27.8h, v11.8h
+
+    st1         {v12.8h}, [x1], #16
+    st1         {v24.8h}, [x1], #16
+    st1         {v22.8h}, [x3], #16
+    st1         {v26.8h}, [x3], #16
+    st1         {v8.8h, v9.8h}, [x2], #32
+    st1         {v10.8h, v11.8h}, [x4], #32
+
+    b.le        3f
+
+    subs        w9,  w9,  #16
+
+    urhadd      v24.8h, v1.8h, v5.8h
+    urhadd      v25.8h, v3.8h, v7.8h
+    urhadd      v26.8h, v5.8h, v29.8h
+    urhadd      v27.8h, v7.8h, v31.8h
+
+    ld2         {v0.8h, v1.8h}, [x11], #32
+    ld2         {v2.8h, v3.8h}, [x11], #32
+    ld2         {v4.8h, v5.8h}, [x12], #32
+    ld2         {v6.8h, v7.8h}, [x12], #32
+    ld2         {v28.8h, v29.8h}, [x13], #32
+    ld2         {v30.8h, v31.8h}, [x13], #32
+
+    urhadd      v20.8h, v0.8h, v4.8h
+    urhadd      v21.8h, v2.8h, v6.8h
+    urhadd      v22.8h, v4.8h, v28.8h
+    urhadd      v23.8h, v6.8h, v30.8h
+
+    ext         v8.16b, v16.16b, v17.16b, #2
+    ext         v9.16b, v17.16b, v20.16b, #2
+    ext         v10.16b, v18.16b, v19.16b, #2
+    ext         v11.16b, v19.16b, v22.16b, #2
+
+    urhadd      v12.8h, v16.8h, v24.8h
+    urhadd      v13.8h, v17.8h, v25.8h
+
+    urhadd      v14.8h, v18.8h, v26.8h
+    urhadd      v15.8h, v19.8h, v27.8h
+
+    urhadd      v16.8h, v24.8h, v8.8h
+    urhadd      v17.8h, v25.8h, v9.8h
+
+    urhadd      v18.8h, v26.8h, v10.8h
+    urhadd      v19.8h, v27.8h, v11.8h
+
+    st1         {v12.8h, v13.8h}, [x1], #32
+    st1         {v14.8h, v15.8h}, [x3], #32
+    st1         {v16.8h, v17.8h}, [x2], #32
+    st1         {v18.8h, v19.8h}, [x4], #32
+    b.gt        2b
+3:
+    subs        w8, w8, #1
+    add         x0, x0, x5, lsl #1
+    add         x1, x1, x10
+    add         x2, x2, x10
+    add         x3, x3, x10
+    add         x4, x4, x10
+    b.gt        1b
+
+    ldp         d8, d9, [sp]
+    ldp         d10, d11, [sp, #0x10]
+    ldp         d12, d13, [sp, #0x20]
+    ldp         d14, d15, [sp, #0x30]
+
+    add         sp, sp, #0x40
 
-function mbtree_propagate_list_internal_neon, export=1
-    movrel      x11,  pw_0to15
-    dup         v31.8h,  w4             // bipred_weight
-    movi        v30.8h,  #0xc0, lsl #8
-    ld1         {v29.8h},  [x11] //h->mb.i_mb_x,h->mb.i_mb_y
-    movi        v28.4s,  #4
-    movi        v27.8h,  #31
-    movi        v26.8h,  #32
-    dup         v24.8h,  w5             // mb_y
-    zip1        v29.8h,  v29.8h, v24.8h
-8:
-    subs        w6,  w6,  #8
-    ld1         {v1.8h},  [x1], #16     // propagate_amount
-    ld1         {v2.8h},  [x2], #16     // lowres_cost
-    and         v2.16b, v2.16b, v30.16b
-    cmeq        v25.8h, v2.8h,  v30.8h
-    umull       v16.4s, v1.4h,  v31.4h
-    umull2      v17.4s, v1.8h,  v31.8h
-    rshrn       v16.4h, v16.4s, #6
-    rshrn2      v16.8h, v17.4s, #6
-    bsl         v25.16b, v16.16b, v1.16b // if( lists_used == 3 )
-    //          propagate_amount = (propagate_amount * bipred_weight + 32) >> 6
-    ld1         {v4.8h,v5.8h},  [x0],  #32
-    sshr        v6.8h,  v4.8h,  #5
-    sshr        v7.8h,  v5.8h,  #5
-    add         v6.8h,  v6.8h,  v29.8h
-    add         v29.8h, v29.8h, v28.8h
-    add         v7.8h,  v7.8h,  v29.8h
-    add         v29.8h, v29.8h, v28.8h
-    st1         {v6.8h,v7.8h},  [x3],  #32
-    and         v4.16b, v4.16b, v27.16b
-    and         v5.16b, v5.16b, v27.16b
-    uzp1        v6.8h,  v4.8h,  v5.8h   // x & 31
-    uzp2        v7.8h,  v4.8h,  v5.8h   // y & 31
-    sub         v4.8h,  v26.8h, v6.8h   // 32 - (x & 31)
-    sub         v5.8h,  v26.8h, v7.8h   // 32 - (y & 31)
-    mul         v19.8h, v6.8h,  v7.8h   // idx3weight = y*x;
-    mul         v18.8h, v4.8h,  v7.8h   // idx2weight = y*(32-x);
-    mul         v17.8h, v6.8h,  v5.8h   // idx1weight = (32-y)*x;
-    mul         v16.8h, v4.8h,  v5.8h   // idx0weight = (32-y)*(32-x) ;
-    umull       v6.4s,  v19.4h, v25.4h
-    umull2      v7.4s,  v19.8h, v25.8h
-    umull       v4.4s,  v18.4h, v25.4h
-    umull2      v5.4s,  v18.8h, v25.8h
-    umull       v2.4s,  v17.4h, v25.4h
-    umull2      v3.4s,  v17.8h, v25.8h
-    umull       v0.4s,  v16.4h, v25.4h
-    umull2      v1.4s,  v16.8h, v25.8h
-    rshrn       v19.4h, v6.4s,  #10
-    rshrn2      v19.8h, v7.4s,  #10
-    rshrn       v18.4h, v4.4s,  #10
-    rshrn2      v18.8h, v5.4s,  #10
-    rshrn       v17.4h, v2.4s,  #10
-    rshrn2      v17.8h, v3.4s,  #10
-    rshrn       v16.4h, v0.4s,  #10
-    rshrn2      v16.8h, v1.4s,  #10
-    zip1        v0.8h,  v16.8h, v17.8h
-    zip2        v1.8h,  v16.8h, v17.8h
-    zip1        v2.8h,  v18.8h, v19.8h
-    zip2        v3.8h,  v18.8h, v19.8h
-    st1         {v0.8h,v1.8h},  [x3], #32
-    st1         {v2.8h,v3.8h},  [x3], #32
-    b.ge        8b
     ret
 endfunc
 
-function memcpy_aligned_neon, export=1
-    tst         x2,  #16
+function load_deinterleave_chroma_fenc_neon, export=1
+    mov         x4, #FENC_STRIDE/2
+    lsl         x4, x4, #1
+    lsl         x2, x2, #1
+    b           load_deinterleave_chroma
+endfunc
+
+function load_deinterleave_chroma_fdec_neon, export=1
+    mov         x4, #FDEC_STRIDE/2
+    lsl         x4, x4, #1
+    lsl         x2, x2, #1
+load_deinterleave_chroma:
+    ld2         {v0.8h, v1.8h}, [x1], x2
+    ld2         {v2.8h, v3.8h}, [x1], x2
+    subs        w3, w3, #2
+    st1         {v0.8h}, [x0], x4
+    st1         {v1.8h}, [x0], x4
+    st1         {v2.8h}, [x0], x4
+    st1         {v3.8h}, [x0], x4
+    b.gt        load_deinterleave_chroma
+
+    ret
+endfunc
+
+function store_interleave_chroma_neon, export=1
+    mov         x5, #FDEC_STRIDE
+    lsl         x5, x5, #1
+    lsl         x1, x1, #1
+1:
+    ld1         {v0.8h}, [x2], x5
+    ld1         {v1.8h}, [x3], x5
+    ld1         {v2.8h}, [x2], x5
+    ld1         {v3.8h}, [x3], x5
+    subs        w4, w4, #2
+    zip1        v4.8h, v0.8h, v1.8h
+    zip1        v6.8h, v2.8h, v3.8h
+    zip2        v5.8h, v0.8h, v1.8h
+    zip2        v7.8h, v2.8h, v3.8h
+
+    st1         {v4.8h, v5.8h}, [x0], x1
+    st1         {v6.8h, v7.8h}, [x0], x1
+    b.gt        1b
+
+    ret
+endfunc
+
+function plane_copy_core_neon, export=1
+    add         w8, w4, #31 // 32-bit write clears the upper 32-bit the register
+    and         w4, w8, #~31
+    // safe use of the full reg since negative width makes no sense
+    sub         x1, x1, x4
+    sub         x3, x3, x4
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+1:
+    mov         w8, w4
+16:
+    tst         w8, #16
     b.eq        32f
-    sub         x2,  x2,  #16
-    ldr         q0,  [x1], #16
-    str         q0,  [x0], #16
+    subs        w8, w8, #16
+    ldp         q0, q1, [x2], #32
+    stp         q0, q1, [x0], #32
+    b.eq        0f
 32:
-    tst         x2,  #32
-    b.eq        640f
-    sub         x2,  x2,  #32
-    ldp         q0,  q1,  [x1], #32
-    stp         q0,  q1,  [x0], #32
-640:
-    cbz         x2,  1f
-64:
-    subs        x2,  x2,  #64
-    ldp         q0,  q1,  [x1, #32]
-    ldp         q2,  q3,  [x1], #64
-    stp         q0,  q1,  [x0, #32]
-    stp         q2,  q3,  [x0], #64
-    b.gt        64b
+    subs        w8, w8, #32
+    ldp         q0, q1, [x2], #32
+    ldp         q2, q3, [x2], #32
+    stp         q0, q1, [x0], #32
+    stp         q2, q3, [x0], #32
+    b.gt        32b
+0:
+    subs        w5, w5, #1
+    add         x2, x2, x3
+    add         x0, x0, x1
+    b.gt        1b
+
+    ret
+endfunc
+
+function plane_copy_swap_core_neon, export=1
+    lsl         w4, w4, #1
+    add         w8, w4, #31 // 32-bit write clears the upper 32-bit the register
+    and         w4, w8, #~31
+    sub         x1, x1, x4
+    sub         x3, x3, x4
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
 1:
+    mov         w8, w4
+    tbz         w4, #4, 32f
+    subs        w8, w8, #16
+    ld1         {v0.8h, v1.8h}, [x2], #32
+    rev32       v0.8h, v0.8h
+    rev32       v1.8h, v1.8h
+    st1         {v0.8h, v1.8h}, [x0], #32
+    b.eq        0f
+32:
+    subs        w8, w8, #32
+    ld1         {v0.8h ,v1.8h, v2.8h, v3.8h}, [x2], #64
+    rev32       v20.8h, v0.8h
+    rev32       v21.8h, v1.8h
+    rev32       v22.8h, v2.8h
+    rev32       v23.8h, v3.8h
+    st1         {v20.8h, v21.8h, v22.8h, v23.8h}, [x0], #64
+    b.gt        32b
+0:
+    subs        w5, w5, #1
+    add         x2, x2, x3
+    add         x0, x0, x1
+    b.gt        1b
+
     ret
 endfunc
 
-function memzero_aligned_neon, export=1
-    movi        v0.16b,  #0
-    movi        v1.16b,  #0
+function plane_copy_deinterleave_neon, export=1
+    add         w9, w6, #15
+    and         w9, w9, #~15
+    sub         x1, x1, x9
+    sub         x3, x3, x9
+    sub         x5, x5, x9, lsl #1
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    lsl         x5, x5, #1
 1:
-    subs        x1,  x1,  #128
-    stp         q0,  q1,  [x0, #96]
-    stp         q0,  q1,  [x0, #64]
-    stp         q0,  q1,  [x0, #32]
-    stp         q0,  q1,  [x0], 128
+    ld2         {v0.8h, v1.8h}, [x4], #32
+    ld2         {v2.8h, v3.8h}, [x4], #32
+    subs        w9, w9, #16
+    st1         {v0.8h}, [x0], #16
+    st1         {v2.8h}, [x0], #16
+    st1         {v1.8h}, [x2], #16
+    st1         {v3.8h}, [x2], #16
     b.gt        1b
+
+    add         x4, x4, x5
+    subs        w7, w7, #1
+    add         x0, x0, x1
+    add         x2, x2, x3
+    mov         w9, w6
+    b.gt       1b
+
     ret
 endfunc
 
-// void mbtree_fix8_pack( int16_t *dst, float *src, int count )
-function mbtree_fix8_pack_neon, export=1
-    subs        w3,  w2,  #8
-    b.lt        2f
+function plane_copy_interleave_core_neon, export=1
+    add         w9, w6, #15
+    and         w9, w9, #0xfffffff0
+    sub         x1, x1, x9,  lsl #1
+    sub         x3, x3, x9
+    sub         x5, x5, x9
+    lsl         x1, x1, #1
+    lsl         x3, x3, #1
+    lsl         x5, x5, #1
 1:
-    subs        w3,  w3,  #8
-    ld1         {v0.4s,v1.4s}, [x1], #32
-    fcvtzs      v0.4s,  v0.4s,  #8
-    fcvtzs      v1.4s,  v1.4s,  #8
-    sqxtn       v2.4h,  v0.4s
-    sqxtn2      v2.8h,  v1.4s
-    rev16       v3.16b, v2.16b
-    st1         {v3.8h},  [x0], #16
-    b.ge        1b
-2:
-    adds        w3,  w3,  #8
-    b.eq        4f
-3:
-    subs        w3,  w3,  #1
-    ldr         s0, [x1], #4
-    fcvtzs      w4,  s0,  #8
-    rev16       w5,  w4
-    strh        w5, [x0], #2
-    b.gt        3b
-4:
+    ld1         {v0.8h}, [x2], #16
+    ld1         {v1.8h}, [x4], #16
+    ld1         {v2.8h}, [x2], #16
+    ld1         {v3.8h}, [x4], #16
+    subs        w9, w9, #16
+    st2         {v0.8h, v1.8h}, [x0], #32
+    st2         {v2.8h, v3.8h}, [x0], #32
+    b.gt        1b
+
+    subs        w7, w7, #1
+    add         x0, x0, x1
+    add         x2, x2, x3
+    add         x4, x4, x5
+    mov         w9, w6
+    b.gt        1b
+
     ret
 endfunc
 
-// void mbtree_fix8_unpack( float *dst, int16_t *src, int count )
-function mbtree_fix8_unpack_neon, export=1
-    subs        w3,  w2,  #8
-    b.lt        2f
+.macro deinterleave_rgb
+    subs            x11, x11, #8
+    st1             {v0.8h}, [x0], #16
+    st1             {v1.8h}, [x2], #16
+    st1             {v2.8h}, [x4], #16
+    b.gt            1b
+
+    subs            w10, w10, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    add             x4, x4, x5
+    add             x6, x6, x7
+    mov             x11, x9
+    b.gt            1b
+.endm
+
+function plane_copy_deinterleave_rgb_neon, export=1
+#if SYS_MACOSX
+    ldr             w8, [sp]
+    ldp             w9, w10, [sp, #4]
+#else
+    ldr             x8, [sp]
+    ldp             x9, x10, [sp, #8]
+#endif
+    cmp             w8, #3
+    uxtw            x9, w9
+    add             x11, x9, #7
+    and             x11, x11, #~7
+    sub             x1, x1, x11
+    sub             x3, x3, x11
+    sub             x5, x5, x11
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    lsl             x5, x5, #1
+    b.ne            4f
+    sub             x7, x7, x11, lsl #1
+    sub             x7, x7, x11
+    lsl             x7, x7, #1
 1:
-    subs        w3,  w3,  #8
-    ld1         {v0.8h}, [x1], #16
-    rev16       v1.16b, v0.16b
-    sxtl        v2.4s,  v1.4h
-    sxtl2       v3.4s,  v1.8h
-    scvtf       v4.4s,  v2.4s,  #8
-    scvtf       v5.4s,  v3.4s,  #8
-    st1         {v4.4s,v5.4s}, [x0], #32
-    b.ge        1b
-2:
-    adds        w3,  w3,  #8
-    b.eq        4f
-3:
-    subs        w3,  w3,  #1
-    ldrh        w4, [x1], #2
-    rev16       w5,  w4
-    sxth        w6,  w5
-    scvtf       s0,  w6,  #8
-    str         s0, [x0], #4
-    b.gt        3b
+    ld3            {v0.8h, v1.8h, v2.8h}, [x6], #48
+    deinterleave_rgb
+
+    ret
 4:
+    sub             x7, x7, x11, lsl #2
+    lsl             x7, x7, #1
+1:
+    ld4            {v0.8h, v1.8h, v2.8h, v3.8h}, [x6], #64
+    deinterleave_rgb
+
+    ret
+endfunc
+
+// void hpel_filter( pixel *dsth, pixel *dstv, pixel *dstc, pixel *src,
+//                   intptr_t stride, int width, int height, int16_t *buf )
+function hpel_filter_neon, export=1
+    lsl         x5, x5, #1
+    ubfm        x9, x3, #3, #7
+    add         w15, w5, w9
+    sub         x13, x3, x9                 // align src
+    sub         x10, x0, x9
+    sub         x11, x1, x9
+    sub         x12, x2, x9
+    movi        v30.8h, #5
+    movi        v31.8h, #20
+
+    lsl         x4, x4, #1
+    stp         d8, d9, [sp, #-0x40]!
+    stp         d10, d11, [sp, #0x10]
+    stp         d12, d13, [sp, #0x20]
+    stp         d14, d15, [sp, #0x30]
+
+    str         q0, [sp, #-0x50]!
+
+1:  // line start
+    mov         x3, x13
+    mov         x2, x12
+    mov         x1, x11
+    mov         x0, x10
+    add         x7, x3, #32                 // src pointer next 16b for horiz filter
+    mov         x5, x15                     // restore width
+    sub         x3, x3, x4, lsl #1          // src - 2*stride
+    ld1         {v28.8h, v29.8h}, [x7], #32 // src[16:31]
+    add         x9, x3, x5                  // holds src - 2*stride + width
+
+    ld1         {v8.8h, v9.8h}, [x3], x4    // src-2*stride[0:15]
+    ld1         {v10.8h, v11.8h}, [x3], x4  // src-1*stride[0:15]
+    ld1         {v12.8h, v13.8h}, [x3], x4  // src-0*stride[0:15]
+    ld1         {v14.8h, v15.8h}, [x3], x4  // src+1*stride[0:15]
+    ld1         {v16.8h, v17.8h}, [x3], x4  // src+2*stride[0:15]
+    ld1         {v18.8h, v19.8h}, [x3], x4  // src+3*stride[0:15]
+
+    ext         v22.16b, v7.16b, v12.16b, #12
+    ext         v23.16b, v12.16b, v13.16b, #12
+    uaddl       v1.4s, v8.4h, v18.4h
+    uaddl2      v20.4s, v8.8h, v18.8h
+    ext         v24.16b, v12.16b, v13.16b, #6
+    ext         v25.16b, v13.16b, v28.16b, #6
+    umlsl       v1.4s, v10.4h, v30.4h
+    umlsl2      v20.4s, v10.8h, v30.8h
+    ext         v26.16b, v7.16b, v12.16b, #14
+    ext         v27.16b, v12.16b, v13.16b, #14
+    umlal       v1.4s, v12.4h, v31.4h
+    umlal2      v20.4s, v12.8h, v31.8h
+    ext         v3.16b, v12.16b, v13.16b, #2
+    ext         v4.16b, v13.16b, v28.16b, #2
+    umlal       v1.4s, v14.4h, v31.4h
+    umlal2      v20.4s, v14.8h, v31.8h
+    ext         v21.16b, v12.16b, v13.16b, #4
+    ext         v5.16b, v13.16b, v28.16b, #4
+    umlsl       v1.4s, v16.4h, v30.4h
+    umlsl2      v20.4s, v16.8h, v30.8h
+
+2:  // next 16 pixel of line
+    subs        x5, x5, #32
+    sub         x3, x9, x5                  // src - 2*stride += 16
+
+    uaddl       v8.4s, v22.4h, v24.4h
+    uaddl2      v22.4s, v22.8h, v24.8h
+    uaddl       v10.4s, v23.4h, v25.4h
+    uaddl2      v23.4s, v23.8h, v25.8h
+
+    umlsl       v8.4s, v26.4h, v30.4h
+    umlsl2      v22.4s, v26.8h, v30.8h
+    umlsl       v10.4s, v27.4h, v30.4h
+    umlsl2      v23.4s, v27.8h, v30.8h
+
+    umlal       v8.4s, v12.4h, v31.4h
+    umlal2      v22.4s, v12.8h, v31.8h
+    umlal       v10.4s, v13.4h, v31.4h
+    umlal2      v23.4s, v13.8h, v31.8h
+
+    umlal       v8.4s, v3.4h, v31.4h
+    umlal2      v22.4s, v3.8h, v31.8h
+    umlal       v10.4s, v4.4h, v31.4h
+    umlal2      v23.4s, v4.8h, v31.8h
+
+    umlsl       v8.4s, v21.4h, v30.4h
+    umlsl2      v22.4s, v21.8h, v30.8h
+    umlsl       v10.4s, v5.4h, v30.4h
+    umlsl2      v23.4s, v5.8h, v30.8h
+
+    uaddl       v5.4s, v9.4h, v19.4h
+    uaddl2      v2.4s, v9.8h, v19.8h
+
+    sqrshrun    v8.4h, v8.4s, #5
+    sqrshrun2   v8.8h, v22.4s, #5
+    sqrshrun    v10.4h, v10.4s, #5
+    sqrshrun2   v10.8h, v23.4s, #5
+
+    mov         v6.16b, v12.16b
+    mov         v7.16b, v13.16b
+
+    mvni        v23.8h, #0xfc, lsl #8
+
+    umin        v8.8h, v8.8h, v23.8h
+    umin        v10.8h, v10.8h, v23.8h
+
+    st1         {v8.8h}, [x0], #16
+    st1         {v10.8h}, [x0], #16
+
+    umlsl       v5.4s, v11.4h, v30.4h
+    umlsl2      v2.4s, v11.8h, v30.8h
+
+    ld1         {v8.8h, v9.8h}, [x3], x4
+    umlal       v5.4s, v13.4h, v31.4h
+    umlal2      v2.4s, v13.8h, v31.8h
+    ld1         {v10.8h, v11.8h}, [x3], x4
+    umlal       v5.4s, v15.4h, v31.4h
+    umlal2      v2.4s, v15.8h, v31.8h
+    ld1         {v12.8h, v13.8h}, [x3], x4
+    umlsl       v5.4s, v17.4h, v30.4h
+    umlsl2      v2.4s, v17.8h, v30.8h
+    ld1         {v14.8h, v15.8h}, [x3], x4
+
+    sqrshrun    v4.4h, v5.4s, #5
+    sqrshrun2   v4.8h, v2.4s, #5
+    sqrshrun    v18.4h, v1.4s, #5
+    sqrshrun2   v18.8h, v20.4s, #5
+
+    mvni        v17.8h, #0xfc, lsl #8
+
+    smin        v4.8h, v4.8h, v17.8h
+    smin        v18.8h, v18.8h, v17.8h
+
+    st1         {v18.8h}, [x1], #16
+    st1         {v4.8h}, [x1], #16
+
+    ld1         {v16.8h, v17.8h}, [x3], x4          // src+2*stride[0:15]
+    ld1         {v18.8h, v19.8h}, [x3], x4          // src+3*stride[0:15]
+
+    str         q9, [sp, #0x10]
+    str         q15, [sp, #0x20]
+    str         q17, [sp, #0x30]
+    str         q19, [sp, #0x40]
+
+    ldr         q28, [sp]
+
+    ext         v22.16b, v28.16b, v1.16b, #8
+    ext         v9.16b, v1.16b, v20.16b, #8
+    ext         v26.16b, v1.16b, v20.16b, #12
+    ext         v17.16b, v20.16b, v5.16b, #12
+    ext         v23.16b, v28.16b, v1.16b, #12
+    ext         v19.16b, v1.16b, v20.16b, #12
+
+    uaddl       v3.4s, v8.4h, v18.4h
+    uaddl2      v15.4s, v8.8h, v18.8h
+    umlsl       v3.4s, v10.4h, v30.4h
+    umlsl2      v15.4s, v10.8h, v30.8h
+    umlal       v3.4s, v12.4h, v31.4h
+    umlal2      v15.4s, v12.8h, v31.8h
+    umlal       v3.4s, v14.4h, v31.4h
+    umlal2      v15.4s, v14.8h, v31.8h
+    umlsl       v3.4s, v16.4h, v30.4h
+    umlsl2      v15.4s, v16.8h, v30.8h
+
+    add         v4.4s, v22.4s, v26.4s
+    add         v26.4s, v9.4s, v17.4s
+
+    ext         v25.16b, v1.16b, v20.16b, #8
+    ext         v22.16b, v20.16b, v5.16b, #8
+    ext         v24.16b, v1.16b, v20.16b, #4
+    ext         v9.16b, v20.16b, v5.16b, #4
+
+    add         v31.4s, v23.4s, v25.4s
+    add         v19.4s, v19.4s, v22.4s
+    add         v6.4s, v24.4s, v1.4s
+    add         v17.4s, v9.4s, v20.4s
+    sub         v4.4s, v4.4s, v31.4s                // a-b
+    sub         v26.4s, v26.4s, v19.4s              // a-b
+    sub         v31.4s, v31.4s, v6.4s               // b-c
+    sub         v19.4s, v19.4s, v17.4s              // b-c
+
+    ext         v22.16b, v20.16b, v5.16b, #8
+    ext         v9.16b, v5.16b, v2.16b, #8
+    ext         v24.16b, v5.16b, v2.16b, #12
+    ext         v28.16b, v2.16b, v3.16b, #12
+    ext         v23.16b, v20.16b, v5.16b, #12
+    ext         v30.16b, v5.16b, v2.16b, #12
+    ext         v25.16b, v5.16b, v2.16b, #8
+    ext         v29.16b, v2.16b, v3.16b, #8
+
+    add         v22.4s, v22.4s, v24.4s
+    add         v9.4s, v9.4s, v28.4s
+    add         v23.4s, v23.4s, v25.4s
+    add         v29.4s, v29.4s, v30.4s
+
+    ext         v24.16b, v5.16b, v2.16b, #4
+    ext         v28.16b, v2.16b, v3.16b, #4
+
+    add         v24.4s, v24.4s, v5.4s
+    add         v28.4s, v28.4s, v2.4s
+
+    sub         v22.4s, v22.4s, v23.4s
+    sub         v9.4s, v9.4s, v29.4s
+    sub         v23.4s, v23.4s, v24.4s
+    sub         v29.4s, v29.4s, v28.4s
+
+    sshr        v4.4s, v4.4s, #2
+    sshr        v0.4s, v26.4s, #2
+    sshr        v22.4s, v22.4s, #2
+    sshr        v9.4s, v9.4s, #2
+
+    sub         v4.4s, v4.4s, v31.4s
+    sub         v0.4s, v0.4s, v19.4s
+    sub         v22.4s, v22.4s, v23.4s
+    sub         v9.4s, v9.4s, v29.4s
+
+    sshr        v4.4s, v4.4s, #2
+    sshr        v0.4s, v0.4s, #2
+    sshr        v22.4s, v22.4s, #2
+    sshr        v9.4s, v9.4s, #2
+
+    add         v4.4s, v4.4s, v6.4s
+    add         v0.4s, v0.4s, v17.4s
+    add         v22.4s, v22.4s, v24.4s
+    add         v9.4s, v9.4s, v28.4s
+
+    str         q2, [sp]
+
+    sqrshrun    v4.4h, v4.4s, #6
+    sqrshrun2   v4.8h, v0.4s, #6
+    sqrshrun    v22.4h, v22.4s, #6
+    sqrshrun2   v22.8h, v9.4s, #6
+
+    mov         v0.16b, v5.16b
+
+    ld1        {v28.8h, v29.8h}, [x7], #32          // src[16:31]
+
+    ldr         q9, [sp, #0x10]
+    ldr         q17, [sp, #0x30]
+    ldr         q19, [sp, #0x40]
+
+    ext         v26.16b, v7.16b, v12.16b, #14
+    ext         v27.16b, v12.16b, v13.16b, #14
+
+    mvni        v25.8h, 0xfc, lsl #8
+
+    smin        v22.8h, v22.8h, v25.8h
+    smin        v4.8h, v4.8h, v25.8h
+
+    st1        {v4.8h}, [x2], #16
+    st1        {v22.8h}, [x2], #16
+
+    mov         v1.16b, v3.16b
+    mov         v20.16b, v15.16b
+
+    ldr         q15, [sp, #0x20]
+
+    ext         v22.16b, v7.16b, v12.16b, #12
+    ext         v23.16b, v12.16b, v13.16b, #12
+    ext         v3.16b, v12.16b, v13.16b, #2
+    ext         v4.16b, v13.16b, v28.16b, #2
+    ext         v21.16b, v12.16b, v13.16b, #4
+    ext         v5.16b, v13.16b, v28.16b, #4
+    ext         v24.16b, v12.16b, v13.16b, #6
+    ext         v25.16b, v13.16b, v28.16b, #6
+
+    movi        v30.8h, #5
+    movi        v31.8h, #20
+
+    b.gt        2b
+
+    subs        w6, w6, #1
+    add         x10, x10, x4
+    add         x11, x11, x4
+    add         x12, x12, x4
+    add         x13, x13, x4
+    b.gt        1b
+
+    add         sp, sp, #0x50
+
+    ldp         d8, d9, [sp]
+    ldp         d10, d11, [sp, #0x10]
+    ldp         d12, d13, [sp, #0x20]
+    ldp         d14, d15, [sp, #0x30]
+    add         sp, sp, #0x40
+
     ret
 endfunc
+
+#endif


=====================================
common/aarch64/mc-c.c
=====================================
@@ -28,11 +28,11 @@
 #include "mc.h"
 
 #define x264_prefetch_ref_aarch64 x264_template(prefetch_ref_aarch64)
-void x264_prefetch_ref_aarch64( uint8_t *, intptr_t, int );
+void x264_prefetch_ref_aarch64( pixel *, intptr_t, int );
 #define x264_prefetch_fenc_420_aarch64 x264_template(prefetch_fenc_420_aarch64)
-void x264_prefetch_fenc_420_aarch64( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_prefetch_fenc_420_aarch64( pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_prefetch_fenc_422_aarch64 x264_template(prefetch_fenc_422_aarch64)
-void x264_prefetch_fenc_422_aarch64( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_prefetch_fenc_422_aarch64( pixel *, intptr_t, pixel *, intptr_t, int );
 
 #define x264_memcpy_aligned_neon x264_template(memcpy_aligned_neon)
 void *x264_memcpy_aligned_neon( void *dst, const void *src, size_t n );
@@ -40,32 +40,32 @@ void *x264_memcpy_aligned_neon( void *dst, const void *src, size_t n );
 void x264_memzero_aligned_neon( void *dst, size_t n );
 
 #define x264_pixel_avg_16x16_neon x264_template(pixel_avg_16x16_neon)
-void x264_pixel_avg_16x16_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_16x16_neon( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_16x8_neon x264_template(pixel_avg_16x8_neon)
-void x264_pixel_avg_16x8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_16x8_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_8x16_neon x264_template(pixel_avg_8x16_neon)
-void x264_pixel_avg_8x16_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_8x16_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_8x8_neon x264_template(pixel_avg_8x8_neon)
-void x264_pixel_avg_8x8_neon  ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_8x8_neon  ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_8x4_neon x264_template(pixel_avg_8x4_neon)
-void x264_pixel_avg_8x4_neon  ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_8x4_neon  ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_4x16_neon x264_template(pixel_avg_4x16_neon)
-void x264_pixel_avg_4x16_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x16_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_4x8_neon x264_template(pixel_avg_4x8_neon)
-void x264_pixel_avg_4x8_neon  ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x8_neon  ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_4x4_neon x264_template(pixel_avg_4x4_neon)
-void x264_pixel_avg_4x4_neon  ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x4_neon  ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_pixel_avg_4x2_neon x264_template(pixel_avg_4x2_neon)
-void x264_pixel_avg_4x2_neon  ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_pixel_avg_4x2_neon  ( pixel *, intptr_t, pixel *, intptr_t, pixel *, intptr_t, int );
 
 #define x264_pixel_avg2_w4_neon x264_template(pixel_avg2_w4_neon)
-void x264_pixel_avg2_w4_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w4_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
 #define x264_pixel_avg2_w8_neon x264_template(pixel_avg2_w8_neon)
-void x264_pixel_avg2_w8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w8_neon ( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
 #define x264_pixel_avg2_w16_neon x264_template(pixel_avg2_w16_neon)
-void x264_pixel_avg2_w16_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w16_neon( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
 #define x264_pixel_avg2_w20_neon x264_template(pixel_avg2_w20_neon)
-void x264_pixel_avg2_w20_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int );
+void x264_pixel_avg2_w20_neon( pixel *, intptr_t, pixel *, intptr_t, pixel *, int );
 
 #define x264_plane_copy_core_neon x264_template(plane_copy_core_neon)
 void x264_plane_copy_core_neon( pixel *dst, intptr_t i_dst,
@@ -111,12 +111,12 @@ void x264_load_deinterleave_chroma_fenc_neon( pixel *dst, pixel *src, intptr_t i
 #define x264_mc_weight_w8_offsetadd_neon x264_template(mc_weight_w8_offsetadd_neon)
 #define x264_mc_weight_w8_offsetsub_neon x264_template(mc_weight_w8_offsetsub_neon)
 #define MC_WEIGHT(func)\
-void x264_mc_weight_w20##func##_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
-void x264_mc_weight_w16##func##_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
-void x264_mc_weight_w8##func##_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
-void x264_mc_weight_w4##func##_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w20##func##_neon( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w16##func##_neon( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w8##func##_neon ( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
+void x264_mc_weight_w4##func##_neon ( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int );\
 \
-static void (* mc##func##_wtab_neon[6])( uint8_t *, intptr_t, uint8_t *, intptr_t, const x264_weight_t *, int ) =\
+static void (* mc##func##_wtab_neon[6])( pixel *, intptr_t, pixel *, intptr_t, const x264_weight_t *, int ) =\
 {\
     x264_mc_weight_w4##func##_neon,\
     x264_mc_weight_w4##func##_neon,\
@@ -126,32 +126,30 @@ static void (* mc##func##_wtab_neon[6])( uint8_t *, intptr_t, uint8_t *, intptr_
     x264_mc_weight_w20##func##_neon,\
 };
 
-#if !HIGH_BIT_DEPTH
 MC_WEIGHT()
 MC_WEIGHT(_nodenom)
 MC_WEIGHT(_offsetadd)
 MC_WEIGHT(_offsetsub)
-#endif
 
 #define x264_mc_copy_w4_neon x264_template(mc_copy_w4_neon)
-void x264_mc_copy_w4_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_mc_copy_w4_neon ( pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_mc_copy_w8_neon x264_template(mc_copy_w8_neon)
-void x264_mc_copy_w8_neon ( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_mc_copy_w8_neon ( pixel *, intptr_t, pixel *, intptr_t, int );
 #define x264_mc_copy_w16_neon x264_template(mc_copy_w16_neon)
-void x264_mc_copy_w16_neon( uint8_t *, intptr_t, uint8_t *, intptr_t, int );
+void x264_mc_copy_w16_neon( pixel *, intptr_t, pixel *, intptr_t, int );
 
 #define x264_mc_chroma_neon x264_template(mc_chroma_neon)
-void x264_mc_chroma_neon( uint8_t *, uint8_t *, intptr_t, uint8_t *, intptr_t, int, int, int, int );
+void x264_mc_chroma_neon( pixel *, pixel *, intptr_t, pixel *, intptr_t, int, int, int, int );
 #define x264_integral_init4h_neon x264_template(integral_init4h_neon)
-void x264_integral_init4h_neon( uint16_t *, uint8_t *, intptr_t );
+void x264_integral_init4h_neon( uint16_t *, pixel *, intptr_t );
 #define x264_integral_init4v_neon x264_template(integral_init4v_neon)
 void x264_integral_init4v_neon( uint16_t *, uint16_t *, intptr_t );
 #define x264_integral_init8h_neon x264_template(integral_init8h_neon)
-void x264_integral_init8h_neon( uint16_t *, uint8_t *, intptr_t );
+void x264_integral_init8h_neon( uint16_t *, pixel *, intptr_t );
 #define x264_integral_init8v_neon x264_template(integral_init8v_neon)
 void x264_integral_init8v_neon( uint16_t *, intptr_t );
 #define x264_frame_init_lowres_core_neon x264_template(frame_init_lowres_core_neon)
-void x264_frame_init_lowres_core_neon( uint8_t *, uint8_t *, uint8_t *, uint8_t *, uint8_t *, intptr_t, intptr_t, int, int );
+void x264_frame_init_lowres_core_neon( pixel *, pixel *, pixel *, pixel *, pixel *, intptr_t, intptr_t, int, int );
 
 #define x264_mbtree_propagate_cost_neon x264_template(mbtree_propagate_cost_neon)
 void x264_mbtree_propagate_cost_neon( int16_t *, uint16_t *, uint16_t *, uint16_t *, uint16_t *, float *, int );
@@ -161,7 +159,25 @@ void x264_mbtree_fix8_pack_neon( uint16_t *dst, float *src, int count );
 #define x264_mbtree_fix8_unpack_neon x264_template(mbtree_fix8_unpack_neon)
 void x264_mbtree_fix8_unpack_neon( float *dst, uint16_t *src, int count );
 
-#if !HIGH_BIT_DEPTH
+static void (* const pixel_avg_wtab_neon[6])( pixel *, intptr_t, pixel *, intptr_t, pixel *, int ) =
+{
+    NULL,
+    x264_pixel_avg2_w4_neon,
+    x264_pixel_avg2_w8_neon,
+    x264_pixel_avg2_w16_neon,   // no slower than w12, so no point in a separate function
+    x264_pixel_avg2_w16_neon,
+    x264_pixel_avg2_w20_neon,
+};
+
+static void (* const mc_copy_wtab_neon[5])( pixel *, intptr_t, pixel *, intptr_t, int ) =
+{
+    NULL,
+    x264_mc_copy_w4_neon,
+    x264_mc_copy_w8_neon,
+    NULL,
+    x264_mc_copy_w16_neon,
+};
+
 static void weight_cache_neon( x264_t *h, x264_weight_t *w )
 {
     if( w->i_scale == 1<<w->i_denom )
@@ -183,39 +199,20 @@ static void weight_cache_neon( x264_t *h, x264_weight_t *w )
         w->weightfn = mc_wtab_neon;
 }
 
-static void (* const pixel_avg_wtab_neon[6])( uint8_t *, intptr_t, uint8_t *, intptr_t, uint8_t *, int ) =
-{
-    NULL,
-    x264_pixel_avg2_w4_neon,
-    x264_pixel_avg2_w8_neon,
-    x264_pixel_avg2_w16_neon,   // no slower than w12, so no point in a separate function
-    x264_pixel_avg2_w16_neon,
-    x264_pixel_avg2_w20_neon,
-};
-
-static void (* const mc_copy_wtab_neon[5])( uint8_t *, intptr_t, uint8_t *, intptr_t, int ) =
-{
-    NULL,
-    x264_mc_copy_w4_neon,
-    x264_mc_copy_w8_neon,
-    NULL,
-    x264_mc_copy_w16_neon,
-};
-
-static void mc_luma_neon( uint8_t *dst,    intptr_t i_dst_stride,
-                          uint8_t *src[4], intptr_t i_src_stride,
+static void mc_luma_neon( pixel *dst,    intptr_t i_dst_stride,
+                          pixel *src[4], intptr_t i_src_stride,
                           int mvx, int mvy,
                           int i_width, int i_height, const x264_weight_t *weight )
 {
     int qpel_idx = ((mvy&3)<<2) + (mvx&3);
     intptr_t offset = (mvy>>2)*i_src_stride + (mvx>>2);
-    uint8_t *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
+    pixel *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
     if( (mvy&3) == 3 )             // explicit if() to force conditional add
         src1 += i_src_stride;
 
     if( qpel_idx & 5 ) /* qpel interpolation needed */
     {
-        uint8_t *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
+        pixel *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
         pixel_avg_wtab_neon[i_width>>2](
                 dst, i_dst_stride, src1, i_src_stride,
                 src2, i_height );
@@ -228,20 +225,20 @@ static void mc_luma_neon( uint8_t *dst,    intptr_t i_dst_stride,
         mc_copy_wtab_neon[i_width>>2]( dst, i_dst_stride, src1, i_src_stride, i_height );
 }
 
-static uint8_t *get_ref_neon( uint8_t *dst,   intptr_t *i_dst_stride,
-                              uint8_t *src[4], intptr_t i_src_stride,
+static pixel *get_ref_neon( pixel *dst,   intptr_t *i_dst_stride,
+                              pixel *src[4], intptr_t i_src_stride,
                               int mvx, int mvy,
                               int i_width, int i_height, const x264_weight_t *weight )
 {
     int qpel_idx = ((mvy&3)<<2) + (mvx&3);
     intptr_t offset = (mvy>>2)*i_src_stride + (mvx>>2);
-    uint8_t *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
+    pixel *src1 = src[x264_hpel_ref0[qpel_idx]] + offset;
     if( (mvy&3) == 3 )             // explicit if() to force conditional add
         src1 += i_src_stride;
 
     if( qpel_idx & 5 ) /* qpel interpolation needed */
     {
-        uint8_t *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
+        pixel *src2 = src[x264_hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
         pixel_avg_wtab_neon[i_width>>2](
                 dst, *i_dst_stride, src1, i_src_stride,
                 src2, i_height );
@@ -262,19 +259,18 @@ static uint8_t *get_ref_neon( uint8_t *dst,   intptr_t *i_dst_stride,
 }
 
 #define x264_hpel_filter_neon x264_template(hpel_filter_neon)
-void x264_hpel_filter_neon( uint8_t *dsth, uint8_t *dstv, uint8_t *dstc,
-                            uint8_t *src, intptr_t stride, int width,
+void x264_hpel_filter_neon( pixel *dsth, pixel *dstv, pixel *dstc,
+                            pixel *src, intptr_t stride, int width,
                             int height, int16_t *buf );
 
 PLANE_COPY(16, neon)
 PLANE_COPY_SWAP(16, neon)
 PLANE_INTERLEAVE(neon)
 PROPAGATE_LIST(neon)
-#endif // !HIGH_BIT_DEPTH
 
 void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
 {
-#if !HIGH_BIT_DEPTH
+
     if( cpu&X264_CPU_ARMV8 )
     {
         pf->prefetch_fenc_420 = x264_prefetch_fenc_420_aarch64;
@@ -285,20 +281,13 @@ void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
     if( !(cpu&X264_CPU_NEON) )
         return;
 
-    pf->copy_16x16_unaligned = x264_mc_copy_w16_neon;
-    pf->copy[PIXEL_16x16]    = x264_mc_copy_w16_neon;
-    pf->copy[PIXEL_8x8]      = x264_mc_copy_w8_neon;
-    pf->copy[PIXEL_4x4]      = x264_mc_copy_w4_neon;
-
-    pf->plane_copy                  = plane_copy_neon;
-    pf->plane_copy_swap             = plane_copy_swap_neon;
-    pf->plane_copy_deinterleave     = x264_plane_copy_deinterleave_neon;
-    pf->plane_copy_deinterleave_rgb = x264_plane_copy_deinterleave_rgb_neon;
-    pf->plane_copy_interleave       = plane_copy_interleave_neon;
+    pf->mbtree_propagate_cost = x264_mbtree_propagate_cost_neon;
+    pf->mbtree_propagate_list = mbtree_propagate_list_neon;
+    pf->mbtree_fix8_pack      = x264_mbtree_fix8_pack_neon;
+    pf->mbtree_fix8_unpack    = x264_mbtree_fix8_unpack_neon;
 
-    pf->load_deinterleave_chroma_fdec = x264_load_deinterleave_chroma_fdec_neon;
-    pf->load_deinterleave_chroma_fenc = x264_load_deinterleave_chroma_fenc_neon;
-    pf->store_interleave_chroma       = x264_store_interleave_chroma_neon;
+    pf->memcpy_aligned  = x264_memcpy_aligned_neon;
+    pf->memzero_aligned = x264_memzero_aligned_neon;
 
     pf->avg[PIXEL_16x16] = x264_pixel_avg_16x16_neon;
     pf->avg[PIXEL_16x8]  = x264_pixel_avg_16x8_neon;
@@ -310,6 +299,11 @@ void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
     pf->avg[PIXEL_4x4]   = x264_pixel_avg_4x4_neon;
     pf->avg[PIXEL_4x2]   = x264_pixel_avg_4x2_neon;
 
+    pf->copy_16x16_unaligned = x264_mc_copy_w16_neon;
+    pf->copy[PIXEL_16x16]    = x264_mc_copy_w16_neon;
+    pf->copy[PIXEL_8x8]      = x264_mc_copy_w8_neon;
+    pf->copy[PIXEL_4x4]      = x264_mc_copy_w4_neon;
+
     pf->weight       = mc_wtab_neon;
     pf->offsetadd    = mc_offsetadd_wtab_neon;
     pf->offsetsub    = mc_offsetsub_wtab_neon;
@@ -318,20 +312,30 @@ void x264_mc_init_aarch64( uint32_t cpu, x264_mc_functions_t *pf )
     pf->mc_chroma = x264_mc_chroma_neon;
     pf->mc_luma = mc_luma_neon;
     pf->get_ref = get_ref_neon;
-    pf->hpel_filter = x264_hpel_filter_neon;
-    pf->frame_init_lowres_core = x264_frame_init_lowres_core_neon;
 
     pf->integral_init4h = x264_integral_init4h_neon;
     pf->integral_init8h = x264_integral_init8h_neon;
     pf->integral_init4v = x264_integral_init4v_neon;
     pf->integral_init8v = x264_integral_init8v_neon;
 
-    pf->mbtree_propagate_cost = x264_mbtree_propagate_cost_neon;
-    pf->mbtree_propagate_list = mbtree_propagate_list_neon;
-    pf->mbtree_fix8_pack      = x264_mbtree_fix8_pack_neon;
-    pf->mbtree_fix8_unpack    = x264_mbtree_fix8_unpack_neon;
+    pf->frame_init_lowres_core = x264_frame_init_lowres_core_neon;
+
+    pf->load_deinterleave_chroma_fdec = x264_load_deinterleave_chroma_fdec_neon;
+    pf->load_deinterleave_chroma_fenc = x264_load_deinterleave_chroma_fenc_neon;
+
+    pf->store_interleave_chroma       = x264_store_interleave_chroma_neon;
+
+    pf->plane_copy                  = plane_copy_neon;
+    pf->plane_copy_swap             = plane_copy_swap_neon;
+    pf->plane_copy_deinterleave     = x264_plane_copy_deinterleave_neon;
+    pf->plane_copy_deinterleave_rgb = x264_plane_copy_deinterleave_rgb_neon;
+    pf->plane_copy_interleave       = plane_copy_interleave_neon;
+
+    pf->hpel_filter = x264_hpel_filter_neon;
+
+#if !HIGH_BIT_DEPTH
+
+
 
-    pf->memcpy_aligned  = x264_memcpy_aligned_neon;
-    pf->memzero_aligned = x264_memzero_aligned_neon;
 #endif // !HIGH_BIT_DEPTH
 }



View it on GitLab: https://code.videolan.org/videolan/x264/-/compare/834c5c92db67bf34467915305544ad8c4fe97657...cc5c343f432ba7c6ce1e11aa49cbb718e7e4710e

-- 
View it on GitLab: https://code.videolan.org/videolan/x264/-/compare/834c5c92db67bf34467915305544ad8c4fe97657...cc5c343f432ba7c6ce1e11aa49cbb718e7e4710e
You're receiving this email because of your account on code.videolan.org.


VideoLAN code repository instance


More information about the x264-devel mailing list