[x265] [arm64] port scale1D_128to64 and scale2D_64to32

Sat Jul 31 05:14:29 UTC 2021

I have no idea to significant improve performance, the macro helpful code readable.
some little comment:
move SUB follow by LD1 will hidden memory operator latency, also mixed ST1 with next LD1, etc.
But in these case the code readable became bad, so I do not suggest these adjust.

Regards,
Min Chen

At 2021-07-31 12:14:29, "Pop, Sebastian" <spop at amazon.com> wrote:

Hi,

Please let me know if you have ideas on how to make this code faster.

I tried to remove the stall by fetching more memory earlier, still no change in performance:

// void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride)

function x265_scale2D_64to32_neon

    mov             w12, #15

    ld1             {v0.16b-v3.16b}, [x1], x2

    ld1             {v4.16b-v7.16b}, [x1], x2

.loop_scale2D:

    sub             w12, w12, #1

    ld1             {v20.16b-v23.16b}, [x1], x2

    ld1             {v24.16b-v27.16b}, [x1], x2

    scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7

    ld1             {v0.16b-v3.16b}, [x1], x2

    ld1             {v4.16b-v7.16b}, [x1], x2

    scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27

    cbnz            w12, .loop_scale2D

    ld1             {v20.16b-v23.16b}, [x1], x2

    ld1             {v24.16b-v27.16b}, [x1], x2

    scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7

    scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27

    ret

endfunc

.macro scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7

    uaddlp          \v0\().8h, \v0\().16b

    uaddlp          \v1\().8h, \v1\().16b

    uaddlp          \v2\().8h, \v2\().16b

    uaddlp          \v3\().8h, \v3\().16b

    uaddlp          \v4\().8h, \v4\().16b

    uaddlp          \v5\().8h, \v5\().16b

    uaddlp          \v6\().8h, \v6\().16b

    uaddlp          \v7\().8h, \v7\().16b

    add             \v0\().8h, \v0\().8h, \v4\().8h

    add             \v1\().8h, \v1\().8h, \v5\().8h

    add             \v2\().8h, \v2\().8h, \v6\().8h

    add             \v3\().8h, \v3\().8h, \v7\().8h

    uqrshrn         \v0\().8b, \v0\().8h, #2

    uqrshrn2        \v0\().16b, \v1\().8h, #2

    uqrshrn         \v1\().8b, \v2\().8h, #2

    uqrshrn2        \v1\().16b, \v3\().8h, #2

    st1             {\v0\().16b-\v1\().16b}, [x0], #32

.endm

The only change that I did is to further optimize for code size by re-rolling the loop that was unrolled 2x.

No change in performance, and 2x smaller code.

Sebastian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210731/121fe5da/attachment-0001.html>