[x265] [arm64] port scale1D_128to64 and scale2D_64to32
chen
chenm003 at 163.com
Sat Jul 31 05:14:29 UTC 2021
I have no idea to significant improve performance, the macro helpful code readable.
some little comment:
move SUB follow by LD1 will hidden memory operator latency, also mixed ST1 with next LD1, etc.
But in these case the code readable became bad, so I do not suggest these adjust.
Regards,
Min Chen
At 2021-07-31 12:14:29, "Pop, Sebastian" <spop at amazon.com> wrote:
Hi,
Please let me know if you have ideas on how to make this code faster.
I tried to remove the stall by fetching more memory earlier, still no change in performance:
// void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride)
function x265_scale2D_64to32_neon
mov w12, #15
ld1 {v0.16b-v3.16b}, [x1], x2
ld1 {v4.16b-v7.16b}, [x1], x2
.loop_scale2D:
sub w12, w12, #1
ld1 {v20.16b-v23.16b}, [x1], x2
ld1 {v24.16b-v27.16b}, [x1], x2
scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7
ld1 {v0.16b-v3.16b}, [x1], x2
ld1 {v4.16b-v7.16b}, [x1], x2
scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27
cbnz w12, .loop_scale2D
ld1 {v20.16b-v23.16b}, [x1], x2
ld1 {v24.16b-v27.16b}, [x1], x2
scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7
scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27
ret
endfunc
.macro scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7
uaddlp \v0\().8h, \v0\().16b
uaddlp \v1\().8h, \v1\().16b
uaddlp \v2\().8h, \v2\().16b
uaddlp \v3\().8h, \v3\().16b
uaddlp \v4\().8h, \v4\().16b
uaddlp \v5\().8h, \v5\().16b
uaddlp \v6\().8h, \v6\().16b
uaddlp \v7\().8h, \v7\().16b
add \v0\().8h, \v0\().8h, \v4\().8h
add \v1\().8h, \v1\().8h, \v5\().8h
add \v2\().8h, \v2\().8h, \v6\().8h
add \v3\().8h, \v3\().8h, \v7\().8h
uqrshrn \v0\().8b, \v0\().8h, #2
uqrshrn2 \v0\().16b, \v1\().8h, #2
uqrshrn \v1\().8b, \v2\().8h, #2
uqrshrn2 \v1\().16b, \v3\().8h, #2
st1 {\v0\().16b-\v1\().16b}, [x0], #32
.endm
The only change that I did is to further optimize for code size by re-rolling the loop that was unrolled 2x.
No change in performance, and 2x smaller code.
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210731/121fe5da/attachment-0001.html>
More information about the x265-devel
mailing list