[x265] [PATCH 1/7] AArch64: Add Neon implementation of 4x4 DST
chen
chenm003 at 163.com
Fri Nov 29 00:45:41 UTC 2024
At 2024-11-26 21:24:17, "Micro Daryl Robles" <microdaryl.robles at arm.com> wrote:
>Also optimize transpose_4x4_s16 implementation.
>
>Relative performance compared to scalar C:
>
> Neoverse N1: 1.63x
> Neoverse V1: 1.85x
> Neoverse V2: 2.00x
>---
> source/common/aarch64/dct-prim.cpp | 88 +++++++++++++++++++++++++-----
> 1 file changed, 74 insertions(+), 14 deletions(-)
>
>+template<int shift>
>+static inline void fastForwardDst4_neon(const int16_t *src, int16_t *dst)
>+{
>+ int16x4_t s0 = vld1_s16(src + 0);
>+ int16x4_t s1 = vld1_s16(src + 4);
>+ int16x4_t s2 = vld1_s16(src + 8);
>+ int16x4_t s3 = vld1_s16(src + 12);
May we merge these memory load with pair or 4-element load instruction?
>+ vst1_s16(dst + 0, d0);
>+ vst1_s16(dst + 4, d1);
>+ vst1_s16(dst + 8, d2);
>+ vst1_s16(dst + 12, d3);
storage instruction is same
>+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
In optimize version, we need not this wrapper functions, especially memcpy, it made slower performance
>+{
>+ const int shift_pass1 = 1 + X265_DEPTH - 8;
>+ const int shift_pass2 = 8;
>+
>+ ALIGN_VAR_32(int16_t, coef[4 * 4]);
>+ ALIGN_VAR_32(int16_t, block[4 * 4]);
>+
>+ for (int i = 0; i < 4; i++)
>+ {
>+ memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t));
>+ }
>+
>+ fastForwardDst4_neon<shift_pass1>(block, coef);
>+ fastForwardDst4_neon<shift_pass2>(coef, dst);
>+}
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241129/ee3f0a8e/attachment.htm>
More information about the x265-devel
mailing list