[x265] [PATCH 1/7] AArch64: Add Neon implementation of 4x4 DST

Fri Nov 29 00:45:41 UTC 2024



At 2024-11-26 21:24:17, "Micro Daryl Robles" <microdaryl.robles at arm.com> wrote:
>Also optimize transpose_4x4_s16 implementation.
>
>Relative performance compared to scalar C:
>
>  Neoverse N1: 1.63x
>  Neoverse V1: 1.85x
>  Neoverse V2: 2.00x
>---
> source/common/aarch64/dct-prim.cpp | 88 +++++++++++++++++++++++++-----
> 1 file changed, 74 insertions(+), 14 deletions(-)
>

>+template<int shift>
>+static inline void fastForwardDst4_neon(const int16_t *src, int16_t *dst)
>+{
>+    int16x4_t s0 = vld1_s16(src + 0);
>+    int16x4_t s1 = vld1_s16(src + 4);
>+    int16x4_t s2 = vld1_s16(src + 8);

>+    int16x4_t s3 = vld1_s16(src + 12);
May we merge these memory load with pair or 4-element load instruction?


>+    vst1_s16(dst + 0, d0);
>+    vst1_s16(dst + 4, d1);
>+    vst1_s16(dst + 8, d2);

>+    vst1_s16(dst + 12, d3);
storage instruction is same


>+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
In optimize version, we need not this wrapper functions, especially memcpy, it made slower performance


>+{
>+    const int shift_pass1 = 1 + X265_DEPTH - 8;
>+    const int shift_pass2 = 8;
>+
>+    ALIGN_VAR_32(int16_t, coef[4 * 4]);
>+    ALIGN_VAR_32(int16_t, block[4 * 4]);
>+
>+    for (int i = 0; i < 4; i++)
>+    {
>+        memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t));
>+    }
>+
>+    fastForwardDst4_neon<shift_pass1>(block, coef);
>+    fastForwardDst4_neon<shift_pass2>(coef, dst);
>+}
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241129/ee3f0a8e/attachment.htm>