<div data-ntes="ntes_mail_body_root" style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;"><br></p></div><pre>At 2024-11-26 21:24:17, "Micro Daryl Robles" <microdaryl.robles@arm.com> wrote:

>Also optimize transpose_4x4_s16 implementation.

>

>Relative performance compared to scalar C:

>

>  Neoverse N1: 1.63x

>  Neoverse V1: 1.85x

>  Neoverse V2: 2.00x

>---

> source/common/aarch64/dct-prim.cpp | 88 +++++++++++++++++++++++++-----

> 1 file changed, 74 insertions(+), 14 deletions(-)

>


>+template<int shift>

>+static inline void fastForwardDst4_neon(const int16_t *src, int16_t *dst)

>+{

>+    int16x4_t s0 = vld1_s16(src + 0);

>+    int16x4_t s1 = vld1_s16(src + 4);

>+    int16x4_t s2 = vld1_s16(src + 8);

<div>>+    int16x4_t s3 = vld1_s16(src + 12);</div><div>May we merge these memory load with pair or 4-element load instruction?</div><div><br></div>

>+    vst1_s16(dst + 0, d0);

>+    vst1_s16(dst + 4, d1);

>+    vst1_s16(dst + 8, d2);

<div>>+    vst1_s16(dst + 12, d3);</div><div>storage instruction is same</div><div><br></div>

<div>>+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)</div><div>In optimize version, we need not this wrapper functions, especially memcpy, it made slower performance</div><div><br></div>>+{

>+    const int shift_pass1 = 1 + X265_DEPTH - 8;

>+    const int shift_pass2 = 8;

>+

>+    ALIGN_VAR_32(int16_t, coef[4 * 4]);

>+    ALIGN_VAR_32(int16_t, block[4 * 4]);

>+

>+    for (int i = 0; i < 4; i++)

>+    {

>+        memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t));

>+    }

>+

>+    fastForwardDst4_neon<shift_pass1>(block, coef);

>+    fastForwardDst4_neon<shift_pass2>(coef, dst);

>+}

>

</pre></div>