<div data-ntes="ntes_mail_body_root" style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;"><br></p></div><pre>At 2024-12-04 23:38:12, "Micro Daryl Robles" <microdaryl.robles@arm.com> wrote:

>+template<int shift>

>+static inline void inverseDst4_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)

>+{

>+    int16x4_t s0 = vld1_s16(src + 0);

<div>>+    int16x4_t s1 = vld1_s16(src + 4);</div><div><u>s0 and s1 may load by 128-bits instruction</u></div><div><br></div>>+    int16x4_t s2 = vld1_s16(src + 8);

>+    int16x4_t s3 = vld1_s16(src + 12);

>+

>+    int32x4_t c0 = vaddl_s16(s0, s2);

>+    int32x4_t c1 = vaddl_s16(s2, s3);

>+    int32x4_t c2 = vsubl_s16(s0, s3);

<div>>+    int32x4_t c3 = vmull_n_s16(s1, 74);</div><div><u>with above optimize, s1 may use by instcution smull2</u></div><div><br></div>

</pre></div>