[x265] [PATCH v3 0/7] AArch64: Add Neon impl of transform functions

Mon Dec 16 03:38:45 UTC 2024

Hi Micro,

Thank for the explain.

1) vcombine_s16(s0, vdup_n_s16(0)) it looks the compiler does not provide vreinterpret cast in between 64b and 128b, so we can keep your version, we may optimize future with hand assembly.

2)
>>>+ int16x4_t s0 = vld1_s16(src + 0); >>>+ int16x4_t s1 = vld1_s16(src + 4); >>s0 and s1 may load by 128-bits instruction > >These loads are properly converted by the compiler using LDP d0, d1 and only takes 1 instruction.

In this case, s1 used by vmull_n_s16 only, so we can combine s0 & s1 into one register, the s2/s3 can't.
Since it is a small improve, we can keep your version and optimize future by hand assembly

*)
I have no more comments now, [PATCH v3] looks good to me.

Regards,

Chen

At 2024-12-14 07:16:24, "Micro Daryl Robles" <microdaryl.robles at arm.com> wrote:
>Hi Chen,
>
>Thank you for your comments. See my replies below.
>
>
>>>-static void transpose_4x4x16(int16x4_t &x0, int16x4_t &x1, int16x4_t &x2, int16x4_t &x3)
>>>+static inline void transpose_4x4_s16(int16x4_t &s0, int16x4_t &s1, int16x4_t &s2, int16x4_t &s3)
>>> {
>>>-    int32x2_t s0, s1, s2, s3;
>>>+    int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
>>>+    int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
>>>+    int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
>>
>>>+    int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
>>
>>
>>Why clear high 64-bits? it will overwrite by ZIP1 below
>>
>>>
>>>+    int16x8x2_t s0123 = vzipq_s16(s02, s13);
>>
>
>The vcombine_s16(s0, vdup_n_s16(0)) is a NOP operation, since the previous 64-bit vld1
>instruction already has the implicit effect of clearing the top half of the 128-bit vector.
>So the vcombine intrinsic is just to allow the compiler to use the whole 128-bit vector 
>of s0 so that we can use intrinsic vzip1q to zip the lower 64-bit data.
>
>This approach is faster than using vcombine(s0, s1) which uses an extra MOV instruction to
>move s1 to the upper 64-bit.
>
>>
>>>+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
>>>+{
>>>+    const int shift_pass1 = 1 + X265_DEPTH - 8;
>>>+    const int shift_pass2 = 8;
>>>+
>>>+    ALIGN_VAR_32(int16_t, coef[4 * 4]);
>>>+    ALIGN_VAR_32(int16_t, block[4 * 4]);
>>>+
>>>+    for (int i = 0; i < 4; i++)
>>>+    {
>>>+        memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t));
>>>+    }
>>
>>We need not this loop to copy data from input buffer
>
>For the Forward DCTs, this memcpy loop is optimized away by the compiler so I decided
>to keep it to make it look consistent with the other forward DCT functions which
>I have not modified in this patch set.
>
>
>>>+template<int shift>
>>>+static inline void inverseDst4_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
>>>+{
>>>+    int16x4_t s0 = vld1_s16(src + 0);
>>>+    int16x4_t s1 = vld1_s16(src + 4);
>>s0 and s1 may load by 128-bits instruction
>
>These loads are properly converted by the compiler using LDP d0, d1 and only takes 1 instruction.
>
>
>>>+static inline void transpose_4x8_s16(int16x4_t s0, int16x4_t s1, int16x4_t s2, int16x4_t s3,
>>>+                                     int16x4_t s4, int16x4_t s5, int16x4_t s6, int16x4_t s7,
>>>+                                     int16x8_t &d0, int16x8_t &d1, int16x8_t &d2, int16x8_t &d3)
>>>+{
>>>+    int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
>>>+    int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
>>>+    int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
>>>+    int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
>>>+    int16x8_t s4q = vcombine_s16(s4, vdup_n_s16(0));
>>>+    int16x8_t s5q = vcombine_s16(s5, vdup_n_s16(0));
>>>+    int16x8_t s6q = vcombine_s16(s6, vdup_n_s16(0));
>>
>>>+    int16x8_t s7q = vcombine_s16(s7, vdup_n_s16(0));
>>Same as previous, high 64 bits unnecessary to clear
>
>Same explanation as my first comment.
>These vcombine_s16(sx, vdup_n_s16(0)) instructions are effectively NOP, since the previous 
>64-bit vld1 instructions have an implicit effect of clearing the top half of the 128-bit vector.
>The vcombine is just to allow the compiler to use intrinsic vzip1q to zip the lower 64-bit data.
>
>>
>>>+template<int shift>
>>>+static inline void partialButterflyInverse8_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
>>>+    if (vget_lane_u64(vreinterpret_u64_s16(vget_low_s16(s3)), 0) != 0)
>>
>>detect zeros is good idea, however, 4 instructions not enough to hidden pipeline flush cost, suggest combine below each two of if_sections (O_lo & O_hi) into one
>>
>>
>>>+    {
>>>+        O_lo[0] = vmlal_lane_s16(O_lo[0], vget_low_s16(s3), c_odd, 1); //  75
>>>+        O_lo[1] = vmlsl_lane_s16(O_lo[1], vget_low_s16(s3), c_odd, 3); // -18
>>>+        O_lo[2] = vmlsl_lane_s16(O_lo[2], vget_low_s16(s3), c_odd, 0); // -89
>>>+        O_lo[3] = vmlsl_lane_s16(O_lo[3], vget_low_s16(s3), c_odd, 2); // -50
>>>+    }
>>>+    if (vget_lane_u64(vreinterpret_u64_s16(vget_high_s16(s3)), 0) != 0)
>>>+    {
>>>+        O_hi[0] = vmlal_lane_s16(O_hi[0], vget_high_s16(s3), c_odd, 1); //  75
>>>+        O_hi[1] = vmlsl_lane_s16(O_hi[1], vget_high_s16(s3), c_odd, 3); // -18
>>>+        O_hi[2] = vmlsl_lane_s16(O_hi[2], vget_high_s16(s3), c_odd, 0); // -89
>>>+        O_hi[3] = vmlsl_lane_s16(O_hi[3], vget_high_s16(s3), c_odd, 2); // -50
>>>+    }
>
>Thank you for the suggestion, I have considered it and sending my patch v3.
>
>I have modified this by using (vaddlvq_u32(vreinterpretq_u32_s16(s3)) != 0).
>
>This unsigned add long adds each 32-bit data into a single unsigned 64-bit to avoid any risk 
>of non-zero sum overflowing to zero. This should be better than using vorr(vget_low, vget_high).
>
>
>Many thanks,
>Micro
>
>Micro Daryl Robles (7):
>  AArch64: Add Neon implementation of 4x4 DST
>  AArch64: Add Neon implementation of 4x4 IDST
>  AArch64: Add Neon implementation of 4x4 DCT
>  AArch64: Add Neon implementation of 4x4 IDCT
>  AArch64: Add Neon implementation of 8x8 IDCT
>  AArch64: Improve the Neon implementation of 16x16 IDCT
>  AArch64: Improve the Neon implementation of 32x32 IDCT
>
> source/common/aarch64/dct-prim.cpp | 1440 +++++++++++++++++++++-------
> 1 file changed, 1102 insertions(+), 338 deletions(-)
>
>-- 
>2.34.1
>
>_______________________________________________
>x265-devel mailing list
>x265-devel at videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241216/82d427dd/attachment-0001.htm>