[x265] [PATCH v3 0/7] AArch64: Add Neon impl of transform functions

Fri Dec 13 23:16:24 UTC 2024

Hi Chen,

Thank you for your comments. See my replies below.

>>-static void transpose_4x4x16(int16x4_t &x0, int16x4_t &x1, int16x4_t &x2, int16x4_t &x3)
>>+static inline void transpose_4x4_s16(int16x4_t &s0, int16x4_t &s1, int16x4_t &s2, int16x4_t &s3)
>> {
>>-    int32x2_t s0, s1, s2, s3;
>>+    int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
>>+    int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
>>+    int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
>
>>+    int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
>
>
>Why clear high 64-bits? it will overwrite by ZIP1 below
>
>>
>>+    int16x8x2_t s0123 = vzipq_s16(s02, s13);
>

The vcombine_s16(s0, vdup_n_s16(0)) is a NOP operation, since the previous 64-bit vld1
instruction already has the implicit effect of clearing the top half of the 128-bit vector.
So the vcombine intrinsic is just to allow the compiler to use the whole 128-bit vector 
of s0 so that we can use intrinsic vzip1q to zip the lower 64-bit data.

This approach is faster than using vcombine(s0, s1) which uses an extra MOV instruction to
move s1 to the upper 64-bit.

>
>>+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
>>+{
>>+    const int shift_pass1 = 1 + X265_DEPTH - 8;
>>+    const int shift_pass2 = 8;
>>+
>>+    ALIGN_VAR_32(int16_t, coef[4 * 4]);
>>+    ALIGN_VAR_32(int16_t, block[4 * 4]);
>>+
>>+    for (int i = 0; i < 4; i++)
>>+    {
>>+        memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t));
>>+    }
>
>We need not this loop to copy data from input buffer

For the Forward DCTs, this memcpy loop is optimized away by the compiler so I decided
to keep it to make it look consistent with the other forward DCT functions which
I have not modified in this patch set.

>>+template<int shift>
>>+static inline void inverseDst4_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
>>+{
>>+    int16x4_t s0 = vld1_s16(src + 0);
>>+    int16x4_t s1 = vld1_s16(src + 4);
>s0 and s1 may load by 128-bits instruction

These loads are properly converted by the compiler using LDP d0, d1 and only takes 1 instruction.

>>+static inline void transpose_4x8_s16(int16x4_t s0, int16x4_t s1, int16x4_t s2, int16x4_t s3,
>>+                                     int16x4_t s4, int16x4_t s5, int16x4_t s6, int16x4_t s7,
>>+                                     int16x8_t &d0, int16x8_t &d1, int16x8_t &d2, int16x8_t &d3)
>>+{
>>+    int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
>>+    int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
>>+    int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
>>+    int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
>>+    int16x8_t s4q = vcombine_s16(s4, vdup_n_s16(0));
>>+    int16x8_t s5q = vcombine_s16(s5, vdup_n_s16(0));
>>+    int16x8_t s6q = vcombine_s16(s6, vdup_n_s16(0));
>
>>+    int16x8_t s7q = vcombine_s16(s7, vdup_n_s16(0));
>Same as previous, high 64 bits unnecessary to clear

Same explanation as my first comment.
These vcombine_s16(sx, vdup_n_s16(0)) instructions are effectively NOP, since the previous 
64-bit vld1 instructions have an implicit effect of clearing the top half of the 128-bit vector.
The vcombine is just to allow the compiler to use intrinsic vzip1q to zip the lower 64-bit data.

>
>>+template<int shift>
>>+static inline void partialButterflyInverse8_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
>>+    if (vget_lane_u64(vreinterpret_u64_s16(vget_low_s16(s3)), 0) != 0)
>
>detect zeros is good idea, however, 4 instructions not enough to hidden pipeline flush cost, suggest combine below each two of if_sections (O_lo & O_hi) into one
>
>
>>+    {
>>+        O_lo[0] = vmlal_lane_s16(O_lo[0], vget_low_s16(s3), c_odd, 1); //  75
>>+        O_lo[1] = vmlsl_lane_s16(O_lo[1], vget_low_s16(s3), c_odd, 3); // -18
>>+        O_lo[2] = vmlsl_lane_s16(O_lo[2], vget_low_s16(s3), c_odd, 0); // -89
>>+        O_lo[3] = vmlsl_lane_s16(O_lo[3], vget_low_s16(s3), c_odd, 2); // -50
>>+    }
>>+    if (vget_lane_u64(vreinterpret_u64_s16(vget_high_s16(s3)), 0) != 0)
>>+    {
>>+        O_hi[0] = vmlal_lane_s16(O_hi[0], vget_high_s16(s3), c_odd, 1); //  75
>>+        O_hi[1] = vmlsl_lane_s16(O_hi[1], vget_high_s16(s3), c_odd, 3); // -18
>>+        O_hi[2] = vmlsl_lane_s16(O_hi[2], vget_high_s16(s3), c_odd, 0); // -89
>>+        O_hi[3] = vmlsl_lane_s16(O_hi[3], vget_high_s16(s3), c_odd, 2); // -50
>>+    }

Thank you for the suggestion, I have considered it and sending my patch v3.

I have modified this by using (vaddlvq_u32(vreinterpretq_u32_s16(s3)) != 0).

This unsigned add long adds each 32-bit data into a single unsigned 64-bit to avoid any risk 
of non-zero sum overflowing to zero. This should be better than using vorr(vget_low, vget_high).

Many thanks,
Micro

Micro Daryl Robles (7):
  AArch64: Add Neon implementation of 4x4 DST
  AArch64: Add Neon implementation of 4x4 IDST
  AArch64: Add Neon implementation of 4x4 DCT
  AArch64: Add Neon implementation of 4x4 IDCT
  AArch64: Add Neon implementation of 8x8 IDCT
  AArch64: Improve the Neon implementation of 16x16 IDCT
  AArch64: Improve the Neon implementation of 32x32 IDCT

 source/common/aarch64/dct-prim.cpp | 1440 +++++++++++++++++++++-------
 1 file changed, 1102 insertions(+), 338 deletions(-)

-- 
2.34.1