[x265] [PATCH v3 0/7] AArch64: Add Neon impl of transform functions
Micro Daryl Robles
microdaryl.robles at arm.com
Fri Dec 13 23:16:24 UTC 2024
Hi Chen,
Thank you for your comments. See my replies below.
>>-static void transpose_4x4x16(int16x4_t &x0, int16x4_t &x1, int16x4_t &x2, int16x4_t &x3)
>>+static inline void transpose_4x4_s16(int16x4_t &s0, int16x4_t &s1, int16x4_t &s2, int16x4_t &s3)
>> {
>>- int32x2_t s0, s1, s2, s3;
>>+ int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
>>+ int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
>>+ int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
>
>>+ int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
>
>
>Why clear high 64-bits? it will overwrite by ZIP1 below
>
>>
>>+ int16x8x2_t s0123 = vzipq_s16(s02, s13);
>
The vcombine_s16(s0, vdup_n_s16(0)) is a NOP operation, since the previous 64-bit vld1
instruction already has the implicit effect of clearing the top half of the 128-bit vector.
So the vcombine intrinsic is just to allow the compiler to use the whole 128-bit vector
of s0 so that we can use intrinsic vzip1q to zip the lower 64-bit data.
This approach is faster than using vcombine(s0, s1) which uses an extra MOV instruction to
move s1 to the upper 64-bit.
>
>>+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
>>+{
>>+ const int shift_pass1 = 1 + X265_DEPTH - 8;
>>+ const int shift_pass2 = 8;
>>+
>>+ ALIGN_VAR_32(int16_t, coef[4 * 4]);
>>+ ALIGN_VAR_32(int16_t, block[4 * 4]);
>>+
>>+ for (int i = 0; i < 4; i++)
>>+ {
>>+ memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t));
>>+ }
>
>We need not this loop to copy data from input buffer
For the Forward DCTs, this memcpy loop is optimized away by the compiler so I decided
to keep it to make it look consistent with the other forward DCT functions which
I have not modified in this patch set.
>>+template<int shift>
>>+static inline void inverseDst4_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
>>+{
>>+ int16x4_t s0 = vld1_s16(src + 0);
>>+ int16x4_t s1 = vld1_s16(src + 4);
>s0 and s1 may load by 128-bits instruction
These loads are properly converted by the compiler using LDP d0, d1 and only takes 1 instruction.
>>+static inline void transpose_4x8_s16(int16x4_t s0, int16x4_t s1, int16x4_t s2, int16x4_t s3,
>>+ int16x4_t s4, int16x4_t s5, int16x4_t s6, int16x4_t s7,
>>+ int16x8_t &d0, int16x8_t &d1, int16x8_t &d2, int16x8_t &d3)
>>+{
>>+ int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
>>+ int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
>>+ int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
>>+ int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
>>+ int16x8_t s4q = vcombine_s16(s4, vdup_n_s16(0));
>>+ int16x8_t s5q = vcombine_s16(s5, vdup_n_s16(0));
>>+ int16x8_t s6q = vcombine_s16(s6, vdup_n_s16(0));
>
>>+ int16x8_t s7q = vcombine_s16(s7, vdup_n_s16(0));
>Same as previous, high 64 bits unnecessary to clear
Same explanation as my first comment.
These vcombine_s16(sx, vdup_n_s16(0)) instructions are effectively NOP, since the previous
64-bit vld1 instructions have an implicit effect of clearing the top half of the 128-bit vector.
The vcombine is just to allow the compiler to use intrinsic vzip1q to zip the lower 64-bit data.
>
>>+template<int shift>
>>+static inline void partialButterflyInverse8_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
>>+ if (vget_lane_u64(vreinterpret_u64_s16(vget_low_s16(s3)), 0) != 0)
>
>detect zeros is good idea, however, 4 instructions not enough to hidden pipeline flush cost, suggest combine below each two of if_sections (O_lo & O_hi) into one
>
>
>>+ {
>>+ O_lo[0] = vmlal_lane_s16(O_lo[0], vget_low_s16(s3), c_odd, 1); // 75
>>+ O_lo[1] = vmlsl_lane_s16(O_lo[1], vget_low_s16(s3), c_odd, 3); // -18
>>+ O_lo[2] = vmlsl_lane_s16(O_lo[2], vget_low_s16(s3), c_odd, 0); // -89
>>+ O_lo[3] = vmlsl_lane_s16(O_lo[3], vget_low_s16(s3), c_odd, 2); // -50
>>+ }
>>+ if (vget_lane_u64(vreinterpret_u64_s16(vget_high_s16(s3)), 0) != 0)
>>+ {
>>+ O_hi[0] = vmlal_lane_s16(O_hi[0], vget_high_s16(s3), c_odd, 1); // 75
>>+ O_hi[1] = vmlsl_lane_s16(O_hi[1], vget_high_s16(s3), c_odd, 3); // -18
>>+ O_hi[2] = vmlsl_lane_s16(O_hi[2], vget_high_s16(s3), c_odd, 0); // -89
>>+ O_hi[3] = vmlsl_lane_s16(O_hi[3], vget_high_s16(s3), c_odd, 2); // -50
>>+ }
Thank you for the suggestion, I have considered it and sending my patch v3.
I have modified this by using (vaddlvq_u32(vreinterpretq_u32_s16(s3)) != 0).
This unsigned add long adds each 32-bit data into a single unsigned 64-bit to avoid any risk
of non-zero sum overflowing to zero. This should be better than using vorr(vget_low, vget_high).
Many thanks,
Micro
Micro Daryl Robles (7):
AArch64: Add Neon implementation of 4x4 DST
AArch64: Add Neon implementation of 4x4 IDST
AArch64: Add Neon implementation of 4x4 DCT
AArch64: Add Neon implementation of 4x4 IDCT
AArch64: Add Neon implementation of 8x8 IDCT
AArch64: Improve the Neon implementation of 16x16 IDCT
AArch64: Improve the Neon implementation of 32x32 IDCT
source/common/aarch64/dct-prim.cpp | 1440 +++++++++++++++++++++-------
1 file changed, 1102 insertions(+), 338 deletions(-)
--
2.34.1
More information about the x265-devel
mailing list