[x265] [PATCH v2 0/7] AArch64: Add Neon impl of transform functions

Wed Dec 4 15:37:27 UTC 2024

Hi Chen,

Thank you for your comments. Please see the replies below.

>May we merge these memory load with pair or 4-element load instruction?
>Storage instruction is same

The compiler should reliably generate LDP/STP instructions for the existing intrinsics code.
Using the 4-register load instructions (vld1_s16_x4 etc) has a couple of additional disadvantages 
compared to the existing code:

1) Older compilers (especially GCC) emit a lot of unnecessary MOV instructions around the 
multi-register load/store instructions.

2) Using plain load/stores allows the compiler to more easily elide the stores and loads to the 
temporary buffer between calls to e.g. fastForwardDst4_neon, whereas with the 4-register load 
instructions the stores and loads to the temporary buffer remain in the generated code.

>In optimize version, we need not this wrapper functions, especially memcpy, it made slower performance

For the forward transforms, the memcpy part is effectively removed by the compiler, so removing 
memcpy in the intrinsics code generates the same assembly code as the current one.

However, for the inverse transforms, there seems to be some benefit in removing the memcpy part, 
so I removed them only for the inverse transforms in this v2 patch set.

Many thanks,
Micro

Micro Daryl Robles (7):
  AArch64: Add Neon implementation of 4x4 DST
  AArch64: Add Neon implementation of 4x4 IDST
  AArch64: Add Neon implementation of 4x4 DCT
  AArch64: Add Neon implementation of 4x4 IDCT
  AArch64: Add Neon implementation of 8x8 IDCT
  AArch64: Improve the Neon implementation of 16x16 IDCT
  AArch64: Improve the Neon implementation of 32x32 IDCT

 source/common/aarch64/dct-prim.cpp | 1442 +++++++++++++++++++++-------
 1 file changed, 1104 insertions(+), 338 deletions(-)

-- 
2.34.1