[x265] [PATCH v2 0/7] AArch64: Add Neon impl of transform functions
Micro Daryl Robles
microdaryl.robles at arm.com
Wed Dec 4 15:37:27 UTC 2024
Hi Chen,
Thank you for your comments. Please see the replies below.
>May we merge these memory load with pair or 4-element load instruction?
>Storage instruction is same
The compiler should reliably generate LDP/STP instructions for the existing intrinsics code.
Using the 4-register load instructions (vld1_s16_x4 etc) has a couple of additional disadvantages
compared to the existing code:
1) Older compilers (especially GCC) emit a lot of unnecessary MOV instructions around the
multi-register load/store instructions.
2) Using plain load/stores allows the compiler to more easily elide the stores and loads to the
temporary buffer between calls to e.g. fastForwardDst4_neon, whereas with the 4-register load
instructions the stores and loads to the temporary buffer remain in the generated code.
>In optimize version, we need not this wrapper functions, especially memcpy, it made slower performance
For the forward transforms, the memcpy part is effectively removed by the compiler, so removing
memcpy in the intrinsics code generates the same assembly code as the current one.
However, for the inverse transforms, there seems to be some benefit in removing the memcpy part,
so I removed them only for the inverse transforms in this v2 patch set.
Many thanks,
Micro
Micro Daryl Robles (7):
AArch64: Add Neon implementation of 4x4 DST
AArch64: Add Neon implementation of 4x4 IDST
AArch64: Add Neon implementation of 4x4 DCT
AArch64: Add Neon implementation of 4x4 IDCT
AArch64: Add Neon implementation of 8x8 IDCT
AArch64: Improve the Neon implementation of 16x16 IDCT
AArch64: Improve the Neon implementation of 32x32 IDCT
source/common/aarch64/dct-prim.cpp | 1442 +++++++++++++++++++++-------
1 file changed, 1104 insertions(+), 338 deletions(-)
--
2.34.1
More information about the x265-devel
mailing list