[x265] [PATCH v2 0/9] AArch64: Optimise DCT primitives

Tue Aug 27 15:10:18 UTC 2024

Hi Chen,

Thank you for reviewing the patches. Please see replies below - we have pushed this updated series to address your comments.

>1. These function will share for 8/10/12 bpp, if move into X265_NS, it will make duplicated copy

If I understand correctly, you are referring to the duplication that will occur when building the "Multi-library interface"? I think this duplication exists already in the current multi bitdepth build, its just that the symbols are defined with internal linkage. By moving things into the X265_NS we don't add any duplicated copies, although we do at least make the functions reference-able from other code. Also if I understand correctly, this should not be an issue for the C implementations - as when building the Multi-library interface you should set EXPORT_C_API=OFF for all but one of the builds.

>2. add new section "namespace X265_NS" before these functions are better than move, it affects code history record.

Updated patch to do this.

>partialButterfly8_neon
For size 8, butterfly E & O is necessary, but EE/EO is not a good idea, Odd spends 8 operators per line, Even spends 4 operators plus 1 temporary store and 2 prepare operators, totally 7 operators with dependency link, looks no performance benefits, especally SVE SVDOT may get more performance with Odd method.

Regarding EE/EO not being a good idea in 8x8: register pressure is also a consideration here, as well as how the end of the first pass interacts with the start of the second (in terms of data positioning once they have been inlined - since everything stays in registers rather than being stored to/loaded from memory.) The current implementation performs better than the suggested implementations as computing EE and EO reduces register pressure (an issue 16-bit dot product cannot mitigate) and also allows the compiler to optimize more effectively across both passes after they have been inlined.

>Code style mismatch in different code section, one line is better.
>+        int32x4_t t01 = vpaddq_s32(vmull_s16(c1, O[j + 0]),
>+                                   vmull_s16(c1, O[j + 1]));
>*** vs
>+        t01 = vpaddq_s32(vmull_s16(c3, O[j + 0]), vmull_s16(c3, O[j + 1]));

The reason for the mismatch here is because the second line goes over the character limit of 80. This formatting is produced automatic formatting software (uncrustify).

>dct8_neon
>Better inline two of partialButterfly8_neon, it reduce some operators, such as int32x4_t c0 = vld1q_s32(t8_even[0]);
>16x16 and 32x2 are similar

The compiler currently inlines these - we have updated patches to be more explicit by using the `inline` keyword for the butterfly helper functions.

>const table may share in between Neon and Sve code

Updated with this done.

Many thanks,
Hari

Hari Limaye (5):
  Move C DCT implementations into X265_NS
  AArch64: Move Neon DCT implementations into X265_NS
  AArch64: Optimise partialButterfly8_neon
  AArch64: Optimise partialButterfly16_neon
  AArch64: Optimise partialButterfly32_neon

Jonathan Wright (4):
  AArch64: Move Neon-SVE bridge helpers into dedicated header
  AArch64: Add SVE implementation of 8x8 DCT
  AArch64: Add SVE implementation of 16x16 DCT
  AArch64: Add SVE implementation of 32x32 DCT

 source/common/CMakeLists.txt             |   2 +-
 source/common/aarch64/asm-primitives.cpp |   1 +
 source/common/aarch64/dct-prim-sve.cpp   | 491 ++++++++++++++++++++
 source/common/aarch64/dct-prim.cpp       | 557 +++++++++++++++--------
 source/common/aarch64/dct-prim.h         |  40 ++
 source/common/aarch64/neon-sve-bridge.h  |  67 +++
 source/common/aarch64/sao-prim.h         |  32 +-
 source/common/dct.cpp                    |  22 +-
 8 files changed, 980 insertions(+), 232 deletions(-)
 create mode 100644 source/common/aarch64/dct-prim-sve.cpp
 create mode 100644 source/common/aarch64/neon-sve-bridge.h

-- 
2.42.1