[x265] [PATCH 0/7] AArch64 saoCuStats Optimisations

Wed May 22 19:09:26 UTC 2024

Hi Chen,

Thank you for reviewing the patches.

>In signOf_neon
>>+ // signOf(a - b) = -(a > b) | (b > a)
>comments is not clear, suggest
>-(a > b ? -1 : 0) | ( a < b)

I have posted updated versions of patches 3, 4, 6 to make these comments more clear with respect to the possible outputs of Neon comparison instructions.

>In saoCuStatsBO_neon
>It is memory bandwidth optimize only, interval memory access strong depends on CPU pipeline design and >compiler, it is not generic, not sure how about on other kind of CPUs.

Yes it is primarily a memory bandwidth optimisation - we have tested with recent GCC and Clang on a range of Neoverse CPUs and find it to be faster than the C implementation.

>In saoCuStatsE*_neon
>No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, >it mostly faster on modern CPUs

Yes, we found that this instruction sequence was faster than the alternatives, for the Neon implementation.

Many thanks,

Hari