[x265] [PATCH 0/7] AArch64 saoCuStats Optimisations
Hari Limaye
hari.limaye at arm.com
Wed May 22 19:09:26 UTC 2024
Hi Chen,
Thank you for reviewing the patches.
>In signOf_neon
>>+ // signOf(a - b) = -(a > b) | (b > a)
>comments is not clear, suggest
>-(a > b ? -1 : 0) | ( a < b)
I have posted updated versions of patches 3, 4, 6 to make these comments more clear with respect to the possible outputs of Neon comparison instructions.
>In saoCuStatsBO_neon
>It is memory bandwidth optimize only, interval memory access strong depends on CPU pipeline design and >compiler, it is not generic, not sure how about on other kind of CPUs.
Yes it is primarily a memory bandwidth optimisation - we have tested with recent GCC and Clang on a range of Neoverse CPUs and find it to be faster than the C implementation.
>In saoCuStatsE*_neon
>No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, >it mostly faster on modern CPUs
Yes, we found that this instruction sequence was faster than the alternatives, for the Neon implementation.
Many thanks,
Hari
More information about the x265-devel
mailing list