[x265] [PATCH 0/7] AArch64 saoCuStats Optimisations

Thu May 23 02:11:22 UTC 2024

Hi Hari,

The new patches looks good for me now, thank you for your patches.

Regards,

Chen

At 2024-05-23 03:09:26, "Hari Limaye" <hari.limaye at arm.com> wrote:
>Hi Chen,
>
>Thank you for reviewing the patches.
>
>>In signOf_neon
>>>+ // signOf(a - b) = -(a > b) | (b > a)
>>comments is not clear, suggest
>>-(a > b ? -1 : 0) | ( a < b)
>
>I have posted updated versions of patches 3, 4, 6 to make these comments more clear with respect to the possible outputs of Neon comparison instructions.
>
>>In saoCuStatsBO_neon
>>It is memory bandwidth optimize only, interval memory access strong depends on CPU pipeline design and >compiler, it is not generic, not sure how about on other kind of CPUs.
>
>Yes it is primarily a memory bandwidth optimisation - we have tested with recent GCC and Clang on a range of Neoverse CPUs and find it to be faster than the C implementation.
>
>>In saoCuStatsE*_neon
>>No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, >it mostly faster on modern CPUs
>
>Yes, we found that this instruction sequence was faster than the alternatives, for the Neon implementation.
>
>Many thanks,
>
>Hari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20240523/3c67d553/attachment.htm>