[x265] [PATCH 0/8] AArch64 SAD/SADxN Optimisations

Wed May 29 15:45:47 UTC 2024

Hi Hari,

Thank you for your information.

My A77 document looks older, it does not show uOps, so we can keep your LDR+ADD in patch, thanks.

Regards,
Chen

At 2024-05-29 19:24:16, "Hari Limaye" <hari.limaye at arm.com> wrote:
>Hi Chen,
>
>Thank you for clarifying.
>
>From the Arm CPU Software Optimisation Guides, LD1R requires an extra micro-op for the broadcast compared to the regular load (LDR). Benchmarking shows that using LD1R in the sad functions of width 4 is ~20% slower than using the LDR, ADD sequence.
>
>Many thanks,
>
>Hari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20240529/e703360a/attachment.htm>