[x265] [PATCH 0/8] AArch64 SAD/SADxN Optimisations

Tue May 28 14:26:32 UTC 2024

Hi Hari,

Thank you for explain more details.

It is my fault, I don't point out we may replace LD1 by LD1R in last comment.

The instruction `ld1 {v0.s}[0], [x0], x1` is not good here due to partial register access false dependency link, but `ld1r {v0.2s}, [x0], x1` may avoid issue

Could you please take a look performace with LD1R?.

Regards,

Chen

At 2024-05-28 18:03:43, "Hari Limaye" <hari.limaye at arm.com> wrote:
>Hi Chen,
>
>Thank you for reviewing the patches.
>
>>In this case, replace LD1 by LDR+ADD is not get benefit
>
>Here, the existing instruction `ld1  {v0.s}[0], [x0], x1` is a read-modify-write operation and so creates a false dependency on the previous value of the register. Replacing this initial load with an LDR instruction removes this issue, as it is a completely destructive operation.
>
>The speed-test results for the block sizes with width 4, when compared to the existing Neon code on a Neoverse V1 machine:
>
>sad[4x4]	 | 2.94x
>sad[4x8]	 | 3.47x
>sad[4x16]	 | 2.49x
>sad_x3[4x4]	 | 1.94x
>sad_x3[4x8]	 | 1.59x
>sad_x3[4x16] | 1.46x
>sad_x4[4x4]	 | 1.59x
>sad_x4[4x8]	 | 1.45x
>sad_x4[4x16] | 1.27x
>
>Many thanks,
>
>Hari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20240528/401a7d74/attachment.htm>