[x265] [PATCH 0/8] AArch64 SAD/SADxN Optimisations

Tue May 28 10:03:43 UTC 2024

Hi Chen,

Thank you for reviewing the patches.

>In this case, replace LD1 by LDR+ADD is not get benefit

Here, the existing instruction `ld1  {v0.s}[0], [x0], x1` is a read-modify-write operation and so creates a false dependency on the previous value of the register. Replacing this initial load with an LDR instruction removes this issue, as it is a completely destructive operation.

The speed-test results for the block sizes with width 4, when compared to the existing Neon code on a Neoverse V1 machine:

sad[4x4]	 | 2.94x
sad[4x8]	 | 3.47x
sad[4x16]	 | 2.49x
sad_x3[4x4]	 | 1.94x
sad_x3[4x8]	 | 1.59x
sad_x3[4x16] | 1.46x
sad_x4[4x4]	 | 1.59x
sad_x4[4x8]	 | 1.45x
sad_x4[4x16] | 1.27x

Many thanks,

Hari