[x265] [PATCH 0/8] AArch64 SAD/SADxN Optimisations
Hari Limaye
hari.limaye at arm.com
Tue May 28 10:03:43 UTC 2024
Hi Chen,
Thank you for reviewing the patches.
>In this case, replace LD1 by LDR+ADD is not get benefit
Here, the existing instruction `ld1 {v0.s}[0], [x0], x1` is a read-modify-write operation and so creates a false dependency on the previous value of the register. Replacing this initial load with an LDR instruction removes this issue, as it is a completely destructive operation.
The speed-test results for the block sizes with width 4, when compared to the existing Neon code on a Neoverse V1 machine:
sad[4x4] | 2.94x
sad[4x8] | 3.47x
sad[4x16] | 2.49x
sad_x3[4x4] | 1.94x
sad_x3[4x8] | 1.59x
sad_x3[4x16] | 1.46x
sad_x4[4x4] | 1.59x
sad_x4[4x8] | 1.45x
sad_x4[4x16] | 1.27x
Many thanks,
Hari
More information about the x265-devel
mailing list