[x265] [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon

Tue Dec 17 10:02:58 UTC 2024

HI George,

Looks good, thank for the patch.

Regards,

Chen

At 2024-12-17 01:02:11, "George Steed" <george.steed at arm.com> wrote:
>The lane-indexed LD1 load instructions imply a dependency on the
>previous value of the vector register to maintain the values in lanes
>not loaded. On larger micro-architectures this introduces an unnecessary
>dependency chain which limits the ability of the core to execute
>out-of-order.
>
>To avoid this dependency being introduced, simply use the scalar LDR
>instructions to load the lowest lane of the vector, this has the effect
>of zeroing the top portion of the vector rather than trying to maintain
>the previous value of the upper lanes.
>
>On a Neoverse V2 machine this results in a 62% reduction in times
>reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
>8x4 benchmarks.
>---
> source/common/aarch64/pixel-util.S | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
>diff --git a/source/common/aarch64/pixel-util.S b/source/common/aarch64/pixel-util.S
>index 5d8cc8c8e..d8b3f4365 100644
>--- a/source/common/aarch64/pixel-util.S
>+++ b/source/common/aarch64/pixel-util.S
>@@ -609,13 +609,18 @@ endfunc
> 
> //******* satd *******
> .macro satd_4x4_neon
>-    ld1             {v0.s}[0], [x0], x1
>+    ldr             s0, [x0]
>+    ldr             s1, [x2]
>+    add             x0, x0, x1
>+    add             x2, x2, x3
>     ld1             {v0.s}[1], [x0], x1
>-    ld1             {v1.s}[0], [x2], x3
>     ld1             {v1.s}[1], [x2], x3
>-    ld1             {v2.s}[0], [x0], x1
>+
>+    ldr             s2, [x0]
>+    ldr             s3, [x2]
>+    add             x0, x0, x1
>+    add             x2, x2, x3
>     ld1             {v2.s}[1], [x0], x1
>-    ld1             {v3.s}[0], [x2], x3
>     ld1             {v3.s}[1], [x2], x3
> 
>     usubl           v4.8h, v0.8b, v1.8b
>-- 
>2.34.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241217/b39e8a65/attachment.htm>