[x265] [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon

Mon Dec 16 17:02:11 UTC 2024

The lane-indexed LD1 load instructions imply a dependency on the
previous value of the vector register to maintain the values in lanes
not loaded. On larger micro-architectures this introduces an unnecessary
dependency chain which limits the ability of the core to execute
out-of-order.

To avoid this dependency being introduced, simply use the scalar LDR
instructions to load the lowest lane of the vector, this has the effect
of zeroing the top portion of the vector rather than trying to maintain
the previous value of the upper lanes.

On a Neoverse V2 machine this results in a 62% reduction in times
reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
8x4 benchmarks.
---
 source/common/aarch64/pixel-util.S | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/source/common/aarch64/pixel-util.S b/source/common/aarch64/pixel-util.S
index 5d8cc8c8e..d8b3f4365 100644
--- a/source/common/aarch64/pixel-util.S
+++ b/source/common/aarch64/pixel-util.S
@@ -609,13 +609,18 @@ endfunc
 
 //******* satd *******
 .macro satd_4x4_neon
-    ld1             {v0.s}[0], [x0], x1
+    ldr             s0, [x0]
+    ldr             s1, [x2]
+    add             x0, x0, x1
+    add             x2, x2, x3
     ld1             {v0.s}[1], [x0], x1
-    ld1             {v1.s}[0], [x2], x3
     ld1             {v1.s}[1], [x2], x3
-    ld1             {v2.s}[0], [x0], x1
+
+    ldr             s2, [x0]
+    ldr             s3, [x2]
+    add             x0, x0, x1
+    add             x2, x2, x3
     ld1             {v2.s}[1], [x0], x1
-    ld1             {v3.s}[0], [x2], x3
     ld1             {v3.s}[1], [x2], x3
 
     usubl           v4.8h, v0.8b, v1.8b
-- 
2.34.1

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-aarch64-pixel-util.S-Improve-satd_4x4_neon.patch
Type: text/x-diff
Size: 1929 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241216/94b03761/attachment.patch>