[x265] [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon
George Steed
george.steed at arm.com
Mon Dec 16 17:02:11 UTC 2024
The lane-indexed LD1 load instructions imply a dependency on the
previous value of the vector register to maintain the values in lanes
not loaded. On larger micro-architectures this introduces an unnecessary
dependency chain which limits the ability of the core to execute
out-of-order.
To avoid this dependency being introduced, simply use the scalar LDR
instructions to load the lowest lane of the vector, this has the effect
of zeroing the top portion of the vector rather than trying to maintain
the previous value of the upper lanes.
On a Neoverse V2 machine this results in a 62% reduction in times
reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
8x4 benchmarks.
---
source/common/aarch64/pixel-util.S | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/source/common/aarch64/pixel-util.S b/source/common/aarch64/pixel-util.S
index 5d8cc8c8e..d8b3f4365 100644
--- a/source/common/aarch64/pixel-util.S
+++ b/source/common/aarch64/pixel-util.S
@@ -609,13 +609,18 @@ endfunc
//******* satd *******
.macro satd_4x4_neon
- ld1 {v0.s}[0], [x0], x1
+ ldr s0, [x0]
+ ldr s1, [x2]
+ add x0, x0, x1
+ add x2, x2, x3
ld1 {v0.s}[1], [x0], x1
- ld1 {v1.s}[0], [x2], x3
ld1 {v1.s}[1], [x2], x3
- ld1 {v2.s}[0], [x0], x1
+
+ ldr s2, [x0]
+ ldr s3, [x2]
+ add x0, x0, x1
+ add x2, x2, x3
ld1 {v2.s}[1], [x0], x1
- ld1 {v3.s}[0], [x2], x3
ld1 {v3.s}[1], [x2], x3
usubl v4.8h, v0.8b, v1.8b
--
2.34.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-aarch64-pixel-util.S-Improve-satd_4x4_neon.patch
Type: text/x-diff
Size: 1929 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241216/94b03761/attachment.patch>
More information about the x265-devel
mailing list