[x265] [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon
chen
chenm003 at 163.com
Tue Dec 17 10:02:58 UTC 2024
HI George,
Looks good, thank for the patch.
Regards,
Chen
At 2024-12-17 01:02:11, "George Steed" <george.steed at arm.com> wrote:
>The lane-indexed LD1 load instructions imply a dependency on the
>previous value of the vector register to maintain the values in lanes
>not loaded. On larger micro-architectures this introduces an unnecessary
>dependency chain which limits the ability of the core to execute
>out-of-order.
>
>To avoid this dependency being introduced, simply use the scalar LDR
>instructions to load the lowest lane of the vector, this has the effect
>of zeroing the top portion of the vector rather than trying to maintain
>the previous value of the upper lanes.
>
>On a Neoverse V2 machine this results in a 62% reduction in times
>reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
>8x4 benchmarks.
>---
> source/common/aarch64/pixel-util.S | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
>diff --git a/source/common/aarch64/pixel-util.S b/source/common/aarch64/pixel-util.S
>index 5d8cc8c8e..d8b3f4365 100644
>--- a/source/common/aarch64/pixel-util.S
>+++ b/source/common/aarch64/pixel-util.S
>@@ -609,13 +609,18 @@ endfunc
>
> //******* satd *******
> .macro satd_4x4_neon
>- ld1 {v0.s}[0], [x0], x1
>+ ldr s0, [x0]
>+ ldr s1, [x2]
>+ add x0, x0, x1
>+ add x2, x2, x3
> ld1 {v0.s}[1], [x0], x1
>- ld1 {v1.s}[0], [x2], x3
> ld1 {v1.s}[1], [x2], x3
>- ld1 {v2.s}[0], [x0], x1
>+
>+ ldr s2, [x0]
>+ ldr s3, [x2]
>+ add x0, x0, x1
>+ add x2, x2, x3
> ld1 {v2.s}[1], [x0], x1
>- ld1 {v3.s}[0], [x2], x3
> ld1 {v3.s}[1], [x2], x3
>
> usubl v4.8h, v0.8b, v1.8b
>--
>2.34.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241217/b39e8a65/attachment.htm>
More information about the x265-devel
mailing list