[x265] [PATCH] AArch64: Optimize pixel_avg_pp_4xh

Fri Jun 20 10:53:41 UTC 2025

Hi Chen,

Thanks for the feedback.

I had a try using LDR with offsets and unrolling by 2, the performance is almost the same for the 2 approaches
(<=0.03x deviation up or down for different block sizes).

Regards,
Li

From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Date: Friday, 2025. June 20. at 7:09
To: Development for x265 <x265-devel at videolan.org>
Cc: nd <nd at arm.com>
Subject: Re: [x265] [PATCH] AArch64: Optimize pixel_avg_pp_4xh

The code looks good to me

btw: The LDR support Register Indirect Addressing, how about unroll(2) to reduce ADD operators?

At 2025-06-19 22:58:53, "Li Zhang" <li.zhang2 at arm.com> wrote:

>Use LDR and STR instead of LD1 to lane in the pixel_avg_pp_4xh assembly

>implementation. The new approach is a wholly destructive operation and

>removes a false dependency on the existing register contents.

>

>The change provides up to 2.5x speed up.

>---

> source/common/aarch64/mc-a.S | 9 ++++++---

> 1 file changed, 6 insertions(+), 3 deletions(-)

>

>diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S

>index 130bf1a4a..ff18713fa 100644

>--- a/source/common/aarch64/mc-a.S

>+++ b/source/common/aarch64/mc-a.S

>@@ -38,10 +38,13 @@

> .macro pixel_avg_pp_4xN_neon h

> function PFX(pixel_avg_pp_4x\h\()_neon)

> .rept \h

>-    ld1             {v0.s}[0], [x2], x3

>-    ld1             {v1.s}[0], [x4], x5

>+    ldr             s0, [x2]

>+    ldr             s1, [x4]

>+    add             x2, x2, x3

>+    add             x4, x4, x5

>     urhadd          v2.8b, v0.8b, v1.8b

>-    st1             {v2.s}[0], [x0], x1

>+    str             s2, [x0]

>+    add             x0, x0, x1

> .endr

>     ret

> endfunc

>--

>2.39.5 (Apple Git-154)

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20250620/ed7d68d7/attachment.htm>