[x265] [PATCH] AArch64: Optimize pixel_avg_pp_4xh
Li Zhang
Li.Zhang2 at arm.com
Fri Jun 20 10:53:41 UTC 2025
Hi Chen,
Thanks for the feedback.
I had a try using LDR with offsets and unrolling by 2, the performance is almost the same for the 2 approaches
(<=0.03x deviation up or down for different block sizes).
Regards,
Li
From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Date: Friday, 2025. June 20. at 7:09
To: Development for x265 <x265-devel at videolan.org>
Cc: nd <nd at arm.com>
Subject: Re: [x265] [PATCH] AArch64: Optimize pixel_avg_pp_4xh
The code looks good to me
btw: The LDR support Register Indirect Addressing, how about unroll(2) to reduce ADD operators?
At 2025-06-19 22:58:53, "Li Zhang" <li.zhang2 at arm.com> wrote:
>Use LDR and STR instead of LD1 to lane in the pixel_avg_pp_4xh assembly
>implementation. The new approach is a wholly destructive operation and
>removes a false dependency on the existing register contents.
>
>The change provides up to 2.5x speed up.
>---
> source/common/aarch64/mc-a.S | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
>diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S
>index 130bf1a4a..ff18713fa 100644
>--- a/source/common/aarch64/mc-a.S
>+++ b/source/common/aarch64/mc-a.S
>@@ -38,10 +38,13 @@
> .macro pixel_avg_pp_4xN_neon h
> function PFX(pixel_avg_pp_4x\h\()_neon)
> .rept \h
>- ld1 {v0.s}[0], [x2], x3
>- ld1 {v1.s}[0], [x4], x5
>+ ldr s0, [x2]
>+ ldr s1, [x4]
>+ add x2, x2, x3
>+ add x4, x4, x5
> urhadd v2.8b, v0.8b, v1.8b
>- st1 {v2.s}[0], [x0], x1
>+ str s2, [x0]
>+ add x0, x0, x1
> .endr
> ret
> endfunc
>--
>2.39.5 (Apple Git-154)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20250620/ed7d68d7/attachment.htm>
More information about the x265-devel
mailing list