[x265] [PATCH] AArch64: Optimize pixel_avg_pp_4xh

Fri Jun 20 05:08:35 UTC 2025

The code looks good to me

btw: The LDR support Register Indirect Addressing, how about unroll(2) to reduce ADD operators?

At 2025-06-19 22:58:53, "Li Zhang" <li.zhang2 at arm.com> wrote:
>Use LDR and STR instead of LD1 to lane in the pixel_avg_pp_4xh assembly
>implementation. The new approach is a wholly destructive operation and
>removes a false dependency on the existing register contents.
>
>The change provides up to 2.5x speed up.
>---
> source/common/aarch64/mc-a.S | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
>diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S
>index 130bf1a4a..ff18713fa 100644
>--- a/source/common/aarch64/mc-a.S
>+++ b/source/common/aarch64/mc-a.S
>@@ -38,10 +38,13 @@
> .macro pixel_avg_pp_4xN_neon h
> function PFX(pixel_avg_pp_4x\h\()_neon)
> .rept \h
>-    ld1             {v0.s}[0], [x2], x3
>-    ld1             {v1.s}[0], [x4], x5
>+    ldr             s0, [x2]
>+    ldr             s1, [x4]
>+    add             x2, x2, x3
>+    add             x4, x4, x5
>     urhadd          v2.8b, v0.8b, v1.8b
>-    st1             {v2.s}[0], [x0], x1
>+    str             s2, [x0]
>+    add             x0, x0, x1
> .endr
>     ret
> endfunc
>-- 
>2.39.5 (Apple Git-154)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20250620/ed1689ca/attachment.htm>