<div data-ntes="ntes_mail_body_root" style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi,</p><p style="margin: 0;"><br></p><p style="margin: 0;">Thank for improve instruction, it looks good to me.</p><p style="margin: 0;"><br></p><p style="margin: 0;">Regards,<br>Chen</p></div><pre>At 2025-05-07 14:49:51, "Gerda Zsejke More" <gerdazsejke.more@arm.com> wrote:

>Optimize pixel_avg_pp_12x16_neon by using more suitable load and

>store instructions. Using LD1 for the 32-bit lane is a constructive

>operation - needing to merge the new value for lane 0 with the

>existing top half of the vector. Using LDR turns this into a wholly

>destructive operation since LDR zeros the rest of the vector -

>removing the false dependency.

>---

> source/common/aarch64/mc-a.S | 6 +++---

> 1 file changed, 3 insertions(+), 3 deletions(-)

>

>diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S

>index 8c2878b3e..130bf1a4a 100644

>--- a/source/common/aarch64/mc-a.S

>+++ b/source/common/aarch64/mc-a.S

>@@ -73,13 +73,13 @@ function PFX(pixel_avg_pp_12x16_neon)

>     sub             x3, x3, #4

>     sub             x5, x5, #4

> .rept 16

>-    ld1             {v0.s}[0], [x2], #4

>+    ldr             s0, [x2], #4

>     ld1             {v1.8b}, [x2], x3

>-    ld1             {v2.s}[0], [x4], #4

>+    ldr             s2, [x4], #4

>     ld1             {v3.8b}, [x4], x5

>     urhadd          v4.8b, v0.8b, v2.8b

>     urhadd          v5.8b, v1.8b, v3.8b

>-    st1             {v4.s}[0], [x0], #4

>+    str             s4, [x0], #4

>     st1             {v5.8b}, [x0], x1

> .endr

>     ret

>-- 

>2.39.5 (Apple Git-154)

>

</pre></div>