<div data-ntes="ntes_mail_body_root" style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">HI George,</p><p style="margin: 0;"><br></p><p style="margin: 0;">Looks good, thank for the patch.</p><p style="margin: 0;"><br></p><p style="margin: 0;">Regards,</p><p style="margin: 0;">Chen</p></div><div style="position:relative;zoom:1"></div><div id="divNeteaseMailCard"></div><p style="margin: 0;"><br></p><pre><br>At 2024-12-17 01:02:11, "George Steed" <george.steed@arm.com> wrote:

>The lane-indexed LD1 load instructions imply a dependency on the

>previous value of the vector register to maintain the values in lanes

>not loaded. On larger micro-architectures this introduces an unnecessary

>dependency chain which limits the ability of the core to execute

>out-of-order.

>

>To avoid this dependency being introduced, simply use the scalar LDR

>instructions to load the lowest lane of the vector, this has the effect

>of zeroing the top portion of the vector rather than trying to maintain

>the previous value of the upper lanes.

>

>On a Neoverse V2 machine this results in a 62% reduction in times

>reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD

>8x4 benchmarks.

>---

> source/common/aarch64/pixel-util.S | 13 +++++++++----

> 1 file changed, 9 insertions(+), 4 deletions(-)

>

>diff --git a/source/common/aarch64/pixel-util.S b/source/common/aarch64/pixel-util.S

>index 5d8cc8c8e..d8b3f4365 100644

>--- a/source/common/aarch64/pixel-util.S

>+++ b/source/common/aarch64/pixel-util.S

>@@ -609,13 +609,18 @@ endfunc

> 

> //******* satd *******

> .macro satd_4x4_neon

>-    ld1             {v0.s}[0], [x0], x1

>+    ldr             s0, [x0]

>+    ldr             s1, [x2]

>+    add             x0, x0, x1

>+    add             x2, x2, x3

>     ld1             {v0.s}[1], [x0], x1

>-    ld1             {v1.s}[0], [x2], x3

>     ld1             {v1.s}[1], [x2], x3

>-    ld1             {v2.s}[0], [x0], x1

>+

>+    ldr             s2, [x0]

>+    ldr             s3, [x2]

>+    add             x0, x0, x1

>+    add             x2, x2, x3

>     ld1             {v2.s}[1], [x0], x1

>-    ld1             {v3.s}[0], [x2], x3

>     ld1             {v3.s}[1], [x2], x3

> 

>     usubl           v4.8h, v0.8b, v1.8b

>-- 

>2.34.1

>

</pre></div>