<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi Hari,</p><p style="margin: 0;"><br></p><p style="margin: 0;">Thank you for explain more details.</p><p style="margin: 0;">It is my fault, I don't point out we may replace LD1 by LD1R in last comment.</p><p style="margin: 0;">The instruction <span style="font-family: arial; white-space: pre-wrap;">`ld1 {v0.s}[0], [x0], x1` is not good here due to partial register access false dependency link, but </span><span style="font-family: arial; white-space: pre-wrap;">`ld1<b>r</b> {v0.2s}, [x0], x1` may avoid issue</span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">Could you please take a look performace with LD1R?.</span></p><p style="margin: 0;"><br></p><p style="margin: 0;">Regards,</p><p style="margin: 0;">Chen</p></div><pre>At 2024-05-28 18:03:43, "Hari Limaye" <hari.limaye@arm.com> wrote:
>Hi Chen,
>
>Thank you for reviewing the patches.
>
>>In this case, replace LD1 by LDR+ADD is not get benefit
>
>Here, the existing instruction `ld1 {v0.s}[0], [x0], x1` is a read-modify-write operation and so creates a false dependency on the previous value of the register. Replacing this initial load with an LDR instruction removes this issue, as it is a completely destructive operation.
>
>The speed-test results for the block sizes with width 4, when compared to the existing Neon code on a Neoverse V1 machine:
>
>sad[4x4] | 2.94x
>sad[4x8] | 3.47x
>sad[4x16] | 2.49x
>sad_x3[4x4] | 1.94x
>sad_x3[4x8] | 1.59x
>sad_x3[4x16] | 1.46x
>sad_x4[4x4] | 1.59x
>sad_x4[4x8] | 1.45x
>sad_x4[4x16] | 1.27x
>
>Many thanks,
>
>Hari
</pre></div>