<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div><div id="spnEditorContent"><p style="margin: 0;">Hi <span style="font-family: arial; white-space: pre-wrap;">Hari,</span></p><p style="margin: 0;"><br></p><p style="margin: 0;">Thank for the new patches, it looks good for me.</p><p style="margin: 0;"><br></p><p style="margin: 0;">The only comment,</p><p style="margin: 0;">Is below code affect instruction issue bandwidth?</p><p style="margin: 0;">I can't find more detail document for the back-to-back register forward path and pipeline description.</p><p style="margin: 0;">In the doc, both uabd and udot are 4 throughput, instruction decode in-order, issue out-of-orider, not sure uabd+udot instruction interleaving affect multi-issue or not, other are fine.</p><pre style="width: 1298.64px; word-break: break-word !important;">+    uabd            v2.16b, v16.16b, v20.16b

+    udot            v0.4s, v2.16b, v2.16b

+    uabd            v3.16b, v17.16b, v21.16b

+    udot            v1.4s, v3.16b, v3.16b

+    uabd            v4.16b, v18.16b, v22.16b

+    udot            v0.4s, v4.16b, v4.16b

+    uabd            v5.16b, v19.16b, v23.16b

+    udot            v1.4s, v5.16b, v5.16b

</pre><div><br></div></div><div style="position:relative;zoom:1"></div><div id="divNeteaseMailCard"></div><div style="margin: 0;">Regards,</div><div style="margin: 0;">Chen</div><pre><br>At 2024-07-20 01:14:17, "Hari Limaye" <hari.limaye@arm.com> wrote:

>Hi Chen,

>

>Apologies for the delay in getting back to you.

>

>Thank you for the comments on the patches.

>

>>in the SSE_PP_8xN, how about two-lines format (.8b -> .16b), it just reduce one of UABD, I guess it is not performance change

>

>You are correct that this is not beneficial for performance - the additional merge negates the benefit of removing the single UABD instruction.

>

>>How about shared code in different size and reduce unroll?

>

>For the block sizes that are fully unrolled at present, e.g. SSE_PP_16xN, reducing the unroll factor and sharing the code results in a performance regression. 

>

>We have however updated SSE_PP_32xN to share the same code with a small wrapper, as this gives the same performance whilst decreasing the code size.

>

>Many thanks,

>

>Hari

>

>-- 

>2.42.1

</pre><br></div></div>