[x265] [PATCH 0/3] AArch64 sse_pp Optimisations

Sat Jul 20 08:09:58 UTC 2024

Hi Hari,

Thank for the new patches, it looks good for me.

The only comment,

Is below code affect instruction issue bandwidth?

I can't find more detail document for the back-to-back register forward path and pipeline description.

In the doc, both uabd and udot are 4 throughput, instruction decode in-order, issue out-of-orider, not sure uabd+udot instruction interleaving affect multi-issue or not, other are fine.

+    uabd            v2.16b, v16.16b, v20.16b
+    udot            v0.4s, v2.16b, v2.16b
+    uabd            v3.16b, v17.16b, v21.16b
+    udot            v1.4s, v3.16b, v3.16b
+    uabd            v4.16b, v18.16b, v22.16b
+    udot            v0.4s, v4.16b, v4.16b
+    uabd            v5.16b, v19.16b, v23.16b
+    udot            v1.4s, v5.16b, v5.16b

Regards,
Chen

At 2024-07-20 01:14:17, "Hari Limaye" <hari.limaye at arm.com> wrote:
>Hi Chen,
>
>Apologies for the delay in getting back to you.
>
>Thank you for the comments on the patches.
>
>>in the SSE_PP_8xN, how about two-lines format (.8b -> .16b), it just reduce one of UABD, I guess it is not performance change
>
>You are correct that this is not beneficial for performance - the additional merge negates the benefit of removing the single UABD instruction.
>
>>How about shared code in different size and reduce unroll?
>
>For the block sizes that are fully unrolled at present, e.g. SSE_PP_16xN, reducing the unroll factor and sharing the code results in a performance regression. 
>
>We have however updated SSE_PP_32xN to share the same code with a small wrapper, as this gives the same performance whilst decreasing the code size.
>
>Many thanks,
>
>Hari
>
>-- 
>2.42.1

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20240720/3f57ad8d/attachment.htm>