[x265] [PATCH 0/3] AArch64 sse_pp Optimisations
chen
chenm003 at 163.com
Sat Jul 20 08:09:58 UTC 2024
Hi Hari,
Thank for the new patches, it looks good for me.
The only comment,
Is below code affect instruction issue bandwidth?
I can't find more detail document for the back-to-back register forward path and pipeline description.
In the doc, both uabd and udot are 4 throughput, instruction decode in-order, issue out-of-orider, not sure uabd+udot instruction interleaving affect multi-issue or not, other are fine.
+ uabd v2.16b, v16.16b, v20.16b
+ udot v0.4s, v2.16b, v2.16b
+ uabd v3.16b, v17.16b, v21.16b
+ udot v1.4s, v3.16b, v3.16b
+ uabd v4.16b, v18.16b, v22.16b
+ udot v0.4s, v4.16b, v4.16b
+ uabd v5.16b, v19.16b, v23.16b
+ udot v1.4s, v5.16b, v5.16b
Regards,
Chen
At 2024-07-20 01:14:17, "Hari Limaye" <hari.limaye at arm.com> wrote:
>Hi Chen,
>
>Apologies for the delay in getting back to you.
>
>Thank you for the comments on the patches.
>
>>in the SSE_PP_8xN, how about two-lines format (.8b -> .16b), it just reduce one of UABD, I guess it is not performance change
>
>You are correct that this is not beneficial for performance - the additional merge negates the benefit of removing the single UABD instruction.
>
>>How about shared code in different size and reduce unroll?
>
>For the block sizes that are fully unrolled at present, e.g. SSE_PP_16xN, reducing the unroll factor and sharing the code results in a performance regression.
>
>We have however updated SSE_PP_32xN to share the same code with a small wrapper, as this gives the same performance whilst decreasing the code size.
>
>Many thanks,
>
>Hari
>
>--
>2.42.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20240720/3f57ad8d/attachment.htm>
More information about the x265-devel
mailing list