[x265] [PATCH 0/3] AArch64 sse_pp Optimisations
chen
chenm003 at 163.com
Wed Jun 26 04:11:12 UTC 2024
Hi Hari,
Thank for the patches.
It looks good for me, I have not much comments on it, just a few discuss.
in the SSE_PP_8xN, how about two-lines format (.8b -> .16b), it just reduce one of UABD, I guess it is not performance change
How about shared code in different size and reduce unroll?
I think about performance in x265 other than testbench, because x86/ARM L0/L1 cache are not big enough
For example, SSE_PP_16xN have two version, 16x16 and 16x32
The code is similar, unroll 8 and 16 times
ARM can handle up to 3 memory operators per cycle, looks like unroll 2 times enough fully pipeline (up to 8-uops width)
If reduce unroll times, we will use loop, in here, 16x16 and 16x32 just different in count of loop, we may shared most code with small wrapper
The risk is branch predictor unit, I am not sure how about it performance
Regards,
Chen
At 2024-06-25 20:49:00, "Hari Limaye" <hari.limaye at arm.com> wrote:
>Hi,
>
>This series is based on the previously submitted patch-sets (AArch64 saoCuStats Optimisations, AArch64 SAD/SADxN Optimisations), and depends on CMake refactoring performed in those patch-sets.
>
>Geometric mean of performance speedup on a Neoverse V1 machine (higher is better):
>
>Existing Neon -> Optimised Neon: 1.60x
>Optimised Neon -> Armv8.4 Neon DotProd: 1.73x
>
>Many thanks,
>
>Hari
>
>Hari Limaye (3):
> AArch64: Optimise Neon assembly implementations of sse_pp
> AArch64: Remove SVE and SVE2 sse_pp primitives
> AArch64: Add Armv8.4 Neon DotProd implementations of sse_pp
>
> source/common/CMakeLists.txt | 4 +-
> source/common/aarch64/asm-primitives.cpp | 24 +--
> source/common/aarch64/fun-decls.h | 1 +
> source/common/aarch64/ssd-a-common.S | 4 +-
> source/common/aarch64/ssd-a-sve.S | 78 -------
> source/common/aarch64/ssd-a-sve2.S | 261 -----------------------
> source/common/aarch64/ssd-a.S | 259 ++++++++--------------
> source/common/aarch64/ssd-neon-dotprod.S | 165 ++++++++++++++
> 8 files changed, 272 insertions(+), 524 deletions(-)
> delete mode 100644 source/common/aarch64/ssd-a-sve.S
> create mode 100644 source/common/aarch64/ssd-neon-dotprod.S
>
>--
>2.42.1
>
>_______________________________________________
>x265-devel mailing list
>x265-devel at videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20240626/648cf088/attachment.htm>
More information about the x265-devel
mailing list