<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi <span style="font-family: arial; white-space: pre-wrap;">Hari,</span></p><p style="margin: 0;"><br></p><p style="margin: 0;">Thank for the patches.</p><p style="margin: 0;">It looks good for me, I have not much comments on it, just a few discuss.</p><p style="margin: 0;"></p><p style="margin: 0;"></p><ul><li>in the SSE_PP_8xN, how about two-lines format (.8b -> .16b), it just reduce one of UABD, I guess it is not performance change</li><li>How about shared code in different size and reduce unroll?</li><ul><li>I think about performance in x265 other than testbench, because x86/ARM L0/L1 cache are not big enough</li><ul><li>For example, SSE_PP_16xN have two version, 16x16 and 16x32</li><li>The code is similar, unroll 8 and 16 times</li><li>ARM can handle up to 3 memory operators per cycle, looks like unroll 2 times enough fully pipeline (up to 8-uops width)</li><li>If reduce unroll times, we will use loop, in here, 16x16 and 16x32 just different in count of loop, we may shared most code with small wrapper</li><li>The risk is branch predictor unit, I am not sure how about it performance</li></ul></ul></ul><p></p><p></p><p style="margin: 0;"><br></p></div><div style="position:relative;zoom:1"></div><div id="divNeteaseMailCard"></div><div style="margin: 0;">Regards,</div><div style="margin: 0;">Chen</div><pre><br>At 2024-06-25 20:49:00, "Hari Limaye" <hari.limaye@arm.com> wrote:

>Hi,

>

>This series is based on the previously submitted patch-sets (AArch64 saoCuStats Optimisations, AArch64 SAD/SADxN Optimisations), and depends on CMake refactoring performed in those patch-sets.

>

>Geometric mean of performance speedup on a Neoverse V1 machine (higher is better):

>

>Existing Neon  -> Optimised Neon:       1.60x

>Optimised Neon -> Armv8.4 Neon DotProd: 1.73x

>

>Many thanks,

>

>Hari

>

>Hari Limaye (3):

>  AArch64: Optimise Neon assembly implementations of sse_pp

>  AArch64: Remove SVE and SVE2 sse_pp primitives

>  AArch64: Add Armv8.4 Neon DotProd implementations of sse_pp

>

> source/common/CMakeLists.txt             |   4 +-

> source/common/aarch64/asm-primitives.cpp |  24 +--

> source/common/aarch64/fun-decls.h        |   1 +

> source/common/aarch64/ssd-a-common.S     |   4 +-

> source/common/aarch64/ssd-a-sve.S        |  78 -------

> source/common/aarch64/ssd-a-sve2.S       | 261 -----------------------

> source/common/aarch64/ssd-a.S            | 259 ++++++++--------------

> source/common/aarch64/ssd-neon-dotprod.S | 165 ++++++++++++++

> 8 files changed, 272 insertions(+), 524 deletions(-)

> delete mode 100644 source/common/aarch64/ssd-a-sve.S

> create mode 100644 source/common/aarch64/ssd-neon-dotprod.S

>

>-- 

>2.42.1

>

>_______________________________________________

>x265-devel mailing list

>x265-devel@videolan.org

>https://mailman.videolan.org/listinfo/x265-devel

</pre></div>