<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi <span style="font-family: arial; white-space: pre-wrap;">Hari,</span></p><p style="margin: 0;"><br></p><p style="margin: 0;">These 8 patches looks good, the only comment on below code</p><p style="margin: 0;"><br></p><p style="margin: 0;">=================================</p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;"> .macro SAD_START_4 f</span></p></div><pre style="width: 1298.64px; word-break: break-word !important;">-    ld1             {v0.s}[0], [x0], x1

+    ldr             s0, [x0]

+    ldr             s1, [x2]

+    add             x0, x0, x1

+    add             x2, x2, x3

     ld1             {v0.s}[1], [x0], x1

-    ld1             {v1.s}[0], [x2], x3

     ld1             {v1.s}[1], [x2], x3

     \f              v16.8h, v0.8b, v1.8b

 .endm</pre><pre><div id="spnEditorContent" style="font-family: Arial; white-space: normal;"><p style="margin: 0px;">In the document</p><p style="margin: 0px;">LDR latency 5/-, throughput 2</p><p style="margin: 0px;">ADD latency 2, throughput 2 </p><p style="margin: 0px;">LD1  latency 7, throughput 2  (latency may optimize to 5)</p><p style="margin: 0px;"><br></p><p style="margin: 0px;">In this case, replace LD1 by LDR+ADD is not get benefit</p><p style="margin: 0px;">btw: same comment in SAD_X_START_4</p><p style="margin: 0px;"><br></p><p style="margin: 0px;">=================================</p></div><div><pre style="width: 1298.64px; word-break: break-word !important;"><br></pre></div><div>At 2024-05-24 01:12:04, "Hari Limaye" <hari.limaye@arm.com> wrote:</div>>Hi,

>

>This patch-series optimises the Neon implementations of SAD/SADxN primitives, adds new Armv8.4 Neon DotProd implementations, and performs some refactoring to AArch64 code.

>

>This series is based on the previously submitted refactoring patch-series (AArch64 saoCuStats Optimisations).

>

>Geometric mean of performance uplift when compiled with LLVM 17 on a Neoverse V1 machine (higher is better):

>

>Existing Neon  -> Optimised Neon:       1.45x

>Optimised Neon -> Armv8.4 Neon DotProd: 1.03x

>

>Many thanks,

>

>Hari

>

>Hari Limaye (8):

>  AArch64: Optimise Neon assembly implementations of SAD

>  AArch64: Optimise Neon assembly implementations of SADxN

>  AArch64: Remove SVE2 SAD/SADxN primitives

>  AArch64: Clean up CMake feature detection

>  AArch64: Add Armv8.4 Neon DotProd feature detection

>  AArch64: Refactor setup of optimised assembly primitives

>  AArch64: Add Armv8.4 Neon DotProd implementations of SAD

>  AArch64: Add Armv8.4 Neon DotProd implementations of SADxN

>

> build/README.txt                         |   8 +

> source/CMakeLists.txt                    |  89 ++-

> source/cmake/FindNEON_DOTPROD.cmake      |  21 +

> source/common/CMakeLists.txt             |   6 +-

> source/common/aarch64/asm-primitives.cpp | 832 ++---------------------

> source/common/aarch64/fun-decls.h        |  21 +

> source/common/aarch64/sad-a-common.S     | 514 --------------

> source/common/aarch64/sad-a-sve2.S       | 511 --------------

> source/common/aarch64/sad-a.S            | 506 +++++++++++++-

> source/common/aarch64/sad-neon-dotprod.S | 302 ++++++++

> source/common/cpu.cpp                    |  19 +-

> source/test/testbench.cpp                |   3 +-

> source/x265.h                            |  11 +-

> 13 files changed, 958 insertions(+), 1885 deletions(-)

> create mode 100644 source/cmake/FindNEON_DOTPROD.cmake

> delete mode 100644 source/common/aarch64/sad-a-common.S

> delete mode 100644 source/common/aarch64/sad-a-sve2.S

> create mode 100644 source/common/aarch64/sad-neon-dotprod.S

>

>-- 

>2.42.1

>

>_______________________________________________

>x265-devel mailing list

>x265-devel@videolan.org

>https://mailman.videolan.org/listinfo/x265-devel

</pre></div>