<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi <span style="font-family: arial; white-space: pre-wrap;">Hari,</span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;"><br></span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">Thanks for the new ARM patches.</span></p><p style="margin: 0;"></p><ul><li>In <span style="font-family: arial; white-space: pre-wrap;">signOf_neon</span></li><ul><li>>+    // signOf(a - b) = -(a > b) | (b > a)<br>comments is not clear, suggest<br>-(a > b ? <b style="font-family: arial; white-space: pre-wrap;">-1</b><span style="font-family: arial; white-space: pre-wrap;"> : 0) | ( a < b)</span></li></ul><li><span style="font-family: arial; white-space: pre-wrap;">In </span><font face="arial"><span style="white-space: pre-wrap;">saoCuStatsBO_neon</span></font></li><ul><li><font face="arial"><span style="white-space: pre-wrap;">It is memory bandwidth optimize only, interval memory access strong depends on CPU pipeline design and compiler, it is not generic, not sure how about on other kind of CPUs.<br></span></font></li></ul><li><span style="font-family: arial; white-space: pre-wrap;">In </span><font face="arial"><span style="white-space: pre-wrap;">saoCuStatsE*_neon</span></font></li><ul><li><font face="arial"><span style="white-space: pre-wrap;">No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, it mostly faster on modern CPUs</span></font></li></ul><li><span style="font-family: arial; white-space: pre-wrap;">In </span><font face="arial"><span style="white-space: pre-wrap;">saoCuStats*_sve, </span></font><font face="arial"><span style="white-space: pre-wrap;">saoCuStats*_sve2</span></font></li><ul><li><font face="arial"><span style="white-space: pre-wrap;">No comments since it is similar algorithm as Neon<br></span></font></li></ul><ul><li><font face="arial"><span style="white-space: pre-wrap;"><br></span></font></li></ul></ul><div><font face="arial"><span style="white-space: pre-wrap;">Regards,</span></font></div><div><font face="arial"><span style="white-space: pre-wrap;">Chen</span></font></div><p></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">At 2024-05-21 00:14:35, "Hari Limaye" <hari.limaye@arm.com> wrote:</span></p></div><pre>>Hi,

>

>This patch-series adds AArch64 Neon, SVE, and SVE2 implementations of

>the saoCuStats function primitives for low and high bitdepth.

>

>This series is based on the previously submitted refactoring patch

>series.

>

>Performance numbers:

>

>C -> Neon on Neoverse V1:

>    Low bitdepth:

>        saoCuStatsBO | 1.09x

>        saoCuStatsE0 | 2.67x

>        saoCuStatsE1 | 2.82x

>        saoCuStatsE2 | 2.93x

>        saoCuStatsE3 | 3.26x

>

>    High bitdepth:

>        saoCuStatsBO | 1.09x

>        saoCuStatsE0 | 2.39x

>        saoCuStatsE1 | 2.67x

>        saoCuStatsE2 | 2.47x

>        saoCuStatsE3 | 2.86x

>

>Neon -> SVE on Neoverse V1:

>    Low bitdepth:

>        saoCuStatsE0 | 1.12x

>        saoCuStatsE1 | 1.15x

>        saoCuStatsE2 | 1.21x

>        saoCuStatsE3 | 1.14x

>

>    High bitdepth:

>        saoCuStatsE0 | 1.19x

>        saoCuStatsE1 | 1.28x

>        saoCuStatsE2 | 1.19x

>        saoCuStatsE3 | 1.12x

>

>SVE -> SVE2 on Neoverse V2:

>    Low bitdepth:

>        saoCuStatsE0 | 1.08x

>        saoCuStatsE1 | 1.06x

>        saoCuStatsE2 | 1.06x

>        saoCuStatsE3 | 1.09x

>

>    High bitdepth:

>        saoCuStatsE0 | 1.03x

>        saoCuStatsE1 | 1.10x

>        saoCuStatsE2 | 1.08x

>        saoCuStatsE3 | 1.09x

>

>Many thanks,

>

>Hari

>

>Hari Limaye (7):

>  Test: Relax constraints of check_saoCuStatsE*

>  Move duplicated signOf function to common header

>  AArch64: Add Neon saoCuStats primitives for low bitdepth

>  AArch64: Add Neon saoCuStats primitives for high bitdepth

>  AArch64: Add check for arm_neon_sve_bridge.h

>  AArch64: Add SVE saoCuStats primitives

>  AArch64: Add SVE2 saoCuStats primitives

>

> source/CMakeLists.txt                     |  35 +-

> source/common/CMakeLists.txt              |  19 +-

> source/common/aarch64/asm-primitives.cpp  |  14 +

> source/common/aarch64/loopfilter-prim.cpp |  19 +-

> source/common/aarch64/sao-prim-sve.cpp    | 271 +++++++++++++++

> source/common/aarch64/sao-prim-sve2.cpp   | 317 ++++++++++++++++++

> source/common/aarch64/sao-prim.cpp        | 380 ++++++++++++++++++++++

> source/common/aarch64/sao-prim.h          | 100 ++++++

> source/common/common.h                    |   6 +

> source/common/loopfilter.cpp              |  16 +-

> source/encoder/sao.cpp                    |  74 ++---

> source/test/pixelharness.cpp              |  11 +-

> 12 files changed, 1187 insertions(+), 75 deletions(-)

> create mode 100644 source/common/aarch64/sao-prim-sve.cpp

> create mode 100644 source/common/aarch64/sao-prim-sve2.cpp

> create mode 100644 source/common/aarch64/sao-prim.cpp

> create mode 100644 source/common/aarch64/sao-prim.h

>

>-- 

>2.42.1

>

>_______________________________________________

>x265-devel mailing list

>x265-devel@videolan.org

>https://mailman.videolan.org/listinfo/x265-devel

</pre></div>