<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi <span style="font-family: arial; white-space: pre-wrap;">Hari,</span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;"><br></span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">Thanks for the new ARM patches.</span></p><p style="margin: 0;"></p><ul><li>In <span style="font-family: arial; white-space: pre-wrap;">signOf_neon</span></li><ul><li>>+ // signOf(a - b) = -(a > b) | (b > a)<br>comments is not clear, suggest<br>-(a > b ? <b style="font-family: arial; white-space: pre-wrap;">-1</b><span style="font-family: arial; white-space: pre-wrap;"> : 0) | ( a < b)</span></li></ul><li><span style="font-family: arial; white-space: pre-wrap;">In </span><font face="arial"><span style="white-space: pre-wrap;">saoCuStatsBO_neon</span></font></li><ul><li><font face="arial"><span style="white-space: pre-wrap;">It is memory bandwidth optimize only, interval memory access strong depends on CPU pipeline design and compiler, it is not generic, not sure how about on other kind of CPUs.<br></span></font></li></ul><li><span style="font-family: arial; white-space: pre-wrap;">In </span><font face="arial"><span style="white-space: pre-wrap;">saoCuStatsE*_neon</span></font></li><ul><li><font face="arial"><span style="white-space: pre-wrap;">No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, it mostly faster on modern CPUs</span></font></li></ul><li><span style="font-family: arial; white-space: pre-wrap;">In </span><font face="arial"><span style="white-space: pre-wrap;">saoCuStats*_sve, </span></font><font face="arial"><span style="white-space: pre-wrap;">saoCuStats*_sve2</span></font></li><ul><li><font face="arial"><span style="white-space: pre-wrap;">No comments since it is similar algorithm as Neon<br></span></font></li></ul><ul><li><font face="arial"><span style="white-space: pre-wrap;"><br></span></font></li></ul></ul><div><font face="arial"><span style="white-space: pre-wrap;">Regards,</span></font></div><div><font face="arial"><span style="white-space: pre-wrap;">Chen</span></font></div><p></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">At 2024-05-21 00:14:35, "Hari Limaye" <hari.limaye@arm.com> wrote:</span></p></div><pre>>Hi,
>
>This patch-series adds AArch64 Neon, SVE, and SVE2 implementations of
>the saoCuStats function primitives for low and high bitdepth.
>
>This series is based on the previously submitted refactoring patch
>series.
>
>Performance numbers:
>
>C -> Neon on Neoverse V1:
> Low bitdepth:
> saoCuStatsBO | 1.09x
> saoCuStatsE0 | 2.67x
> saoCuStatsE1 | 2.82x
> saoCuStatsE2 | 2.93x
> saoCuStatsE3 | 3.26x
>
> High bitdepth:
> saoCuStatsBO | 1.09x
> saoCuStatsE0 | 2.39x
> saoCuStatsE1 | 2.67x
> saoCuStatsE2 | 2.47x
> saoCuStatsE3 | 2.86x
>
>Neon -> SVE on Neoverse V1:
> Low bitdepth:
> saoCuStatsE0 | 1.12x
> saoCuStatsE1 | 1.15x
> saoCuStatsE2 | 1.21x
> saoCuStatsE3 | 1.14x
>
> High bitdepth:
> saoCuStatsE0 | 1.19x
> saoCuStatsE1 | 1.28x
> saoCuStatsE2 | 1.19x
> saoCuStatsE3 | 1.12x
>
>SVE -> SVE2 on Neoverse V2:
> Low bitdepth:
> saoCuStatsE0 | 1.08x
> saoCuStatsE1 | 1.06x
> saoCuStatsE2 | 1.06x
> saoCuStatsE3 | 1.09x
>
> High bitdepth:
> saoCuStatsE0 | 1.03x
> saoCuStatsE1 | 1.10x
> saoCuStatsE2 | 1.08x
> saoCuStatsE3 | 1.09x
>
>Many thanks,
>
>Hari
>
>Hari Limaye (7):
> Test: Relax constraints of check_saoCuStatsE*
> Move duplicated signOf function to common header
> AArch64: Add Neon saoCuStats primitives for low bitdepth
> AArch64: Add Neon saoCuStats primitives for high bitdepth
> AArch64: Add check for arm_neon_sve_bridge.h
> AArch64: Add SVE saoCuStats primitives
> AArch64: Add SVE2 saoCuStats primitives
>
> source/CMakeLists.txt | 35 +-
> source/common/CMakeLists.txt | 19 +-
> source/common/aarch64/asm-primitives.cpp | 14 +
> source/common/aarch64/loopfilter-prim.cpp | 19 +-
> source/common/aarch64/sao-prim-sve.cpp | 271 +++++++++++++++
> source/common/aarch64/sao-prim-sve2.cpp | 317 ++++++++++++++++++
> source/common/aarch64/sao-prim.cpp | 380 ++++++++++++++++++++++
> source/common/aarch64/sao-prim.h | 100 ++++++
> source/common/common.h | 6 +
> source/common/loopfilter.cpp | 16 +-
> source/encoder/sao.cpp | 74 ++---
> source/test/pixelharness.cpp | 11 +-
> 12 files changed, 1187 insertions(+), 75 deletions(-)
> create mode 100644 source/common/aarch64/sao-prim-sve.cpp
> create mode 100644 source/common/aarch64/sao-prim-sve2.cpp
> create mode 100644 source/common/aarch64/sao-prim.cpp
> create mode 100644 source/common/aarch64/sao-prim.h
>
>--
>2.42.1
>
>_______________________________________________
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
</pre></div>