<div dir="ltr">I respectfully disagree:<br><br>1 - I observed same unrolling for both gcc and clang (the two major compilers for Mac OS)<br>2 - We already make use of unrolling pragmas for speedup in this file freely<br>3 - There are many other commits with roughly ~20% speedup on a kernel in x265 and similar end to end impact, eg 





<p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:18px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">a8a83ba984b87c852fa5043595491c92c4d810e6<br></span></p><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Nov 24, 2024 at 9:51 AM chen <<a href="mailto:chenm003@163.com">chenm003@163.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="line-height:1.7;color:rgb(0,0,0);font-size:14px;font-family:Arial"><div id="m_7804268896393182716spnEditorContent"><p style="margin:0px">In this case, I more like keep current code no change.</p><p style="margin:0px">The performance strong depends to compiler, and not so much benefits, we may optimize by asm future.</p></div><p> 2024-11-24 23:54:48,"Ganesh Ajjanagadde" <<a href="mailto:gajjanag2@gmail.com" target="_blank">gajjanag2@gmail.com</a>> </p><blockquote id="m_7804268896393182716isReplyContent" style="padding-left:1ex;margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)"><div dir="ltr"><div>Right, this only affects the dct32 case. Others are unaffected and change is within noise for them.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Nov 23, 2024 at 9:57 PM chen <<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="line-height:1.7;color:rgb(0,0,0);font-size:14px;font-family:Arial"><div id="m_7804268896393182716m_-3902502304530905332spnEditorContent"><p style="margin:0px">Looks for dct32x32 and idct4x4 only, other size are similar or worse?</p><p style="margin:0px"><br></p></div><pre>At 2024-11-24 12:34:48, <a href="mailto:gajjanag2@gmail.com" target="_blank">gajjanag2@gmail.com</a> wrote:
>From: Ganesh Ajjanagadde <<a href="mailto:gajjanag@alum.mit.edu" target="_blank">gajjanag@alum.mit.edu</a>>
>
>Apple silicon has 4 128 bit NEON execution units and benefits from unrolling.
>
>From ./TestBench on an M4 Mac Mini,
>
>before:
>dct8x8               |      2.32x |          205.12   |      476.62
>dct16x16     |      2.02x |          801.20   |      1619.62
>dct32x32     |      3.47x |          7566.39  |      26275.65
>idct4x4              |      0.90x |          175.80   |      157.90
>idct16x16    |      2.05x |          863.30   |      1771.80
>idct32x32    |      1.79x |          6344.33  |      11351.99
>
>after:
>dct8x8               |      2.33x |          204.72   |      476.53
>dct16x16     |      2.04x |          802.16   |      1637.39
>dct32x32     |      4.96x |          5181.02  |      25700.34
>idct4x4              |      1.08x |          162.09   |      174.40
>idct16x16    |      1.95x |          910.01   |      1771.61
>idct32x32    |      1.75x |          6350.72  |      11143.71
>
>~2% end to end encoding speedup
>---
> source/common/aarch64/dct-prim.cpp | 2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/source/common/aarch64/dct-prim.cpp b/source/common/aarch64/dct-prim.cpp
>index 8b523ceb0..e6ee7005b 100644
>--- a/source/common/aarch64/dct-prim.cpp
>+++ b/source/common/aarch64/dct-prim.cpp
>@@ -435,6 +435,7 @@ static inline void partialButterfly32_neon(const int16_t *src, int16_t *dst)
>         for (int i = 0; i < line; i += 4)
>         {
>             int32x4_t t[4];
>+X265_PRAGMA_UNROLL(4)
>             for (int j = 0; j < 4; ++j) {
>                 t[j] = vmull_s16(c0, vget_low_s16(O[i + j][0]));
>                 t[j] = vmlal_s16(t[j], c1, vget_high_s16(O[i + j][0]));
>@@ -461,6 +462,7 @@ static inline void partialButterfly32_neon(const int16_t *src, int16_t *dst)
>         for (int i = 0; i < line; i += 4)
>         {
>             int32x4_t t[4];
>+X265_PRAGMA_UNROLL(4)
>             for (int j = 0; j < 4; ++j) {
>                 t[j] = vmulq_s32(c0, EO[i + j][0]);
>                 t[j] = vmlaq_s32(t[j], c1, EO[i + j][1]);
>-- 
>2.39.5 (Apple Git-154)
>
>_______________________________________________
>x265-devel mailing list
><a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a>
><a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a>
</pre></div>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
</blockquote></div></div>
</blockquote></div>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
</blockquote></div>