<div data-ntes="ntes_mail_body_root" style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi Hari & Jonathan,</p><p style="margin: 0;"><br></p><p style="margin: 0;">Thank for the patches, I have some comments,</p><p style="margin: 0;"><br></p><p style="margin: 0;">[PATCH 1/9] Move C DCT implementations into X265_NS</p><p style="margin: 0;">1. These function will share for 8/10/12 bpp, if move into <span style="font-family: arial; white-space: pre-wrap;">X265_NS, it will make duplicated copy</span></p><p style="margin: 0;">2. add new section "<span style="font-family: arial; white-space: pre-wrap;">namespace X265_NS" before these functions are better than move, it affects code history record.</span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;"><br></span></p><p style="margin: 0;"><font face="arial"><span style="white-space: pre-wrap;">[PATCH 3/9] AArch64: Optimise partialButterfly8_neon</span></font></p><p style="margin: 0;"><font face="arial"><span style="white-space: pre-wrap;">[PATCH 4/9] AArch64: Optimise partialButterfly16_neon</span></font></p><p style="margin: 0;"><span style="white-space: pre-wrap; font-family: arial;">[PATCH 5/9] AArch64: Optimise partialButterfly32_neon</span></p><p style="margin: 0;"><font face="arial"><span style="white-space: pre-wrap;">[PATCH 7/9] AArch64: Add SVE implementation of 8x8 DCT</span></font></p><p style="margin: 0;"><font face="arial"><span style="white-space: pre-wrap;">[PATCH 8/9] AArch64: Add SVE implementation of 16x16 DCT</span></font></p><p style="margin: 0;"><font face="arial"><span style="white-space: pre-wrap;">[PATCH 9/9] AArch64: Add SVE implementation of 32x32 DCT</span></font></p><p style="margin: 0;"></p><ul><li>partialButterfly8_neon</li><ul><li>For size 8, butterfly E & O is necessary, but EE/EO is not a good idea, Odd spends 8 operators per line, Even spends 4 operators plus 1 temporary store and 2 prepare operators, totally 7 operators with dependency link, looks no performance benefits, especally SVE SVDOT may get more performance with Odd method.</li><li>Code style mismatch in different code section, one line is better.<br>+        int32x4_t t01 = vpaddq_s32(vmull_s16(c1, O[j + 0]),<br>+                                   vmull_s16(c1, O[j + 1]));<br>*** vs<br>+        t01 = vpaddq_s32(vmull_s16(c3, O[j + 0]), vmull_s16(c3, O[j + 1]));<br><br></li></ul><li>dct8_neon</li><ul><li>Better inline two of partialButterfly8_neon, it reduce some operators, such as <i>int32x4_t c0 = vld1q_s32(t8_even[0])</i>;</li></ul><li>16x16 and 32x2 are similar</li><li>const table may share in between Neon and Sve code</li></ul><p></p><p style="margin: 0;"><br></p></div><div>Regards,</div><div>Chen</div><pre><br>At 2024-08-22 23:17:50, "Hari Limaye" <hari.limaye@arm.com> wrote:

>Move C implementations of DCT functions into the X265_NS namespace, and

>remove the static modifier from their declarations, so that they can be

>referenced from external code when linking to libx265.

>---

> source/common/dct.cpp | 340 +++++++++++++++++++++---------------------

> 1 file changed, 170 insertions(+), 170 deletions(-)

>

>diff --git a/source/common/dct.cpp b/source/common/dct.cpp

>index b102b6e31..d318b2c64 100644

>--- a/source/common/dct.cpp

>+++ b/source/common/dct.cpp

>@@ -439,176 +439,6 @@ static void partialButterfly4(const int16_t* src, int16_t* dst, int shift, int l

>     }

> }

> 

>-static void dst4_c(const int16_t* src, int16_t* dst, intptr_t srcStride)

>-{

>-    const int shift_1st = 1 + X265_DEPTH - 8;

>-    const int shift_2nd = 8;

>-

>-    ALIGN_VAR_32(int16_t, coef[4 * 4]);

>-    ALIGN_VAR_32(int16_t, block[4 * 4]);

>-

>-    for (int i = 0; i < 4; i++)

>-    {

>-        memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t));

>-    }

>-

>-    fastForwardDst(block, coef, shift_1st);

>-    fastForwardDst(coef, dst, shift_2nd);

>-}


</pre></div>