<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><p style="margin: 0;">Hi <span style="font-family: arial; white-space: pre-wrap;">Hari,</span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;"><br></span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">The new patches looks good for me now, thank you for your patches.</span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;"><br></span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">Regards,</span></p><p style="margin: 0;"><span style="font-family: arial; white-space: pre-wrap;">Chen</span></p></div><pre>At 2024-05-23 03:09:26, "Hari Limaye" <hari.limaye@arm.com> wrote:

>Hi Chen,

>

>Thank you for reviewing the patches.

>

>>In signOf_neon

>>>+ // signOf(a - b) = -(a > b) | (b > a)

>>comments is not clear, suggest

>>-(a > b ? -1 : 0) | ( a < b)

>

>I have posted updated versions of patches 3, 4, 6 to make these comments more clear with respect to the possible outputs of Neon comparison instructions.

>

>>In saoCuStatsBO_neon

>>It is memory bandwidth optimize only, interval memory access strong depends on CPU pipeline design and >compiler, it is not generic, not sure how about on other kind of CPUs.

>

>Yes it is primarily a memory bandwidth optimisation - we have tested with recent GCC and Clang on a range of Neoverse CPUs and find it to be faster than the C implementation.

>

>>In saoCuStatsE*_neon

>>No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, >it mostly faster on modern CPUs

>

>Yes, we found that this instruction sequence was faster than the alternatives, for the Neon implementation.

>

>Many thanks,

>

>Hari

</pre></div>