<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div style="margin: 0;">Hi,</div><div style="margin: 0;"><br></div><div style="margin: 0;"><span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon</span></div><div style="margin: 0;"><span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">......</span></div><div style="margin: 0;"><p class="MsoNormal"><span style="font-size: 14.6667px;">+ uaddlv s4, v4.4h</span></p><p class="MsoNormal"><span style="font-size: 14.6667px;">Unsigned?</span></p><p class="MsoNormal"><span style="font-size: 14.6667px;"><br></span></p><p class="MsoNormal"><span style="font-size: 14.6667px;">+ umov w12, v4.h[0]</span></p><p class="MsoNormal"><span style="font-size: 14.6667px;">+ sxth w12, w12</span></p><p class="MsoNormal"><span style="font-size: 14.6667px;">+ add x0, x12, #16</span></p><div><br></div><p class="MsoNormal"><span style="font-size: 11pt;">The SXTH is unnecessary because count of zeros must be in range [0,16], so the W12 in the range [-16,0]</span></p><p class="MsoNormal"><span style="font-size: 11pt;">Please also remind the W0 is low part of X0, and result in the reg S4 is int32.</span></p><p class="MsoNormal"><span style="font-size: 11pt;"><br></span></p><p class="MsoNormal"><span style="font-size: 11pt;">Others in the patch looks good.</span></p><p class="MsoNormal"><span style="font-size: 11pt;"><br></span></p><p class="MsoNormal"><span style="font-size: 11pt;">Regards,</span></p><p class="MsoNormal"><span style="font-size: 11pt;">Min Chen</span></p></div><p>At 2021-07-25 13:31:06, "Pop, Sebastian" <spop@amazon.com> wrote:</p><blockquote id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<style><!--
_font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
_font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
_page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hi,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> depends on result, we can sum of these -1 to get totally number of non-zero<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> coeffs, it reduce 3 instructions to 2.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">You are right. With this change I see a lot of improvement:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.rept 2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ld1 {v0.8b}, [x1], x2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ld1 {v1.8b}, [x1], x2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- clz v2.4h, v0.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- clz v3.4h, v1.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ushr v2.4h, v2.4h, #4<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ushr v3.4h, v3.4h, #4<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- add v2.4h, v2.4h, v3.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- add v4.4h, v4.4h, v2.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v0.8b}, [x0], #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v1.8b}, [x0], #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ cmeq v0.4h, v0.4h, #0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ cmeq v1.4h, v1.4h, #0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add v4.4h, v4.4h, v0.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add v4.4h, v4.4h, v1.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.endr<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uaddlv s4, v4.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- fmov w12, s4<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- mov w11, #16<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- sub w0, w11, w12<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ umov w12, v4.h[0]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ sxth w12, w12<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add x0, x12, #16<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ret<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">endfunc<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Before:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[4x4] 13.93x 7.50 104.56<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[8x8] 31.20x 12.70 396.33<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[16x16] 43.22x 36.00 1556.03<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[32x32] 47.39x 129.34 6129.63<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">After:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[4x4] 14.76x 7.12 105.12<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[8x8] 37.56x 10.60 398.25<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[16x16] 52.57x 29.74 1563.60<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> copy_cnt[32x32] 62.22x 98.37 6120.29<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + xtn v0.8b, v0.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + xtn2 v0.16b, v1.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> equal to<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> tbl v0, {v0,v1}, v2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">You are right. With this change I see a lot of improvement:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Before:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">copy_sp[16x16] 85.13x 18.78 1599.19<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">copy_sp[32x32] 96.31x 65.07 6266.88<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">copy_sp[64x64] 98.81x 252.38 24937.40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[i422] copy_sp[16x32] 91.93x 34.32 3154.89<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[i422] copy_sp[32x64] 99.54x 128.29 12769.10<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">After:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">copy_sp[16x16] 96.23x 16.42 1579.74<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">copy_sp[32x32] 104.33x 57.84 6034.24<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">copy_sp[64x64] 110.79x 221.66 24558.72<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[i422] copy_sp[16x32] 97.74x 31.89 3116.46<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[i422] copy_sp[32x64] 111.37x 112.39 12517.52<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Please see the amended patch.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Sebastian<o:p></o:p></span></p>
</div>
</blockquote></div>