<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div style="margin: 0;">Hi <span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">Sebastian</span>,</div><div style="margin: 0;"><br></div><div style="margin: 0;">thanks your patch.</div><div style="margin: 0;">I have some comments.</div><div style="margin: 0;"><br></div><p style="margin: 0;">+function x265_filterPixelToShort_4x4_neon</p><p style="margin: 0;">+ add x3, x3, x3</p><p style="margin: 0;">+ movi v2.8h, #0xe0, lsl #8</p><div style="margin: 0;">are you compiler does not handle constant 0xe000 automatic? it is more readable</div><div style="margin: 0;"><br></div><div style="margin: 0;"><div style="margin: 0;">+ ld1 {v0.s}[0], [x0], x1</div><div style="margin: 0;">+ ld1 {v0.s}[1], [x0], x1</div><div style="margin: 0;">+ ld1 {v1.s}[2], [x0], x1</div><div style="margin: 0;">Why not v0.s?</div><div style="margin: 0;"><br></div><div style="margin: 0;">+ ld1 {v1.s}[3], [x0], x1</div><div><br></div></div><p style="margin: 0;">+.macro filterPixelToShort_32xN h</p><p style="margin: 0;">+function x265_filterPixelToShort_32x\h\()_neon</p><p style="margin: 0;">+ add x3, x3, x3</p><p style="margin: 0;">+ movi v6.8h, #0xe0, lsl #8</p><p style="margin: 0;">+.rept \h</p><p style="margin: 0;">+ ld1 {v0.16b-v1.16b}, [x0], x1</p><div>ldp maybe provide more bandwidth</div><div><br></div><div><div>+.macro filterPixelToShort_64xN h</div><div>+function x265_filterPixelToShort_64x\h\()_neon</div><div>+ add x3, x3, x3</div><div>+ sub x3, x3, #0x40</div><div>+ movi v4.8h, #0xe0, lsl #8</div><div>+.rept \h</div><div>I guess unroll N is not good idea, because the code section too large, it most probability to make cache flush and missing.</div><div><br></div><div>+ ld1 {v0.16b-v3.16b}, [x0], x1</div><div>+ ushll v16.8h, v0.8b, #6</div><div>+ ushll2 v17.8h, v0.16b, #6</div><div>+ ushll v18.8h, v1.8b, #6</div><div>+ ushll2 v19.8h, v1.16b, #6</div><div>+ ushll v20.8h, v2.8b, #6</div><div>+ ushll2 v21.8h, v2.16b, #6</div><div>+ ushll v22.8h, v3.8b, #6</div><div>+ ushll2 v23.8h, v3.16b, #6</div><div>+ add v16.8h, v16.8h, v4.8h</div><div>+ add v17.8h, v17.8h, v4.8h</div><div>+ add v18.8h, v18.8h, v4.8h</div><div>+ add v19.8h, v19.8h, v4.8h</div><div>+ add v20.8h, v20.8h, v4.8h</div><div>+ add v21.8h, v21.8h, v4.8h</div><div>+ add v22.8h, v22.8h, v4.8h</div><div>+ add v23.8h, v23.8h, v4.8h</div><div>+ st1 {v16.16b-v19.16b}, [x2], #0x40</div><div>ldp may reduce pipeline stall and more bandwidth</div><div><br></div><div>+ st1 {v20.16b-v23.16b}, [x2], x3</div><div>+.endr</div><div>+ ret</div><div>+endfunc</div><div>+.endm</div></div><div><br></div><p style="margin: 0;"><br></p><p style="margin: 0;"><br></p><div style="position:relative;zoom:1"></div><div id="divNeteaseMailCard"></div><p style="margin: 0;"><br></p><p> 2021-06-24 07:52:22£¬"Pop, Sebastian" <spop@amazon.com> </p><blockquote id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<style><!--
_font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
_font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
_page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hi,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">The attached patch ports filterPixelToShort to arm64.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Tested on graviton2 arm64-linux.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x4] 1.21x 4.98 6.03<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 8x8] 2.20x 6.20 13.65<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[16x16] 1.54x 25.24 38.94<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x32] 1.49x 101.99 151.63<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.48x 420.31 622.36<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 8x4] 2.18x 3.05 6.64<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x8] 1.91x 6.01 11.49<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 16x8] 1.47x 12.19 17.92<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 8x16] 1.95x 13.30 25.94<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x16] 1.49x 50.63 75.58<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[16x32] 1.56x 49.92 77.66<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.49x 209.43 312.13<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x64] 1.48x 205.16 304.53<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[16x12] 1.65x 17.62 29.08<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[12x16] 6.22x 24.07 149.61<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 16x4] 1.60x 5.37 8.59<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x16] 1.75x 13.58 23.73<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x24] 1.48x 76.47 113.22<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[24x32] 2.69x 78.12 210.52<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 32x8] 1.48x 25.00 37.06<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 8x32] 1.63x 29.10 47.46<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.48x 314.74 466.77<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.48x 104.13 154.48<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[16x64] 1.58x 98.66 155.67<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Ok to commit?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Sebastian<o:p></o:p></span></p>
</div>
</blockquote></div>