<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div style="margin: 0;">You are welcome.</div><div style="margin: 0;">on your CPU, the ldp still slower, so we can keep origin version and improve it again in future.</div><div style="margin: 0;">This version looks good for me, thank you for your contribute.</div><div style="margin: 0;"><br></div><p>At 2021-06-24 10:01:40, "Pop, Sebastian" <spop@amazon.com> wrote:</p><blockquote id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<style><!--
_font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
_font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
_page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks again Chen for your careful review and recommendations.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I added the following change to the attached patch as we get better performance:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">--- a/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+++ b/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">@@ -35,14 +35,14 @@ function x265_filterPixelToShort_4x4_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> movi v2.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ld1 {v0.s}[0], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ld1 {v0.s}[1], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ld1 {v1.s}[2], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ld1 {v1.s}[3], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v3.8h, v0.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ushll2 v4.8h, v1.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v3.8h, v3.8h, v2.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- add v4.8h, v4.8h, v2.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v3.d}[0], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v3.d}[1], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ld1 {v1.s}[0], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ld1 {v1.s}[1], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ushll v4.8h, v1.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add v4.8h, v4.8h, v2.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v4.d}[0], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v4.d}[1], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ret<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Before:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x4] 1.20x 4.99 6.01<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">After:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x4] 1.38x 4.20 5.78<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I tried the ldp with post-increment as you recommended.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Performance is slightly lower with the change:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">function x265_filterPixelToShort_64x\h\()_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add x3, x3, x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> sub x3, x3, #0x40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ sub x1, x1, #0x20<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> movi v4.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> mov x9, #\r<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.loop_filterP2S_64x\h:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> subs x9, x9, #1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.rept 2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ld1 {v0.16b-v3.16b}, [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ldp q0, q1, [x0], #0x20<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ldp q0, q1, [x0]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add x0, x0, x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v16.8h, v0.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll2 v17.8h, v0.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v18.8h, v1.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Before:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.46x 105.52 154.47<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.47x 212.06 312.14<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.47x 318.75 467.61<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.46x 425.61 622.36<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">After:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.42x 108.41 154.37<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.45x 215.18 312.12<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.44x 325.01 468.76<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.44x 432.46 622.36<o:p></o:p></span></p>
<div>
<div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</blockquote>
</div>
</div>
</div>
</blockquote></div>