<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks again Chen for your careful review and recommendations.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I added the following change to the attached patch as we get better performance:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">--- a/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+++ b/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">@@ -35,14 +35,14 @@ function x265_filterPixelToShort_4x4_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> movi v2.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ld1 {v0.s}[0], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ld1 {v0.s}[1], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ld1 {v1.s}[2], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ld1 {v1.s}[3], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v3.8h, v0.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ushll2 v4.8h, v1.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v3.8h, v3.8h, v2.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- add v4.8h, v4.8h, v2.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v3.d}[0], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v3.d}[1], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ld1 {v1.s}[0], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ld1 {v1.s}[1], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ushll v4.8h, v1.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add v4.8h, v4.8h, v2.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v4.d}[0], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v4.d}[1], [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ret<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Before:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x4] 1.20x 4.99 6.01<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">After:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x4] 1.38x 4.20 5.78<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I tried the ldp with post-increment as you recommended.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Performance is slightly lower with the change:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">function x265_filterPixelToShort_64x\h\()_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add x3, x3, x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> sub x3, x3, #0x40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ sub x1, x1, #0x20<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> movi v4.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> mov x9, #\r<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.loop_filterP2S_64x\h:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> subs x9, x9, #1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.rept 2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ld1 {v0.16b-v3.16b}, [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ldp q0, q1, [x0], #0x20<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ldp q0, q1, [x0]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add x0, x0, x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v16.8h, v0.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll2 v17.8h, v0.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v18.8h, v1.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Before:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.46x 105.52 154.47<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.47x 212.06 312.14<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.47x 318.75 467.61<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.46x 425.61 622.36<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">After:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.42x 108.41 154.37<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.45x 215.18 312.12<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.44x 325.01 468.76<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.44x 432.46 622.36<o:p></o:p></span></p>
<div>
<div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</blockquote>
</div>
</div>
</div>
</body>
</html>