<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><p style="margin: 0;"><span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">Thank your response, comment inline.</span></p><p>At 2021-06-24 08:57:20, "Pop, Sebastian" <spop@amazon.com> wrote:</p><blockquote id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<style><!--
_font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
_font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
_page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hi Chen,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks for your review!<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +function x265_filterPixelToShort_4x4_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add x3, x3, x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + movi v2.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> are you compiler does not handle constant 0xe000 automatic? it is more readable<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">GNU assembler errors with that immediate:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">ipfilter8.S:35: Error: immediate value out of range -128 to 255 at operand 2 -- `movi v2.8h,#0xe000'<o:p></o:p></span></p><p class="MsoNormal"><span style="font-size:11.0pt"><br></span></p><p class="MsoNormal">Look old binutils issus, so we can keep your origin version.</p><p class="MsoNormal"><span style="font-size:11.0pt"><br></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ld1 {v0.s}[0], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ld1 {v0.s}[1], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ld1 {v1.s}[2], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> Why not v0.s?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">It is slightly faster to use an independent register for the upper part:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">when using {v0.s}[3] and {v0.s}[4]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x4] 1.13x 5.35 6.03<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">performance is lower than when using {v1.s}[3] and {v1.s}[4]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 4x4] 1.21x 4.99 6.03<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p><br></o:p></span></p><p class="MsoNormal"><span style="font-size:11.0pt"><o:p>Yes, in here, independent register may faster, but we can use lower part and ushll later, use one register high pare directly may make false register dependency path. </o:p></span></p><p class="MsoNormal"><span style="font-size:11.0pt"><o:p><br></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ld1 {v1.s}[3], [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +.macro filterPixelToShort_32xN h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +function x265_filterPixelToShort_32x\h\()_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add x3, x3, x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + movi v6.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +.rept \h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ld1 {v0.16b-v1.16b}, [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> ldp maybe provide more bandwidth<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">ld1 could be replaced with ldp + add, like this:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ldp q0, q1, [x0]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add x0, x0, x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 32x8] 1.39x 26.62 37.07<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x16] 1.42x 53.19 75.58<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x24] 1.41x 80.23 113.11<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x32] 1.42x 107.08 151.63<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x64] 1.41x 215.11 303.37<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Performance with ldp + add is lower than with ld1:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[ 32x8] 1.48x 25.00 37.06<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x16] 1.49x 50.64 75.56<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x24] 1.48x 76.46 113.31<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x32] 1.49x 101.97 151.63<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[32x64] 1.48x 205.15 303.31<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p><br></o:p></span></p><p class="MsoNormal"><span style="font-size: 14.6667px;">ldp immediately follow by add may make pipeline stall or similar issue, if there no better choice, we can keep origin version.</span></p><p class="MsoNormal"><br></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +.macro filterPixelToShort_64xN h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +function x265_filterPixelToShort_64x\h\()_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add x3, x3, x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + sub x3, x3, #0x40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + movi v4.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +.rept \h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> I guess unroll N is not good idea, because the code section too large, it most probability to make cache flush and missing.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Performance is slightly lower with a loop, i.e., with this change:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">--- a/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+++ b/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">@@ -173,12 +173,15 @@ filterPixelToShort_32xN 24<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">filterPixelToShort_32xN 32<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">filterPixelToShort_32xN 64<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-.macro filterPixelToShort_64xN h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+.macro filterPixelToShort_64xN h r<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">function x265_filterPixelToShort_64x\h\()_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add x3, x3, x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> sub x3, x3, #0x40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> movi v4.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-.rept \h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ mov x9, #\r<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+.loop_filterP2S_64x\h:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ subs x9, x9, #1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+.rept 2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ld1 {v0.16b-v3.16b}, [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v16.8h, v0.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll2 v17.8h, v0.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">@@ -199,14 +202,15 @@ function x265_filterPixelToShort_64x\h\()_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v16.16b-v19.16b}, [x2], #0x40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> st1 {v20.16b-v23.16b}, [x2], x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.endr<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ bgt .loop_filterP2S_64x\h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ret<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">endfunc<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.endm<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-filterPixelToShort_64xN 16<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-filterPixelToShort_64xN 32<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-filterPixelToShort_64xN 48<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-filterPixelToShort_64xN 64<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+filterPixelToShort_64xN 16 8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+filterPixelToShort_64xN 32 16<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+filterPixelToShort_64xN 48 24<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+filterPixelToShort_64xN 64 32<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.macro qpel_filter_0_32b<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> movi v24.8h, #64<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">With the above change adding a loop I get<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.46x 105.52 154.34<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.47x 212.07 311.71<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.47x 318.75 468.04<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.46x 425.61 622.25<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">whereas with the fully unrolled version performance is slightly higher:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.48x 104.14 154.36<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.49x 209.43 312.13<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.48x 315.33 466.37<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.49x 420.45 624.63<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I do not have a preference for this one, so I will follow your recommendations.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Please let me know if I need to amend the patch to add the loop.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p><p class="MsoNormal"><span style="font-size:11.0pt"><o:p>our testbench is small code piece, so there not so much cache missing report during testbench, in really system, large function is a big problems.</o:p></span></p><p class="MsoNormal"><span style="font-size:11.0pt"><o:p><br></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ld1 {v0.16b-v3.16b}, [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll v16.8h, v0.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll2 v17.8h, v0.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll v18.8h, v1.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll2 v19.8h, v1.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll v20.8h, v2.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll2 v21.8h, v2.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll v22.8h, v3.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + ushll2 v23.8h, v3.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v16.8h, v16.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v17.8h, v17.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v18.8h, v18.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v19.8h, v19.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v20.8h, v20.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v21.8h, v21.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v22.8h, v22.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v23.8h, v23.8h, v4.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + st1 {v16.16b-v19.16b}, [x2], #0x40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> ldp may reduce pipeline stall and more bandwidth<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">ldp is not beneficial as it requires 2 ldp + 2 add instead of 1 ld1:<o:p></o:p></span></p><p class="MsoNormal"><span style="font-size:11.0pt"><br></span></p><p class="MsoNormal"><span style="font-size:11.0pt">ldp may follow by constant post-increment.</span></p><p class="MsoNormal"><span style="font-size:11.0pt">such as</span></p><p class="MsoNormal"><span style="font-size:11.0pt">ldp v16,v17,[x2],#0x20</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">--- a/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+++ b/source/common/aarch64/ipfilter8.S<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">@@ -178,8 +178,13 @@ function x265_filterPixelToShort_64x\h\()_neon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add x3, x3, x3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> sub x3, x3, #0x40<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> movi v4.8h, #0xe0, lsl #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ sub x1, x1, #32<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.rept \h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">- ld1 {v0.16b-v3.16b}, [x0], x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ldp q0, q1, [x0]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add x0, x0, #32<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ ldp q2, q3, [x0]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+ add x0, x0, x1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v16.8h, v0.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll2 v17.8h, v0.16b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ushll v18.8h, v1.8b, #6<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">This adds overhead to instruction decoding.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">cpu back-end will issue the same loads for ldp and ld1.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Here is performance with the above change:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.43x 108.20 154.47<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.44x 216.43 312.10<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.43x 325.81 466.80<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.44x 433.60 624.30<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">With ld1 performance is higher:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x16] 1.48x 104.14 154.44<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x32] 1.49x 209.44 312.10<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x48] 1.48x 314.76 466.79<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">convert_p2s[64x64] 1.48x 420.30 622.38<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Sebastian<o:p></o:p></span></p>
</div>
</blockquote></div>