<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none"><!-- p { margin-top: 0px; margin-bottom: 0px; }--></style>

</head>

<body dir="ltr" style="font-size:12pt;color:#000000;background-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">

<p>Hi Min Chen,<br>

</p>

<p><br>

</p>

<p>Thank you for your review comments, that helped improve the performance of scanPosLast on arm64:<br>

</p>

<p><br>

</p>

<div>           scanPosLast  5.46x    782.47          4275.92</div>

<div><br>

I think I addressed all the changes you requested with the exception of the two below:</div>

<div><br>

</div>

<div>

<div>

<div>> +    // get sign</div>

<div>> +    cmeq            v5.16b, v3.16b, #0  //      equal to zero</div>

<div>> +    mvn             v5.16b, v5.16b      // v5 = non-zero</div>

<div>> [MC] Why not replace cmeq+mvn by cmgt?</div>

<div><br>

</div>

<div>[SP] We cannot replace the sequence with cmgt.</div>

<div>cmgt #0 is "Compare signed Greater than zero".</div>

<div>cmgt #0 would only select positive values.</div>

<div>We need all non-zero values, i.e., negative and positive values.</div>

<div><br>

</div>

<div>> +    // val - w13 = pmovmskb(v3)</div>

<div>> +    and             v3.16b, v3.16b, v28.16b</div>

<div>> +    mov             d4, v3.d[1]</div>

<div>> +    addv            b13, v3.8b</div>

<div>> +    addv            b14, v4.8b</div>

<div>> [MC] ADDV support .16b?</div>

<div><br>

</div>

<div>[SP] I cannot use the .16b variant of ADDV.</div>

<div>The data in v3.16b is ANDed with a mask in v28.16b:</div>

<div>    and             v3.16b, v3.16b, v28.16b</div>

<div>The mask in v28 is:</div>

<div>.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80</div>

<div>This is used to select which byte gets counted in which position.</div>

<div><br>

</div>

<div>To use an ADDV .16b I would need to encode the position of the bytes</div>

<div>in 16 bits instead of 8 bits, i.e., the mask would be:</div>

<div>.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000</div>

<div>however that would require the data to be in 16bit vector elements and NEON vectors would be 8h which is half too short.</div>

<div><br>

</div>

<div>Another solution I was considering is to decrease the vector factor for the loop from 16 to 8.<br>

</div>

<div>That would simplify the code for pmovmskb, however the scalar code would be less efficient, as it would only deal with half the bytes.<br>

</div>

<div>Do you think I should try out with a lower vector factor 8?<br>

</div>

<div><br>

</div>

<div>Thanks,<br>

</div>

Sebastian<br>

</div>

</div>

<style type="text/css" style="">

<!--

p

        {margin-top:0px;

        margin-bottom:0px}

-->

</style>

</body>

</html>