<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>+INIT_XMM sse2<BR>+cglobal count_nonzero, 2,3,4<BR>+ pxor m0, m0<BR>+ pxor m1, m1<BR>+ mov r2d, r1d<BR>+ shr r1d, 3<BR>+<BR>+.loop<BR></DIV>
<BLOCKQUOTE id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<DIV dir="ltr">
<DIV class="gmail_extra">
<DIV class="gmail_quote">
<BLOCKQUOTE class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">+ mova m2, [r0]<BR>+ mova m3, [r0 + 16]<BR>
<DIV class="HOEnZb">
<DIV class="h5">+ add r0, 32<BR></DIV>
<DIV class="h5">+ packssdw m2, m3, <BR>just count, no need it</DIV>
<DIV class="h5"> </DIV>
<DIV class="h5">+ pcmpeqw m2, m0<BR>+ psrlw m2, 15<BR>pcmp generte mask, it is 0xFFFF, so we no need to shift right</DIV>
<DIV class="h5"> </DIV>
<DIV class="h5">+ packsswb m2, m2<BR>+ psadbw m2, m0</DIV>
<DIV class="h5">psad is low perf, why you need exact number in inner loop?</DIV>
<DIV class="h5">of course, abs(-1) = abs(1) </DIV>
<DIV class="h5"><BR>+ paddd m1, m2<BR>+ dec r1d<BR>+ jnz .loop<BR>+<BR>+ movd r1d, m1<BR>+ sub r2d, r1d<BR>+ mov eax, r2d<BR>+<BR>+ RET<BR></DIV></DIV></BLOCKQUOTE></DIV></DIV></DIV></BLOCKQUOTE></div>