<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>>+ psadbw m5, m3<BR>>+ psadbw m6, m4<BR>>+ pshufd m6, m6, 84<BR>You want to clear high 96 bits to zero, why not use pand, of course, we can avoid this, see below</DIV>
<DIV> </DIV>
<DIV>>+ paddd m5, m6<BR>>+ paddd m0, m5<BR>we can sum as 32xN and drop high 64 bits in last step</DIV>
<DIV> </DIV>
<DIV>>+%macro SAD_X3_W24 0<BR>>+cglobal pixel_sad_x3_24x32, 5, 6, 8<BR>>+ pxor m0, m0<BR>>+ pxor m1, m1<BR>>+ pxor m2, m2<BR>>+ mov r6, 32<BR>>+<BR>>+.loop<BR>>+ SAD_X3_24x4<BR>>+ SAD_X3_24x4<BR>>+ SAD_X3_24x4<BR>>+ SAD_X3_24x4<BR>>+<BR>>+ sub r6, 16<BR>>+ cmp r6, 0<BR>>+jnz .loop<BR>loop problem as my previous mail, and instruction SUB affect FLAG, so I think you don't need "cmp r6,0"</DIV>
<DIV> </DIV></div>