<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>>+xor r4, r4<BR>>+add r4d, %2<BR>Why not mov r4d, %2</DIV>
<DIV> </DIV>
<DIV>>+punpcklbw m4, m2, m3,<BR>>+punpckhbw m5, m2, m3,<BR>m2 not need anymore, why alloc a new register m5?</DIV>
<DIV> </DIV>
<DIV>+movu m2, [r0 + r1]<BR>+movu m3, [r0 + 2 * r1]<BR>reduce load operators</DIV>
<DIV> </DIV></div>