<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>>+pand m0, [pw_00ff]<BR>>+pand m2, [pw_00ff]<BR>>+pand m4, [pw_00ff]<BR>>+pand m6, [pw_00ff]<BR>>+<BR>>+packuswb m0, m1<BR>>+packuswb m2, m3<BR>>+packuswb m4, m5<BR>>+packuswb m6, m7<BR>1. If you don't buffer [pw_00ff] into register, you can merge pand+packuswb to pshufb, most time buffer constant into register is faster</DIV>
<DIV>2. packuswb m0,m0 is better, since it depends on one register.</DIV>
<DIV> </DIV></div>