<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>> ;-----------------------------------------------------------------------------<BR>>+; void pixel_sub_ps_c_2x4(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);<BR>>+;-----------------------------------------------------------------------------<BR>>+INIT_XMM sse4<BR>>+cglobal pixel_sub_ps_2x4, 6, 7, 2, dest, deststride, src0, src1, srcstride0, srcstride1<BR>>+<BR>>+add r1, r1<BR>>+<BR>>+movd m0, [r2]<BR>>+movd m1, [r2 + r4]<BR>>+movd m2, [r2 + 2 * r4]<BR>I don't worry about small block performance, but if you use below code, it is short and faster</DIV>
<DIV>movd m0, [r2]</DIV>
<DIV>movhps m0, [r2 + r4]</DIV>
<DIV> </DIV>
<DIV>>+<BR>>+movd m3, [r3]<BR>>+movd m4, [r3 + r5]<BR>>+movd m5, [r3 + 2 * r5]<BR>>+<BR>>+lea r2, [r2 + 2 * r4]<BR>>+lea r3, [r3 + 2 * r5]<BR>>+<BR>>+movd m6, [r2 + r4]<BR>>+movd m7, [r3 + r5]<BR>>+<BR>>+pmovzxbw m0, m0<BR>>+pmovzxbw m1, m1<BR>>+pmovzxbw m2, m2<BR>>+pmovzxbw m3, m3<BR>>+pmovzxbw m4, m4<BR>>+pmovzxbw m5, m5<BR>>+pmovzxbw m6, m6<BR>>+pmovzxbw m7, m7<BR>>+<BR>>+psubw m0, m3<BR>>+psubw m1, m4<BR>>+psubw m2, m5<BR>>+psubw m6, m7<BR>here only half of pmovzxbw and psub when use above code.</DIV>
<DIV><BR>>+movd [r0], m0<BR>>+movd [r0 + r1], m1<BR>>+movd [r0 + 2* r1], m2<BR>>+<BR>>+lea r0, [r0 + 2 * r1]<BR>>+<BR>>+movd [r0 + r1], m6<BR>>+<BR>>+RET<BR></DIV></div>