<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>Applyed with little modify:<BR>pshufb m0, m0, m7</DIV>
<DIV></DIV>
<DIV id="divNeteaseMailCard"></DIV>
<DIV> </DIV>
<DIV>in Intel instruction documents, pshufb have two parameters only, three parameters style for AVX and have a extra code byte,</DIV>
<DIV>so I suggest use two parameters style when you are not really need hide register move<BR></DIV>
<DIV>At 2013-12-05 15:59:24,murugan@multicorewareinc.com wrote:<BR>># HG changeset patch<BR>># User Murugan Vairavel <murugan@multicorewareinc.com><BR>># Date 1386230342 -19800<BR>># Thu Dec 05 13:29:02 2013 +0530<BR>># Node ID dbfde5222782eec2ba414d473fd4ba2494c6f333<BR>># Parent e4a7885f377e37841c3ecd8e2419454fa1ba03db<BR>>asm: 10bpp code for scale2D_64to32 routine<BR>><BR>>diff -r e4a7885f377e -r dbfde5222782 source/common/x86/asm-primitives.cpp<BR>>--- a/source/common/x86/asm-primitives.cpp Wed Dec 04 13:45:29 2013 -0600<BR>>+++ b/source/common/x86/asm-primitives.cpp Thu Dec 05 13:29:02 2013 +0530<BR>>@@ -567,6 +567,7 @@<BR>> if (cpuMask & X265_CPU_SSSE3)<BR>> {<BR>> p.scale1D_128to64 = x265_scale1D_128to64_ssse3;<BR>>+ p.scale2D_64to32 = x265_scale2D_64to32_ssse3;<BR>> }<BR>> if (cpuMask & X265_CPU_SSE4)<BR>> {<BR>>diff -r e4a7885f377e -r dbfde5222782 source/common/x86/pixel-util8.asm<BR>>--- a/source/common/x86/pixel-util8.asm Wed Dec 04 13:45:29 2013 -0600<BR>>+++ b/source/common/x86/pixel-util8.asm Thu Dec 05 13:29:02 2013 +0530<BR>>@@ -47,6 +47,8 @@<BR>> deinterleave_word_shuf: db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 15, 15<BR>> hmul_16p: times 16 db 1<BR>> times 8 db 1, -1<BR>>+hmulw_16p: times 8 dw 1<BR>>+ times 4 dw 1, -1<BR>> <BR>> SECTION .text<BR>> <BR>>@@ -1797,9 +1799,173 @@<BR>> ;-----------------------------------------------------------------<BR>> INIT_XMM ssse3<BR>> cglobal scale2D_64to32, 3, 4, 8, dest, src, stride<BR>>-<BR>>+ mov r3d, 32<BR>>+%if HIGH_BIT_DEPTH<BR>>+ mova m7, [deinterleave_word_shuf]<BR>>+ add r2, r2<BR>>+.loop<BR>>+ movu m0, [r1] ;i<BR>>+ movu m1, [r1 + 2] ;j<BR>>+ movu m2, [r1 + r2] ;k<BR>>+ movu m3, [r1 + r2 + 2] ;l<BR>>+ movu m4, m0<BR>>+ movu m5, m2<BR>>+ pxor m4, m1 ;i^j<BR>>+ pxor m5, m3 ;k^l<BR>>+ por m4, m5 ;ij|kl<BR>>+ pavgw m0, m1 ;s<BR>>+ pavgw m2, m3 ;t<BR>>+ movu m5, m0<BR>>+ pavgw m0, m2 ;(s+t+1)/2<BR>>+ pxor m5, m2 ;s^t<BR>>+ pand m4, m5 ;(ij|kl)&st<BR>>+ pand m4, [hmulw_16p]<BR>>+ psubw m0, m4 ;Result<BR>>+ movu m1, [r1 + 16] ;i<BR>>+ movu m2, [r1 + 16 + 2] ;j<BR>>+ movu m3, [r1 + r2 + 16] ;k<BR>>+ movu m4, [r1 + r2 + 16 + 2] ;l<BR>>+ movu m5, m1<BR>>+ movu m6, m3<BR>>+ pxor m5, m2 ;i^j<BR>>+ pxor m6, m4 ;k^l<BR>>+ por m5, m6 ;ij|kl<BR>>+ pavgw m1, m2 ;s<BR>>+ pavgw m3, m4 ;t<BR>>+ movu m6, m1<BR>>+ pavgw m1, m3 ;(s+t+1)/2<BR>>+ pxor m6, m3 ;s^t<BR>>+ pand m5, m6 ;(ij|kl)&st<BR>>+ pand m5, [hmulw_16p]<BR>>+ psubw m1, m5 ;Result<BR>>+ pshufb m0, m0, m7<BR>>+ pshufb m1, m1, m7<BR>>+<BR>>+ punpcklqdq m0, m1<BR>>+ movu [r0], m0<BR>>+<BR>>+ movu m0, [r1 + 32] ;i<BR>>+ movu m1, [r1 + 32 + 2] ;j<BR>>+ movu m2, [r1 + r2 + 32] ;k<BR>>+ movu m3, [r1 + r2 + 32 + 2] ;l<BR>>+ movu m4, m0<BR>>+ movu m5, m2<BR>>+ pxor m4, m1 ;i^j<BR>>+ pxor m5, m3 ;k^l<BR>>+ por m4, m5 ;ij|kl<BR>>+ pavgw m0, m1 ;s<BR>>+ pavgw m2, m3 ;t<BR>>+ movu m5, m0<BR>>+ pavgw m0, m2 ;(s+t+1)/2<BR>>+ pxor m5, m2 ;s^t<BR>>+ pand m4, m5 ;(ij|kl)&st<BR>>+ pand m4, [hmulw_16p]<BR>>+ psubw m0, m4 ;Result<BR>>+ movu m1, [r1 + 48] ;i<BR>>+ movu m2, [r1 + 48 + 2] ;j<BR>>+ movu m3, [r1 + r2 + 48] ;k<BR>>+ movu m4, [r1 + r2 + 48 + 2] ;l<BR>>+ movu m5, m1<BR>>+ movu m6, m3<BR>>+ pxor m5, m2 ;i^j<BR>>+ pxor m6, m4 ;k^l<BR>>+ por m5, m6 ;ij|kl<BR>>+ pavgw m1, m2 ;s<BR>>+ pavgw m3, m4 ;t<BR>>+ movu m6, m1<BR>>+ pavgw m1, m3 ;(s+t+1)/2<BR>>+ pxor m6, m3 ;s^t<BR>>+ pand m5, m6 ;(ij|kl)&st<BR>>+ pand m5, [hmulw_16p]<BR>>+ psubw m1, m5 ;Result<BR>>+ pshufb m0, m0, m7<BR>>+ pshufb m1, m1, m7<BR>>+<BR>>+ punpcklqdq m0, m1<BR>>+ movu [r0 + 16], m0<BR>>+<BR>>+ movu m0, [r1 + 64] ;i<BR>>+ movu m1, [r1 + 64 + 2] ;j<BR>>+ movu m2, [r1 + r2 + 64] ;k<BR>>+ movu m3, [r1 + r2 + 64 + 2] ;l<BR>>+ movu m4, m0<BR>>+ movu m5, m2<BR>>+ pxor m4, m1 ;i^j<BR>>+ pxor m5, m3 ;k^l<BR>>+ por m4, m5 ;ij|kl<BR>>+ pavgw m0, m1 ;s<BR>>+ pavgw m2, m3 ;t<BR>>+ movu m5, m0<BR>>+ pavgw m0, m2 ;(s+t+1)/2<BR>>+ pxor m5, m2 ;s^t<BR>>+ pand m4, m5 ;(ij|kl)&st<BR>>+ pand m4, [hmulw_16p]<BR>>+ psubw m0, m4 ;Result<BR>>+ movu m1, [r1 + 80] ;i<BR>>+ movu m2, [r1 + 80 + 2] ;j<BR>>+ movu m3, [r1 + r2 + 80] ;k<BR>>+ movu m4, [r1 + r2 + 80 + 2] ;l<BR>>+ movu m5, m1<BR>>+ movu m6, m3<BR>>+ pxor m5, m2 ;i^j<BR>>+ pxor m6, m4 ;k^l<BR>>+ por m5, m6 ;ij|kl<BR>>+ pavgw m1, m2 ;s<BR>>+ pavgw m3, m4 ;t<BR>>+ movu m6, m1<BR>>+ pavgw m1, m3 ;(s+t+1)/2<BR>>+ pxor m6, m3 ;s^t<BR>>+ pand m5, m6 ;(ij|kl)&st<BR>>+ pand m5, [hmulw_16p]<BR>>+ psubw m1, m5 ;Result<BR>>+ pshufb m0, m0, m7<BR>>+ pshufb m1, m1, m7<BR>>+<BR>>+ punpcklqdq m0, m1<BR>>+ movu [r0 + 32], m0<BR>>+<BR>>+ movu m0, [r1 + 96] ;i<BR>>+ movu m1, [r1 + 96 + 2] ;j<BR>>+ movu m2, [r1 + r2 + 96] ;k<BR>>+ movu m3, [r1 + r2 + 96 + 2] ;l<BR>>+ movu m4, m0<BR>>+ movu m5, m2<BR>>+ pxor m4, m1 ;i^j<BR>>+ pxor m5, m3 ;k^l<BR>>+ por m4, m5 ;ij|kl<BR>>+ pavgw m0, m1 ;s<BR>>+ pavgw m2, m3 ;t<BR>>+ movu m5, m0<BR>>+ pavgw m0, m2 ;(s+t+1)/2<BR>>+ pxor m5, m2 ;s^t<BR>>+ pand m4, m5 ;(ij|kl)&st<BR>>+ pand m4, [hmulw_16p]<BR>>+ psubw m0, m4 ;Result<BR>>+ movu m1, [r1 + 112] ;i<BR>>+ movu m2, [r1 + 112 + 2] ;j<BR>>+ movu m3, [r1 + r2 + 112] ;k<BR>>+ movu m4, [r1 + r2 + 112 + 2] ;l<BR>>+ movu m5, m1<BR>>+ movu m6, m3<BR>>+ pxor m5, m2 ;i^j<BR>>+ pxor m6, m4 ;k^l<BR>>+ por m5, m6 ;ij|kl<BR>>+ pavgw m1, m2 ;s<BR>>+ pavgw m3, m4 ;t<BR>>+ movu m6, m1<BR>>+ pavgw m1, m3 ;(s+t+1)/2<BR>>+ pxor m6, m3 ;s^t<BR>>+ pand m5, m6 ;(ij|kl)&st<BR>>+ pand m5, [hmulw_16p]<BR>>+ psubw m1, m5 ;Result<BR>>+ pshufb m0, m0, m7<BR>>+ pshufb m1, m1, m7<BR>>+<BR>>+ punpcklqdq m0, m1<BR>>+ movu [r0 + 48], m0<BR>>+ lea r0, [r0 + 64]<BR>>+%else<BR>> mova m7, [deinterleave_shuf]<BR>>- mov r3d, 32<BR>> .loop<BR>> <BR>> movu m0, [r1] ;i<BR>>@@ -1895,9 +2061,9 @@<BR>> movu [r0 + 16], m0<BR>> <BR>> lea r0, [r0 + 32]<BR>>+%endif<BR>> lea r1, [r1 + 2 * r2]<BR>> dec r3d<BR>>-<BR>> jnz .loop<BR>> <BR>> RET<BR>>_______________________________________________<BR>>x265-devel mailing list<BR>>x265-devel@videolan.org<BR>>https://mailman.videolan.org/listinfo/x265-devel<BR></DIV></div>