<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div> </div><pre><br>At 2014-09-02 22:13:08,praveen@multicorewareinc.com wrote:
># HG changeset patch
># User Praveen Tiwari
># Date 1409652070 -19800
># Node ID 2667a0e3afdc2b95ff73c962b3e25366162d8e8d
># Parent 16de8fd2837c853c974f83f9aba9e8ef09c2fe2b
>added copy_shr primitive
>
>diff -r 16de8fd2837c -r 2667a0e3afdc source/common/x86/blockcopy8.asm
>--- a/source/common/x86/blockcopy8.asm Tue Sep 02 14:38:41 2014 +0530
>+++ b/source/common/x86/blockcopy8.asm Tue Sep 02 15:31:10 2014 +0530
>@@ -4400,3 +4400,79 @@
> psadbw xm0, xm4
> movd eax, xm0
> RET
>+
>+;-----------------------------------------------------------------------------
>+; void copy_shr(short *dst, short *src, intptr_t stride, int shift, int size)
>+;-----------------------------------------------------------------------------
>+
>+INIT_XMM sse4
>+cglobal copy_shr, 4, 7, 4, dst, src, stride
>+%define rnd m2
>+%define shift m1
>+
>+ ; make shift
>+ mov r5d, r3m
>+ movd shift, r5d
>+
>+ ; make round
>+ dec r5
>+ xor r6, r6
>+ bts r6, r5
</pre><pre>operator on r5d and r6d is better (and below), except 'xor', it can be detect by instruction decoder component</pre><pre>>+
>+ movd rnd, r6d
>+ pshufd rnd, rnd, 0
>+
>+ ; register alloc
>+ ; r0 - dst
>+ ; r1 - src
>+ ; r2 - stride * 2 (short*)
>+ ; r3 - lx
>+ ; r4 - size
>+ ; r5 - ly
>+ ; r6 - diff
>+ add r2d, r2d
>+
>+ mov r4d, r4m
>+ mov r5, r4 ; size
>+ mov r6, r2 ; stride
>+ sub r6, r4
>+ add r6, r6
>+
>+ shr r5, 1
>+.loop_row:
>+
>+ mov r3, r4
>+ shr r3, 2
>+.loop_col:
>+ ; row 0
>+ movh m3, [r1]
>+ pmovsxwd m0, m3
</pre><pre>pmovszwd didn't need aligned address, so may merge with above</pre><pre> </pre><pre>>+ paddd m0, rnd
>+ psrad m0, shift
>+ packssdw m0, m0
>+ movh [r0], m0
>+
>+ ; row 1
>+ movh m3, [r1 + r4 * 2]
>+ pmovsxwd m0, m3
>+ paddd m0, rnd
>+ psrad m0, shift
>+ packssdw m0, m0
>+ movh [r0 + r2], m0
</pre><pre>we may share packssdw on row0 and row1 , and store by movh+movhps to reduce port5 utilize on Haswell</pre><pre>>+
>+ ; move col pointer
>+ add r1, 8
>+ add r0, 8
>+
>+ dec r3
>+ jg .loop_col
>+
>+ ; update pointer
>+ lea r1, [r1 + r4 * 2]
>+ add r0, r6
>+
>+ ; end of loop_row
>+ dec r5
>+ jg .loop_row
>+
>+ RET
</pre></div>