[x265] [PATCH] copy_cnt_16, AVX2 asm code as per new interface, performance improved from 14.22x to 23.57x on HASWELL-I5
chen
chenm003 at 163.com
Wed Sep 10 23:20:09 CEST 2014
At 2014-09-10 16:05:40,praveen at multicorewareinc.com wrote:
># HG changeset patch
># User Praveen Tiwari
># Date 1410336330 -19800
># Node ID c5b3e04e4eba2fcc4298c225d11ab25e0da82558
># Parent d29cb300975a491287abdfb6abd2a9d3141e99f0
>copy_cnt_16, AVX2 asm code as per new interface, performance improved from 14.22x to 23.57x on HASWELL-I5
for comment, use cycles is better, our testbench will show cycles, the speed up factor depends on CPU type and compiler.
> INIT_YMM avx2
>-cglobal copy_cnt_16, 3,5,5
>+cglobal copy_cnt_16, 3,5,7
> add r2d, r2d
>- lea r4, [r2 * 3]
>- mov r3d, 16/4
>- ; NOTE: xorpd is faster than pxor
>- xorpd m4, m4
>- xorpd m3, m3
>-
>-.loop
>- ; row 0
>+ lea r3, [r2 * 3]
>+ mov r4d, 256/128
>+
>+ xorpd m5, m5
m5 for psadbw only, why you spent a register before loop?
>+ xorpd m6, m6
>+
>+.loop:
>+ ; row 0 - 1
> movu m0, [r1]
>+ movu [r0], m0
> movu m1, [r1 + r2]
>+ movu [r0 + 32], m1
>+
>+ vpacksswb m0, m1
remove prefix 'v' is better, unless it is new AVX2 only instruction, x86inc.asm have rename macro
>+ pminub m0, [pb_1]
in my demo, I use this style because I just use memory 2 times
In here, you use 4x2=8 times, so buffer memory into register is better
>+
>+ ; row 2 - 3
>+ movu m1, [r1 + r2 * 2]
>+ movu [r0 + 64], m1
>+ movu m2, [r1 + r3]
>+ movu [r0 + 96], m2
>+
>+ vpacksswb m1, m2
>+ pminub m1, [pb_1]
>+ paddb m0, m1
>+
>+ ; row 4 - 5
>+ lea r1, [r1 + r2 * 4]
>+ movu m2, [r1]
>+ movu [r0 + 128], m2
>+ movu m3, [r1 + r2]
>+ movu [r0 + 160], m3
>+
>+ vpacksswb m2, m3
>+ pminub m2, [pb_1]
>+
>+ ; row 6 - 7
>+ movu m3, [r1 + r2 * 2]
>+ movu [r0 + 192], m3
>+ movu m4, [r1 + r3]
>+ movu [r0 + 224], m4
offset more than 128 will encode as 4 bytes
>+
>+ vpacksswb m3, m4
>+ pminub m3, [pb_1]
>+ paddb m2, m3
>+
>+ paddb m0, m2
>+ paddb m6, m0
>+
>+ add r0, 256
offset more than 128 will encode as 4 bytes
>+ lea r1, [r1 + 4 * r2]
>+ dec r4d
> jnz .loop
>
> ; get count
>+ vextracti128 xm1, m6, 1
>+ paddb xm6, xm1
>+ psadbw xm6, xm5
>+ movhlps xm1, xm6
>+ paddd xm6, xm1
>+ movd eax, xm6
> RET
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20140911/85630d96/attachment-0001.html>
More information about the x265-devel
mailing list