[x265] Fwd: [PATCH] weight_pp avx2 asm code, improved from 8608.65 cycles to 5138.09 cycles over sse version of asm code

Praveen Tiwari praveen at multicorewareinc.com
Fri Oct 17 07:03:42 CEST 2014


---------- Forwarded message ----------
From: chen <chenm003 at 163.com>
Date: Fri, Oct 17, 2014 at 3:11 AM
Subject: Re: [x265] [PATCH] weight_pp avx2 asm code, improved from 8608.65
cycles to 5138.09 cycles over sse version of asm code
To: Development for x265 <x265-devel at videolan.org>





At 2014-10-16 17:20:13,praveen at multicorewareinc.com wrote:
># HG changeset patch
># User Praveen Tiwari
># Date 1413451199 -19800
># Node ID 858be8d7d7176ab6c6d01cf92d00c8478fe99b34
># Parent  79702581ec824a2a375aebe228d69c3930aeea96
>weight_pp avx2 asm code, improved from 8608.65 cycles to 5138.09 cycles over sse version of asm code
>
>diff -r 79702581ec82 -r 858be8d7d717 source/common/x86/pixel-util8.asm
>--- a/source/common/x86/pixel-util8.asm	Wed Oct 15 17:49:35 2014 -0500
>+++ b/source/common/x86/pixel-util8.asm	Thu Oct 16 14:49:59 2014 +0530
>@@ -1375,6 +1375,60 @@
>
>     RET
>
>+INIT_YMM avx2
>+cglobal weight_pp, 6, 7, 6
>+
>+    mov          r6d, r6m
>+    shl          r6d, 6           ; m0 = [w0<<6]
>+    movd         xm0, r6d
>+
>+    movd         xm1, r7m         ; m1 = [round]
>+    punpcklwd    xm0, xm1
>+    pshufd       xm0, xm0, 0
>+    vinserti128  m0, m0, xm0, 1   ; assuming both (w0<<6) and round are using maximum of 16 bits each, m0 = [w0<<6 round]

>>vpbroadcastd is better

Yeah, exactly I tried to replace  (pshufd xm0, xm0, 0) + (vinserti128
m0, m0, xm0, 1) with vpbroadcastd m0, xm0 (as per document syntax,
__m256i  _mm256_broadcastd_epi32
            (__m128i a)) but it throwing build error: invalid
combination of opcode and operands.

and we just use weight_pp in four position, all of them have same
stride in r2 & r3, so we can simplify interface and free more register
here, you can combo W0 and Round in general register to improve
performance.



>+
>+    movd         xm1, r8m
>+    vpbroadcastd m2, r9m
>+    mova         m5, [pw_1]
>+    sub          r2d, r4d
>+    sub          r3d, r4d
>+
>+.loopH:
>+    mov         r6d, r4d
>+    shr         r6d, 4

why do Shr every time?

>+.loopW:
>+    movu        xm4, [r0]
>+    pmovzxbw    m4, xm4

pmovzxbw didn't need aligned address

>+    punpcklwd   m3, m4, m5
>+    pmaddwd     m3, m0
>+    psrad       m3, xm1
>+    paddd       m3, m2
>+
>+    punpckhwd   m4, m5
>+    pmaddwd     m4, m0
>+    psrad       m4, xm1
>+    paddd       m4, m2
>+
>+    packssdw    m3, m4
>+    vextracti128 xm4, m3, 1
>+    packuswb    m3, m4

How about vpermq+packuswb(xm3)?

>+    movu        [r1], xm3
>+
>+    add         r0, 16
>+    add         r1, 16
>+
>+    dec         r6d
>+    jnz         .loopW
>+
>+    lea         r0, [r0 + r2]
>+    lea         r1, [r1 + r3]
>+
>+    dec         r5d
>+    jnz         .loopH
>+
>+    RET


_______________________________________________
x265-devel mailing list
x265-devel at videolan.org
https://mailman.videolan.org/listinfo/x265-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20141017/b7c8c08a/attachment.html>


More information about the x265-devel mailing list