<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div> </div><div><br></div>At 2014-11-18 19:34:53,"Divya Manivannan" <divya@multicorewareinc.com> wrote:<br> <blockquote id="isReplyContent" style="margin: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left-color: rgb(204, 204, 204); border-left-width: 1px; border-left-style: solid;"><div dir="ltr"><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">chen</b> <span dir="ltr"><<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>></span><br>Date: Mon, Nov 17, 2014 at 9:54 PM<br>Subject: Re: [x265] [PATCH] asm: luma_vpp[16x16] in avx2: improve 2141c->1488c<br>To: Development for x265 <<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a>><br><br><br><div style="color: rgb(0, 0, 0); line-height: 1.7; font-family: arial; font-size: 14px;"><div> </div><pre><br>At 2014-11-17 19:14:55,"Divya Manivannan" <<a href="mailto:divya@multicorewareinc.com" target="_blank">divya@multicorewareinc.com</a>> wrote:
># HG changeset patch
># User Divya Manivannan <a href="mailto:divya@multicorewareinc.com%3E%3E#%C2%A0Date%C2%A01416222868%C2%A0-19800%3E%23%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0Mon%C2%A0Nov%C2%A017%C2%A016:44:28%C2%A02014%C2%A0+0530%3E%23%C2%A0Node%C2%A0ID%C2%A095571420e32c02f51d41fc45d286a2619a5052e5%3E%23%C2%A0Parent%C2%A0%C2%A05fec38f8c75606ddf4d5b7c9244dfd373892af5d%3Easm:%C2%A0luma_vpp[16x16]%C2%A0in%C2%A0avx2:%C2%A0improve%C2%A02141c-%3E1488c%3E" target="_blank">divya@multicorewareinc.com>
># Date 1416222868 -19800
>#      Mon Nov 17 16:44:28 2014 +0530
># Node ID 95571420e32c02f51d41fc45d286a2619a5052e5
># Parent  5fec38f8c75606ddf4d5b7c9244dfd373892af5d
>asm: luma_vpp[16x16] in avx2: improve 2141c->1488c
>
</a><span> INIT_YMM avx2
>-cglobal interp_8tap_vert_pp_8x8, 4,7,7
>+cglobal interp_8tap_vert_pp_8x8, 4, 7, 7
>     mov             r4d, r4m
>     shl             r4d, 7
>     lea             r5, [r1 * 3]
>@@ -3875,6 +3875,57 @@
>     RET
> %endmacro

>+INIT_YMM avx2
>+cglobal interp_8tap_vert_pp_16x16, 4, 7, 7, 0-gprsize
</span></pre><pre>you just use dword local stack</pre><pre>[Divya] I couldnt understand why I need to use dword local stack?</pre><pre>[MC] because "dec dword [rsp]", you really use dword.
</pre><pre> </pre><span><pre>>+    mov             r4d, r4m
>+    shl             r4d, 7
>+    lea             r5, [r1 * 3]
>+    sub             r0, r5
>+
>+%ifdef PIC
>+    lea             r5, [tab_LumaCoeffVer_32]
>+    lea             r5, [r5 + r4]
</pre></span><pre>same before</pre><span><pre>>+%else
>+    lea             r5, [tab_LumaCoeffVer_32 + r4]
>+%endif
>+
>+    mov             dword [rsp], 2
>+
>+.loopH:
>+    mov             r4d, 2
>+.loopW:
>+    PROCESS_LUMA_AVX2_W8_8R
</pre></span><pre>AVX2 can processing one row, could you try these mode?<span style="color: rgb(80, 0, 80); line-height: 1.7; font-family: arial;"> </span></pre><div><div><pre>>+    lea             r6, [r2]
>+    pmulhrsw        m5, [pw_512]                    ; m5 = word: row 0, row 1
>+    pmulhrsw        m2, [pw_512]                    ; m2 = word: row 2, row 3
>+    pmulhrsw        m1, [pw_512]                    ; m1 = word: row 4, row 5
>+    pmulhrsw        m4, [pw_512]                    ; m4 = word: row 6, row 7
>+    packuswb        m5, m2
>+    packuswb        m1, m4
>+    vextracti128    xm2, m5, 1
>+    vextracti128    xm4, m1, 1
>+    movq            [r6], xm5
>+    movq            [r6 + r3], xm2
>+    lea             r6, [r6 + r3 * 2]
>+    movhps          [r6], xm5
>+    movhps          [r6 + r3], xm2
>+    lea             r6, [r6 + r3 * 2]
>+    movq            [r6], xm1
>+    movq            [r6 + r3], xm4
>+    lea             r6, [r6 + r3 * 2]
>+    movhps          [r6], xm1
>+    movhps          [r6 + r3], xm4
>+
>+    add             r0, 8
>+    add             r2, 8
>+    dec             r4d
>+    jnz             .loopW
>+    lea             r2, [r2 + r3 * 8 - 16]
>+    lea             r0, [r0 + r1 * 8 - 16]
>+    dec             dword [rsp]
</pre></div></div><pre>code is right, but I suggest use a macro to improve x64 perf</pre><pre>[Divya]  I couldnt understand this macro to improve x64 performance.</pre><pre>[MC] most time, memory is slower, in x64 mode, we have 16 register, so you can replace memory operator to register</pre><pre> </pre><pre><span>>+    jnz             .loopH
>+    RET
>+
> ;-------------------------------------------------------------------------------------------------------------
> ; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
> ;---------------------------------------------------------------------------------------</span>----------------------
>_______________________________________________
>x265-devel mailing list
><a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a>
><a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a>
</pre></div><br>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
<br></div><br></div>
</blockquote></div>