<div dir="ltr">Hello Chen,<div><br></div><div>I have tried processing one row in avx2 for 32 bit. But the performance is less than processing half of the row.</div><div>I have got 1891 cycles for one row processing but 1488 cycles for processing half row.</div><div>Herewith I have attached the algorithm which I have tried. </div><div><br></div><div>Regards,</div><div>Divya</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Nov 18, 2014 at 11:29 PM, chen <span dir="ltr"><<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div><div class="h5"><div> </div><div><br></div>At 2014-11-18 19:34:53,"Divya Manivannan" <<a href="mailto:divya@multicorewareinc.com" target="_blank">divya@multicorewareinc.com</a>> wrote:<br> </div></div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid"><div dir="ltr"><br><div class="gmail_quote"><div><div class="h5">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">chen</b> <span dir="ltr"><<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>></span><br>Date: Mon, Nov 17, 2014 at 9:54 PM<br>Subject: Re: [x265] [PATCH] asm: luma_vpp[16x16] in avx2: improve 2141c->1488c<br>To: Development for x265 <<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a>><br><br><br></div></div><div style="color:rgb(0,0,0);line-height:1.7;font-family:arial;font-size:14px"><div><div class="h5"><div> </div><pre><br>At 2014-11-17 19:14:55,"Divya Manivannan" <<a href="mailto:divya@multicorewareinc.com" target="_blank">divya@multicorewareinc.com</a>> wrote:
># HG changeset patch
># User Divya Manivannan <a href="mailto:divya@multicorewareinc.com%3E%3E#%C2%A0Date%C2%A01416222868%C2%A0-19800%3E%23%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0Mon%C2%A0Nov%C2%A017%C2%A016:44:28%C2%A02014%C2%A0+0530%3E%23%C2%A0Node%C2%A0ID%C2%A095571420e32c02f51d41fc45d286a2619a5052e5%3E%23%C2%A0Parent%C2%A0%C2%A05fec38f8c75606ddf4d5b7c9244dfd373892af5d%3Easm:%C2%A0luma_vpp[16x16]%C2%A0in%C2%A0avx2:%C2%A0improve%C2%A02141c-%3E1488c%3E" target="_blank">divya@multicorewareinc.com>
># Date 1416222868 -19800
># Mon Nov 17 16:44:28 2014 +0530
># Node ID 95571420e32c02f51d41fc45d286a2619a5052e5
># Parent 5fec38f8c75606ddf4d5b7c9244dfd373892af5d
>asm: luma_vpp[16x16] in avx2: improve 2141c->1488c
>
</a><span> INIT_YMM avx2
>-cglobal interp_8tap_vert_pp_8x8, 4,7,7
>+cglobal interp_8tap_vert_pp_8x8, 4, 7, 7
> mov r4d, r4m
> shl r4d, 7
> lea r5, [r1 * 3]
>@@ -3875,6 +3875,57 @@
> RET
> %endmacro
>
>+INIT_YMM avx2
>+cglobal interp_8tap_vert_pp_16x16, 4, 7, 7, 0-gprsize
</span></pre><pre>you just use dword local stack</pre><pre>[Divya] I couldnt understand why I need to use dword local stack?</pre></div></div><pre>[MC] because "dec dword [rsp]", you really use dword.
</pre><div><div class="h5"><pre> </pre><span><pre>>+ mov r4d, r4m
>+ shl r4d, 7
>+ lea r5, [r1 * 3]
>+ sub r0, r5
>+
>+%ifdef PIC
>+ lea r5, [tab_LumaCoeffVer_32]
>+ lea r5, [r5 + r4]
</pre></span><pre>same before</pre><span><pre>>+%else
>+ lea r5, [tab_LumaCoeffVer_32 + r4]
>+%endif
>+
>+ mov dword [rsp], 2
>+
>+.loopH:
>+ mov r4d, 2
>+.loopW:
>+ PROCESS_LUMA_AVX2_W8_8R
</pre></span><pre>AVX2 can processing one row, could you try these mode?<span style="color:rgb(80,0,80);line-height:1.7;font-family:arial"> </span></pre><div><div><pre>>+ lea r6, [r2]
>+ pmulhrsw m5, [pw_512] ; m5 = word: row 0, row 1
>+ pmulhrsw m2, [pw_512] ; m2 = word: row 2, row 3
>+ pmulhrsw m1, [pw_512] ; m1 = word: row 4, row 5
>+ pmulhrsw m4, [pw_512] ; m4 = word: row 6, row 7
>+ packuswb m5, m2
>+ packuswb m1, m4
>+ vextracti128 xm2, m5, 1
>+ vextracti128 xm4, m1, 1
>+ movq [r6], xm5
>+ movq [r6 + r3], xm2
>+ lea r6, [r6 + r3 * 2]
>+ movhps [r6], xm5
>+ movhps [r6 + r3], xm2
>+ lea r6, [r6 + r3 * 2]
>+ movq [r6], xm1
>+ movq [r6 + r3], xm4
>+ lea r6, [r6 + r3 * 2]
>+ movhps [r6], xm1
>+ movhps [r6 + r3], xm4
>+
>+ add r0, 8
>+ add r2, 8
>+ dec r4d
>+ jnz .loopW
>+ lea r2, [r2 + r3 * 8 - 16]
>+ lea r0, [r0 + r1 * 8 - 16]
>+ dec dword [rsp]
</pre></div></div><pre>code is right, but I suggest use a macro to improve x64 perf</pre><pre>[Divya] I couldnt understand this macro to improve x64 performance.</pre></div></div><pre>[MC] most time, memory is slower, in x64 mode, we have 16 register, so you can replace memory operator to register</pre><span class=""><pre> </pre><pre><span>>+ jnz .loopH
>+ RET
>+
> ;-------------------------------------------------------------------------------------------------------------
> ; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
> ;---------------------------------------------------------------------------------------</span>----------------------
>_______________________________________________
>x265-devel mailing list
><a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a>
><a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a>
</pre></span></div><span class=""><br>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
<br></span></div><br></div>
</blockquote></div><br>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
<br></blockquote></div><br></div>