[x265] Fwd: [PATCH] asm: luma_vpp[16x16] in avx2: improve 2141c->1488c
Divya Manivannan
divya at multicorewareinc.com
Wed Nov 19 13:31:29 CET 2014
Hello Chen,
I have tried processing one row in avx2 for 32 bit. But the performance is
less than processing half of the row.
I have got 1891 cycles for one row processing but 1488 cycles for
processing half row.
Herewith I have attached the algorithm which I have tried.
Regards,
Divya
On Tue, Nov 18, 2014 at 11:29 PM, chen <chenm003 at 163.com> wrote:
>
>
> At 2014-11-18 19:34:53,"Divya Manivannan" <divya at multicorewareinc.com>
> wrote:
>
>
> ---------- Forwarded message ----------
> From: chen <chenm003 at 163.com>
> Date: Mon, Nov 17, 2014 at 9:54 PM
> Subject: Re: [x265] [PATCH] asm: luma_vpp[16x16] in avx2: improve
> 2141c->1488c
> To: Development for x265 <x265-devel at videolan.org>
>
>
>
>
>
> At 2014-11-17 19:14:55,"Divya Manivannan" <divya at multicorewareinc.com> wrote:
> ># HG changeset patch
> ># User Divya Manivannan divya at multicorewareinc.com>
> ># Date 1416222868 -19800
> ># Mon Nov 17 16:44:28 2014 +0530
> ># Node ID 95571420e32c02f51d41fc45d286a2619a5052e5
> ># Parent 5fec38f8c75606ddf4d5b7c9244dfd373892af5d
> >asm: luma_vpp[16x16] in avx2: improve 2141c->1488c
> > <divya at multicorewareinc.com%3E%3E#%C2%A0Date%C2%A01416222868%C2%A0-19800%3E%23%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0Mon%C2%A0Nov%C2%A017%C2%A016:44:28%C2%A02014%C2%A0+0530%3E%23%C2%A0Node%C2%A0ID%C2%A095571420e32c02f51d41fc45d286a2619a5052e5%3E%23%C2%A0Parent%C2%A0%C2%A05fec38f8c75606ddf4d5b7c9244dfd373892af5d%3Easm:%C2%A0luma_vpp[16x16]%C2%A0in%C2%A0avx2:%C2%A0improve%C2%A02141c-%3E1488c%3E> INIT_YMM avx2
> >-cglobal interp_8tap_vert_pp_8x8, 4,7,7
> >+cglobal interp_8tap_vert_pp_8x8, 4, 7, 7
> > mov r4d, r4m
> > shl r4d, 7
> > lea r5, [r1 * 3]
> >@@ -3875,6 +3875,57 @@
> > RET
> > %endmacro
> >
> >+INIT_YMM avx2
> >+cglobal interp_8tap_vert_pp_16x16, 4, 7, 7, 0-gprsize
>
> you just use dword local stack
>
> [Divya] I couldnt understand why I need to use dword local stack?
>
> [MC] because "dec dword [rsp]", you really use dword.
>
>
>
> >+ mov r4d, r4m
> >+ shl r4d, 7
> >+ lea r5, [r1 * 3]
> >+ sub r0, r5
> >+
> >+%ifdef PIC
> >+ lea r5, [tab_LumaCoeffVer_32]
> >+ lea r5, [r5 + r4]
>
> same before
>
> >+%else
> >+ lea r5, [tab_LumaCoeffVer_32 + r4]
> >+%endif
> >+
> >+ mov dword [rsp], 2
> >+
> >+.loopH:
> >+ mov r4d, 2
> >+.loopW:
> >+ PROCESS_LUMA_AVX2_W8_8R
>
> AVX2 can processing one row, could you try these mode?
>
> >+ lea r6, [r2]
> >+ pmulhrsw m5, [pw_512] ; m5 = word: row 0, row 1
> >+ pmulhrsw m2, [pw_512] ; m2 = word: row 2, row 3
> >+ pmulhrsw m1, [pw_512] ; m1 = word: row 4, row 5
> >+ pmulhrsw m4, [pw_512] ; m4 = word: row 6, row 7
> >+ packuswb m5, m2
> >+ packuswb m1, m4
> >+ vextracti128 xm2, m5, 1
> >+ vextracti128 xm4, m1, 1
> >+ movq [r6], xm5
> >+ movq [r6 + r3], xm2
> >+ lea r6, [r6 + r3 * 2]
> >+ movhps [r6], xm5
> >+ movhps [r6 + r3], xm2
> >+ lea r6, [r6 + r3 * 2]
> >+ movq [r6], xm1
> >+ movq [r6 + r3], xm4
> >+ lea r6, [r6 + r3 * 2]
> >+ movhps [r6], xm1
> >+ movhps [r6 + r3], xm4
> >+
> >+ add r0, 8
> >+ add r2, 8
> >+ dec r4d
> >+ jnz .loopW
> >+ lea r2, [r2 + r3 * 8 - 16]
> >+ lea r0, [r0 + r1 * 8 - 16]
> >+ dec dword [rsp]
>
> code is right, but I suggest use a macro to improve x64 perf
>
> [Divya] I couldnt understand this macro to improve x64 performance.
>
> [MC] most time, memory is slower, in x64 mode, we have 16 register, so you can replace memory operator to register
>
>
>
> >+ jnz .loopH
> >+ RET
> >+
> > ;-------------------------------------------------------------------------------------------------------------
> > ; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
> > ;-------------------------------------------------------------------------------------------------------------
> >_______________________________________________
> >x265-devel mailing list
> >x265-devel at videolan.org
> >https://mailman.videolan.org/listinfo/x265-devel
>
>
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
>
>
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20141119/6f46571c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vert_16x16_32bit_15x.rtf
Type: application/rtf
Size: 4810 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20141119/6f46571c/attachment-0001.rtf>
More information about the x265-devel
mailing list