<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div> </div><pre><br>At 2014-11-17 19:14:55,"Divya Manivannan" <divya@multicorewareinc.com> wrote:
># HG changeset patch
># User Divya Manivannan <a href="mailto:divya@multicorewareinc.com>># Date 1416222868 -19800># Mon Nov 17 16:44:28 2014 +0530># Node ID 95571420e32c02f51d41fc45d286a2619a5052e5># Parent 5fec38f8c75606ddf4d5b7c9244dfd373892af5d>asm: luma_vpp[16x16] in avx2: improve 2141c->1488c>">divya@multicorewareinc.com>
># Date 1416222868 -19800
># Mon Nov 17 16:44:28 2014 +0530
># Node ID 95571420e32c02f51d41fc45d286a2619a5052e5
># Parent 5fec38f8c75606ddf4d5b7c9244dfd373892af5d
>asm: luma_vpp[16x16] in avx2: improve 2141c->1488c
>
</a> INIT_YMM avx2
>-cglobal interp_8tap_vert_pp_8x8, 4,7,7
>+cglobal interp_8tap_vert_pp_8x8, 4, 7, 7
> mov r4d, r4m
> shl r4d, 7
> lea r5, [r1 * 3]
>@@ -3875,6 +3875,57 @@
> RET
> %endmacro
>
>+INIT_YMM avx2
>+cglobal interp_8tap_vert_pp_16x16, 4, 7, 7, 0-gprsize
</pre><pre>you just use dword local stack</pre><pre>>+ mov r4d, r4m
>+ shl r4d, 7
>+ lea r5, [r1 * 3]
>+ sub r0, r5
>+
>+%ifdef PIC
>+ lea r5, [tab_LumaCoeffVer_32]
>+ lea r5, [r5 + r4]
</pre><pre>same before</pre><pre>>+%else
>+ lea r5, [tab_LumaCoeffVer_32 + r4]
>+%endif
>+
>+ mov dword [rsp], 2
>+
>+.loopH:
>+ mov r4d, 2
>+.loopW:
>+ PROCESS_LUMA_AVX2_W8_8R
</pre><pre>AVX2 can processing one row, could you try these mode?</pre><pre> </pre><pre>>+ lea r6, [r2]
>+ pmulhrsw m5, [pw_512] ; m5 = word: row 0, row 1
>+ pmulhrsw m2, [pw_512] ; m2 = word: row 2, row 3
>+ pmulhrsw m1, [pw_512] ; m1 = word: row 4, row 5
>+ pmulhrsw m4, [pw_512] ; m4 = word: row 6, row 7
>+ packuswb m5, m2
>+ packuswb m1, m4
>+ vextracti128 xm2, m5, 1
>+ vextracti128 xm4, m1, 1
>+ movq [r6], xm5
>+ movq [r6 + r3], xm2
>+ lea r6, [r6 + r3 * 2]
>+ movhps [r6], xm5
>+ movhps [r6 + r3], xm2
>+ lea r6, [r6 + r3 * 2]
>+ movq [r6], xm1
>+ movq [r6 + r3], xm4
>+ lea r6, [r6 + r3 * 2]
>+ movhps [r6], xm1
>+ movhps [r6 + r3], xm4
>+
>+ add r0, 8
>+ add r2, 8
>+ dec r4d
>+ jnz .loopW
>+ lea r2, [r2 + r3 * 8 - 16]
>+ lea r0, [r0 + r1 * 8 - 16]
>+ dec dword [rsp]
</pre><pre>code is right, but I suggest use a macro to improve x64 perf</pre><pre>>+ jnz .loopH
>+ RET
>+
> ;-------------------------------------------------------------------------------------------------------------
> ; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
> ;-------------------------------------------------------------------------------------------------------------
>_______________________________________________
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
</pre></div>