<div dir="ltr">ahh, width is just 8*16 = 128, two rows at a time will need v<span style="color:rgb(0,0,0);font-family:'Oxygen Mono',monospace;font-size:12.8000001907349px;line-height:16.6399993896484px">extracti128</span> as well while storing, which goes to port5, a bottleneck port. <span style="font-size:12.8000001907349px">pavgw is much cheaper than it. You may try to combine 16XN sizes.</span></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr">Regards,<div>Praveen</div></div></div></div>
<br><div class="gmail_quote">On Fri, Jun 26, 2015 at 3:40 PM, Rajesh Paulraj <span dir="ltr"><<a href="mailto:rajesh@multicorewareinc.com" target="_blank">rajesh@multicorewareinc.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I tried using vinserti128. But that reduces the performance than this one. So i kept this version.</div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 26, 2015 at 3:37 PM, Praveen Tiwari <span dir="ltr"><<a href="mailto:praveen@multicorewareinc.com" target="_blank">praveen@multicorewareinc.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><br clear="all"><div><div><div dir="ltr"><br></div></div></div>
<br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername"></b> <span dir="ltr"><<a href="mailto:rajesh@multicorewareinc.com" target="_blank">rajesh@multicorewareinc.com</a>></span><br>Date: Fri, Jun 26, 2015 at 3:14 PM<br>Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp<br>To: <a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br><br><br># HG changeset patch<br>
# User Rajesh Paulraj<<a href="mailto:rajesh@multicorewareinc.com" target="_blank">rajesh@multicorewareinc.com</a>><br>
# Date 1435311076 -19800<br>
# Fri Jun 26 15:01:16 2015 +0530<br>
# Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f<br>
# Parent d64227e54233d1646c55bcb4b0b831e5340009ed<br>
asm: pixelavg_pp[8xN] avx2 code for 10bpp<br>
<br>
avx2:<br>
avg_pp[ 8x4] 4.39x 145.09 636.75<br>
avg_pp[ 8x8] 5.33x 215.27 1146.55<br>
avg_pp[ 8x16] 6.50x 336.88 2190.68<br>
avg_pp[ 8x32] 7.71x 579.86 4470.84<br>
<br>
sse2:<br>
avg_pp[ 8x4] 2.31x 287.63 663.94<br>
avg_pp[ 8x8] 3.26x 370.21 1205.26<br>
avg_pp[ 8x16] 3.99x 581.63 2323.25<br>
avg_pp[ 8x32] 4.78x 995.79 4755.58<br>
<br>
diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp<br>
--- a/source/common/x86/asm-primitives.cpp Thu Jun 25 16:25:51 2015 +0530<br>
+++ b/source/common/x86/asm-primitives.cpp Fri Jun 26 15:01:16 2015 +0530<br>
@@ -1362,6 +1362,10 @@<br>
<a href="http://p.cu" rel="noreferrer" target="_blank">p.cu</a>[BLOCK_32x32].intra_pred[33] = PFX(intra_pred_ang32_33_avx2);<br>
<a href="http://p.cu" rel="noreferrer" target="_blank">p.cu</a>[BLOCK_32x32].intra_pred[34] = PFX(intra_pred_ang32_2_avx2);<br>
<br>
+ p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2);<br>
+ p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2);<br>
+ p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2);<br>
+ p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2);<br>
p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2);<br>
p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2);<br>
p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2);<br>
diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm<br>
--- a/source/common/x86/mc-a.asm Thu Jun 25 16:25:51 2015 +0530<br>
+++ b/source/common/x86/mc-a.asm Fri Jun 26 15:01:16 2015 +0530<br>
@@ -4490,6 +4490,88 @@<br>
RET<br>
%endif<br>
<br>
+%macro pixel_avg_W8 0<br>
+ movu xm0, [r2]<br>
+ movu xm1, [r4]<br>
+ pavgw xm0, xm1<br>
+ movu [r0], xm0<br>
+ movu xm2, [r2 + r3]<br>
+ movu xm3, [r4 + r5]<br>
+ pavgw xm2, xm3<br>
+ movu [r0 + r1], xm2<br>
+</div></div></div><div class="gmail_quote">>> Your macro is not using avx2 capabilities, did you check the performance of two rows combined ? It will reduce your pavgw and movu instruction by half. You can use <span style="color:rgb(0,0,0);font-family:'Oxygen Mono',monospace;font-size:12.8000001907349px;line-height:16.6399993896484px">vinserti128 to combine two rows at a time. </span></div><div class="gmail_quote"><div><div><br>
+ movu xm0, [r2 + r3 * 2]<br>
+ movu xm1, [r4 + r5 * 2]<br>
+ pavgw xm0, xm1<br>
+ movu [r0 + r1 * 2], xm0<br>
+ movu xm2, [r2 + r6]<br>
+ movu xm3, [r4 + r7]<br>
+ pavgw xm2, xm3<br>
+ movu [r0 + r8], xm2<br>
+<br>
+ lea r0, [r0 + 4 * r1]<br>
+ lea r2, [r2 + 4 * r3]<br>
+ lea r4, [r4 + 4 * r5]<br>
+%endmacro<br>
+<br>
+;-------------------------------------------------------------------------------------------------------------------------------<br>
+;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int)<br>
+;-------------------------------------------------------------------------------------------------------------------------------<br>
+%if ARCH_X86_64<br>
+INIT_YMM avx2<br>
+cglobal pixel_avg_8x4, 6,10,4<br>
+ add r1d, r1d<br>
+ add r3d, r3d<br>
+ add r5d, r5d<br>
+ lea r6, [r3 * 3]<br>
+ lea r7, [r5 * 3]<br>
+ lea r8, [r1 * 3]<br>
+ pixel_avg_W8<br>
+ RET<br>
+<br>
+cglobal pixel_avg_8x8, 6,10,4<br>
+ add r1d, r1d<br>
+ add r3d, r3d<br>
+ add r5d, r5d<br>
+ lea r6, [r3 * 3]<br>
+ lea r7, [r5 * 3]<br>
+ lea r8, [r1 * 3]<br>
+ mov r9d, 2<br>
+.loop<br>
+ pixel_avg_W8<br>
+ dec r9d<br>
+ jnz .loop<br>
+ RET<br>
+<br>
+cglobal pixel_avg_8x16, 6,10,4<br>
+ add r1d, r1d<br>
+ add r3d, r3d<br>
+ add r5d, r5d<br>
+ lea r6, [r3 * 3]<br>
+ lea r7, [r5 * 3]<br>
+ lea r8, [r1 * 3]<br>
+ mov r9d, 4<br>
+.loop<br>
+ pixel_avg_W8<br>
+ dec r9d<br>
+ jnz .loop<br>
+ RET<br>
+<br>
+cglobal pixel_avg_8x32, 6,10,4<br>
+ add r1d, r1d<br>
+ add r3d, r3d<br>
+ add r5d, r5d<br>
+ lea r6, [r3 * 3]<br>
+ lea r7, [r5 * 3]<br>
+ lea r8, [r1 * 3]<br>
+ mov r9d, 8<br>
+.loop<br>
+ pixel_avg_W8<br>
+ dec r9d<br>
+ jnz .loop<br>
+ RET<br>
+%endif<br>
+<br>
%macro pixel_avg_H4 0<br>
movu m0, [r2]<br>
movu m1, [r4]<br></div></div>
_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
</div><br></div>
<br>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
<br></blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
<br></blockquote></div><br></div>