<div dir="ltr"><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">chen</b> <span dir="ltr"><<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>></span><br>Date: Mon, Oct 20, 2014 at 9:07 PM<br>Subject: Re: [x265] [PATCH] weighted prediction pixel, avx2 asm code as per new interface<br>To: Development for x265 <<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a>><br><br><br><div style="line-height:1.7;color:rgb(0,0,0);font-size:14px;font-family:arial"><div> </div><pre><br>At 2014-10-20 19:37:06,<a href="mailto:praveen@multicorewareinc.com" target="_blank">praveen@multicorewareinc.com</a> wrote:


># HG changeset patch


># User Praveen Tiwari


># Date 1413805016 -19800


># Node ID 2293c5759f2e0af36141125a97b7a479e023619b


># Parent  3366be6ef59eec3d3ca69ed52942708b5d1b3bc6


>weighted prediction pixel, avx2 asm code as per new interface


>


>diff -r 3366be6ef59e -r 2293c5759f2e source/common/x86/pixel-util.h


>--- a/source/common/x86/pixel-util.h       Mon Oct 20 13:53:09 2014 +0530


>+++ b/source/common/x86/pixel-util.h       Mon Oct 20 17:06:56 2014 +0530


>@@ -58,6 +58,7 @@


> int x265_count_nonzero_ssse3(const int16_t *quantCoeff, int numCoeff);


> 


> void x265_weight_pp_sse4(pixel *src, pixel *dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);


>+void x265_weight_pp_avx2(pixel *src, pixel *dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);


> void x265_weight_sp_sse4(int16_t *src, pixel *dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);


> 


> void x265_pixel_ssim_4x4x2_core_mmx2(const uint8_t * pix1, intptr_t stride1,


>diff -r 3366be6ef59e -r 2293c5759f2e source/common/x86/pixel-util8.asm


>--- a/source/common/x86/pixel-util8.asm    Mon Oct 20 13:53:09 2014 +0530


>+++ b/source/common/x86/pixel-util8.asm    Mon Oct 20 17:06:56 2014 +0530


>@@ -1363,6 +1363,56 @@


>     jnz         .loopH


>     RET


> 


>+INIT_YMM avx2


>+cglobal weight_pp, 6, 7, 6


>+


>+    shl          r5d, 6            ; m0 = [w0<<6]


>+    mov          r6d, r6m


>+    shl          r6d, 16


>+    or           r6d, r5d          ; assuming both (w0<<6) and round are using maximum of 16 bits each.


>+    movd         xm0, r6d


>+    pshufd       xm0, xm0, 0       ; m0 = [w0<<6, round]


>+    vinserti128  m0, m0, xm0, 1    ; document says (pshufd + vinserti128) can be replaced with vpbroadcastd m0, xm0, but having build problem, need to investigate


</pre><pre>vpbroadcastd m0, xm0</pre><pre>[Praveen] : please check the comment, <span style="line-height:23.7999992370605px;font-family:arial">vpbroadcastd having build problem here. Better will be if you can try on your side ?</span></pre><span><pre>>+    movd         xm1, r7m


>+    vpbroadcastd m2, r8m


>+    mova         m5, [pw_1]


>+    sub          r2d, r3d


>+    shr          r3d, 4


>+


>+.loopH:


>+    mov          r5d, r3d


>+


>+.loopW:


>+    pmovzxbw    m4, [r0]


>+    punpcklwd   m3, m4, m5


>+    pmaddwd     m3, m0


>+    psrad       m3, xm1


>+    paddd       m3, m2


>+


>+    punpckhwd   m4, m5


>+    pmaddwd     m4, m0


>+    psrad       m4, xm1


>+    paddd       m4, m2


>+


>+    packssdw    m3, m4


>+    vpermq      m4, m3, 11101110b  ;[1, 2, 1, 2]


</pre></span><pre>you just want high 128bits, how about vextracti128?</pre><pre>[praveen] : Min, I did the same but it was your suggestion on my previous patch so I have updated it. Check the following</pre><pre><span class="im" style="font-family:arial;line-height:23.7999992370605px;white-space:normal"><pre style="white-space:pre-wrap">>+    packssdw    m3, m4


>+    vextracti128 xm4, m3, 1


>+    packuswb    m3, m4


</pre></span><pre style="white-space:pre-wrap;line-height:23.7999992370605px">How about vpermq+packuswb(xm3)? <span style="line-height:23.7999992370605px;font-family:arial"> </span></pre></pre><pre>>+    packuswb    m3, m4


</pre><pre>are you need 256bits operators here?</pre><pre><pre style="line-height:23.7999992370605px;white-space:pre-wrap">Not required but both are safe but I think I should replace it 128 bit that will be more sensible.</pre></pre><span><pre>>+    movu        [r1], xm3


>+


>+    add         r0, 16


>+    add         r1, 16


>+


>+    dec         r5d


>+    jnz         .loopW


>+


>+    lea         r0, [r0 + r2]


>+    lea         r1, [r1 + r2]


>+


>+    dec         r4d


>+    jnz         .loopH


>+    RET


</pre></span></div><br>_______________________________________________<br>


x265-devel mailing list<br>


<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>


<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>


<br></div><br></div>