<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Oct 8, 2013 at 2:33 AM,  <span dir="ltr"><<a href="mailto:praveen@multicorewareinc.com" target="_blank">praveen@multicorewareinc.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"># HG changeset patch<br>

# User Praveen Tiwari<br>

# Date 1381217602 -19800<br>

# Node ID 4f728eeab74a089c86068663baf522c40a136981<br>

# Parent  2b2fc4a46c7dcf8720b1b9872c0f3b86c048ffcd<br>

filterHorizontal_p_p_4, 48x48 asm code<br></blockquote><div><br></div><div>For luma, the only width-48 block used in the encoder is 48x64.</div><div><br></div><div>And at width 64 there is only 64x16, 64x32, 64x48, 64x64 (1/4, 1/2, 3/4, 4/4).</div>

<div><br></div><div>The same applies to width 32 (8, 16, 24, 32) and 16 (4, 8, 12, 16).  (width 24 only has height 32, width 12 only has height 16)</div><div><br></div><div>width 8 only has 8x4 and 8x8</div><div><br></div>

<div>So to minimize your work effort you should be writing 8-tap luma macros that interpolate:</div><div><br></div><div>* 64x16</div><div>* 32x8<br></div><div>* 16x4<br></div><div>* 8x4<br></div><div><br></div><div>The 48x64, 24x32, and 12x16 blocks are rarely used (AMP) and could be built from 16x4 or 4x4.</div>

<div><br></div><div>These 4-tap filters are only used for 4:2:0 chroma and they will have different block-size requirements, but you need to figure out exactly which chroma blocks are needed before writing 4-tap block intrinsics.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

diff -r 2b2fc4a46c7d -r 4f728eeab74a source/common/x86/ipfilter8.asm<br>

--- a/source/common/x86/ipfilter8.asm   Tue Oct 08 12:53:44 2013 +0530<br>

+++ b/source/common/x86/ipfilter8.asm   Tue Oct 08 13:03:22 2013 +0530<br>

@@ -530,3 +530,101 @@<br>

     FILTER_H4_w32   x0, x1, x2, x3<br>

     movu        [dstq + 16],    x1<br>

     RET<br>

+<br>

+    SECTION_RODATA 32<br>

+tab_Tm:     db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6<br>

+            db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10<br>

+<br>

+tab_c_512:  times 8 dw 512<br>

+<br>

+SECTION .text<br>

+<br>

+%macro FILTER_H4_w48 4<br>

+    movu        %1, [srcq - 1]<br>

+    pshufb      %2, %1, Tm0<br>

+    pmaddubsw   %2, coef2<br>

+    pshufb      %1, %1, Tm1<br>

+    pmaddubsw   %1, coef2<br>

+    phaddw      %2, %1<br>

+    movu        %1, [srcq - 1 + 8]<br>

+    pshufb      %4, %1, Tm0<br>

+    pmaddubsw   %4, coef2<br>

+    pshufb      %1, %1, Tm1<br>

+    pmaddubsw   %1, coef2<br>

+    phaddw      %4, %1<br>

+    pmulhrsw    %2, %3<br>

+    pmulhrsw    %4, %3<br>

+    packuswb    %2, %4<br>

+    movu        [dstq],      %2<br>

+    movu        %1, [srcq - 1 + 16]<br>

+    pshufb      %2, %1, Tm0<br>

+    pmaddubsw   %2, coef2<br>

+    pshufb      %1, %1, Tm1<br>

+    pmaddubsw   %1, coef2<br>

+    phaddw      %2, %1<br>

+    movu        %1, [srcq - 1 + 24]<br>

+    pshufb      %4, %1, Tm0<br>

+    pmaddubsw   %4, coef2<br>

+    pshufb      %1, %1, Tm1<br>

+    pmaddubsw   %1, coef2<br>

+    phaddw      %4, %1<br>

+    pmulhrsw    %2, %3<br>

+    pmulhrsw    %4, %3<br>

+    packuswb    %2, %4<br>

+    movu        [dstq + 16],      x1<br>

+    movu        %1, [srcq - 1 + 32]<br>

+    pshufb      %2, %1, Tm0<br>

+    pmaddubsw   %2, coef2<br>

+    pshufb      %1, %1, Tm1<br>

+    pmaddubsw   %1, coef2<br>

+    phaddw      %2, %1<br>

+    movu        %1, [srcq - 1 + 40]<br>

+    pshufb      %4, %1, Tm0<br>

+    pmaddubsw   %4, coef2<br>

+    pshufb      %1, %1, Tm1<br>

+    pmaddubsw   %1, coef2<br>

+    phaddw      %4, %1<br>

+    pmulhrsw    %2, %3<br>

+    pmulhrsw    %4, %3<br>

+    packuswb    %2, %4<br>

+%endmacro<br>

+<br>

+%macro FILTER_H4_w48_CALL 0<br>

+    FILTER_H4_w48   x0, x1, x2, x3<br>

+<br>

+    movu        [dstq + 32],      x1<br>

+<br>

+    add         srcq,        srcstrideq<br>

+    add         dstq,        dststrideq<br>

+%endmacro<br>

+<br>

+;-----------------------------------------------------------------------------<br>

+; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, short const *coeff)<br>

+;-----------------------------------------------------------------------------<br>

+INIT_XMM sse4<br>

+cglobal filterHorizontal_p_p_4, 4, 5, 6, src, srcstride, dst, dststride<br>

+%define coef2       m6<br>

+%define Tm0         m5<br>

+%define Tm1         m4<br>

+%define x3          m3<br>

+%define x2          m2<br>

+%define x1          m1<br>

+%define x0          m0<br>

+<br>

+    mov         r4,         r6m<br>

+    movu        coef2,      [r4]<br>

+    packsswb    coef2,      coef2<br>

+    pshufd      coef2,      coef2,      0<br>

+<br>

+    mova        x2,         [tab_c_512]<br>

+<br>

+    mova        Tm0,        [tab_Tm]<br>

+    mova        Tm1,        [tab_Tm + 16]<br>

+<br>

+ %rep 47<br>

+ FILTER_H4_w48_CALL<br>

+ %endrep<br>

+<br>

+    FILTER_H4_w48   x0, x1, x2, x3<br>

+    movu        [dstq + 32],    x1<br>

+    RET<br>

_______________________________________________<br>

x265-devel mailing list<br>

<a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>

<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>Steve Borho

</div></div>