<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><P>please see below with tag [MC]</P>

<P> </P>

<P>At 2013-09-20 21:38:26,"Praveen Tiwari" <praveen@multicorewareinc.com> wrote:<BR></P>

<BLOCKQUOTE id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">

<DIV dir="ltr"><BR><BR>

<DIV class="gmail_quote">---------- Forwarded message ----------<BR>From: <B class="gmail_sendername">Jason Garrett-Glaser</B> <SPAN dir="ltr"><<A href="mailto:jason@x264.com">jason@x264.com</A>></SPAN><BR>Date: Fri, Sep 20, 2013 at 12:57 AM<BR>Subject: Re: [x265] [PATCH Only Review, don't merge] Assembly routine for filterHorizontal_p_p() for 4 tap filter<BR>To: <A href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</A><BR><BR><BR></DIV>

<DIV class="im">> +%if ARCH_X86_64 == 0<BR>> +<BR>> +SECTION_RODATA 32<BR>> +tab_leftmask:   dq 0x0000000000000000<BR>> +                dq 0x00000000000000FF<BR>> +                dq 0x000000000000FFFF<BR>> +                dq 0x0000000000FFFFFF<BR>> +                dq 0x00000000FFFFFFFF<BR>> +                dq 0x000000FFFFFFFFFF<BR>> +                dq 0x0000FFFFFFFFFFFF<BR>> +                dq 0x00FFFFFFFFFFFFFF<BR><BR></DIV>

<DIV class="gmail_quote">This can probably be replaced by an unaligned load from a single table<BR>of db 0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1<BR>[MC] Excuse me, I think it is </DIV>

<DIV class="gmail_quote">db -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0</DIV>

<DIV class="gmail_quote"><BR>> +INIT_XMM sse4<BR><BR>These generally go above the cglobal for each function, not at the<BR>start of the file.<BR></DIV>

<DIV class="im"><BR>> +%macro FILTER_H4 2<BR>> +    movu        %1, [src + col]<BR>> +    pshufb      %2, %1, Tm4<BR>> +    pmaddubsw   %2, coef2<BR>> +    pshufb      %1, %1, Tm5<BR>> +    pmaddubsw   %1, coef2<BR>> +    phaddw      %2, %1<BR>> +    paddw       %2, sumOffset<BR>> +    psraw       %2, headRoom<BR>> +    packuswb    %2, %1<BR>> +%endmacro<BR><BR></DIV>

<DIV class="gmail_quote">phaddw is pretty slow (~3 uops) and should be avoided if you can, but<BR>if there's no good solution, then using it is okay. Personally I'd try<BR>to restructure the shuffles so that you don't have to add adjacent<BR>values, but if you can't, that's okay.<BR><BR>paddw/psraw can probably be replaced by pmulhrsw with a magic<BR>constant: e.g. (x + 32 >> 6) is the same as pmulhrsw(x, 512). This<BR>tends to be faster in my experience.<BR></DIV>

<DIV class="im"><BR>> +;-----------------------------------------------------------------------------<BR>> +; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, short const *coeff)<BR>> +;-----------------------------------------------------------------------------<BR>> +cglobal filterHorizontal_p_p_4, 0, 7, 8<BR>> +%define headRoom    14-8<BR>> +%define offset      (1 << (headRoom - 1))<BR>> +%define offset2     (offset | (offset << 16))<BR>> +%define src         r0<BR>> +%define dst         r1<BR>> +%define row         r2<BR>> +%define col         r3<BR>> +%define width       r4<BR>> +%define widthleft   r5<BR>> +%define coef2       m7<BR>> +%define sumOffset   m6<BR>> +%define Tm5       &nbsp

 ; m5<BR>> +%define Tm4         m4<BR>> +%define x2          m3<BR>> +%define x1          m2<BR>> +%define x0          m1<BR>> +%define leftmask    m0<BR>> +%define tmp         r0<BR>> +%define tmp1        r1<BR></DIV>

<DIV class="gmail_quote">x86inc named parameters are probably useful here.<BR></DIV>

<DIV class="im"><BR>> +<BR>> +    mov         tmp,        r6m<BR>> +    movu        coef2,      [tmp]<BR>> +    packuswb    coef2,      coef2<BR>> +    pshufd      coef2,      coef2,      0<BR><BR></DIV>

<DIV class="gmail_quote">>>Is there some reason the input coefficients are stored as 16-bit<BR>>>instead of 8-bit?  This step seems like it's going to be repeated a<BR>>>lot.<BR></DIV>

<DIV class="im"><BR></DIV>

<DIV class="im"><FONT color="#000000">To implement this change , we need to modify HM code.</FONT></DIV>

<DIV class="im">[MC] we can define the table in asm file, but we have to modify HM. of course, it is easy things<BR></DIV>

<DIV class="im"><BR>> +<BR>> +    mov         tmp,        offset2<BR>> +    movd        sumOffset,  tmp<BR>> +    pshufd      sumOffset,  sumOffset,  0<BR><BR></DIV>

<DIV class="gmail_quote">You can movd directly from memory; going through a register is much<BR>slower, especially on AMD machines.</DIV>

<DIV class="gmail_quote">[MC] are you means, we put constant into memory and load it once?<BR></DIV>

<DIV class="im"><BR>> +    mov         width,      r4m<BR>> +    mov         widthleft,  width<BR>> +    and         width,      ~7<BR>> +    and         widthleft,  7<BR><BR></DIV>

<DIV class="gmail_quote">mov width, r4m<BR>mov widthleft, 7<BR>and widthleft, width<BR>and width, ~7<BR><BR>(probably a lower latency instruction sequence)<BR></DIV>

<DIV class="im"><BR>> +    movq        leftmask,   [tab_leftmask + widthleft * 8]<BR>> +    mova        Tm4,        [tab_Tm]<BR>> +    mova        Tm5,        [tab_Tm + 16]<BR>> +<BR>> +    mov         src,        r0m<BR>> +    mov         dst,        r2m<BR>> +    mov         row,        r5m<BR><BR></DIV>

<DIV class="gmail_quote">A lot of these can probably be auto-loaded during the function<BR>prologue, unless there's a reason not to.</DIV>

<DIV class="gmail_quote">[MC] auto-loaded is depends on params order, this function have many params, I need tmp register before loop, so I can't use auto-load at prologue<BR></DIV>

<DIV class="im"><BR>> +_loop_row:<BR>> +    test        row,        row<BR>> +    jz          _end_row<BR><BR></DIV>

<DIV class="gmail_quote">>>This condition should be at the end.</DIV>

<DIV class="gmail_quote">We need to discuss more on this. if we will put condition at end, will it check condition (In case it fails beginning of loop) before entering the the loop?</DIV>

<DIV class="gmail_quote">[MC] for testbench only, I afraid it may call with zero<BR><BR>> +    xor         col,        col<BR>> +_loop_col:<BR>> +    cmp         col,        width<BR>> +    jge         _end_col<BR>Same as above.<BR>[MC] same, testbench will pass col less than 8, -_-<BR></DIV>

<DIV class="gmail_quote">

<DIV class="im"><BR>> +    FILTER_H4   x0, x1<BR>> +    movq        [dst + col], x1<BR><BR></DIV>nit: personally I'd do "movh" here.<BR>

<DIV class="im"><BR>> +    add         col,        8<BR>> +    jmp         _loop_col<BR>> +<BR>> +_end_col:<BR>> +    test        widthleft,  widthleft<BR>> +    jz          _next_row<BR>> +<BR>> +    movq        x2, [dst + col]<BR>> +    FILTER_H4   x0, x1<BR>> +    pblendvb    x2, x2, x1, leftmask<BR><BR></DIV>Doesn't pblendvb require some weird thing (in non-AVX) where the xmm0<BR>register is hardcoded?<BR>

<DIV class="im">[MC] no way, x264 macro have a bug here, you can remove reduce x2 and check the output, the xmm0 seems Intel limit</DIV>

<DIV class="im"><BR>> +    movq        [dst + col], x2<BR>> +<BR>> +_next_row:<BR>> +    add         src,        r1m<BR>> +    add         dst,        r3m<BR>> +    dec         row<BR>> +    jmp         _loop_row<BR><BR></DIV>As before, loop conditions should generally be at the end, not the<BR>start; it avoids a wasteful jump.<BR><BR>Jason<BR>_______________________________________________<BR>x265-devel mailing list<BR><A href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</A><BR><A href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</A><BR></DIV><BR></DIV></BLOCKQUOTE></div>