<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><P>please see below with tag [MC]</P>
<P> </P>
<P>At 2013-09-20 21:38:26,"Praveen Tiwari" <praveen@multicorewareinc.com> wrote:<BR></P>
<BLOCKQUOTE id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<DIV dir="ltr"><BR><BR>
<DIV class="gmail_quote">---------- Forwarded message ----------<BR>From: <B class="gmail_sendername">Jason Garrett-Glaser</B> <SPAN dir="ltr"><<A href="mailto:jason@x264.com">jason@x264.com</A>></SPAN><BR>Date: Fri, Sep 20, 2013 at 12:57 AM<BR>Subject: Re: [x265] [PATCH Only Review, don't merge] Assembly routine for filterHorizontal_p_p() for 4 tap filter<BR>To: <A href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</A><BR><BR><BR></DIV>
<DIV class="im">> +%if ARCH_X86_64 == 0<BR>> +<BR>> +SECTION_RODATA 32<BR>> +tab_leftmask: dq 0x0000000000000000<BR>> + dq 0x00000000000000FF<BR>> + dq 0x000000000000FFFF<BR>> + dq 0x0000000000FFFFFF<BR>> + dq 0x00000000FFFFFFFF<BR>> + dq 0x000000FFFFFFFFFF<BR>> + dq 0x0000FFFFFFFFFFFF<BR>> + dq 0x00FFFFFFFFFFFFFF<BR><BR></DIV>
<DIV class="gmail_quote">This can probably be replaced by an unaligned load from a single table<BR>of db 0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1<BR>[MC] Excuse me, I think it is </DIV>
<DIV class="gmail_quote">db -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0</DIV>
<DIV class="gmail_quote"><BR>> +INIT_XMM sse4<BR><BR>These generally go above the cglobal for each function, not at the<BR>start of the file.<BR></DIV>
<DIV class="im"><BR>> +%macro FILTER_H4 2<BR>> + movu %1, [src + col]<BR>> + pshufb %2, %1, Tm4<BR>> + pmaddubsw %2, coef2<BR>> + pshufb %1, %1, Tm5<BR>> + pmaddubsw %1, coef2<BR>> + phaddw %2, %1<BR>> + paddw %2, sumOffset<BR>> + psraw %2, headRoom<BR>> + packuswb %2, %1<BR>> +%endmacro<BR><BR></DIV>
<DIV class="gmail_quote">phaddw is pretty slow (~3 uops) and should be avoided if you can, but<BR>if there's no good solution, then using it is okay. Personally I'd try<BR>to restructure the shuffles so that you don't have to add adjacent<BR>values, but if you can't, that's okay.<BR><BR>paddw/psraw can probably be replaced by pmulhrsw with a magic<BR>constant: e.g. (x + 32 >> 6) is the same as pmulhrsw(x, 512). This<BR>tends to be faster in my experience.<BR></DIV>
<DIV class="im"><BR>> +;-----------------------------------------------------------------------------<BR>> +; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, short const *coeff)<BR>> +;-----------------------------------------------------------------------------<BR>> +cglobal filterHorizontal_p_p_4, 0, 7, 8<BR>> +%define headRoom 14-8<BR>> +%define offset (1 << (headRoom - 1))<BR>> +%define offset2 (offset | (offset << 16))<BR>> +%define src r0<BR>> +%define dst r1<BR>> +%define row r2<BR>> +%define col r3<BR>> +%define width r4<BR>> +%define widthleft r5<BR>> +%define coef2 m7<BR>> +%define sumOffset m6<BR>> +%define Tm5  
; m5<BR>> +%define Tm4 m4<BR>> +%define x2 m3<BR>> +%define x1 m2<BR>> +%define x0 m1<BR>> +%define leftmask m0<BR>> +%define tmp r0<BR>> +%define tmp1 r1<BR></DIV>
<DIV class="gmail_quote">x86inc named parameters are probably useful here.<BR></DIV>
<DIV class="im"><BR>> +<BR>> + mov tmp, r6m<BR>> + movu coef2, [tmp]<BR>> + packuswb coef2, coef2<BR>> + pshufd coef2, coef2, 0<BR><BR></DIV>
<DIV class="gmail_quote">>>Is there some reason the input coefficients are stored as 16-bit<BR>>>instead of 8-bit? This step seems like it's going to be repeated a<BR>>>lot.<BR></DIV>
<DIV class="im"><BR></DIV>
<DIV class="im"><FONT color="#000000">To implement this change , we need to modify HM code.</FONT></DIV>
<DIV class="im">[MC] we can define the table in asm file, but we have to modify HM. of course, it is easy things<BR></DIV>
<DIV class="im"><BR>> +<BR>> + mov tmp, offset2<BR>> + movd sumOffset, tmp<BR>> + pshufd sumOffset, sumOffset, 0<BR><BR></DIV>
<DIV class="gmail_quote">You can movd directly from memory; going through a register is much<BR>slower, especially on AMD machines.</DIV>
<DIV class="gmail_quote">[MC] are you means, we put constant into memory and load it once?<BR></DIV>
<DIV class="im"><BR>> + mov width, r4m<BR>> + mov widthleft, width<BR>> + and width, ~7<BR>> + and widthleft, 7<BR><BR></DIV>
<DIV class="gmail_quote">mov width, r4m<BR>mov widthleft, 7<BR>and widthleft, width<BR>and width, ~7<BR><BR>(probably a lower latency instruction sequence)<BR></DIV>
<DIV class="im"><BR>> + movq leftmask, [tab_leftmask + widthleft * 8]<BR>> + mova Tm4, [tab_Tm]<BR>> + mova Tm5, [tab_Tm + 16]<BR>> +<BR>> + mov src, r0m<BR>> + mov dst, r2m<BR>> + mov row, r5m<BR><BR></DIV>
<DIV class="gmail_quote">A lot of these can probably be auto-loaded during the function<BR>prologue, unless there's a reason not to.</DIV>
<DIV class="gmail_quote">[MC] auto-loaded is depends on params order, this function have many params, I need tmp register before loop, so I can't use auto-load at prologue<BR></DIV>
<DIV class="im"><BR>> +_loop_row:<BR>> + test row, row<BR>> + jz _end_row<BR><BR></DIV>
<DIV class="gmail_quote">>>This condition should be at the end.</DIV>
<DIV class="gmail_quote">We need to discuss more on this. if we will put condition at end, will it check condition (In case it fails beginning of loop) before entering the the loop?</DIV>
<DIV class="gmail_quote">[MC] for testbench only, I afraid it may call with zero<BR><BR>> + xor col, col<BR>> +_loop_col:<BR>> + cmp col, width<BR>> + jge _end_col<BR>Same as above.<BR>[MC] same, testbench will pass col less than 8, -_-<BR></DIV>
<DIV class="gmail_quote">
<DIV class="im"><BR>> + FILTER_H4 x0, x1<BR>> + movq [dst + col], x1<BR><BR></DIV>nit: personally I'd do "movh" here.<BR>
<DIV class="im"><BR>> + add col, 8<BR>> + jmp _loop_col<BR>> +<BR>> +_end_col:<BR>> + test widthleft, widthleft<BR>> + jz _next_row<BR>> +<BR>> + movq x2, [dst + col]<BR>> + FILTER_H4 x0, x1<BR>> + pblendvb x2, x2, x1, leftmask<BR><BR></DIV>Doesn't pblendvb require some weird thing (in non-AVX) where the xmm0<BR>register is hardcoded?<BR>
<DIV class="im">[MC] no way, x264 macro have a bug here, you can remove reduce x2 and check the output, the xmm0 seems Intel limit</DIV>
<DIV class="im"><BR>> + movq [dst + col], x2<BR>> +<BR>> +_next_row:<BR>> + add src, r1m<BR>> + add dst, r3m<BR>> + dec row<BR>> + jmp _loop_row<BR><BR></DIV>As before, loop conditions should generally be at the end, not the<BR>start; it avoids a wasteful jump.<BR><BR>Jason<BR>_______________________________________________<BR>x265-devel mailing list<BR><A href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</A><BR><A href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</A><BR></DIV><BR></DIV></BLOCKQUOTE></div>