<div dir="ltr"><br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Jason Garrett-Glaser</b> <span dir="ltr"><<a href="mailto:jason@x264.com">jason@x264.com</a>></span><br>
Date: Fri, Sep 20, 2013 at 12:57 AM<br>Subject: Re: [x265] [PATCH Only Review, don't merge] Assembly routine for filterHorizontal_p_p() for 4 tap filter<br>To: <a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>
<br><br><div class="im">> +%if ARCH_X86_64 == 0<br>
> +<br>
> +SECTION_RODATA 32<br>
> +tab_leftmask: dq 0x0000000000000000<br>
> + dq 0x00000000000000FF<br>
> + dq 0x000000000000FFFF<br>
> + dq 0x0000000000FFFFFF<br>
> + dq 0x00000000FFFFFFFF<br>
> + dq 0x000000FFFFFFFFFF<br>
> + dq 0x0000FFFFFFFFFFFF<br>
> + dq 0x00FFFFFFFFFFFFFF<br>
<br>
</div>This can probably be replaced by an unaligned load from a single table<br>
of db 0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1<br>
<br>
> +INIT_XMM sse4<br>
<br>
These generally go above the cglobal for each function, not at the<br>
start of the file.<br>
<div class="im"><br>
> +%macro FILTER_H4 2<br>
> + movu %1, [src + col]<br>
> + pshufb %2, %1, Tm4<br>
> + pmaddubsw %2, coef2<br>
> + pshufb %1, %1, Tm5<br>
> + pmaddubsw %1, coef2<br>
> + phaddw %2, %1<br>
> + paddw %2, sumOffset<br>
> + psraw %2, headRoom<br>
> + packuswb %2, %1<br>
> +%endmacro<br>
<br>
</div>phaddw is pretty slow (~3 uops) and should be avoided if you can, but<br>
if there's no good solution, then using it is okay. Personally I'd try<br>
to restructure the shuffles so that you don't have to add adjacent<br>
values, but if you can't, that's okay.<br>
<br>
paddw/psraw can probably be replaced by pmulhrsw with a magic<br>
constant: e.g. (x + 32 >> 6) is the same as pmulhrsw(x, 512). This<br>
tends to be faster in my experience.<br>
<div class="im"><br>
> +;-----------------------------------------------------------------------------<br>
> +; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, short const *coeff)<br>
> +;-----------------------------------------------------------------------------<br>
> +cglobal filterHorizontal_p_p_4, 0, 7, 8<br>
> +%define headRoom 14-8<br>
> +%define offset (1 << (headRoom - 1))<br>
> +%define offset2 (offset | (offset << 16))<br>
> +%define src r0<br>
> +%define dst r1<br>
> +%define row r2<br>
> +%define col r3<br>
> +%define width r4<br>
> +%define widthleft r5<br>
> +%define coef2 m7<br>
> +%define sumOffset m6<br>
> +%define Tm5 m5<br>
> +%define Tm4 m4<br>
> +%define x2 m3<br>
> +%define x1 m2<br>
> +%define x0 m1<br>
> +%define leftmask m0<br>
> +%define tmp r0<br>
> +%define tmp1 r1<br>
</div>x86inc named parameters are probably useful here.<br>
<div class="im"><br>
> +<br>
> + mov tmp, r6m<br>
> + movu coef2, [tmp]<br>
> + packuswb coef2, coef2<br>
> + pshufd coef2, coef2, 0<br>
<br>
</div>>>Is there some reason the input coefficients are stored as 16-bit<br>>>instead of 8-bit? This step seems like it's going to be repeated a<br>>>lot.<br>
<div class="im"><br></div><div class="im"><font color="#000000">To implement this change , we need to modify HM code.</font></div><div class="im"><br></div><div class="im"><br>
> +<br>
> + mov tmp, offset2<br>
> + movd sumOffset, tmp<br>
> + pshufd sumOffset, sumOffset, 0<br>
<br>
</div>You can movd directly from memory; going through a register is much<br>
slower, especially on AMD machines.<br>
<div class="im"><br>
> + mov width, r4m<br>
> + mov widthleft, width<br>
> + and width, ~7<br>
> + and widthleft, 7<br>
<br>
</div>mov width, r4m<br>
mov widthleft, 7<br>
and widthleft, width<br>
and width, ~7<br>
<br>
(probably a lower latency instruction sequence)<br>
<div class="im"><br>
> + movq leftmask, [tab_leftmask + widthleft * 8]<br>
> + mova Tm4, [tab_Tm]<br>
> + mova Tm5, [tab_Tm + 16]<br>
> +<br>
> + mov src, r0m<br>
> + mov dst, r2m<br>
> + mov row, r5m<br>
<br>
</div>A lot of these can probably be auto-loaded during the function<br>
prologue, unless there's a reason not to.<br>
<div class="im"><br>
> +_loop_row:<br>
> + test row, row<br>
> + jz _end_row<br>
<br>
</div>>>This condition should be at the end.</div><div class="gmail_quote">We need to discuss more on this. if we will put condition at end, will it check condition (In case it fails beginning of loop) before entering the the loop?</div>
<div class="gmail_quote"><br></div><div class="gmail_quote">
<div class="im"><br>
> + xor col, col<br>
> +_loop_col:<br>
> + cmp col, width<br>
> + jge _end_col<br>
<br>
</div>Same as above.<br>
<div class="im"><br>
> + FILTER_H4 x0, x1<br>
> + movq [dst + col], x1<br>
<br>
</div>nit: personally I'd do "movh" here.<br>
<div class="im"><br>
> + add col, 8<br>
> + jmp _loop_col<br>
> +<br>
> +_end_col:<br>
> + test widthleft, widthleft<br>
> + jz _next_row<br>
> +<br>
> + movq x2, [dst + col]<br>
> + FILTER_H4 x0, x1<br>
> + pblendvb x2, x2, x1, leftmask<br>
<br>
</div>Doesn't pblendvb require some weird thing (in non-AVX) where the xmm0<br>
register is hardcoded?<br>
<div class="im"><br>
> + movq [dst + col], x2<br>
> +<br>
> +_next_row:<br>
> + add src, r1m<br>
> + add dst, r3m<br>
> + dec row<br>
> + jmp _loop_row<br>
<br>
</div>As before, loop conditions should generally be at the end, not the<br>
start; it avoids a wasteful jump.<br>
<br>
Jason<br>
_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
</div><br></div>