[x265] Fwd: [PATCH Only Review, don't merge] Assembly routine for filterHorizontal_p_p() for 4 tap filter

Fri Sep 20 16:53:43 CEST 2013

please see below with tag [MC]

At 2013-09-20 21:38:26,"Praveen Tiwari" <praveen at multicorewareinc.com> wrote:

---------- Forwarded message ----------
From: Jason Garrett-Glaser<jason at x264.com>
Date: Fri, Sep 20, 2013 at 12:57 AM
Subject: Re: [x265] [PATCH Only Review, don't merge] Assembly routine for filterHorizontal_p_p() for 4 tap filter
To: x265-devel at videolan.org

> +%if ARCH_X86_64 == 0
> +
> +SECTION_RODATA 32
> +tab_leftmask:   dq 0x0000000000000000
> +                dq 0x00000000000000FF
> +                dq 0x000000000000FFFF
> +                dq 0x0000000000FFFFFF
> +                dq 0x00000000FFFFFFFF
> +                dq 0x000000FFFFFFFFFF
> +                dq 0x0000FFFFFFFFFFFF
> +                dq 0x00FFFFFFFFFFFFFF

This can probably be replaced by an unaligned load from a single table
of db 0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1
[MC] Excuse me, I think it is
db -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0

> +INIT_XMM sse4

These generally go above the cglobal for each function, not at the
start of the file.

> +%macro FILTER_H4 2
> +    movu        %1, [src + col]
> +    pshufb      %2, %1, Tm4
> +    pmaddubsw   %2, coef2
> +    pshufb      %1, %1, Tm5
> +    pmaddubsw   %1, coef2
> +    phaddw      %2, %1
> +    paddw       %2, sumOffset
> +    psraw       %2, headRoom
> +    packuswb    %2, %1
> +%endmacro

phaddw is pretty slow (~3 uops) and should be avoided if you can, but
if there's no good solution, then using it is okay. Personally I'd try
to restructure the shuffles so that you don't have to add adjacent
values, but if you can't, that's okay.

paddw/psraw can probably be replaced by pmulhrsw with a magic
constant: e.g. (x + 32 >> 6) is the same as pmulhrsw(x, 512). This
tends to be faster in my experience.

> +;-----------------------------------------------------------------------------
> +; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, short const *coeff)
> +;-----------------------------------------------------------------------------
> +cglobal filterHorizontal_p_p_4, 0, 7, 8
> +%define headRoom    14-8
> +%define offset      (1 << (headRoom - 1))
> +%define offset2     (offset | (offset << 16))
> +%define src         r0
> +%define dst         r1
> +%define row         r2
> +%define col         r3
> +%define width       r4
> +%define widthleft   r5
> +%define coef2       m7
> +%define sumOffset   m6
> +%define Tm5         m5
> +%define Tm4         m4
> +%define x2          m3
> +%define x1          m2
> +%define x0          m1
> +%define leftmask    m0
> +%define tmp         r0
> +%define tmp1        r1

x86inc named parameters are probably useful here.

> +
> +    mov         tmp,        r6m
> +    movu        coef2,      [tmp]
> +    packuswb    coef2,      coef2
> +    pshufd      coef2,      coef2,      0

>>Is there some reason the input coefficients are stored as 16-bit
>>instead of 8-bit?  This step seems like it's going to be repeated a
>>lot.

To implement this change , we need to modify HM code.
[MC] we can define the table in asm file, but we have to modify HM. of course, it is easy things

> +
> +    mov         tmp,        offset2
> +    movd        sumOffset,  tmp
> +    pshufd      sumOffset,  sumOffset,  0

You can movd directly from memory; going through a register is much
slower, especially on AMD machines.
[MC] are you means, we put constant into memory and load it once?

> +    mov         width,      r4m
> +    mov         widthleft,  width
> +    and         width,      ~7
> +    and         widthleft,  7

mov width, r4m
mov widthleft, 7
and widthleft, width
and width, ~7

(probably a lower latency instruction sequence)

> +    movq        leftmask,   [tab_leftmask + widthleft * 8]
> +    mova        Tm4,        [tab_Tm]
> +    mova        Tm5,        [tab_Tm + 16]
> +
> +    mov         src,        r0m
> +    mov         dst,        r2m
> +    mov         row,        r5m

A lot of these can probably be auto-loaded during the function
prologue, unless there's a reason not to.
[MC] auto-loaded is depends on params order, this function have many params, I need tmp register before loop, so I can't use auto-load at prologue

> +_loop_row:
> +    test        row,        row
> +    jz          _end_row

>>This condition should be at the end.
We need to discuss more on this. if we will put condition at end, will it check condition (In case it fails beginning of loop) before entering the the loop?
[MC] for testbench only, I afraid it may call with zero

> +    xor         col,        col
> +_loop_col:
> +    cmp         col,        width
> +    jge         _end_col
Same as above.
[MC] same, testbench will pass col less than 8, -_-

> +    FILTER_H4   x0, x1
> +    movq        [dst + col], x1

nit: personally I'd do "movh" here.

> +    add         col,        8
> +    jmp         _loop_col
> +
> +_end_col:
> +    test        widthleft,  widthleft
> +    jz          _next_row
> +
> +    movq        x2, [dst + col]
> +    FILTER_H4   x0, x1
> +    pblendvb    x2, x2, x1, leftmask

Doesn't pblendvb require some weird thing (in non-AVX) where the xmm0
register is hardcoded?

[MC] no way, x264 macro have a bug here, you can remove reduce x2 and check the output, the xmm0 seems Intel limit

> +    movq        [dst + col], x2
> +
> +_next_row:
> +    add         src,        r1m
> +    add         dst,        r3m
> +    dec         row
> +    jmp         _loop_row

As before, loop conditions should generally be at the end, not the
start; it avoids a wasteful jump.

Jason
_______________________________________________
x265-devel mailing list
x265-devel at videolan.org
https://mailman.videolan.org/listinfo/x265-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20130920/c0eb3bd1/attachment-0001.html>