[x265] [PATCH] avx2: 'integral4v' asm code -> 7.48x faster than 'C' version

Mon May 8 13:36:24 CEST 2017

Hello Praveen Tiwari,

Just for curiosity, when comparing your code's performance with the
plain C version, did you give a chance too the compiler to vectorize
the code itself?
Such a trivial loop should not be difficult to handle for the compiler
I think...

Cheers,

Guillaume

On Mon, May 8, 2017 at 6:31 AM,  <praveen at multicorewareinc.com> wrote:
> # HG changeset patch
> # User Praveen Tiwari <praveen at multicorewareinc.com>
> # Date 1493905428 -19800
> #      Thu May 04 19:13:48 2017 +0530
> # Node ID 41611825c2f4661536500e1306db7d8c4bf7fd07
> # Parent  48502979a4b21f6982dcdacbf7796bf5d9fb395c
> avx2: 'integral4v' asm code -> 7.48x faster than 'C' version
>
>    integral_init4v  7.48x    202.53          1515.14
>
> diff -r 48502979a4b2 -r 41611825c2f4 source/common/x86/seaintegral.asm
> --- a/source/common/x86/seaintegral.asm Wed May 03 11:26:26 2017 +0530
> +++ b/source/common/x86/seaintegral.asm Thu May 04 19:13:48 2017 +0530
> @@ -32,8 +32,19 @@
>  ;void integral_init4v_c(uint32_t *sum4, intptr_t stride)
>  ;-----------------------------------------------------------------------------
>  INIT_YMM avx2
> -cglobal integral4v, 2, 2, 0
> -
> +cglobal integral4v, 2, 3, 2
> +    mov r2, r1
> +    shl r2, 4
> +
> +.loop
> +    movu    m0, [r0]
> +    movu    m1, [r0 + r2]
> +    psubd   m1, m0
> +    movu    [r0], m1
> +    add     r0, 32
> +    sub     r1, 8
> +    cmp     r1, 0
> +    jnz     .loop
>      RET
>
>  ;-----------------------------------------------------------------------------
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel

-- 
Wearing a Rolex is like driving an Audi: It says you've got some
money, but nothing to say.
John Lefèvre