[x265] [PATCH] sse_pp16: more than +1x performance improvement for all versions

Steve Borho steve at borho.org
Tue Aug 6 19:56:33 CEST 2013


On Tue, Aug 6, 2013 at 7:38 AM, <praveen at multicorewareinc.com> wrote:

> # HG changeset patch
> # User praveentiwari
> # Date 1375792722 -19800
> # Node ID 397c2cbccec538701a4f3898e1dba7c9276112f0
> # Parent  b2242ff16d1fa4d4cdc44b5a94a94fd331bb9191
> sse_pp16: more than +1x performance improvement for all versions.
>

_mm_cvtepu8_epi16() maps to pmovzxbw, which is an SSE4.1 function.  When
you replace vector code with straight intrinsics you need to check for
these and apply the minimum SIMD requirement for compilation guards.  These
break the GCC builds.


>
> diff -r b2242ff16d1f -r 397c2cbccec5 source/common/vec/sse.inc
> --- a/source/common/vec/sse.inc Tue Aug 06 17:48:04 2013 +0530
> +++ b/source/common/vec/sse.inc Tue Aug 06 18:08:42 2013 +0530
> @@ -113,25 +113,34 @@
>  int sse_pp16(pixel* Org, intptr_t strideOrg, pixel* Cur, intptr_t
> strideCur)
>  {
>      int rows = ly;
> -    Vec16uc m1, n1;
> +    __m128i sum = _mm_set1_epi32(0);
>
> -    Vec8us diff_low(0), diff_high(0);
> -    Vec4i sum_low(0), sum_high(0);
>      for (; rows != 0; rows--)
>      {
> -        m1.load(Org);
> -        n1.load(Cur);
> -        diff_low = extend_low(m1) - extend_low(n1);
> -        diff_high = extend_high(m1) - extend_high(n1);
> -        diff_low = diff_low * diff_low;
> -        diff_high = diff_high * diff_high;
> -        sum_low += (extend_low(diff_low) + extend_low(diff_high));
> -        sum_high += (extend_high(diff_low) + extend_high(diff_high));
> +        __m128i m1 = _mm_loadu_si128((__m128i const*)(Org));
> +        __m128i n1 = _mm_loadu_si128((__m128i const*)(Cur));
> +
> +        __m128i m1lo = _mm_cvtepu8_epi16(m1);
> +        __m128i m1hi = _mm_srli_si128(m1, 8);
> +        m1hi = _mm_cvtepu8_epi16(m1hi);
> +
> +        __m128i n1lo = _mm_cvtepu8_epi16(n1);
> +        __m128i n1hi = _mm_srli_si128(n1, 8);
> +        n1hi = _mm_cvtepu8_epi16(n1hi);
> +
> +        __m128i diff = _mm_sub_epi16(m1lo, n1lo);
> +        sum = _mm_add_epi32(sum, _mm_madd_epi16(diff, diff));
> +
> +        diff = _mm_sub_epi16(m1hi, n1hi);
> +        sum = _mm_add_epi32(sum, _mm_madd_epi16(diff, diff));
> +
>          Org += strideOrg;
>          Cur += strideCur;
>      }
>
> -    return horizontal_add(sum_low) + horizontal_add(sum_high);
> +    sum = _mm_add_epi32(sum, _mm_srli_si128(sum, 8));
> +    sum = _mm_add_epi32(sum, _mm_srli_si128(sum, 4));
> +    return _mm_cvtsi128_si32(sum);
>  }
>
>  template<int ly>
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> http://mailman.videolan.org/listinfo/x265-devel
>



-- 
Steve Borho
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/private/x265-devel/attachments/20130806/96de53ee/attachment.html>


More information about the x265-devel mailing list