[x265] [PATCH] intrinsic: Added dct16 sse3 intrinsic, 55288.92 -> 45139.28

Wed Feb 11 20:05:40 CET 2015

On 02/11/2015 10:47 AM, Steve Borho wrote:
> On 02/11, dave wrote:
>> 55288.92 is the c code and 45139.28 the intrinsic, that is if I am reading
>> the testbench output correctly.
>>
>> dct16x16        1.22x      45139.28      55288.92
>>
>> My system is old so these numbers are probably large compared to testing on
>> a newer system.
> With this patch applied I get this on a Haswell laptop:
>
> $ ./test/TestBench --test transforms --cpu AVX2 | grep dct16x16
> dct16x16        6.75x    9536.53     64405.30
> $ ./test/TestBench --test transforms --cpu SSSE3 | grep dct16x16
> dct16x16        4.47x    8275.59     36993.45
> $ ./test/TestBench --test transforms --cpu SSE3 | grep dct16x16
> dct16x16        3.23x    11384.22    36786.55
>
> It is extraordinarily interesting how much slower the C reference
> becomes when I enable the AVX2 primitives. And how the AVX2 16x16 dct is
> slower than the SSSE3 intrinsic version when compared directly using
> rdtsc.
>
> On this CPU the SSE3 version shows ok, but this CPU will never run the
> SSE3 version.
>
> The basic question is whether there are CPUs (that people generally use)
> that this will help enough to warrant keeping two intrinsic versions of
> the functions around in perpetuity.
>
I did the sse2 idct8 asm a while ago because it was mentioned in this 
mailing list that there was a decent number of multicore amd(athlon?) 
systems out there that only supported up to sse3.  This was just a 
follow up but since it isn't that great of an improvement at 1.2x on my 
system, I was planning on looking into further improving it (and dct8), 
most likely with assembler but only if it's really needed.  Otherwise, I 
will work on something else...

Also, since sse3 consists mostly of floating point enhancements which 
aren't used in dct/idct then effectively the sse3 intrinsics and 
assembler transforms are sse2.