[x265] [PATCH] intrinsic: Added dct16 sse3 intrinsic, 55288.92 -> 45139.28
dave
dtyx265 at gmail.com
Wed Feb 11 20:05:40 CET 2015
On 02/11/2015 10:47 AM, Steve Borho wrote:
> On 02/11, dave wrote:
>> 55288.92 is the c code and 45139.28 the intrinsic, that is if I am reading
>> the testbench output correctly.
>>
>> dct16x16 1.22x 45139.28 55288.92
>>
>> My system is old so these numbers are probably large compared to testing on
>> a newer system.
> With this patch applied I get this on a Haswell laptop:
>
> $ ./test/TestBench --test transforms --cpu AVX2 | grep dct16x16
> dct16x16 6.75x 9536.53 64405.30
> $ ./test/TestBench --test transforms --cpu SSSE3 | grep dct16x16
> dct16x16 4.47x 8275.59 36993.45
> $ ./test/TestBench --test transforms --cpu SSE3 | grep dct16x16
> dct16x16 3.23x 11384.22 36786.55
>
> It is extraordinarily interesting how much slower the C reference
> becomes when I enable the AVX2 primitives. And how the AVX2 16x16 dct is
> slower than the SSSE3 intrinsic version when compared directly using
> rdtsc.
>
> On this CPU the SSE3 version shows ok, but this CPU will never run the
> SSE3 version.
>
> The basic question is whether there are CPUs (that people generally use)
> that this will help enough to warrant keeping two intrinsic versions of
> the functions around in perpetuity.
>
I did the sse2 idct8 asm a while ago because it was mentioned in this
mailing list that there was a decent number of multicore amd(athlon?)
systems out there that only supported up to sse3. This was just a
follow up but since it isn't that great of an improvement at 1.2x on my
system, I was planning on looking into further improving it (and dct8),
most likely with assembler but only if it's really needed. Otherwise, I
will work on something else...
Also, since sse3 consists mostly of floating point enhancements which
aren't used in dct/idct then effectively the sse3 intrinsics and
assembler transforms are sse2.
More information about the x265-devel
mailing list