[x265] [PATCH] intrinsic: Added dct16 sse3 intrinsic, 55288.92 -> 45139.28

Steve Borho steve at borho.org
Wed Feb 11 19:47:02 CET 2015


On 02/11, dave wrote:
> 55288.92 is the c code and 45139.28 the intrinsic, that is if I am reading
> the testbench output correctly.
> 
> dct16x16        1.22x      45139.28      55288.92
> 
> My system is old so these numbers are probably large compared to testing on
> a newer system.

With this patch applied I get this on a Haswell laptop:

$ ./test/TestBench --test transforms --cpu AVX2 | grep dct16x16
dct16x16        6.75x    9536.53     64405.30
$ ./test/TestBench --test transforms --cpu SSSE3 | grep dct16x16
dct16x16        4.47x    8275.59     36993.45
$ ./test/TestBench --test transforms --cpu SSE3 | grep dct16x16
dct16x16        3.23x    11384.22    36786.55

It is extraordinarily interesting how much slower the C reference
becomes when I enable the AVX2 primitives. And how the AVX2 16x16 dct is
slower than the SSSE3 intrinsic version when compared directly using
rdtsc.

On this CPU the SSE3 version shows ok, but this CPU will never run the
SSE3 version.

The basic question is whether there are CPUs (that people generally use)
that this will help enough to warrant keeping two intrinsic versions of
the functions around in perpetuity.

-- 
Steve Borho


More information about the x265-devel mailing list