[x265] [PATCH] intrinsic: Added dct16 sse3 intrinsic, 55288.92 -> 45139.28
Steve Borho
steve at borho.org
Wed Feb 11 19:47:02 CET 2015
On 02/11, dave wrote:
> 55288.92 is the c code and 45139.28 the intrinsic, that is if I am reading
> the testbench output correctly.
>
> dct16x16 1.22x 45139.28 55288.92
>
> My system is old so these numbers are probably large compared to testing on
> a newer system.
With this patch applied I get this on a Haswell laptop:
$ ./test/TestBench --test transforms --cpu AVX2 | grep dct16x16
dct16x16 6.75x 9536.53 64405.30
$ ./test/TestBench --test transforms --cpu SSSE3 | grep dct16x16
dct16x16 4.47x 8275.59 36993.45
$ ./test/TestBench --test transforms --cpu SSE3 | grep dct16x16
dct16x16 3.23x 11384.22 36786.55
It is extraordinarily interesting how much slower the C reference
becomes when I enable the AVX2 primitives. And how the AVX2 16x16 dct is
slower than the SSSE3 intrinsic version when compared directly using
rdtsc.
On this CPU the SSE3 version shows ok, but this CPU will never run the
SSE3 version.
The basic question is whether there are CPUs (that people generally use)
that this will help enough to warrant keeping two intrinsic versions of
the functions around in perpetuity.
--
Steve Borho
More information about the x265-devel
mailing list