[x264-devel] Re: Speed-up method
Alex Izvorski
aizvorski at gmail.com
Sat Mar 3 10:11:09 CET 2007
On Fri, 2007-03-02 at 19:00 -0700, Loren Merritt wrote:
> On Sat, 3 Mar 2007, telomere 1 wrote:
>
> > QX6700
> > 27.9GFlops
> >
> > Geforce 8800GTX
> > (MADD(2flops) + MUL(1flop)) * 1,350 MHz * 128 SPs = 518.4G Flops
>
> QX6700
> 2.67 GHz * 4 cpus * (1 or 2 sse op/cycle) * (8 int16 or 16 int8 in a sse reg)
> = 85 to 341 GIPS
>
> Though x264 at hq settings spends about 60% of its time in simd code, so
> the real estimated improvement is a factor of 2.5 for offloading all of
> that to the gpu.
That's a conservative estimate: the PSADBW instruction does 16 integer
subtracts, 16 absolute value calculations and 14 adds - in one clock
cycle ;) so the theoretical peak is 46 ops/clock or 491 GIPS. In
practice, sad16x16 (one of the most common functions) requires at least
3*16*16-1 = 767 integer operations and runs in ~42 clocks on Core2,
giving an actual 18 integer ops/clock average (variable width, they
start as 8-bit and end up as 16-bit). So at least in that part of the
code x264 hits ~192 GIPS on a QX6700.
Nevertheless, I am very interested in how one would program motion
estimation on an nvidia GPU. I have seen papers (e.g.
http://numod.ins.uni-bonn.de/research/papers/public/StGa04motion.pdf and
more on gpgpu.org), but no actual code. Anyone?
Does the Geforce do anything integer-based? Would the data be processed
as vertexes or textures? Is the Geforce Gflops calculation just for the
vertex data?
>From telomere's post, it seems like the Geforce has considerable
multiplication resources (one MADD and one MUL in the same cycle). I
can't think offhand how to really use that to good effect in motion
estimation: perhaps in Hadamard, if it turns out multiply by +1/-1 and
add is faster than the combination of add/subtract/permute?
I am noticing a certain trend here: every time it seems like something
could be done more efficiently on specialized hardware, general-purpose
hardware catches up. The last one was the Cell processor, not so
attractive now that one can put two X5355's in one box ;) Yes, the Cell
is still faster in theory than two X5355's. In practice though...
--Alex
--
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html
More information about the x264-devel
mailing list