[x264-devel] Re: Speed-up method

Alex Izvorski aizvorski at gmail.com
Sat Mar 3 10:11:09 CET 2007


On Fri, 2007-03-02 at 19:00 -0700, Loren Merritt wrote:
> On Sat, 3 Mar 2007, telomere 1 wrote:
> 
> > QX6700
> > 27.9GFlops
> >
> > Geforce 8800GTX
> > (MADD(2flops) + MUL(1flop)) * 1,350 MHz * 128 SPs = 518.4G Flops
> 
> QX6700
> 2.67 GHz * 4 cpus * (1 or 2 sse op/cycle) * (8 int16 or 16 int8 in a sse reg)
> = 85 to 341 GIPS
> 
> Though x264 at hq settings spends about 60% of its time in simd code, so 
> the real estimated improvement is a factor of 2.5 for offloading all of 
> that to the gpu.

That's a conservative estimate: the PSADBW instruction does 16 integer
subtracts, 16 absolute value calculations and 14 adds - in one clock
cycle ;) so the theoretical peak is 46 ops/clock or 491 GIPS.  In
practice, sad16x16 (one of the most common functions) requires at least
3*16*16-1 = 767 integer operations and runs in ~42 clocks on Core2,
giving an actual 18 integer ops/clock average (variable width, they
start as 8-bit and end up as 16-bit).  So at least in that part of the
code x264 hits ~192 GIPS on a QX6700.

Nevertheless, I am very interested in how one would program motion
estimation on an nvidia GPU.  I have seen papers (e.g.
http://numod.ins.uni-bonn.de/research/papers/public/StGa04motion.pdf and
more on gpgpu.org), but no actual code.  Anyone?

Does the Geforce do anything integer-based?  Would the data be processed
as vertexes or textures?  Is the Geforce Gflops calculation just for the
vertex data?

>From telomere's post, it seems like the Geforce has considerable
multiplication resources (one MADD and one MUL in the same cycle).  I
can't think offhand how to really use that to good effect in motion
estimation: perhaps in Hadamard, if it turns out multiply by +1/-1 and
add is faster than the combination of add/subtract/permute?

I am noticing a certain trend here: every time it seems like something
could be done more efficiently on specialized hardware, general-purpose
hardware catches up.  The last one was the Cell processor, not so
attractive now that one can put two X5355's in one box ;)  Yes, the Cell
is still faster in theory than two X5355's.  In practice though...

--Alex


-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html



More information about the x264-devel mailing list