[x264-devel] Re: Speed-up method

Radek Czyz radoslaw at syskin.cjb.net
Sat Mar 3 10:23:29 CET 2007


Perhaps worth mentioning, I actually have a 8800gts and, as you might 
guess, I'm interested in trying ME on it as well :)

Radek

Alex Izvorski wrote:
> On Fri, 2007-03-02 at 19:00 -0700, Loren Merritt wrote:
>> On Sat, 3 Mar 2007, telomere 1 wrote:
>>
>>> QX6700
>>> 27.9GFlops
>>>
>>> Geforce 8800GTX
>>> (MADD(2flops) + MUL(1flop)) * 1,350 MHz * 128 SPs = 518.4G Flops
>> QX6700
>> 2.67 GHz * 4 cpus * (1 or 2 sse op/cycle) * (8 int16 or 16 int8 in a sse reg)
>> = 85 to 341 GIPS
>>
>> Though x264 at hq settings spends about 60% of its time in simd code, so 
>> the real estimated improvement is a factor of 2.5 for offloading all of 
>> that to the gpu.
> 
> That's a conservative estimate: the PSADBW instruction does 16 integer
> subtracts, 16 absolute value calculations and 14 adds - in one clock
> cycle ;) so the theoretical peak is 46 ops/clock or 491 GIPS.  In
> practice, sad16x16 (one of the most common functions) requires at least
> 3*16*16-1 = 767 integer operations and runs in ~42 clocks on Core2,
> giving an actual 18 integer ops/clock average (variable width, they
> start as 8-bit and end up as 16-bit).  So at least in that part of the
> code x264 hits ~192 GIPS on a QX6700.
> 
> Nevertheless, I am very interested in how one would program motion
> estimation on an nvidia GPU.  I have seen papers (e.g.
> http://numod.ins.uni-bonn.de/research/papers/public/StGa04motion.pdf and
> more on gpgpu.org), but no actual code.  Anyone?
> 
> Does the Geforce do anything integer-based?  Would the data be processed
> as vertexes or textures?  Is the Geforce Gflops calculation just for the
> vertex data?
> 
>>From telomere's post, it seems like the Geforce has considerable
> multiplication resources (one MADD and one MUL in the same cycle).  I
> can't think offhand how to really use that to good effect in motion
> estimation: perhaps in Hadamard, if it turns out multiply by +1/-1 and
> add is faster than the combination of add/subtract/permute?
> 
> I am noticing a certain trend here: every time it seems like something
> could be done more efficiently on specialized hardware, general-purpose
> hardware catches up.  The last one was the Cell processor, not so
> attractive now that one can put two X5355's in one box ;)  Yes, the Cell
> is still faster in theory than two X5355's.  In practice though...
> 
> --Alex
> 
> 

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html



More information about the x264-devel mailing list