[x264-devel] Re: Speed-up method
Radek Czyz
radoslaw at syskin.cjb.net
Sat Mar 3 10:23:29 CET 2007
Perhaps worth mentioning, I actually have a 8800gts and, as you might
guess, I'm interested in trying ME on it as well :)
Radek
Alex Izvorski wrote:
> On Fri, 2007-03-02 at 19:00 -0700, Loren Merritt wrote:
>> On Sat, 3 Mar 2007, telomere 1 wrote:
>>
>>> QX6700
>>> 27.9GFlops
>>>
>>> Geforce 8800GTX
>>> (MADD(2flops) + MUL(1flop)) * 1,350 MHz * 128 SPs = 518.4G Flops
>> QX6700
>> 2.67 GHz * 4 cpus * (1 or 2 sse op/cycle) * (8 int16 or 16 int8 in a sse reg)
>> = 85 to 341 GIPS
>>
>> Though x264 at hq settings spends about 60% of its time in simd code, so
>> the real estimated improvement is a factor of 2.5 for offloading all of
>> that to the gpu.
>
> That's a conservative estimate: the PSADBW instruction does 16 integer
> subtracts, 16 absolute value calculations and 14 adds - in one clock
> cycle ;) so the theoretical peak is 46 ops/clock or 491 GIPS. In
> practice, sad16x16 (one of the most common functions) requires at least
> 3*16*16-1 = 767 integer operations and runs in ~42 clocks on Core2,
> giving an actual 18 integer ops/clock average (variable width, they
> start as 8-bit and end up as 16-bit). So at least in that part of the
> code x264 hits ~192 GIPS on a QX6700.
>
> Nevertheless, I am very interested in how one would program motion
> estimation on an nvidia GPU. I have seen papers (e.g.
> http://numod.ins.uni-bonn.de/research/papers/public/StGa04motion.pdf and
> more on gpgpu.org), but no actual code. Anyone?
>
> Does the Geforce do anything integer-based? Would the data be processed
> as vertexes or textures? Is the Geforce Gflops calculation just for the
> vertex data?
>
>>From telomere's post, it seems like the Geforce has considerable
> multiplication resources (one MADD and one MUL in the same cycle). I
> can't think offhand how to really use that to good effect in motion
> estimation: perhaps in Hadamard, if it turns out multiply by +1/-1 and
> add is faster than the combination of add/subtract/permute?
>
> I am noticing a certain trend here: every time it seems like something
> could be done more efficiently on specialized hardware, general-purpose
> hardware catches up. The last one was the Cell processor, not so
> attractive now that one can put two X5355's in one box ;) Yes, the Cell
> is still faster in theory than two X5355's. In practice though...
>
> --Alex
>
>
--
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html
More information about the x264-devel
mailing list