[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8, pixel_sa8d_8x8

Tue Sep 5 00:52:53 CEST 2006

Hi,

Loren Merritt a écrit :
> On Wed, 30 Aug 2006, Guillaume POIRIER wrote:
> 
>>> I'm not fluent in altivec, but the manual says:
>>> vec_sum4s can't take two vec_s16_t, one of the inputs has to be 32bit.
>>> vec_sums only horizontally sums one of the inputs.
>>
>> Yes, you are right. I noticed that yesterday when I went over the 
>> accumulation code once again.
>> The way horizontal accumulation is made is Altivec is quite clever in 
>> fact, as it allows to easily do the accumulation of vectors in a loop.
>>
>> I wonder how it's done in SSE4. Intel hasn't released any doc about 
>> them yet. I wonder why: are they not ready, or do they have smth to 
>> hide? or... ?
> 
> In SSE4, it's
> phaddd {a,b,c,d}, {e,f,g,h} => {a+b,c+d,e+f,g+h}
> which is annoying if you're generating only one sum. Maybe there are 
> cases where you'd want just a pairwise sum, but I haven't found one. It 
> almost works for row hadamard, but I think transpose+column is still 
> fewer ops.

Hey, good to know that. :)

Please find in attachment the n+1 version of my Altivec patchset.
In today's menu:
I've benchmarked the different implementations of hadamard8x8 and
quant4x4 that I had on hand (as featured in rev.7 of my patchset, which
does not seem to have reached the ML...) to pick the fastest of each of
them.
Nothing too exciting as it's just a matter of squeezing 1-3 or 4 cyles
out of 200 or so... but while I was at it, I figured it wouldn't hurt to
measure the different implementation before arbitrary discarding the
other implementations.
It's interesting to note that on my G5, in the case of quant8x8, which 
uses the macro defined for quant4x4, the implementation which uses 
shifts (158 cycles), or unrolls the outer loop (160 cycles) is slower 
that the implementation which uses the plain a simple mults (157 
cycles)... Well...

As a free bonus, the attached patchset adds PMC support (i.e. hardware
performance counters) for G5 and G3/G4, taken from FFmpeg's code but
it's a bit ugly as there's no START/STOP macro to ease benchmarking yet
more.
This probably deserves to be put in a different patch if ever it gets
merged.

Last but not least, I've cleaned-up my patchset so that it works better
with GCC3.3 (not yet complete though, quant.c needs some rework).

Feedbacks welcome.

Guillaume
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Altivec_quant-dct_routines_8+PMC.diff
Type: text/x-patch
Size: 30732 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20060905/4d4f8719/attachment.bin