[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8, pixel_sa8d_8x8

Tue Sep 5 18:47:29 CEST 2006

makes me wonder if it were a good idea to sell the Dual G5 and get the
mac pro! Keep up the great PPC/Altivec work!

On 9/4/06, Guillaume POIRIER <gpoirier at mplayerhq.hu> wrote:
> Hi,
>
> Loren Merritt a écrit :
> > On Wed, 30 Aug 2006, Guillaume POIRIER wrote:
> >
> >>> I'm not fluent in altivec, but the manual says:
> >>> vec_sum4s can't take two vec_s16_t, one of the inputs has to be 32bit.
> >>> vec_sums only horizontally sums one of the inputs.
> >>
> >> Yes, you are right. I noticed that yesterday when I went over the
> >> accumulation code once again.
> >> The way horizontal accumulation is made is Altivec is quite clever in
> >> fact, as it allows to easily do the accumulation of vectors in a loop.
> >>
> >> I wonder how it's done in SSE4. Intel hasn't released any doc about
> >> them yet. I wonder why: are they not ready, or do they have smth to
> >> hide? or... ?
> >
> > In SSE4, it's
> > phaddd {a,b,c,d}, {e,f,g,h} => {a+b,c+d,e+f,g+h}
> > which is annoying if you're generating only one sum. Maybe there are
> > cases where you'd want just a pairwise sum, but I haven't found one. It
> > almost works for row hadamard, but I think transpose+column is still
> > fewer ops.
>
> Hey, good to know that. :)
>
> Please find in attachment the n+1 version of my Altivec patchset.
> In today's menu:
> I've benchmarked the different implementations of hadamard8x8 and
> quant4x4 that I had on hand (as featured in rev.7 of my patchset, which
> does not seem to have reached the ML...) to pick the fastest of each of
> them.
> Nothing too exciting as it's just a matter of squeezing 1-3 or 4 cyles
> out of 200 or so... but while I was at it, I figured it wouldn't hurt to
> measure the different implementation before arbitrary discarding the
> other implementations.
> It's interesting to note that on my G5, in the case of quant8x8, which
> uses the macro defined for quant4x4, the implementation which uses
> shifts (158 cycles), or unrolls the outer loop (160 cycles) is slower
> that the implementation which uses the plain a simple mults (157
> cycles)... Well...
>
> As a free bonus, the attached patchset adds PMC support (i.e. hardware
> performance counters) for G5 and G3/G4, taken from FFmpeg's code but
> it's a bit ugly as there's no START/STOP macro to ease benchmarking yet
> more.
> This probably deserves to be put in a different patch if ever it gets
> merged.
>
> Last but not least, I've cleaned-up my patchset so that it works better
> with GCC3.3 (not yet complete though, quant.c needs some rework).
>
> Feedbacks welcome.
>
> Guillaume
>
>
>

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html