[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8, pixel_sa8d_8x8

Wed Aug 30 21:32:24 CEST 2006

On Wed, 30 Aug 2006, Guillaume POIRIER wrote:

>> I'm not fluent in altivec, but the manual says:
>> vec_sum4s can't take two vec_s16_t, one of the inputs has to be 32bit.
>> vec_sums only horizontally sums one of the inputs.
>
> Yes, you are right. I noticed that yesterday when I went over the 
> accumulation code once again.
> The way horizontal accumulation is made is Altivec is quite clever in fact, 
> as it allows to easily do the accumulation of vectors in a loop.
>
> I wonder how it's done in SSE4. Intel hasn't released any doc about them yet. 
> I wonder why: are they not ready, or do they have smth to hide? or... ?

In SSE4, it's
phaddd {a,b,c,d}, {e,f,g,h} => {a+b,c+d,e+f,g+h}
which is annoying if you're generating only one sum. Maybe there are cases 
where you'd want just a pairwise sum, but I haven't found one. It almost 
works for row hadamard, but I think transpose+column is still fewer ops.

--Loren Merritt

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html