[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8

Mon Aug 28 11:54:40 CEST 2006

Hi,

Loren Merritt wrote:
> On Sun, 27 Aug 2006, Guillaume POIRIER wrote:
> 
>> The attached patch adds sub16x16_dct8 to the bunch of optimized codes.
>>
>> I realize that I have not advertised the speed-up:
>> All functions get a 3.3x to 3.5x speed-up, except quant4x4 which get a
>> far bigger speed-up, but I lost the sheet on which I had it written
>> down (I think it was a 7x speed-up, but not sure).
>>
>> Overall, on an encode with lost of options, RD, and stuff, I get a
>> 2.5% speed-up, but encodes with more straightforward options should
>> get an even bigger overall speed-up.
>>
>> Next step:  x264_pixel_sa8d_8x8 and friends (20% through for now).
> 
> 
> I don't see common/ppc/quant.[ch] anymore, forgot to svn add?

Darn, You're right! I'll fix this in next patch.
The content of these files haven't changed since the last patch I sent
though.

I've got a simple question regarding the C and SSE implementation of
pixel_sa8d_wxh

In the C version, there's this hunk:

#define SRC(x)     diff[i][x]
#define DST(x,rhs) diff[i][x] = (rhs)
            for( i = 0; i < 8; i++ )
                SA8D_1D
#undef SRC
#undef DST

#define SRC(x)     diff[x][i]
#define DST(x,rhs) i_satd += abs(rhs)
            for( i = 0; i < 8; i++ )
                SA8D_1D
#undef SRC
#undef DST

Note that the first loop that calls SA8D_1D give a different line to
SA8D_1D at each iteration, and inside SA8D_1D, each SRC element is
just an element of this line.
This doesn't seem too SIMD-friendly to me (it's easy to use a whole
line of each column of the block as input (as a vector), whereas you
can't directly address a particular element of that vector).
It looks like before calling my altivec version of SA8D_1D (which is
just the C version unrolled by 8), I'd need to transpose the block,
then compute the sum of 8x8 Hadamard transformed differences, then
transpose again and for the 2nd step with SRC(x) as diff[x][i] and
DST(x,rhs) as i_satd += abs(rhs).

But when I look at the SSE version, it doesn't look like any transpose
is made before doing the sum of 8x8 hadamard transform diff.

It look like it does:
load diff[][] with the sum/diff of *pixel_1 and *pixel_2
then run
sum of 8x8 hadamard transform diff
transpose 8x8
sum of 8x8 hadamard transform diff
then accumulation of the absolute value of each element.

Am I missing smth?
Maybe the trick is that for the very purpose of computing the result
of pixel_sa8d_8x8, doing:
hadamard8x8 (on the lines of diff[][])
transpose 8x8
hadamard8x8 (on the columns of diff[][])

is equivalent to
hadamard8x8 (on the columns of diff[][])
transpose 8x8
hadamard8x8 (on the lines of diff[][])

as all we do afterwards is compute the absolute value of each element
and accumulate them?

I hope my explanations aren't too confusing... I've never been too
good in math class when it was about matrices...

Best regards,

Guillaume

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html