[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8

Loren Merritt lorenm at u.washington.edu
Mon Aug 28 18:25:59 CEST 2006


On Mon, 28 Aug 2006, Guillaume Poirier wrote:

> I've got a simple question regarding the C and SSE implementation of
> pixel_sa8d_wxh
>
> In the C version, there's this hunk:
>
> #define SRC(x)     diff[i][x]
> #define DST(x,rhs) diff[i][x] = (rhs)
>            for( i = 0; i < 8; i++ )
>                SA8D_1D
> #undef SRC
> #undef DST
>
> #define SRC(x)     diff[x][i]
> #define DST(x,rhs) i_satd += abs(rhs)
>            for( i = 0; i < 8; i++ )
>                SA8D_1D
> #undef SRC
> #undef DST
>
> Note that the first loop that calls SA8D_1D give a different line to
> SA8D_1D at each iteration, and inside SA8D_1D, each SRC element is
> just an element of this line.
> This doesn't seem too SIMD-friendly to me (it's easy to use a whole
> line of each column of the block as input (as a vector), whereas you
> can't directly address a particular element of that vector).
> It looks like before calling my altivec version of SA8D_1D (which is
> just the C version unrolled by 8), I'd need to transpose the block,
> then compute the sum of 8x8 Hadamard transformed differences, then
> transpose again and for the 2nd step with SRC(x) as diff[x][i] and
> DST(x,rhs) as i_satd += abs(rhs).
>
> But when I look at the SSE version, it doesn't look like any transpose
> is made before doing the sum of 8x8 hadamard transform diff.
>
> It look like it does:
> load diff[][] with the sum/diff of *pixel_1 and *pixel_2
> then run
> sum of 8x8 hadamard transform diff
> transpose 8x8
> sum of 8x8 hadamard transform diff
> then accumulation of the absolute value of each element.
>
> Am I missing smth?
> Maybe the trick is that for the very purpose of computing the result
> of pixel_sa8d_8x8, doing:
> hadamard8x8 (on the lines of diff[][])
> transpose 8x8
> hadamard8x8 (on the columns of diff[][])
>
> is equivalent to
> hadamard8x8 (on the columns of diff[][])
> transpose 8x8
> hadamard8x8 (on the lines of diff[][])
>
> as all we do afterwards is compute the absolute value of each element
> and accumulate them?
>
> I hope my explanations aren't too confusing... I've never been too
> good in math class when it was about matrices...

the C version does
   row hadamard
   column hadamard
   sum

which is the same as
   column hadamard
   row hadamard
   sum

which is the same as
   column hadamard
   transpose
   column hadamard
   transpose
   sum

which is the same as
   column hadamard
   transpose
   column hadamard
   sum

which is the SSE version.

--Loren Merritt

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html



More information about the x264-devel mailing list