[x264-devel] Questions about add8x8_idct8

Fri Sep 15 17:47:51 CEST 2006

Hello folk,

I'm working on converting add8x8_idct8 to Altivec, so far, so good, I've 
got an Altivec version of macro IDCT8_1D.

However, there smth that troubles me, and it's the part when rounding is 
done "for the >>6 at the end".

Here is what the C version does:

static void add8x8_idct8( uint8_t *dst, int16_t dct[8][8] )
{
     int i;

     dct[0][0] += 32; // rounding for the >>6 at the end

An now here's what the SSE2 version does:

x264_add8x8_idct8_sse2:

[..]

xmm9, xmm8    SSE2_TRANSPOSE8x8 xmm9, xmm1, xmm7, xmm3, xmm4, xmm0, 
xmm2, xmm6, xmm5  paddw xmm9, [pw_32 GLOBAL] ; rounding for the >>6 at 
the end
IDCT8_1D     xmm9, xmm0, xmm6, xmm3, xmm5, xmm4, xmm7, xmm1, xmm8, xmm2

but when I look at the definition of pw_32, it reads:
pw_32: times 8 dw 32
Which means that pw_32 is 32, 32, 32, 32, 32, 32, 32, 32 (as words)

the equivalent MMX code just does a simple:
     add  word [eax], 32
(which is just what the C code does)

I'm a little confused regarding why pw_32 isn't just 32 followed by the 
relevant number of zeros, and I'm also wondering which of my vector 
register should be summed with pw_32 (probably the one that loaded 
dct[0][0]), if it matters at all.

It looks like a minor thing, but it confuses me.

Another question: I was also wondering if it was possible to somehow get 
the memory pointed by uint8_t *dst to be aligned.
It looks like it's not possible from what I've been able to figure out 
in encoder/macroblock.c and in common/common.h, but maybe I was just 
blind...

I looks like I should fiddle with this code in common.h

             DECLARE_ALIGNED( uint8_t, fenc_buf[24*FENC_STRIDE], 16 );
             DECLARE_ALIGNED( uint8_t, fdec_buf[27*FDEC_STRIDE], 16 );

             /* pointer over mb of the frame to be compressed */
             uint8_t *p_fenc[3];

             /* pointer over mb of the frame to be reconstructed  */
             uint8_t *p_fdec[3];

             /* pointer over mb of the references */
             uint8_t *p_fref[2][16][4+2]; /* last: lN, lH, lV, lHV, cU, 
cV */
             uint16_t *p_integral[2][16];

but I fear it would break a lot of things if I were to do it...

Non-aligned access (especially stores) are a major pain (both slower to 
execute and to code) so it would really help if *dst could be 16-bytes 
aligned

Guillaume

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html