[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8, pixel_sa8d_8x8, pixel_sa8d_16x16, idct8

Sun Oct 1 23:32:22 CEST 2006

Hi,

Loren Merritt a écrit :
> On Sun, 24 Sep 2006, Guillaume POIRIER wrote:
>> On 9/18/06, Loren Merritt <lorenm at u.washington.edu> wrote:
>>
>>> pixel_sa8d_8x8_core_altivec could use a VEC_DIFF with one of the 
>>> pointers
>>> 8byte aligned.
>>
>> So far I've been able to use VEC_DIFF_H_8BYTE_ALIGNED with the
>> following pattern:
>>
>> +    VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff0v );
>> +    VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff1v );
>> +    VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff2v );
>> +    VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff3v );
>> +
>> +    VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff4v );
>> +    VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff5v );
>> +    VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff6v );
>> +    VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff7v );
>>
>> I have not looked too much at this problem, but as far as I've seen,
>> it looks like one every other call to VEC_DIFF* is done with a
>> different alignment of pix1 and pix2;
>> i.e. each call of VEC_DIFF_H_8BYTE_ALIGNED is done with both pix1 and
>> pix8 8bytes or 16 bytes aligned, whereas on the above call the calls
>> to VEC_DIFF are done with a different alignment of pix1 and pix2 (i.e.
>> one is 8bytes aligned and the other is 16 bytes aligned).
> 
> Weird. That would indicate that stride is only a multiple of 8. Which 
> does happen for pix2 during slicetype and chroma_me, but only for sad 
> and satd not sa8d.

Ok, I was just being careless in my tests. It's just that I've tested 
only with checkasm, and not with a real-word encoder. That was foolish 
of me.

After I ran some tests, it looks like there are only 3 useful 
VEC_DIFF_xx patterns encountered in real life:

- both arrays are always at least 16-bytes aligned, and aligned i_pix1 
and i_pix2 are multiples of 16. That means that the no special trick has 
to be done to load a full line.... and that I should maybe create a 
macro for VEC_DIFF_16BYTES_ALIGNED

- Both arrays are 8-bytes aligned and i_pix1 and i_pix2 are multiples of 
16. That means that all loads would need some permutation, and that 
VEC_DIFF_H_8BYTE_ALIGNED() has to be used everywhere.

Note that the 2 above are by far the most common case.

The third case is when the i_pix2 is a multiple of 8, so at any given 
moment, the alignment of each memory access is different (this is a case 
tested in checkasm, but that I didn't see in real world with the options 
I've tested). In that case, the interleaved 
VEC_DIFF_H_8BYTE_ALIGNED/VEC_DIFF_H takes care of it. This could 
probably be improved somehow, but since that case isn't common in my 
experience, I don't see the point to optimize it.... but I could be 
wrong (as I tested a small subset of encoding options).

>> I'll see what I can do, but I imagine it's possible to make do without
>> using VEC_DIFF (which doesn't care about alignment at all).
> 
> sad, satd, and sa8d can all be optimized for:
> pix1 is aligned to whatever the block size is.
> pix2 is unaligned.

unaligned, as in: _any_ alignment, or as in "sometimes 8 or 16 bytes 
aligned ?
Also, aren't alignment patterns different for sad, satd, and sa8d?

> stride1 is a multiple of 16.

yep, I noted also that i_pix1 was always multiple of 16

> stride2 is a multiple of 8, and I could easily make it 16.

in the case of sa8d, I just haven't seen the case when it wasn't a 
multiple of 16 (except in checkasm)...
Am I blind? Or which are the options that I should use to trigger 
stride2 being a multiple of 8?

> Additionally, in the current usage of sa8d, pix2 is also aligned to the 
> blocksize. But don't count on that remaining so.

Ah crap! What kind of alignment should I assume in the future? No 
alignment, or 4, 8, ... bytes aligned?

>> Now I have a question regarding a bug I've found in the Altivec quant 
>> code.
>> I've noticed on some encodes I've done with that patch, I'm getting
>> some isolated green or blue blocks that sometimes create green drags
>> on first pass, and on the final encode, I'm just getting blocs that
>> "pop in and pop out" (as in: the motion compensation doesn't turn them
>> into green drags).
>>
>> It _appears_ that the more I activate high quality options (RD,
>> trellis), the less artifacts I'm getting. I imagine that it means that
>> the different codepath taken with high quality options may not trigger
>> the bug as often, or maybe compensate for them.
>>
>> What's funny is that the bug is un-reproductible, as in: if I take a
>> sample encode it once, I'll get some green/blue blocks, say at frames
>> 5 and 7... and if I re-encode, with the same source, and the same
>> options, I won't get the blocs at the same frames and at the same
>> locations of the frame.
> 
> The only causes of nondeterminism in single-threaded programs are 
> uninitialized memory and deliberate randomness (e.g. time()). So try 
> valgrind.

I identified the problem (without valgrind, as, sadly, it doesn't exist 
on OSX). It was due to some advanced (PPC-970) compiler options that I 
was adding to configure.mak. I'm pissed, because I waisted a whole lot 
of precious time to nail this problem down...

Guillaume

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html

[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8, pixel_sa8d_8x8, pixel_sa8d_16x16, *idct8*

[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8, pixel_sa8d_8x8, pixel_sa8d_16x16, idct8