[x264-devel] Re: [PATCH] Altivec optimizations for quant4x4, quant4x4dc, quant8x8, sub8x8_dct8, sub16x16_dct8, pixel_sa8d_8x8, pixel_sa8d_16x16, *idct8*
Guillaume POIRIER
gpoirier at mplayerhq.hu
Sun Oct 1 23:32:22 CEST 2006
Hi,
Loren Merritt a écrit :
> On Sun, 24 Sep 2006, Guillaume POIRIER wrote:
>> On 9/18/06, Loren Merritt <lorenm at u.washington.edu> wrote:
>>
>>> pixel_sa8d_8x8_core_altivec could use a VEC_DIFF with one of the
>>> pointers
>>> 8byte aligned.
>>
>> So far I've been able to use VEC_DIFF_H_8BYTE_ALIGNED with the
>> following pattern:
>>
>> + VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff0v );
>> + VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff1v );
>> + VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff2v );
>> + VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff3v );
>> +
>> + VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff4v );
>> + VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff5v );
>> + VEC_DIFF_H_8BYTE_ALIGNED( pix1, i_pix1, pix2, i_pix2, 8, diff6v );
>> + VEC_DIFF_H( pix1, i_pix1, pix2, i_pix2, 8, diff7v );
>>
>> I have not looked too much at this problem, but as far as I've seen,
>> it looks like one every other call to VEC_DIFF* is done with a
>> different alignment of pix1 and pix2;
>> i.e. each call of VEC_DIFF_H_8BYTE_ALIGNED is done with both pix1 and
>> pix8 8bytes or 16 bytes aligned, whereas on the above call the calls
>> to VEC_DIFF are done with a different alignment of pix1 and pix2 (i.e.
>> one is 8bytes aligned and the other is 16 bytes aligned).
>
> Weird. That would indicate that stride is only a multiple of 8. Which
> does happen for pix2 during slicetype and chroma_me, but only for sad
> and satd not sa8d.
Ok, I was just being careless in my tests. It's just that I've tested
only with checkasm, and not with a real-word encoder. That was foolish
of me.
After I ran some tests, it looks like there are only 3 useful
VEC_DIFF_xx patterns encountered in real life:
- both arrays are always at least 16-bytes aligned, and aligned i_pix1
and i_pix2 are multiples of 16. That means that the no special trick has
to be done to load a full line.... and that I should maybe create a
macro for VEC_DIFF_16BYTES_ALIGNED
- Both arrays are 8-bytes aligned and i_pix1 and i_pix2 are multiples of
16. That means that all loads would need some permutation, and that
VEC_DIFF_H_8BYTE_ALIGNED() has to be used everywhere.
Note that the 2 above are by far the most common case.
The third case is when the i_pix2 is a multiple of 8, so at any given
moment, the alignment of each memory access is different (this is a case
tested in checkasm, but that I didn't see in real world with the options
I've tested). In that case, the interleaved
VEC_DIFF_H_8BYTE_ALIGNED/VEC_DIFF_H takes care of it. This could
probably be improved somehow, but since that case isn't common in my
experience, I don't see the point to optimize it.... but I could be
wrong (as I tested a small subset of encoding options).
>> I'll see what I can do, but I imagine it's possible to make do without
>> using VEC_DIFF (which doesn't care about alignment at all).
>
> sad, satd, and sa8d can all be optimized for:
> pix1 is aligned to whatever the block size is.
> pix2 is unaligned.
unaligned, as in: _any_ alignment, or as in "sometimes 8 or 16 bytes
aligned ?
Also, aren't alignment patterns different for sad, satd, and sa8d?
> stride1 is a multiple of 16.
yep, I noted also that i_pix1 was always multiple of 16
> stride2 is a multiple of 8, and I could easily make it 16.
in the case of sa8d, I just haven't seen the case when it wasn't a
multiple of 16 (except in checkasm)...
Am I blind? Or which are the options that I should use to trigger
stride2 being a multiple of 8?
> Additionally, in the current usage of sa8d, pix2 is also aligned to the
> blocksize. But don't count on that remaining so.
Ah crap! What kind of alignment should I assume in the future? No
alignment, or 4, 8, ... bytes aligned?
>> Now I have a question regarding a bug I've found in the Altivec quant
>> code.
>> I've noticed on some encodes I've done with that patch, I'm getting
>> some isolated green or blue blocks that sometimes create green drags
>> on first pass, and on the final encode, I'm just getting blocs that
>> "pop in and pop out" (as in: the motion compensation doesn't turn them
>> into green drags).
>>
>> It _appears_ that the more I activate high quality options (RD,
>> trellis), the less artifacts I'm getting. I imagine that it means that
>> the different codepath taken with high quality options may not trigger
>> the bug as often, or maybe compensate for them.
>>
>> What's funny is that the bug is un-reproductible, as in: if I take a
>> sample encode it once, I'll get some green/blue blocks, say at frames
>> 5 and 7... and if I re-encode, with the same source, and the same
>> options, I won't get the blocs at the same frames and at the same
>> locations of the frame.
>
> The only causes of nondeterminism in single-threaded programs are
> uninitialized memory and deliberate randomness (e.g. time()). So try
> valgrind.
I identified the problem (without valgrind, as, sadly, it doesn't exist
on OSX). It was due to some advanced (PPC-970) compiler options that I
was adding to configure.mak. I'm pissed, because I waisted a whole lot
of precious time to nail this problem down...
Guillaume
--
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html
More information about the x264-devel
mailing list