[x264-devel] [PATCH] faster mc_chroma_altivec

Wed Feb 4 18:06:50 CET 2009

Hello,

2009/2/4  <maaanuuu at gmx.net>:

> the attached patch now doesn't fail on dimensions that are not mod 16. I'll
> check whether the mod 16 chroma stride patch allows the use of VEC_LOAD.
>
>>> mc_chroma_w4_altivec now needs dst to be aligned to a 4 byte boundary, is
>>> that OK?
>>
>> Fine by me. The question now is off course: it that ensured by proper
>> aligned allocations and strides?
>
> I used grep to search for all calls to mc_chroma. All strides for dst are
> either 8, 16 or FDEC_STRIDE, so that should be fine. The alignments seem to
> be OK too.

Cool.

>>> Finally, I put width == 2 into its own function because at the moment the
>>> code that is used for it is actually slower than plain C.
>>
>> I'm not surprised. There's too little work to do to use AltiVec here.
>> Did you try to do some pseudo-SIMD using general purpose registers?
>
> Do you mean parallel loop unrolling? I tried that and it wasn't faster on a
> PPC970.

Nope, more something like what's done in x264/common/predict.c or
what's described here too http://guru.multimedia.cx/simd-without-simd/
The key idea is to use a variable of a bigger size (say a short to
represent two char) to compute 2 char values at a time.

Note that I don't know if it's possible to write an efficient
pseudo-SIMD version of width==2 code.

>>> The patch passes checkasm and leads to a 2-3% performance gain overall
>>> using
>>> the default settings. Please note that I have NOT done extensive
>>> regression
>>> tests.
>>
>> Please do so. Run a encode of several hundred frames with and without
>> this patch, and make sure that the MD5 matches.
>
> I ran encodings using different input dimensions and different parameters,
> and the output files were always identical.

Ok, good.

>>> Comments and suggestions are welcome :)
>>
>> Well, if you ask...
>>
>> +        src0v_16A = vec_u8_to_u16( src0v_8A );
>> +        src0v_16B = vec_u8_to_u16( src0v_8B );
>> +        dstv_16A = vec_mladd( src0v_16A, coeff0v, k32v );
>> +        dstv_16B = vec_mladd( src0v_16B, coeff0v, k32v );
>>
>> Could you put this in a macro to factorize some code?
>
> Done

Thanks.

>> So far, this new code looks alright, though I have to admit I'd prefer
>> smaller, self-contained patches to simplify the reviewing process...
>
> How should I split them up? I tried to make the patch cleaner a little bit.

This new patch is much more readable, no need to bother splitting it up.

I'll have a closer look at it tonight.

Cheers,

Guillaume
-- 
Only a very small fraction of our DNA does anything; the rest is all
comments and ifdefs.