[x264-devel] [PATCH] faster mc_chroma_altivec
Guillaume POIRIER
gpoirier at mplayerhq.hu
Wed Feb 4 18:06:50 CET 2009
Hello,
2009/2/4 <maaanuuu at gmx.net>:
> the attached patch now doesn't fail on dimensions that are not mod 16. I'll
> check whether the mod 16 chroma stride patch allows the use of VEC_LOAD.
>
>>> mc_chroma_w4_altivec now needs dst to be aligned to a 4 byte boundary, is
>>> that OK?
>>
>> Fine by me. The question now is off course: it that ensured by proper
>> aligned allocations and strides?
>
> I used grep to search for all calls to mc_chroma. All strides for dst are
> either 8, 16 or FDEC_STRIDE, so that should be fine. The alignments seem to
> be OK too.
Cool.
>>> Finally, I put width == 2 into its own function because at the moment the
>>> code that is used for it is actually slower than plain C.
>>
>> I'm not surprised. There's too little work to do to use AltiVec here.
>> Did you try to do some pseudo-SIMD using general purpose registers?
>
> Do you mean parallel loop unrolling? I tried that and it wasn't faster on a
> PPC970.
Nope, more something like what's done in x264/common/predict.c or
what's described here too http://guru.multimedia.cx/simd-without-simd/
The key idea is to use a variable of a bigger size (say a short to
represent two char) to compute 2 char values at a time.
Note that I don't know if it's possible to write an efficient
pseudo-SIMD version of width==2 code.
>>> The patch passes checkasm and leads to a 2-3% performance gain overall
>>> using
>>> the default settings. Please note that I have NOT done extensive
>>> regression
>>> tests.
>>
>> Please do so. Run a encode of several hundred frames with and without
>> this patch, and make sure that the MD5 matches.
>
> I ran encodings using different input dimensions and different parameters,
> and the output files were always identical.
Ok, good.
>>> Comments and suggestions are welcome :)
>>
>> Well, if you ask...
>>
>> + src0v_16A = vec_u8_to_u16( src0v_8A );
>> + src0v_16B = vec_u8_to_u16( src0v_8B );
>> + dstv_16A = vec_mladd( src0v_16A, coeff0v, k32v );
>> + dstv_16B = vec_mladd( src0v_16B, coeff0v, k32v );
>>
>> Could you put this in a macro to factorize some code?
>
> Done
Thanks.
>> So far, this new code looks alright, though I have to admit I'd prefer
>> smaller, self-contained patches to simplify the reviewing process...
>
> How should I split them up? I tried to make the patch cleaner a little bit.
This new patch is much more readable, no need to bother splitting it up.
I'll have a closer look at it tonight.
Cheers,
Guillaume
--
Only a very small fraction of our DNA does anything; the rest is all
comments and ifdefs.
More information about the x264-devel
mailing list