[x264-devel] [PATCH] faster mc_chroma_altivec

Mon Feb 2 23:34:51 CET 2009

Hello,

2009/2/2  <maaanuuu at gmx.net>:
> Hello,
>
> the attached patch improves mc_chroma_altivec:
>
> Now VEC_LOAD is used instead of VEC_LOAD_G, vec_mladd is used more efficient
> and the loop is unrolled 2x.

I see now that indeed vec_mladd wasn't put to a good use. Good catch.

> mc_chroma_w4_altivec now needs dst to be aligned to a 4 byte boundary, is
> that OK?

Fine by me. The question now is off course: it that ensured by proper
aligned allocations and strides?

> Finally, I put width == 2 into its own function because at the moment the
> code that is used for it is actually slower than plain C.

I'm not surprised. There's too little work to do to use AltiVec here.
Did you try to do some pseudo-SIMD using general purpose registers?

> The patch passes checkasm and leads to a 2-3% performance gain overall using
> the default settings. Please note that I have NOT done extensive regression
> tests.

Please do so. Run a encode of several hundred frames with and without
this patch, and make sure that the MD5 matches.

> Comments and suggestions are welcome :)

Well, if you ask...

+        src0v_16A = vec_u8_to_u16( src0v_8A );
+        src0v_16B = vec_u8_to_u16( src0v_8B );
+        dstv_16A = vec_mladd( src0v_16A, coeff0v, k32v );
+        dstv_16B = vec_mladd( src0v_16B, coeff0v, k32v );

Could you put this in a macro to factorize some code?

So far, this new code looks alright, though I have to admit I'd prefer
smaller, self-contained patches to simplify the reviewing process...

Cheers,

Guillaume
-- 
Only a very small fraction of our DNA does anything; the rest is all
comments and ifdefs.

Calvin Trillin  - "Health food makes me sick."