[x264-devel] 8x8 and 16x16 Altivec implementation of variance

Guillaume POIRIER gpoirier at mplayerhq.hu
Thu Jan 22 23:03:52 CET 2009


Hello,

Loren Merritt a écrit :
> On Thu, 22 Jan 2009, Guillaume POIRIER wrote:
> 
>> + vec_u16_t mule = vec_mule(pix_v, pix_v);
>> + vec_u16_t mulo = vec_mulo(pix_v, pix_v);
>> + vec_u32_t mule_h = vec_u16_to_u32_h(mule);
>> + vec_u32_t mule_l = vec_u16_to_u32_l(mule);
>> + vec_u32_t mulo_h = vec_u16_to_u32_h(mulo);
>> + vec_u32_t mulo_l = vec_u16_to_u32_l(mulo);
>> + vec_u32_t mule_sqr = vec_add(mule_h, mule_l);
>> + vec_u32_t mulo_sqr = vec_add(mulo_h, mulo_l);
>> + vec_u32_t mul_sqr = vec_add(mule_sqr, mulo_sqr);
>> + sqr_v = vec_add(sqr_v, mul_sqr);
> 
> replace all that with:
> sqr_v = vec_msum(pix_v, pix_v, sqr_v);


Damn, off course! This leads to a huge speed-up:
var_8x8_c: 71
var_8x8_altivec: 17 (4.2x)
var_16x16_c: 255
var_16x16_altivec: 28 (9.1x)

Let's see how I can improve things further with some unrolling now...

Thanks a lot Loren. I _clearly_ need to clean my glasses up.

Guillaume


More information about the x264-devel mailing list