[x264-devel] 8x8 and 16x16 Altivec implementation of variance
Guillaume POIRIER
gpoirier at mplayerhq.hu
Thu Jan 22 23:03:52 CET 2009
Hello,
Loren Merritt a écrit :
> On Thu, 22 Jan 2009, Guillaume POIRIER wrote:
>
>> + vec_u16_t mule = vec_mule(pix_v, pix_v);
>> + vec_u16_t mulo = vec_mulo(pix_v, pix_v);
>> + vec_u32_t mule_h = vec_u16_to_u32_h(mule);
>> + vec_u32_t mule_l = vec_u16_to_u32_l(mule);
>> + vec_u32_t mulo_h = vec_u16_to_u32_h(mulo);
>> + vec_u32_t mulo_l = vec_u16_to_u32_l(mulo);
>> + vec_u32_t mule_sqr = vec_add(mule_h, mule_l);
>> + vec_u32_t mulo_sqr = vec_add(mulo_h, mulo_l);
>> + vec_u32_t mul_sqr = vec_add(mule_sqr, mulo_sqr);
>> + sqr_v = vec_add(sqr_v, mul_sqr);
>
> replace all that with:
> sqr_v = vec_msum(pix_v, pix_v, sqr_v);
Damn, off course! This leads to a huge speed-up:
var_8x8_c: 71
var_8x8_altivec: 17 (4.2x)
var_16x16_c: 255
var_16x16_altivec: 28 (9.1x)
Let's see how I can improve things further with some unrolling now...
Thanks a lot Loren. I _clearly_ need to clean my glasses up.
Guillaume
More information about the x264-devel
mailing list