[x264-devel] commit: add AltiVec implementation of x264_pixel_var_16x16 and x264_pixel_var_8x8 (Guillaume Poirier )

Sun Jan 25 00:33:11 CET 2009

On Sat, 24 Jan 2009, Guillaume POIRIER wrote:
> On Sat, Jan 24, 2009 at 2:45 AM, Loren Merritt <lorenm at u.washington.edu> wrote:
>> On Fri, 23 Jan 2009, git version control (Guillaume Poirier) wrote:
>>
>>> + sum_v = vec_add( sum_v, vec_sld( sum_v, sum_v, 8 ) );
>>> + sum_v = vec_add( sum_v, vec_sld( sum_v, sum_v, 4 ) );
>>
>> vec_sums?
>
> Only for the 8x8 case then? vec_sums performs a _signed_ satured sum.
> Theoretically, it may overflow and not produce the same computation in
> the 16x16 case, right?

16x16 sum of squares fits in 24 bits. Saturation is irrelevant.

>>> + pix0_v = vec_perm(pix0_v, pix0_v, perm0);
>>> + pix1_v = vec_perm(pix1_v, pix1_v, perm1);
>>> + vec_u8_t pix_v = vec_mergeh(pix0_v, pix1_v);
>>
>> This can be a single vec_perm. The map then can't be generated by
>> vec_lvsl, but there's only 4 possibilities (2 if you make stride mod16),
>> so LUT it.
>
> I implemented that, but I'd need your help here. How come the mere
> computation of the array index:
>
> vec_u8_t perm_tab[] = {
>     CV(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
>        0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17),
>     CV(0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
>        0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17),
>     CV(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
>        0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x0E, 0x1F),
>     CV(0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
>        0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x0E, 0x1F)
> };
> vec_u8_t perm = perm_tab[!((unsigned long)pix & 0xF)<<1 & !(i_stride & 0x8)];
>
> takes 17 cycles!! That doubles the overall time spent on the routine!
>
> I have no experience in using LUT: what's the right way to compute the
> index fast?

perm_tab needs to be static const, otherwise it gets written to the stack 
at every function call.

gcc might also be failing to optimize the ! (it's not a simple arithmetic 
op). Fix that by
perm_tab[(((uintptr_t)pix & 8) >> 3) + ((i_stride & 8) >> 2)]

And then promote chroma planes to 16 byte alignment, and change it to
perm_tab[((uintptr_t)pix & 8) >> 3]

--Loren Merritt