[x264-devel] commit: add AltiVec implementation of x264_pixel_var_16x16 and x264_pixel_var_8x8 (Guillaume Poirier )

Sat Jan 24 23:29:26 CET 2009

Hello,

On Sat, Jan 24, 2009 at 2:45 AM, Loren Merritt <lorenm at u.washington.edu> wrote:
> On Fri, 23 Jan 2009, git version control wrote:
>
>> x264 | branch: master | Guillaume Poirier <gpoirier at mplayerhq.hu> | Fri Jan 23 13:53:06 2009 -0800| [71ac0a34bc0460bf67da68f300e4150bc50d9aae] | committer: Guillaume Poirier
>
>> + sum_v = vec_add( sum_v, vec_sld( sum_v, sum_v, 8 ) );
>> + sum_v = vec_add( sum_v, vec_sld( sum_v, sum_v, 4 ) );
>
> vec_sums?

Only for the 8x8 case then? vec_sums performs a _signed_ satured sum.
Theoretically, it may overflow and not produce the same computation in
the 16x16 case, right?

I tried vec_sum before, but it wasn't faster since the result is in
the 3rd element of the vector, so I had to vec_splat the result before
storing the result in the scalar 'sum' variable with vec_ste.

I just had a new idea to retrieve the result without having to
vec_splat it: replace the scalar "sum" by an array and it's now faster
by one cyle (17 cyles => 16 cycles). The result is in the attached
patch.

>> + pix0_v = vec_perm(pix0_v, pix0_v, perm0);
>> + pix1_v = vec_perm(pix1_v, pix1_v, perm1);
>> + vec_u8_t pix_v = vec_mergeh(pix0_v, pix1_v);
>
> This can be a single vec_perm. The map then can't be generated by
> vec_lvsl, but there's only 4 possibilities (2 if you make stride mod16),
> so LUT it.

I implemented that, but I'd need your help here. How come the mere
computation of the array index:
!((unsigned long)pix & 0xF)<<1 & !(i_stride & 0x8)

takes 17 cycles!! That doubles the overall time spent on the routine!

I have no experience in using LUT: what's the right way to compute the
index fast?

The crazy thing is that if I replace the table by an if/else to select
the right permutation vector, the new code still only takes 16 cyles
overall to execute (i.e. I don't get a measurable speed-up). I guess
it has something to do with the measurement in checkasm, that executes
the same code several times, taking the best: it has trained the
branch predictor, therefore all branches are already correctly
predicted.

The attached patch has my implementation of what your suggested...
Could you please have a look at it to tell me what's wrong with the
way I implemented the LUT?

Regards,

Guillaume
-- 
Only a very small fraction of our DNA does anything; the rest is all
comments and ifdefs.

George Burns  - "You can't help getting older, but you don't have to get old."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: variance_optimization.0.diff
Type: application/octet-stream
Size: 2192 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20090124/3799e64c/attachment.obj