[x264-devel] commit: Add AltiVec implementation of predict_8x8c_p. 2. 6x faster than scalar C. (Guillaume Poirier )

Guillaume POIRIER gpoirier at mplayerhq.hu
Mon Jan 19 00:17:19 CET 2009


Hello,

Woops, last message got sent before I could add anything useful to it.


2009/1/18 Loren Merritt <lorenm at u.washington.edu>:
> On Sun, 18 Jan 2009, git version control wrote:
>
>> x264 | branch: master | Guillaume Poirier <gpoirier at mplayerhq.hu> | Sun Jan 18 22:44:14 2009 +0100| [09e76c903d3419619ed326a4dd114369a55bdd6e] | committer: Guillaume Poirier
>>
>> +    vec_s16_t induc_v  = (vec_s16_t) CV(0, 1, 2, 3, 4, 5, 6, 7);
>> +    vec_s32_t mule_b_v = vec_mule(induc_v, b_v);
>> +    vec_s32_t mulo_b_v = vec_mulo(induc_v, b_v);
>> +    vec_s16_t mul_b_induc0_v = vec_pack(vec_mergeh(mule_b_v, mulo_b_v), vec_mergel(mule_b_v, mulo_b_v));
>> +    vec_s16_t add_i0_b_0v = vec_adds(i00_v, mul_b_induc0_v);
>
> Is there no plain 16bit multiply? vec_mladd?

Yep, that would work. Except that on the 8x8 kind, it's 1 cycle slower
on PPC970, and 2 cycles slower on PPC7450. I don't know how to explain
this since all integer multiply have documented latency of 5 cycles on
PPC970.

It may be a scheduling issue where the compiler screws things up on
the new code I'll check the assembly.

The 16x16 code is 1 cycle faster with the new code though.

I attached the new code with vec_mladd.

Guillaume
-- 
Only a very small fraction of our DNA does anything; the rest is all
comments and ifdefs.

Natalie Wood  - "The only time a woman really succeeds in changing a
man is when he is a baby."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul.diff
Type: application/octet-stream
Size: 1637 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20090119/15f8a6c4/attachment.obj 


More information about the x264-devel mailing list