>> The 8x8 doesn't such a big speed-up because the data is 8-bytes
>> aligned, not 16-bytes aligned, so it's necessary to permute it before
>> using it.
> I do not know much about altivec at all, but it seems the permute may be more
> expensive than a shift. Have you tried just shifting things into place?

vec_perm and vec_s(r|l)* have the same throughput and latencies.
That's what's cool about it ;-)

