[x264-devel] [Patch] zigzag SSE2 Version 2

Tue Aug 12 23:16:37 CEST 2008

> Thank you for the hint, I checked them out and had a look at it, here are some 
> results on my hardware

thanks for benching. however, depending on when exactly you checked out your
copy, some of the routines may not have been the current ones (unoptimized
mmx 8x8_frame crept in, also sse2 version was probably still %if'd out for
a slower one that was just there to prove that xmm-mmx transfers aren't
slower than xmm-xmm movs).

can you check out again and test those?

(there is a new sse2 one based on the ssse3 one i did today. but it would
be nice if you could also bench the old one which was probably inactive in
your former checkout. change %if 0 to 1 in line 278 of dct-a.asm, and comment 
or remove lines 578 and 579). i'd imagine the new one to be equal or better
than the old one on k8 while still slower than mmx, but i do not have k10 
experience, so a bench would be appreciated.

note these totally avoid pinsrw which is pretty slow on almost all cpu 
except the 45nm core2 (penryn), and even there it is not latency 1,
because the source can be only gpr or mem, not xmm.
according to my docs, it's high latency and low throughput on k8/k10 too. 
i am not sure why you still get lower numbers for the 4x4. maybe the rdtsc 
gets reordered into that latency.

regards,
holger