[x264-devel] [PATCH] zigzag SSE2

Fri May 2 15:33:45 CEST 2008

hi,

> two patches against git HEAD are attached:

hmm. it seems my patches haven't made it there yet.

i only did the scan_4x4_frame and 8x8_frame so far, but i think something 
similar can be done to the subs. 4x4 was somewhere below 10 cycles
(dark_shikari measured 8 on core2, i had 6.something on amd64, pengvado
saw 10 on opteron). the 8x8 was 40 cycles on my amd64, core2 differed
with alignment due to cache split, but i think the range was 38-57.
(4x4 was pure mmx, 8x8 used pshufw)
so i later did a sse2 version that was 60 cycles on athlon but got rid of
unaligned loads and could get rid of unaligned stores with some extra
i intend to do that once i get one myself, which should be before
the end of the month.

> The 64-bit result was obtained after reverting the patch to reduce the code 
> size (%define movdqa movaps ;%define movdqu movups) of the SSE2 code. Athlons 
> seem to keep some state information about the contents of their sse 
> registers.

they do have some fp state. but that shouldn't matter as long as you only
unpack and shuffle.

ah. might as well take the opportunity to introduce myself to the list:
i am one of the four google summer of code students accepted to work on
x264 this summer. i intend to be looking for optimization possibilities
in x264 wherever i find them. the zigzags were sort of a qualification
task for me.

that being said, i appreciate what you've done. these functions are a lot
harder to optimize then they look at first sight. my first attempts were
only marginally fast than c too.

holger