[x264-devel] [PATCH] zigzag SSE2

Axel Zeuner axel.zeuner at gmx.de
Fri May 2 11:30:58 CEST 2008


Hello,
two patches against git HEAD are attached:
- x264-zigzag-sse2.diff contains SSE2 implementations of the zigzag functions. 
- x264-timeasm.diff contains timeasm, a timing code to check the effects of 
the changes made. The program is a hack, it does no checks and was tested 
only on linux x86/x86-64 using gcc.

I would like to see results on other processors in 32-bit and 64-bit mode 
before one may start discuss about inclusion of these functions into git. 

Two results as printed by timeasm follow:

model name	: AMD Turion(tm) 64 X2 Mobile Technology TL-50
Architecture: x86-32
---------------------------------------------
zigzag frame
offset determination: 121 clocks
c - sub_4x4: 43 clocks
ref - sub_4x4: 43 clocks
new - sub_4x4: 37 clocks
offset determination: 121 clocks
c - scan_4x4: 22 clocks
ref - scan_4x4: 22 clocks
new - scan_4x4: 20 clocks
offset determination: 121 clocks
c - scan_8x8: 76 clocks
ref - scan_8x8: 76 clocks
new - scan_8x8: 62 clocks
---------------------------------------------
zigzag field
offset determination: 121 clocks
c - sub_4x4: 41 clocks
ref - sub_4x4: 42 clocks
new - sub_4x4: 37 clocks
offset determination: 121 clocks
c - scan_4x4: 19 clocks
ref - scan_4x4: 18 clocks
new - scan_4x4: 16 clocks
offset determination: 121 clocks
c - scan_8x8: 70 clocks
ref - scan_8x8: 71 clocks
new - scan_8x8: 45 clocks

Architecture: x86-64
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 5000+
x264: using random seed 191362655
---------------------------------------------
zigzag frame
offset determination: 120 clocks
c - sub_4x4: 35 clocks
ref - sub_4x4: 35 clocks
new - sub_4x4: 33 clocks
offset determination: 120 clocks
c - scan_4x4: 20 clocks
ref - scan_4x4: 20 clocks
new - scan_4x4: 17 clocks
offset determination: 120 clocks
c - scan_8x8: 74 clocks
ref - scan_8x8: 74 clocks
new - scan_8x8: 60 clocks
---------------------------------------------
zigzag field
offset determination: 120 clocks
c - sub_4x4: 35 clocks
ref - sub_4x4: 35 clocks
new - sub_4x4: 34 clocks
offset determination: 120 clocks
c - scan_4x4: 13 clocks
ref - scan_4x4: 16 clocks
new - scan_4x4: 12 clocks
offset determination: 120 clocks
c - scan_8x8: 69 clocks
ref - scan_8x8: 69 clocks
new - scan_8x8: 43 clocks

The 64-bit result was obtained after reverting the patch to reduce the code 
size (%define movdqa movaps ;%define movdqu movups) of the SSE2 code. Athlons 
seem to keep some state information about the contents of their sse 
registers.

Regards,
Axel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x264-timeasm.diff
Type: text/x-diff
Size: 8004 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080502/4882bef5/attachment.diff 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x264-zigzag-sse2.diff
Type: text/x-diff
Size: 14542 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080502/4882bef5/attachment-0001.diff 


More information about the x264-devel mailing list