[x264-devel] [PATCH] zigzag SSE2
Axel Zeuner
axel.zeuner at gmx.de
Fri May 2 11:30:58 CEST 2008
Hello,
two patches against git HEAD are attached:
- x264-zigzag-sse2.diff contains SSE2 implementations of the zigzag functions.
- x264-timeasm.diff contains timeasm, a timing code to check the effects of
the changes made. The program is a hack, it does no checks and was tested
only on linux x86/x86-64 using gcc.
I would like to see results on other processors in 32-bit and 64-bit mode
before one may start discuss about inclusion of these functions into git.
Two results as printed by timeasm follow:
model name : AMD Turion(tm) 64 X2 Mobile Technology TL-50
Architecture: x86-32
---------------------------------------------
zigzag frame
offset determination: 121 clocks
c - sub_4x4: 43 clocks
ref - sub_4x4: 43 clocks
new - sub_4x4: 37 clocks
offset determination: 121 clocks
c - scan_4x4: 22 clocks
ref - scan_4x4: 22 clocks
new - scan_4x4: 20 clocks
offset determination: 121 clocks
c - scan_8x8: 76 clocks
ref - scan_8x8: 76 clocks
new - scan_8x8: 62 clocks
---------------------------------------------
zigzag field
offset determination: 121 clocks
c - sub_4x4: 41 clocks
ref - sub_4x4: 42 clocks
new - sub_4x4: 37 clocks
offset determination: 121 clocks
c - scan_4x4: 19 clocks
ref - scan_4x4: 18 clocks
new - scan_4x4: 16 clocks
offset determination: 121 clocks
c - scan_8x8: 70 clocks
ref - scan_8x8: 71 clocks
new - scan_8x8: 45 clocks
Architecture: x86-64
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5000+
x264: using random seed 191362655
---------------------------------------------
zigzag frame
offset determination: 120 clocks
c - sub_4x4: 35 clocks
ref - sub_4x4: 35 clocks
new - sub_4x4: 33 clocks
offset determination: 120 clocks
c - scan_4x4: 20 clocks
ref - scan_4x4: 20 clocks
new - scan_4x4: 17 clocks
offset determination: 120 clocks
c - scan_8x8: 74 clocks
ref - scan_8x8: 74 clocks
new - scan_8x8: 60 clocks
---------------------------------------------
zigzag field
offset determination: 120 clocks
c - sub_4x4: 35 clocks
ref - sub_4x4: 35 clocks
new - sub_4x4: 34 clocks
offset determination: 120 clocks
c - scan_4x4: 13 clocks
ref - scan_4x4: 16 clocks
new - scan_4x4: 12 clocks
offset determination: 120 clocks
c - scan_8x8: 69 clocks
ref - scan_8x8: 69 clocks
new - scan_8x8: 43 clocks
The 64-bit result was obtained after reverting the patch to reduce the code
size (%define movdqa movaps ;%define movdqu movups) of the SSE2 code. Athlons
seem to keep some state information about the contents of their sse
registers.
Regards,
Axel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x264-timeasm.diff
Type: text/x-diff
Size: 8004 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080502/4882bef5/attachment.diff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x264-zigzag-sse2.diff
Type: text/x-diff
Size: 14542 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080502/4882bef5/attachment-0001.diff
More information about the x264-devel
mailing list