[x264-devel] [Patch] zigzag SSE2 Version 2
Axel Zeuner
Axel.Zeuner at gmx.de
Wed Aug 13 07:59:41 CEST 2008
On Tuesday 12 August 2008 23:16:37 Holger Lubitz wrote:
> thanks for benching. however, depending on when exactly you checked out
> your copy, some of the routines may not have been the current ones
> (unoptimized mmx 8x8_frame crept in, also sse2 version was probably still
> %if'd out for a slower one that was just there to prove that xmm-mmx
> transfers aren't slower than xmm-xmm movs).
>
> can you check out again and test those?
k8 results:
x264-holger$ ./checkasm --bench=zig 2>/dev/null
nop: 80
zigzag_scan_4x4_field_c: 133
zigzag_scan_4x4_field_mmx: 107
zigzag_scan_4x4_frame_c: 218
zigzag_scan_4x4_frame_mmx: 115
zigzag_scan_8x8_frame_c: 756
zigzag_scan_8x8_frame_mmx: 452
zigzag_scan_8x8_frame_sse2: 699
k10 results:
x264-holger$ ./checkasm --bench=zig 2>/dev/null
nop: 672
zigzag_scan_4x4_field_c: 118
zigzag_scan_4x4_field_mmx: 74
zigzag_scan_4x4_frame_c: 232
zigzag_scan_4x4_frame_mmx: 119
zigzag_scan_8x8_frame_c: 768
zigzag_scan_8x8_frame_mmx: 359
zigzag_scan_8x8_frame_sse2: 465
> (there is a new sse2 one based on the ssse3 one i did today. but it would
> be nice if you could also bench the old one which was probably inactive in
> your former checkout. change %if 0 to 1 in line 278 of dct-a.asm, and
> comment or remove lines 578 and 579). i'd imagine the new one to be equal
> or better than the old one on k8 while still slower than mmx, but i do not
> have k10 experience, so a bench would be appreciated.
with the changes mentioned, hopefully I did it right:
k8 results:
x264-holger$ ./checkasm --bench=zig 2>/dev/null
zigzag_scan_4x4_field_c: 133
zigzag_scan_4x4_field_mmx: 108
zigzag_scan_4x4_frame_c: 218
zigzag_scan_4x4_frame_mmx: 115
zigzag_scan_8x8_frame_c: 757
zigzag_scan_8x8_frame_mmx: 451
zigzag_scan_8x8_frame_sse2: 612
k10 results:
x264-holger$ ./checkasm --bench=zig 2>/dev/null
nop: 671
zigzag_scan_4x4_field_c: 121
zigzag_scan_4x4_field_mmx: 74
zigzag_scan_4x4_frame_c: 228
zigzag_scan_4x4_frame_mmx: 117
zigzag_scan_8x8_frame_c: 768
zigzag_scan_8x8_frame_mmx: 364
zigzag_scan_8x8_frame_sse2: 365
> note these totally avoid pinsrw which is pretty slow on almost all cpu
I like to see it.
> except the 45nm core2 (penryn), and even there it is not latency 1,
> because the source can be only gpr or mem, not xmm.
> according to my docs, it's high latency and low throughput on k8/k10 too.
pinsrw xmm, mem, imm is double decode on k8 with throughput of 1, i.e. not
worse than the most other sse operations, single decode on k10 with througput
according to amd of 1, but executable in the FADD or FMUL unit and a latency
of 4 as the most other read modify operations on amd processors. The latency
probably does not play an important role in the scan8x8 functions because we
have more than 4 dependency chains.
> i am not sure why you still get lower numbers for the 4x4. maybe the rdtsc
I believe for short routines we measure more the decode time (scan4x4: 15
instructions, 3 per clock: 5 clocks, measured about 8) than the time to
retirement of the last instruction.
In the output below using x264 HEAD + my patches ref denotes an existing mmx
or C function, new denotes the sse2 functions. Throughput is measured with
rdtsc, latency with cpuid; rdtsc; cpuid. Offset is the number of clocks
required to call an empty function. The raw results for the method calls are
corrected with offset, minimum value is printed.
x264-dev$ ./timeasm
Architecture: x86-64
model name : AMD Phenom(tm) XXXX Quad-Core Processor
x264: using random seed 712620458
---------------------------------------------
zigzag frame
offset - throughput/latency 68 177
c - sub_4x4 28 34
ref - sub_4x4 28 33
new - sub_4x4 16 32
offset - throughput/latency 68 177
c - scan_4x4 15 21
ref - scan_4x4 15 21
new - scan_4x4 9 22
offset - throughput/latency 68 177
c - scan_8x8 76 77
ref - scan_8x8 76 78
new - scan_8x8 27 43
---------------------------------------------
zigzag field
offset - throughput/latency 68 177
c - sub_4x4 28 34
ref - sub_4x4 28 33
new - sub_4x4 16 33
offset - throughput/latency 68 177
c - scan_4x4 10 11
ref - scan_4x4 7 13
new - scan_4x4 6 16
offset - throughput/latency 68 177
c - scan_8x8 68 72
ref - scan_8x8 68 72
new - scan_8x8 21 32
Sources and Makefile are attached. Let the first line in the Makfile
X264DIR=../x264-holger point to the directory of x264.
Regards,
Axel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile
Type: text/x-makefile
Size: 384 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080813/a98a118f/attachment.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: timeasm.c
Type: text/x-csrc
Size: 8264 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080813/a98a118f/attachment.c
More information about the x264-devel
mailing list