[x264-devel] [Patch] zigzag SSE2 Version 2

Wed Aug 13 07:59:41 CEST 2008

On Tuesday 12 August 2008 23:16:37 Holger Lubitz wrote:
> thanks for benching. however, depending on when exactly you checked out
> your copy, some of the routines may not have been the current ones
> (unoptimized mmx 8x8_frame crept in, also sse2 version was probably still
> %if'd out for a slower one that was just there to prove that xmm-mmx
> transfers aren't slower than xmm-xmm movs).
>
> can you check out again and test those?

k8 results:
x264-holger$ ./checkasm --bench=zig 2>/dev/null
nop: 80
zigzag_scan_4x4_field_c: 133
zigzag_scan_4x4_field_mmx: 107
zigzag_scan_4x4_frame_c: 218
zigzag_scan_4x4_frame_mmx: 115
zigzag_scan_8x8_frame_c: 756
zigzag_scan_8x8_frame_mmx: 452
zigzag_scan_8x8_frame_sse2: 699

k10 results:
x264-holger$ ./checkasm --bench=zig 2>/dev/null
nop: 672
zigzag_scan_4x4_field_c: 118
zigzag_scan_4x4_field_mmx: 74
zigzag_scan_4x4_frame_c: 232
zigzag_scan_4x4_frame_mmx: 119
zigzag_scan_8x8_frame_c: 768
zigzag_scan_8x8_frame_mmx: 359
zigzag_scan_8x8_frame_sse2: 465

> (there is a new sse2 one based on the ssse3 one i did today. but it would
> be nice if you could also bench the old one which was probably inactive in
> your former checkout. change %if 0 to 1 in line 278 of dct-a.asm, and
> comment or remove lines 578 and 579). i'd imagine the new one to be equal
> or better than the old one on k8 while still slower than mmx, but i do not
> have k10 experience, so a bench would be appreciated.

with the changes mentioned, hopefully I did it right:

k8 results:
x264-holger$ ./checkasm --bench=zig 2>/dev/null
zigzag_scan_4x4_field_c: 133
zigzag_scan_4x4_field_mmx: 108
zigzag_scan_4x4_frame_c: 218
zigzag_scan_4x4_frame_mmx: 115
zigzag_scan_8x8_frame_c: 757
zigzag_scan_8x8_frame_mmx: 451
zigzag_scan_8x8_frame_sse2: 612

k10 results: 
x264-holger$ ./checkasm --bench=zig 2>/dev/null
nop: 671
zigzag_scan_4x4_field_c: 121
zigzag_scan_4x4_field_mmx: 74
zigzag_scan_4x4_frame_c: 228
zigzag_scan_4x4_frame_mmx: 117
zigzag_scan_8x8_frame_c: 768
zigzag_scan_8x8_frame_mmx: 364
zigzag_scan_8x8_frame_sse2: 365

> note these totally avoid pinsrw which is pretty slow on almost all cpu
I like to see it.
> except the 45nm core2 (penryn), and even there it is not latency 1,
> because the source can be only gpr or mem, not xmm.
> according to my docs, it's high latency and low throughput on k8/k10 too.
pinsrw xmm, mem, imm is double decode on k8 with throughput of 1, i.e. not 
worse than the most other sse operations, single decode on k10 with througput 
according to amd of 1, but executable in the FADD or FMUL unit and a latency 
of 4 as the most other read modify operations on amd processors. The latency 
probably does not play an important role in the scan8x8 functions because we 
have more than 4 dependency chains.
> i am not sure why you still get lower numbers for the 4x4. maybe the rdtsc
I believe for short routines we measure more the decode time (scan4x4: 15 
instructions, 3 per clock: 5 clocks, measured about 8) than the time to 
retirement of the last instruction.

In the output below using x264 HEAD + my patches ref denotes an existing mmx 
or C function, new denotes the sse2 functions. Throughput is measured with 
rdtsc, latency with cpuid; rdtsc; cpuid. Offset is the number of clocks 
required to call an empty function. The raw results for the method calls are 
corrected with offset, minimum value is printed.

x264-dev$ ./timeasm
Architecture: x86-64
model name      : AMD Phenom(tm) XXXX Quad-Core Processor
x264: using random seed 712620458
---------------------------------------------
zigzag frame
   offset - throughput/latency      68     177
                   c - sub_4x4      28      34
                 ref - sub_4x4      28      33
                 new - sub_4x4      16      32
   offset - throughput/latency      68     177
                  c - scan_4x4      15      21
                ref - scan_4x4      15      21
                new - scan_4x4       9      22
   offset - throughput/latency      68     177
                  c - scan_8x8      76      77
                ref - scan_8x8      76      78
                new - scan_8x8      27      43
---------------------------------------------
zigzag field
   offset - throughput/latency      68     177
                   c - sub_4x4      28      34
                 ref - sub_4x4      28      33
                 new - sub_4x4      16      33
   offset - throughput/latency      68     177
                  c - scan_4x4      10      11
                ref - scan_4x4       7      13
                new - scan_4x4       6      16
   offset - throughput/latency      68     177
                  c - scan_8x8      68      72
                ref - scan_8x8      68      72
                new - scan_8x8      21      32

Sources and Makefile are attached. Let the first line in the Makfile 
X264DIR=../x264-holger point to the directory of x264.

Regards,
Axel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile
Type: text/x-makefile
Size: 384 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080813/a98a118f/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: timeasm.c
Type: text/x-csrc
Size: 8264 bytes
Desc: not available
Url : http://mailman.videolan.org/pipermail/x264-devel/attachments/20080813/a98a118f/attachment.c