[x264-devel] [Christian Heine <sennindemokrit at gmx.net>] [patch] optimized quant_XxX functions

Sun Sep 4 10:04:07 CEST 2005

 The deleted attachment is at:
    <http://www.videolan.org/~admin/20050904-videolan/x264rev291-quant_c_mmxext_sse2-v2.diff>

----- Forwarded message from Christian Heine <sennindemokrit at gmx.net> -----

From: Christian Heine <sennindemokrit at gmx.net>
Date: Sat, 03 Sep 2005 06:40:51 +0200
To: x264-devel at videolan.org
Subject: [patch] optimized quant_XxX functions
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511
X-Spam-Status: No, score=-3.9 required=5.0 tests=FORGED_RCVD_HELO,PORN_10,
	RCVD_IN_ORBS,UNIFIED_PATCH autolearn=failed version=3.0.3

Hi,

attached is a patch based on x264 rev. 291 that includes a faster C
implementation as well as MMXEXT optimized versions of quant_8x8,
quant_4x4, quant_4x4_dc, quant_2x2_dc on the x86 architecture.

--- Capabilities

The code was inspired from Alex Izvorski version of quant_8x8 and
quant_4x4, but deals with two important issues:

1) precision: Alex Izvorski's version limited quant_mf tables to be
within the range of an int16_t. This version is fine as long as
(dct[x][y]*quant_mf[x][y]) stays within the limits of an int32_t, with
only a small speed penalty. This should be sufficient precision, the C
implementation (old and new) also breaks beyond those limits.

2) Determine the use of optimizations at run time. Alex Izvorski's
version was always compiled with the MMXEXT version.

This patch however recognizes if the quant_mf tables can be used with 
Alex Izvorski version of quant, and uses them in that case. A debug 
level log message is generated, about which version is used.

--- Speed

Speedup strongly depends on the parameters used.

test system:
   AthlonXP 3000+ WinXP/MinGW

configure paramaters:
   --enable-avis-input --extra-cflags=-march=athlon-xp

run time paramters:
   x264 --bframes 2 --ref 8 --8x8dct --analyse all --qp 20

The following table shows the overall speed improvement compared to x264
rev 291. As base of the comparison the resulting fps were compared. They 
were: 3.84 (x264rev291), 3.87 (c-opt), 3.99 (mmx-CH-opt), 4.01 
(mmx-AI-opt), 4.03 (mmx-AI-orig).

version     improvement     comments
c-opt       0.78%           new C implementation
mmx-CH-opt  3.90%           MMXEXT optimized version (Christian Heine)
mmx-AI-opt  4.43%           MMXEXT optimized version (Alex Izvorski)

mmx-AI-orig 4.95%           Alex Izvorski's original version

The MMXEXT optimized version of Alex Izvorski is slower in this patch 
than in the original patch from him. This is because 
quant_XxX_core_mmxext is no longer called directly, but through a 
function pointer.

--- Internals

To achieve run time adaptive optimizations there are two options.

1) Function pointers for quant_XxX.
2) Function pointers for quant_XxX_core.

What's the difference? The first option allows optimizations for the
whole function, while the second leaves some C code for the compiler,
and only uses optimizations for the pure repetetive arithmetic stuff.

First of all, it is better to have GCC calculate things like i_qbits and
displacements of quant_mf. I tried to write optimized code for this and
failed. I don't know how GCC does it better and looking at the code
generated by GCC didn't help it - it simply didn't understand it, so I
just left it that way.

The second and main reason for the second option is, that by leaving
some C code in quant_XxX and leaving qaunt_XxX in macroblock.c gcc is
able to inline and unroll some of that C code. There are functions that
call quant_4x4 about 16 times, always using the same i_qscale, allowing
GCC to inline/unroll calculation of i_qbits, displacements and
constants. This may not sound like much, but actually makes an overall
speed improvement of about 0.5% on my tests.

The patch contains the code to both options, although the second option
is used. If you want to try the first option just replace calls like

quant_XxX( h, dct, ... )
quant_XxX_dc( h, dct, ... )

with

h->quantf.quant_XxX( dct, ... )
h->quantf.quant_XxX_dc( dct, ... )

in macroblock.c in all the occurences. If you have the nerve, you can
check each occurence separately for speed improvement.

--- Todo

BIG FAT WARNING: The patch also contains an SSE2 optimized version,
which I haven't the ability to test. Uncomment the appropriate code in
quant.c:x264_init_quant() if it doesn't produce correct results, or send
a patch to correct it. If it works, please send some benchmarks.

A patch for MMXEXT/SSE2 optimized quant functions for Athlon64 can be
done with minimum efford and is wellcome.

Regards,
Christian Heine

----- End forwarded message -----

-- 
System administration <admin at via.ecp.fr>
VIA, Ecole Centrale Paris, France

-- 
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html