[x264-devel] [Christian Heine <sennindemokrit at gmx.net>] [patch] optimized quant_XxX functions
System administration
admin at via.ecp.fr
Sun Sep 4 10:04:07 CEST 2005
The deleted attachment is at:
<http://www.videolan.org/~admin/20050904-videolan/x264rev291-quant_c_mmxext_sse2-v2.diff>
----- Forwarded message from Christian Heine <sennindemokrit at gmx.net> -----
From: Christian Heine <sennindemokrit at gmx.net>
Date: Sat, 03 Sep 2005 06:40:51 +0200
To: x264-devel at videolan.org
Subject: [patch] optimized quant_XxX functions
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511
X-Spam-Status: No, score=-3.9 required=5.0 tests=FORGED_RCVD_HELO,PORN_10,
RCVD_IN_ORBS,UNIFIED_PATCH autolearn=failed version=3.0.3
Hi,
attached is a patch based on x264 rev. 291 that includes a faster C
implementation as well as MMXEXT optimized versions of quant_8x8,
quant_4x4, quant_4x4_dc, quant_2x2_dc on the x86 architecture.
--- Capabilities
The code was inspired from Alex Izvorski version of quant_8x8 and
quant_4x4, but deals with two important issues:
1) precision: Alex Izvorski's version limited quant_mf tables to be
within the range of an int16_t. This version is fine as long as
(dct[x][y]*quant_mf[x][y]) stays within the limits of an int32_t, with
only a small speed penalty. This should be sufficient precision, the C
implementation (old and new) also breaks beyond those limits.
2) Determine the use of optimizations at run time. Alex Izvorski's
version was always compiled with the MMXEXT version.
This patch however recognizes if the quant_mf tables can be used with
Alex Izvorski version of quant, and uses them in that case. A debug
level log message is generated, about which version is used.
--- Speed
Speedup strongly depends on the parameters used.
test system:
AthlonXP 3000+ WinXP/MinGW
configure paramaters:
--enable-avis-input --extra-cflags=-march=athlon-xp
run time paramters:
x264 --bframes 2 --ref 8 --8x8dct --analyse all --qp 20
The following table shows the overall speed improvement compared to x264
rev 291. As base of the comparison the resulting fps were compared. They
were: 3.84 (x264rev291), 3.87 (c-opt), 3.99 (mmx-CH-opt), 4.01
(mmx-AI-opt), 4.03 (mmx-AI-orig).
version improvement comments
c-opt 0.78% new C implementation
mmx-CH-opt 3.90% MMXEXT optimized version (Christian Heine)
mmx-AI-opt 4.43% MMXEXT optimized version (Alex Izvorski)
mmx-AI-orig 4.95% Alex Izvorski's original version
The MMXEXT optimized version of Alex Izvorski is slower in this patch
than in the original patch from him. This is because
quant_XxX_core_mmxext is no longer called directly, but through a
function pointer.
--- Internals
To achieve run time adaptive optimizations there are two options.
1) Function pointers for quant_XxX.
2) Function pointers for quant_XxX_core.
What's the difference? The first option allows optimizations for the
whole function, while the second leaves some C code for the compiler,
and only uses optimizations for the pure repetetive arithmetic stuff.
First of all, it is better to have GCC calculate things like i_qbits and
displacements of quant_mf. I tried to write optimized code for this and
failed. I don't know how GCC does it better and looking at the code
generated by GCC didn't help it - it simply didn't understand it, so I
just left it that way.
The second and main reason for the second option is, that by leaving
some C code in quant_XxX and leaving qaunt_XxX in macroblock.c gcc is
able to inline and unroll some of that C code. There are functions that
call quant_4x4 about 16 times, always using the same i_qscale, allowing
GCC to inline/unroll calculation of i_qbits, displacements and
constants. This may not sound like much, but actually makes an overall
speed improvement of about 0.5% on my tests.
The patch contains the code to both options, although the second option
is used. If you want to try the first option just replace calls like
quant_XxX( h, dct, ... )
quant_XxX_dc( h, dct, ... )
with
h->quantf.quant_XxX( dct, ... )
h->quantf.quant_XxX_dc( dct, ... )
in macroblock.c in all the occurences. If you have the nerve, you can
check each occurence separately for speed improvement.
--- Todo
BIG FAT WARNING: The patch also contains an SSE2 optimized version,
which I haven't the ability to test. Uncomment the appropriate code in
quant.c:x264_init_quant() if it doesn't produce correct results, or send
a patch to correct it. If it works, please send some benchmarks.
A patch for MMXEXT/SSE2 optimized quant functions for Athlon64 can be
done with minimum efford and is wellcome.
Regards,
Christian Heine
----- End forwarded message -----
--
System administration <admin at via.ecp.fr>
VIA, Ecole Centrale Paris, France
--
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html
More information about the x264-devel
mailing list