[x264-devel] x264 Development Newsletter: Vol 35

Jason Garrett-Glaser jason at x264.com
Wed Feb 27 00:17:48 CET 2013


This is the thirty-fifth x264 development newsletter. This is a
regular email containing updates on fixes and improvements in the most
recent x264 push, along with updates on what's coming next.  Previous
versions can be found in the mailing list archives.

Fixes:

x86: Use simple nop codes for <= SSE functions on 32-bit: fixes
SIGILLs on the "CentaurHauls family 6 model 9 stepping 8" family of
CPUs.

Fix a possible non-determinism problem with mbtree + open-gop + sync-lookahead.

x86-64: Fix trellis asm with interlacing (only caused slight compression loss).

win64: Fix uses of the red zone in asm (could in theory cause crashes).

Improvements:

Improve x264_encoder_reconfig documentation.

Sync x86inc improvements from libav.

Enable DEP/ASLR on Windows.

Add QNX support to configure.

CABAC optimizations: faster PIC and win64 encode_decision.

x86: Use SSE in data-copying functions where SSE2 isn't necessary.

Improve threaded lookahead auto selection: dramatically faster (up to
50% or more) first-pass encoding on a 4-core Sandy Bridge CPU
(possibly more with more cores).

Optimize and clean up predictor checking in motion estimation, plus
add more x86 asm: ~5-20% faster predictor checking.

x86: ~15% faster SSD in high bit depth.

x86: port >MMX SATD functions to high bit depth: 20-50% faster SATD
functions in high bit depth.

x86-64: add a combined SATD/SA8D function for transform size decision;
~30% faster transform size decision.

x86: faster AVX SATD-related functions; ~2-9% faster (depending on
function) on Sandy Bridge.

x86: improved CPU flag handling and optimizations for various CPUs.
Detect the Bobcat and optimize for the slowness of certain CPUs (e.g.
the Bobcat's slow palignr) on a more fine-grained level, picking the
correct functions.  Also add slightly-Atom-optimized versions of
SATD-related functions.  This commit changes a lot of CPU flags around
a bit; see the commit message for more details.

Fix various store forwarding stalls for improved performance.

Eliminate some performance-critical branches where possible in motion
estimation and analysis -- a few percent fewer branch mispredictions
overall.

Add AVXSynth support to the AVISynth input module.

Improve quantization in 4x4dct blocks: quantize 4 blocks at a time for
improved performance.  Add a new assembly function and implement it in
x86 and ARM (NEON).  Also refactor a lot of related code to take
advantage of the new function; 1-2% overall faster encoding.

ARM: finally update the mc_chroma NEON asm to support NV12: up to
10-15% faster encoding overall.

Upcoming:

AVX2 optimizations are in the pipeline, largely waiting on acquisition
of a Haswell for testing.

Jason Garrett-Glaser

The x264 Team


More information about the x264-devel mailing list