[x264-devel] x264 Development Newsletter: Vol 27

Sat Feb 4 21:10:22 CET 2012

This is the twenty-seventh x264 development newsletter. This is a
regular email containing updates on fixes and improvements in the most
recent x264 push, along with updates on what's coming next.  Previous
versions can be found in the mailing list archives.

Fixes:

Fix an invalid memory access with intra-refresh + VBV that could cause
crashes in some cases.

Fix a crash in --demuxer y4m with unsupported input colorspace.

Blu-ray compliance fix: Bluray doesn't allow referencing across
I-frames, that is, it treats I-frames like IDR -- so just don't force
--keyint-min 1.  I doubt this caused any issues with real playback
hardware, but the spec mandates it for some insane reason.  Thanks to
Pegasys for the bug report.

Fix input colorspace handling with packed 4:2:2.

Force ARM function alignment to 4 bytes, to deal with linkers that
don't align mixed Thumb/ARM code correctly by default.

Fix the win32 implementation of pthread_cond_signal -- not actually
used by x264 (which only uses broadcast), so simply for completeness.

Improvements:

Add a Windows resource file, to show x264 version information,
copyright, etc in Windows Explorer.

Bump dates to 2012, now that Checkuary has ended.

Swap the bit depth conversion dithering algorithm to assume TV range
instead of PC range.  Since most sources are TV, this is more correct
for the common case, and the error in the case of the source being PC
range is smaller than in the reverse case (PC range algorithm on TV
range data).

Follow lav's example and switch from %ifdef to %if in asm, to allow
combining conditionals.

x86inc: add high halfword register support and add a TAIL_CALL macro
for abstracting away another common asm idiom.

More x86 asm optimizations: AVX 32-bit hpel_filter_h; faster on Sandy
Bridge.  More XOP optimizations; frame_init_lowres and 8x8 zigzag
functions.  Enable SSSE3 weight on newer CPUs where it's faste.
General asm cleanups and optimizations.

Add CPU detection support for 5 new instruction sets: TBM, AVX2, FMA3,
BMI1, and BMI2.  I have some test patches with some minor performance
optimizations using most of these (except TBM), but I'm not going to
commit them as it would bump the yasm version requirement to 1.2,
which isn't in most distros yet, and is rather needless at the moment
given that none of the chips that support these are available at
retail yet.

A digression on new CPU instructions:

TBM, BMI1, and FMA3 will be supported in AMD's upcoming
Trinity/Piledriver CPU.  FMA3 is redundant in this case, as AMD
already supports FMA4, which is a strict superset of FMA3, so this is
just for compatibility with Intel (see
http://en.wikipedia.org/wiki/FMA_instruction_set#History).

The bit manipulation instructions look vaguely interesting, but the
most powerful ones are left for BMI2 (pdep, pext...) and neither Intel
nor AMD have explained in much detail what exactly these are all for,
and why it justifies all this new silicon (besides "being kind of cool
and interesting").  TBM doubly so -- it's hardly been mentioned on the
Internet *anywhere* (except in people writing assemblers...) and AMD
hasn't given the slightest clue as to what it's really intended for --
despite the fact that it's already in physical silicon!  In fact, yasm
doesn't even support TBM yet -- I submitted a patch to the yasm
mailing list to add support, though I haven't gotten a response yet.
I have access to a Trinity engineering sample, and I'd love to do some
testing with these, but it would be nice to know what I'm supposed to
*do* with these instructions...

Haswell will support all of these except TBM, but Haswell isn't until
2013.  AVX2 looks a bit odd: for most SSE instructions, instead of
extending them intuitively to AVX2, Intel has this "128-bit lane"
design.  This results in pretty unintuitive behavior for some
instructions, especially shuffles.  I hope AVX2 will be relatively
useful, but x86 SIMD is getting unfathomably crufty at this point --
and now the cruftiness is spreading back to GPR instructions too as
AMD and Intel realize that they now have a massive, massive VEX/XOP
instruction space to fill with whatever their hearts desire.

Maybe they should pitch it all out and just adopt NEON.

Upcoming:

Google Code-In is done, but a bunch of NEON assembly still needs review.

x262 is under development: a best-in-class MPEG-2 encoder built using
the x264 framework.  It works well enough to be vaguely usable now,
but is still highly experimental and needs more work -- developers
welcome!

Jason Garrett-Glaser

The x264 Team