[x265-commits] [x265] asm: properly disable x265_stack_align() when ENABLE_ASSE...

Mon Mar 9 18:48:41 CET 2015

details:   http://hg.videolan.org/x265/rev/cbc41dfdb5c4
branches:  stable
changeset: 9666:cbc41dfdb5c4
user:      Steve Borho <steve at borho.org>
date:      Fri Mar 06 22:27:54 2015 -0600
description:
asm: properly disable x265_stack_align() when ENABLE_ASSEMBLY is OFF
Subject: [x265] param: disallow encodes without room for P frame in lookahead

details:   http://hg.videolan.org/x265/rev/2e93cd58ff61
branches:  stable
changeset: 9667:2e93cd58ff61
user:      Steve Borho <steve at borho.org>
date:      Mon Mar 09 11:57:36 2015 -0500
description:
param: disallow encodes without room for P frame in lookahead
Subject: [x265] Merge with stable

details:   http://hg.videolan.org/x265/rev/62c02919ced8
branches:  
changeset: 9668:62c02919ced8
user:      Steve Borho <steve at borho.org>
date:      Mon Mar 09 12:47:47 2015 -0500
description:
Merge with stable

diffstat:

 doc/reST/cli.rst                     |   295 +-
 doc/reST/presets.rst                 |     9 +-
 doc/reST/threading.rst               |   100 +-
 readme.rst                           |    14 +
 source/CMakeLists.txt                |    66 +-
 source/cmake/FindNuma.cmake          |    43 +
 source/common/CMakeLists.txt         |     2 +-
 source/common/bitstream.cpp          |     4 +-
 source/common/common.cpp             |     4 +
 source/common/common.h               |    15 +-
 source/common/constants.cpp          |     2 +-
 source/common/constants.h            |     2 +-
 source/common/cudata.cpp             |   292 +-
 source/common/cudata.h               |    20 +-
 source/common/dct.cpp                |    13 +-
 source/common/deblock.cpp            |     4 +-
 source/common/framedata.h            |     2 +
 source/common/intrapred.cpp          |    28 +
 source/common/ipfilter.cpp           |    45 +-
 source/common/lowres.cpp             |    23 +-
 source/common/lowres.h               |     3 +-
 source/common/mv.h                   |     9 +-
 source/common/param.cpp              |    86 +-
 source/common/picyuv.cpp             |     8 +-
 source/common/pixel.cpp              |     2 +-
 source/common/predict.cpp            |   344 +-
 source/common/predict.h              |    60 +-
 source/common/primitives.cpp         |     3 +-
 source/common/primitives.h           |    19 +-
 source/common/quant.cpp              |   102 +-
 source/common/quant.h                |     4 +-
 source/common/scalinglist.cpp        |     2 +-
 source/common/shortyuv.cpp           |     6 +-
 source/common/slice.cpp              |    12 +-
 source/common/slice.h                |    19 +-
 source/common/threading.cpp          |     7 +
 source/common/threading.h            |    63 +-
 source/common/threadpool.cpp         |   699 ++--
 source/common/threadpool.h           |   166 +-
 source/common/wavefront.cpp          |    12 +-
 source/common/wavefront.h            |    11 +-
 source/common/x86/asm-primitives.cpp |   407 ++-
 source/common/x86/blockcopy8.asm     |  1548 ++++++++-
 source/common/x86/blockcopy8.h       |    42 +
 source/common/x86/const-a.asm        |    19 +-
 source/common/x86/dct8.asm           |   362 ++
 source/common/x86/dct8.h             |     1 +
 source/common/x86/intrapred.h        |    24 +-
 source/common/x86/intrapred16.asm    |   530 +++-
 source/common/x86/intrapred8.asm     |  1282 ++++++++-
 source/common/x86/ipfilter16.asm     |  2697 +++++++++++++++++-
 source/common/x86/ipfilter8.asm      |  5224 +++++++++++++++++++++++++++++++++-
 source/common/x86/ipfilter8.h        |    27 +-
 source/common/x86/mc-a.asm           |    70 +-
 source/common/x86/pixel-a.asm        |   574 +++
 source/common/x86/pixel-util.h       |    12 +-
 source/common/x86/pixel-util8.asm    |   343 ++-
 source/common/x86/pixel.h            |    26 +
 source/common/x86/pixeladd8.asm      |   161 +
 source/encoder/analysis.cpp          |   901 ++---
 source/encoder/analysis.h            |    44 +-
 source/encoder/api.cpp               |     3 +-
 source/encoder/dpb.cpp               |    44 +-
 source/encoder/dpb.h                 |     4 +-
 source/encoder/encoder.cpp           |   453 ++-
 source/encoder/encoder.h             |     7 +-
 source/encoder/entropy.cpp           |   182 +-
 source/encoder/entropy.h             |     6 +-
 source/encoder/frameencoder.cpp      |   259 +-
 source/encoder/frameencoder.h        |    29 +-
 source/encoder/framefilter.cpp       |     7 +-
 source/encoder/level.cpp             |    25 +-
 source/encoder/motion.cpp            |    67 +-
 source/encoder/motion.h              |     1 +
 source/encoder/nal.cpp               |     2 +-
 source/encoder/ratecontrol.cpp       |   213 +-
 source/encoder/ratecontrol.h         |   216 +-
 source/encoder/sao.cpp               |     2 +
 source/encoder/search.cpp            |   861 +++--
 source/encoder/search.h              |   167 +-
 source/encoder/slicetype.cpp         |  1671 ++++++----
 source/encoder/slicetype.h           |   250 +-
 source/encoder/weightPrediction.cpp  |    20 +-
 source/input/y4m.cpp                 |    58 +-
 source/output/y4m.cpp                |     8 -
 source/output/yuv.cpp                |     4 -
 source/profile/cpuEvents.h           |     3 +-
 source/test/CMakeLists.txt           |     3 +
 source/test/ipfilterharness.cpp      |    73 +-
 source/test/ipfilterharness.h        |     4 +-
 source/test/mbdstharness.cpp         |    64 +-
 source/test/pixelharness.cpp         |    21 +-
 source/test/testbench.cpp            |     6 +
 source/test/testharness.h            |     2 +-
 source/x265.cpp                      |     4 +-
 source/x265.h                        |   561 ++-
 source/x265cli.h                     |    27 +-
 97 files changed, 17788 insertions(+), 4453 deletions(-)

diffs (truncated from 30188 to 300 lines):

diff -r b5c42efdd600 -r 62c02919ced8 doc/reST/cli.rst

--- a/doc/reST/cli.rst	Mon Feb 23 13:25:07 2015 -0600
+++ b/doc/reST/cli.rst	Mon Mar 09 12:47:47 2015 -0500
@@ -171,19 +171,54 @@ Performance Options
 	Over-allocation of frame threads will not improve performance, it
 	will generally just increase memory use.
 
-.. option:: --threads <integer>
+	**Values:** any value between 8 and 16. Default is 0, auto-detect
 
-	Number of threads to allocate for the worker thread pool  This pool
-	is used for WPP and for distributed analysis and motion search:
-	:option:`--wpp` :option:`--pmode` and :option:`--pme` respectively.
+.. option:: --pools <string>, --numa-pools <string>
 
-	If :option:`--threads` 1 is specified, then no thread pool is
-	created. When no thread pool is created, all the thread pool
-	features are implicitly disabled. If all the pool features are
-	disabled by the user, then the pool is implicitly disabled.
+	Comma seperated list of threads per NUMA node. If "none", then no worker
+	pools are created and only frame parallelism is possible. If NULL or ""
+	(default) x265 will use all available threads on each NUMA node::
 
-	Default 0, one thread is allocated per detected hardware thread
-	(logical CPU cores)
+	'+'  is a special value indicating all cores detected on the node
+	'*'  is a special value indicating all cores detected on the node and all remaining nodes
+	'-'  is a special value indicating no cores on the node, same as '0'
+
+	example strings for a 4-node system::
+
+	""        - default, unspecified, all numa nodes are used for thread pools
+	"*"       - same as default
+	"none"    - no thread pools are created, only frame parallelism possible
+	"-"       - same as "none"
+	"10"      - allocate one pool, using up to 10 cores on node 0
+	"-,+"     - allocate one pool, using all cores on node 1
+	"+,-,+"   - allocate two pools, using all cores on nodes 0 and 2
+	"+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
+	"-,*"     - allocate three pools, using all cores on nodes 1, 2 and 3
+	"8,8,8,8" - allocate four pools with up to 8 threads in each pool
+
+	The total number of threads will be determined by the number of threads
+	assigned to all nodes. The worker threads will each be given affinity for
+	their node, they will not be allowed to migrate between nodes, but they
+	will be allowed to move between CPU cores within their node.
+
+	If the three pool features: :option:`--wpp` :option:`--pmode` and
+	:option:`--pme` are all disabled, then :option:`--pools` is ignored
+	and no thread pools are created.
+
+	If "none" is specified, then all three of the thread pool features are
+	implicitly disabled.
+
+	Multiple thread pools will be allocated for any NUMA node with more than
+	64 logical CPU cores. But any given thread pool will always use at most
+	one NUMA node.
+
+	Frame encoders are distributed between the available thread pools,
+	and the encoder will never generate more thread pools than
+	:option:`--frame-threads`.  The pools are used for WPP and for
+	distributed analysis and motion search.
+
+	Default "", one thread is allocated per detected hardware thread
+	(logical CPU cores) and one thread pool per NUMA node.
 
 .. option:: --wpp, --no-wpp
 
@@ -409,7 +444,17 @@ Profile, Level, Tier
 	If :option:`--level-idc` has been specified, the option adds the
 	intention to support the High tier of that level. If your specified
 	level does not support a High tier, a warning is issued and this
-	modifier flag is ignored.
+	modifier flag is ignored. If :option:`--level-idc` has been specified,
+	but not --high-tier, then the encoder will attempt to encode at the 
+	specified level, main tier first, turning on high tier only if 
+	necessary and available at that level.
+
+.. option:: --ref <1..16>
+
+	Max number of L0 references to be allowed. This number has a linear
+	multiplier effect on the amount of work performed in motion search,
+	but will generally have a beneficial affect on compression and
+	distortion. Default 3
 
 .. note::
 	:option:`--profile`, :option:`--level-idc`, and
@@ -465,6 +510,23 @@ the prediction quad-tree.
 	and less frame parallelism as well. Because of this the faster
 	presets use a CU size of 32. Default: 64
 
+.. option:: --min-cu-size <64|32|16|8>
+
+	Minimum CU size (width and height). By using 16 or 32 the encoder
+	will not analyze the cost of CUs below that minimum threshold,
+	saving considerable amounts of compute with a predictable increase
+	in bitrate. This setting has a large effect on performance on the
+	faster presets.
+
+	Default: 8 (minimum 8x8 CU for HEVC, best compression efficiency)
+
+.. note::
+
+	All encoders within a single process must use the same settings for
+	the CU size range. :option:`--ctu` and :option:`--min-cu-size` must
+	be consistent for all of them since the encoder configures several
+	key global data structures based on this range.
+
 .. option:: --rect, --no-rect
 
 	Enable analysis of rectangular motion partitions Nx2N and 2NxN
@@ -494,14 +556,6 @@ the prediction quad-tree.
 	Measure full CU size (2Nx2N) merge candidates first; if no residual
 	is found the analysis is short circuited. Default disabled
 
-.. option:: --fast-cbf, --no-fast-cbf
-
-	Short circuit analysis if a prediction is found that does not set
-	the coded block flag (aka: no residual was encoded).  It prevents
-	the encoder from perhaps finding other predictions that also have no
-	residual but require less signaling bits or have less distortion.
-	Only applicable for RD levels 5 and 6. Default disabled
-
 .. option:: --fast-intra, --no-fast-intra
 
 	Perform an initial scan of every fifth intra angular mode, then
@@ -526,14 +580,6 @@ the prediction quad-tree.
 	Only effective at RD levels 3 and above, which perform RDO mode
 	decisions.
 
-.. option:: --tskip, --no-tskip
-
-	Enable evaluation of transform skip (bypass DCT but still use
-	quantization) coding for 4x4 TU coded blocks.
-
-	Only effective at RD levels 3 and above, which perform RDO mode
-	decisions. Default disabled
-
 .. option:: --tskip-fast, --no-tskip-fast
 
 	Only evaluate transform skip for NxN intra predictions (4x4 blocks).
@@ -567,6 +613,30 @@ not match.
 Options which affect the transform unit quad-tree, sometimes referred to
 as the residual quad-tree (RQT).
 
+.. option:: --rdoq-level <0|1|2>, --no-rdoq-level
+
+	Specify the amount of rate-distortion analysis to use within
+	quantization::
+
+	At level 0 rate-distortion cost is not considered in quant
+	
+	At level 1 rate-distortion cost is used to find optimal rounding
+	values for each level (and allows psy-rdoq to be effective). It
+	trades-off the signaling cost of the coefficient vs its post-inverse
+	quant distortion from the pre-quant coefficient. When
+	:option:`--psy-rdoq` is enabled, this formula is biased in favor of
+	more energy in the residual (larger coefficient absolute levels)
+	
+	At level 2 rate-distortion cost is used to make decimate decisions
+	on each 4x4 coding group, including the cost of signaling the group
+	within the group bitmap. If the total distortion of not signaling
+	the entire coding group is less than the rate cost, the block is
+	decimated. Next, it applies rate-distortion cost analysis to the
+	last non-zero coefficient, which can result in many (or all) of the
+	coding groups being decimated. Psy-rdoq is less effective at
+	preserving energy when RDOQ is at level 2, since it only has
+	influence over the level distortion costs.
+
 .. option:: --tu-intra-depth <1..4>
 
 	The transform unit (residual) quad-tree begins with the same depth
@@ -593,9 +663,76 @@ as the residual quad-tree (RQT).
 	partitions, in which case a TU split is implied and thus the
 	residual quad-tree begins one layer below the CU quad-tree.
 
+.. option:: --nr-intra <integer>, --nr-inter <integer>
+
+	Noise reduction - an adaptive deadzone applied after DCT
+	(subtracting from DCT coefficients), before quantization.  It does
+	no pixel-level filtering, doesn't cross DCT block boundaries, has no
+	overlap, The higher the strength value parameter, the more
+	aggressively it will reduce noise.
+
+	Enabling noise reduction will make outputs diverge between different
+	numbers of frame threads. Outputs will be deterministic but the
+	outputs of -F2 will no longer match the outputs of -F3, etc.
+
+	**Values:** any value in range of 0 to 2000. Default 0 (disabled).
+
+.. option:: --tskip, --no-tskip
+
+	Enable evaluation of transform skip (bypass DCT but still use
+	quantization) coding for 4x4 TU coded blocks.
+
+	Only effective at RD levels 3 and above, which perform RDO mode
+	decisions. Default disabled
+
+.. option:: --rdpenalty <0..2>
+
+	When set to 1, transform units of size 32x32 are given a 4x bit cost
+	penalty compared to smaller transform units, in intra coded CUs in P
+	or B slices.
+
+	When set to 2, transform units of size 32x32 are not even attempted,
+	unless otherwise required by the maximum recursion depth.  For this
+	option to be effective with 32x32 intra CUs,
+	:option:`--tu-intra-depth` must be at least 2.  For it to be
+	effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
+	at least 3.
+
+	Note that in HEVC an intra transform unit (a block of the residual
+	quad-tree) is also a prediction unit, meaning that the intra
+	prediction signal is generated for each TU block, the residual
+	subtracted and then coded. The coding unit simply provides the
+	prediction modes that will be used when predicting all of the
+	transform units within the CU. This means that when you prevent
+	32x32 intra transform units, you are preventing 32x32 intra
+	predictions.
+
+	Default 0, disabled.
+
+	**Values:** 0:disabled 1:4x cost penalty 2:force splits
+
+.. option:: --max-tu-size <32|16|8|4>
+
+	Maximum TU size (width and height). The residual can be more
+	efficiently compressed by the DCT transform when the max TU size
+	is larger, but at the expense of more computation. Transform unit
+	quad-tree begins at the same depth of the coded tree unit, but if the
+	maximum TU size is smaller than the CU size then transform QT begins 
+	at the depth of the max-tu-size. Default: 32.
+
 Temporal / motion search options
 ================================
 
+.. option:: --max-merge <1..5>
+
+	Maximum number of neighbor (spatial and temporal) candidate blocks
+	that the encoder may consider for merging motion predictions. If a
+	merge candidate results in no residual, it is immediately selected
+	as a "skip".  Otherwise the merge candidates are tested as part of
+	motion estimation when searching for the least cost inter option.
+	The max candidate number is encoded in the SPS and determines the
+	bit cost of signaling merge CUs. Default 2
+
 .. option:: --me <integer|string>
 
 	Motion search method. Generally, the higher the number the harder
@@ -658,16 +795,6 @@ Temporal / motion search options
 
 	**Range of values:** an integer from 0 to 32768
 
-.. option:: --max-merge <1..5>
-
-	Maximum number of neighbor (spatial and temporal) candidate blocks
-	that the encoder may consider for merging motion predictions. If a
-	merge candidate results in no residual, it is immediately selected
-	as a "skip".  Otherwise the merge candidates are tested as part of
-	motion estimation when searching for the least cost inter option.
-	The max candidate number is encoded in the SPS and determines the
-	bit cost of signaling merge CUs. Default 2
-
 .. option:: --temporal-mvp, --no-temporal-mvp
 
 	Enable temporal motion vector predictors in P and B slices.
@@ -704,32 +831,6 @@ Spatial/intra options
 	propagation of reference errors that may have resulted from lossy
 	signals. Default disabled
 
-.. option:: --rdpenalty <0..2>
-
-	When set to 1, transform units of size 32x32 are given a 4x bit cost
-	penalty compared to smaller transform units, in intra coded CUs in P
-	or B slices.
-
-	When set to 2, transform units of size 32x32 are not even attempted,
-	unless otherwise required by the maximum recursion depth.  For this
-	option to be effective with 32x32 intra CUs,
-	:option:`--tu-intra-depth` must be at least 2.  For it to be
-	effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
-	at least 3.
-
-	Note that in HEVC an intra transform unit (a block of the residual
-	quad-tree) is also a prediction unit, meaning that the intra
-	prediction signal is generated for each TU block, the residual
-	subtracted and then coded. The coding unit simply provides the
-	prediction modes that will be used when predicting all of the
-	transform units within the CU. This means that when you prevent
-	32x32 intra transform units, you are preventing 32x32 intra
-	predictions.
-
-	Default 0, disabled.
-
-	**Values:** 0:disabled 1:4x cost penalty 2:force splits
-
 Psycho-visual options
 =====================
 
@@ -752,8 +853,8 @@ of blurred prediction modes, like DC and