[x265-commits] [x265] search: hoist bufScale allocation out of its limited scope

Tue Feb 24 00:57:16 CET 2015

details:   http://hg.videolan.org/x265/rev/b5c42efdd600
branches:  stable
changeset: 9390:b5c42efdd600
user:      Steve Borho <steve at borho.org>
date:      Mon Feb 23 13:25:07 2015 -0600
description:
search: hoist bufScale allocation out of its limited scope

bufScale was going out of scope but the fenc pointer was still pointed to it and
the buffer was used for intra cost measurements - definitely bad juju.
Subject: [x265] Merge with stable

details:   http://hg.videolan.org/x265/rev/e379d79cd412
branches:  
changeset: 9391:e379d79cd412
user:      Steve Borho <steve at borho.org>
date:      Mon Feb 23 13:30:13 2015 -0600
description:
Merge with stable
Subject: [x265] search: typo

details:   http://hg.videolan.org/x265/rev/46f7a923389b
branches:  
changeset: 9392:46f7a923389b
user:      Steve Borho <steve at borho.org>
date:      Mon Feb 23 13:40:38 2015 -0600
description:
search: typo
Subject: [x265] search: use a single aligned malloc for intra analysis buffers

details:   http://hg.videolan.org/x265/rev/06a98dfd89c9
branches:  
changeset: 9393:06a98dfd89c9
user:      Steve Borho <steve at borho.org>
date:      Mon Feb 23 14:02:04 2015 -0600
description:
search: use a single aligned malloc for intra analysis buffers

This avoids very large stack allocations of up to 72KBytes. Improved a few
comments
Subject: [x265] search: use aligned mallocs for transform skip temp buffers

details:   http://hg.videolan.org/x265/rev/8ee700424b5a
branches:  
changeset: 9394:8ee700424b5a
user:      Steve Borho <steve at borho.org>
date:      Mon Feb 23 14:16:27 2015 -0600
description:
search: use aligned mallocs for transform skip temp buffers

This is to avoid frequent 4k stack allocations (and less reliance on an aligned
stack for best performance)
Subject: [x265] NUMA based thread pools

details:   http://hg.videolan.org/x265/rev/62b8fe990df5
branches:  
changeset: 9395:62b8fe990df5
user:      Steve Borho <steve at borho.org>
date:      Thu Feb 19 09:45:59 2015 -0600
description:
NUMA based thread pools

On systems with multiple NUMA nodes (typically multi-socket workstations or
servers) x265 will now create one thread pool per NUMA node and distribute
frame encoders to the pools evenly. Each frame encoder will use only one
thread pool and its worker threads. This prevents threads from different NUMA
nodes working together on the same frame.

On UNIX, you must link with libnuma in order for x265 to be NUMA aware.
On systems with a single socket or UNIX systems without libnuma this change
should be un-noticeable except --threads N is now --pool N

Since JobProviders are assigned to pools statically, each pool knows up front
which JobProviders it is servicing and can be more intelligent about which
provider needs the most help (I over P, P over B, B over b). The enqueue/dequeue
functions are no longer necessary.

FrameEncoders are now allocating the ThreadLocalData array for their thread
pool, allocating it after setting their thread affinity to their target NUMA
node so that memory should be associated to their socket. More work needs to be
done to mirror recon pictures between nodes so motion estimation does not have
to use remote external memory.

This commit introduces knowledge in the thread pool about BondedTaskGroups but
they are only implemented in the next commit. This commit will not compile by
itself.
Subject: [x265] pool: introduce bonded task groups

details:   http://hg.videolan.org/x265/rev/b000ea3ab26d
branches:  
changeset: 9396:b000ea3ab26d
user:      Steve Borho <steve at borho.org>
date:      Thu Feb 19 09:39:48 2015 -0600
description:
pool: introduce bonded task groups

Since the thread pools no longer support JobProviders being dynamically added
and removed we need a new mechanism to enlist the help of worker threads for
short-term work.

A bonded task group is a simple data structure, stack allocated, which tracks
the workers which were enlisted for help and when they are all finished. It
will try to enlist threads which most recently worked on the same frame, if
available, then it will try any idle threads in the same pool.

This commit switches --pmode and --pme to use bonded task groups.
Subject: [x265] slicetype: refactor lookahead to use bonded task groups

details:   http://hg.videolan.org/x265/rev/0c23bfd6b0d4
branches:  
changeset: 9397:0c23bfd6b0d4
user:      Steve Borho <steve at borho.org>
date:      Thu Feb 19 09:42:06 2015 -0600
description:
slicetype: refactor lookahead to use bonded task groups

The lowres downscale, adaptive quant calculations, and lowres intra analysis
are all moved into a per-frame task which is performed by worker threads when
a worker pool is available to the encoder.  Similary slicetypeDecide() is also
performed by worker threads if available.

Individual frame cost estimations no longer use wave-front scheduling. Instead
the frames are divided into slices which are distributed to workers via a
bonded task group. This change reduces the accuracy of P frame decisions, and
will likely require further tuning. It improves overall work efficiency but it
lowers compression efficiency

When --b-adapt 2 is used on large core systems, the initial batch of frame cost
estimates are performed by the entire thread pool at once, again using a bonded
task group.

diffstat:

 doc/reST/cli.rst                     |   258 +++--
 doc/reST/threading.rst               |   100 +-
 readme.rst                           |    14 +
 source/CMakeLists.txt                |    20 +-
 source/cmake/FindNuma.cmake          |    43 +
 source/common/bitstream.cpp          |     2 +-
 source/common/common.cpp             |     4 +
 source/common/common.h               |     7 +-
 source/common/constants.cpp          |     2 +-
 source/common/constants.h            |     2 +-
 source/common/cudata.cpp             |    54 +-
 source/common/cudata.h               |     8 +-
 source/common/deblock.cpp            |     2 +-
 source/common/framedata.h            |     2 +
 source/common/ipfilter.cpp           |    45 +-
 source/common/lowres.cpp             |    26 +-
 source/common/lowres.h               |     3 +-
 source/common/param.cpp              |    59 +-
 source/common/picyuv.cpp             |     8 +-
 source/common/pixel.cpp              |     2 +-
 source/common/predict.cpp            |     6 +-
 source/common/primitives.cpp         |     1 +
 source/common/primitives.h           |    10 +-
 source/common/quant.cpp              |    78 +-
 source/common/scalinglist.cpp        |     2 +-
 source/common/shortyuv.cpp           |     6 +-
 source/common/slice.cpp              |    12 +-
 source/common/slice.h                |    19 +-
 source/common/threading.h            |    53 +-
 source/common/threadpool.cpp         |   699 +++++++-------
 source/common/threadpool.h           |   166 ++-
 source/common/wavefront.cpp          |    12 +-
 source/common/wavefront.h            |    11 +-
 source/common/x86/asm-primitives.cpp |     2 +
 source/common/x86/blockcopy8.asm     |   693 ++++++++-----
 source/common/x86/dct8.asm           |   362 +++++++
 source/common/x86/dct8.h             |     1 +
 source/common/x86/intrapred8.asm     |     2 +-
 source/encoder/analysis.cpp          |   745 +++++++--------
 source/encoder/analysis.h            |    41 +-
 source/encoder/api.cpp               |     2 +-
 source/encoder/dpb.cpp               |    44 +-
 source/encoder/dpb.h                 |     4 +-
 source/encoder/encoder.cpp           |   396 +++++--
 source/encoder/encoder.h             |     6 +-
 source/encoder/entropy.cpp           |   166 +-
 source/encoder/entropy.h             |     6 +-
 source/encoder/frameencoder.cpp      |   221 +++-
 source/encoder/frameencoder.h        |    13 +-
 source/encoder/framefilter.cpp       |     7 +-
 source/encoder/level.cpp             |    25 +-
 source/encoder/nal.cpp               |     2 +-
 source/encoder/ratecontrol.cpp       |   157 +---
 source/encoder/ratecontrol.h         |     2 -
 source/encoder/search.cpp            |   554 ++++++-----
 source/encoder/search.h              |   140 ++-
 source/encoder/slicetype.cpp         |  1586 +++++++++++++++++++--------------
 source/encoder/slicetype.h           |   249 +++--
 source/encoder/weightPrediction.cpp  |     2 +-
 source/input/y4m.cpp                 |    58 +-
 source/output/y4m.cpp                |     8 -
 source/output/yuv.cpp                |     4 -
 source/profile/cpuEvents.h           |     3 +-
 source/test/ipfilterharness.cpp      |    73 +-
 source/test/ipfilterharness.h        |     4 +-
 source/test/pixelharness.cpp         |     6 +-
 source/x265.h                        |   549 ++++++-----
 source/x265cli.h                     |    17 +-
 68 files changed, 4689 insertions(+), 3197 deletions(-)

diffs (truncated from 12294 to 300 lines):

diff -r 359daecfbb47 -r 0c23bfd6b0d4 doc/reST/cli.rst

--- a/doc/reST/cli.rst	Mon Feb 16 10:33:58 2015 +0530
+++ b/doc/reST/cli.rst	Thu Feb 19 09:42:06 2015 -0600
@@ -171,19 +171,54 @@ Performance Options
 	Over-allocation of frame threads will not improve performance, it
 	will generally just increase memory use.
 
-.. option:: --threads <integer>
+	**Values:** any value between 8 and 16. Default is 0, auto-detect
+
+.. option:: --pools <string>, --numa-pools <string>
+
+	Comma seperated list of threads per NUMA node. If "none", then no worker
+	pools are created and only frame parallelism is possible. If NULL or ""
+	(default) x265 will use all available threads on each NUMA node::
+
+	'+'  is a special value indicating all cores detected on the node
+	'*'  is a special value indicating all cores detected on the node and all remaining nodes
+	'-'  is a special value indicating no cores on the node, same as '0'
+
+	example strings for a 4-node system::
+		""        - default, unspecified, all numa nodes are used for thread pools
+		"*"       - same as default
+		"none"    - no thread pools are created, only frame parallelism possible
+		"-"       - same as "none"
+		"10"      - allocate one pool, using up to 10 cores on node 0
+		"-,+"     - allocate one pool, using all cores on node 1
+		"+,-,+"   - allocate two pools, using all cores on nodes 0 and 2
+		"+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
+		"-,*"     - allocate three pools, using all cores on nodes 1, 2 and 3
+		"8,8,8,8" - allocate four pools with up to 8 threads in each pool
+
+	The total number of threads will be determined by the number of threads
+	assigned to all nodes. The worker threads will each be given affinity for
+	their node, they will not be allowed to migrate between nodes, but they
+	will be allowed to move between CPU cores within their node.
+
+	If the three pool features: :option:`--wpp` :option:`--pmode` and
+	:option:`--pme` are all disabled, then :option:`--pools` is ignored
+	and no thread pools are created.
+
+	If "none" is specified, then all three of the thread pool features are
+	implicitly disabled.
+
+	Multiple thread pools will be allocated for any NUMA node with more than
+	64 logical CPU cores. But any given thread pool will always use at most
+	one NUMA node.
+
+	Frame encoders are distributed between the available thread pools, and
+	the encoder will never generate more thread pools than frameNumThreads
 
 	Number of threads to allocate for the worker thread pool  This pool
 	is used for WPP and for distributed analysis and motion search:
-	:option:`--wpp` :option:`--pmode` and :option:`--pme` respectively.
 
-	If :option:`--threads` 1 is specified, then no thread pool is
-	created. When no thread pool is created, all the thread pool
-	features are implicitly disabled. If all the pool features are
-	disabled by the user, then the pool is implicitly disabled.
-
-	Default 0, one thread is allocated per detected hardware thread
-	(logical CPU cores)
+	Default "", one thread is allocated per detected hardware thread
+	(logical CPU cores) and one thread pool per NUMA node.
 
 .. option:: --wpp, --no-wpp
 
@@ -409,7 +444,17 @@ Profile, Level, Tier
 	If :option:`--level-idc` has been specified, the option adds the
 	intention to support the High tier of that level. If your specified
 	level does not support a High tier, a warning is issued and this
-	modifier flag is ignored.
+	modifier flag is ignored. If :option:`--level-idc` has been specified,
+	but not --high-tier, then the encoder will attempt to encode at the 
+	specified level, main tier first, turning on high tier only if 
+	necessary and available at that level.
+
+.. option:: --ref <1..16>
+
+	Max number of L0 references to be allowed. This number has a linear
+	multiplier effect on the amount of work performed in motion search,
+	but will generally have a beneficial affect on compression and
+	distortion. Default 3
 
 .. note::
 	:option:`--profile`, :option:`--level-idc`, and
@@ -465,6 +510,23 @@ the prediction quad-tree.
 	and less frame parallelism as well. Because of this the faster
 	presets use a CU size of 32. Default: 64
 
+.. option:: --min-cu-size <64|32|16|8>
+
+	Minimum CU size (width and height). By using 16 or 32 the encoder
+	will not analyze the cost of CUs below that minimum threshold,
+	saving considerable amounts of compute with a predictable increase
+	in bitrate. This setting has a large effect on performance on the
+	faster presets.
+
+	Default: 8 (minimum 8x8 CU for HEVC, best compression efficiency)
+
+.. note::
+
+	All encoders within a single process must use the same settings for
+	the CU size range. :option:`--ctu` and :option:`--min-cu-size` must
+	be consistent for all of them since the encoder configures several
+	key global data structures based on this range.
+
 .. option:: --rect, --no-rect
 
 	Enable analysis of rectangular motion partitions Nx2N and 2NxN
@@ -494,14 +556,6 @@ the prediction quad-tree.
 	Measure full CU size (2Nx2N) merge candidates first; if no residual
 	is found the analysis is short circuited. Default disabled
 
-.. option:: --fast-cbf, --no-fast-cbf
-
-	Short circuit analysis if a prediction is found that does not set
-	the coded block flag (aka: no residual was encoded).  It prevents
-	the encoder from perhaps finding other predictions that also have no
-	residual but require less signaling bits or have less distortion.
-	Only applicable for RD levels 5 and 6. Default disabled
-
 .. option:: --fast-intra, --no-fast-intra
 
 	Perform an initial scan of every fifth intra angular mode, then
@@ -526,14 +580,6 @@ the prediction quad-tree.
 	Only effective at RD levels 3 and above, which perform RDO mode
 	decisions.
 
-.. option:: --tskip, --no-tskip
-
-	Enable evaluation of transform skip (bypass DCT but still use
-	quantization) coding for 4x4 TU coded blocks.
-
-	Only effective at RD levels 3 and above, which perform RDO mode
-	decisions. Default disabled
-
 .. option:: --tskip-fast, --no-tskip-fast
 
 	Only evaluate transform skip for NxN intra predictions (4x4 blocks).
@@ -593,9 +639,76 @@ as the residual quad-tree (RQT).
 	partitions, in which case a TU split is implied and thus the
 	residual quad-tree begins one layer below the CU quad-tree.
 
+.. option:: --nr-intra <integer>, --nr-inter <integer>
+
+	Noise reduction - an adaptive deadzone applied after DCT
+	(subtracting from DCT coefficients), before quantization.  It does
+	no pixel-level filtering, doesn't cross DCT block boundaries, has no
+	overlap, The higher the strength value parameter, the more
+	aggressively it will reduce noise.
+
+	Enabling noise reduction will make outputs diverge between different
+	numbers of frame threads. Outputs will be deterministic but the
+	outputs of -F2 will no longer match the outputs of -F3, etc.
+
+	**Values:** any value in range of 0 to 2000. Default 0 (disabled).
+
+.. option:: --tskip, --no-tskip
+
+	Enable evaluation of transform skip (bypass DCT but still use
+	quantization) coding for 4x4 TU coded blocks.
+
+	Only effective at RD levels 3 and above, which perform RDO mode
+	decisions. Default disabled
+
+.. option:: --rdpenalty <0..2>
+
+	When set to 1, transform units of size 32x32 are given a 4x bit cost
+	penalty compared to smaller transform units, in intra coded CUs in P
+	or B slices.
+
+	When set to 2, transform units of size 32x32 are not even attempted,
+	unless otherwise required by the maximum recursion depth.  For this
+	option to be effective with 32x32 intra CUs,
+	:option:`--tu-intra-depth` must be at least 2.  For it to be
+	effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
+	at least 3.
+
+	Note that in HEVC an intra transform unit (a block of the residual
+	quad-tree) is also a prediction unit, meaning that the intra
+	prediction signal is generated for each TU block, the residual
+	subtracted and then coded. The coding unit simply provides the
+	prediction modes that will be used when predicting all of the
+	transform units within the CU. This means that when you prevent
+	32x32 intra transform units, you are preventing 32x32 intra
+	predictions.
+
+	Default 0, disabled.
+
+	**Values:** 0:disabled 1:4x cost penalty 2:force splits
+
+.. option:: --max-tu-size <32|16|8|4>
+
+	Maximum TU size (width and height). The residual can be more
+	efficiently compressed by the DCT transform when the max TU size
+	is larger, but at the expense of more computation. Transform unit
+	quad-tree begins at the same depth of the coded tree unit, but if the
+	maximum TU size is smaller than the CU size then transform QT begins 
+	at the depth of the max-tu-size. Default: 32.
+
 Temporal / motion search options
 ================================
 
+.. option:: --max-merge <1..5>
+
+	Maximum number of neighbor (spatial and temporal) candidate blocks
+	that the encoder may consider for merging motion predictions. If a
+	merge candidate results in no residual, it is immediately selected
+	as a "skip".  Otherwise the merge candidates are tested as part of
+	motion estimation when searching for the least cost inter option.
+	The max candidate number is encoded in the SPS and determines the
+	bit cost of signaling merge CUs. Default 2
+
 .. option:: --me <integer|string>
 
 	Motion search method. Generally, the higher the number the harder
@@ -658,16 +771,6 @@ Temporal / motion search options
 
 	**Range of values:** an integer from 0 to 32768
 
-.. option:: --max-merge <1..5>
-
-	Maximum number of neighbor (spatial and temporal) candidate blocks
-	that the encoder may consider for merging motion predictions. If a
-	merge candidate results in no residual, it is immediately selected
-	as a "skip".  Otherwise the merge candidates are tested as part of
-	motion estimation when searching for the least cost inter option.
-	The max candidate number is encoded in the SPS and determines the
-	bit cost of signaling merge CUs. Default 2
-
 .. option:: --temporal-mvp, --no-temporal-mvp
 
 	Enable temporal motion vector predictors in P and B slices.
@@ -704,32 +807,6 @@ Spatial/intra options
 	propagation of reference errors that may have resulted from lossy
 	signals. Default disabled
 
-.. option:: --rdpenalty <0..2>
-
-	When set to 1, transform units of size 32x32 are given a 4x bit cost
-	penalty compared to smaller transform units, in intra coded CUs in P
-	or B slices.
-
-	When set to 2, transform units of size 32x32 are not even attempted,
-	unless otherwise required by the maximum recursion depth.  For this
-	option to be effective with 32x32 intra CUs,
-	:option:`--tu-intra-depth` must be at least 2.  For it to be
-	effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
-	at least 3.
-
-	Note that in HEVC an intra transform unit (a block of the residual
-	quad-tree) is also a prediction unit, meaning that the intra
-	prediction signal is generated for each TU block, the residual
-	subtracted and then coded. The coding unit simply provides the
-	prediction modes that will be used when predicting all of the
-	transform units within the CU. This means that when you prevent
-	32x32 intra transform units, you are preventing 32x32 intra
-	predictions.
-
-	Default 0, disabled.
-
-	**Values:** 0:disabled 1:4x cost penalty 2:force splits
-
 Psycho-visual options
 =====================
 
@@ -874,13 +951,6 @@ Slice decision options
 
 	Use B-frames as references, when possible. Default enabled
 
-.. option:: --ref <1..16>
-
-	Max number of L0 references to be allowed. This number has a linear
-	multiplier effect on the amount of work performed in motion search,
-	but will generally have a beneficial affect on compression and
-	distortion. Default 3
-
 Quality, rate control and rate distortion options
 =================================================
 
@@ -990,20 +1060,6 @@ Quality, rate control and rate distortio
 	less bits. This tends to improve detail in the backgrounds of video
 	with less detail in areas of high motion. Default enabled
 
-.. option:: --nr-intra <integer>, --nr-inter <integer>
-
-	Noise reduction - an adaptive deadzone applied after DCT
-	(subtracting from DCT coefficients), before quantization.  It does
-	no pixel-level filtering, doesn't cross DCT block boundaries, has no
-	overlap, The higher the strength value parameter, the more
-	aggressively it will reduce noise.
-
-	Enabling noise reduction will make outputs diverge between different
-	numbers of frame threads. Outputs will be deterministic but the
-	outputs of -F2 will no longer match the outputs of -F3, etc.
-
-	**Values:** any value in range of 0 to 2000. Default 0 (disabled).
-
 .. option:: --pass <integer>