[x265-commits] [x265] asm: properly disable x265_stack_align() when ENABLE_ASSE...
Steve Borho
steve at borho.org
Mon Mar 9 18:48:41 CET 2015
details: http://hg.videolan.org/x265/rev/cbc41dfdb5c4
branches: stable
changeset: 9666:cbc41dfdb5c4
user: Steve Borho <steve at borho.org>
date: Fri Mar 06 22:27:54 2015 -0600
description:
asm: properly disable x265_stack_align() when ENABLE_ASSEMBLY is OFF
Subject: [x265] param: disallow encodes without room for P frame in lookahead
details: http://hg.videolan.org/x265/rev/2e93cd58ff61
branches: stable
changeset: 9667:2e93cd58ff61
user: Steve Borho <steve at borho.org>
date: Mon Mar 09 11:57:36 2015 -0500
description:
param: disallow encodes without room for P frame in lookahead
Subject: [x265] Merge with stable
details: http://hg.videolan.org/x265/rev/62c02919ced8
branches:
changeset: 9668:62c02919ced8
user: Steve Borho <steve at borho.org>
date: Mon Mar 09 12:47:47 2015 -0500
description:
Merge with stable
diffstat:
doc/reST/cli.rst | 295 +-
doc/reST/presets.rst | 9 +-
doc/reST/threading.rst | 100 +-
readme.rst | 14 +
source/CMakeLists.txt | 66 +-
source/cmake/FindNuma.cmake | 43 +
source/common/CMakeLists.txt | 2 +-
source/common/bitstream.cpp | 4 +-
source/common/common.cpp | 4 +
source/common/common.h | 15 +-
source/common/constants.cpp | 2 +-
source/common/constants.h | 2 +-
source/common/cudata.cpp | 292 +-
source/common/cudata.h | 20 +-
source/common/dct.cpp | 13 +-
source/common/deblock.cpp | 4 +-
source/common/framedata.h | 2 +
source/common/intrapred.cpp | 28 +
source/common/ipfilter.cpp | 45 +-
source/common/lowres.cpp | 23 +-
source/common/lowres.h | 3 +-
source/common/mv.h | 9 +-
source/common/param.cpp | 86 +-
source/common/picyuv.cpp | 8 +-
source/common/pixel.cpp | 2 +-
source/common/predict.cpp | 344 +-
source/common/predict.h | 60 +-
source/common/primitives.cpp | 3 +-
source/common/primitives.h | 19 +-
source/common/quant.cpp | 102 +-
source/common/quant.h | 4 +-
source/common/scalinglist.cpp | 2 +-
source/common/shortyuv.cpp | 6 +-
source/common/slice.cpp | 12 +-
source/common/slice.h | 19 +-
source/common/threading.cpp | 7 +
source/common/threading.h | 63 +-
source/common/threadpool.cpp | 699 ++--
source/common/threadpool.h | 166 +-
source/common/wavefront.cpp | 12 +-
source/common/wavefront.h | 11 +-
source/common/x86/asm-primitives.cpp | 407 ++-
source/common/x86/blockcopy8.asm | 1548 ++++++++-
source/common/x86/blockcopy8.h | 42 +
source/common/x86/const-a.asm | 19 +-
source/common/x86/dct8.asm | 362 ++
source/common/x86/dct8.h | 1 +
source/common/x86/intrapred.h | 24 +-
source/common/x86/intrapred16.asm | 530 +++-
source/common/x86/intrapred8.asm | 1282 ++++++++-
source/common/x86/ipfilter16.asm | 2697 +++++++++++++++++-
source/common/x86/ipfilter8.asm | 5224 +++++++++++++++++++++++++++++++++-
source/common/x86/ipfilter8.h | 27 +-
source/common/x86/mc-a.asm | 70 +-
source/common/x86/pixel-a.asm | 574 +++
source/common/x86/pixel-util.h | 12 +-
source/common/x86/pixel-util8.asm | 343 ++-
source/common/x86/pixel.h | 26 +
source/common/x86/pixeladd8.asm | 161 +
source/encoder/analysis.cpp | 901 ++---
source/encoder/analysis.h | 44 +-
source/encoder/api.cpp | 3 +-
source/encoder/dpb.cpp | 44 +-
source/encoder/dpb.h | 4 +-
source/encoder/encoder.cpp | 453 ++-
source/encoder/encoder.h | 7 +-
source/encoder/entropy.cpp | 182 +-
source/encoder/entropy.h | 6 +-
source/encoder/frameencoder.cpp | 259 +-
source/encoder/frameencoder.h | 29 +-
source/encoder/framefilter.cpp | 7 +-
source/encoder/level.cpp | 25 +-
source/encoder/motion.cpp | 67 +-
source/encoder/motion.h | 1 +
source/encoder/nal.cpp | 2 +-
source/encoder/ratecontrol.cpp | 213 +-
source/encoder/ratecontrol.h | 216 +-
source/encoder/sao.cpp | 2 +
source/encoder/search.cpp | 861 +++--
source/encoder/search.h | 167 +-
source/encoder/slicetype.cpp | 1671 ++++++----
source/encoder/slicetype.h | 250 +-
source/encoder/weightPrediction.cpp | 20 +-
source/input/y4m.cpp | 58 +-
source/output/y4m.cpp | 8 -
source/output/yuv.cpp | 4 -
source/profile/cpuEvents.h | 3 +-
source/test/CMakeLists.txt | 3 +
source/test/ipfilterharness.cpp | 73 +-
source/test/ipfilterharness.h | 4 +-
source/test/mbdstharness.cpp | 64 +-
source/test/pixelharness.cpp | 21 +-
source/test/testbench.cpp | 6 +
source/test/testharness.h | 2 +-
source/x265.cpp | 4 +-
source/x265.h | 561 ++-
source/x265cli.h | 27 +-
97 files changed, 17788 insertions(+), 4453 deletions(-)
diffs (truncated from 30188 to 300 lines):
diff -r b5c42efdd600 -r 62c02919ced8 doc/reST/cli.rst
--- a/doc/reST/cli.rst Mon Feb 23 13:25:07 2015 -0600
+++ b/doc/reST/cli.rst Mon Mar 09 12:47:47 2015 -0500
@@ -171,19 +171,54 @@ Performance Options
Over-allocation of frame threads will not improve performance, it
will generally just increase memory use.
-.. option:: --threads <integer>
+ **Values:** any value between 8 and 16. Default is 0, auto-detect
- Number of threads to allocate for the worker thread pool This pool
- is used for WPP and for distributed analysis and motion search:
- :option:`--wpp` :option:`--pmode` and :option:`--pme` respectively.
+.. option:: --pools <string>, --numa-pools <string>
- If :option:`--threads` 1 is specified, then no thread pool is
- created. When no thread pool is created, all the thread pool
- features are implicitly disabled. If all the pool features are
- disabled by the user, then the pool is implicitly disabled.
+ Comma seperated list of threads per NUMA node. If "none", then no worker
+ pools are created and only frame parallelism is possible. If NULL or ""
+ (default) x265 will use all available threads on each NUMA node::
- Default 0, one thread is allocated per detected hardware thread
- (logical CPU cores)
+ '+' is a special value indicating all cores detected on the node
+ '*' is a special value indicating all cores detected on the node and all remaining nodes
+ '-' is a special value indicating no cores on the node, same as '0'
+
+ example strings for a 4-node system::
+
+ "" - default, unspecified, all numa nodes are used for thread pools
+ "*" - same as default
+ "none" - no thread pools are created, only frame parallelism possible
+ "-" - same as "none"
+ "10" - allocate one pool, using up to 10 cores on node 0
+ "-,+" - allocate one pool, using all cores on node 1
+ "+,-,+" - allocate two pools, using all cores on nodes 0 and 2
+ "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
+ "-,*" - allocate three pools, using all cores on nodes 1, 2 and 3
+ "8,8,8,8" - allocate four pools with up to 8 threads in each pool
+
+ The total number of threads will be determined by the number of threads
+ assigned to all nodes. The worker threads will each be given affinity for
+ their node, they will not be allowed to migrate between nodes, but they
+ will be allowed to move between CPU cores within their node.
+
+ If the three pool features: :option:`--wpp` :option:`--pmode` and
+ :option:`--pme` are all disabled, then :option:`--pools` is ignored
+ and no thread pools are created.
+
+ If "none" is specified, then all three of the thread pool features are
+ implicitly disabled.
+
+ Multiple thread pools will be allocated for any NUMA node with more than
+ 64 logical CPU cores. But any given thread pool will always use at most
+ one NUMA node.
+
+ Frame encoders are distributed between the available thread pools,
+ and the encoder will never generate more thread pools than
+ :option:`--frame-threads`. The pools are used for WPP and for
+ distributed analysis and motion search.
+
+ Default "", one thread is allocated per detected hardware thread
+ (logical CPU cores) and one thread pool per NUMA node.
.. option:: --wpp, --no-wpp
@@ -409,7 +444,17 @@ Profile, Level, Tier
If :option:`--level-idc` has been specified, the option adds the
intention to support the High tier of that level. If your specified
level does not support a High tier, a warning is issued and this
- modifier flag is ignored.
+ modifier flag is ignored. If :option:`--level-idc` has been specified,
+ but not --high-tier, then the encoder will attempt to encode at the
+ specified level, main tier first, turning on high tier only if
+ necessary and available at that level.
+
+.. option:: --ref <1..16>
+
+ Max number of L0 references to be allowed. This number has a linear
+ multiplier effect on the amount of work performed in motion search,
+ but will generally have a beneficial affect on compression and
+ distortion. Default 3
.. note::
:option:`--profile`, :option:`--level-idc`, and
@@ -465,6 +510,23 @@ the prediction quad-tree.
and less frame parallelism as well. Because of this the faster
presets use a CU size of 32. Default: 64
+.. option:: --min-cu-size <64|32|16|8>
+
+ Minimum CU size (width and height). By using 16 or 32 the encoder
+ will not analyze the cost of CUs below that minimum threshold,
+ saving considerable amounts of compute with a predictable increase
+ in bitrate. This setting has a large effect on performance on the
+ faster presets.
+
+ Default: 8 (minimum 8x8 CU for HEVC, best compression efficiency)
+
+.. note::
+
+ All encoders within a single process must use the same settings for
+ the CU size range. :option:`--ctu` and :option:`--min-cu-size` must
+ be consistent for all of them since the encoder configures several
+ key global data structures based on this range.
+
.. option:: --rect, --no-rect
Enable analysis of rectangular motion partitions Nx2N and 2NxN
@@ -494,14 +556,6 @@ the prediction quad-tree.
Measure full CU size (2Nx2N) merge candidates first; if no residual
is found the analysis is short circuited. Default disabled
-.. option:: --fast-cbf, --no-fast-cbf
-
- Short circuit analysis if a prediction is found that does not set
- the coded block flag (aka: no residual was encoded). It prevents
- the encoder from perhaps finding other predictions that also have no
- residual but require less signaling bits or have less distortion.
- Only applicable for RD levels 5 and 6. Default disabled
-
.. option:: --fast-intra, --no-fast-intra
Perform an initial scan of every fifth intra angular mode, then
@@ -526,14 +580,6 @@ the prediction quad-tree.
Only effective at RD levels 3 and above, which perform RDO mode
decisions.
-.. option:: --tskip, --no-tskip
-
- Enable evaluation of transform skip (bypass DCT but still use
- quantization) coding for 4x4 TU coded blocks.
-
- Only effective at RD levels 3 and above, which perform RDO mode
- decisions. Default disabled
-
.. option:: --tskip-fast, --no-tskip-fast
Only evaluate transform skip for NxN intra predictions (4x4 blocks).
@@ -567,6 +613,30 @@ not match.
Options which affect the transform unit quad-tree, sometimes referred to
as the residual quad-tree (RQT).
+.. option:: --rdoq-level <0|1|2>, --no-rdoq-level
+
+ Specify the amount of rate-distortion analysis to use within
+ quantization::
+
+ At level 0 rate-distortion cost is not considered in quant
+
+ At level 1 rate-distortion cost is used to find optimal rounding
+ values for each level (and allows psy-rdoq to be effective). It
+ trades-off the signaling cost of the coefficient vs its post-inverse
+ quant distortion from the pre-quant coefficient. When
+ :option:`--psy-rdoq` is enabled, this formula is biased in favor of
+ more energy in the residual (larger coefficient absolute levels)
+
+ At level 2 rate-distortion cost is used to make decimate decisions
+ on each 4x4 coding group, including the cost of signaling the group
+ within the group bitmap. If the total distortion of not signaling
+ the entire coding group is less than the rate cost, the block is
+ decimated. Next, it applies rate-distortion cost analysis to the
+ last non-zero coefficient, which can result in many (or all) of the
+ coding groups being decimated. Psy-rdoq is less effective at
+ preserving energy when RDOQ is at level 2, since it only has
+ influence over the level distortion costs.
+
.. option:: --tu-intra-depth <1..4>
The transform unit (residual) quad-tree begins with the same depth
@@ -593,9 +663,76 @@ as the residual quad-tree (RQT).
partitions, in which case a TU split is implied and thus the
residual quad-tree begins one layer below the CU quad-tree.
+.. option:: --nr-intra <integer>, --nr-inter <integer>
+
+ Noise reduction - an adaptive deadzone applied after DCT
+ (subtracting from DCT coefficients), before quantization. It does
+ no pixel-level filtering, doesn't cross DCT block boundaries, has no
+ overlap, The higher the strength value parameter, the more
+ aggressively it will reduce noise.
+
+ Enabling noise reduction will make outputs diverge between different
+ numbers of frame threads. Outputs will be deterministic but the
+ outputs of -F2 will no longer match the outputs of -F3, etc.
+
+ **Values:** any value in range of 0 to 2000. Default 0 (disabled).
+
+.. option:: --tskip, --no-tskip
+
+ Enable evaluation of transform skip (bypass DCT but still use
+ quantization) coding for 4x4 TU coded blocks.
+
+ Only effective at RD levels 3 and above, which perform RDO mode
+ decisions. Default disabled
+
+.. option:: --rdpenalty <0..2>
+
+ When set to 1, transform units of size 32x32 are given a 4x bit cost
+ penalty compared to smaller transform units, in intra coded CUs in P
+ or B slices.
+
+ When set to 2, transform units of size 32x32 are not even attempted,
+ unless otherwise required by the maximum recursion depth. For this
+ option to be effective with 32x32 intra CUs,
+ :option:`--tu-intra-depth` must be at least 2. For it to be
+ effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
+ at least 3.
+
+ Note that in HEVC an intra transform unit (a block of the residual
+ quad-tree) is also a prediction unit, meaning that the intra
+ prediction signal is generated for each TU block, the residual
+ subtracted and then coded. The coding unit simply provides the
+ prediction modes that will be used when predicting all of the
+ transform units within the CU. This means that when you prevent
+ 32x32 intra transform units, you are preventing 32x32 intra
+ predictions.
+
+ Default 0, disabled.
+
+ **Values:** 0:disabled 1:4x cost penalty 2:force splits
+
+.. option:: --max-tu-size <32|16|8|4>
+
+ Maximum TU size (width and height). The residual can be more
+ efficiently compressed by the DCT transform when the max TU size
+ is larger, but at the expense of more computation. Transform unit
+ quad-tree begins at the same depth of the coded tree unit, but if the
+ maximum TU size is smaller than the CU size then transform QT begins
+ at the depth of the max-tu-size. Default: 32.
+
Temporal / motion search options
================================
+.. option:: --max-merge <1..5>
+
+ Maximum number of neighbor (spatial and temporal) candidate blocks
+ that the encoder may consider for merging motion predictions. If a
+ merge candidate results in no residual, it is immediately selected
+ as a "skip". Otherwise the merge candidates are tested as part of
+ motion estimation when searching for the least cost inter option.
+ The max candidate number is encoded in the SPS and determines the
+ bit cost of signaling merge CUs. Default 2
+
.. option:: --me <integer|string>
Motion search method. Generally, the higher the number the harder
@@ -658,16 +795,6 @@ Temporal / motion search options
**Range of values:** an integer from 0 to 32768
-.. option:: --max-merge <1..5>
-
- Maximum number of neighbor (spatial and temporal) candidate blocks
- that the encoder may consider for merging motion predictions. If a
- merge candidate results in no residual, it is immediately selected
- as a "skip". Otherwise the merge candidates are tested as part of
- motion estimation when searching for the least cost inter option.
- The max candidate number is encoded in the SPS and determines the
- bit cost of signaling merge CUs. Default 2
-
.. option:: --temporal-mvp, --no-temporal-mvp
Enable temporal motion vector predictors in P and B slices.
@@ -704,32 +831,6 @@ Spatial/intra options
propagation of reference errors that may have resulted from lossy
signals. Default disabled
-.. option:: --rdpenalty <0..2>
-
- When set to 1, transform units of size 32x32 are given a 4x bit cost
- penalty compared to smaller transform units, in intra coded CUs in P
- or B slices.
-
- When set to 2, transform units of size 32x32 are not even attempted,
- unless otherwise required by the maximum recursion depth. For this
- option to be effective with 32x32 intra CUs,
- :option:`--tu-intra-depth` must be at least 2. For it to be
- effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
- at least 3.
-
- Note that in HEVC an intra transform unit (a block of the residual
- quad-tree) is also a prediction unit, meaning that the intra
- prediction signal is generated for each TU block, the residual
- subtracted and then coded. The coding unit simply provides the
- prediction modes that will be used when predicting all of the
- transform units within the CU. This means that when you prevent
- 32x32 intra transform units, you are preventing 32x32 intra
- predictions.
-
- Default 0, disabled.
-
- **Values:** 0:disabled 1:4x cost penalty 2:force splits
-
Psycho-visual options
=====================
@@ -752,8 +853,8 @@ of blurred prediction modes, like DC and
More information about the x265-commits
mailing list