[x265-commits] [x265] slicetype: fix the BRef cost estimates in vbv lookahead.

Wed Feb 18 23:55:26 CET 2015

details:   http://hg.videolan.org/x265/rev/359daecfbb47
branches:  stable
changeset: 9366:359daecfbb47
user:      Aarthi Thirumalai
date:      Mon Feb 16 10:33:58 2015 +0530
description:
slicetype: fix the BRef cost estimates in vbv lookahead.
Subject: [x265] Merge with stable

details:   http://hg.videolan.org/x265/rev/9a6849146225
branches:  
changeset: 9367:9a6849146225
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Wed Feb 18 14:43:48 2015 +0530
description:
Merge with stable
Subject: [x265] rename variable g_maxFullDepth to g_unitSizeDepth, NUM_CU_PARTITIONS to NUM_4x4_PARTITIONS

details:   http://hg.videolan.org/x265/rev/15ab013c56dd
branches:  
changeset: 9368:15ab013c56dd
user:      Santhoshini Sekar<santhoshini at multicorewareinc.com>
date:      Mon Feb 16 14:28:19 2015 +0530
description:
rename variable g_maxFullDepth to g_unitSizeDepth, NUM_CU_PARTITIONS to NUM_4x4_PARTITIONS
for better clarity
Subject: [x265] asm-see: intra_pred_ang4_2, fix xmm register count

details:   http://hg.videolan.org/x265/rev/c5e50d780f06
branches:  
changeset: 9369:c5e50d780f06
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Wed Feb 18 15:35:58 2015 +0530
description:
asm-see: intra_pred_ang4_2, fix xmm register count

diffstat:

 doc/reST/cli.rst                 |  190 +++++----
 doc/reST/threading.rst           |   11 +-
 readme.rst                       |   14 +
 source/CMakeLists.txt            |   11 +-
 source/common/bitstream.cpp      |    2 +-
 source/common/common.cpp         |    4 +
 source/common/common.h           |    7 +-
 source/common/constants.cpp      |    2 +-
 source/common/constants.h        |    2 +-
 source/common/cudata.cpp         |   32 +-
 source/common/cudata.h           |    4 +-
 source/common/ipfilter.cpp       |   45 +-
 source/common/param.cpp          |   29 +-
 source/common/picyuv.cpp         |    6 +-
 source/common/pixel.cpp          |    2 +-
 source/common/primitives.cpp     |    1 +
 source/common/primitives.h       |   10 +-
 source/common/quant.cpp          |   78 ++++-
 source/common/scalinglist.cpp    |    2 +-
 source/common/shortyuv.cpp       |    6 +-
 source/common/slice.cpp          |   12 +-
 source/common/slice.h            |   11 +-
 source/common/threading.h        |   19 +-
 source/common/x86/blockcopy8.asm |  693 +++++++++++++++++++++++---------------
 source/common/x86/intrapred8.asm |    2 +-
 source/encoder/analysis.cpp      |  279 ++++++++------
 source/encoder/analysis.h        |    3 +-
 source/encoder/api.cpp           |    2 +-
 source/encoder/dpb.cpp           |   44 +-
 source/encoder/dpb.h             |    4 +-
 source/encoder/encoder.cpp       |  218 ++++++++++--
 source/encoder/encoder.h         |    2 +-
 source/encoder/entropy.cpp       |  166 ++++----
 source/encoder/entropy.h         |    6 +-
 source/encoder/frameencoder.cpp  |   19 +-
 source/encoder/frameencoder.h    |   11 +-
 source/encoder/framefilter.cpp   |    5 +
 source/encoder/level.cpp         |   25 +-
 source/encoder/nal.cpp           |    2 +-
 source/encoder/search.cpp        |  304 +++++++++-------
 source/encoder/search.h          |   93 +++++-
 source/encoder/slicetype.cpp     |   44 +-
 source/encoder/slicetype.h       |   17 +-
 source/input/y4m.cpp             |   58 +--
 source/output/y4m.cpp            |    8 -
 source/output/yuv.cpp            |    4 -
 source/test/ipfilterharness.cpp  |   73 ++++-
 source/test/ipfilterharness.h    |    4 +-
 source/x265.h                    |  478 +++++++++++++-------------
 source/x265cli.h                 |   12 +-
 50 files changed, 1901 insertions(+), 1175 deletions(-)

diffs (truncated from 5631 to 300 lines):

diff -r 3ed2a4215e08 -r c5e50d780f06 doc/reST/cli.rst

--- a/doc/reST/cli.rst	Mon Feb 16 18:26:29 2015 +0530
+++ b/doc/reST/cli.rst	Wed Feb 18 15:35:58 2015 +0530
@@ -171,6 +171,8 @@ Performance Options
 	Over-allocation of frame threads will not improve performance, it
 	will generally just increase memory use.
 
+	**Values:** any value between 8 and 16. Default is 0, auto-detect
+
 .. option:: --threads <integer>
 
 	Number of threads to allocate for the worker thread pool  This pool
@@ -409,7 +411,17 @@ Profile, Level, Tier
 	If :option:`--level-idc` has been specified, the option adds the
 	intention to support the High tier of that level. If your specified
 	level does not support a High tier, a warning is issued and this
-	modifier flag is ignored.
+	modifier flag is ignored. If :option:`--level-idc` has been specified,
+	but not --high-tier, then the encoder will attempt to encode at the 
+	specified level, main tier first, turning on high tier only if 
+	necessary and available at that level.
+
+.. option:: --ref <1..16>
+
+	Max number of L0 references to be allowed. This number has a linear
+	multiplier effect on the amount of work performed in motion search,
+	but will generally have a beneficial affect on compression and
+	distortion. Default 3
 
 .. note::
 	:option:`--profile`, :option:`--level-idc`, and
@@ -494,14 +506,6 @@ the prediction quad-tree.
 	Measure full CU size (2Nx2N) merge candidates first; if no residual
 	is found the analysis is short circuited. Default disabled
 
-.. option:: --fast-cbf, --no-fast-cbf
-
-	Short circuit analysis if a prediction is found that does not set
-	the coded block flag (aka: no residual was encoded).  It prevents
-	the encoder from perhaps finding other predictions that also have no
-	residual but require less signaling bits or have less distortion.
-	Only applicable for RD levels 5 and 6. Default disabled
-
 .. option:: --fast-intra, --no-fast-intra
 
 	Perform an initial scan of every fifth intra angular mode, then
@@ -526,14 +530,6 @@ the prediction quad-tree.
 	Only effective at RD levels 3 and above, which perform RDO mode
 	decisions.
 
-.. option:: --tskip, --no-tskip
-
-	Enable evaluation of transform skip (bypass DCT but still use
-	quantization) coding for 4x4 TU coded blocks.
-
-	Only effective at RD levels 3 and above, which perform RDO mode
-	decisions. Default disabled
-
 .. option:: --tskip-fast, --no-tskip-fast
 
 	Only evaluate transform skip for NxN intra predictions (4x4 blocks).
@@ -593,9 +589,76 @@ as the residual quad-tree (RQT).
 	partitions, in which case a TU split is implied and thus the
 	residual quad-tree begins one layer below the CU quad-tree.
 
+.. option:: --nr-intra <integer>, --nr-inter <integer>
+
+	Noise reduction - an adaptive deadzone applied after DCT
+	(subtracting from DCT coefficients), before quantization.  It does
+	no pixel-level filtering, doesn't cross DCT block boundaries, has no
+	overlap, The higher the strength value parameter, the more
+	aggressively it will reduce noise.
+
+	Enabling noise reduction will make outputs diverge between different
+	numbers of frame threads. Outputs will be deterministic but the
+	outputs of -F2 will no longer match the outputs of -F3, etc.
+
+	**Values:** any value in range of 0 to 2000. Default 0 (disabled).
+
+.. option:: --tskip, --no-tskip
+
+	Enable evaluation of transform skip (bypass DCT but still use
+	quantization) coding for 4x4 TU coded blocks.
+
+	Only effective at RD levels 3 and above, which perform RDO mode
+	decisions. Default disabled
+
+.. option:: --rdpenalty <0..2>
+
+	When set to 1, transform units of size 32x32 are given a 4x bit cost
+	penalty compared to smaller transform units, in intra coded CUs in P
+	or B slices.
+
+	When set to 2, transform units of size 32x32 are not even attempted,
+	unless otherwise required by the maximum recursion depth.  For this
+	option to be effective with 32x32 intra CUs,
+	:option:`--tu-intra-depth` must be at least 2.  For it to be
+	effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
+	at least 3.
+
+	Note that in HEVC an intra transform unit (a block of the residual
+	quad-tree) is also a prediction unit, meaning that the intra
+	prediction signal is generated for each TU block, the residual
+	subtracted and then coded. The coding unit simply provides the
+	prediction modes that will be used when predicting all of the
+	transform units within the CU. This means that when you prevent
+	32x32 intra transform units, you are preventing 32x32 intra
+	predictions.
+
+	Default 0, disabled.
+
+	**Values:** 0:disabled 1:4x cost penalty 2:force splits
+
+.. option:: --max-tu-size <32|16|8|4>
+
+	Maximum TU size (width and height). The residual can be more
+	efficiently compressed by the DCT transform when the max TU size
+	is larger, but at the expense of more computation. Transform unit
+	quad-tree begins at the same depth of the coded tree unit, but if the
+	maximum TU size is smaller than the CU size then transform QT begins 
+	at the depth of the max-tu-size. Default: 32.
+
 Temporal / motion search options
 ================================
 
+.. option:: --max-merge <1..5>
+
+	Maximum number of neighbor (spatial and temporal) candidate blocks
+	that the encoder may consider for merging motion predictions. If a
+	merge candidate results in no residual, it is immediately selected
+	as a "skip".  Otherwise the merge candidates are tested as part of
+	motion estimation when searching for the least cost inter option.
+	The max candidate number is encoded in the SPS and determines the
+	bit cost of signaling merge CUs. Default 2
+
 .. option:: --me <integer|string>
 
 	Motion search method. Generally, the higher the number the harder
@@ -658,16 +721,6 @@ Temporal / motion search options
 
 	**Range of values:** an integer from 0 to 32768
 
-.. option:: --max-merge <1..5>
-
-	Maximum number of neighbor (spatial and temporal) candidate blocks
-	that the encoder may consider for merging motion predictions. If a
-	merge candidate results in no residual, it is immediately selected
-	as a "skip".  Otherwise the merge candidates are tested as part of
-	motion estimation when searching for the least cost inter option.
-	The max candidate number is encoded in the SPS and determines the
-	bit cost of signaling merge CUs. Default 2
-
 .. option:: --temporal-mvp, --no-temporal-mvp
 
 	Enable temporal motion vector predictors in P and B slices.
@@ -704,32 +757,6 @@ Spatial/intra options
 	propagation of reference errors that may have resulted from lossy
 	signals. Default disabled
 
-.. option:: --rdpenalty <0..2>
-
-	When set to 1, transform units of size 32x32 are given a 4x bit cost
-	penalty compared to smaller transform units, in intra coded CUs in P
-	or B slices.
-
-	When set to 2, transform units of size 32x32 are not even attempted,
-	unless otherwise required by the maximum recursion depth.  For this
-	option to be effective with 32x32 intra CUs,
-	:option:`--tu-intra-depth` must be at least 2.  For it to be
-	effective with 64x64 intra CUs, :option:`--tu-intra-depth` must be
-	at least 3.
-
-	Note that in HEVC an intra transform unit (a block of the residual
-	quad-tree) is also a prediction unit, meaning that the intra
-	prediction signal is generated for each TU block, the residual
-	subtracted and then coded. The coding unit simply provides the
-	prediction modes that will be used when predicting all of the
-	transform units within the CU. This means that when you prevent
-	32x32 intra transform units, you are preventing 32x32 intra
-	predictions.
-
-	Default 0, disabled.
-
-	**Values:** 0:disabled 1:4x cost penalty 2:force splits
-
 Psycho-visual options
 =====================
 
@@ -874,13 +901,6 @@ Slice decision options
 
 	Use B-frames as references, when possible. Default enabled
 
-.. option:: --ref <1..16>
-
-	Max number of L0 references to be allowed. This number has a linear
-	multiplier effect on the amount of work performed in motion search,
-	but will generally have a beneficial affect on compression and
-	distortion. Default 3
-
 Quality, rate control and rate distortion options
 =================================================
 
@@ -990,20 +1010,6 @@ Quality, rate control and rate distortio
 	less bits. This tends to improve detail in the backgrounds of video
 	with less detail in areas of high motion. Default enabled
 
-.. option:: --nr-intra <integer>, --nr-inter <integer>
-
-	Noise reduction - an adaptive deadzone applied after DCT
-	(subtracting from DCT coefficients), before quantization.  It does
-	no pixel-level filtering, doesn't cross DCT block boundaries, has no
-	overlap, The higher the strength value parameter, the more
-	aggressively it will reduce noise.
-
-	Enabling noise reduction will make outputs diverge between different
-	numbers of frame threads. Outputs will be deterministic but the
-	outputs of -F2 will no longer match the outputs of -F3, etc.
-
-	**Values:** any value in range of 0 to 2000. Default 0 (disabled).
-
 .. option:: --pass <integer>
 
 	Enable multi-pass rate control mode. Input is encoded multiple times,
@@ -1342,13 +1348,13 @@ Bitstream options
 	to keep the stream headers for you and you want keyframes to be
 	random access points. Default disabled
 
-.. option:: --info, --no-info
+.. option:: --aud, --no-aud
 
-	Emit an informational SEI with the stream headers which describes
-	the encoder version, build info, and encode parameters. This is very
-	helpful for debugging purposes but encoding version numbers and
-	build info could make your bitstreams diverge and interfere with
-	regression testing. Default enabled
+	Emit an access unit delimiter NAL at the start of each slice access
+	unit. If :option:`--repeat-headers` is not enabled (indicating the
+	user will be writing headers manually at the start of the stream)
+	the very first AUD will be skipped since it cannot be placed at the
+	start of the access unit, where it belongs. Default disabled
 
 .. option:: --hrd, --no-hrd
 
@@ -1357,13 +1363,13 @@ Bitstream options
 	Picture Timing SEI messages providing timing information to the
 	decoder. Default disabled
 
-.. option:: --aud, --no-aud
+.. option:: --info, --no-info
 
-	Emit an access unit delimiter NAL at the start of each slice access
-	unit. If :option:`--repeat-headers` is not enabled (indicating the
-	user will be writing headers manually at the start of the stream)
-	the very first AUD will be skipped since it cannot be placed at the
-	start of the access unit, where it belongs. Default disabled
+	Emit an informational SEI with the stream headers which describes
+	the encoder version, build info, and encode parameters. This is very
+	helpful for debugging purposes but encoding version numbers and
+	build info could make your bitstreams diverge and interfere with
+	regression testing. Default enabled
 
 .. option:: --hash <integer>
 
@@ -1375,6 +1381,18 @@ Bitstream options
 	2. CRC
 	3. Checksum
 
+.. option:: --temporal-layers,--no-temporal-layers
+
+	Enable a temporal sub layer. All referenced I/P/B frames are in the
+	base layer and all unreferenced B frames are placed in a temporal
+	sublayer. A decoder may chose to drop the sublayer and only decode
+	and display the base layer slices.
+	
+	If used with a fixed GOP (:option:`b-adapt` 0) and :option:`bframes`
+	3 then the two layers evenly split the frame rate, with a cadence of
+	PbBbP. You probably also want :option:`--no-scenecut` and a keyframe
+	interval that is a multiple of 4.
+
 Debugging options
 =================
 
diff -r 3ed2a4215e08 -r c5e50d780f06 doc/reST/threading.rst
--- a/doc/reST/threading.rst	Mon Feb 16 18:26:29 2015 +0530
+++ b/doc/reST/threading.rst	Wed Feb 18 15:35:58 2015 +0530
@@ -125,9 +125,14 @@ The second extenuating circumstance is t
 for motion reference must be processed by the loop filters and the loop
 filters cannot run until a full row has been encoded, and it must run a
 full row behind the encode process so that the pixels below the row
-being filtered are available. When you add up all the row lags each
-frame ends up being 3 CTU rows behind its reference frames (the
-equivalent of 12 macroblock rows for x264)
+being filtered are available. On top of this, HEVC has two loop filters:
+deblocking and SAO, which must be run in series with a row lag between
+them. When you add up all the row lags each frame ends up being 3 CTU
+rows behind its reference frames (the equivalent of 12 macroblock rows
+for x264). And keep in mind the wave-front progression pattern; by the
+time the reference frame finishes the third row of CTUs, nearly half of
+the CTUs in the frame may be compressed (depending on the display aspect
+ratio).