[x265-commits] [x265] slicetype: pre-calculate cost estimates for B slices, sim...
Steve Borho
steve at borho.org
Thu May 1 20:33:53 CEST 2014
details: http://hg.videolan.org/x265/rev/1ed0bd2dbfd1
branches:
changeset: 6796:1ed0bd2dbfd1
user: Steve Borho <steve at borho.org>
date: Thu May 01 00:26:32 2014 -0500
description:
slicetype: pre-calculate cost estimates for B slices, simplify callback
rate control was always calling back for B slice estimates but since we were
only pre-calculating them if VBV was enabled we were forced to make the
estimates at that time (withing the context of the API thread).
With this patch, we estimate B costs unless CQP is in use, and this allows us
to simplify getEstimatedPictureCost()
Subject: [x265] Merge with stable
details: http://hg.videolan.org/x265/rev/22eda589d8ca
branches:
changeset: 6797:22eda589d8ca
user: Steve Borho <steve at borho.org>
date: Thu May 01 00:26:43 2014 -0500
description:
Merge with stable
diffstat:
doc/reST/cli.rst | 2 +-
doc/reST/index.rst | 1 +
doc/reST/threading.rst | 201 +++++++++++++++++++++++++
source/Lib/TLibCommon/TComWeightPrediction.cpp | 26 +-
source/Lib/TLibCommon/TComYuv.h | 7 +
source/common/shortyuv.h | 7 +
source/encoder/encoder.cpp | 10 +-
source/encoder/slicetype.cpp | 84 ++-------
8 files changed, 257 insertions(+), 81 deletions(-)
diffs (truncated from 492 to 300 lines):
diff -r a25fb61a7326 -r 22eda589d8ca doc/reST/cli.rst
--- a/doc/reST/cli.rst Tue Apr 29 14:04:58 2014 -0500
+++ b/doc/reST/cli.rst Thu May 01 00:26:43 2014 -0500
@@ -242,7 +242,7 @@ Quad-Tree analysis
.. option:: --wpp, --no-wpp
Enable Wavefront Parallel Processing. The encoder may begin encoding
- a row as soon as the row above it is at least two LCUs ahead in the
+ a row as soon as the row above it is at least two CTUs ahead in the
encode process. This gives a 3-5x gain in parallelism for about 1%
overhead in compression efficiency. Default: Enabled
diff -r a25fb61a7326 -r 22eda589d8ca doc/reST/index.rst
--- a/doc/reST/index.rst Tue Apr 29 14:04:58 2014 -0500
+++ b/doc/reST/index.rst Thu May 01 00:26:43 2014 -0500
@@ -5,3 +5,4 @@ x265 Documentation
introduction
cli
+ threading
diff -r a25fb61a7326 -r 22eda589d8ca doc/reST/threading.rst
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/reST/threading.rst Thu May 01 00:26:43 2014 -0500
@@ -0,0 +1,201 @@
+*********
+Threading
+*********
+
+Thread Pool
+===========
+
+x265 creates a pool of worker threads and shares this thread pool
+with all encoders within the same process (it is process global, aka a
+singleton). The number of threads within the thread pool is determined
+by the encoder which first allocates the pool, which by definition is
+the first encoder created within each process.
+
+:option:`--threads` specifies the number of threads the encoder will
+try to allocate for its thread pool. If the thread pool was already
+allocated this parameter is ignored. By default x265 allocated one
+thread per (hyperthreaded) CPU core in your system.
+
+Work distribution is job based. Idle worker threads ask their parent
+pool object for jobs to perform. When no jobs are available, idle
+worker threads block and consume no CPU cycles.
+
+Objects which desire to distribute work to worker threads are known as
+job providers (and they derive from the JobProvider class). When job
+providers have work they enqueue themselves into the pool's provider
+list (and dequeue themselves when they no longer have work). The thread
+pool has a method to **poke** awake a blocked idle thread, and job
+providers are recommended to call this method when they make new jobs
+available.
+
+Worker jobs are not allowed to block except when abosultely necessary
+for data locking. If a job becomes blocked, the worker thread is
+expected to drop that job and go back to the pool and find more work.
+
+.. note::
+
+ x265_cleanup() frees the process-global thread pool, allowing
+ it to be reallocated if necessary, but only if no encoders are
+ allocated at the time it is called.
+
+Wavefront Parallel Processing
+=============================
+
+New with HEVC, Wavefront Parallel Processing allows each row of CTUs to
+be encoded in parallel, so long as each row stays at least two CTUs
+behind the row above it, to ensure the intra references and other data
+of the blocks above and above-right are available. WPP has almost no
+effect on the analysis and compression of each CTU and so it has a very
+small impact on compression efficiency relative to slices or tiles. The
+compression loss from WPP has been found to be less than 1% in most of
+our tests.
+
+WPP has three effects which can impact efficiency. The first is the row
+starts must be signaled in the slice header, the second is each row must
+be padded to an even byte in length, and the third is the state of the
+entropy coder is transferred from the second CTU of each row to the
+first CTU of the row below it. In some conditions this transfer of
+state actually improves compression since the above-right state may have
+better locality than the end of the previous row.
+
+Parabola Research have published an excellent HEVC
+`animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_
+which visualizes WPP very well. It even correctly visualizes some of
+WPPs key drawbacks, such as:
+
+1. the low thread utilization at the start and end of each frame
+2. a difficult block may stall the wave-front and it takes a while for
+ the wave-front to recover.
+3. 64x64 CTUs are big! there are much fewer rows than with H.264 and
+ similar codecs
+
+Because of these stall issues you rarely get the full parallelisation
+benefit one would expect from row threading. 30% to 50% of the
+theoretical perfect threading is typical.
+
+In x265 WPP is enabled by default since it not only improves performance
+at encode but it also makes it possible for the decoder to be threaded.
+
+If WPP is disabled by :option:`--no-wpp` the frame will be encoded in
+scan order and the entropy overheads will be avoided. If frame
+threading is not disabled, the encoder will change the default frame
+thread count to be higher than if WPP was enabled. The exact formulas
+are described in the next section.
+
+
+Frame Threading
+===============
+
+Frame threading is the act of encoding multiple frames at the same time.
+It is a challenge because each frame will generally use one or more of
+the previously encoded frames as motion references and those frames may
+still be in the process of being encoded themselves.
+
+Previous encoders such as x264 worked around this problem by limiting
+the motion search region within these reference frames to just one
+macroblock row below the coincident row being encoded. Thus a frame
+could be encoded at the same time as its reference frames so long as it
+stayed one row behind the encode progress of its references (glossing
+over a few details).
+
+x265 has the same frame threading mechanism, but we generally have much
+less frame parallelism to exploit than x264 because of the size of our
+CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock
+rows available each frame while x265 only has 17 64x64 CTU rows.
+
+The second extenuating circumstance is the loop filters. The pixels used
+for motion reference must be processed by the loop filters and the loop
+filters cannot run until a full row has been encoded, and it must run a
+full row behind the encode process so that the pixels below the row
+being filtered are available. When you add up all the row lags each
+frame ends up being 3 CTU rows behind its reference frames (the
+equivalent of 12 macroblock rows for x264)
+
+The third extenuating circumstance is that when a frame being encoded
+becomes blocked by a reference frame row being available, that frame's
+wave-front becomes completely stalled and when the row becomes available
+again it can take quite some time for the wave to be restarted, if it
+ever does. This makes WPP many times less effective when frame
+parallelism is in use.
+
+:option:`--merange` can have a negative impact on frame parallelism. If
+the range is too large, more rows of CTU lag must be added to ensure
+those pixels are available in the reference frames. Similarly
+:option:`--sao-lcu-opt` 0 will cause SAO to be performed over the
+entire picture at once (rather than being CTU based), which prevents any
+motion reference pixels from being available until the entire frame has
+been encoded, which prevents any real frame parallelism at all.
+
+.. note::
+
+ Even though the merange is used to determine the amount of reference
+ pixels that must be available in the reference frames, the actual
+ motion search is not necessarily centered around the coincident
+ block. The motion search is actually centered around the motion
+ predictor, but the available pixel area (mvmin, mvmax) is determined
+ by merange and the interpolation filter half-heights.
+
+When frame threading is disabled, the entirety of all reference frames
+are always fully available (by definition) and thus the available pixel
+area is not restricted at all, and this can sometimes improve
+compression efficiency. Because of this, the output of encodes with
+frame parallelism disabled will not match the output of encodes with
+frame parallelism enabled; but when enabled the number of frame threads
+should have no effect on the output bitstream except when using ABR or
+VBV rate control.
+
+By default frame parallelism and WPP are enabled together. The number of
+frame threads used is auto-detected from the (hyperthreaded) CPU core
+count, but may be manually specified via :option:`--frame-threads`
+
+ +-------+--------+
+ | Cores | Frames |
+ +=======+========+
+ | > 32 | 6 |
+ +-------+--------+
+ | >= 16 | 5 |
+ +-------+--------+
+ | >= 8 | 3 |
+ +-------+--------+
+ | >= 4 | 2 |
+ +-------+--------+
+
+If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)**
+
+Over-allocating frame threads can be very counter-productive. They
+each allocate a large amount of memory and because of the limited number
+of CTU rows and the reference lag, you generally get limited benefit
+from adding frame encoders beyond the auto-detected count, and often
+the extra frame encoders reduce performance.
+
+Given these considerations, you can understand why the faster presets
+lower the max CTU size to 32x32 (making twice as many CTU rows available
+for WPP and for finer grained frame parallelism) and reduce
+:option:`--merange`
+
+Each frame encoder runs in its own thread (allocated separately from the
+worker pool). This frame thread has some pre-processing responsibilities
+and some post-processing responsibilities for each frame, but it spends
+the bulk of its time managing the wave-front processing by making CTU
+rows available to the worker threads when their dependencies are
+resolved. The frame encoder threads spend nearly all of their time
+blocked in one of 4 possible locations:
+
+1. blocked, waiting for a frame to process
+2. blocked on a reference frame, waiting for a CTU row of reconstructed
+ and loop-filtered reference pixels to become available
+3. blocked waiting for wave-front completion
+4. blocked waiting for the main thread to consume an encoded frame
+
+Lookahead
+=========
+
+The lookahead module of x265 (the lowres pre-encode which determines
+scene cuts and slice types) uses the thread pool to distribute the
+lowres cost analysis to worker threads. It follows the same wave-front
+pattern as the main encoder except it works in reverse-scan order.
+
+The function slicetypeDecide() itself may also be performed by a worker
+thread if your system has enough CPU cores to make this a beneficial
+trade-off, else it runs within the context of the thread which calls the
+x265_encoder_encode().
diff -r a25fb61a7326 -r 22eda589d8ca source/Lib/TLibCommon/TComWeightPrediction.cpp
--- a/source/Lib/TLibCommon/TComWeightPrediction.cpp Tue Apr 29 14:04:58 2014 -0500
+++ b/source/Lib/TLibCommon/TComWeightPrediction.cpp Thu May 01 00:26:43 2014 -0500
@@ -99,12 +99,12 @@ void TComWeightPrediction::addWeightBi(T
if (bLuma)
{
// Luma : --------------------------------------------
- int w0 = wp0[0].w;
- int offset = wp0[0].o + wp1[0].o;
+ int w0 = wp0[0].w;
+ int offset = wp0[0].o + wp1[0].o;
int shiftNum = IF_INTERNAL_PREC - X265_DEPTH;
- int shift = wp0[0].shift + shiftNum + 1;
- int round = shift ? (1 << (shift - 1)) * bRound : 0;
- int w1 = wp1[0].w;
+ int shift = wp0[0].shift + shiftNum + 1;
+ int round = shift ? (1 << (shift - 1)) * bRound : 0;
+ int w1 = wp1[0].w;
uint32_t src0Stride = srcYuv0->getStride();
uint32_t src1Stride = srcYuv1->getStride();
@@ -145,8 +145,8 @@ void TComWeightPrediction::addWeightBi(T
uint32_t src1Stride = srcYuv1->getCStride();
uint32_t dststride = outDstYuv->getCStride();
- width >>= 1;
- height >>= 1;
+ width >>= srcYuv0->getHorzChromaShift();
+ height >>= srcYuv0->getVertChromaShift();
for (y = height - 1; y >= 0; y--)
{
@@ -268,8 +268,8 @@ void TComWeightPrediction::addWeightBi(S
src1Stride = srcYuv1->m_cwidth;
dststride = outDstYuv->getCStride();
- width >>= 1;
- height >>= 1;
+ width >>= srcYuv0->getHorzChromaShift();
+ height >>= srcYuv0->getVertChromaShift();
for (y = height - 1; y >= 0; y--)
{
@@ -379,8 +379,8 @@ void TComWeightPrediction::addWeightUni(
src0Stride = srcYuv0->getCStride();
dststride = outDstYuv->getCStride();
- width >>= 1;
- height >>= 1;
+ width >>= srcYuv0->getHorzChromaShift();
+ height >>= srcYuv0->getVertChromaShift();
for (y = height - 1; y >= 0; y--)
{
@@ -469,8 +469,8 @@ void TComWeightPrediction::addWeightUni(
srcStride = srcYuv0->m_cwidth;
dstStride = outDstYuv->getCStride();
- width >>= 1;
- height >>= 1;
+ width >>= srcYuv0->getHorzChromaShift();
+ height >>= srcYuv0->getVertChromaShift();
primitives.weight_sp(srcU0, dstU, srcStride, dstStride, width, height, w0, round, shift, offset);
diff -r a25fb61a7326 -r 22eda589d8ca source/Lib/TLibCommon/TComYuv.h
--- a/source/Lib/TLibCommon/TComYuv.h Tue Apr 29 14:04:58 2014 -0500
+++ b/source/Lib/TLibCommon/TComYuv.h Thu May 01 00:26:43 2014 -0500
@@ -203,6 +203,13 @@ public:
uint32_t getCHeight() { return m_cheight; }
uint32_t getCWidth() { return m_cwidth; }
+
+ // -------------------------------------------------------------------------------------------------------------------
+ // member functions to support multiple color space formats
More information about the x265-commits
mailing list