[x264-devel] commit: Update some of the information in doc/ (Jason Garrett-Glaser )

Sun Oct 10 23:47:36 CEST 2010

x264 | branch: master | Jason Garrett-Glaser <darkshikari at gmail.com> | Sat Oct  2 23:56:52 2010 -0700| [e16c5f26d006acf2ff540de33e8dd59fa83ab95d] | committer: Jason Garrett-Glaser 

Update some of the information in doc/

> http://git.videolan.org/gitweb.cgi/x264.git/?a=commit;h=e16c5f26d006acf2ff540de33e8dd59fa83ab95d
---

 doc/ratecontrol.txt     |   10 +++++-----
 doc/regression_test.txt |    7 +++----
 doc/threads.txt         |   26 +++++++++++++++++++++-----
 3 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/doc/ratecontrol.txt b/doc/ratecontrol.txt
index 2d05603..e93ced2 100644
--- a/doc/ratecontrol.txt
+++ b/doc/ratecontrol.txt
@@ -1,6 +1,11 @@
 A qualitative overview of x264's ratecontrol methods
 By Loren Merritt
 
+Historical note:
+This document is outdated, but a significant part of it is still accurate.  Here are some important ways ratecontrol has changed since the authoring of this document:
+- By default, MB-tree is used instead of qcomp for weighting frame quality based on complexity.  MB-tree is effectively a generalization of qcomp to the macroblock level.  MB-tree also replaces the constant offsets for B-frame quantizers.  The legacy algorithm is still available for low-latency applications.
+- Adaptive quantization is now used to distribute quality among each frame; frames are no longer constant quantizer, even if MB-tree is off.
+- VBV runs per-row rather than per-frame to improve accuracy.
 
 x264's ratecontrol is based on libavcodec's, and is mostly empirical. But I can retroactively propose the following theoretical points which underlie most of the algorithms:
 
@@ -37,8 +42,3 @@ The goal is the same as in 2pass, but here we don't have the benefit of a previo
 
     constant quantizer:
 QPs are simply based on frame type.
-
-
-    all modes:
-H.264 allows each macroblock to have a different QP. x264 does not do so. Ratecontrol returns one QP which is used for the whole frame.
-
diff --git a/doc/regression_test.txt b/doc/regression_test.txt
index 84422ec..5563238 100644
--- a/doc/regression_test.txt
+++ b/doc/regression_test.txt
@@ -7,19 +7,18 @@ inherently caused by compression.
 svn co svn://svn.videolan.org/x264/trunk x264
 cd x264
 ./configure
-perl -pi -e 's|//(#define DEBUG_DUMP_FRAME)|$1|' encoder/encoder.c # define DEBUG_DUMP_FRAME
 make
 cd ..
 
 # Install and compile JM reference decoder :
-wget http://iphome.hhi.de/suehring/tml/download/jm10.2.zip
-unzip jm10.2.zip
+wget http://iphome.hhi.de/suehring/tml/download/jm17.2.zip
+unzip jm17.2.zip
 cd JM
 sh unixprep.sh
 cd ldecod
 make
 cd ../..
 
-./x264/x264 input.yuv -o output.h264 # this produces fdec.yuv
+./x264/x264 input.yuv --dump-yuv fdec.yuv -o output.h264
 ./JM/bin/ldecod.exe -i output.h264 -o ref.yuv
 diff ref.yuv fdec.yuv
diff --git a/doc/threads.txt b/doc/threads.txt
index 3777b51..49cb5fb 100644
--- a/doc/threads.txt
+++ b/doc/threads.txt
@@ -1,3 +1,6 @@
+Historical notes:
+Slice-based threads was the original threading model of x264.  It was replaced with frame-based threads in r607.  This document was originally written at that time.  Slice-based threading was brought back (as an optional mode) in r1364 for low-latency encoding.  Furthermore, frame-based threading was modified significantly in r1246, with the addition of threaded lookahead.
+
 Old threading method: slice-based
 application calls x264
 x264 runs B-adapt and ratecontrol (serial)
@@ -9,24 +12,37 @@ In x264cli, there is one additional thread to decode the input.
 
 New threading method: frame-based
 application calls x264
-x264 runs B-adapt and ratecontrol (serial to the application, but parallel to the other x264 threads)
+x264 requests a frame from lookahead, which runs B-adapt and ratecontrol parallel to the current thread, separated by a buffer of size sync-lookahead
 spawn a thread for this frame
-thread runs encode in 1 slice, deblock, hpel filter
+thread runs encode, deblock, hpel filter
 meanwhile x264 waits for the oldest thread to finish
 return to application, but the rest of the threads continue running in the background
-No additional threads are needed to decode the input, unless decoding+B-adapt is slower than slice+deblock+hpel, in which case an additional input thread would allow decoding in parallel to B-adapt.
-
+No additional threads are needed to decode the input, unless decoding is slower than slice+deblock+hpel, in which case an additional input thread would allow decoding in parallel.
 
 Penalties for slice-based threading:
 Each slice adds some bitrate (or equivalently reduces quality), for a variety of reasons: the slice header costs some bits, cabac contexts are reset, mvs and intra samples can't be predicted across the slice boundary.
-In CBR mode, we have to allocate bits between slices before encoding them, which may lead to uneven quality.
+In CBR mode, multiple slices encode simultaneously, thus increasing the maximum misprediction possible with VBV.
 Some parts of the encoder are serial, so it doesn't scale well with lots of cpus.
 
+Some numbers on penalties for slicing:
+Tested at 720p with 45 slices (one per mb row) to maximize the total cost for easy measurement. Averaged over 4 movies at crf20 and crf30. Total cost: +30% bitrate at constant psnr.
+I enabled the various components of slicing one at a time, and measured the portion of that cost they contribute:
+    * 34% intra prediction
+    * 25% redundant slice headers, nal headers, and rounding to whole bytes
+    * 16% mv prediction
+    * 16% reset cabac contexts
+    * 6% deblocking between slices (you don't strictly have to turn this off just for standard compliance, but you do if you want to use slices for decoder multithreading)
+    * 2% cabac neighbors (cbp, skip, etc)
+The proportional cost of redundant headers should certainly depend on bitrate (since the header size is constant and everything else depends on bitrate). Deblocking should too (due to varing deblock strength).
+But none of the proportions should depend strongly on the number of slices: some are triggered per slice while some are triggered per macroblock-that's-on-the-edge-of-a-slice, but as long as there's no more than 1 slice per row, the relative frequency of those two conditions is determined solely by the image width.
+
+
 Penalties for frame-base threading:
 To allow encoding of multiple frames in parallel, we have to ensure that any given macroblock uses motion vectors only from pieces of the reference frames that have been encoded already. This is usually not noticeable, but can matter for very fast upward motion.
 We have to commit to one frame type before starting on the frame. Thus scenecut detection must run during the lowres pre-motion-estimation along with B-adapt, which makes it faster but less accurate than re-encoding the whole frame.
 Ratecontrol gets delayed feedback, since it has to plan frame N before frame N-1 finishes.
 
+NOTE: these benchmarks are from the original implementation of frame-based threads.  They are likely not entirely accurate today, nor do the commandlines match up with modern x264.  However, they still give a good idea of the relative performance of frame and slice-based threads.
 
 Benchmarks:
 cpu: 4x woodcrest 3GHz