[x265-commits] [x265] rc: add helper function to decide the VBV predictor type ...

Tue Apr 14 06:44:55 CEST 2015

details:   http://hg.videolan.org/x265/rev/2884fb779578
branches:  
changeset: 10160:2884fb779578
user:      Aarthi Thirumalai
date:      Thu Apr 09 15:24:17 2015 +0530
description:
rc: add helper function to decide the VBV predictor type for each frame
Subject: [x265] cmake: specify default Windows target O/S as Win7, to enable NUMA APIs

details:   http://hg.videolan.org/x265/rev/34fa761d51bb
branches:  
changeset: 10161:34fa761d51bb
user:      Steve Borho <steve at borho.org>
date:      Sun Apr 12 08:15:10 2015 -0500
description:
cmake: specify default Windows target O/S as Win7, to enable NUMA APIs

after this change, our builds will target Windows 7 and later operating systems
by default, unless the WINXP_SUPPORT option is enabled. The NUMA APIs that we
need to give processor affinity to pool threads were introduced in Windows 7.
XP and Vista get lumped together into one *legacy* build option.
Subject: [x265] cmake: use explicit cpp file lists for input/ and output/

details:   http://hg.videolan.org/x265/rev/0b52b0251807
branches:  
changeset: 10162:0b52b0251807
user:      Steve Borho <steve at borho.org>
date:      Sun Apr 12 08:39:31 2015 -0500
description:
cmake: use explicit cpp file lists for input/ and output/

using globs for source files not recommended since cmake will not detect that
the glob output has changed until the next time cmake is run (the cmake
generated Makefile will not know that it needs to re-run cmake).
Subject: [x265] input: use poke() method of ThreadSafeInteger appropriately

details:   http://hg.videolan.org/x265/rev/1ca06e792b1e
branches:  
changeset: 10163:1ca06e792b1e
user:      Steve Borho <steve at borho.org>
date:      Sun Apr 12 10:02:24 2015 -0500
description:
input: use poke() method of ThreadSafeInteger appropriately

and rely on constructor of TSI to initialize value to 0
Subject: [x265] cli: add an output preview feature, activated by --recon-y4m-exec

details:   http://hg.videolan.org/x265/rev/3749af0b4277
branches:  
changeset: 10164:3749af0b4277
user:      Peixuan Zhang <zhangpeixuancn at gmail.com>
date:      Thu Apr 09 16:31:18 2015 +0800
description:
cli: add an output preview feature, activated by --recon-y4m-exec

if you have an application which can play a Y4MPEG stream received on stdin,
the x265 CLI can feed it reconstructed pictures in display order.  The pictures
will have no timing info, obviously, so the picture timing will be determined
primarily by encoding elapsed time and latencies, but it can be useful to
preview the pictures being output by the encoder to validate input settings and
rate control parameters.
Subject: [x265] api: add SMPTE ST 2086 mastering display color metadata

details:   http://hg.videolan.org/x265/rev/155f66bc7d0d
branches:  
changeset: 10165:155f66bc7d0d
user:      Steve Borho <steve at borho.org>
date:      Mon Apr 13 11:46:52 2015 -0700
description:
api: add SMPTE ST 2086 mastering display color metadata

This is setting a precedent for adding support for an SEI which is passed
directly from user arguments into the bitstream with no validations and minimal
overhead to the public API.
Subject: [x265] doc: clarify that --pools strings might need shell escaping (closes #121)

details:   http://hg.videolan.org/x265/rev/cd7d2f5c4d97
branches:  
changeset: 10166:cd7d2f5c4d97
user:      Steve Borho <steve at borho.org>
date:      Mon Apr 13 11:48:25 2015 -0700
description:
doc: clarify that --pools strings might need shell escaping (closes #121)
Subject: [x265] asm: improve avx2 code sub_ps[32x32] 1402 -> 1360

details:   http://hg.videolan.org/x265/rev/fbc8be70593e
branches:  
changeset: 10167:fbc8be70593e
user:      Sumalatha Polureddy
date:      Wed Apr 08 15:10:08 2015 +0530
description:
asm: improve avx2 code sub_ps[32x32] 1402 -> 1360
Subject: [x265] simplify rdoQuant() logic on ctxSet

details:   http://hg.videolan.org/x265/rev/37a9ac232655
branches:  
changeset: 10168:37a9ac232655
user:      Min Chen <chenm003 at 163.com>
date:      Mon Apr 13 18:39:38 2015 +0800
description:
simplify rdoQuant() logic on ctxSet
Subject: [x265] asm: intra pred all_angs_pred_4x4 sse2

details:   http://hg.videolan.org/x265/rev/abfbfdf724a0
branches:  
changeset: 10169:abfbfdf724a0
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Apr 13 14:13:19 2015 -0700
description:
asm: intra pred all_angs_pred_4x4 sse2

This replaces c code and is backported from sse4
The processing of modes 10 and 26 were merged and moved to after mode 2

The new constants are declared with shortened names

64-bit

./test/TestBench --testbench intrapred | grep intra_allangs4x4
intra_allangs4x4	9.89x 	 6434.99  	 63671.87

32-bit

./test/TestBench --testbench intrapred | grep intra_allangs4x4
intra_allangs4x4	13.38x 	 6497.50  	 86943.55

diffstat:

 doc/reST/cli.rst                         |   36 +-
 doc/reST/threading.rst                   |   19 +-
 source/CMakeLists.txt                    |   15 +-
 source/common/param.cpp                  |    1 +
 source/common/quant.cpp                  |   15 +-
 source/common/x86/asm-primitives.cpp     |    2 +
 source/common/x86/const-a.asm            |    6 +
 source/common/x86/intrapred.h            |    1 +
 source/common/x86/intrapred8_allangs.asm |  782 +++++++++++++++++++++++++++++++
 source/common/x86/pixel-util8.asm        |  151 ++---
 source/encoder/encoder.cpp               |   15 +
 source/encoder/ratecontrol.cpp           |   12 +-
 source/encoder/ratecontrol.h             |    1 +
 source/encoder/sei.h                     |   42 +
 source/input/y4m.cpp                     |    7 +-
 source/input/yuv.cpp                     |    6 +-
 source/output/reconplay.cpp              |  193 +++++++
 source/output/reconplay.h                |   74 ++
 source/x265.cpp                          |   23 +-
 source/x265.h                            |    9 +
 source/x265cli.h                         |    5 +
 21 files changed, 1295 insertions(+), 120 deletions(-)

diffs (truncated from 1854 to 300 lines):

diff -r 4cccf22b00ee -r abfbfdf724a0 doc/reST/cli.rst

--- a/doc/reST/cli.rst	Fri Apr 10 18:15:38 2015 -0500
+++ b/doc/reST/cli.rst	Mon Apr 13 14:13:19 2015 -0700
@@ -219,7 +219,8 @@ Performance Options
 
 	On Windows, the native APIs offer sufficient functionality to
 	discover the NUMA topology and enforce the thread affinity that
-	libx265 needs, but on POSIX systems it relies on libnuma for this
+	libx265 needs (so long as you have not chosen to target XP or
+	Vista), but on POSIX systems it relies on libnuma for this
 	functionality. If your target POSIX system is single socket, then
 	building without libnuma is a perfectly reasonable option, as it
 	will have no effect on the runtime behavior. On a multiple-socket
@@ -229,6 +230,9 @@ Performance Options
 	Default "", one thread is allocated per detected hardware thread
 	(logical CPU cores) and one thread pool per NUMA node.
 
+	Note that the string value will need to be escaped or quoted to
+	protect against shell expansion on many platforms
+
 .. option:: --wpp, --no-wpp
 
 	Enable Wavefront Parallel Processing. The encoder may begin encoding
@@ -1478,6 +1482,20 @@ VUI fields must be manually specified.
 	specification for a description of these values. Default undefined
 	(not signaled)
 
+.. option:: --master-display <string>
+
+	SMPTE ST 2086 mastering display color volume SEI info, specified as
+	a string which is parsed when the stream header SEI are emitted. The
+	string format is "Y(%hu,%hu)U(%hu,%hu)V(%hu,%hu)WP(%hu,%hu)L(%u,%u)"
+	where %hu are unsigned 16bit integers and %u are unsigned 32bit
+	integers. The SEI includes X,Y display primaries for YUV channels,
+	white point X,Y and max,min luminance values.
+
+	Example: Y(10,12)U(5,13)V(5,13)WP(100,100)L(1000,100)
+
+	Note that this string value will need to be escaped or quoted to
+	protect against shell expansion on many platforms
+
 Bitstream options
 =================
 
@@ -1561,4 +1579,20 @@ Debugging options
 
 	**CLI ONLY**
 
+.. option:: --recon-y4m-exec <string>
+
+	If you have an application which can play a Y4MPEG stream received
+	on stdin, the x265 CLI can feed it reconstructed pictures in display
+	order.  The pictures will have no timing info, obviously, so the
+	picture timing will be determined primarily by encoding elapsed time
+	and latencies, but it can be useful to preview the pictures being
+	output by the encoder to validate input settings and rate control
+	parameters.
+
+	Example command for ffplay (assuming it is in your PATH):
+
+	--recon-y4m-exec "ffplay -i pipe:0 -autoexit"
+
+	**CLI ONLY**
+
 .. vim: noet
diff -r 4cccf22b00ee -r abfbfdf724a0 doc/reST/threading.rst
--- a/doc/reST/threading.rst	Fri Apr 10 18:15:38 2015 -0500
+++ b/doc/reST/threading.rst	Mon Apr 13 14:13:19 2015 -0700
@@ -34,15 +34,16 @@ expected to drop that job so the worker 
 and find more work.
 
 On Windows, the native APIs offer sufficient functionality to discover
-the NUMA topology and enforce the thread affinity that libx265 needs,
-but on POSIX systems it relies on libnuma for this functionality. If
-your target POSIX system is single socket, then building without libnuma
-is a perfectly reasonable option, as it will have no effect on the
-runtime behavior. On a multiple-socket system, a POSIX build of libx265
-without libnuma will be less work efficient, but will still function
-correctly. You lose the work isolation effect that keeps each frame
-encoder from only using the threads of a single socket and so you incur
-a heavier context switching cost.
+the NUMA topology and enforce the thread affinity that libx265 needs (so
+long as you have not chosen to target XP or Vista), but on POSIX systems
+it relies on libnuma for this functionality. If your target POSIX system
+is single socket, then building without libnuma is a perfectly
+reasonable option, as it will have no effect on the runtime behavior. On
+a multiple-socket system, a POSIX build of libx265 without libnuma will
+be less work efficient, but will still function correctly. You lose the
+work isolation effect that keeps each frame encoder from only using the
+threads of a single socket and so you incur a heavier context switching
+cost.
 
 Wavefront Parallel Processing
 =============================
diff -r 4cccf22b00ee -r abfbfdf724a0 source/CMakeLists.txt
--- a/source/CMakeLists.txt	Fri Apr 10 18:15:38 2015 -0500
+++ b/source/CMakeLists.txt	Mon Apr 13 14:13:19 2015 -0700
@@ -30,7 +30,7 @@ option(STATIC_LINK_CRT "Statically link 
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 55)
+set(X265_BUILD 56)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -301,12 +301,15 @@ if (WIN32)
         list(APPEND PLATFORM_LIBS ${VLD_LIBRARIES})
         link_directories(${VLD_LIBRARY_DIRS})
     endif()
-    option(WINXP_SUPPORT "Make binaries compatible with Windows XP" OFF)
+    option(WINXP_SUPPORT "Make binaries compatible with Windows XP and Vista" OFF)
     if(WINXP_SUPPORT)
         # force use of workarounds for CONDITION_VARIABLE and atomic
         # intrinsics introduced after XP
         add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP)
-    endif()
+    else(WINXP_SUPPORT)
+        # default to targeting Windows 7 for the NUMA APIs
+        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WIN7)
+    endif(WINXP_SUPPORT)
 endif()
 
 include(version) # determine X265_VERSION and X265_LATEST_TAG
@@ -463,8 +466,10 @@ endif()
 # Main CLI application
 option(ENABLE_CLI "Build standalone CLI application" ON)
 if(ENABLE_CLI)
-    file(GLOB InputFiles input/*.cpp input/*.h)
-    file(GLOB OutputFiles output/*.cpp output/*.h)
+    file(GLOB InputFiles input/input.cpp input/yuv.cpp input/y4m.cpp input/*.h)
+    file(GLOB OutputFiles output/output.cpp output/reconplay.cpp output/*.h
+                          output/yuv.cpp output/y4m.cpp # recon
+                          output/raw.cpp)               # muxers
     file(GLOB FilterFiles filters/*.cpp filters/*.h)
     source_group(input FILES ${InputFiles})
     source_group(output FILES ${OutputFiles})
diff -r 4cccf22b00ee -r abfbfdf724a0 source/common/param.cpp
--- a/source/common/param.cpp	Fri Apr 10 18:15:38 2015 -0500
+++ b/source/common/param.cpp	Mon Apr 13 14:13:19 2015 -0700
@@ -851,6 +851,7 @@ int x265_param_parse(x265_param* p, cons
     OPT("lambda-file") p->rc.lambdaFileName = strdup(value);
     OPT("analysis-file") p->analysisFileName = strdup(value);
     OPT("qg-size") p->rc.qgSize = atoi(value);
+    OPT("master-display") p->masteringDisplayColorVolume = strdup(value);
     else
         return X265_PARAM_BAD_NAME;
 #undef OPT
diff -r 4cccf22b00ee -r abfbfdf724a0 source/common/quant.cpp
--- a/source/common/quant.cpp	Fri Apr 10 18:15:38 2015 -0500
+++ b/source/common/quant.cpp	Mon Apr 13 14:13:19 2015 -0700
@@ -558,7 +558,6 @@ uint32_t Quant::rdoQuant(const CUData& c
     int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */
     uint64_t sigCoeffGroupFlag64 = 0;
 
-    uint32_t ctxSet      = 0;
     int cgLastScanPos    = -1;
     int lastScanPos      = -1;
     const uint32_t cgSize = (1 << MLS_CG_SIZE); /* 4x4 num coef = 16 */
@@ -582,10 +581,12 @@ uint32_t Quant::rdoQuant(const CUData& c
 
     uint32_t scanPos;
     coeffGroupRDStats cgRdStats;
+    uint32_t c1 = 1;
 
     /* iterate over coding groups in reverse scan order */
     for (int cgScanPos = cgNum - 1; cgScanPos >= 0; cgScanPos--)
     {
+        uint32_t ctxSet = (cgScanPos && bIsLuma) ? 2 : 0;
         const uint32_t cgBlkPos = codeParams.scanCG[cgScanPos];
         const uint32_t cgPosY   = cgBlkPos >> codeParams.log2TrSizeCG;
         const uint32_t cgPosX   = cgBlkPos - (cgPosY << codeParams.log2TrSizeCG);
@@ -594,7 +595,10 @@ uint32_t Quant::rdoQuant(const CUData& c
 
         const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
 
-        int    c1            = 1;
+        if (c1 == 0)
+            ctxSet++;
+        c1 = 1;
+
         int    c2            = 0;
         uint32_t goRiceParam = 0;
         uint32_t c1Idx       = 0;
@@ -815,13 +819,6 @@ uint32_t Quant::rdoQuant(const CUData& c
             }
         } /* end for (scanPosinCG) */
 
-        /* context set update */
-        {
-            ctxSet = (cgScanPos == 1 || !bIsLuma) ? 0 : 2;
-            X265_CHECK(c1 >= 0, "c1 is negative\n");
-            ctxSet -= ((int32_t)(c1 - 1) >> 31);
-        }
-
         costCoeffGroupSig[cgScanPos] = 0;
 
         if (cgLastScanPos < 0)
diff -r 4cccf22b00ee -r abfbfdf724a0 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Fri Apr 10 18:15:38 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Mon Apr 13 14:13:19 2015 -0700
@@ -1259,6 +1259,8 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2;
         p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2;
 
+        p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_sse2;
+
         p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2;
         p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2;
 
diff -r 4cccf22b00ee -r abfbfdf724a0 source/common/x86/const-a.asm
--- a/source/common/x86/const-a.asm	Fri Apr 10 18:15:38 2015 -0500
+++ b/source/common/x86/const-a.asm	Mon Apr 13 14:13:19 2015 -0700
@@ -53,6 +53,10 @@ const pb_unpackwq2,         times  1 db 
 const pb_shuf8x8c,          times  1 db   0,   0,   0,   0,   2,   2,   2,   2,   4,   4,   4,   4,   6,   6,   6,   6
 const pb_movemask,          times 16 db 0x00
                             times 16 db 0xFF
+const pb_0000000000000F0F,  times  2 db 0xff, 0x00
+                            times 12 db 0x00
+const pb_000000000000000F,           db 0xff
+                            times 15 db 0x00
 
 ;; 16-bit constants
 
@@ -94,6 +98,8 @@ const multiL,               times  1 dw 
 const multiH2,              times  1 dw  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32
 const pw_planar16_mul,      times  1 dw  15,  14,  13,  12,  11,  10,   9,   8,   7,   6,   5,   4,   3,   2,   1,   0
 const pw_planar32_mul,      times  1 dw  31,  30,  29,  28,  27,  26,  25,  24,  23,  22,  21,  20,  19,  18,  17,  16
+const pw_FFFFFFFFFFFFFFF0,           dw 0x00
+                            times 7  dw 0xff
 
 
 ;; 32-bit constants
diff -r 4cccf22b00ee -r abfbfdf724a0 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h	Fri Apr 10 18:15:38 2015 -0500
+++ b/source/common/x86/intrapred.h	Mon Apr 13 14:13:19 2015 -0700
@@ -277,6 +277,7 @@ void x265_intra_pred_ang32_24_avx2(pixel
 void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
diff -r 4cccf22b00ee -r abfbfdf724a0 source/common/x86/intrapred8_allangs.asm
--- a/source/common/x86/intrapred8_allangs.asm	Fri Apr 10 18:15:38 2015 -0500
+++ b/source/common/x86/intrapred8_allangs.asm	Mon Apr 13 14:13:19 2015 -0700
@@ -34,9 +34,14 @@ cextern pw_1024
 
 ; common constant with intrapred8.asm
 cextern ang_table
+cextern pw_ang_table
 cextern tab_S1
 cextern tab_S2
 cextern tab_Si
+cextern pw_16
+cextern pb_000000000000000F
+cextern pb_0000000000000F0F
+cextern pw_FFFFFFFFFFFFFFF0
 
 
 ;-----------------------------------------------------------------------------
@@ -23006,3 +23011,780 @@ cglobal all_angs_pred_32x32, 3,7,8, 0-4
     palignr    m4,              m2,       m1,    14
     movu       [r0 + 2111 * 16],   m4
     RET
+
+;-----------------------------------------------------------------------------
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+;-----------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal all_angs_pred_4x4, 4, 4, 8
+
+; mode 2
+
+    movh        m6,             [r1 + 9]
+    mova        m2,             m6
+    psrldq      m2,             1
+    movd        [r0],           m2              ;byte[A, B, C, D]
+    psrldq      m2,             1
+    movd        [r0 + 4],       m2              ;byte[B, C, D, E]
+    psrldq      m2,             1
+    movd        [r0 + 8],       m2              ;byte[C, D, E, F]
+    psrldq      m2,             1
+    movd        [r0 + 12],      m2              ;byte[D, E, F, G]
+
+; mode 10/26
+
+    pxor        m7,             m7
+    pshufd      m5,             m6,        0
+    mova        [r0 + 128],     m5              ;mode 10 byte[9, A, B, C, 9, A, B, C, 9, A, B, C, 9, A, B, C]
+
+    movd        m4,             [r1 + 1]
+    pshufd      m4,             m4,        0
+    mova        [r0 + 384],     m4              ;mode 26 byte[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
+
+    movd        m1,             [r1]
+    punpcklbw   m1,             m7
+    pshuflw     m1,             m1,     0x00
+    punpcklqdq  m1,             m1              ;m1 = byte[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]