[x265-commits] [x265] quant: avoid runtime check of transform shift size

Fri Aug 8 07:43:34 CEST 2014

details:   http://hg.videolan.org/x265/rev/8e68a1db7c04
branches:  
changeset: 7736:8e68a1db7c04
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 21:12:20 2014 -0500
description:
quant: avoid runtime check of transform shift size
Subject: [x265] frameencoder: avoid redundant calls to resetEntropy()

details:   http://hg.videolan.org/x265/rev/83880abea807
branches:  
changeset: 7737:83880abea807
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 22:24:47 2014 -0500
description:
frameencoder: avoid redundant calls to resetEntropy()

All of the entropy coders need to be reset to the same state at the start of
the frame's analysis.  There is no point in re-calculating this initial state
repeatedly for each row
Subject: [x265] frameencoder: nit

details:   http://hg.videolan.org/x265/rev/b89417dfa782
branches:  
changeset: 7738:b89417dfa782
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 22:24:54 2014 -0500
description:
frameencoder: nit
Subject: [x265] entropy: disable signaling of CABAC init state

details:   http://hg.videolan.org/x265/rev/3fdb78507aea
branches:  
changeset: 7739:3fdb78507aea
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 19:58:27 2014 -0500
description:
entropy: disable signaling of CABAC init state

This flag, which was already disabled when frame parallelism is in use (which
is nearly always) was of limited utility. It did not improve compression
efficiency by any measurable amount, and it was expensive to compute.  But the
quality which made it expendable was that it was the only user of the bBinsCoded
flag in the ContextModel; forcing us to copy twice as much data every time we
copy a context.

With this feature removed, the context model can be reduced to a single uint8_t
state variable.
Subject: [x265] entropy: remove bBinsCoded from ContextModel (no more users)

details:   http://hg.videolan.org/x265/rev/4297617da24c
branches:  
changeset: 7740:4297617da24c
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 19:59:07 2014 -0500
description:
entropy: remove bBinsCoded from ContextModel (no more users)
Subject: [x265] entropy: remove ContextModel structure, use uint8_t directly

details:   http://hg.videolan.org/x265/rev/04567c40dae5
branches:  
changeset: 7741:04567c40dae5
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 20:20:00 2014 -0500
description:
entropy: remove ContextModel structure, use uint8_t directly
Subject: [x265] entropy: pad size of context array to 32 * 5 bytes

details:   http://hg.videolan.org/x265/rev/f6e38749049c
branches:  
changeset: 7742:f6e38749049c
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 21:24:15 2014 -0500
description:
entropy: pad size of context array to 32 * 5 bytes
Subject: [x265] entropy: remove implicit memset from constructor

details:   http://hg.videolan.org/x265/rev/49b593197330
branches:  
changeset: 7743:49b593197330
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 22:42:47 2014 -0500
description:
entropy: remove implicit memset from constructor

Before we do further refactors, we want Entropy instances allocated on the
stack to not perform any needless initialization work
Subject: [x265] main10: create a hybrid all-angs primitve for 16bpp compiles

details:   http://hg.videolan.org/x265/rev/33702c567e50
branches:  
changeset: 7744:33702c567e50
user:      Steve Borho <steve at borho.org>
date:      Thu Aug 07 23:44:13 2014 -0500
description:
main10: create a hybrid all-angs primitve for 16bpp compiles

The all-angs primitive is highly optimized assembly code that avoids a lot of
redundant work.  The all-angs C ref is horribly slow, doing redundant work to
mimic the output of the all-angs assembly code. Since we have no high bit depth
assembly for these functions, we'll use a shim C function that works very
similar to the C ref but it at least uses optimized primitives.

intra_allangs4x4	3.64x 	 6619.54  	 24097.30
intra_allangs8x8	5.66x 	 13722.49 	 77694.97
intra_allangs32x32	4.57x 	 246943.81 	 1129159.50

before:
encoded 1253 frames in 104.37s (12.01 fps), 366.08 kb/s, SSIM Mean Y: 0.9889624 (19.571 dB)

after:
encoded 1253 frames in 95.62s (13.10 fps), 366.08 kb/s, SSIM Mean Y: 0.9889624 (19.571 dB)
Subject: [x265] denoise: fix numCoeff (bug from 42b1d7c17510)

details:   http://hg.videolan.org/x265/rev/ef2602935c59
branches:  
changeset: 7745:ef2602935c59
user:      Satoshi Nakagawa <nakagawa424 at oki.com>
date:      Fri Aug 08 12:57:25 2014 +0900
description:
denoise: fix numCoeff (bug from 42b1d7c17510)
Subject: [x265] asm: cvt16to32_shr[*] for TSkip

details:   http://hg.videolan.org/x265/rev/8cd2e8c9a3ba
branches:  
changeset: 7746:8cd2e8c9a3ba
user:      Min Chen <chenm003 at 163.com>
date:      Thu Aug 07 18:18:11 2014 -0500
description:
asm: cvt16to32_shr[*] for TSkip
Subject: [x265] asm: cvt32to16_shl[*] for TSkip

details:   http://hg.videolan.org/x265/rev/091a63164c41
branches:  
changeset: 7747:091a63164c41
user:      Min Chen <chenm003 at 163.com>
date:      Thu Aug 07 18:18:11 2014 -0500
description:
asm: cvt32to16_shl[*] for TSkip

diffstat:

 source/Lib/TLibCommon/ContextTables.h |    6 -
 source/common/pixel.cpp               |   35 +++
 source/common/primitives.h            |    4 +
 source/common/quant.cpp               |   36 +-
 source/common/slice.h                 |    3 -
 source/common/x86/asm-primitives.cpp  |   45 +++
 source/common/x86/blockcopy8.asm      |  386 ++++++++++++++++++++++++++++++++++
 source/common/x86/blockcopy8.h        |    8 +
 source/encoder/encoder.cpp            |    3 -
 source/encoder/entropy.cpp            |  347 +++++++++--------------------
 source/encoder/entropy.h              |    9 +-
 source/encoder/frameencoder.cpp       |   20 +-
 source/test/pixelharness.cpp          |   88 +++++++-
 source/test/pixelharness.h            |    2 +
 source/test/testbench.cpp             |    6 +
 15 files changed, 711 insertions(+), 287 deletions(-)

diffs (truncated from 1704 to 300 lines):

diff -r 8e45fc7c5521 -r 091a63164c41 source/Lib/TLibCommon/ContextTables.h

--- a/source/Lib/TLibCommon/ContextTables.h	Thu Aug 07 19:49:42 2014 +0530
+++ b/source/Lib/TLibCommon/ContextTables.h	Thu Aug 07 18:18:11 2014 -0500
@@ -127,12 +127,6 @@
 namespace x265 {
 // private namespace
 
-struct ContextModel
-{
-    uint8_t state;
-    uint8_t bBinsCoded;
-};
-
 extern const uint32_t g_entropyBits[128];
 extern const uint8_t g_nextState[128][2];
 
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/pixel.cpp
--- a/source/common/pixel.cpp	Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/pixel.cpp	Thu Aug 07 18:18:11 2014 -0500
@@ -442,6 +442,18 @@ void convert16to32_shl(int32_t *dst, int
     }
 }
 
+template<int size>
+void convert16to32_shr(int32_t *dst, int16_t *src, intptr_t stride, int shift, int offset)
+{
+    for (int i = 0; i < size; i++)
+    {
+        for (int j = 0; j < size; j++)
+        {
+            dst[i * size + j] = ((int)src[i * stride + j] + offset) >> shift;
+        }
+    }
+}
+
 void convert32to16_shr(int16_t *dst, int32_t *src, intptr_t stride, int shift, int size)
 {
     int round = 1 << (shift - 1);
@@ -458,6 +470,21 @@ void convert32to16_shr(int16_t *dst, int
     }
 }
 
+template<int size>
+void convert32to16_shl(int16_t *dst, int32_t *src, intptr_t stride, int shift)
+{
+    for (int i = 0; i < size; i++)
+    {
+        for (int j = 0; j < size; j++)
+        {
+            dst[j] = ((int16_t)src[j] << shift);
+        }
+
+        src += size;
+        dst += stride;
+    }
+}
+
 template<int blockSize>
 void getResidual(pixel *fenc, pixel *pred, int16_t *residual, intptr_t stride)
 {
@@ -1176,7 +1203,15 @@ void Setup_C_PixelPrimitives(EncoderPrim
     p.blockfill_s[BLOCK_64x64] = blockfil_s_c<64>;
 
     p.cvt16to32_shl = convert16to32_shl;
+    p.cvt16to32_shr[BLOCK_4x4] = convert16to32_shr<4>;
+    p.cvt16to32_shr[BLOCK_8x8] = convert16to32_shr<8>;
+    p.cvt16to32_shr[BLOCK_16x16] = convert16to32_shr<16>;
+    p.cvt16to32_shr[BLOCK_32x32] = convert16to32_shr<32>;
     p.cvt32to16_shr = convert32to16_shr;
+    p.cvt32to16_shl[BLOCK_4x4] = convert32to16_shl<4>;
+    p.cvt32to16_shl[BLOCK_8x8] = convert32to16_shl<8>;
+    p.cvt32to16_shl[BLOCK_16x16] = convert32to16_shl<16>;
+    p.cvt32to16_shl[BLOCK_32x32] = convert32to16_shl<32>;
 
     p.sa8d[BLOCK_4x4]   = satd_4x4;
     p.sa8d[BLOCK_8x8]   = sa8d_8x8;
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/primitives.h
--- a/source/common/primitives.h	Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/primitives.h	Thu Aug 07 18:18:11 2014 -0500
@@ -149,7 +149,9 @@ typedef void (*intra_pred_t)(pixel* dst,
 typedef void (*intra_allangs_t)(pixel *dst, pixel *above0, pixel *left0, pixel *above1, pixel *left1, int bLuma);
 
 typedef void (*cvt16to32_shl_t)(int32_t *dst, int16_t *src, intptr_t, int, int);
+typedef void (*cvt16to32_shr_t)(int32_t *dst, int16_t *src, intptr_t, int, int);
 typedef void (*cvt32to16_shr_t)(int16_t *dst, int32_t *src, intptr_t, int, int);
+typedef void (*cvt32to16_shl_t)(int16_t *dst, int32_t *src, intptr_t, int);
 typedef uint32_t (*cvt16to32_cnt_t)(coeff_t* coeff, int16_t* residual, intptr_t stride);
 
 typedef void (*dct_t)(int16_t *src, int32_t *dst, intptr_t stride);
@@ -218,7 +220,9 @@ struct EncoderPrimitives
     blockcpy_pp_t   blockcpy_pp;                     // block copy pixel from pixel
     blockcpy_ps_t   blockcpy_ps;                     // block copy pixel from short
     cvt16to32_shl_t cvt16to32_shl;
+    cvt16to32_shr_t cvt16to32_shr[NUM_SQUARE_BLOCKS - 1];
     cvt32to16_shr_t cvt32to16_shr;
+    cvt32to16_shl_t cvt32to16_shl[NUM_SQUARE_BLOCKS - 1];
     cvt16to32_cnt_t cvt16to32_cnt[NUM_SQUARE_BLOCKS - 1];
 
     copy_pp_t       luma_copy_pp[NUM_LUMA_PARTITIONS];
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/quant.cpp
--- a/source/common/quant.cpp	Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/quant.cpp	Thu Aug 07 18:18:11 2014 -0500
@@ -342,24 +342,24 @@ uint32_t Quant::transformNxN(TComDataCU*
     bool isLuma  = ttype == TEXT_LUMA;
     bool usePsy  = m_psyRdoqScale && isLuma && !useTransformSkip;
     bool isIntra = cu->getPredictionMode(absPartIdx) == MODE_INTRA;
+    int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; // Represents scaling through forward transform
     int trSize = 1 << log2TrSize;
 
     X265_CHECK((cu->m_slice->m_sps->quadtreeTULog2MaxSize >= log2TrSize), "transform size too large\n");
     if (useTransformSkip)
     {
-        int shift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize;
-
-        if (shift >= 0)
-            primitives.cvt16to32_shl(m_resiDctCoeff, residual, stride, shift, trSize);
+#if X265_DEPTH <= 10
+        primitives.cvt16to32_shl(m_resiDctCoeff, residual, stride, transformShift, trSize);
+#else
+        if (transformShift >= 0)
+            primitives.cvt16to32_shl(m_resiDctCoeff, residual, stride, transformShift, trSize);
         else
         {
-            /* X265_DEPTH > 13 */
-            shift = -shift;
+            int shift = -transformShift;
             int offset = (1 << (shift - 1));
-            for (int j = 0; j < trSize; j++)
-                for (int k = 0; k < trSize; k++)
-                    m_resiDctCoeff[j * trSize + k] = (residual[j * stride + k] + offset) >> shift;
+            primitives.cvt16to32_shr[log2TrSize - 2](m_resiDctCoeff, residual, stride, shift, offset);
         }
+#endif
     }
     else
     {
@@ -382,7 +382,8 @@ uint32_t Quant::transformNxN(TComDataCU*
         {
             /* denoise is not applied to intra residual, so DST can be ignored */
             int cat = sizeIdx + 4 * !isLuma;
-            denoiseDct(m_resiDctCoeff, m_nr->residualSum[cat], m_nr->offsetDenoise[cat], trSize << 1);
+            int numCoeff = 1 << log2TrSize * 2;
+            denoiseDct(m_resiDctCoeff, m_nr->residualSum[cat], m_nr->offsetDenoise[cat], numCoeff);
             m_nr->count[cat]++;
         }
     }
@@ -398,7 +399,6 @@ uint32_t Quant::transformNxN(TComDataCU*
         int per = m_qpParam[ttype].per;
         int32_t *quantCoeff = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem];
 
-        int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; // Represents scaling through forward transform
         int qbits = QUANT_SHIFT + per + transformShift;
         int add = (cu->m_slice->m_sliceType == I_SLICE ? 171 : 85) << (qbits - 9);
         int numCoeff = 1 << log2TrSize * 2;
@@ -451,16 +451,14 @@ void Quant::invtransformNxN(bool transQu
         int trSize = 1 << log2TrSize;
         shift = transformShift;
 
-        if (shift > 0)
+#if X265_DEPTH <= 10
+        primitives.cvt32to16_shr(residual, m_resiDctCoeff, stride, shift, trSize);
+#else
+        if (shift >= 0)
             primitives.cvt32to16_shr(residual, m_resiDctCoeff, stride, shift, trSize);
         else
-        {
-            // The case when X265_DEPTH >= 13
-            shift = -shift;
-            for (int j = 0; j < trSize; j++)
-                for (int k = 0; k < trSize; k++)
-                    residual[j * stride + k] = (int16_t)m_resiDctCoeff[j * trSize + k] << shift;
-        }
+            primitives.cvt32to16_shl[log2TrSize - 2](residual, m_resiDctCoeff, stride, -shift);
+#endif
     }
     else
     {
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/slice.h
--- a/source/common/slice.h	Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/slice.h	Thu Aug 07 18:18:11 2014 -0500
@@ -234,9 +234,6 @@ struct PPS
     bool     bEntropyCodingSyncEnabled; // use param
     bool     bSignHideEnabled;          // use param
 
-    bool     bCabacInitPresent;
-    uint32_t encCABACTableIdx;          // Used to transmit table selection across slices
-
     bool     bDeblockingFilterControlPresent;
     bool     bPicDisableDeblockingFilter;
     int      deblockingFilterBetaOffsetDiv2;
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/x86/asm-primitives.cpp	Thu Aug 07 18:18:11 2014 -0500
@@ -1230,6 +1230,35 @@ extern "C" {
 
 namespace x265 {
 // private x265 namespace
+
+#if HIGH_BIT_DEPTH
+extern unsigned char IntraFilterType[][35];
+
+/* Very similar to CRef in intrapred.cpp, except it uses optimized primitives */
+template<int log2Size>
+void intra_allangs(pixel *dest, pixel *above0, pixel *left0, pixel *above1, pixel *left1, int bLuma)
+{
+    const int size = 1 << log2Size;
+    const int sizeIdx = log2Size - 2;
+    ALIGN_VAR_32(pixel, buffer[32 * 32]);
+
+    for (int mode = 2; mode <= 34; mode++)
+    {
+        pixel *left = (IntraFilterType[sizeIdx][mode] ? left1 : left0);
+        pixel *above = (IntraFilterType[sizeIdx][mode] ? above1 : above0);
+        pixel *out = dest + ((mode - 2) << (log2Size * 2));
+
+        if (mode < 18)
+        {
+            primitives.intra_pred[sizeIdx][mode](buffer, size, left, above, mode, bLuma);
+            primitives.transpose[sizeIdx](out, buffer, size);
+        }
+        else
+            primitives.intra_pred[sizeIdx][mode](out, size, left, above, mode, bLuma);
+    }
+}
+#endif
+
 void Setup_Assembly_Primitives(EncoderPrimitives &p, int cpuMask)
 {
 #if HIGH_BIT_DEPTH
@@ -1434,6 +1463,14 @@ void Setup_Assembly_Primitives(EncoderPr
         p.chroma[X265_CSP_I422].copy_pp[i] = (copy_pp_t)p.chroma[X265_CSP_I422].copy_ss[i];
     }
 
+    if (p.intra_pred[0][0] && p.transpose[0])
+    {
+        p.intra_pred_allangs[BLOCK_4x4] = intra_allangs<2>;
+        p.intra_pred_allangs[BLOCK_8x8] = intra_allangs<3>;
+        p.intra_pred_allangs[BLOCK_16x16] = intra_allangs<4>;
+        p.intra_pred_allangs[BLOCK_32x32] = intra_allangs<5>;
+    }
+
 #else // if HIGH_BIT_DEPTH
     if (cpuMask & X265_CPU_SSE2)
     {
@@ -1494,6 +1531,10 @@ void Setup_Assembly_Primitives(EncoderPr
         SA8D_INTER_FROM_BLOCK(sse2);
 
         p.cvt32to16_shr = x265_cvt32to16_shr_sse2;
+        p.cvt32to16_shl[BLOCK_4x4] = x265_cvt32to16_shl_4_sse2;
+        p.cvt32to16_shl[BLOCK_8x8] = x265_cvt32to16_shl_8_sse2;
+        p.cvt32to16_shl[BLOCK_16x16] = x265_cvt32to16_shl_16_sse2;
+        p.cvt32to16_shl[BLOCK_32x32] = x265_cvt32to16_shl_32_sse2;
         p.calcrecon[BLOCK_4x4] = x265_calcRecons4_sse2;
         p.calcrecon[BLOCK_8x8] = x265_calcRecons8_sse2;
         p.calcresidual[BLOCK_4x4] = x265_getResidual4_sse2;
@@ -1553,6 +1594,10 @@ void Setup_Assembly_Primitives(EncoderPr
         CHROMA_ADDAVG(_sse4);
         CHROMA_ADDAVG_422(_sse4);
         p.cvt16to32_shl = x265_cvt16to32_shl_sse4;
+        p.cvt16to32_shr[BLOCK_4x4] = x265_cvt16to32_shr_4_sse4;
+        p.cvt16to32_shr[BLOCK_8x8] = x265_cvt16to32_shr_8_sse4;
+        p.cvt16to32_shr[BLOCK_16x16] = x265_cvt16to32_shr_16_sse4;
+        p.cvt16to32_shr[BLOCK_32x32] = x265_cvt16to32_shr_32_sse4;
 
         // TODO: check POPCNT flag!
         p.cvt16to32_cnt[BLOCK_4x4] = x265_cvt16to32_cnt_4_sse4;
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/x86/blockcopy8.asm
--- a/source/common/x86/blockcopy8.asm	Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/x86/blockcopy8.asm	Thu Aug 07 18:18:11 2014 -0500
@@ -3394,6 +3394,392 @@ cglobal cvt16to32_shl, 5, 7, 2, dst, src
 
 
 ;--------------------------------------------------------------------------------------
+; void cvt16to32_shr(int32_t *dst, int16_t *src, intptr_t stride, int shift, int offset);
+;--------------------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal cvt16to32_shr_4, 3,3,3
+    add             r2d, r2d
+    movd            m0, r3m
+    movd            m1, r4m
+    pshufd          m1, m1, 0
+
+    ; register alloc
+    ; r0 - dst
+    ; r1 - src
+    ; r2 - stride
+    ; m0 - shift
+    ; m1 - dword [offset]
+
+    ; Row 0
+    pmovsxwd        m2, [r1]
+    paddd           m2, m1
+    psrad           m2, m0
+    movu            [r0 + 0 * mmsize], m2
+
+    ; Row 1
+    pmovsxwd        m2, [r1 + r2]
+    paddd           m2, m1
+    psrad           m2, m0
+    movu            [r0 + 1 * mmsize], m2