[x265-commits] [x265] quant: avoid runtime check of transform shift size
Steve Borho
steve at borho.org
Fri Aug 8 07:43:34 CEST 2014
details: http://hg.videolan.org/x265/rev/8e68a1db7c04
branches:
changeset: 7736:8e68a1db7c04
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 21:12:20 2014 -0500
description:
quant: avoid runtime check of transform shift size
Subject: [x265] frameencoder: avoid redundant calls to resetEntropy()
details: http://hg.videolan.org/x265/rev/83880abea807
branches:
changeset: 7737:83880abea807
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 22:24:47 2014 -0500
description:
frameencoder: avoid redundant calls to resetEntropy()
All of the entropy coders need to be reset to the same state at the start of
the frame's analysis. There is no point in re-calculating this initial state
repeatedly for each row
Subject: [x265] frameencoder: nit
details: http://hg.videolan.org/x265/rev/b89417dfa782
branches:
changeset: 7738:b89417dfa782
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 22:24:54 2014 -0500
description:
frameencoder: nit
Subject: [x265] entropy: disable signaling of CABAC init state
details: http://hg.videolan.org/x265/rev/3fdb78507aea
branches:
changeset: 7739:3fdb78507aea
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 19:58:27 2014 -0500
description:
entropy: disable signaling of CABAC init state
This flag, which was already disabled when frame parallelism is in use (which
is nearly always) was of limited utility. It did not improve compression
efficiency by any measurable amount, and it was expensive to compute. But the
quality which made it expendable was that it was the only user of the bBinsCoded
flag in the ContextModel; forcing us to copy twice as much data every time we
copy a context.
With this feature removed, the context model can be reduced to a single uint8_t
state variable.
Subject: [x265] entropy: remove bBinsCoded from ContextModel (no more users)
details: http://hg.videolan.org/x265/rev/4297617da24c
branches:
changeset: 7740:4297617da24c
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 19:59:07 2014 -0500
description:
entropy: remove bBinsCoded from ContextModel (no more users)
Subject: [x265] entropy: remove ContextModel structure, use uint8_t directly
details: http://hg.videolan.org/x265/rev/04567c40dae5
branches:
changeset: 7741:04567c40dae5
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 20:20:00 2014 -0500
description:
entropy: remove ContextModel structure, use uint8_t directly
Subject: [x265] entropy: pad size of context array to 32 * 5 bytes
details: http://hg.videolan.org/x265/rev/f6e38749049c
branches:
changeset: 7742:f6e38749049c
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 21:24:15 2014 -0500
description:
entropy: pad size of context array to 32 * 5 bytes
Subject: [x265] entropy: remove implicit memset from constructor
details: http://hg.videolan.org/x265/rev/49b593197330
branches:
changeset: 7743:49b593197330
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 22:42:47 2014 -0500
description:
entropy: remove implicit memset from constructor
Before we do further refactors, we want Entropy instances allocated on the
stack to not perform any needless initialization work
Subject: [x265] main10: create a hybrid all-angs primitve for 16bpp compiles
details: http://hg.videolan.org/x265/rev/33702c567e50
branches:
changeset: 7744:33702c567e50
user: Steve Borho <steve at borho.org>
date: Thu Aug 07 23:44:13 2014 -0500
description:
main10: create a hybrid all-angs primitve for 16bpp compiles
The all-angs primitive is highly optimized assembly code that avoids a lot of
redundant work. The all-angs C ref is horribly slow, doing redundant work to
mimic the output of the all-angs assembly code. Since we have no high bit depth
assembly for these functions, we'll use a shim C function that works very
similar to the C ref but it at least uses optimized primitives.
intra_allangs4x4 3.64x 6619.54 24097.30
intra_allangs8x8 5.66x 13722.49 77694.97
intra_allangs32x32 4.57x 246943.81 1129159.50
before:
encoded 1253 frames in 104.37s (12.01 fps), 366.08 kb/s, SSIM Mean Y: 0.9889624 (19.571 dB)
after:
encoded 1253 frames in 95.62s (13.10 fps), 366.08 kb/s, SSIM Mean Y: 0.9889624 (19.571 dB)
Subject: [x265] denoise: fix numCoeff (bug from 42b1d7c17510)
details: http://hg.videolan.org/x265/rev/ef2602935c59
branches:
changeset: 7745:ef2602935c59
user: Satoshi Nakagawa <nakagawa424 at oki.com>
date: Fri Aug 08 12:57:25 2014 +0900
description:
denoise: fix numCoeff (bug from 42b1d7c17510)
Subject: [x265] asm: cvt16to32_shr[*] for TSkip
details: http://hg.videolan.org/x265/rev/8cd2e8c9a3ba
branches:
changeset: 7746:8cd2e8c9a3ba
user: Min Chen <chenm003 at 163.com>
date: Thu Aug 07 18:18:11 2014 -0500
description:
asm: cvt16to32_shr[*] for TSkip
Subject: [x265] asm: cvt32to16_shl[*] for TSkip
details: http://hg.videolan.org/x265/rev/091a63164c41
branches:
changeset: 7747:091a63164c41
user: Min Chen <chenm003 at 163.com>
date: Thu Aug 07 18:18:11 2014 -0500
description:
asm: cvt32to16_shl[*] for TSkip
diffstat:
source/Lib/TLibCommon/ContextTables.h | 6 -
source/common/pixel.cpp | 35 +++
source/common/primitives.h | 4 +
source/common/quant.cpp | 36 +-
source/common/slice.h | 3 -
source/common/x86/asm-primitives.cpp | 45 +++
source/common/x86/blockcopy8.asm | 386 ++++++++++++++++++++++++++++++++++
source/common/x86/blockcopy8.h | 8 +
source/encoder/encoder.cpp | 3 -
source/encoder/entropy.cpp | 347 +++++++++--------------------
source/encoder/entropy.h | 9 +-
source/encoder/frameencoder.cpp | 20 +-
source/test/pixelharness.cpp | 88 +++++++-
source/test/pixelharness.h | 2 +
source/test/testbench.cpp | 6 +
15 files changed, 711 insertions(+), 287 deletions(-)
diffs (truncated from 1704 to 300 lines):
diff -r 8e45fc7c5521 -r 091a63164c41 source/Lib/TLibCommon/ContextTables.h
--- a/source/Lib/TLibCommon/ContextTables.h Thu Aug 07 19:49:42 2014 +0530
+++ b/source/Lib/TLibCommon/ContextTables.h Thu Aug 07 18:18:11 2014 -0500
@@ -127,12 +127,6 @@
namespace x265 {
// private namespace
-struct ContextModel
-{
- uint8_t state;
- uint8_t bBinsCoded;
-};
-
extern const uint32_t g_entropyBits[128];
extern const uint8_t g_nextState[128][2];
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/pixel.cpp
--- a/source/common/pixel.cpp Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/pixel.cpp Thu Aug 07 18:18:11 2014 -0500
@@ -442,6 +442,18 @@ void convert16to32_shl(int32_t *dst, int
}
}
+template<int size>
+void convert16to32_shr(int32_t *dst, int16_t *src, intptr_t stride, int shift, int offset)
+{
+ for (int i = 0; i < size; i++)
+ {
+ for (int j = 0; j < size; j++)
+ {
+ dst[i * size + j] = ((int)src[i * stride + j] + offset) >> shift;
+ }
+ }
+}
+
void convert32to16_shr(int16_t *dst, int32_t *src, intptr_t stride, int shift, int size)
{
int round = 1 << (shift - 1);
@@ -458,6 +470,21 @@ void convert32to16_shr(int16_t *dst, int
}
}
+template<int size>
+void convert32to16_shl(int16_t *dst, int32_t *src, intptr_t stride, int shift)
+{
+ for (int i = 0; i < size; i++)
+ {
+ for (int j = 0; j < size; j++)
+ {
+ dst[j] = ((int16_t)src[j] << shift);
+ }
+
+ src += size;
+ dst += stride;
+ }
+}
+
template<int blockSize>
void getResidual(pixel *fenc, pixel *pred, int16_t *residual, intptr_t stride)
{
@@ -1176,7 +1203,15 @@ void Setup_C_PixelPrimitives(EncoderPrim
p.blockfill_s[BLOCK_64x64] = blockfil_s_c<64>;
p.cvt16to32_shl = convert16to32_shl;
+ p.cvt16to32_shr[BLOCK_4x4] = convert16to32_shr<4>;
+ p.cvt16to32_shr[BLOCK_8x8] = convert16to32_shr<8>;
+ p.cvt16to32_shr[BLOCK_16x16] = convert16to32_shr<16>;
+ p.cvt16to32_shr[BLOCK_32x32] = convert16to32_shr<32>;
p.cvt32to16_shr = convert32to16_shr;
+ p.cvt32to16_shl[BLOCK_4x4] = convert32to16_shl<4>;
+ p.cvt32to16_shl[BLOCK_8x8] = convert32to16_shl<8>;
+ p.cvt32to16_shl[BLOCK_16x16] = convert32to16_shl<16>;
+ p.cvt32to16_shl[BLOCK_32x32] = convert32to16_shl<32>;
p.sa8d[BLOCK_4x4] = satd_4x4;
p.sa8d[BLOCK_8x8] = sa8d_8x8;
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/primitives.h
--- a/source/common/primitives.h Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/primitives.h Thu Aug 07 18:18:11 2014 -0500
@@ -149,7 +149,9 @@ typedef void (*intra_pred_t)(pixel* dst,
typedef void (*intra_allangs_t)(pixel *dst, pixel *above0, pixel *left0, pixel *above1, pixel *left1, int bLuma);
typedef void (*cvt16to32_shl_t)(int32_t *dst, int16_t *src, intptr_t, int, int);
+typedef void (*cvt16to32_shr_t)(int32_t *dst, int16_t *src, intptr_t, int, int);
typedef void (*cvt32to16_shr_t)(int16_t *dst, int32_t *src, intptr_t, int, int);
+typedef void (*cvt32to16_shl_t)(int16_t *dst, int32_t *src, intptr_t, int);
typedef uint32_t (*cvt16to32_cnt_t)(coeff_t* coeff, int16_t* residual, intptr_t stride);
typedef void (*dct_t)(int16_t *src, int32_t *dst, intptr_t stride);
@@ -218,7 +220,9 @@ struct EncoderPrimitives
blockcpy_pp_t blockcpy_pp; // block copy pixel from pixel
blockcpy_ps_t blockcpy_ps; // block copy pixel from short
cvt16to32_shl_t cvt16to32_shl;
+ cvt16to32_shr_t cvt16to32_shr[NUM_SQUARE_BLOCKS - 1];
cvt32to16_shr_t cvt32to16_shr;
+ cvt32to16_shl_t cvt32to16_shl[NUM_SQUARE_BLOCKS - 1];
cvt16to32_cnt_t cvt16to32_cnt[NUM_SQUARE_BLOCKS - 1];
copy_pp_t luma_copy_pp[NUM_LUMA_PARTITIONS];
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/quant.cpp
--- a/source/common/quant.cpp Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/quant.cpp Thu Aug 07 18:18:11 2014 -0500
@@ -342,24 +342,24 @@ uint32_t Quant::transformNxN(TComDataCU*
bool isLuma = ttype == TEXT_LUMA;
bool usePsy = m_psyRdoqScale && isLuma && !useTransformSkip;
bool isIntra = cu->getPredictionMode(absPartIdx) == MODE_INTRA;
+ int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; // Represents scaling through forward transform
int trSize = 1 << log2TrSize;
X265_CHECK((cu->m_slice->m_sps->quadtreeTULog2MaxSize >= log2TrSize), "transform size too large\n");
if (useTransformSkip)
{
- int shift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize;
-
- if (shift >= 0)
- primitives.cvt16to32_shl(m_resiDctCoeff, residual, stride, shift, trSize);
+#if X265_DEPTH <= 10
+ primitives.cvt16to32_shl(m_resiDctCoeff, residual, stride, transformShift, trSize);
+#else
+ if (transformShift >= 0)
+ primitives.cvt16to32_shl(m_resiDctCoeff, residual, stride, transformShift, trSize);
else
{
- /* X265_DEPTH > 13 */
- shift = -shift;
+ int shift = -transformShift;
int offset = (1 << (shift - 1));
- for (int j = 0; j < trSize; j++)
- for (int k = 0; k < trSize; k++)
- m_resiDctCoeff[j * trSize + k] = (residual[j * stride + k] + offset) >> shift;
+ primitives.cvt16to32_shr[log2TrSize - 2](m_resiDctCoeff, residual, stride, shift, offset);
}
+#endif
}
else
{
@@ -382,7 +382,8 @@ uint32_t Quant::transformNxN(TComDataCU*
{
/* denoise is not applied to intra residual, so DST can be ignored */
int cat = sizeIdx + 4 * !isLuma;
- denoiseDct(m_resiDctCoeff, m_nr->residualSum[cat], m_nr->offsetDenoise[cat], trSize << 1);
+ int numCoeff = 1 << log2TrSize * 2;
+ denoiseDct(m_resiDctCoeff, m_nr->residualSum[cat], m_nr->offsetDenoise[cat], numCoeff);
m_nr->count[cat]++;
}
}
@@ -398,7 +399,6 @@ uint32_t Quant::transformNxN(TComDataCU*
int per = m_qpParam[ttype].per;
int32_t *quantCoeff = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem];
- int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; // Represents scaling through forward transform
int qbits = QUANT_SHIFT + per + transformShift;
int add = (cu->m_slice->m_sliceType == I_SLICE ? 171 : 85) << (qbits - 9);
int numCoeff = 1 << log2TrSize * 2;
@@ -451,16 +451,14 @@ void Quant::invtransformNxN(bool transQu
int trSize = 1 << log2TrSize;
shift = transformShift;
- if (shift > 0)
+#if X265_DEPTH <= 10
+ primitives.cvt32to16_shr(residual, m_resiDctCoeff, stride, shift, trSize);
+#else
+ if (shift >= 0)
primitives.cvt32to16_shr(residual, m_resiDctCoeff, stride, shift, trSize);
else
- {
- // The case when X265_DEPTH >= 13
- shift = -shift;
- for (int j = 0; j < trSize; j++)
- for (int k = 0; k < trSize; k++)
- residual[j * stride + k] = (int16_t)m_resiDctCoeff[j * trSize + k] << shift;
- }
+ primitives.cvt32to16_shl[log2TrSize - 2](residual, m_resiDctCoeff, stride, -shift);
+#endif
}
else
{
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/slice.h
--- a/source/common/slice.h Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/slice.h Thu Aug 07 18:18:11 2014 -0500
@@ -234,9 +234,6 @@ struct PPS
bool bEntropyCodingSyncEnabled; // use param
bool bSignHideEnabled; // use param
- bool bCabacInitPresent;
- uint32_t encCABACTableIdx; // Used to transmit table selection across slices
-
bool bDeblockingFilterControlPresent;
bool bPicDisableDeblockingFilter;
int deblockingFilterBetaOffsetDiv2;
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/x86/asm-primitives.cpp Thu Aug 07 18:18:11 2014 -0500
@@ -1230,6 +1230,35 @@ extern "C" {
namespace x265 {
// private x265 namespace
+
+#if HIGH_BIT_DEPTH
+extern unsigned char IntraFilterType[][35];
+
+/* Very similar to CRef in intrapred.cpp, except it uses optimized primitives */
+template<int log2Size>
+void intra_allangs(pixel *dest, pixel *above0, pixel *left0, pixel *above1, pixel *left1, int bLuma)
+{
+ const int size = 1 << log2Size;
+ const int sizeIdx = log2Size - 2;
+ ALIGN_VAR_32(pixel, buffer[32 * 32]);
+
+ for (int mode = 2; mode <= 34; mode++)
+ {
+ pixel *left = (IntraFilterType[sizeIdx][mode] ? left1 : left0);
+ pixel *above = (IntraFilterType[sizeIdx][mode] ? above1 : above0);
+ pixel *out = dest + ((mode - 2) << (log2Size * 2));
+
+ if (mode < 18)
+ {
+ primitives.intra_pred[sizeIdx][mode](buffer, size, left, above, mode, bLuma);
+ primitives.transpose[sizeIdx](out, buffer, size);
+ }
+ else
+ primitives.intra_pred[sizeIdx][mode](out, size, left, above, mode, bLuma);
+ }
+}
+#endif
+
void Setup_Assembly_Primitives(EncoderPrimitives &p, int cpuMask)
{
#if HIGH_BIT_DEPTH
@@ -1434,6 +1463,14 @@ void Setup_Assembly_Primitives(EncoderPr
p.chroma[X265_CSP_I422].copy_pp[i] = (copy_pp_t)p.chroma[X265_CSP_I422].copy_ss[i];
}
+ if (p.intra_pred[0][0] && p.transpose[0])
+ {
+ p.intra_pred_allangs[BLOCK_4x4] = intra_allangs<2>;
+ p.intra_pred_allangs[BLOCK_8x8] = intra_allangs<3>;
+ p.intra_pred_allangs[BLOCK_16x16] = intra_allangs<4>;
+ p.intra_pred_allangs[BLOCK_32x32] = intra_allangs<5>;
+ }
+
#else // if HIGH_BIT_DEPTH
if (cpuMask & X265_CPU_SSE2)
{
@@ -1494,6 +1531,10 @@ void Setup_Assembly_Primitives(EncoderPr
SA8D_INTER_FROM_BLOCK(sse2);
p.cvt32to16_shr = x265_cvt32to16_shr_sse2;
+ p.cvt32to16_shl[BLOCK_4x4] = x265_cvt32to16_shl_4_sse2;
+ p.cvt32to16_shl[BLOCK_8x8] = x265_cvt32to16_shl_8_sse2;
+ p.cvt32to16_shl[BLOCK_16x16] = x265_cvt32to16_shl_16_sse2;
+ p.cvt32to16_shl[BLOCK_32x32] = x265_cvt32to16_shl_32_sse2;
p.calcrecon[BLOCK_4x4] = x265_calcRecons4_sse2;
p.calcrecon[BLOCK_8x8] = x265_calcRecons8_sse2;
p.calcresidual[BLOCK_4x4] = x265_getResidual4_sse2;
@@ -1553,6 +1594,10 @@ void Setup_Assembly_Primitives(EncoderPr
CHROMA_ADDAVG(_sse4);
CHROMA_ADDAVG_422(_sse4);
p.cvt16to32_shl = x265_cvt16to32_shl_sse4;
+ p.cvt16to32_shr[BLOCK_4x4] = x265_cvt16to32_shr_4_sse4;
+ p.cvt16to32_shr[BLOCK_8x8] = x265_cvt16to32_shr_8_sse4;
+ p.cvt16to32_shr[BLOCK_16x16] = x265_cvt16to32_shr_16_sse4;
+ p.cvt16to32_shr[BLOCK_32x32] = x265_cvt16to32_shr_32_sse4;
// TODO: check POPCNT flag!
p.cvt16to32_cnt[BLOCK_4x4] = x265_cvt16to32_cnt_4_sse4;
diff -r 8e45fc7c5521 -r 091a63164c41 source/common/x86/blockcopy8.asm
--- a/source/common/x86/blockcopy8.asm Thu Aug 07 19:49:42 2014 +0530
+++ b/source/common/x86/blockcopy8.asm Thu Aug 07 18:18:11 2014 -0500
@@ -3394,6 +3394,392 @@ cglobal cvt16to32_shl, 5, 7, 2, dst, src
;--------------------------------------------------------------------------------------
+; void cvt16to32_shr(int32_t *dst, int16_t *src, intptr_t stride, int shift, int offset);
+;--------------------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal cvt16to32_shr_4, 3,3,3
+ add r2d, r2d
+ movd m0, r3m
+ movd m1, r4m
+ pshufd m1, m1, 0
+
+ ; register alloc
+ ; r0 - dst
+ ; r1 - src
+ ; r2 - stride
+ ; m0 - shift
+ ; m1 - dword [offset]
+
+ ; Row 0
+ pmovsxwd m2, [r1]
+ paddd m2, m1
+ psrad m2, m0
+ movu [r0 + 0 * mmsize], m2
+
+ ; Row 1
+ pmovsxwd m2, [r1 + r2]
+ paddd m2, m1
+ psrad m2, m0
+ movu [r0 + 1 * mmsize], m2
More information about the x265-commits
mailing list