[x265-commits] [x265] AQ: Re-enable fine grained adaptive quantization

Deepthi Nandakumar deepthi at multicorewareinc.com
Tue Apr 21 19:59:42 CEST 2015


details:   http://hg.videolan.org/x265/rev/3ebf02051ca0
branches:  
changeset: 10225:3ebf02051ca0
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Sat Apr 18 11:16:08 2015 +0530
description:
AQ: Re-enable fine grained adaptive quantization
Subject: [x265] entropy: after encodeCU, the CU structures need to be reset with the right QP

details:   http://hg.videolan.org/x265/rev/1bce9910c734
branches:  
changeset: 10226:1bce9910c734
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Sat Apr 18 10:56:21 2015 +0530
description:
entropy: after encodeCU, the CU structures need to be reset with the right QP

This is used by the deblocking filter to index into the the beta and tc tables.
Without deblocking, no-one cares about the QP structures of CU after encode,
so it wouldnt cause any problems.
Subject: [x265] search: add RDcost measurement of DeltaQP to lower rdLevels

details:   http://hg.videolan.org/x265/rev/5521aa355df8
branches:  
changeset: 10227:5521aa355df8
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Wed Apr 08 11:52:50 2015 +0530
description:
search: add RDcost measurement of DeltaQP to lower rdLevels

This is a follow-on patch to 75d6c2588e34, which changed outputs at rdLevels > 4.
This patch may change outputs at lower rdLevels as well.
Subject: [x265] encoder: ignore param->rc.qgSize when delta-qp coding is disabled

details:   http://hg.videolan.org/x265/rev/b2b8fcabc1b1
branches:  
changeset: 10228:b2b8fcabc1b1
user:      Steve Borho <steve at borho.org>
date:      Thu Apr 09 11:20:57 2015 -0500
description:
encoder: ignore param->rc.qgSize when delta-qp coding is disabled

This was causing maxCuDQPDepth to be negative and check failures for cases like:
BasketballDrive_1920x1080_50.y4m --preset superfast --psy-rd 1 --ctu 16 --no-wpp
Subject: [x265] encoder: give param->rc.qgSize a sane default when dqp is not used

details:   http://hg.videolan.org/x265/rev/3b0b020780b6
branches:  
changeset: 10229:3b0b020780b6
user:      Steve Borho <steve at borho.org>
date:      Thu Apr 09 11:31:48 2015 -0500
description:
encoder: give param->rc.qgSize a sane default when dqp is not used

just in case the param is shown or consulted somewhere
Subject: [x265] param: show quant-group size in logs, move AQ config into its own line

details:   http://hg.videolan.org/x265/rev/1d9dd9f39beb
branches:  
changeset: 10230:1d9dd9f39beb
user:      Steve Borho <steve at borho.org>
date:      Thu Apr 09 11:30:33 2015 -0500
description:
param: show quant-group size in logs, move AQ config into its own line
Subject: [x265] tests: add coverage for --qg-size

details:   http://hg.videolan.org/x265/rev/cf7d2e34c476
branches:  
changeset: 10231:cf7d2e34c476
user:      Steve Borho <steve at borho.org>
date:      Thu Apr 09 00:48:54 2015 -0400
description:
tests: add coverage for --qg-size
Subject: [x265] sao: modify saoCuOrgE3_2Rows C code and add sse4 code

details:   http://hg.videolan.org/x265/rev/0e26f9f1428e
branches:  
changeset: 10232:0e26f9f1428e
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Mon Apr 20 18:54:53 2015 +0530
description:
sao: modify saoCuOrgE3_2Rows C code and add sse4 code

SAO_EO_3_2Rows  9.52x    1042.79         9930.47
Subject: [x265] rdoQuant: improve encoder ~3.5% by modify all zero-coeff group scan and cost compute logic

details:   http://hg.videolan.org/x265/rev/8b0c155a87db
branches:  
changeset: 10233:8b0c155a87db
user:      Min Chen <chenm003 at 163.com>
date:      Mon Apr 20 12:41:16 2015 +0800
description:
rdoQuant: improve encoder ~3.5% by modify all zero-coeff group scan and cost compute logic
Subject: [x265] rdoQuant: fast zero cost compute path

details:   http://hg.videolan.org/x265/rev/c410967ce0a3
branches:  
changeset: 10234:c410967ce0a3
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 17 21:31:19 2015 +0800
description:
rdoQuant: fast zero cost compute path
Subject: [x265] rdoQuant: improve coeff group block clean code

details:   http://hg.videolan.org/x265/rev/fa4d90a57ec3
branches:  
changeset: 10235:fa4d90a57ec3
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 17 21:31:23 2015 +0800
description:
rdoQuant: improve coeff group block clean code
Subject: [x265] rdoQuant: fast zero-coeff path

details:   http://hg.videolan.org/x265/rev/14c57b28849b
branches:  
changeset: 10236:14c57b28849b
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 17 21:31:26 2015 +0800
description:
rdoQuant: fast zero-coeff path
Subject: [x265] rdoQuant: move cgRdStats.sigCost0 outside from loop

details:   http://hg.videolan.org/x265/rev/33781b035903
branches:  
changeset: 10237:33781b035903
user:      Min Chen <chenm003 at 163.com>
date:      Sun Apr 19 16:44:04 2015 +0800
description:
rdoQuant: move cgRdStats.sigCost0 outside from loop
Subject: [x265] asm: ssse3 version of findPosFirstLast, 365c -> 75c

details:   http://hg.videolan.org/x265/rev/72dd8ac33b30
branches:  
changeset: 10238:72dd8ac33b30
user:      Min Chen <chenm003 at 163.com>
date:      Mon Apr 20 19:58:29 2015 +0800
description:
asm: ssse3 version of findPosFirstLast, 365c -> 75c
Subject: [x265] doc: if you break cpu auto-detection, you get to keep both halves

details:   http://hg.videolan.org/x265/rev/6dca493f7f09
branches:  
changeset: 10239:6dca493f7f09
user:      Steve Borho <steve at borho.org>
date:      Mon Apr 20 14:51:02 2015 -0500
description:
doc: if you break cpu auto-detection, you get to keep both halves
Subject: [x265] asm: generic x64 version of findPosLast

details:   http://hg.videolan.org/x265/rev/279b262f4f90
branches:  
changeset: 10240:279b262f4f90
user:      Min Chen <chenm003 at 163.com>
date:      Tue Apr 21 14:35:30 2015 +0800
description:
asm: generic x64 version of findPosLast
Subject: [x265] asm: avx code for chroma satd functions for all partitions of 422

details:   http://hg.videolan.org/x265/rev/21875b26ed04
branches:  
changeset: 10241:21875b26ed04
user:      Sumalatha Polureddy
date:      Tue Apr 21 14:46:26 2015 +0530
description:
asm: avx code for chroma satd functions for all partitions of 422
Subject: [x265] asm: new optimized algorithm for satd, improved ~25% over previous algorithm

details:   http://hg.videolan.org/x265/rev/d1c00f7a2387
branches:  
changeset: 10242:d1c00f7a2387
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Tue Apr 21 14:25:14 2015 +0530
description:
asm: new optimized algorithm for satd, improved ~25% over previous algorithm
Subject: [x265] slicetype: select best mvp using neighbor mvs satd cost for Lowres ME

details:   http://hg.videolan.org/x265/rev/77267c5390df
branches:  
changeset: 10243:77267c5390df
user:      Gopu Govindaswamy <gopu at multicorewareinc.com>
date:      Tue Apr 21 15:05:18 2015 +0530
description:
slicetype: select best mvp using neighbor mvs satd cost for Lowres ME

This patch modifies the lowres mvp selection logic to be more HEVC like. Rather
than using H.264 median, it analyzes satd cost of each neighbor MV, estimating
merge analysis, then uses the least cost MV as MVP, estimating AMVP.

diffstat:

 doc/reST/cli.rst                     |    7 +
 source/common/constants.cpp          |    2 +-
 source/common/cudata.cpp             |    6 +-
 source/common/cudata.h               |    2 +-
 source/common/dct.cpp                |   32 +++
 source/common/loopfilter.cpp         |   16 +-
 source/common/param.cpp              |   16 +-
 source/common/primitives.h           |    4 +-
 source/common/quant.cpp              |  210 ++++++++++++++++----
 source/common/x86/asm-primitives.cpp |   30 ++-
 source/common/x86/loopfilter.asm     |  129 ++++++++++++
 source/common/x86/loopfilter.h       |    1 +
 source/common/x86/pixel-a.asm        |  354 ++++++++++++++++------------------
 source/common/x86/pixel-util.h       |    2 +
 source/common/x86/pixel-util8.asm    |   98 +++++++++-
 source/encoder/analysis.cpp          |  202 +++++++++++++------
 source/encoder/analysis.h            |    9 +-
 source/encoder/encoder.cpp           |    5 +-
 source/encoder/entropy.cpp           |    9 +-
 source/encoder/entropy.h             |    2 +-
 source/encoder/frameencoder.cpp      |    2 -
 source/encoder/sao.cpp               |   23 +-
 source/encoder/search.cpp            |   58 ++++-
 source/encoder/search.h              |    4 +-
 source/encoder/slicetype.cpp         |   46 ++--
 source/test/pixelharness.cpp         |   69 ++++++
 source/test/pixelharness.h           |    1 +
 source/test/regression-tests.txt     |   20 +-
 source/test/smoke-tests.txt          |    6 +-
 29 files changed, 968 insertions(+), 397 deletions(-)

diffs (truncated from 2443 to 300 lines):

diff -r 5c3443546ccc -r 77267c5390df doc/reST/cli.rst
--- a/doc/reST/cli.rst	Sat Apr 18 10:02:19 2015 -0700
+++ b/doc/reST/cli.rst	Tue Apr 21 15:05:18 2015 +0530
@@ -159,6 +159,13 @@ Performance Options
 	handled implicitly.
 
 	One may also directly supply the CPU capability bitmap as an integer.
+	
+	Note that by specifying this option you are overriding x265's CPU
+	detection and it is possible to do this wrong. You can cause encoder
+	crashes by specifying SIMD architectures which are not supported on
+	your CPU.
+
+	Default: auto-detected SIMD architectures
 
 .. option:: --frame-threads, -F <integer>
 
diff -r 5c3443546ccc -r 77267c5390df source/common/constants.cpp
--- a/source/common/constants.cpp	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/constants.cpp	Tue Apr 21 15:05:18 2015 +0530
@@ -324,7 +324,7 @@ const uint16_t g_scan8x8[NUM_SCAN_TYPE][
       4,  12, 20, 28,  5, 13, 21, 29,  6, 14, 22, 30,  7, 15, 23, 31, 36, 44, 52, 60, 37, 45, 53, 61, 38, 46, 54, 62, 39, 47, 55, 63 }
 };
 
-const uint16_t g_scan4x4[NUM_SCAN_TYPE][4 * 4] =
+ALIGN_VAR_16(const uint16_t, g_scan4x4[NUM_SCAN_TYPE][4 * 4]) =
 {
     { 0,  4,  1,  8,  5,  2, 12,  9,  6,  3, 13, 10,  7, 14, 11, 15 },
     { 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
diff -r 5c3443546ccc -r 77267c5390df source/common/cudata.cpp
--- a/source/common/cudata.cpp	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/cudata.cpp	Tue Apr 21 15:05:18 2015 +0530
@@ -298,7 +298,7 @@ void CUData::initCTU(const Frame& frame,
 }
 
 // initialize Sub partition
-void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom)
+void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp)
 {
     m_absIdxInCTU   = cuGeom.absPartIdx;
     m_encData       = ctu.m_encData;
@@ -312,8 +312,8 @@ void CUData::initSubCU(const CUData& ctu
     m_cuAboveRight  = ctu.m_cuAboveRight;
     X265_CHECK(m_numPartitions == cuGeom.numPartitions, "initSubCU() size mismatch\n");
 
-    /* sequential memsets */
-    m_partSet((uint8_t*)m_qp, (uint8_t)ctu.m_qp[0]);
+    m_partSet((uint8_t*)m_qp, (uint8_t)qp);
+
     m_partSet(m_log2CUSize,   (uint8_t)cuGeom.log2CUSize);
     m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX);
     m_partSet(m_tqBypass,     (uint8_t)m_encData->m_param->bLossless);
diff -r 5c3443546ccc -r 77267c5390df source/common/cudata.h
--- a/source/common/cudata.h	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/cudata.h	Tue Apr 21 15:05:18 2015 +0530
@@ -182,7 +182,7 @@ public:
     static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
 
     void     initCTU(const Frame& frame, uint32_t cuAddr, int qp);
-    void     initSubCU(const CUData& ctu, const CUGeom& cuGeom);
+    void     initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp);
     void     initLosslessCU(const CUData& cu, const CUGeom& cuGeom);
 
     void     copyPartFrom(const CUData& cu, const CUGeom& childGeom, uint32_t subPartIdx);
diff -r 5c3443546ccc -r 77267c5390df source/common/dct.cpp
--- a/source/common/dct.cpp	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/dct.cpp	Tue Apr 21 15:05:18 2015 +0530
@@ -785,6 +785,37 @@ int findPosLast_c(const uint16_t *scan, 
     return scanPosLast - 1;
 }
 
+uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])
+{
+    int n;
+
+    for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
+    {
+        const uint32_t idx = scanTbl[n];
+        const uint32_t idxY = idx / MLS_CG_SIZE;
+        const uint32_t idxX = idx % MLS_CG_SIZE;
+        if (dstCoeff[idxY * trSize + idxX])
+            break;
+    }
+
+    X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
+
+    uint32_t lastNZPosInCG = (uint32_t)n;
+
+    for (n = 0;; n++)
+    {
+        const uint32_t idx = scanTbl[n];
+        const uint32_t idxY = idx / MLS_CG_SIZE;
+        const uint32_t idxX = idx % MLS_CG_SIZE;
+        if (dstCoeff[idxY * trSize + idxX])
+            break;
+    }
+
+    uint32_t firstNZPosInCG = (uint32_t)n;
+
+    return ((lastNZPosInCG << 16) | firstNZPosInCG);
+}
+
 }  // closing - anonymous file-static namespace
 
 namespace x265 {
@@ -818,5 +849,6 @@ void setupDCTPrimitives_c(EncoderPrimiti
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
 
     p.findPosLast = findPosLast_c;
+    p.findPosFirstLast = findPosFirstLast_c;
 }
 }
diff -r 5c3443546ccc -r 77267c5390df source/common/loopfilter.cpp
--- a/source/common/loopfilter.cpp	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/loopfilter.cpp	Tue Apr 21 15:05:18 2015 +0530
@@ -122,25 +122,21 @@ void processSaoCUE3(pixel *rec, int8_t *
     }
 }
 
-void processSaoCUE3_2Rows(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX, int8_t* signDown)
+void processSaoCUE3_2Rows(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX, int8_t* upBuff)
 {
-    int8_t signDown1;
+    int8_t signDown;
     int8_t edgeType;
 
     for (int y = 0; y < 2; y++)
     {
-        edgeType = signDown[y] + upBuff1[startX] + 2;
-        upBuff1[startX - 1] = -signDown[y];
-        rec[startX] = x265_clip(rec[startX] + offsetEo[edgeType]);
-
         for (int x = startX + 1; x < endX; x++)
         {
-            signDown1 = signOf(rec[x] - rec[x + stride]);
-            edgeType = signDown1 + upBuff1[x] + 2;
-            upBuff1[x - 1] = -signDown1;
+            signDown = signOf(rec[x] - rec[x + stride]);
+            edgeType = signDown + upBuff1[x] + 2;
+            upBuff1[x - 1] = -signDown;
             rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
         }
-        upBuff1[endX - 1] = signOf(rec[endX - 1 + stride + 1] - rec[endX]);
+        upBuff1[endX - 1] = upBuff[y];
         rec += stride + 1;
     }
 }
diff -r 5c3443546ccc -r 77267c5390df source/common/param.cpp
--- a/source/common/param.cpp	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/param.cpp	Tue Apr 21 15:05:18 2015 +0530
@@ -1273,22 +1273,20 @@ void x265_print_params(x265_param* param
     x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb / refs: %d / %d / %d / %d\n",
              param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred, param->maxNumReferences);
 
+    if (param->rc.aqMode)
+        x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree  : %d / %0.1f / %d / %d\n", param->rc.aqMode,
+                 param->rc.aqStrength, param->rc.qgSize, param->rc.cuTree);
+
     if (param->bLossless)
         x265_log(param, X265_LOG_INFO, "Rate Control                        : Lossless\n");
     else switch (param->rc.rateControlMode)
     {
     case X265_RC_ABR:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : ABR-%d kbps / %0.1f / %d\n", param->rc.bitrate,
-                 param->rc.aqStrength, param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control                        : ABR-%d kbps\n", param->rc.bitrate); break;
     case X265_RC_CQP:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CQP-%d / %0.1f / %d\n", param->rc.qp, param->rc.aqStrength,
-                 param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control                        : CQP-%d\n", param->rc.qp);  break;
     case X265_RC_CRF:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CRF-%0.1f / %0.1f / %d\n", param->rc.rfConstant,
-                 param->rc.aqStrength, param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control                        : CRF-%0.1f\n", param->rc.rfConstant);  break;
     }
 
     if (param->rc.vbvBufferSize)
diff -r 5c3443546ccc -r 77267c5390df source/common/primitives.h
--- a/source/common/primitives.h	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/primitives.h	Tue Apr 21 15:05:18 2015 +0530
@@ -172,7 +172,7 @@ typedef void (*saoCuOrgE0_t)(pixel* rec,
 typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
 typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
 typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
-typedef void (*saoCuOrgE3_2Rows_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX, int8_t* signDown);
+typedef void (*saoCuOrgE3_2Rows_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX, int8_t* upBuff);
 typedef void (*saoCuOrgB0_t)(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
 typedef void (*sign_t)(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
 typedef void (*planecopy_cp_t) (const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
@@ -181,6 +181,7 @@ typedef void (*planecopy_sp_t) (const ui
 typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
 
 typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
 
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
@@ -293,6 +294,7 @@ struct EncoderPrimitives
 
 
     findPosLast_t         findPosLast;
+    findPosFirstLast_t    findPosFirstLast;
 
     /* There is one set of chroma primitives per color space. An encoder will
      * have just a single color space and thus it will only ever use one entry
diff -r 5c3443546ccc -r 77267c5390df source/common/quant.cpp
--- a/source/common/quant.cpp	Sat Apr 18 10:02:19 2015 -0700
+++ b/source/common/quant.cpp	Tue Apr 21 15:05:18 2015 +0530
@@ -530,6 +530,7 @@ uint32_t Quant::rdoQuant(const CUData& c
     X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
     if (!numSig)
         return 0;
+
     uint32_t trSize = 1 << log2TrSize;
     int64_t lambda2 = m_qpParam[ttype].lambda2;
     int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda);
@@ -545,7 +546,7 @@ uint32_t Quant::rdoQuant(const CUData& c
 #define UNQUANT(lvl)    (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift)
 #define SIGCOST(bits)   ((lambda2 * (bits)) >> 8)
 #define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits))
-#define PSYVALUE(rec)   ((psyScale * (rec)) >> (16 - scaleBits))
+#define PSYVALUE(rec)   ((psyScale * (rec)) >> (2 * transformShift + 1))
 
     int64_t costCoeff[32 * 32];   /* d*d + lambda * bits */
     int64_t costUncoded[32 * 32]; /* d*d + lambda * 0    */
@@ -558,8 +559,6 @@ uint32_t Quant::rdoQuant(const CUData& c
     int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */
     uint64_t sigCoeffGroupFlag64 = 0;
 
-    int cgLastScanPos    = -1;
-    int lastScanPos      = -1;
     const uint32_t cgSize = (1 << MLS_CG_SIZE); /* 4x4 num coef = 16 */
     bool bIsLuma = ttype == TEXT_LUMA;
 
@@ -576,29 +575,148 @@ uint32_t Quant::rdoQuant(const CUData& c
     const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2);
     const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE);
 
+    uint8_t coeffNum[MLS_GRP_NUM];      // value range[0, 16]
+    uint16_t coeffSign[MLS_GRP_NUM];    // bit mask map for non-zero coeff sign
+    uint16_t coeffFlag[MLS_GRP_NUM];    // bit mask map for non-zero coeff
+
+#if CHECKED_BUILD || _DEBUG
+    // clean output buffer, the asm version of findPosLast Never output anything after latest non-zero coeff group
+    memset(coeffNum, 0, sizeof(coeffNum));
+    memset(coeffSign, 0, sizeof(coeffNum));
+    memset(coeffFlag, 0, sizeof(coeffNum));
+#endif
+    const int lastScanPos = primitives.findPosLast(codeParams.scan, dstCoeff, coeffSign, coeffFlag, coeffNum, numSig);
+    const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE);
+
+
     /* TODO: update bit estimates if dirty */
     EstBitsSbac& estBitsSbac = m_entropyCoder->m_estBitsSbac;
 
     uint32_t scanPos;
-    coeffGroupRDStats cgRdStats;
     uint32_t c1 = 1;
 
+    // process trail all zero Coeff Group
+
+    /* coefficients after lastNZ have no distortion signal cost */
+    const int zeroCG = cgNum - 1 - cgLastScanPos;
+    memset(&costCoeff[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
+    memset(&costSig[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
+
+    /* sum zero coeff (uncodec) cost */
+
+    // TODO: does we need these cost?
+    if (usePsyMask)
+    {
+        for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
+        {
+            X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
+
+            uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
+            uint32_t blkPos      = codeParams.scan[scanPosBase];
+
+            // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
+            for (int y = 0; y < MLS_CG_SIZE; y++)
+            {
+                for (int x = 0; x < MLS_CG_SIZE; x++)
+                {
+                    int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+                    int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
+
+                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
+
+                    /* when no residual coefficient is coded, predicted coef == recon coef */
+                    costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
+
+                    totalUncodedCost += costUncoded[blkPos + x];
+                    totalRdCost += costUncoded[blkPos + x];
+                }
+                blkPos += trSize;
+            }
+        }


More information about the x265-commits mailing list