[x265-commits] [x265] api: add --allow-non-conformance param, default to False

Tue Apr 7 04:08:00 CEST 2015

details:   http://hg.videolan.org/x265/rev/775436f7364d
branches:  
changeset: 10071:775436f7364d
user:      Steve Borho <steve at borho.org>
date:      Sun Apr 05 12:56:40 2015 -0500
description:
api: add --allow-non-conformance param, default to False

The encoder will now abort any encode that would result in a non-conformant
stream, unless --allow-non-conformance is specified
Subject: [x265] asm: luma_hps[12x16] avx2 - improved 3779c->2482c

details:   http://hg.videolan.org/x265/rev/0e097d6d57cf
branches:  
changeset: 10072:0e097d6d57cf
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Apr 06 09:17:08 2015 +0530
description:
asm: luma_hps[12x16] avx2 - improved 3779c->2482c
Subject: [x265] asm: luma_hps[24x32] avx2 - improved 11545c->6843c

details:   http://hg.videolan.org/x265/rev/02b4942ce999
branches:  
changeset: 10073:02b4942ce999
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Apr 06 09:19:14 2015 +0530
description:
asm: luma_hps[24x32] avx2 - improved 11545c->6843c
Subject: [x265] asm: chroma_hps[24x32] avx2 - improved 4458c->3583c

details:   http://hg.videolan.org/x265/rev/60c6a48a292c
branches:  
changeset: 10074:60c6a48a292c
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Apr 06 09:21:20 2015 +0530
description:
asm: chroma_hps[24x32] avx2 - improved 4458c->3583c
Subject: [x265] asm: luma_hvpp[16x16] - 11.39x 5226c

details:   http://hg.videolan.org/x265/rev/3849ba2347de
branches:  
changeset: 10075:3849ba2347de
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Apr 06 09:38:55 2015 +0530
description:
asm: luma_hvpp[16x16] - 11.39x 5226c
Subject: [x265] asm: improve the old avx2 code for sad[32x24]

details:   http://hg.videolan.org/x265/rev/809339fb90b5
branches:  
changeset: 10076:809339fb90b5
user:      Sumalatha Polureddy
date:      Mon Apr 06 11:47:55 2015 +0530
description:
asm: improve the old avx2 code for sad[32x24]

old:
sad[32x24]  14.26x   490.58          6995.66
new:
sad[32x24]  16.33x   428.35          6993.57
Subject: [x265] asm: intra_pred_ang4_8 improved by ~24% over SSE4

details:   http://hg.videolan.org/x265/rev/d317f9252f40
branches:  
changeset: 10077:d317f9252f40
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Mon Apr 06 10:58:30 2015 +0530
description:
asm: intra_pred_ang4_8 improved by ~24% over SSE4

AVX2:
intra_ang_4x4[ 8]       9.58x    110.01          1053.65

SSE4:
intra_ang_4x4[ 8]       7.26x    146.78          1065.62
Subject: [x265] asm: intra_pred_ang4_7 improved by ~42% over SSE4

details:   http://hg.videolan.org/x265/rev/aaa31e85a137
branches:  
changeset: 10078:aaa31e85a137
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Mon Apr 06 11:31:02 2015 +0530
description:
asm: intra_pred_ang4_7 improved by ~42% over SSE4

AVX2:
intra_ang_4x4[ 7]       10.24x   98.65           1009.92

SSE4:
intra_ang_4x4[ 7]       6.25x    169.98          1061.89
Subject: [x265] asm: intra_pred_ang4_6 improved by ~36% over SSE4

details:   http://hg.videolan.org/x265/rev/24571357bee9
branches:  
changeset: 10079:24571357bee9
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Mon Apr 06 12:05:52 2015 +0530
description:
asm: intra_pred_ang4_6 improved by ~36% over SSE4

AVX2:
intra_ang_4x4[ 6]       10.08x   101.69          1024.92

SSE4:
intra_ang_4x4[ 6]       6.60x    160.00          1055.62
Subject: [x265] asm: intra_pred_ang4_5 improved by ~41% over SSE4

details:   http://hg.videolan.org/x265/rev/c570567a2760
branches:  
changeset: 10080:c570567a2760
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Mon Apr 06 12:18:54 2015 +0530
description:
asm: intra_pred_ang4_5 improved by ~41% over SSE4

AVX2:
intra_ang_4x4[ 5]       9.56x    103.43          989.01

SSE4:
intra_ang_4x4[ 5]       5.99x    176.06          1055.48
Subject: [x265] asm: intra_pred_ang4_4 improved by ~44% over SSE4

details:   http://hg.videolan.org/x265/rev/cd6ea2f38499
branches:  
changeset: 10081:cd6ea2f38499
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Mon Apr 06 12:31:17 2015 +0530
description:
asm: intra_pred_ang4_4 improved by ~44% over SSE4

AVX2:
intra_ang_4x4[ 4]       10.62x   94.02           998.80

SSE4:
intra_ang_4x4[ 4]       5.89x    169.02          994.88
Subject: [x265] asm: intra_pred_ang4_3 improved by ~41% over SSE4

details:   http://hg.videolan.org/x265/rev/141e2904e2ac
branches:  
changeset: 10082:141e2904e2ac
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Mon Apr 06 12:47:11 2015 +0530
description:
asm: intra_pred_ang4_3 improved by ~41% over SSE4

AVX2:
intra_ang_4x4[ 3]       10.17x   97.09           987.20

SSE4:
intra_ang_4x4[ 3]       6.42x    167.16          1072.98
Subject: [x265] sao: modify C and SSE4 code for saoCuOrgE0 to process 2 rows

details:   http://hg.videolan.org/x265/rev/b84fe6497aa5
branches:  
changeset: 10083:b84fe6497aa5
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Mon Apr 06 13:56:47 2015 +0530
description:
sao: modify C and SSE4 code for saoCuOrgE0 to process 2 rows
Subject: [x265] asm: saoCuOrgE0 avx2 code: 756c->629c

details:   http://hg.videolan.org/x265/rev/64b7d2b4aac7
branches:  
changeset: 10084:64b7d2b4aac7
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Mon Apr 06 14:38:43 2015 +0530
description:
asm: saoCuOrgE0 avx2 code: 756c->629c
Subject: [x265] asm: improve the old avx2 code for sad[64x64]

details:   http://hg.videolan.org/x265/rev/7e5b68eba341
branches:  
changeset: 10085:7e5b68eba341
user:      Sumalatha Polureddy
date:      Mon Apr 06 14:53:36 2015 +0530
description:
asm: improve the old avx2 code for sad[64x64]

old:
sad[64x64]  21.47x   1702.40         36545.14
new:
sad[64x64]  22.89x   1595.16         36506.87
Subject: [x265] asm: improve old avx2 code for sad[64x48]

details:   http://hg.videolan.org/x265/rev/ca0d3bb3de69
branches:  
changeset: 10086:ca0d3bb3de69
user:      Sumalatha Polureddy
date:      Mon Apr 06 15:51:50 2015 +0530
description:
asm: improve old avx2 code for sad[64x48]

old:
sad[64x48]  16.79x   1504.65         25267.23

new:
sad[64x48]  20.18x   1260.99         25451.33
Subject: [x265] asm: ssse3 8bpp code for convert_p2s[12xN],[24xN],[48x64]

details:   http://hg.videolan.org/x265/rev/6d1c2339d9b9
branches:  
changeset: 10087:6d1c2339d9b9
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Mon Apr 06 15:02:05 2015 +0530
description:
asm: ssse3 8bpp code for convert_p2s[12xN],[24xN],[48x64]

     convert_p2s[12x16](9.82x), convert_p2s[24x32](13.61x),
     convert_p2s[48x64](11.12x)
Subject: [x265] asm: sse4 8bpp code for chroma_p2s[6xN] for i420, i422

details:   http://hg.videolan.org/x265/rev/57956d20dc48
branches:  
changeset: 10088:57956d20dc48
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Mon Apr 06 15:06:06 2015 +0530
description:
asm: sse4 8bpp code for chroma_p2s[6xN] for i420, i422

          chroma_p2s[6x8][i420](2.75x), chroma_p2s[6x16][i422](2.96x)
Subject: [x265] asm: ssse3 8bpp code for chroma_p2s[8x6](4.74x) for i420

details:   http://hg.videolan.org/x265/rev/64d96f1ac0bd
branches:  
changeset: 10089:64d96f1ac0bd
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Mon Apr 06 15:08:57 2015 +0530
description:
asm: ssse3 8bpp code for chroma_p2s[8x6](4.74x) for i420
Subject: [x265] asm: ssse3 8bpp code for chroma_p2s i422, reuse luma code

details:   http://hg.videolan.org/x265/rev/7db85dc198a4
branches:  
changeset: 10090:7db85dc198a4
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Mon Apr 06 16:22:54 2015 +0530
description:
asm: ssse3 8bpp code for chroma_p2s i422, reuse luma code

     chroma_p2s[4x32](3.78), chroma_p2s[8x12](5.25x), chroma_p2s[8x64](6.65x),
     chroma_p2s[12x32](9.57x), chroma_p2s[16x24](12.96x),
     chroma_p2s[16x24](12.56x), chroma_p2s[24x64](13.66x),
     chroma_p2s[32x48](9.83x)
Subject: [x265] improve rdoQuant by reduce type convert and condition check

details:   http://hg.videolan.org/x265/rev/7d9aa340f950
branches:  
changeset: 10091:7d9aa340f950
user:      Min Chen <chenm003 at 163.com>
date:      Mon Apr 06 20:18:01 2015 +0800
description:
improve rdoQuant by reduce type convert and condition check
Subject: [x265] fix count of shift overflow bug in Quant::getSigCoeffGroupCtxInc

details:   http://hg.videolan.org/x265/rev/cfd3c423c0bc
branches:  
changeset: 10092:cfd3c423c0bc
user:      Min Chen <chenm003 at 163.com>
date:      Mon Apr 06 20:17:52 2015 +0800
description:
fix count of shift overflow bug in Quant::getSigCoeffGroupCtxInc
Subject: [x265] improve rdoQuant by more parameters on getSigCoeffGroupCtxInc and calcPatternSigCtx

details:   http://hg.videolan.org/x265/rev/bac58ebf8d86
branches:  
changeset: 10093:bac58ebf8d86
user:      Min Chen <chenm003 at 163.com>
date:      Mon Apr 06 20:17:57 2015 +0800
description:
improve rdoQuant by more parameters on getSigCoeffGroupCtxInc and calcPatternSigCtx
Subject: [x265] cli: rewrite pts_queue to use new/delete, not to confuse the leak tool

details:   http://hg.videolan.org/x265/rev/e35d7fe9e974
branches:  
changeset: 10094:e35d7fe9e974
user:      Xinyue Lu <i at 7086.in>
date:      Mon Apr 06 15:39:24 2015 -0700
description:
cli: rewrite pts_queue to use new/delete, not to confuse the leak tool
Subject: [x265] level: allow unbounded level 8.5 to be used for lossless encodes

details:   http://hg.videolan.org/x265/rev/0ce13ce29304
branches:  
changeset: 10095:0ce13ce29304
user:      Steve Borho <steve at borho.org>
date:      Mon Apr 06 21:02:36 2015 -0500
description:
level: allow unbounded level 8.5 to be used for lossless encodes

Lossless has no rate control, obviously, so it does not generally fit in any of
the given levels but I think it is better to signal a valid profile (main,
main10, main10 4:4:4, etc) together with level 8.5 than to signal profile and
level as NONE. If anyone knows a better solution for this, please enlighten me.

This workaround prevents the need for --allow-non-conformance with --lossless

diffstat:

 doc/reST/cli.rst                     |   15 +-
 source/CMakeLists.txt                |    2 +-
 source/common/loopfilter.cpp         |   21 +-
 source/common/param.cpp              |    1 +
 source/common/primitives.h           |    2 +-
 source/common/quant.cpp              |   60 +-
 source/common/quant.h                |   33 +-
 source/common/slice.h                |    1 +
 source/common/x86/asm-primitives.cpp |   25 +
 source/common/x86/intrapred.h        |    6 +
 source/common/x86/intrapred8.asm     |   75 +++
 source/common/x86/ipfilter8.asm      |  700 +++++++++++++++++++++++++++++++++++
 source/common/x86/ipfilter8.h        |   27 +-
 source/common/x86/loopfilter.asm     |  116 +++++-
 source/common/x86/loopfilter.h       |    3 +-
 source/common/x86/sad-a.asm          |   99 ++--
 source/encoder/api.cpp               |    7 +
 source/encoder/entropy.cpp           |    4 +-
 source/encoder/level.cpp             |   15 +-
 source/encoder/sao.cpp               |   25 +-
 source/test/pixelharness.cpp         |   10 +-
 source/x265.cpp                      |   33 +-
 source/x265.h                        |    4 +
 source/x265cli.h                     |    3 +
 24 files changed, 1145 insertions(+), 142 deletions(-)

diffs (truncated from 1960 to 300 lines):

diff -r ebe5e57c4b45 -r 0ce13ce29304 doc/reST/cli.rst

--- a/doc/reST/cli.rst	Sat Apr 04 15:11:39 2015 -0500
+++ b/doc/reST/cli.rst	Mon Apr 06 21:02:36 2015 -0500
@@ -464,11 +464,22 @@ Profile, Level, Tier
 	HEVC specification.  If x265 detects that the total reference count
 	is greater than 8, it will issue a warning that the resulting stream
 	is non-compliant and it signals the stream as profile NONE and level
-	NONE but still allows the encode to continue.  Compliant HEVC
+	NONE and will abort the encode unless
+	:option:`--allow-non-conformance` it specified.  Compliant HEVC
 	decoders may refuse to decode such streams.
 	
 	Default 3
 
+.. option:: --allow-non-conformance, --no-allow-non-conformance
+
+	Allow libx265 to generate a bitstream with profile and level NONE.
+	By default it will abort any encode which does not meet strict level
+	compliance. The two most likely causes for non-conformance are
+	:option:`--ctu` being too small, :option:`--ref` being too high,
+	or the bitrate or resolution being out of specification.
+
+	Default: disabled
+
 .. note::
 	:option:`--profile`, :option:`--level-idc`, and
 	:option:`--high-tier` are only intended for use when you are
@@ -476,7 +487,7 @@ Profile, Level, Tier
 	limitations and must constrain the bitstream within those limits.
 	Specifying a profile or level may lower the encode quality
 	parameters to meet those requirements but it will never raise
-	them.
+	them. It may enable VBV constraints on a CRF encode.
 
 Mode decision / Analysis
 ========================
diff -r ebe5e57c4b45 -r 0ce13ce29304 source/CMakeLists.txt
--- a/source/CMakeLists.txt	Sat Apr 04 15:11:39 2015 -0500
+++ b/source/CMakeLists.txt	Mon Apr 06 21:02:36 2015 -0500
@@ -30,7 +30,7 @@ option(STATIC_LINK_CRT "Statically link 
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 52)
+set(X265_BUILD 53)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
diff -r ebe5e57c4b45 -r 0ce13ce29304 source/common/loopfilter.cpp
--- a/source/common/loopfilter.cpp	Sat Apr 04 15:11:39 2015 -0500
+++ b/source/common/loopfilter.cpp	Mon Apr 06 21:02:36 2015 -0500
@@ -42,18 +42,23 @@ void calSign(int8_t *dst, const pixel *s
         dst[x] = signOf(src1[x] - src2[x]);
 }
 
-void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t signLeft)
+void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride)
 {
-    int x;
-    int8_t signRight;
+    int x, y;
+    int8_t signRight, signLeft0;
     int8_t edgeType;
 
-    for (x = 0; x < width; x++)
+    for (y = 0; y < 2; y++)
     {
-        signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
-        edgeType = signRight + signLeft + 2;
-        signLeft  = -signRight;
-        rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        signLeft0 = signLeft[y];
+        for (x = 0; x < width; x++)
+        {
+            signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
+            edgeType = signRight + signLeft0 + 2;
+            signLeft0 = -signRight;
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        }
+        rec += stride;
     }
 }
 
diff -r ebe5e57c4b45 -r 0ce13ce29304 source/common/param.cpp
--- a/source/common/param.cpp	Sat Apr 04 15:11:39 2015 -0500
+++ b/source/common/param.cpp	Mon Apr 06 21:02:36 2015 -0500
@@ -565,6 +565,7 @@ int x265_param_parse(x265_param* p, cons
             p->levelIdc = atoi(value);
     }
     OPT("high-tier") p->bHighTier = atobool(value);
+    OPT("allow-non-conformance") p->bAllowNonConformance = atobool(value);
     OPT2("log-level", "log")
     {
         p->logLevel = atoi(value);
diff -r ebe5e57c4b45 -r 0ce13ce29304 source/common/primitives.h
--- a/source/common/primitives.h	Sat Apr 04 15:11:39 2015 -0500
+++ b/source/common/primitives.h	Mon Apr 06 21:02:36 2015 -0500
@@ -168,7 +168,7 @@ typedef void (*pixel_add_ps_t)(pixel* a,
 typedef void (*pixelavg_pp_t)(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int weight);
 typedef void (*addAvg_t)(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride);
 
-typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t signLeft);
+typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t* signLeft, intptr_t stride);
 typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
 typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
 typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
diff -r ebe5e57c4b45 -r 0ce13ce29304 source/common/quant.cpp
--- a/source/common/quant.cpp	Sat Apr 04 15:11:39 2015 -0500
+++ b/source/common/quant.cpp	Mon Apr 06 21:02:36 2015 -0500
@@ -50,6 +50,11 @@ inline int fastMin(int x, int y)
     return y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y)
 }
 
+inline int fastMax(int x, int y)
+{
+    return x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)
+}
+
 inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, const uint32_t absGoRice, const uint32_t maxVlc, uint32_t c1c2Idx)
 {
     X265_CHECK(c1c2Idx <= 3, "c1c2Idx check failure\n");
@@ -515,6 +520,7 @@ uint32_t Quant::rdoQuant(const CUData& c
 {
     int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
     int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype;
+    const uint32_t usePsyMask = usePsy ? -1 : 0;
 
     X265_CHECK(scalingListType < 6, "scaling list type out of range\n");
 
@@ -595,14 +601,14 @@ uint32_t Quant::rdoQuant(const CUData& c
         const uint64_t cgBlkPosMask = ((uint64_t)1 << cgBlkPos);
         memset(&cgRdStats, 0, sizeof(coeffGroupRDStats));
 
-        const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
+        const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
 
         /* iterate over coefficients in each group in reverse scan order */
         for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--)
         {
             scanPos              = (cgScanPos << MLS_CG_SIZE) + scanPosinCG;
             uint32_t blkPos      = codeParams.scan[scanPos];
-            uint16_t maxAbsLevel = (int16_t)abs(dstCoeff[blkPos]);             /* abs(quantized coeff) */
+            uint32_t maxAbsLevel = abs(dstCoeff[blkPos]);             /* abs(quantized coeff) */
             int signCoef         = m_resiDctCoeff[blkPos];            /* pre-quantization DCT coeff */
             int predictedCoef    = m_fencDctCoeff[blkPos] - signCoef; /* predicted DCT = source DCT - residual DCT*/
 
@@ -611,8 +617,8 @@ uint32_t Quant::rdoQuant(const CUData& c
              * FIX15 nature of the CABAC cost tables minus the forward transform scale */
 
             /* cost of not coding this coefficient (all distortion, no signal bits) */
-            costUncoded[scanPos] = (int64_t)(signCoef * signCoef) << scaleBits;
-            if (usePsy && blkPos)
+            costUncoded[scanPos] = ((int64_t)signCoef * signCoef) << scaleBits;
+            if (usePsyMask & blkPos)
                 /* when no residual coefficient is coded, predicted coef == recon coef */
                 costUncoded[scanPos] -= PSYVALUE(predictedCoef);
 
@@ -652,7 +658,7 @@ uint32_t Quant::rdoQuant(const CUData& c
                 const int* greaterOneBits = estBitsSbac.greaterOneBits[oneCtx];
                 const int* levelAbsBits = estBitsSbac.levelAbsBits[absCtx];
 
-                uint16_t level = 0;
+                uint32_t level = 0;
                 uint32_t sigCoefBits = 0;
                 costCoeff[scanPos] = MAX_INT64;
 
@@ -672,8 +678,11 @@ uint32_t Quant::rdoQuant(const CUData& c
                 }
                 if (maxAbsLevel)
                 {
-                    uint16_t minAbsLevel = X265_MAX(maxAbsLevel - 1, 1);
-                    for (uint16_t lvl = maxAbsLevel; lvl >= minAbsLevel; lvl--)
+                    // NOTE: X265_MAX(maxAbsLevel - 1, 1) ==> (X>=2 -> X-1), (X<2 -> 1)  | (0 < X < 2 ==> X=1)
+                    uint32_t minAbsLevel = (maxAbsLevel - 1);
+                    if (maxAbsLevel == 1)
+                        minAbsLevel = 1;
+                    for (uint32_t lvl = maxAbsLevel; lvl >= minAbsLevel; lvl--)
                     {
                         uint32_t levelBits = getICRateCost(lvl, lvl - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE;
 
@@ -682,7 +691,7 @@ uint32_t Quant::rdoQuant(const CUData& c
                         int64_t curCost = RDCOST(d, sigCoefBits + levelBits);
 
                         /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */
-                        if (usePsy && blkPos)
+                        if (usePsyMask & blkPos)
                         {
                             int reconCoef = abs(unquantAbsLevel + SIGN(predictedCoef, signCoef));
                             curCost -= PSYVALUE(reconCoef);
@@ -697,7 +706,7 @@ uint32_t Quant::rdoQuant(const CUData& c
                     }
                 }
 
-                dstCoeff[blkPos] = level;
+                dstCoeff[blkPos] = (int16_t)level;
                 totalRdCost += costCoeff[scanPos];
 
                 /* record costs for sign-hiding performed at the end */
@@ -815,7 +824,7 @@ uint32_t Quant::rdoQuant(const CUData& c
              * of the significant coefficient group flag and evaluate whether the RD cost of the
              * coded group is more than the RD cost of the uncoded group */
 
-            uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
+            uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
 
             int64_t costZeroCG = totalRdCost + SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]);
             costZeroCG += cgRdStats.uncodedDist;       /* add distortion for resetting non-zero levels to zero levels */
@@ -848,7 +857,7 @@ uint32_t Quant::rdoQuant(const CUData& c
         else
         {
             /* there were no coded coefficients in this coefficient group */
-            uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
+            uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
             costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]);
             totalRdCost += costCoeffGroupSig[cgScanPos];  /* add cost of 0 bit in significant CG bitmap */
             totalRdCost -= cgRdStats.sigCost;             /* remove cost of significant coefficient bitmap */
@@ -909,7 +918,7 @@ uint32_t Quant::rdoQuant(const CUData& c
              * cost of signaling it as not-significant */
             uint32_t blkPos = codeParams.scan[scanPos];
             if (dstCoeff[blkPos])
-            {                
+            {
                 // Calculates the cost of signaling the last significant coefficient in the block 
                 uint32_t pos[2] = { (blkPos & (trSize - 1)), (blkPos >> log2TrSize) };
                 if (codeParams.scanType == SCAN_VER)
@@ -1092,22 +1101,6 @@ uint32_t Quant::rdoQuant(const CUData& c
     return numSig;
 }
 
-/* Pattern decision for context derivation process of significant_coeff_flag */
-uint32_t Quant::calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG)
-{
-    if (!log2TrSizeCG)
-        return 0;
-
-    const uint32_t trSizeCG = 1 << log2TrSizeCG;
-    X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
-    const uint32_t shift = (cgPosY << log2TrSizeCG) + cgPosX + 1;
-    const uint32_t sigPos = (uint32_t)(shift >= 64 ? 0 : sigCoeffGroupFlag64 >> shift);
-    const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
-    const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
-
-    return sigRight + sigLower;
-}
-
 /* Context derivation process of coeff_abs_significant_flag */
 uint32_t Quant::getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma,
                              uint32_t firstSignificanceMapContext)
@@ -1175,14 +1168,3 @@ uint32_t Quant::getSigCtxInc(uint32_t pa
     return (bIsLuma && (posX | posY) >= 4) ? 3 + offset : offset;
 }
 
-/* Context derivation process of coeff_abs_significant_flag */
-uint32_t Quant::getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG)
-{
-    const uint32_t trSizeCG = 1 << log2TrSizeCG;
-
-    const uint32_t sigPos = (uint32_t)(cgGroupMask >> (1 + (cgPosY << log2TrSizeCG) + cgPosX));
-    const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
-    const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
-
-    return (sigRight | sigLower) & 1;
-}
diff -r ebe5e57c4b45 -r 0ce13ce29304 source/common/quant.h
--- a/source/common/quant.h	Sat Apr 04 15:11:39 2015 -0500
+++ b/source/common/quant.h	Mon Apr 06 21:02:36 2015 -0500
@@ -111,10 +111,39 @@ public:
     void invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff,
                          uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig);
 
+    /* Pattern decision for context derivation process of significant_coeff_flag */
+    static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
+    {
+        if (trSizeCG == 1)
+            return 0;
+
+        X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
+        // NOTE: cgBlkPos+1 may more than 63, it is invalid for shift,
+        //       but in this case, both cgPosX and cgPosY equal to (trSizeCG - 1),
+        //       the sigRight and sigLower will clear value to zero, the final result will be correct
+        const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid
+
+        // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
+        return sigRight + sigLower;
+    }
+
+    /* Context derivation process of coeff_abs_significant_flag */
+    static uint32_t getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
+    {
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
+        // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx
+        const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
+
+        return (sigRight | sigLower) & 1;
+    }