[x265-commits] [x265] no-rdo early exit: giving weightage to the cost of all CU...

Wed Nov 13 01:42:33 CET 2013

details:   http://hg.videolan.org/x265/rev/dc5c51ff542f
branches:  
changeset: 5037:dc5c51ff542f
user:      Sumalatha Polureddy
date:      Tue Nov 12 10:45:56 2013 +0530
description:
no-rdo early exit: giving weightage to the cost of all CU's and neighbour CU's for early exit

Early exit is done when CU cost at depth "n" is lessthan sum of 60% of avgcost of all CU's
and 40% of avgcost of neighbour CU's at same depth.
Subject: [x265] asm: pixel_avg_32x(64,32,24,8)

details:   http://hg.videolan.org/x265/rev/5b0e1731f776
branches:  
changeset: 5038:5b0e1731f776
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Tue Nov 12 10:25:21 2013 +0530
description:
asm: pixel_avg_32x(64,32,24,8)
Subject: [x265] asm: pixel_avg_64x(64,48,16)

details:   http://hg.videolan.org/x265/rev/9c92947860e0
branches:  
changeset: 5039:9c92947860e0
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Tue Nov 12 11:03:42 2013 +0530
description:
asm: pixel_avg_64x(64,48,16)
Subject: [x265] asm: asm: pixel_avg_24x32

details:   http://hg.videolan.org/x265/rev/56642525d09e
branches:  
changeset: 5040:56642525d09e
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Tue Nov 12 11:44:58 2013 +0530
description:
asm: asm: pixel_avg_24x32
Subject: [x265] asm: pixel_avg_48x64, pixel_avg_8x32

details:   http://hg.videolan.org/x265/rev/4a4fd61e98e6
branches:  
changeset: 5041:4a4fd61e98e6
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Tue Nov 12 11:56:18 2013 +0530
description:
asm: pixel_avg_48x64, pixel_avg_8x32
Subject: [x265] cleanup: hardcoded m_qtTempTComYuv[qtLayer].m_width to MAX_CU_SIZE

details:   http://hg.videolan.org/x265/rev/12053d6bf759
branches:  
changeset: 5042:12053d6bf759
user:      Min Chen <chenm003 at 163.com>
date:      Tue Nov 12 16:14:09 2013 +0800
description:
cleanup: hardcoded m_qtTempTComYuv[qtLayer].m_width to MAX_CU_SIZE
Subject: [x265] Backout: Causing non-determinism in rd 0 and 1. Needs to be further investigated.

details:   http://hg.videolan.org/x265/rev/ab0968b4b65d
branches:  
changeset: 5043:ab0968b4b65d
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Tue Nov 12 16:36:54 2013 +0530
description:
Backout: Causing non-determinism in rd 0 and 1. Needs to be further investigated.
Subject: [x265] TEncSearch: use luma block copy (luma part size) if bChromaSame

details:   http://hg.videolan.org/x265/rev/ea4f939478ed
branches:  
changeset: 5044:ea4f939478ed
user:      Steve Borho <steve at borho.org>
date:      Mon Nov 11 22:29:22 2013 -0600
description:
TEncSearch: use luma block copy (luma part size) if bChromaSame
Subject: [x265] compress: fix shadow warning from GCC

details:   http://hg.videolan.org/x265/rev/58bdb05da194
branches:  
changeset: 5045:58bdb05da194
user:      Steve Borho <steve at borho.org>
date:      Mon Nov 11 22:30:32 2013 -0600
description:
compress: fix shadow warning from GCC
Subject: [x265] asm: assembly code for pixel_satd_32x24 and rearranged the functions

details:   http://hg.videolan.org/x265/rev/085d5c625c53
branches:  
changeset: 5046:085d5c625c53
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Tue Nov 12 12:19:28 2013 +0530
description:
asm: assembly code for pixel_satd_32x24 and rearranged the functions
Subject: [x265] asm: assembly code for pixel_satd_16x12

details:   http://hg.videolan.org/x265/rev/2baf62a8e47d
branches:  
changeset: 5047:2baf62a8e47d
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Tue Nov 12 12:51:37 2013 +0530
description:
asm: assembly code for pixel_satd_16x12
Subject: [x265] asm: assembly code for pixel_satd_16x4

details:   http://hg.videolan.org/x265/rev/7818f5b7cc25
branches:  
changeset: 5048:7818f5b7cc25
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Tue Nov 12 13:16:19 2013 +0530
description:
asm: assembly code for pixel_satd_16x4
Subject: [x265] asm: assembly code for satd_16x32, satd_16x64, satd_8x32

details:   http://hg.videolan.org/x265/rev/d636952ed093
branches:  
changeset: 5049:d636952ed093
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Tue Nov 12 16:34:37 2013 +0530
description:
asm: assembly code for satd_16x32, satd_16x64, satd_8x32
Subject: [x265] asm: assembly code for pixel_satd_12x16

details:   http://hg.videolan.org/x265/rev/c56ce77dc081
branches:  
changeset: 5050:c56ce77dc081
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Tue Nov 12 19:26:01 2013 +0530
description:
asm: assembly code for pixel_satd_12x16
Subject: [x265] TComYuv::copyToPicLuma, blockcopy_pp asm code integration

details:   http://hg.videolan.org/x265/rev/04c28af13c4d
branches:  
changeset: 5051:04c28af13c4d
user:      Praveen Tiwari
date:      Tue Nov 12 14:14:04 2013 +0530
description:
TComYuv::copyToPicLuma, blockcopy_pp asm code integration
Subject: [x265] TComYuv::copyFromPicLuma, blockcopy_pp luma asm code integration

details:   http://hg.videolan.org/x265/rev/c56ea57ce3ab
branches:  
changeset: 5052:c56ea57ce3ab
user:      Praveen Tiwari
date:      Tue Nov 12 17:07:14 2013 +0530
description:
TComYuv::copyFromPicLuma, blockcopy_pp luma asm code integration
Subject: [x265] TComYuv.cpp, use new blockcopy_pp luma primitives where feasible

details:   http://hg.videolan.org/x265/rev/8708689dcca2
branches:  
changeset: 5053:8708689dcca2
user:      Praveen Tiwari
date:      Tue Nov 12 17:41:21 2013 +0530
description:
TComYuv.cpp, use new blockcopy_pp luma primitives where feasible
Subject: [x265] TComYuv.cpp, use new luma_copy_ps asm primitives where feasible

details:   http://hg.videolan.org/x265/rev/31528c277c64
branches:  
changeset: 5054:31528c277c64
user:      Praveen Tiwari
date:      Tue Nov 12 17:58:13 2013 +0530
description:
TComYuv.cpp, use new luma_copy_ps asm primitives where feasible
Subject: [x265] asm: assembly code for x265_pixel_avg_12x16

details:   http://hg.videolan.org/x265/rev/d0f80f375c3b
branches:  
changeset: 5055:d0f80f375c3b
user:      Min Chen <chenm003 at 163.com>
date:      Tue Nov 12 19:27:06 2013 +0800
description:
asm: assembly code for x265_pixel_avg_12x16
Subject: [x265] Adding function pointer array and initializations for chroma vsp filter functions.

details:   http://hg.videolan.org/x265/rev/e676cbd86238
branches:  
changeset: 5056:e676cbd86238
user:      Nabajit Deka
date:      Tue Nov 12 16:07:05 2013 +0530
description:
Adding function pointer array and initializations for chroma vsp filter functions.
Subject: [x265] Adding test bench code for chroma vsp filter functions.

details:   http://hg.videolan.org/x265/rev/ed8a6cd4d8ec
branches:  
changeset: 5057:ed8a6cd4d8ec
user:      Nabajit Deka
date:      Tue Nov 12 16:16:14 2013 +0530
description:
Adding test bench code for chroma vsp filter functions.
Subject: [x265] asm: routines for chroma vsp filter functions for all block sizes.

details:   http://hg.videolan.org/x265/rev/4844849073b7
branches:  
changeset: 5058:4844849073b7
user:      Nabajit Deka
date:      Tue Nov 12 16:21:30 2013 +0530
description:
asm: routines for chroma vsp filter functions for all block sizes.
Subject: [x265] Adding asm function declarations for chroma vsp filter functions.

details:   http://hg.videolan.org/x265/rev/8fe8d8f9f7cb
branches:  
changeset: 5059:8fe8d8f9f7cb
user:      Nabajit Deka
date:      Tue Nov 12 16:23:13 2013 +0530
description:
Adding asm function declarations for chroma vsp filter functions.
Subject: [x265] Adding function pointer initializations for asm chroma vsp functions.

details:   http://hg.videolan.org/x265/rev/028b911ae623
branches:  
changeset: 5060:028b911ae623
user:      Nabajit Deka
date:      Tue Nov 12 16:25:54 2013 +0530
description:
Adding function pointer initializations for asm chroma vsp functions.
Subject: [x265] Adding function pointer array and initializations for chroma hps filter functions.

details:   http://hg.videolan.org/x265/rev/8a8b967500e5
branches:  
changeset: 5061:8a8b967500e5
user:      Nabajit Deka
date:      Tue Nov 12 17:30:35 2013 +0530
description:
Adding function pointer array and initializations for chroma hps filter functions.
Subject: [x265] Adding test bench code for chroma hps filter functions.

details:   http://hg.videolan.org/x265/rev/e6d26209c45f
branches:  
changeset: 5062:e6d26209c45f
user:      Nabajit Deka
date:      Tue Nov 12 17:34:19 2013 +0530
description:
Adding test bench code for chroma hps filter functions.
Subject: [x265] asm: routines for chroma hps filter functions for 2xN, 4xN, 6x8 and 12x16 block sizes.

details:   http://hg.videolan.org/x265/rev/533bca3ec7e9
branches:  
changeset: 5063:533bca3ec7e9
user:      Nabajit Deka
date:      Tue Nov 12 20:24:34 2013 +0530
description:
asm: routines for chroma hps filter functions for 2xN, 4xN, 6x8 and 12x16 block sizes.
Subject: [x265] Adding function pointer array and C primitive initializations for chroma vps filter functions.

details:   http://hg.videolan.org/x265/rev/1ddacfd89112
branches:  
changeset: 5064:1ddacfd89112
user:      Nabajit Deka
date:      Tue Nov 12 20:51:20 2013 +0530
description:
Adding function pointer array and C primitive initializations for chroma vps filter functions.
Subject: [x265] Adding test bench code for chroma vps filter functions.

details:   http://hg.videolan.org/x265/rev/2185b81ae35b
branches:  
changeset: 5065:2185b81ae35b
user:      Nabajit Deka
date:      Tue Nov 12 20:52:13 2013 +0530
description:
Adding test bench code for chroma vps filter functions.
Subject: [x265] Adding initialisation for ssd/sum values for lowress frame

details:   http://hg.videolan.org/x265/rev/a19ba09c1fd7
branches:  
changeset: 5066:a19ba09c1fd7
user:      Shazeb Nawaz Khan <shazeb at multicorewareinc.com>
date:      Tue Nov 12 17:06:03 2013 +0530
description:
Adding initialisation for ssd/sum values for lowress frame
Subject: [x265] Bug fix : In ipfilter for 10 bit yuv support

details:   http://hg.videolan.org/x265/rev/90c2763ee027
branches:  
changeset: 5067:90c2763ee027
user:      sagarkotecha
date:      Tue Nov 12 16:55:09 2013 +0530
description:
Bug fix : In ipfilter for 10 bit yuv support

diffstat:

 source/Lib/TLibCommon/TComYuv.cpp     |   19 +-
 source/Lib/TLibEncoder/TEncSearch.cpp |  111 +++--
 source/common/TShortYUV.cpp           |    4 -
 source/common/ipfilter.cpp            |   11 +-
 source/common/primitives.h            |    3 +
 source/common/x86/asm-primitives.cpp  |   74 ++-
 source/common/x86/ipfilter8.asm       |  627 ++++++++++++++++++++++++++++++++++
 source/common/x86/ipfilter8.h         |   33 +
 source/common/x86/mc-a.asm            |  103 +++++-
 source/common/x86/pixel-a.asm         |  300 +++++++++++++--
 source/common/x86/pixel.h             |   11 +
 source/encoder/compress.cpp           |    2 +-
 source/encoder/ratecontrol.cpp        |    5 +
 source/test/ipfilterharness.cpp       |  108 +++++-
 source/test/ipfilterharness.h         |    2 +
 15 files changed, 1268 insertions(+), 145 deletions(-)

diffs (truncated from 2113 to 300 lines):

diff -r 1ca01c82609f -r 90c2763ee027 source/Lib/TLibCommon/TComYuv.cpp

--- a/source/Lib/TLibCommon/TComYuv.cpp	Mon Nov 11 15:46:00 2013 +0530
+++ b/source/Lib/TLibCommon/TComYuv.cpp	Tue Nov 12 16:55:09 2013 +0530
@@ -111,13 +111,15 @@ void TComYuv::copyToPicLuma(TComPicYuv* 
     width  = m_width >> partDepth;
     height = m_height >> partDepth;
 
+    int part = partitionFromSizes(width, height);
+
     Pel* src = getLumaAddr(partIdx, width);
     Pel* dst = destPicYuv->getLumaAddr(cuAddr, absZOrderIdx);
 
     uint32_t srcstride = getStride();
     uint32_t dststride = destPicYuv->getStride();
 
-    primitives.blockcpy_pp(width, height, dst, dststride, src, srcstride);
+    primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
 }
 
 void TComYuv::copyToPicChroma(TComPicYuv* destPicYuv, uint32_t cuAddr, uint32_t absZOrderIdx, uint32_t partDepth, uint32_t partIdx)
@@ -153,7 +155,8 @@ void TComYuv::copyFromPicLuma(TComPicYuv
     uint32_t dststride = getStride();
     uint32_t srcstride = srcPicYuv->getStride();
 
-    primitives.blockcpy_pp(m_width, m_height, dst, dststride, src, srcstride);
+    int part = partitionFromSizes(m_width, m_height);
+    primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
 }
 
 void TComYuv::copyFromPicChroma(TComPicYuv* srcPicYuv, uint32_t cuAddr, uint32_t absZOrderIdx)
@@ -184,7 +187,8 @@ void TComYuv::copyToPartLuma(TComYuv* ds
     uint32_t srcstride = getStride();
     uint32_t dststride = dstPicYuv->getStride();
 
-    primitives.blockcpy_pp(m_width, m_height, dst, dststride, src, srcstride);
+    int part = partitionFromSizes(m_width, m_height);
+    primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
 }
 
 void TComYuv::copyToPartChroma(TComYuv* dstPicYuv, uint32_t uiDstPartIdx)
@@ -218,7 +222,8 @@ void TComYuv::copyPartToLuma(TComYuv* ds
     uint32_t height = dstPicYuv->getHeight();
     uint32_t width = dstPicYuv->getWidth();
 
-    primitives.blockcpy_pp(width, height, dst, dststride, src, srcstride);
+    int part = partitionFromSizes(width, height);
+    primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
 }
 
 void TComYuv::copyPartToChroma(TComYuv* dstPicYuv, uint32_t partIdx)
@@ -264,7 +269,8 @@ void TComYuv::copyPartToPartLuma(TComYuv
     uint32_t srcstride = getStride();
     uint32_t dststride = dstPicYuv->getStride();
 
-    primitives.blockcpy_pp(width, height, dst, dststride, src, srcstride);
+    int part = partitionFromSizes(width, height);
+    primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
 }
 
 void TComYuv::copyPartToPartLuma(TShortYUV* dstPicYuv, uint32_t partIdx, uint32_t width, uint32_t height)
@@ -275,7 +281,8 @@ void TComYuv::copyPartToPartLuma(TShortY
     uint32_t  srcstride = getStride();
     uint32_t  dststride = dstPicYuv->m_width;
 
-    primitives.blockcpy_sp(width, height, dst, dststride, src, srcstride);
+    int part = partitionFromSizes(width, height);
+    primitives.luma_copy_ps[part](dst, dststride, src, srcstride);
 }
 
 void TComYuv::copyPartToPartChroma(TComYuv* dstPicYuv, uint32_t partIdx, uint32_t width, uint32_t height)
diff -r 1ca01c82609f -r 90c2763ee027 source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp	Mon Nov 11 15:46:00 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp	Tue Nov 12 16:55:09 2013 +0530
@@ -436,7 +436,7 @@ void TEncSearch::xIntraCodingLumaBlk(TCo
     TCoeff*  coeff          = m_qtTempCoeffY[qtLayer] + numCoeffPerInc * absPartIdx;
 
     int16_t* reconQt        = m_qtTempTComYuv[qtLayer].getLumaAddr(absPartIdx);
-    uint32_t reconQtStride  = m_qtTempTComYuv[qtLayer].m_width;
+    assert(m_qtTempTComYuv[qtLayer].m_width == MAX_CU_SIZE);
 
     uint32_t zorder           = cu->getZorderIdxInCU() + absPartIdx;
     Pel*     reconIPred       = cu->getPic()->getPicYuvRec()->getLumaAddr(cu->getAddr(), zorder);
@@ -502,7 +502,7 @@ void TEncSearch::xIntraCodingLumaBlk(TCo
     }
 
     //===== reconstruction =====
-    primitives.calcrecon[size](pred, residual, recon, reconQt, reconIPred, stride, reconQtStride, reconIPredStride);
+    primitives.calcrecon[size](pred, residual, recon, reconQt, reconIPred, stride, MAX_CU_SIZE, reconIPredStride);
 
     //===== update distortion =====
     outDist += primitives.sse_pp[part](fenc, stride, recon, stride);
@@ -548,7 +548,7 @@ void TEncSearch::xIntraCodingChromaBlk(T
     uint32_t numCoeffPerInc = (cu->getSlice()->getSPS()->getMaxCUWidth() * cu->getSlice()->getSPS()->getMaxCUHeight() >> (cu->getSlice()->getSPS()->getMaxCUDepth() << 1)) >> 2;
     TCoeff*  coeff          = (chromaId > 0 ? m_qtTempCoeffCr[qtlayer] : m_qtTempCoeffCb[qtlayer]) + numCoeffPerInc * absPartIdx;
     int16_t* reconQt        = (chromaId > 0 ? m_qtTempTComYuv[qtlayer].getCrAddr(absPartIdx) : m_qtTempTComYuv[qtlayer].getCbAddr(absPartIdx));
-    uint32_t reconQtStride  = m_qtTempTComYuv[qtlayer].m_cwidth;
+    assert(m_qtTempTComYuv[qtlayer].m_cwidth == MAX_CU_SIZE / 2);
 
     uint32_t zorder           = cu->getZorderIdxInCU() + absPartIdx;
     Pel*     reconIPred       = (chromaId > 0 ? cu->getPic()->getPicYuvRec()->getCrAddr(cu->getAddr(), zorder) : cu->getPic()->getPicYuvRec()->getCbAddr(cu->getAddr(), zorder));
@@ -636,7 +636,7 @@ void TEncSearch::xIntraCodingChromaBlk(T
     }
 
     //===== reconstruction =====
-    primitives.calcrecon[size](pred, residual, recon, reconQt, reconIPred, stride, reconQtStride, reconIPredStride);
+    primitives.calcrecon[size](pred, residual, recon, reconQt, reconIPred, stride, MAX_CU_SIZE / 2, reconIPredStride);
 
     //===== update distortion =====
     uint32_t dist = primitives.sse_pp[part](fenc, stride, recon, stride);
@@ -954,24 +954,24 @@ void TEncSearch::xRecurIntraCodingQT(TCo
         uint32_t qtLayer   = cu->getSlice()->getSPS()->getQuadtreeTULog2MaxSize() - trSizeLog2;
         uint32_t zorder    = cu->getZorderIdxInCU() + absPartIdx;
         int16_t* src       = m_qtTempTComYuv[qtLayer].getLumaAddr(absPartIdx);
-        uint32_t srcstride = m_qtTempTComYuv[qtLayer].m_width;
+        assert(m_qtTempTComYuv[qtLayer].m_width == MAX_CU_SIZE);
         Pel*     dst       = cu->getPic()->getPicYuvRec()->getLumaAddr(cu->getAddr(), zorder);
         uint32_t dststride = cu->getPic()->getPicYuvRec()->getStride();
-        primitives.blockcpy_ps(width, height, dst, dststride, src, srcstride);
+        primitives.blockcpy_ps(width, height, dst, dststride, src, MAX_CU_SIZE);
 
         if (!bLumaOnly)
         {
             width >>= 1;
             height >>= 1;
             src       = m_qtTempTComYuv[qtLayer].getCbAddr(absPartIdx);
-            srcstride = m_qtTempTComYuv[qtLayer].m_cwidth;
+            assert(m_qtTempTComYuv[qtLayer].m_cwidth == MAX_CU_SIZE / 2);
             dst       = cu->getPic()->getPicYuvRec()->getCbAddr(cu->getAddr(), zorder);
             dststride = cu->getPic()->getPicYuvRec()->getCStride();
-            primitives.blockcpy_ps(width, height, dst, dststride, src, srcstride);
+            primitives.blockcpy_ps(width, height, dst, dststride, src, MAX_CU_SIZE / 2);
 
             src = m_qtTempTComYuv[qtLayer].getCrAddr(absPartIdx);
             dst = cu->getPic()->getPicYuvRec()->getCrAddr(cu->getAddr(), zorder);
-            primitives.blockcpy_ps(width, height, dst, dststride, src, srcstride);
+            primitives.blockcpy_ps(width, height, dst, dststride, src, MAX_CU_SIZE / 2);
         }
     }
 
@@ -1134,10 +1134,10 @@ void TEncSearch::xLoadIntraResultQT(TCom
     Pel*   reconIPred       = cu->getPic()->getPicYuvRec()->getLumaAddr(cu->getAddr(), zOrder);
     uint32_t   reconIPredStride = cu->getPic()->getPicYuvRec()->getStride();
     int16_t* reconQt          = m_qtTempTComYuv[qtlayer].getLumaAddr(absPartIdx);
-    uint32_t   reconQtStride    = m_qtTempTComYuv[qtlayer].m_width;
+    assert(m_qtTempTComYuv[qtlayer].m_width == MAX_CU_SIZE);
     uint32_t   width            = cu->getWidth(0) >> trDepth;
     uint32_t   height           = cu->getHeight(0) >> trDepth;
-    primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, reconQtStride);
+    primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, MAX_CU_SIZE);
 
     if (!bLumaOnly && !bSkipChroma)
     {
@@ -1146,12 +1146,12 @@ void TEncSearch::xLoadIntraResultQT(TCom
         reconIPred = cu->getPic()->getPicYuvRec()->getCbAddr(cu->getAddr(), zOrder);
         reconIPredStride = cu->getPic()->getPicYuvRec()->getCStride();
         reconQt = m_qtTempTComYuv[qtlayer].getCbAddr(absPartIdx);
-        reconQtStride = m_qtTempTComYuv[qtlayer].m_cwidth;
-        primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, reconQtStride);
+        assert(m_qtTempTComYuv[qtlayer].m_cwidth == MAX_CU_SIZE / 2);
+        primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, MAX_CU_SIZE / 2);
 
         reconIPred = cu->getPic()->getPicYuvRec()->getCrAddr(cu->getAddr(), zOrder);
         reconQt    = m_qtTempTComYuv[qtlayer].getCrAddr(absPartIdx);
-        primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, reconQtStride);
+        primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, MAX_CU_SIZE / 2);
     }
 }
 
@@ -1255,20 +1255,20 @@ void TEncSearch::xLoadIntraResultChromaQ
         uint32_t zorder           = cu->getZorderIdxInCU() + absPartIdx;
         uint32_t width            = cu->getWidth(0) >> (trDepth + 1);
         uint32_t height           = cu->getHeight(0) >> (trDepth + 1);
-        uint32_t reconQtStride    = m_qtTempTComYuv[qtlayer].m_cwidth;
+        assert(m_qtTempTComYuv[qtlayer].m_cwidth == MAX_CU_SIZE / 2);
         uint32_t reconIPredStride = cu->getPic()->getPicYuvRec()->getCStride();
 
         if (stateU0V1Both2 == 0 || stateU0V1Both2 == 2)
         {
             Pel* reconIPred = cu->getPic()->getPicYuvRec()->getCbAddr(cu->getAddr(), zorder);
             int16_t* reconQt  = m_qtTempTComYuv[qtlayer].getCbAddr(absPartIdx);
-            primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, reconQtStride);
+            primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, MAX_CU_SIZE / 2);
         }
         if (stateU0V1Both2 == 1 || stateU0V1Both2 == 2)
         {
             Pel* reconIPred = cu->getPic()->getPicYuvRec()->getCrAddr(cu->getAddr(), zorder);
             int16_t* reconQt  = m_qtTempTComYuv[qtlayer].getCrAddr(absPartIdx);
-            primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, reconQtStride);
+            primitives.blockcpy_ps(width, height, reconIPred, reconIPredStride, reconQt, MAX_CU_SIZE / 2);
         }
     }
 }
@@ -1809,11 +1809,17 @@ void TEncSearch::estIntraPredQT(TComData
                 dststride   = cu->getPic()->getPicYuvRec()->getCStride();
                 src         = reconYuv->getCbAddr(partOffset);
                 srcstride   = reconYuv->getCStride();
-                primitives.chroma_copy_pp[part](dst, dststride, src, srcstride);
+                if (bChromaSame)
+                    primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
+                else
+                    primitives.chroma_copy_pp[part](dst, dststride, src, srcstride);
 
                 dst         = cu->getPic()->getPicYuvRec()->getCrAddr(cu->getAddr(), zorder);
                 src         = reconYuv->getCrAddr(partOffset);
-                primitives.chroma_copy_pp[part](dst, dststride, src, srcstride);
+                if (bChromaSame)
+                    primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
+                else
+                    primitives.chroma_copy_pp[part](dst, dststride, src, srcstride);
             }
         }
 
@@ -3182,10 +3188,10 @@ void TEncSearch::xEstimateResidualQT(TCo
 
             int scalingListType = 3 + g_eTTable[(int)TEXT_LUMA];
             assert(scalingListType < 6);
-            m_trQuant->invtransformNxN(cu->getCUTransquantBypass(absPartIdx), REG_DCT, curResiY, m_qtTempTComYuv[qtlayer].m_width,  coeffCurY, trWidth, trHeight, scalingListType, false, lastPosY); //this is for inter mode only
-
-            const uint32_t nonZeroDistY = primitives.sse_ss[partSize](resiYuv->getLumaAddr(absTUPartIdx), resiYuv->m_width, m_qtTempTComYuv[qtlayer].getLumaAddr(absTUPartIdx),
-                                                                      m_qtTempTComYuv[qtlayer].m_width);
+            assert(m_qtTempTComYuv[qtlayer].m_width == MAX_CU_SIZE);
+            m_trQuant->invtransformNxN(cu->getCUTransquantBypass(absPartIdx), REG_DCT, curResiY, MAX_CU_SIZE,  coeffCurY, trWidth, trHeight, scalingListType, false, lastPosY); //this is for inter mode only
+
+            const uint32_t nonZeroDistY = primitives.sse_ss[partSize](resiYuv->getLumaAddr(absTUPartIdx), resiYuv->m_width, m_qtTempTComYuv[qtlayer].getLumaAddr(absTUPartIdx), MAX_CU_SIZE);
             if (cu->isLosslessCoded(0))
             {
                 distY = nonZeroDistY;
@@ -3227,10 +3233,10 @@ void TEncSearch::xEstimateResidualQT(TCo
         if (!absSumY)
         {
             int16_t *ptr =  m_qtTempTComYuv[qtlayer].getLumaAddr(absTUPartIdx);
-            const uint32_t stride = m_qtTempTComYuv[qtlayer].m_width;
+            assert(m_qtTempTComYuv[qtlayer].m_width == MAX_CU_SIZE);
 
             assert(trWidth == trHeight);
-            primitives.blockfill_s[(int)g_convertToBit[trWidth]](ptr, stride, 0);
+            primitives.blockfill_s[(int)g_convertToBit[trWidth]](ptr, MAX_CU_SIZE, 0);
         }
 
         uint32_t distU = 0;
@@ -3254,11 +3260,12 @@ void TEncSearch::xEstimateResidualQT(TCo
 
                 int scalingListType = 3 + g_eTTable[(int)TEXT_CHROMA_U];
                 assert(scalingListType < 6);
-                m_trQuant->invtransformNxN(cu->getCUTransquantBypass(absPartIdx), REG_DCT, pcResiCurrU, m_qtTempTComYuv[qtlayer].m_cwidth, coeffCurU, trWidthC, trHeightC, scalingListType, false, lastPosU);
+                assert(m_qtTempTComYuv[qtlayer].m_cwidth == MAX_CU_SIZE / 2);
+                m_trQuant->invtransformNxN(cu->getCUTransquantBypass(absPartIdx), REG_DCT, pcResiCurrU, MAX_CU_SIZE / 2, coeffCurU, trWidthC, trHeightC, scalingListType, false, lastPosU);
 
                 uint32_t dist = primitives.sse_ss[partSizeC](resiYuv->getCbAddr(absTUPartIdxC), resiYuv->m_cwidth,
                                                              m_qtTempTComYuv[qtlayer].getCbAddr(absTUPartIdxC),
-                                                             m_qtTempTComYuv[qtlayer].m_cwidth);
+                                                             MAX_CU_SIZE / 2);
                 const uint32_t nonZeroDistU = m_rdCost->scaleChromaDistCb(dist);
 
                 if (cu->isLosslessCoded(0))
@@ -3301,10 +3308,10 @@ void TEncSearch::xEstimateResidualQT(TCo
             if (!absSumU)
             {
                 int16_t *ptr = m_qtTempTComYuv[qtlayer].getCbAddr(absTUPartIdxC);
-                const uint32_t stride = m_qtTempTComYuv[qtlayer].m_cwidth;
+                assert(m_qtTempTComYuv[qtlayer].m_cwidth == MAX_CU_SIZE / 2);
 
                 assert(trWidthC == trHeightC);
-                primitives.blockfill_s[(int)g_convertToBit[trWidthC]](ptr, stride, 0);
+                primitives.blockfill_s[(int)g_convertToBit[trWidthC]](ptr, MAX_CU_SIZE / 2, 0);
             }
 
             distV = m_rdCost->scaleChromaDistCr(primitives.sse_sp[partSizeC](resiYuv->getCrAddr(absTUPartIdxC), resiYuv->m_cwidth, m_tempPel, trWidthC));
@@ -3320,11 +3327,12 @@ void TEncSearch::xEstimateResidualQT(TCo
 
                 int scalingListType = 3 + g_eTTable[(int)TEXT_CHROMA_V];
                 assert(scalingListType < 6);
-                m_trQuant->invtransformNxN(cu->getCUTransquantBypass(absPartIdx), REG_DCT, curResiV, m_qtTempTComYuv[qtlayer].m_cwidth, coeffCurV, trWidthC, trHeightC, scalingListType, false, lastPosV);
+                assert(m_qtTempTComYuv[qtlayer].m_cwidth == MAX_CU_SIZE / 2);
+                m_trQuant->invtransformNxN(cu->getCUTransquantBypass(absPartIdx), REG_DCT, curResiV, MAX_CU_SIZE / 2, coeffCurV, trWidthC, trHeightC, scalingListType, false, lastPosV);
 
                 uint32_t dist = primitives.sse_ss[partSizeC](resiYuv->getCrAddr(absTUPartIdxC), resiYuv->m_cwidth,
                                                              m_qtTempTComYuv[qtlayer].getCrAddr(absTUPartIdxC),
-                                                             m_qtTempTComYuv[qtlayer].m_cwidth);
+                                                             MAX_CU_SIZE / 2);
                 const uint32_t nonZeroDistV = m_rdCost->scaleChromaDistCr(dist);
 
                 if (cu->isLosslessCoded(0))
@@ -3367,10 +3375,10 @@ void TEncSearch::xEstimateResidualQT(TCo
             if (!absSumV)
             {
                 int16_t *ptr =  m_qtTempTComYuv[qtlayer].getCrAddr(absTUPartIdxC);
-                const uint32_t stride = m_qtTempTComYuv[qtlayer].m_cwidth;
+                assert(m_qtTempTComYuv[qtlayer].m_cwidth == MAX_CU_SIZE / 2);
 
                 assert(trWidthC == trHeightC);
-                primitives.blockfill_s[(int)g_convertToBit[trWidthC]](ptr, stride, 0);
+                primitives.blockfill_s[(int)g_convertToBit[trWidthC]](ptr, MAX_CU_SIZE / 2, 0);
             }
         }
         cu->setCbfSubParts(absSumY ? setCbf : 0, TEXT_LUMA, absPartIdx, depth);
@@ -3387,7 +3395,7 @@ void TEncSearch::xEstimateResidualQT(TCo
             UInt64 singleCostY = MAX_INT64;
 
             int16_t *curResiY = m_qtTempTComYuv[qtlayer].getLumaAddr(absTUPartIdx);