[x265-commits] [x265] blockcopy_sp_4x2, optimized asm code according to modifie...

Fri Nov 8 22:31:43 CET 2013

details:   http://hg.videolan.org/x265/rev/85dddb9aa165
branches:  
changeset: 4949:85dddb9aa165
user:      Praveen Tiwari
date:      Fri Nov 08 14:27:44 2013 +0530
description:
blockcopy_sp_4x2, optimized asm code according to modified C primitive
Subject: [x265] blockcopy_sp_4x4, optimized asm code according to modified C primitive

details:   http://hg.videolan.org/x265/rev/d5f67f9cba2c
branches:  
changeset: 4950:d5f67f9cba2c
user:      Praveen Tiwari
date:      Fri Nov 08 14:43:32 2013 +0530
description:
blockcopy_sp_4x4, optimized asm code according to modified C primitive
Subject: [x265] blockcopy_sp_4x16, optimized asm code

details:   http://hg.videolan.org/x265/rev/b20b89bf5348
branches:  
changeset: 4951:b20b89bf5348
user:      Praveen Tiwari
date:      Fri Nov 08 15:14:12 2013 +0530
description:
blockcopy_sp_4x16, optimized asm code
Subject: [x265] blockcopy_sp_4x8, optimized asm code

details:   http://hg.videolan.org/x265/rev/ceed26f375d5
branches:  
changeset: 4952:ceed26f375d5
user:      Praveen Tiwari
date:      Fri Nov 08 15:04:10 2013 +0530
description:
blockcopy_sp_4x8, optimized asm code
Subject: [x265] blockcopy_sp_8x4, optimized asm code

details:   http://hg.videolan.org/x265/rev/27c70b409c1b
branches:  
changeset: 4953:27c70b409c1b
user:      Praveen Tiwari
date:      Fri Nov 08 16:40:28 2013 +0530
description:
blockcopy_sp_8x4, optimized asm code
Subject: [x265] blockcopy_sp_8x6, optimized asm code

details:   http://hg.videolan.org/x265/rev/2fd3cf3b5edb
branches:  
changeset: 4954:2fd3cf3b5edb
user:      Praveen Tiwari
date:      Fri Nov 08 16:53:24 2013 +0530
description:
blockcopy_sp_8x6, optimized asm code
Subject: [x265] blockcopy_sp_8x2, optimized asm code

details:   http://hg.videolan.org/x265/rev/c8d25ce3b965
branches:  
changeset: 4955:c8d25ce3b965
user:      Praveen Tiwari
date:      Fri Nov 08 16:25:45 2013 +0530
description:
blockcopy_sp_8x2, optimized asm code
Subject: [x265] blockcopy_sp_8x8, optimized asm code

details:   http://hg.videolan.org/x265/rev/8cfa90a574f8
branches:  
changeset: 4956:8cfa90a574f8
user:      Praveen Tiwari
date:      Fri Nov 08 17:28:57 2013 +0530
description:
blockcopy_sp_8x8, optimized asm code
Subject: [x265] blockcopy_sp_8x16, optimized asm code

details:   http://hg.videolan.org/x265/rev/a0b003aac23e
branches:  
changeset: 4957:a0b003aac23e
user:      Praveen Tiwari
date:      Fri Nov 08 17:38:24 2013 +0530
description:
blockcopy_sp_8x16, optimized asm code
Subject: [x265] blockcopy_sp_12x16, optimized asm code

details:   http://hg.videolan.org/x265/rev/970517e2eb4c
branches:  
changeset: 4958:970517e2eb4c
user:      Praveen Tiwari
date:      Fri Nov 08 19:26:48 2013 +0530
description:
blockcopy_sp_12x16, optimized asm code
Subject: [x265] blockcopy_sp_16xN, optimized asm code

details:   http://hg.videolan.org/x265/rev/a1a9b29cccf9
branches:  
changeset: 4959:a1a9b29cccf9
user:      Praveen Tiwari
date:      Fri Nov 08 19:01:56 2013 +0530
description:
blockcopy_sp_16xN, optimized asm code
Subject: [x265] blockcopy_sp_24x32, optimized asm code

details:   http://hg.videolan.org/x265/rev/3cf4edc66844
branches:  
changeset: 4960:3cf4edc66844
user:      Praveen Tiwari
date:      Fri Nov 08 19:55:38 2013 +0530
description:
blockcopy_sp_24x32, optimized asm code
Subject: [x265] blockcopy_sp_64xN, optimized asm code

details:   http://hg.videolan.org/x265/rev/a1c0eb5f5d84
branches:  
changeset: 4961:a1c0eb5f5d84
user:      Praveen Tiwari
date:      Fri Nov 08 20:58:14 2013 +0530
description:
blockcopy_sp_64xN, optimized asm code
Subject: [x265] blockcopy_sp_48x64, optimized asm code

details:   http://hg.videolan.org/x265/rev/fa5544054a1d
branches:  
changeset: 4962:fa5544054a1d
user:      Praveen Tiwari
date:      Fri Nov 08 21:17:13 2013 +0530
description:
blockcopy_sp_48x64, optimized asm code
Subject: [x265] blockcopy_sp_32xN, optimized asm code

details:   http://hg.videolan.org/x265/rev/b95f9e753039
branches:  
changeset: 4963:b95f9e753039
user:      Praveen Tiwari
date:      Fri Nov 08 21:21:21 2013 +0530
description:
blockcopy_sp_32xN, optimized asm code
Subject: [x265] blockcopy_sp_6x8, optimized asm code

details:   http://hg.videolan.org/x265/rev/073ca727d5de
branches:  
changeset: 4964:073ca727d5de
user:      Praveen Tiwari
date:      Fri Nov 08 21:50:23 2013 +0530
description:
blockcopy_sp_6x8, optimized asm code
Subject: [x265] blockcopy_sp_2x4, optimized asm code

details:   http://hg.videolan.org/x265/rev/7bd27dfad3bf
branches:  
changeset: 4965:7bd27dfad3bf
user:      Praveen Tiwari
date:      Fri Nov 08 21:58:58 2013 +0530
description:
blockcopy_sp_2x4, optimized asm code
Subject: [x265] blockcopy_sp_2x8, optimized asm code

details:   http://hg.videolan.org/x265/rev/1e7c99e5b511
branches:  
changeset: 4966:1e7c99e5b511
user:      Praveen Tiwari
date:      Fri Nov 08 22:14:25 2013 +0530
description:
blockcopy_sp_2x8, optimized asm code
Subject: [x265] asm: optimised pixel_sad_xN_24x32 assembly code

details:   http://hg.videolan.org/x265/rev/cd16d2ed3128
branches:  
changeset: 4967:cd16d2ed3128
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Fri Nov 08 17:59:38 2013 +0530
description:
asm: optimised pixel_sad_xN_24x32 assembly code
Subject: [x265] TEncSearch: cleanup estIntraPredQT to use 32x32 logic for 64x64 blocks

details:   http://hg.videolan.org/x265/rev/abb7c130ca2f
branches:  
changeset: 4968:abb7c130ca2f
user:      Mahesh Doijade
date:      Fri Nov 08 15:34:39 2013 +0530
description:
TEncSearch: cleanup estIntraPredQT to use 32x32 logic for 64x64 blocks
Subject: [x265] TComPicYuv: fixup 16x16 picture padding by using unpadded width as pad base

details:   http://hg.videolan.org/x265/rev/74bed0a288f5
branches:  
changeset: 4969:74bed0a288f5
user:      Steve Borho <steve at borho.org>
date:      Fri Nov 08 14:30:32 2013 -0600
description:
TComPicYuv: fixup 16x16 picture padding by using unpadded width as pad base
Subject: [x265] no-rdo: refactor enodeResandCalcRDInterCU function

details:   http://hg.videolan.org/x265/rev/66659d4a7b31
branches:  
changeset: 4970:66659d4a7b31
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Fri Nov 08 12:33:47 2013 +0530
description:
no-rdo: refactor enodeResandCalcRDInterCU function

Divide estimateBits and modeDecision inside the function. EstimateBits uses a
pseudo encode. Bitstream changes with this patch for --rd 1.
Subject: [x265] presets: adjust presets to increase spread and align closer with x264 presets

details:   http://hg.videolan.org/x265/rev/8487f675effa
branches:  
changeset: 4971:8487f675effa
user:      Steve Borho <steve at borho.org>
date:      Wed Nov 06 23:10:46 2013 -0600
description:
presets: adjust presets to increase spread and align closer with x264 presets
Subject: [x265] common: set default params to match medium preset, keep star search for medium

details:   http://hg.videolan.org/x265/rev/5b688170c506
branches:  
changeset: 4972:5b688170c506
user:      Steve Borho <steve at borho.org>
date:      Fri Nov 08 15:17:18 2013 -0600
description:
common: set default params to match medium preset, keep star search for medium

diffstat:

 source/Lib/TLibCommon/TComPicYuv.cpp  |    4 +-
 source/Lib/TLibEncoder/TEncSearch.cpp |  286 +++++++---
 source/Lib/TLibEncoder/TEncSearch.h   |   11 +
 source/common/common.cpp              |   57 +-
 source/common/x86/blockcopy8.asm      |  921 ++++++++++++++-------------------
 source/common/x86/sad-a.asm           |  167 ++---
 source/encoder/compress.cpp           |    7 +-
 7 files changed, 713 insertions(+), 740 deletions(-)

diffs (truncated from 2042 to 300 lines):

diff -r fef74c2e329d -r 5b688170c506 source/Lib/TLibCommon/TComPicYuv.cpp

--- a/source/Lib/TLibCommon/TComPicYuv.cpp	Fri Nov 08 02:57:47 2013 -0600
+++ b/source/Lib/TLibCommon/TComPicYuv.cpp	Fri Nov 08 15:17:18 2013 -0600
@@ -348,9 +348,9 @@ void TComPicYuv::copyFromPicture(const x
     int height = m_picHeight - pady;
 
     /* internal pad to multiple of 16x16 blocks */
-    uint8_t rem = m_picWidth & 15;
+    uint8_t rem = width & 15;
     padx = rem ? 16 - rem : padx;
-    rem = m_picHeight & 15;
+    rem = width & 15;
     pady = rem ? 16 - rem : pady;
 
 #if HIGH_BIT_DEPTH
diff -r fef74c2e329d -r 5b688170c506 source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp	Fri Nov 08 02:57:47 2013 -0600
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp	Fri Nov 08 15:17:18 2013 -0600
@@ -1557,12 +1557,9 @@ void TEncSearch::estIntraPredQT(TComData
         //===== determine set of modes to be tested (using prediction signal only) =====
         int numModesAvailable = 35; //total number of Intra modes
         Pel* fenc   = fencYuv->getLumaAddr(pu, width);
-        Pel* pred   = predYuv->getLumaAddr(pu, width);
         uint32_t stride = predYuv->getStride();
         uint32_t rdModeList[FAST_UDI_MAX_RDMODE_NUM];
         int numModesForFullRD = g_intraModeNumFast[widthBit];
-        int log2SizeMinus2 = g_convertToBit[width];
-        pixelcmp_t sa8d = primitives.sa8d[log2SizeMinus2];
 
         bool doFastSearch = (numModesForFullRD != numModesAvailable);
         if (doFastSearch)
@@ -1577,100 +1574,77 @@ void TEncSearch::estIntraPredQT(TComData
             candNum = 0;
             uint32_t modeCosts[35];
 
-            Pel *pAbove0 = refAbove    + width - 1;
-            Pel *pAbove1 = refAboveFlt + width - 1;
-            Pel *pLeft0  = refLeft     + width - 1;
-            Pel *pLeft1  = refLeftFlt  + width - 1;
+            Pel *above         = refAbove    + width - 1;
+            Pel *aboveFiltered = refAboveFlt + width - 1;
+            Pel *left          = refLeft     + width - 1;
+            Pel *leftFiltered  = refLeftFlt  + width - 1;
 
             // 33 Angle modes once
             ALIGN_VAR_32(Pel, buf_trans[32 * 32]);
             ALIGN_VAR_32(Pel, tmp[33 * 32 * 32]);
-
-            if (width <= 32)
+            int scaleWidth = width;
+            int scaleStride = stride;
+            int costMultiplier = 1;
+
+            if (width > 32)
             {
-                // 1
-                primitives.intra_pred_dc(pAbove0 + 1, pLeft0 + 1, pred, stride, width, (width <= 16));
-                modeCosts[DC_IDX] = sa8d(fenc, stride, pred, stride);
-
-                // 0
-                Pel *above   = pAbove0;
-                Pel *left    = pLeft0;
-                if (width >= 8 && width <= 32)
-                {
-                    above = pAbove1;
-                    left  = pLeft1;
-                }
-                primitives.intra_pred_planar((pixel*)above + 1, (pixel*)left + 1, pred, stride, width);
-                modeCosts[PLANAR_IDX] = sa8d(fenc, stride, pred, stride);
-
-                // Transpose NxN
-                primitives.transpose[log2SizeMinus2](buf_trans, (pixel*)fenc, stride);
-
-                primitives.intra_pred_allangs[log2SizeMinus2](tmp, pAbove0, pLeft0, pAbove1, pLeft1, (width <= 16));
-
-                // TODO: We need SATD_x4 here
-                for (uint32_t mode = 2; mode < numModesAvailable; mode++)
-                {
-                    bool modeHor = (mode < 18);
-                    Pel *cmp = (modeHor ? buf_trans : fenc);
-                    intptr_t srcStride = (modeHor ? width : stride);
-                    modeCosts[mode] = sa8d(cmp, srcStride, &tmp[(mode - 2) * (width * width)], width);
-                }
-            }
-            else
-            {
-                // origin is 64x64, we scale to 32x32
-                // TODO: cli option to chose
-#if 1
-                ALIGN_VAR_32(Pel, buf_scale[32 * 32]);
-                primitives.scale2D_64to32(buf_scale, fenc, stride);
-                primitives.transpose[3](buf_trans, buf_scale, 32);
+                // origin is 64x64, we scale to 32x32 and setup required parameters
+                ALIGN_VAR_32(Pel, bufScale[32 * 32]);
+                primitives.scale2D_64to32(bufScale, fenc, stride);
+                fenc = bufScale;
 
                 // reserve space in case primitives need to store data in above
                 // or left buffers
                 Pel _above[4 * 32 + 1];
                 Pel _left[4 * 32 + 1];
-                Pel *const above = _above + 2 * 32;
-                Pel *const left = _left + 2 * 32;
-
-                above[0] = left[0] = pAbove0[0];
-                primitives.scale1D_128to64(above + 1, pAbove0 + 1, 0);
-                primitives.scale1D_128to64(left + 1, pLeft0 + 1, 0);
-
-                // 1
-                primitives.intra_pred_dc(above + 1, left + 1, tmp, 32, 32, false);
-                modeCosts[DC_IDX] = 4 * primitives.sa8d[3](buf_scale, 32, tmp, 32);
-
-                // 0
-                primitives.intra_pred_planar((pixel*)above + 1, (pixel*)left + 1, tmp, 32, 32);
-                modeCosts[PLANAR_IDX] = 4 * primitives.sa8d[3](buf_scale, 32, tmp, 32);
-
-                primitives.intra_pred_allangs[3](tmp, above, left, above, left, false);
-
-                // TODO: I use 4 of SATD32x32 to replace real 64x64
-                for (uint32_t mode = 2; mode < numModesAvailable; mode++)
-                {
-                    bool modeHor = (mode < 18);
-                    Pel *cmp_buf = (modeHor ? buf_trans : buf_scale);
-                    modeCosts[mode] = 4 * primitives.sa8d[3]((pixel*)cmp_buf, 32, (pixel*)&tmp[(mode - 2) * (32 * 32)], 32);
-                }
-
-#else // if 1
-                // 1
-                primitives.intra_pred_dc(pAbove0 + 1, pLeft0 + 1, pred, stride, width, false);
-                modeCosts[DC_IDX] = sa8d(fenc, stride, pred, stride);
-
-                // 0
-                primitives.intra_pred_planar((pixel*)pAbove0 + 1, (pixel*)pLeft0 + 1, pred, stride, width);
-                modeCosts[PLANAR_IDX] = sa8d(fenc, stride, pred, stride);
-
-                for (uint32_t mode = 2; mode < numModesAvailable; mode++)
-                {
-                    predIntraLumaAng(mode, pred, stride, width);
-                    modeCosts[mode] = sa8d(fenc, stride, pred, stride);
-                }
-
-#endif // if 1
+                Pel *aboveScale  = _above + 2 * 32;
+                Pel *leftScale   = _left + 2 * 32;
+                aboveScale[0] = leftScale[0] = above[0];
+                primitives.scale1D_128to64(aboveScale + 1, above + 1, 0);
+                primitives.scale1D_128to64(leftScale + 1, left + 1, 0);
+
+                scaleWidth = 32;
+                scaleStride = 32;
+                costMultiplier = 4;
+
+                // Filtered and Unfiltered refAbove and refLeft pointing to above and left.
+                above         = aboveScale;
+                left          = leftScale;
+                aboveFiltered = aboveScale; 
+                leftFiltered  = leftScale;
+            }
+
+            int log2SizeMinus2 = g_convertToBit[scaleWidth];
+            pixelcmp_t sa8d = primitives.sa8d[log2SizeMinus2];
+
+            // DC
+            primitives.intra_pred_dc(above + 1, left + 1, tmp, scaleStride, scaleWidth, (scaleWidth <= 16));
+            modeCosts[DC_IDX] = costMultiplier * sa8d(fenc, scaleStride, tmp, scaleStride);
+
+            Pel *abovePlanar   = above;
+            Pel *leftPlanar    = left;
+
+            if (width >= 8 && width <= 32)
+            {
+                abovePlanar = aboveFiltered;
+                leftPlanar  = leftFiltered;
+            }
+
+            // PLANAR
+            primitives.intra_pred_planar(abovePlanar + 1, leftPlanar + 1, tmp, scaleStride, scaleWidth);
+            modeCosts[PLANAR_IDX] = costMultiplier * sa8d(fenc, scaleStride, tmp, scaleStride);
+
+            // Transpose NxN
+            primitives.transpose[log2SizeMinus2](buf_trans, fenc, scaleStride);
+
+            primitives.intra_pred_allangs[log2SizeMinus2](tmp, above, left, aboveFiltered, leftFiltered, (scaleWidth <= 16));
+
+            for (uint32_t mode = 2; mode < numModesAvailable; mode++)
+            {
+                bool modeHor = (mode < 18);
+                Pel *cmp = (modeHor ? buf_trans : fenc);
+                intptr_t srcStride = (modeHor ? scaleWidth : scaleStride);
+                modeCosts[mode] = costMultiplier * sa8d(cmp, srcStride, &tmp[(mode - 2) * (scaleWidth * scaleWidth)], scaleWidth);
             }
 
             // Find N least cost modes. N = numModesForFullRD
@@ -2941,6 +2915,144 @@ void TEncSearch::encodeResAndCalcRdInter
     cu->setQPSubParts(qpBest, 0, cu->getDepth(0));
 }
 
+void TEncSearch::estimateRDInterCU(TComDataCU* cu, TComYuv* fencYuv, TComYuv* predYuv, TShortYUV* outResiYuv,
+                                   TShortYUV* outBestResiYuv, TComYuv* outReconYuv, bool /*bSkipRes*/, bool curUseRDOQ)
+{
+    uint32_t width  = cu->getWidth(0);
+    uint32_t height = cu->getHeight(0);
+
+    outResiYuv->subtract(fencYuv, predYuv, 0, width);
+
+    uint32_t zerobits = estimateZerobits(cu);
+    uint32_t zerodistortion = estimateZeroDist(cu, fencYuv, predYuv);
+    uint64_t zerocost = m_rdCost->calcRdCost(zerodistortion, zerobits);
+
+    uint32_t distortion = 0;
+    uint32_t bits = 0;
+    estimateBitsDist(cu, outResiYuv, bits, distortion, curUseRDOQ);
+    uint64_t cost = m_rdCost->calcRdCost(distortion, bits);
+
+    if (cu->isLosslessCoded(0))
+    {
+        zerocost = cost + 1;
+    }
+
+    if (zerocost < cost)
+    {
+        const uint32_t qpartnum = cu->getPic()->getNumPartInCU() >> (cu->getDepth(0) << 1);
+        ::memset(cu->getTransformIdx(), 0, qpartnum * sizeof(UChar));
+        ::memset(cu->getCbf(TEXT_LUMA), 0, qpartnum * sizeof(UChar));
+        ::memset(cu->getCbf(TEXT_CHROMA_U), 0, qpartnum * sizeof(UChar));
+        ::memset(cu->getCbf(TEXT_CHROMA_V), 0, qpartnum * sizeof(UChar));
+        ::memset(cu->getCoeffY(), 0, width * height * sizeof(TCoeff));
+        ::memset(cu->getCoeffCb(), 0, width * height * sizeof(TCoeff) >> 2);
+        ::memset(cu->getCoeffCr(), 0, width * height * sizeof(TCoeff) >> 2);
+        cu->setTransformSkipSubParts(0, 0, 0, 0, cu->getDepth(0));
+        if (cu->getMergeFlag(0) && cu->getPartitionSize(0) == SIZE_2Nx2N)
+        {
+            cu->setSkipFlagSubParts(true, 0, cu->getDepth(0));
+        }
+        bits = zerobits;
+        outBestResiYuv->clear();
+        generateRecon(cu, predYuv, outBestResiYuv, outReconYuv, true);
+    }
+    else
+    {
+        xSetResidualQTData(cu, 0, 0, outBestResiYuv, cu->getDepth(0), true);
+        generateRecon(cu, predYuv, outBestResiYuv, outReconYuv, false);
+    }
+
+    int part = partitionFromSizes(width, height);
+    distortion = primitives.sse_pp[part](fencYuv->getLumaAddr(), fencYuv->getStride(), outReconYuv->getLumaAddr(), outReconYuv->getStride());
+    part = partitionFromSizes(width >> 1, height >> 1);
+    distortion += m_rdCost->scaleChromaDistCb(primitives.sse_pp[part](fencYuv->getCbAddr(), fencYuv->getCStride(), outReconYuv->getCbAddr(), outReconYuv->getCStride()));
+    distortion += m_rdCost->scaleChromaDistCr(primitives.sse_pp[part](fencYuv->getCrAddr(), fencYuv->getCStride(), outReconYuv->getCrAddr(), outReconYuv->getCStride()));
+
+    cu->m_totalBits       = bits;
+    cu->m_totalDistortion = distortion;
+    cu->m_totalCost       = m_rdCost->calcRdCost(distortion, bits);
+}
+
+uint32_t TEncSearch::estimateZerobits(TComDataCU* cu)
+{
+    if (cu->isIntra(0))
+    {
+        return 0;
+    }
+
+    uint32_t zeroResiBits = 0;
+
+    uint32_t width  = cu->getWidth(0);
+    uint32_t height = cu->getHeight(0);
+
+    const uint32_t qpartnum = cu->getPic()->getNumPartInCU() >> (cu->getDepth(0) << 1);
+    ::memset(cu->getTransformIdx(), 0, qpartnum * sizeof(UChar));
+    ::memset(cu->getCbf(TEXT_LUMA), 0, qpartnum * sizeof(UChar));
+    ::memset(cu->getCbf(TEXT_CHROMA_U), 0, qpartnum * sizeof(UChar));
+    ::memset(cu->getCbf(TEXT_CHROMA_V), 0, qpartnum * sizeof(UChar));
+    ::memset(cu->getCoeffY(), 0, width * height * sizeof(TCoeff));
+    ::memset(cu->getCoeffCb(), 0, width * height * sizeof(TCoeff) >> 2);
+    ::memset(cu->getCoeffCr(), 0, width * height * sizeof(TCoeff) >> 2);
+    cu->setTransformSkipSubParts(0, 0, 0, 0, cu->getDepth(0));
+
+    m_rdGoOnSbacCoder->load(m_rdSbacCoders[cu->getDepth(0)][CI_CURR_BEST]);
+    zeroResiBits = xSymbolBitsInter(cu);
+    // Reset skipflags to false which would have set to true by xSymbolBitsInter if merge-skip
+    cu->setSkipFlagSubParts(false, 0, cu->getDepth(0));
+    return zeroResiBits;
+}
+
+uint32_t TEncSearch::estimateZeroDist(TComDataCU* cu, TComYuv* fencYuv, TComYuv* predYuv)
+{
+    uint32_t distortion = 0;
+
+    uint32_t width  = cu->getWidth(0);
+    uint32_t height = cu->getHeight(0);
+
+    int part = partitionFromSizes(width, height);
+
+    distortion = primitives.sse_pp[part](fencYuv->getLumaAddr(), fencYuv->getStride(), predYuv->getLumaAddr(), predYuv->getStride());
+    part = partitionFromSizes(width >> 1, height >> 1);
+    distortion += m_rdCost->scaleChromaDistCb(primitives.sse_pp[part](fencYuv->getCbAddr(), fencYuv->getCStride(), predYuv->getCbAddr(), predYuv->getCStride()));
+    distortion += m_rdCost->scaleChromaDistCr(primitives.sse_pp[part](fencYuv->getCrAddr(), fencYuv->getCStride(), predYuv->getCrAddr(), predYuv->getCStride()));
+    return distortion;
+}
+