[x265-commits] [x265] search: remove unnecessary set of cbf flags in xEstimateR...

Sun Oct 26 02:14:59 CEST 2014

details:   http://hg.videolan.org/x265/rev/a0c07b8e583b
branches:  
changeset: 8641:a0c07b8e583b
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Fri Oct 24 13:09:01 2014 +0530
description:
search: remove unnecessary set of cbf flags in xEstimateResidualQT()
Subject: [x265] quant.cpp: nits

details:   http://hg.videolan.org/x265/rev/5f0838850cb5
branches:  
changeset: 8642:5f0838850cb5
user:      Praveen Tiwari
date:      Fri Oct 24 14:28:30 2014 +0530
description:
quant.cpp: nits
Subject: [x265] search: remove redundant cbf flags setting in xEstimateResidualQT()

details:   http://hg.videolan.org/x265/rev/363bd8ef6c6b
branches:  
changeset: 8643:363bd8ef6c6b
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Fri Oct 24 20:34:57 2014 +0530
description:
search: remove redundant cbf flags setting in xEstimateResidualQT()
Subject: [x265] search: refactored xEstimateResidualQT() to remove cbf flag settings

details:   http://hg.videolan.org/x265/rev/759c6cbf54fa
branches:  
changeset: 8644:759c6cbf54fa
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Fri Oct 24 20:35:01 2014 +0530
description:
search: refactored xEstimateResidualQT() to remove cbf flag settings
Subject: [x265] weight_sp: pshufd to handle width 6 for SSE version of asm code

details:   http://hg.videolan.org/x265/rev/1a07740f85f5
branches:  
changeset: 8645:1a07740f85f5
user:      Praveen Tiwari
date:      Fri Oct 24 13:52:45 2014 +0530
description:
weight_sp: pshufd to handle width 6 for SSE version of asm code

Backout of 2cb8cdaa7df5
Subject: [x265] shortyuv: use absPartIdx for CU/TU part offset like everywhere else

details:   http://hg.videolan.org/x265/rev/0922d96a74a6
branches:  
changeset: 8646:0922d96a74a6
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 24 12:55:19 2014 -0500
description:
shortyuv: use absPartIdx for CU/TU part offset like everywhere else
Subject: [x265] yuv: add copyPartToPart* methods for recon RQT finalization

details:   http://hg.videolan.org/x265/rev/d918b786a3e6
branches:  
changeset: 8647:d918b786a3e6
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 24 12:57:09 2014 -0500
description:
yuv: add copyPartToPart* methods for recon RQT finalization

We're switching reconQt to be kept in pixels rather than shorts
Subject: [x265] search: fix 4:2:2 chroma tskip bit-cost estimation

details:   http://hg.videolan.org/x265/rev/0fc9c36d0c92
branches:  
changeset: 8648:0fc9c36d0c92
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 24 14:54:56 2014 -0500
description:
search: fix 4:2:2 chroma tskip bit-cost estimation
Subject: [x265] search: rename tmpCoeff to coeffRQT, tmpShortYuv to reconQtYuv / resiQtYuv

details:   http://hg.videolan.org/x265/rev/847c45521c19
branches:  
changeset: 8649:847c45521c19
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 24 11:24:05 2014 -0500
description:
search: rename tmpCoeff to coeffRQT, tmpShortYuv to reconQtYuv / resiQtYuv

Explain why these buffers are allocated to max CU size at every layer, fix a
few nits
Subject: [x265] search: rename a couple chroma intra helper methods

details:   http://hg.videolan.org/x265/rev/6f964d4cc8ef
branches:  
changeset: 8650:6f964d4cc8ef
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 09:58:56 2014 -0500
description:
search: rename a couple chroma intra helper methods
Subject: [x265] search: improve a variable name

details:   http://hg.videolan.org/x265/rev/b51aceca9bd8
branches:  
changeset: 8651:b51aceca9bd8
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 09:59:31 2014 -0500
description:
search: improve a variable name
Subject: [x265] search: reconYuv as ref

details:   http://hg.videolan.org/x265/rev/f97c6f14a975
branches:  
changeset: 8652:f97c6f14a975
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 09:59:58 2014 -0500
description:
search: reconYuv as ref
Subject: [x265] search: simplify initTrDepth

details:   http://hg.videolan.org/x265/rev/ddafaee9bf39
branches:  
changeset: 8653:ddafaee9bf39
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 10:00:30 2014 -0500
description:
search: simplify initTrDepth
Subject: [x265] search: simplify RDO chroma intra coding, changes tskip outputs

details:   http://hg.videolan.org/x265/rev/2261ad40ffe8
branches:  
changeset: 8654:2261ad40ffe8
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 24 22:05:13 2014 -0500
description:
search: simplify RDO chroma intra coding, changes tskip outputs

Since the TU layers above tskip's 4x4 are not encoding their residual (they
only need distortion, not RD cost) there is no reason to try to preserve the
entropy coder state. This gives slightly better compression than
before, when tskip is enabled, and I believe it makes the code a lot more
maintainable.
Subject: [x265] search: give offsetSubTUCBFs a basic comment

details:   http://hg.videolan.org/x265/rev/67ae716977fd
branches:  
changeset: 8655:67ae716977fd
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 10:56:30 2014 -0500
description:
search: give offsetSubTUCBFs a basic comment
Subject: [x265] search: rename methods that read coeff and recon from RQT struct at final depths

details:   http://hg.videolan.org/x265/rev/69ee86fd7284
branches:  
changeset: 8656:69ee86fd7284
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 11:01:16 2014 -0500
description:
search: rename methods that read coeff and recon from RQT struct at final depths

also reorder arguments and pass reconYuv as a reference
Subject: [x265] search: nit. splitted is not a word

details:   http://hg.videolan.org/x265/rev/794bf8c060d4
branches:  
changeset: 8657:794bf8c060d4
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 11:01:31 2014 -0500
description:
search: nit. splitted is not a word
Subject: [x265] search: remove tskip analysis out of luma chroma normal path

details:   http://hg.videolan.org/x265/rev/1ea467c9bb22
branches:  
changeset: 8658:1ea467c9bb22
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 11:35:42 2014 -0500
description:
search: remove tskip analysis out of luma chroma normal path
Subject: [x265] search: keep recon QT in pixels, instead of shorts

details:   http://hg.videolan.org/x265/rev/567491c02bf7
branches:  
changeset: 8659:567491c02bf7
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 12:55:00 2014 -0500
description:
search: keep recon QT in pixels, instead of shorts

This changes outputs, apparently because SSE is now comparing fenc against the
clipped recon instead of the un-clipped recon. This was punishing residuals
which were close to the pixel dynamic range limits. The user never sees un-
clipped pixels, and external distortion metrics always use clipped recon, so
it makes sense to do the same here (never mind the obvious perf benefits)
Subject: [x265] search: improve comments and readability of residualTransformQuantIntra

details:   http://hg.videolan.org/x265/rev/58545ea1f6af
branches:  
changeset: 8660:58545ea1f6af
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 13:15:28 2014 -0500
description:
search: improve comments and readability of residualTransformQuantIntra
Subject: [x265] search: avoid a context save at the last recursion depth

details:   http://hg.videolan.org/x265/rev/d7fbf10efe61
branches:  
changeset: 8661:d7fbf10efe61
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 15:07:58 2014 -0500
description:
search: avoid a context save at the last recursion depth
Subject: [x265] trim x265_emms(), try to only use prior to floating point operations

details:   http://hg.videolan.org/x265/rev/64bb88dc7cb6
branches:  
changeset: 8662:64bb88dc7cb6
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 15:18:30 2014 -0500
description:
trim x265_emms(), try to only use prior to floating point operations
Subject: [x265] primitives: remove unused calcrecon primitive (assembly needs cleanup)

details:   http://hg.videolan.org/x265/rev/daa0e77083a7
branches:  
changeset: 8663:daa0e77083a7
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 15:34:58 2014 -0500
description:
primitives: remove unused calcrecon primitive (assembly needs cleanup)
Subject: [x265] search: prevent warnings about unused bCheckSplit value

details:   http://hg.videolan.org/x265/rev/5e8e0e5fb760
branches:  
changeset: 8664:5e8e0e5fb760
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 15:38:40 2014 -0500
description:
search: prevent warnings about unused bCheckSplit value
Subject: [x265] encoder: issue warnings and explicitly disable tskip or culossless if rd < 3

details:   http://hg.videolan.org/x265/rev/b2aa1fd68ffa
branches:  
changeset: 8665:b2aa1fd68ffa
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 15:44:43 2014 -0500
description:
encoder: issue warnings and explicitly disable tskip or culossless if rd < 3

the analysis code is quite incapable of making these RDO decisions at these
RD levels. It's best that these tools never appear to be enabled at these RD
RD levels, and to explain why
Subject: [x265] search: inline updateModeCost

details:   http://hg.videolan.org/x265/rev/e69a8546897a
branches:  
changeset: 8666:e69a8546897a
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 15:50:15 2014 -0500
description:
search: inline updateModeCost
Subject: [x265] search: updateCandList() can be a static method

details:   http://hg.videolan.org/x265/rev/4e8edad1f2e6
branches:  
changeset: 8667:4e8edad1f2e6
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 15:51:14 2014 -0500
description:
search: updateCandList() can be a static method
Subject: [x265] search: remove resiYuv from Mode, keep tmpResiYuv in m_rqt[]

details:   http://hg.videolan.org/x265/rev/08be12894acd
branches:  
changeset: 8668:08be12894acd
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 16:12:29 2014 -0500
description:
search: remove resiYuv from Mode, keep tmpResiYuv in m_rqt[]

The residual buffer is always very short lived; there is no reason to keep a
copy of it per mode.
Subject: [x265] encoder: issue warning and disable --pmode if rdlevel < 2

details:   http://hg.videolan.org/x265/rev/4e7f9bca6f39
branches:  
changeset: 8669:4e7f9bca6f39
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 16:13:15 2014 -0500
description:
encoder: issue warning and disable --pmode if rdlevel < 2
Subject: [x265] docs: update --tskip and --cu-lossless docs

details:   http://hg.videolan.org/x265/rev/f81a2cec4183
branches:  
changeset: 8670:f81a2cec4183
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 16:21:16 2014 -0500
description:
docs: update --tskip and --cu-lossless docs
Subject: [x265] search: cleanup residualQTIntraChroma

details:   http://hg.videolan.org/x265/rev/4d3797830500
branches:  
changeset: 8671:4d3797830500
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 19:01:43 2014 -0500
description:
search: cleanup residualQTIntraChroma

There was a bug where it was reading tskip before setting it to zero, but
fortunately we never allow analysis to set tskip anyway.
Subject: [x265] search: turn some redundant clears of tskip flags into runtime checks

details:   http://hg.videolan.org/x265/rev/72f2b87c86eb
branches:  
changeset: 8672:72f2b87c86eb
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 19:02:16 2014 -0500
description:
search: turn some redundant clears of tskip flags into runtime checks
Subject: [x265] search: improve comments in mergeEstimation()

details:   http://hg.videolan.org/x265/rev/5186635c0536
branches:  
changeset: 8673:5186635c0536
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 25 19:02:55 2014 -0500
description:
search: improve comments in mergeEstimation()

diffstat:

 doc/reST/cli.rst                     |     8 +-
 source/common/pixel.cpp              |    23 -
 source/common/primitives.h           |     1 -
 source/common/quant.cpp              |     4 +-
 source/common/shortyuv.cpp           |    32 +-
 source/common/shortyuv.h             |    10 +-
 source/common/x86/asm-primitives.cpp |     9 -
 source/common/x86/pixel-util8.asm    |     1 +
 source/common/yuv.cpp                |    19 +
 source/common/yuv.h                  |     3 +
 source/encoder/analysis.cpp          |    10 +-
 source/encoder/encoder.cpp           |    16 +
 source/encoder/entropy.cpp           |     6 +
 source/encoder/entropy.h             |     1 +
 source/encoder/frameencoder.cpp      |     2 +-
 source/encoder/motion.cpp            |     1 -
 source/encoder/search.cpp            |  1436 +++++++++++++++++----------------
 source/encoder/search.h              |    51 +-
 source/encoder/slicetype.cpp         |    15 -
 source/encoder/weightPrediction.cpp  |     8 +-
 source/test/pixelharness.cpp         |    60 -
 source/test/pixelharness.h           |     1 -
 22 files changed, 835 insertions(+), 882 deletions(-)

diffs (truncated from 2978 to 300 lines):

diff -r e3a3d17b821c -r 5186635c0536 doc/reST/cli.rst

--- a/doc/reST/cli.rst	Thu Oct 23 21:03:47 2014 -0500
+++ b/doc/reST/cli.rst	Sat Oct 25 19:02:55 2014 -0500
@@ -540,7 +540,10 @@ Spatial/intra options
 .. option:: --tskip, --no-tskip
 
 	Enable evaluation of transform skip (bypass DCT but still use
-	quantization) coding for intra coded blocks. Default disabled
+	quantization) coding for 4x4 TU coded blocks.
+
+	Only effective at RD levels 3 and above, which perform RDO mode
+	decisions. Default disabled
 
 .. option:: --tskip-fast, --no-tskip-fast
 
@@ -661,6 +664,9 @@ Mode decision / Analysis
 	specified, all CUs will be encoded as lossless unconditionally
 	regardless of whether this option was enabled. Default disabled.
 
+	Only effective at RD levels 3 and above, which perform RDO mode
+	decisions.
+
 .. option:: --signhide, --no-signhide
 
 	Hide sign bit of one coeff per TU (rdo). The last sign is implied.
diff -r e3a3d17b821c -r 5186635c0536 source/common/pixel.cpp
--- a/source/common/pixel.cpp	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/pixel.cpp	Sat Oct 25 19:02:55 2014 -0500
@@ -593,24 +593,6 @@ void getResidual(pixel *fenc, pixel *pre
 }
 
 template<int blockSize>
-void calcRecons(pixel* pred, int16_t* residual, int16_t* recqt, pixel* recipred, int stride, int qtstride, int ipredstride)
-{
-    for (int y = 0; y < blockSize; y++)
-    {
-        for (int x = 0; x < blockSize; x++)
-        {
-            recqt[x] = (int16_t)Clip(static_cast<int16_t>(pred[x]) + residual[x]);
-            recipred[x] = (pixel)recqt[x];
-        }
-
-        pred += stride;
-        residual += stride;
-        recqt += qtstride;
-        recipred += ipredstride;
-    }
-}
-
-template<int blockSize>
 void transpose(pixel* dst, pixel* src, intptr_t stride)
 {
     for (int k = 0; k < blockSize; k++)
@@ -1372,11 +1354,6 @@ void Setup_C_PixelPrimitives(EncoderPrim
     p.calcresidual[BLOCK_16x16] = getResidual<16>;
     p.calcresidual[BLOCK_32x32] = getResidual<32>;
     p.calcresidual[BLOCK_64x64] = NULL;
-    p.calcrecon[BLOCK_4x4] = calcRecons<4>;
-    p.calcrecon[BLOCK_8x8] = calcRecons<8>;
-    p.calcrecon[BLOCK_16x16] = calcRecons<16>;
-    p.calcrecon[BLOCK_32x32] = calcRecons<32>;
-    p.calcrecon[BLOCK_64x64] = NULL;
 
     p.transpose[BLOCK_4x4] = transpose<4>;
     p.transpose[BLOCK_8x8] = transpose<8>;
diff -r e3a3d17b821c -r 5186635c0536 source/common/primitives.h
--- a/source/common/primitives.h	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/primitives.h	Sat Oct 25 19:02:55 2014 -0500
@@ -270,7 +270,6 @@ struct EncoderPrimitives
     denoiseDct_t    denoiseDct;
 
     calcresidual_t  calcresidual[NUM_SQUARE_BLOCKS];
-    calcrecon_t     calcrecon[NUM_SQUARE_BLOCKS];
     transpose_t     transpose[NUM_SQUARE_BLOCKS];
 
     var_t           var[NUM_SQUARE_BLOCKS];
diff -r e3a3d17b821c -r 5186635c0536 source/common/quant.cpp
--- a/source/common/quant.cpp	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/quant.cpp	Sat Oct 25 19:02:55 2014 -0500
@@ -40,7 +40,7 @@ struct coeffGroupRDStats
 {
     int     nnzBeforePos0;     /* indicates coeff other than pos 0 are coded */
     int64_t codedLevelAndDist; /* distortion and level cost of coded coefficients */
-    int64_t uncodedDist;       /* uncoded distortion cost of coded coefficients */ 
+    int64_t uncodedDist;       /* uncoded distortion cost of coded coefficients */
     int64_t sigCost;           /* cost of signaling significant coeff bitmap */
     int64_t sigCost0;          /* cost of signaling sig coeff bit of coeff 0 */
 };
@@ -169,7 +169,7 @@ bool Quant::init(bool useRDOQ, double ps
     m_resiDctCoeff = X265_MALLOC(int32_t, MAX_TR_SIZE * MAX_TR_SIZE * 2);
     m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE);
     m_fencShortBuf = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE);
-    
+
     return m_resiDctCoeff && m_fencShortBuf;
 }
 
diff -r e3a3d17b821c -r 5186635c0536 source/common/shortyuv.cpp
--- a/source/common/shortyuv.cpp	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/shortyuv.cpp	Sat Oct 25 19:02:55 2014 -0500
@@ -79,41 +79,41 @@ void ShortYuv::subtract(const Yuv& srcYu
     primitives.chroma[m_csp].sub_ps[sizeIdx](m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize);
 }
 
-void ShortYuv::copyPartToPartLuma(ShortYuv& dstYuv, uint32_t partIdx, uint32_t log2Size) const
+void ShortYuv::copyPartToPartLuma(ShortYuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const
 {
-    const int16_t* src = getLumaAddr(partIdx);
-    int16_t* dst = dstYuv.getLumaAddr(partIdx);
+    const int16_t* src = getLumaAddr(absPartIdx);
+    int16_t* dst = dstYuv.getLumaAddr(absPartIdx);
 
     primitives.square_copy_ss[log2Size - 2](dst, dstYuv.m_size, const_cast<int16_t*>(src), m_size);
 }
 
-void ShortYuv::copyPartToPartLuma(Yuv& dstYuv, uint32_t partIdx, uint32_t log2Size) const
+void ShortYuv::copyPartToPartLuma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const
 {
-    const int16_t* src = getLumaAddr(partIdx);
-    pixel* dst = dstYuv.getLumaAddr(partIdx);
+    const int16_t* src = getLumaAddr(absPartIdx);
+    pixel* dst = dstYuv.getLumaAddr(absPartIdx);
 
     primitives.square_copy_sp[log2Size - 2](dst, dstYuv.m_size, const_cast<int16_t*>(src), m_size);
 }
 
-void ShortYuv::copyPartToPartChroma(ShortYuv& dstYuv, uint32_t partIdx, uint32_t log2SizeL) const
+void ShortYuv::copyPartToPartChroma(ShortYuv& dstYuv, uint32_t absPartIdx, uint32_t log2SizeL) const
 {
     int part = partitionFromLog2Size(log2SizeL);
-    const int16_t* srcU = getCbAddr(partIdx);
-    const int16_t* srcV = getCrAddr(partIdx);
-    int16_t* dstU = dstYuv.getCbAddr(partIdx);
-    int16_t* dstV = dstYuv.getCrAddr(partIdx);
+    const int16_t* srcU = getCbAddr(absPartIdx);
+    const int16_t* srcV = getCrAddr(absPartIdx);
+    int16_t* dstU = dstYuv.getCbAddr(absPartIdx);
+    int16_t* dstV = dstYuv.getCrAddr(absPartIdx);
 
     primitives.chroma[m_csp].copy_ss[part](dstU, dstYuv.m_csize, const_cast<int16_t*>(srcU), m_csize);
     primitives.chroma[m_csp].copy_ss[part](dstV, dstYuv.m_csize, const_cast<int16_t*>(srcV), m_csize);
 }
 
-void ShortYuv::copyPartToPartChroma(Yuv& dstYuv, uint32_t partIdx, uint32_t log2SizeL) const
+void ShortYuv::copyPartToPartChroma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2SizeL) const
 {
     int part = partitionFromLog2Size(log2SizeL);
-    const int16_t* srcU = getCbAddr(partIdx);
-    const int16_t* srcV = getCrAddr(partIdx);
-    pixel* dstU = dstYuv.getCbAddr(partIdx);
-    pixel* dstV = dstYuv.getCrAddr(partIdx);
+    const int16_t* srcU = getCbAddr(absPartIdx);
+    const int16_t* srcV = getCrAddr(absPartIdx);
+    pixel* dstU = dstYuv.getCbAddr(absPartIdx);
+    pixel* dstV = dstYuv.getCrAddr(absPartIdx);
 
     primitives.chroma[m_csp].copy_sp[part](dstU, dstYuv.m_csize, const_cast<int16_t*>(srcU), m_csize);
     primitives.chroma[m_csp].copy_sp[part](dstV, dstYuv.m_csize, const_cast<int16_t*>(srcV), m_csize);
diff -r e3a3d17b821c -r 5186635c0536 source/common/shortyuv.h
--- a/source/common/shortyuv.h	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/shortyuv.h	Sat Oct 25 19:02:55 2014 -0500
@@ -65,10 +65,12 @@ public:
     const int16_t* getChromaAddr(uint32_t chromaId, uint32_t partUnitIdx) const { return m_buf[chromaId] + getChromaAddrOffset(partUnitIdx); }
 
     void subtract(const Yuv& srcYuv0, const Yuv& srcYuv1, uint32_t log2Size);
-    void copyPartToPartLuma(ShortYuv& dstYuv, uint32_t partIdx, uint32_t log2Size) const;
-    void copyPartToPartChroma(ShortYuv& dstYuv, uint32_t partIdx, uint32_t log2SizeL) const;
-    void copyPartToPartLuma(Yuv& dstYuv, uint32_t partIdx, uint32_t log2Size) const;
-    void copyPartToPartChroma(Yuv& dstYuv, uint32_t partIdx, uint32_t log2SizeL) const;
+
+    void copyPartToPartLuma(ShortYuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const;
+    void copyPartToPartChroma(ShortYuv& dstYuv, uint32_t absPartIdx, uint32_t log2SizeL) const;
+
+    void copyPartToPartLuma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const;
+    void copyPartToPartChroma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2SizeL) const;
 
     int getChromaAddrOffset(uint32_t idx) const
     {
diff -r e3a3d17b821c -r 5186635c0536 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/x86/asm-primitives.cpp	Sat Oct 25 19:02:55 2014 -0500
@@ -1374,11 +1374,6 @@ void Setup_Assembly_Primitives(EncoderPr
         p.calcresidual[BLOCK_16x16] = x265_getResidual16_sse2;
         p.calcresidual[BLOCK_32x32] = x265_getResidual32_sse2;
 
-        p.calcrecon[BLOCK_4x4] = x265_calcRecons4_sse2;
-        p.calcrecon[BLOCK_8x8] = x265_calcRecons8_sse2;
-        p.calcrecon[BLOCK_16x16] = x265_calcRecons16_sse2;
-        p.calcrecon[BLOCK_32x32] = x265_calcRecons32_sse2;
-
         p.dct[DCT_4x4] = x265_dct4_sse2;
         p.idct[IDCT_4x4] = x265_idct4_sse2;
         p.idct[IDST_4x4] = x265_idst4_sse2;
@@ -1561,8 +1556,6 @@ void Setup_Assembly_Primitives(EncoderPr
         p.cvt32to16_shl[BLOCK_8x8] = x265_cvt32to16_shl_8_sse2;
         p.cvt32to16_shl[BLOCK_16x16] = x265_cvt32to16_shl_16_sse2;
         p.cvt32to16_shl[BLOCK_32x32] = x265_cvt32to16_shl_32_sse2;
-        p.calcrecon[BLOCK_4x4] = x265_calcRecons4_sse2;
-        p.calcrecon[BLOCK_8x8] = x265_calcRecons8_sse2;
         p.calcresidual[BLOCK_4x4] = x265_getResidual4_sse2;
         p.calcresidual[BLOCK_8x8] = x265_getResidual8_sse2;
         p.transpose[BLOCK_4x4] = x265_transpose4_sse2;
@@ -1671,8 +1664,6 @@ void Setup_Assembly_Primitives(EncoderPr
         CHROMA_BLOCKCOPY_422(ps, _sse4);
         LUMA_BLOCKCOPY(ps, _sse4);
 
-        p.calcrecon[BLOCK_16x16] = x265_calcRecons16_sse4;
-        p.calcrecon[BLOCK_32x32] = x265_calcRecons32_sse4;
         p.calcresidual[BLOCK_16x16] = x265_getResidual16_sse4;
         p.calcresidual[BLOCK_32x32] = x265_getResidual32_sse4;
         p.quant = x265_quant_sse4;
diff -r e3a3d17b821c -r 5186635c0536 source/common/x86/pixel-util8.asm
--- a/source/common/x86/pixel-util8.asm	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/x86/pixel-util8.asm	Sat Oct 25 19:02:55 2014 -0500
@@ -1483,6 +1483,7 @@ cglobal weight_sp, 6, 7, 7, 0-(2*4)
     movd        [r1], m6
     je          .nextH
     add         r1, 4
+    pshufd      m6, m6, 1
 
 .width2:
     pextrw      [r1], m6, 0
diff -r e3a3d17b821c -r 5186635c0536 source/common/yuv.cpp
--- a/source/common/yuv.cpp	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/yuv.cpp	Sat Oct 25 19:02:55 2014 -0500
@@ -163,3 +163,22 @@ void Yuv::addAvg(const ShortYuv& srcYuv0
         primitives.chroma[m_csp].addAvg[part](srcV0, srcV1, dstV, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
     }
 }
+
+void Yuv::copyPartToPartLuma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const
+{
+    const pixel* src = getLumaAddr(absPartIdx);
+    pixel* dst = dstYuv.getLumaAddr(absPartIdx);
+    primitives.square_copy_pp[log2Size - 2](dst, dstYuv.m_size, const_cast<pixel*>(src), m_size);
+}
+
+void Yuv::copyPartToPartChroma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2SizeL) const
+{
+    int part = partitionFromLog2Size(log2SizeL);
+    const pixel* srcU = getCbAddr(absPartIdx);
+    const pixel* srcV = getCrAddr(absPartIdx);
+    pixel* dstU = dstYuv.getCbAddr(absPartIdx);
+    pixel* dstV = dstYuv.getCrAddr(absPartIdx);
+
+    primitives.chroma[m_csp].copy_pp[part](dstU, dstYuv.m_csize, const_cast<pixel*>(srcU), m_csize);
+    primitives.chroma[m_csp].copy_pp[part](dstV, dstYuv.m_csize, const_cast<pixel*>(srcV), m_csize);
+}
diff -r e3a3d17b821c -r 5186635c0536 source/common/yuv.h
--- a/source/common/yuv.h	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/common/yuv.h	Sat Oct 25 19:02:55 2014 -0500
@@ -75,6 +75,9 @@ public:
     // (srcYuv0 + srcYuv1)/2 for YUV partition (bidir averaging)
     void   addAvg(const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t absPartIdx, uint32_t width, uint32_t height, bool bLuma, bool bChroma);
 
+    void copyPartToPartLuma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const;
+    void copyPartToPartChroma(Yuv& dstYuv, uint32_t absPartIdx, uint32_t log2SizeL) const;
+
     pixel* getLumaAddr(uint32_t absPartIdx)                      { return m_buf[0] + getAddrOffset(absPartIdx, m_size); }
     pixel* getCbAddr(uint32_t absPartIdx)                        { return m_buf[1] + getChromaAddrOffset(absPartIdx); }
     pixel* getCrAddr(uint32_t absPartIdx)                        { return m_buf[2] + getChromaAddrOffset(absPartIdx); }
diff -r e3a3d17b821c -r 5186635c0536 source/encoder/analysis.cpp
--- a/source/encoder/analysis.cpp	Thu Oct 23 21:03:47 2014 -0500
+++ b/source/encoder/analysis.cpp	Sat Oct 25 19:02:55 2014 -0500
@@ -92,7 +92,6 @@ bool Analysis::create(ThreadLocalData *t
             md.pred[j].cu.initialize(md.cuMemPool, depth, csp, j);
             ok &= md.pred[j].predYuv.create(cuSize, csp);
             ok &= md.pred[j].reconYuv.create(cuSize, csp);
-            ok &= md.pred[j].resiYuv.create(cuSize, csp);
             md.pred[j].fencYuv = &md.fencYuv;
         }
     }
@@ -111,7 +110,6 @@ void Analysis::destroy()
         {
             m_modeDepth[i].pred[j].predYuv.destroy();
             m_modeDepth[i].pred[j].reconYuv.destroy();
-            m_modeDepth[i].pred[j].resiYuv.destroy();
         }
     }
 }
@@ -776,7 +774,7 @@ void Analysis::compressInterCU_rd0_4(con
                         encodeResAndCalcRdInterCU(*md.bestMode, cuGeom);
                     else if (m_param->rdLevel == 1)
                     {
-                        md.bestMode->resiYuv.subtract(md.fencYuv, md.bestMode->predYuv, cuGeom.log2CUSize);
+                        m_rqt[cuGeom.depth].tmpResiYuv.subtract(md.fencYuv, md.bestMode->predYuv, cuGeom.log2CUSize);
                         generateCoeffRecon(*md.bestMode, cuGeom);
                     }
                 }
@@ -879,8 +877,6 @@ void Analysis::compressInterCU_rd0_4(con
     md.bestMode->cu.copyToPic(depth);
     if (md.bestMode != &md.pred[PRED_SPLIT] && m_param->rdLevel)
         md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPicYuv, cuAddr, cuGeom.encodeIdx);
-
-    x265_emms(); // TODO: Remove
 }
 
 void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom)
@@ -1418,7 +1414,7 @@ void Analysis::encodeIntraInInter(Mode&