[x265-commits] [x265] 16bpp primitives: disabling dct/idct/dst/idst primitives

Tue Nov 12 04:01:01 CET 2013

details:   http://hg.videolan.org/x265/rev/8ca334701a92
branches:  
changeset: 4992:8ca334701a92
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Nov 11 14:34:27 2013 +0530
description:
16bpp primitives: disabling dct/idct/dst/idst primitives
Subject: [x265] Adding function pointer type & array definition for luma vsp filter functions.

details:   http://hg.videolan.org/x265/rev/8d496292dd1d
branches:  
changeset: 4993:8d496292dd1d
user:      Nabajit Deka
date:      Mon Nov 11 11:10:32 2013 +0530
description:
Adding function pointer type & array definition for luma vsp filter functions.
Subject: [x265] Adding C primitive for luma vsp filter functions.

details:   http://hg.videolan.org/x265/rev/d2b3aefb522e
branches:  
changeset: 4994:d2b3aefb522e
user:      Nabajit Deka
date:      Mon Nov 11 11:15:01 2013 +0530
description:
Adding C primitive for luma vsp filter functions.
Subject: [x265] Adding test bench code for luma vsp filter functions.

details:   http://hg.videolan.org/x265/rev/51358e3422b7
branches:  
changeset: 4995:51358e3422b7
user:      Nabajit Deka
date:      Mon Nov 11 11:20:09 2013 +0530
description:
Adding test bench code for luma vsp filter functions.
Subject: [x265] added  blockcopy_ps c primitive and function pointes

details:   http://hg.videolan.org/x265/rev/7f3164f16551
branches:  
changeset: 4996:7f3164f16551
user:      Praveen Tiwari
date:      Mon Nov 11 11:41:51 2013 +0530
description:
added  blockcopy_ps c primitive and function pointes
Subject: [x265] unit test code for block_copy_ps function

details:   http://hg.videolan.org/x265/rev/eab2cd89e813
branches:  
changeset: 4997:eab2cd89e813
user:      Praveen Tiwari
date:      Mon Nov 11 12:30:32 2013 +0530
description:
unit test code for block_copy_ps function
Subject: [x265] asm code for blockcopy_ps_8x2

details:   http://hg.videolan.org/x265/rev/11b09a9fa32f
branches:  
changeset: 4998:11b09a9fa32f
user:      Praveen Tiwari
date:      Mon Nov 11 13:07:57 2013 +0530
description:
asm code for blockcopy_ps_8x2
Subject: [x265] asm code for blockcopy_ps_8x4

details:   http://hg.videolan.org/x265/rev/25300bdf7bbe
branches:  
changeset: 4999:25300bdf7bbe
user:      Praveen Tiwari
date:      Mon Nov 11 13:35:11 2013 +0530
description:
asm code for blockcopy_ps_8x4
Subject: [x265] re-enable asm code for pixel_avg, the problem is miss EMMS

details:   http://hg.videolan.org/x265/rev/a1577003ee96
branches:  
changeset: 5000:a1577003ee96
user:      Min Chen <chenm003 at 163.com>
date:      Mon Nov 11 16:21:00 2013 +0800
description:
re-enable asm code for pixel_avg, the problem is miss EMMS
Subject: [x265] bugfix: PixelHarness::check_pixelavg_pp() output buffer did not initialize

details:   http://hg.videolan.org/x265/rev/9642b5b6500b
branches:  
changeset: 5001:9642b5b6500b
user:      Min Chen <chenm003 at 163.com>
date:      Mon Nov 11 17:41:32 2013 +0800
description:
bugfix: PixelHarness::check_pixelavg_pp() output buffer did not initialize
Subject: [x265] TEncCu: cleanup xComputeCostIntraInInter to use 32x32 logic for 64x64

details:   http://hg.videolan.org/x265/rev/2e90d81098af
branches:  
changeset: 5002:2e90d81098af
user:      Mahesh Doijade <maheshdoijade at multicorewareinc.com>
date:      Mon Nov 11 13:16:52 2013 +0530
description:
TEncCu: cleanup xComputeCostIntraInInter to use 32x32 logic for 64x64
Subject: [x265] compress: white-space nits

details:   http://hg.videolan.org/x265/rev/c94d51359a5f
branches:  
changeset: 5003:c94d51359a5f
user:      Steve Borho <steve at borho.org>
date:      Mon Nov 11 17:46:48 2013 -0600
description:
compress: white-space nits
Subject: [x265] asm code for blockcopy_ps_8x6

details:   http://hg.videolan.org/x265/rev/1fbaef13feb7
branches:  
changeset: 5004:1fbaef13feb7
user:      Praveen Tiwari
date:      Mon Nov 11 14:36:21 2013 +0530
description:
asm code for blockcopy_ps_8x6
Subject: [x265] asm code for blockcopy_ps, 8x6, 8x16 and 8x32

details:   http://hg.videolan.org/x265/rev/7d74ee88f3fe
branches:  
changeset: 5005:7d74ee88f3fe
user:      Praveen Tiwari
date:      Mon Nov 11 14:58:09 2013 +0530
description:
asm code for blockcopy_ps, 8x6, 8x16 and 8x32
Subject: [x265] asm code for blockcopy_ps_16x4

details:   http://hg.videolan.org/x265/rev/cb378330b31b
branches:  
changeset: 5006:cb378330b31b
user:      Praveen Tiwari
date:      Mon Nov 11 16:00:59 2013 +0530
description:
asm code for blockcopy_ps_16x4
Subject: [x265] asm code for asm code for blockcopy_ps,16x8, 16x12, 16x16, 16x32

details:   http://hg.videolan.org/x265/rev/e5567a4eeec5
branches:  
changeset: 5007:e5567a4eeec5
user:      Praveen Tiwari
date:      Mon Nov 11 16:29:44 2013 +0530
description:
asm code for asm code for blockcopy_ps,16x8, 16x12, 16x16, 16x32
Subject: [x265] eliminated register copy from BLOCKCOPY_PS_W16_H4 macro

details:   http://hg.videolan.org/x265/rev/7a0afcd7c4c9
branches:  
changeset: 5008:7a0afcd7c4c9
user:      Praveen Tiwari
date:      Mon Nov 11 16:44:45 2013 +0530
description:
eliminated register copy from BLOCKCOPY_PS_W16_H4 macro
Subject: [x265] blockcopy_ps_16x4, asm code is now sse4

details:   http://hg.videolan.org/x265/rev/1365b796a75e
branches:  
changeset: 5009:1365b796a75e
user:      Praveen Tiwari
date:      Mon Nov 11 16:54:27 2013 +0530
description:
blockcopy_ps_16x4, asm code is now sse4
Subject: [x265] asm code for blockcopy_ps_32xN

details:   http://hg.videolan.org/x265/rev/badcc7920c91
branches:  
changeset: 5010:badcc7920c91
user:      Praveen Tiwari
date:      Mon Nov 11 17:13:25 2013 +0530
description:
asm code for blockcopy_ps_32xN
Subject: [x265] asm code for blockcopy_ps_12x16

details:   http://hg.videolan.org/x265/rev/c09ba17002c0
branches:  
changeset: 5011:c09ba17002c0
user:      Praveen Tiwari
date:      Mon Nov 11 17:50:45 2013 +0530
description:
asm code for blockcopy_ps_12x16
Subject: [x265] asm code for blockcopy_ps_4x2

details:   http://hg.videolan.org/x265/rev/4c45ee313c3c
branches:  
changeset: 5012:4c45ee313c3c
user:      Praveen Tiwari
date:      Mon Nov 11 18:01:16 2013 +0530
description:
asm code for blockcopy_ps_4x2
Subject: [x265] asm code for blockcopy_ps_4x4

details:   http://hg.videolan.org/x265/rev/953fe27840b6
branches:  
changeset: 5013:953fe27840b6
user:      Praveen Tiwari
date:      Mon Nov 11 18:10:26 2013 +0530
description:
asm code for blockcopy_ps_4x4
Subject: [x265] asm code for blockcopy_ps_4x8

details:   http://hg.videolan.org/x265/rev/332793211a8d
branches:  
changeset: 5014:332793211a8d
user:      Praveen Tiwari
date:      Mon Nov 11 18:23:10 2013 +0530
description:
asm code for blockcopy_ps_4x8
Subject: [x265] asm code for blockcopy_ps_24x32

details:   http://hg.videolan.org/x265/rev/c8e0d150b111
branches:  
changeset: 5015:c8e0d150b111
user:      Praveen Tiwari
date:      Mon Nov 11 17:34:06 2013 +0530
description:
asm code for blockcopy_ps_24x32
Subject: [x265] asm code for blockcopy_ps_2x4

details:   http://hg.videolan.org/x265/rev/cf089f73913d
branches:  
changeset: 5016:cf089f73913d
user:      Praveen Tiwari
date:      Mon Nov 11 18:56:06 2013 +0530
description:
asm code for blockcopy_ps_2x4
Subject: [x265] asm code for blockcopy_ps_2x8

details:   http://hg.videolan.org/x265/rev/c047d5898b59
branches:  
changeset: 5017:c047d5898b59
user:      Praveen Tiwari
date:      Mon Nov 11 19:20:41 2013 +0530
description:
asm code for blockcopy_ps_2x8
Subject: [x265] asm code for blockcopy_ps_6x8

details:   http://hg.videolan.org/x265/rev/b208adfaaba6
branches:  
changeset: 5018:b208adfaaba6
user:      Praveen Tiwari
date:      Mon Nov 11 20:24:33 2013 +0530
description:
asm code for blockcopy_ps_6x8
Subject: [x265] added asm code blockcopy_ps_4x16 and invoked function pointer initialization with macro

details:   http://hg.videolan.org/x265/rev/67fb80ee548a
branches:  
changeset: 5019:67fb80ee548a
user:      Praveen Tiwari
date:      Mon Nov 11 20:35:55 2013 +0530
description:
added asm code blockcopy_ps_4x16 and invoked function pointer initialization with macro
Subject: [x265] added asm function for luma blockcopy_ps_16x64

details:   http://hg.videolan.org/x265/rev/8e20f3c1dbb4
branches:  
changeset: 5020:8e20f3c1dbb4
user:      Praveen Tiwari
date:      Mon Nov 11 20:50:50 2013 +0530
description:
added asm function for luma blockcopy_ps_16x64
Subject: [x265] asm code for luma blockcopy_ps_32x64

details:   http://hg.videolan.org/x265/rev/15b705145e15
branches:  
changeset: 5021:15b705145e15
user:      Praveen Tiwari
date:      Mon Nov 11 20:55:03 2013 +0530
description:
asm code for luma blockcopy_ps_32x64
Subject: [x265] asm code for luma blockcopy_ps_48x64

details:   http://hg.videolan.org/x265/rev/c19168acd391
branches:  
changeset: 5022:c19168acd391
user:      Praveen Tiwari
date:      Mon Nov 11 21:06:11 2013 +0530
description:
asm code for luma blockcopy_ps_48x64
Subject: [x265] asm code for blockcopy_ps_64xN

details:   http://hg.videolan.org/x265/rev/ed32ed5a0785
branches:  
changeset: 5023:ed32ed5a0785
user:      Praveen Tiwari
date:      Mon Nov 11 21:22:38 2013 +0530
description:
asm code for blockcopy_ps_64xN
Subject: [x265] added macro call for luma partition blockcopy_ps function

details:   http://hg.videolan.org/x265/rev/18dd57c38254
branches:  
changeset: 5024:18dd57c38254
user:      Praveen Tiwari
date:      Mon Nov 11 21:36:21 2013 +0530
description:
added macro call for luma partition blockcopy_ps function
Subject: [x265] asm: pixel_avg[32x16]

details:   http://hg.videolan.org/x265/rev/79a452bec247
branches:  
changeset: 5025:79a452bec247
user:      Min Chen <chenm003 at 163.com>
date:      Mon Nov 11 20:51:58 2013 +0800
description:
asm: pixel_avg[32x16]
Subject: [x265] use fixed stride/size on m_qtTempTComYuv, to reduce number of calcRecon() parameters

details:   http://hg.videolan.org/x265/rev/0f9c6391fa19
branches:  
changeset: 5026:0f9c6391fa19
user:      Min Chen <chenm003 at 163.com>
date:      Mon Nov 11 21:59:22 2013 +0800
description:
use fixed stride/size on m_qtTempTComYuv, to reduce number of calcRecon() parameters
Subject: [x265] asm: enabled pixel_avg_16x(64,32,12,4) assembly functions

details:   http://hg.videolan.org/x265/rev/1990e66030d1
branches:  
changeset: 5027:1990e66030d1
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Mon Nov 11 16:50:59 2013 +0530
description:
asm: enabled pixel_avg_16x(64,32,12,4) assembly functions
Subject: [x265] asm: assembly code for x265_pixel_satd_32x8

details:   http://hg.videolan.org/x265/rev/da13148e7c6e
branches:  
changeset: 5028:da13148e7c6e
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Mon Nov 11 17:01:26 2013 +0530
description:
asm: assembly code for x265_pixel_satd_32x8
Subject: [x265] asm: assembly code for x265_pixel_satd_32x16

details:   http://hg.videolan.org/x265/rev/27b97bc50331
branches:  
changeset: 5029:27b97bc50331
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Mon Nov 11 20:06:04 2013 +0530
description:
asm: assembly code for x265_pixel_satd_32x16
Subject: [x265] asm: routines for luma vsp filter functions for all block sizes.

details:   http://hg.videolan.org/x265/rev/1eae34eb5995
branches:  
changeset: 5030:1eae34eb5995
user:      Nabajit Deka
date:      Mon Nov 11 15:01:29 2013 +0530
description:
asm: routines for luma vsp filter functions for all block sizes.
Subject: [x265] Adding asm function declarations for luma vsp filter functions.

details:   http://hg.videolan.org/x265/rev/937ac0c1bac4
branches:  
changeset: 5031:937ac0c1bac4
user:      Nabajit Deka
date:      Mon Nov 11 15:14:31 2013 +0530
description:
Adding asm function declarations for luma vsp filter functions.
Subject: [x265] Adding function pointer initializations for luma vsp functions.

details:   http://hg.videolan.org/x265/rev/d11de5be8e25
branches:  
changeset: 5032:d11de5be8e25
user:      Nabajit Deka
date:      Mon Nov 11 15:15:46 2013 +0530
description:
Adding function pointer initializations for luma vsp functions.
Subject: [x265] asm: hookup luma_vsp primitive, drop asm and intrinsic non-block versions

details:   http://hg.videolan.org/x265/rev/904b788b09e2
branches:  
changeset: 5033:904b788b09e2
user:      Steve Borho <steve at borho.org>
date:      Mon Nov 11 19:15:32 2013 -0600
description:
asm: hookup luma_vsp primitive, drop asm and intrinsic non-block versions
Subject: [x265] asm: use new block copy primitives where feasible

details:   http://hg.videolan.org/x265/rev/1c95568c7143
branches:  
changeset: 5034:1c95568c7143
user:      Steve Borho <steve at borho.org>
date:      Mon Nov 11 19:35:16 2013 -0600
description:
asm: use new block copy primitives where feasible
Subject: [x265] TComYuv: de-hungarian nits

details:   http://hg.videolan.org/x265/rev/d1d716083aa7
branches:  
changeset: 5035:d1d716083aa7
user:      Steve Borho <steve at borho.org>
date:      Mon Nov 11 19:43:33 2013 -0600
description:
TComYuv: de-hungarian nits
Subject: [x265] no-rdo: cleanups. Remove unnecessary memsets, rearrange computations.

details:   http://hg.videolan.org/x265/rev/1ca01c82609f
branches:  
changeset: 5036:1ca01c82609f
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Mon Nov 11 15:46:00 2013 +0530
description:
no-rdo: cleanups. Remove unnecessary memsets, rearrange computations.

diffstat:

 source/Lib/TLibCommon/TComPrediction.cpp |    2 +-
 source/Lib/TLibCommon/TComYuv.h          |    6 +-
 source/Lib/TLibEncoder/TEncSearch.cpp    |   70 +--
 source/common/ipfilter.cpp               |   45 +-
 source/common/pixel.cpp                  |   22 +-
 source/common/primitives.h               |    6 +-
 source/common/vec/dct-sse3.cpp           |    6 +-
 source/common/vec/dct-sse41.cpp          |    2 +
 source/common/vec/ipfilter-sse41.cpp     |    1 -
 source/common/x86/asm-primitives.cpp     |  103 +++-
 source/common/x86/blockcopy8.asm         |  704 +++++++++++++++++++++++++++++++
 source/common/x86/blockcopy8.h           |   49 ++
 source/common/x86/ipfilter8.asm          |  285 +++++++-----
 source/common/x86/ipfilter8.h            |   34 +-
 source/common/x86/mc-a.asm               |   35 +-
 source/common/x86/pixel-a.asm            |  102 ++++
 source/common/x86/pixel.h                |    5 +
 source/encoder/compress.cpp              |  156 +++---
 source/encoder/motion.cpp                |    8 +-
 source/encoder/ratecontrol.cpp           |    2 +-
 source/test/ipfilterharness.cpp          |   46 ++
 source/test/ipfilterharness.h            |    1 +
 source/test/pixelharness.cpp             |   60 ++-
 source/test/pixelharness.h               |    1 +
 24 files changed, 1445 insertions(+), 306 deletions(-)

diffs (truncated from 2413 to 300 lines):

diff -r 9d74638c3640 -r 1ca01c82609f source/Lib/TLibCommon/TComPrediction.cpp

--- a/source/Lib/TLibCommon/TComPrediction.cpp	Sat Nov 09 20:14:24 2013 -0600
+++ b/source/Lib/TLibCommon/TComPrediction.cpp	Mon Nov 11 15:46:00 2013 +0530
@@ -500,7 +500,7 @@ void TComPrediction::xPredInterLumaBlk(T
         int filterSize = NTAPS_LUMA;
         int halfFilterSize = (filterSize >> 1);
         primitives.ipfilter_ps[FILTER_H_P_S_8](src - (halfFilterSize - 1) * srcStride,  srcStride, m_immedVals, tmpStride, width, height + filterSize - 1, g_lumaFilter[xFrac]);
-        primitives.ipfilter_sp[FILTER_V_S_P_8](m_immedVals + (halfFilterSize - 1) * tmpStride, tmpStride, dst, dstStride, width, height, yFrac);
+        primitives.luma_vsp[partEnum](m_immedVals + (halfFilterSize - 1) * tmpStride, tmpStride, dst, dstStride, yFrac);
     }
 }
 
diff -r 9d74638c3640 -r 1ca01c82609f source/Lib/TLibCommon/TComYuv.h
--- a/source/Lib/TLibCommon/TComYuv.h	Sat Nov 09 20:14:24 2013 -0600
+++ b/source/Lib/TLibCommon/TComYuv.h	Mon Nov 11 15:46:00 2013 +0530
@@ -129,9 +129,9 @@ public:
     void    copyToPartChroma(TComYuv* dstPicYuv, uint32_t uiDstPartIdx);
 
     //  Copy the part of Big YUV buffer to other Small YUV buffer
-    void    copyPartToYuv(TComYuv* dstPicYuv, uint32_t uiSrcPartIdx);
-    void    copyPartToLuma(TComYuv* dstPicYuv, uint32_t uiSrcPartIdx);
-    void    copyPartToChroma(TComYuv* dstPicYuv, uint32_t uiSrcPartIdx);
+    void    copyPartToYuv(TComYuv* dstPicYuv, uint32_t srcPartIdx);
+    void    copyPartToLuma(TComYuv* dstPicYuv, uint32_t srcPartIdx);
+    void    copyPartToChroma(TComYuv* dstPicYuv, uint32_t srcPartIdx);
 
     //  Copy YUV partition buffer to other YUV partition buffer
     void    copyPartToPartYuv(TComYuv* dstPicYuv, uint32_t partIdx, uint32_t width, uint32_t height, bool bLuma = true, bool bChroma = true);
diff -r 9d74638c3640 -r 1ca01c82609f source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp	Sat Nov 09 20:14:24 2013 -0600
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp	Mon Nov 11 15:46:00 2013 +0530
@@ -176,7 +176,7 @@ void TEncSearch::init(TEncCfg* cfg, TCom
 
         m_qtTempCoeffCb[i] = new TCoeff[(g_maxCUWidth >> m_hChromaShift) * (g_maxCUHeight >> m_vChromaShift)];
         m_qtTempCoeffCr[i] = new TCoeff[(g_maxCUWidth >> m_hChromaShift) * (g_maxCUHeight >> m_vChromaShift)];
-        m_qtTempTComYuv[i].create(g_maxCUWidth, g_maxCUHeight, cfg->getColorFormat());
+        m_qtTempTComYuv[i].create(MAX_CU_SIZE, MAX_CU_SIZE, cfg->getColorFormat());
     }
 
     m_sharedPredTransformSkip[0] = new Pel[MAX_TS_WIDTH * MAX_TS_HEIGHT];
@@ -428,6 +428,7 @@ void TEncSearch::xIntraCodingLumaBlk(TCo
     Pel*     pred         = predYuv->getLumaAddr(absPartIdx);
     int16_t* residual     = resiYuv->getLumaAddr(absPartIdx);
     Pel*     recon        = predYuv->getLumaAddr(absPartIdx);
+    int      part         = partitionFromSizes(width, height);
 
     uint32_t trSizeLog2     = g_convertToBit[cu->getSlice()->getSPS()->getMaxCUWidth() >> fullDepth] + 2;
     uint32_t qtLayer        = cu->getSlice()->getSPS()->getQuadtreeTULog2MaxSize() - trSizeLog2;
@@ -453,12 +454,12 @@ void TEncSearch::xIntraCodingLumaBlk(TCo
         // save prediction
         if (default0Save1Load2 == 1)
         {
-            primitives.blockcpy_pp(width, height, m_sharedPredTransformSkip[0], width, pred, stride);
+            primitives.luma_copy_pp[part](m_sharedPredTransformSkip[0], width, pred, stride);
         }
     }
     else
     {
-        primitives.blockcpy_pp(width, height, pred, stride, m_sharedPredTransformSkip[0], width);
+        primitives.luma_copy_pp[part](pred, stride, m_sharedPredTransformSkip[0], width);
     }
 
     //===== get residual signal =====
@@ -504,7 +505,6 @@ void TEncSearch::xIntraCodingLumaBlk(TCo
     primitives.calcrecon[size](pred, residual, recon, reconQt, reconIPred, stride, reconQtStride, reconIPredStride);
 
     //===== update distortion =====
-    int part = partitionFromSizes(width, height);
     outDist += primitives.sse_pp[part](fenc, stride, recon, stride);
 }
 
@@ -554,6 +554,7 @@ void TEncSearch::xIntraCodingChromaBlk(T
     Pel*     reconIPred       = (chromaId > 0 ? cu->getPic()->getPicYuvRec()->getCrAddr(cu->getAddr(), zorder) : cu->getPic()->getPicYuvRec()->getCbAddr(cu->getAddr(), zorder));
     uint32_t reconIPredStride = cu->getPic()->getPicYuvRec()->getCStride();
     bool     useTransformSkipChroma = cu->getTransformSkip(absPartIdx, ttype);
+    int      part = partitionFromSizes(width, height);
 
     //===== update chroma mode =====
     if (chromaPredMode == DM_CHROMA_IDX)
@@ -576,14 +577,14 @@ void TEncSearch::xIntraCodingChromaBlk(T
         if (default0Save1Load2 == 1)
         {
             Pel* predbuf = m_sharedPredTransformSkip[1 + chromaId];
-            primitives.blockcpy_pp(width, height, predbuf, width, pred, stride);
+            primitives.luma_copy_pp[part](predbuf, width, pred, stride);
         }
     }
     else
     {
         // load prediction
         Pel* predbuf = m_sharedPredTransformSkip[1 + chromaId];
-        primitives.blockcpy_pp(width, height, pred, stride, predbuf, width);
+        primitives.luma_copy_pp[part](pred, stride, predbuf, width);
     }
 
     //===== get residual signal =====
@@ -638,7 +639,6 @@ void TEncSearch::xIntraCodingChromaBlk(T
     primitives.calcrecon[size](pred, residual, recon, reconQt, reconIPred, stride, reconQtStride, reconIPredStride);
 
     //===== update distortion =====
-    int part = partitionFromSizes(width, height);
     uint32_t dist = primitives.sse_pp[part](fenc, stride, recon, stride);
     if (ttype == TEXT_CHROMA_U)
     {
@@ -1610,7 +1610,7 @@ void TEncSearch::estIntraPredQT(TComData
                 // Filtered and Unfiltered refAbove and refLeft pointing to above and left.
                 above         = aboveScale;
                 left          = leftScale;
-                aboveFiltered = aboveScale; 
+                aboveFiltered = aboveScale;
                 leftFiltered  = leftScale;
             }
 
@@ -1796,28 +1796,24 @@ void TEncSearch::estIntraPredQT(TComData
             uint32_t compWidth   = cu->getWidth(0) >> initTrDepth;
             uint32_t compHeight  = cu->getHeight(0) >> initTrDepth;
             uint32_t zorder      = cu->getZorderIdxInCU() + partOffset;
+            int      part        = partitionFromSizes(compWidth, compHeight);
             Pel*     dst         = cu->getPic()->getPicYuvRec()->getLumaAddr(cu->getAddr(), zorder);
             uint32_t dststride   = cu->getPic()->getPicYuvRec()->getStride();
             Pel*     src         = reconYuv->getLumaAddr(partOffset);
             uint32_t srcstride   = reconYuv->getStride();
-            primitives.blockcpy_pp(compWidth, compHeight, dst, dststride, src, srcstride);
+            primitives.luma_copy_pp[part](dst, dststride, src, srcstride);
 
             if (!bLumaOnly && !bSkipChroma)
             {
-                if (!bChromaSame)
-                {
-                    compWidth   >>= 1;
-                    compHeight  >>= 1;
-                }
                 dst         = cu->getPic()->getPicYuvRec()->getCbAddr(cu->getAddr(), zorder);
                 dststride   = cu->getPic()->getPicYuvRec()->getCStride();
                 src         = reconYuv->getCbAddr(partOffset);
                 srcstride   = reconYuv->getCStride();
-                primitives.blockcpy_pp(compWidth, compHeight, dst, dststride, src, srcstride);
+                primitives.chroma_copy_pp[part](dst, dststride, src, srcstride);
 
                 dst         = cu->getPic()->getPicYuvRec()->getCrAddr(cu->getAddr(), zorder);
                 src         = reconYuv->getCrAddr(partOffset);
-                primitives.blockcpy_pp(compWidth, compHeight, dst, dststride, src, srcstride);
+                primitives.chroma_copy_pp[part](dst, dststride, src, srcstride);
             }
         }
 
@@ -1851,7 +1847,7 @@ void TEncSearch::estIntraPredQT(TComData
     m_rdGoOnSbacCoder->load(m_rdSbacCoders[depth][CI_CURR_BEST]);
 
     //===== set distortion (rate and r-d costs are determined later) =====
-    outDistC                 = overallDistC;
+    outDistC              = overallDistC;
     cu->m_totalDistortion = overallDistY + overallDistC;
 }
 
@@ -2940,34 +2936,29 @@ void TEncSearch::estimateRDInterCU(TComD
     if (zerocost < cost)
     {
         const uint32_t qpartnum = cu->getPic()->getNumPartInCU() >> (cu->getDepth(0) << 1);
-        ::memset(cu->getTransformIdx(), 0, qpartnum * sizeof(UChar));
         ::memset(cu->getCbf(TEXT_LUMA), 0, qpartnum * sizeof(UChar));
         ::memset(cu->getCbf(TEXT_CHROMA_U), 0, qpartnum * sizeof(UChar));
         ::memset(cu->getCbf(TEXT_CHROMA_V), 0, qpartnum * sizeof(UChar));
-        ::memset(cu->getCoeffY(), 0, width * height * sizeof(TCoeff));
-        ::memset(cu->getCoeffCb(), 0, width * height * sizeof(TCoeff) >> 2);
-        ::memset(cu->getCoeffCr(), 0, width * height * sizeof(TCoeff) >> 2);
-        cu->setTransformSkipSubParts(0, 0, 0, 0, cu->getDepth(0));
         if (cu->getMergeFlag(0) && cu->getPartitionSize(0) == SIZE_2Nx2N)
         {
-            cu->setSkipFlagSubParts(true, 0, cu->getDepth(0));
+            cu->getSkipFlag()[0] = true;
         }
         bits = zerobits;
-        outBestResiYuv->clear();
         generateRecon(cu, predYuv, outBestResiYuv, outReconYuv, true);
+        distortion = zerodistortion;
     }
     else
     {
         xSetResidualQTData(cu, 0, 0, outBestResiYuv, cu->getDepth(0), true);
         generateRecon(cu, predYuv, outBestResiYuv, outReconYuv, false);
+
+        int part = partitionFromSizes(width, height);
+        distortion = primitives.sse_pp[part](fencYuv->getLumaAddr(), fencYuv->getStride(), outReconYuv->getLumaAddr(), outReconYuv->getStride());
+        part = partitionFromSizes(width >> 1, height >> 1);
+        distortion += m_rdCost->scaleChromaDistCb(primitives.sse_pp[part](fencYuv->getCbAddr(), fencYuv->getCStride(), outReconYuv->getCbAddr(), outReconYuv->getCStride()));
+        distortion += m_rdCost->scaleChromaDistCr(primitives.sse_pp[part](fencYuv->getCrAddr(), fencYuv->getCStride(), outReconYuv->getCrAddr(), outReconYuv->getCStride()));
     }
 
-    int part = partitionFromSizes(width, height);
-    distortion = primitives.sse_pp[part](fencYuv->getLumaAddr(), fencYuv->getStride(), outReconYuv->getLumaAddr(), outReconYuv->getStride());
-    part = partitionFromSizes(width >> 1, height >> 1);
-    distortion += m_rdCost->scaleChromaDistCb(primitives.sse_pp[part](fencYuv->getCbAddr(), fencYuv->getCStride(), outReconYuv->getCbAddr(), outReconYuv->getCStride()));
-    distortion += m_rdCost->scaleChromaDistCr(primitives.sse_pp[part](fencYuv->getCrAddr(), fencYuv->getCStride(), outReconYuv->getCrAddr(), outReconYuv->getCStride()));
-
     cu->m_totalBits       = bits;
     cu->m_totalDistortion = distortion;
     cu->m_totalCost       = m_rdCost->calcRdCost(distortion, bits);
@@ -2975,25 +2966,13 @@ void TEncSearch::estimateRDInterCU(TComD
 
 uint32_t TEncSearch::estimateZerobits(TComDataCU* cu)
 {
-    if (cu->isIntra(0))
-    {
-        return 0;
-    }
-
     uint32_t zeroResiBits = 0;
 
-    uint32_t width  = cu->getWidth(0);
-    uint32_t height = cu->getHeight(0);
-
     const uint32_t qpartnum = cu->getPic()->getNumPartInCU() >> (cu->getDepth(0) << 1);
-    ::memset(cu->getTransformIdx(), 0, qpartnum * sizeof(UChar));
+
     ::memset(cu->getCbf(TEXT_LUMA), 0, qpartnum * sizeof(UChar));
     ::memset(cu->getCbf(TEXT_CHROMA_U), 0, qpartnum * sizeof(UChar));
     ::memset(cu->getCbf(TEXT_CHROMA_V), 0, qpartnum * sizeof(UChar));
-    ::memset(cu->getCoeffY(), 0, width * height * sizeof(TCoeff));
-    ::memset(cu->getCoeffCb(), 0, width * height * sizeof(TCoeff) >> 2);
-    ::memset(cu->getCoeffCr(), 0, width * height * sizeof(TCoeff) >> 2);
-    cu->setTransformSkipSubParts(0, 0, 0, 0, cu->getDepth(0));
 
     m_rdGoOnSbacCoder->load(m_rdSbacCoders[cu->getDepth(0)][CI_CURR_BEST]);
     zeroResiBits = xSymbolBitsInter(cu);
@@ -3035,11 +3014,6 @@ void TEncSearch::generateRecon(TComDataC
 
 void TEncSearch::estimateBitsDist(TComDataCU* cu, TShortYUV* resiYuv, uint32_t& bits, uint32_t& distortion, bool curUseRDOQ)
 {
-    if (cu->isIntra(0))
-    {
-        return;
-    }
-
     bits = 0;
     distortion = 0;
     uint64_t cost = 0;
diff -r 9d74638c3640 -r 1ca01c82609f source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp	Sat Nov 09 20:14:24 2013 -0600
+++ b/source/common/ipfilter.cpp	Mon Nov 11 15:46:00 2013 +0530
@@ -425,6 +425,49 @@ void interp_vert_ps_c(pixel *src, intptr
     }
 }
 
+template<int N, int width, int height>
+void interp_vert_sp_c(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
+{
+    int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    int shift = IF_FILTER_PREC + headRoom;
+    int offset = (1 << (shift - 1)) + (IF_INTERNAL_OFFS << IF_FILTER_PREC);
+    uint16_t maxVal = (1 << X265_DEPTH) - 1;
+    const int16_t *coeff = (N == 8 ? g_lumaFilter[coeffIdx] : g_chromaFilter[coeffIdx]);
+
+    src -= (N / 2 - 1) * srcStride;
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col++)
+        {
+            int sum;
+
+            sum  = src[col + 0 * srcStride] * coeff[0];
+            sum += src[col + 1 * srcStride] * coeff[1];
+            sum += src[col + 2 * srcStride] * coeff[2];
+            sum += src[col + 3 * srcStride] * coeff[3];
+            if (N == 8)
+            {
+                sum += src[col + 4 * srcStride] * coeff[4];
+                sum += src[col + 5 * srcStride] * coeff[5];
+                sum += src[col + 6 * srcStride] * coeff[6];
+                sum += src[col + 7 * srcStride] * coeff[7];
+            }
+
+            int16_t val = (int16_t)((sum + offset) >> shift);
+
+            val = (val < 0) ? 0 : val;
+            val = (val > maxVal) ? maxVal : val;
+
+            dst[col] = (pixel)val;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
 typedef void (*ipfilter_ps_t)(pixel *src, intptr_t srcStride, short *dst, intptr_t dstStride, int width, int height, const short *coeff);
 typedef void (*ipfilter_sp_t)(short *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, const short *coeff);
 
@@ -450,6 +493,7 @@ namespace x265 {
     p.luma_hps[LUMA_ ## W ## x ## H]     = interp_horiz_ps_c<8, W, H>;\
     p.luma_vpp[LUMA_ ## W ## x ## H]     = interp_vert_pp_c<8, W, H>; \
     p.luma_vps[LUMA_ ## W ## x ## H]     = interp_vert_ps_c<8, W, H>; \
+    p.luma_vsp[LUMA_ ## W ## x ## H]     = interp_vert_sp_c<8, W, H>; \
     p.luma_hvpp[LUMA_ ## W ## x ## H]    = interp_hv_pp_c<8, W, H>;
 
 void Setup_C_IPFilterPrimitives(EncoderPrimitives& p)
@@ -506,7 +550,6 @@ void Setup_C_IPFilterPrimitives(EncoderP