[x265-commits] [x265] fix shadowed variable warning

Thu Oct 31 19:43:16 CET 2013

details:   http://hg.videolan.org/x265/rev/a406f7c1dd3b
branches:  
changeset: 4773:a406f7c1dd3b
user:      Steve Borho <steve at borho.org>
date:      Wed Oct 30 20:20:08 2013 -0500
description:
fix shadowed variable warning
Subject: [x265] ipfilter: fix 16bpp build following f0eea23735a6

details:   http://hg.videolan.org/x265/rev/f06e4a24b388
branches:  
changeset: 4774:f06e4a24b388
user:      Steve Borho <steve at borho.org>
date:      Wed Oct 30 23:29:31 2013 -0500
description:
ipfilter: fix 16bpp build following f0eea23735a6
Subject: [x265] testpool: add missing stdio.h for printf

details:   http://hg.videolan.org/x265/rev/ec6b4d35f110
branches:  
changeset: 4775:ec6b4d35f110
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 00:09:49 2013 -0500
description:
testpool: add missing stdio.h for printf
Subject: [x265] asm: fix the bug which occured at win32 compile

details:   http://hg.videolan.org/x265/rev/4a886c170a51
branches:  
changeset: 4776:4a886c170a51
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Thu Oct 31 12:58:25 2013 +0530
description:
asm: fix the bug which occured at win32 compile
Subject: [x265] asm: reduce large code size in pixel_sad_8x32 for better cache performance

details:   http://hg.videolan.org/x265/rev/e4a75488c147
branches:  
changeset: 4777:e4a75488c147
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Thu Oct 31 14:06:12 2013 +0530
description:
asm: reduce large code size in pixel_sad_8x32 for better cache performance
Subject: [x265] asm: reduce large code size in sad_16xN, sad_32xN for better cache performance

details:   http://hg.videolan.org/x265/rev/9a0da4e6d9e3
branches:  
changeset: 4778:9a0da4e6d9e3
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Thu Oct 31 15:10:34 2013 +0530
description:
asm: reduce large code size in sad_16xN, sad_32xN for better cache performance
Subject: [x265] asm: fix stack broken bug

details:   http://hg.videolan.org/x265/rev/08bc7ccc8aad
branches:  
changeset: 4779:08bc7ccc8aad
user:      Min Chen <chenm003 at 163.com>
date:      Thu Oct 31 20:59:27 2013 +0800
description:
asm: fix stack broken bug
Subject: [x265] asm: less code size by reduce constant offset

details:   http://hg.videolan.org/x265/rev/a64e813de628
branches:  
changeset: 4780:a64e813de628
user:      Min Chen <chenm003 at 163.com>
date:      Thu Oct 31 21:00:29 2013 +0800
description:
asm: less code size by reduce constant offset
Subject: [x265] asm: fix bug in luma_p2s and active it in encoder

details:   http://hg.videolan.org/x265/rev/21dbf988079b
branches:  
changeset: 4781:21dbf988079b
user:      Min Chen <chenm003 at 163.com>
date:      Thu Oct 31 21:01:29 2013 +0800
description:
asm: fix bug in luma_p2s and active it in encoder
Subject: [x265] asm: chroma_p2s to replace ipfilter_p2s

details:   http://hg.videolan.org/x265/rev/4a40c4069ad1
branches:  
changeset: 4782:4a40c4069ad1
user:      Min Chen <chenm003 at 163.com>
date:      Thu Oct 31 21:01:43 2013 +0800
description:
asm: chroma_p2s to replace ipfilter_p2s
Subject: [x265] assembly code for pixel_sad_x3_12x16

details:   http://hg.videolan.org/x265/rev/7ccdf622d081
branches:  
changeset: 4783:7ccdf622d081
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Thu Oct 31 16:50:52 2013 +0530
description:
assembly code for pixel_sad_x3_12x16
Subject: [x265] assembly code for pixel_sad_x4_12x16

details:   http://hg.videolan.org/x265/rev/ed884e91d5d5
branches:  
changeset: 4784:ed884e91d5d5
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Thu Oct 31 17:09:43 2013 +0530
description:
assembly code for pixel_sad_x4_12x16
Subject: [x265] no-rdo: Use entropy encoder for bit estimation.

details:   http://hg.videolan.org/x265/rev/775519fb9ba1
branches:  
changeset: 4785:775519fb9ba1
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Thu Oct 31 12:38:27 2013 +0530
description:
no-rdo: Use entropy encoder for bit estimation.

Instead of me-bit estimation, use entropy encoder.
Subject: [x265] compress: cleanup, remove unused data structs

details:   http://hg.videolan.org/x265/rev/eed2b51675cf
branches:  
changeset: 4786:eed2b51675cf
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Thu Oct 31 15:37:47 2013 +0530
description:
compress: cleanup, remove unused data structs
Subject: [x265] aq: set qp, lambda for every CU in the row before processing the CU

details:   http://hg.videolan.org/x265/rev/650e40a62322
branches:  
changeset: 4787:650e40a62322
user:      Aarthi Thirumalai
date:      Thu Oct 31 17:06:34 2013 +0530
description:
aq: set qp, lambda for every CU in the row before processing the CU

enabled bUseDQP flag when AQ is mode is ON.
Subject: [x265] aq: fix NULL pointer check

details:   http://hg.videolan.org/x265/rev/9acea4fbacef
branches:  
changeset: 4788:9acea4fbacef
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 10:43:28 2013 -0500
description:
aq: fix NULL pointer check
Subject: [x265] aq: use more explicit chroma variance stride

details:   http://hg.videolan.org/x265/rev/3e2d69028a3b
branches:  
changeset: 4789:3e2d69028a3b
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 10:44:22 2013 -0500
description:
aq: use more explicit chroma variance stride
Subject: [x265] aq: simplify acEnergyCu

details:   http://hg.videolan.org/x265/rev/180d95f09057
branches:  
changeset: 4790:180d95f09057
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 10:46:13 2013 -0500
description:
aq: simplify acEnergyCu

EMMS was in the wrong place, there were a few white-space issues.
Subject: [x265] aq: fixes for loop over 16x16 blocks

details:   http://hg.videolan.org/x265/rev/974a6afaddca
branches:  
changeset: 4791:974a6afaddca
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 10:49:33 2013 -0500
description:
aq: fixes for loop over 16x16 blocks

This loop was busted when maxCUSize was not 64.  It still has a problem with
pictures that are not even multiples of 16.  The lookahead will extend out the
frame during lowres init to an even multiple of 16 pixels, so it's lowres CU
width will be wider than the AQ code will use, so the block_xy offsets will be
wrong for lookahead analysis.

The pixel extension needs to be moved earlier so AQ and the lookahead have a
consistent 16x16 CU width
Subject: [x265] aq: remove unnecessary double->float->double conversions

details:   http://hg.videolan.org/x265/rev/abedbfdb1e12
branches:  
changeset: 4792:abedbfdb1e12
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 10:50:01 2013 -0500
description:
aq: remove unnecessary double->float->double conversions
Subject: [x265] pixel: remove sad_x3_12 and sad_x4_16 intrinsic functions

details:   http://hg.videolan.org/x265/rev/2d08d77871b0
branches:  
changeset: 4793:2d08d77871b0
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 11:25:09 2013 -0500
description:
pixel: remove sad_x3_12 and sad_x4_16 intrinsic functions
Subject: [x265] remove clang prevention for 12x16 pixel primitives

details:   http://hg.videolan.org/x265/rev/30c655ec95f7
branches:  
changeset: 4794:30c655ec95f7
user:      Steve Borho <steve at borho.org>
date:      Thu Oct 31 11:58:25 2013 -0500
description:
remove clang prevention for 12x16 pixel primitives
Subject: [x265] Ensure that the destination buffer is not overwritten. 64 is added as it is the maximum width supported for luma filter.

details:   http://hg.videolan.org/x265/rev/935d96d93b70
branches:  
changeset: 4795:935d96d93b70
user:      Nabajit Deka
date:      Thu Oct 31 21:25:39 2013 +0530
description:
Ensure that the destination buffer is not overwritten. 64 is added as it is the maximum width supported for luma filter.
Subject: [x265] asm: routines for vertical luma filter for all block sizes

details:   http://hg.videolan.org/x265/rev/faf29e19669f
branches:  
changeset: 4796:faf29e19669f
user:      Nabajit Deka
date:      Thu Oct 31 21:34:09 2013 +0530
description:
asm: routines for vertical luma filter for all block sizes
Subject: [x265] asm: fix typo bug in chroma_p2s

details:   http://hg.videolan.org/x265/rev/e842b2a4aeeb
branches:  
changeset: 4797:e842b2a4aeeb
user:      Min Chen <chenm003 at 163.com>
date:      Thu Oct 31 13:19:33 2013 -0500
description:
asm: fix typo bug in chroma_p2s

diffstat:

 source/Lib/TLibCommon/TComDataCU.cpp     |    4 +-
 source/Lib/TLibCommon/TComPrediction.cpp |   11 +-
 source/Lib/TLibEncoder/TEncCu.cpp        |   55 +--
 source/Lib/TLibEncoder/TEncCu.h          |    4 +-
 source/Lib/TLibEncoder/TEncSearch.h      |   12 +-
 source/common/ipfilter.cpp               |    6 +-
 source/common/primitives.h               |    1 +
 source/common/vec/ipfilter-sse41.cpp     |    3 +-
 source/common/vec/pixel-sse41.cpp        |  190 +----------
 source/common/x86/asm-primitives.cpp     |    8 +-
 source/common/x86/ipfilter8.asm          |  494 ++++++++++++++++++++++++++-
 source/common/x86/ipfilter8.h            |    1 +
 source/common/x86/sad-a.asm              |  552 ++++++++++++++----------------
 source/encoder/compress.cpp              |   74 +---
 source/encoder/encoder.cpp               |    2 +-
 source/encoder/frameencoder.cpp          |   44 ++-
 source/encoder/frameencoder.h            |    2 +
 source/encoder/framefilter.cpp           |    1 -
 source/encoder/ratecontrol.cpp           |   43 +-
 source/encoder/slicetype.cpp             |    2 +-
 source/test/ipfilterharness.cpp          |   48 ++-
 source/test/ipfilterharness.h            |    2 +-
 source/test/testpool.cpp                 |    1 +
 23 files changed, 876 insertions(+), 684 deletions(-)

diffs (truncated from 2242 to 300 lines):

diff -r 7f68debc632b -r e842b2a4aeeb source/Lib/TLibCommon/TComDataCU.cpp

--- a/source/Lib/TLibCommon/TComDataCU.cpp	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/Lib/TLibCommon/TComDataCU.cpp	Thu Oct 31 13:19:33 2013 -0500
@@ -246,7 +246,7 @@ void TComDataCU::initCU(TComPic* pic, ui
     m_totalDistortion  = 0;
     m_totalBits        = 0;
     m_numPartitions    = pic->getNumPartInCU();
-
+    int qp             = pic->m_lowres.m_invQscaleFactor ? pic->getCU(getAddr())->getQP(0) : m_slice->getSliceQp();
     for (int i = 0; i < 4; i++)
     {
         m_avgCost[i] = 0;
@@ -304,7 +304,7 @@ void TComDataCU::initCU(TComPic* pic, ui
         memset(m_height           + firstElement, g_maxCUHeight,            numElements * sizeof(*m_height));
         memset(m_mvpNum[0]        + firstElement, -1,                       numElements * sizeof(*m_mvpNum[0]));
         memset(m_mvpNum[1]        + firstElement, -1,                       numElements * sizeof(*m_mvpNum[1]));
-        memset(m_qp               + firstElement, getSlice()->getSliceQp(), numElements * sizeof(*m_qp));
+        memset(m_qp               + firstElement, qp,                       numElements * sizeof(*m_qp));
         memset(m_bMergeFlags      + firstElement, false,                    numElements * sizeof(*m_bMergeFlags));
         memset(m_mergeIndex       + firstElement, 0,                        numElements * sizeof(*m_mergeIndex));
         memset(m_lumaIntraDir     + firstElement, DC_IDX,                   numElements * sizeof(*m_lumaIntraDir));
diff -r 7f68debc632b -r e842b2a4aeeb source/Lib/TLibCommon/TComPrediction.cpp
--- a/source/Lib/TLibCommon/TComPrediction.cpp	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/Lib/TLibCommon/TComPrediction.cpp	Thu Oct 31 13:19:33 2013 -0500
@@ -508,7 +508,7 @@ void TComPrediction::xPredInterLumaBlk(T
 {
     int refStride = refPic->getStride();
     int refOffset = (mv->x >> 2) + (mv->y >> 2) * refStride;
-    Pel *ref      =  refPic->getLumaAddr(cu->getAddr(), cu->getZorderIdxInCU() + partAddr) + refOffset;
+    pixel *ref    =  refPic->getLumaAddr(cu->getAddr(), cu->getZorderIdxInCU() + partAddr) + refOffset;
 
     int dstStride = dstPic->m_width;
     int16_t *dst    = dstPic->getLumaAddr(partAddr);
@@ -521,7 +521,7 @@ void TComPrediction::xPredInterLumaBlk(T
 
     if ((yFrac | xFrac) == 0)
     {
-        primitives.ipfilter_p2s(ref, refStride, dst, dstStride, width, height);
+        primitives.luma_p2s(ref, refStride, dst, width, height);
     }
     else if (yFrac == 0)
     {
@@ -619,10 +619,13 @@ void TComPrediction::xPredInterChromaBlk
     uint32_t cxWidth = width >> 1;
     uint32_t cxHeight = height >> 1;
 
+    assert(dstStride == MAX_CU_SIZE / 2);
+    assert(((cxWidth | cxHeight) % 2) == 0);
+
     if ((yFrac | xFrac) == 0)
     {
-        primitives.ipfilter_p2s(refCb, refStride, dstCb, dstStride, cxWidth, cxHeight);
-        primitives.ipfilter_p2s(refCr, refStride, dstCr, dstStride, cxWidth, cxHeight);
+        primitives.chroma_p2s(refCb, refStride, dstCb, cxWidth, cxHeight);
+        primitives.chroma_p2s(refCr, refStride, dstCr, cxWidth, cxHeight);
     }
     else if (yFrac == 0)
     {
diff -r 7f68debc632b -r e842b2a4aeeb source/Lib/TLibEncoder/TEncCu.cpp
--- a/source/Lib/TLibEncoder/TEncCu.cpp	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncCu.cpp	Thu Oct 31 13:19:33 2013 -0500
@@ -86,12 +86,7 @@ void TEncCu::create(UChar totalDepth, ui
     m_bestPredYuv = new TComYuv*[m_totalDepth - 1];
     m_bestResiYuv = new TShortYUV*[m_totalDepth - 1];
     m_bestRecoYuv = new TComYuv*[m_totalDepth - 1];
-    for (int j = 0; j < 4; j++)
-    {
-        m_bestPredYuvNxN[j] = new TComYuv*[m_totalDepth - 1];
-        m_interCU_NxN[j]  = new TComDataCU*[m_totalDepth - 1];
-    }
-
+    
     m_tmpPredYuv = new TComYuv*[m_totalDepth - 1];
 
     m_modePredYuv[0] = new TComYuv*[m_totalDepth - 1];
@@ -119,12 +114,6 @@ void TEncCu::create(UChar totalDepth, ui
         m_tempCU[i] = new TComDataCU;
         m_tempCU[i]->create(numPartitions, width, height, maxWidth >> (m_totalDepth - 1));
 
-        for (int j = 0; j < 4; j++)
-        {
-            m_interCU_NxN[j][i] = new TComDataCU;
-            m_interCU_NxN[j][i]->create(numPartitions, width, height, maxWidth >> (m_totalDepth - 1));
-        }
-
         m_interCU_2Nx2N[i] = new TComDataCU;
         m_interCU_2Nx2N[i]->create(numPartitions, width, height, maxWidth >> (m_totalDepth - 1));
         m_interCU_2NxN[i] = new TComDataCU;
@@ -144,12 +133,6 @@ void TEncCu::create(UChar totalDepth, ui
         m_bestRecoYuv[i] = new TComYuv;
         m_bestRecoYuv[i]->create(width, height);
 
-        for (int j = 0; j < 4; j++)
-        {
-            m_bestPredYuvNxN[j][i] = new TComYuv;
-            m_bestPredYuvNxN[j][i]->create(width, height);
-        }
-
         m_tmpPredYuv[i] = new TComYuv;
         m_tmpPredYuv[i]->create(width, height);
 
@@ -231,16 +214,6 @@ void TEncCu::destroy()
             m_tempCU[i] = NULL;
         }
 
-        for (int j = 0; j < 4; j++)
-        {
-            if (m_interCU_NxN[j][i])
-            {
-                m_interCU_NxN[j][i]->destroy();
-                delete m_interCU_NxN[j][i];
-                m_interCU_NxN[j][i] = NULL;
-            }
-        }
-
         if (m_bestPredYuv[i])
         {
             m_bestPredYuv[i]->destroy();
@@ -259,16 +232,7 @@ void TEncCu::destroy()
             delete m_bestRecoYuv[i];
             m_bestRecoYuv[i] = NULL;
         }
-        for (int j = 0; j < 4; j++)
-        {
-            if (m_bestPredYuvNxN[j][i])
-            {
-                m_bestPredYuvNxN[j][i]->destroy();
-                delete m_bestPredYuvNxN[j][i];
-                m_bestPredYuvNxN[j][i] = NULL;
-            }
-        }
-
+    
         if (m_tmpPredYuv[i])
         {
             m_tmpPredYuv[i]->destroy();
@@ -329,12 +293,6 @@ void TEncCu::destroy()
     delete [] m_tempCU;
     m_tempCU = NULL;
 
-    for (int j = 0; j < 4; j++)
-    {
-        delete [] m_interCU_NxN[j];
-        m_interCU_NxN[j] = NULL;
-    }
-
     delete [] m_bestPredYuv;
     m_bestPredYuv = NULL;
     delete [] m_bestResiYuv;
@@ -342,12 +300,6 @@ void TEncCu::destroy()
     delete [] m_bestRecoYuv;
     m_bestRecoYuv = NULL;
 
-    for (int j = 0; j < 4; j++)
-    {
-        delete [] m_bestPredYuvNxN[j];
-        m_bestPredYuvNxN[j] = NULL;
-    }
-
     delete [] m_bestMergeRecoYuv;
     m_bestMergeRecoYuv = NULL;
     delete [] m_tmpPredYuv;
@@ -1691,7 +1643,8 @@ void TEncCu::xCheckDQP(TComDataCU* cu)
 
     if (cu->getSlice()->getPPS()->getUseDQP() && (g_maxCUWidth >> depth) >= cu->getSlice()->getPPS()->getMinCuDQPSize())
     {
-        cu->setQPSubParts(cu->getRefQP(0), 0, depth); // set QP to default QP
+        if (!cu->getCbf(0, TEXT_LUMA, 0) && !cu->getCbf(0, TEXT_CHROMA_U, 0) && !cu->getCbf(0, TEXT_CHROMA_V, 0))
+            cu->setQPSubParts(cu->getRefQP(0), 0, depth); // set QP to default QP
     }
 }
 
diff -r 7f68debc632b -r e842b2a4aeeb source/Lib/TLibEncoder/TEncCu.h
--- a/source/Lib/TLibEncoder/TEncCu.h	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncCu.h	Thu Oct 31 13:19:33 2013 -0500
@@ -76,15 +76,13 @@ private:
     TComDataCU** m_intraInInterCU;
     TComDataCU** m_mergeCU;
     TComDataCU** m_bestMergeCU;
-    TComDataCU** m_interCU_NxN[4];
     TComDataCU** m_bestCU;      ///< Best CUs at each depth
     TComDataCU** m_tempCU;      ///< Temporary CUs at each depth
 
     TComYuv**    m_bestPredYuv; ///< Best Prediction Yuv for each depth
     TShortYUV**  m_bestResiYuv; ///< Best Residual Yuv for each depth
     TComYuv**    m_bestRecoYuv; ///< Best Reconstruction Yuv for each depth
-    TComYuv**    m_bestPredYuvNxN[4];
-
+   
     TComYuv**    m_tmpPredYuv;  ///< Temporary Prediction Yuv for each depth
     TShortYUV**  m_tmpResiYuv;  ///< Temporary Residual Yuv for each depth
     TComYuv**    m_tmpRecoYuv;  ///< Temporary Reconstruction Yuv for each depth
diff -r 7f68debc632b -r e842b2a4aeeb source/Lib/TLibEncoder/TEncSearch.h
--- a/source/Lib/TLibEncoder/TEncSearch.h	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncSearch.h	Thu Oct 31 13:19:33 2013 -0500
@@ -165,6 +165,12 @@ public:
 
     void xSetIntraResultQT(TComDataCU* cu, uint32_t trDepth, uint32_t absPartIdx, bool bLumaOnly, TComYuv* reconYuv);
 
+    // -------------------------------------------------------------------------------------------------------------------
+    // compute symbol bits
+    // -------------------------------------------------------------------------------------------------------------------
+
+    uint32_t xSymbolBitsInter(TComDataCU* cu);
+
 protected:
 
     // --------------------------------------------------------------------------------------------
@@ -232,12 +238,6 @@ protected:
                              UInt64 &rdCost, uint32_t &outBits, uint32_t &outDist, uint32_t *puiZeroDist);
     void xSetResidualQTData(TComDataCU* cu, uint32_t absPartIdx, uint32_t absTUPartIdx, TShortYUV* resiYuv, uint32_t depth, bool bSpatial);
 
-    // -------------------------------------------------------------------------------------------------------------------
-    // compute symbol bits
-    // -------------------------------------------------------------------------------------------------------------------
-
-    uint32_t xSymbolBitsInter(TComDataCU* cu);
-
     void setWpScalingDistParam(TComDataCU* cu, int refIdx, int picList);
 };
 }
diff -r 7f68debc632b -r e842b2a4aeeb source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/common/ipfilter.cpp	Thu Oct 31 13:19:33 2013 -0500
@@ -264,6 +264,7 @@ void filterConvertPelToShort_c(pixel *sr
     }
 }
 
+template<int dstStride>
 void filterConvertPelToShort_c(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
 {
     int shift = IF_INTERNAL_PREC - X265_DEPTH;
@@ -278,7 +279,7 @@ void filterConvertPelToShort_c(pixel *sr
         }
 
         src += srcStride;
-        dst += MAX_CU_SIZE;
+        dst += dstStride;
     }
 }
 
@@ -489,7 +490,8 @@ void Setup_C_IPFilterPrimitives(EncoderP
 
     p.ipfilter_p2s = filterConvertPelToShort_c;
     p.ipfilter_s2p = filterConvertShortToPel_c;
-    p.luma_p2s = filterConvertPelToShort_c;
+    p.luma_p2s = filterConvertPelToShort_c<MAX_CU_SIZE>;
+    p.chroma_p2s = filterConvertPelToShort_c<MAX_CU_SIZE/2>;
 
     p.extendRowBorder = extendCURowColBorder;
 }
diff -r 7f68debc632b -r e842b2a4aeeb source/common/primitives.h
--- a/source/common/primitives.h	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/common/primitives.h	Thu Oct 31 13:19:33 2013 -0500
@@ -254,6 +254,7 @@ struct EncoderPrimitives
     filter_pp_t     luma_vpp[NUM_LUMA_PARTITIONS];
     filter_hv_pp_t  luma_hvpp[NUM_LUMA_PARTITIONS];
     filter_p2s_t    luma_p2s;
+    filter_p2s_t    chroma_p2s;
 
     intra_dc_t      intra_pred_dc;
     intra_planar_t  intra_pred_planar;
diff -r 7f68debc632b -r e842b2a4aeeb source/common/vec/ipfilter-sse41.cpp
--- a/source/common/vec/ipfilter-sse41.cpp	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/common/vec/ipfilter-sse41.cpp	Thu Oct 31 13:19:33 2013 -0500
@@ -681,8 +681,9 @@ void filterHorizontal_pp(pixel *src, int
 #include "vectorclass.h"
 namespace {
 template<int N>
-void filterVertical_sp(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int block_width, int block_height, const int16_t *coeff)
+void filterVertical_sp(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int block_width, int block_height, int coeffIdx)
 {
+    const int16_t *coeff = (N == 8 ? g_lumaFilter[coeffIdx] : g_chromaFilter[coeffIdx]);
     int row, col;
 
     src -= (N / 2 - 1) * srcStride;
diff -r 7f68debc632b -r e842b2a4aeeb source/common/vec/pixel-sse41.cpp
--- a/source/common/vec/pixel-sse41.cpp	Wed Oct 30 20:23:37 2013 +0530
+++ b/source/common/vec/pixel-sse41.cpp	Thu Oct 31 13:19:33 2013 -0500
@@ -34,90 +34,6 @@ using namespace x265;
 namespace {
 #if !HIGH_BIT_DEPTH
 template<int ly>
-void sad_x3_12(pixel *fenc, pixel *fref1, pixel *fref2, pixel *fref3, intptr_t frefstride, int32_t *res)
-{
-    assert(ly == 16);
-    res[0] = res[1] = res[2] = 0;
-    __m128i T00, T01, T02, T03;
-    __m128i T10, T11, T12, T13;
-    __m128i T20, T21, T22, T23;
-    __m128i sum0, sum1;
-
-#ifndef MASK
-#define MASK _mm_set_epi32(0x0, 0xffffffff, 0xffffffff, 0xffffffff)
-#endif
-
-#define PROCESS_12x4x3(BASE) \
-    T00 = _mm_load_si128((__m128i*)(fenc + (BASE + 0) * FENC_STRIDE)); \
-    T00 = _mm_and_si128(T00, MASK); \
-    T01 = _mm_load_si128((__m128i*)(fenc + (BASE + 1) * FENC_STRIDE)); \