[x265-commits] [x265] analysis: remove redundant variables, cleanup variable names

Wed Sep 24 22:30:51 CEST 2014

details:   http://hg.videolan.org/x265/rev/6334cc645407
branches:  
changeset: 8099:6334cc645407
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Sun Sep 21 22:30:08 2014 +0530
description:
analysis: remove redundant variables, cleanup variable names
Subject: [x265] analysis: remove CheckBestMode from CheckIntra

details:   http://hg.videolan.org/x265/rev/817abe294c8b
branches:  
changeset: 8100:817abe294c8b
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Sun Sep 21 23:18:49 2014 +0530
description:
analysis: remove CheckBestMode from CheckIntra
Subject: [x265] psy-rd: fix bug in chroma psyEnergy for intra 4x4

details:   http://hg.videolan.org/x265/rev/d1c2b82de4db
branches:  
changeset: 8101:d1c2b82de4db
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Sep 22 08:53:40 2014 +0530
description:
psy-rd: fix bug in chroma psyEnergy for intra 4x4

Also add TODO, for all psyCost calculations
Subject: [x265] analysis: nits

details:   http://hg.videolan.org/x265/rev/39d0ba6012d5
branches:  
changeset: 8102:39d0ba6012d5
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Sep 22 09:46:00 2014 +0530
description:
analysis: nits
Subject: [x265] search: clean xRecurIntraCodingQT

details:   http://hg.videolan.org/x265/rev/d1ffc125f0a3
branches:  
changeset: 8103:d1ffc125f0a3
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Sep 22 09:46:18 2014 +0530
description:
search: clean xRecurIntraCodingQT
Subject: [x265] motion: avoid extra iterations when no subpel motion found

details:   http://hg.videolan.org/x265/rev/2c1d4c7d85ba
branches:  
changeset: 8104:2c1d4c7d85ba
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 16:19:00 2014 +0100
description:
motion: avoid extra iterations when no subpel motion found

subsequent iterations would have also returned zero, which would be pointless.
this is an adaption of a patch by Sheva Xu.
Subject: [x265] bitcost: use enums for special constants rather than static const ints

details:   http://hg.videolan.org/x265/rev/3fd2d7acb6bb
branches:  
changeset: 8105:3fd2d7acb6bb
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 16:31:00 2014 +0100
description:
bitcost: use enums for special constants rather than static const ints

enums require no storage
Subject: [x265] nits: do not check for NULL from new operations

details:   http://hg.videolan.org/x265/rev/6e450860475a
branches:  
changeset: 8106:6e450860475a
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 16:40:44 2014 +0100
description:
nits: do not check for NULL from new operations

By the C++ spec, new is incapable of returning NULL. If an allocation failure
actually occurs, an exception is issued (which we do not catch)
Long term, all of these new operations need to be replaced by malloc and
explicit initialization and destruction. In the short term, these return value
checks are redundant.
Subject: [x265] bitstream: add paren to avoid ambiguous precedence in X265_CHECK

details:   http://hg.videolan.org/x265/rev/2599fd87b72e
branches:  
changeset: 8107:2599fd87b72e
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 16:51:07 2014 +0100
description:
bitstream: add paren to avoid ambiguous precedence in X265_CHECK
Subject: [x265] entropy: fix SAO enable detection (refs #80)

details:   http://hg.videolan.org/x265/rev/c39538f0c59b
branches:  
changeset: 8108:c39538f0c59b
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 16:52:29 2014 +0100
description:
entropy: fix SAO enable detection (refs #80)

Apparently our analysis never toggles luma separately from chroma because this
bug has not resulted in any bad bitstreams, that I know of. This bug was found
via static analysis
Subject: [x265] encoder: use %u to sprintf unsigned ints (refs #80)

details:   http://hg.videolan.org/x265/rev/a58aea624122
branches:  
changeset: 8109:a58aea624122
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 16:52:59 2014 +0100
description:
encoder: use %u to sprintf unsigned ints (refs #80)
Subject: [x265] TComWeightPrediction: combine duplicate inline functions (refs #80)

details:   http://hg.videolan.org/x265/rev/c7cc07fd21a7
branches:  
changeset: 8110:c7cc07fd21a7
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 17:05:54 2014 +0100
description:
TComWeightPrediction: combine duplicate inline functions (refs #80)
Subject: [x265] predict: merge TComWeightPrediction functions into Predict

details:   http://hg.videolan.org/x265/rev/0be03e280b3d
branches:  
changeset: 8111:0be03e280b3d
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 17:57:15 2014 +0100
description:
predict: merge TComWeightPrediction functions into Predict

* TComWeightPrediction had no member vars, the constructor was useless
* half of the functions were not used, they were dropped
* default arguments were removed, none were actually required
* x prefixes removed from method names
* comments were cleaned up
Subject: [x265] predict: don't bother keeping refidx as an array

details:   http://hg.videolan.org/x265/rev/1c172c1822e4
branches:  
changeset: 8112:1c172c1822e4
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 18:05:32 2014 +0100
description:
predict: don't bother keeping refidx as an array

it is always indexed explicitly
Subject: [x265] nits: use parantheses to improve readability in shifts

details:   http://hg.videolan.org/x265/rev/fd435504f15e
branches:  
changeset: 8113:fd435504f15e
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Sep 22 13:14:54 2014 +0530
description:
nits: use parantheses to improve readability in shifts
Subject: [x265] Backed out changeset: 25dde1ffab66

details:   http://hg.videolan.org/x265/rev/82bab5587bf1
branches:  
changeset: 8114:82bab5587bf1
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Sep 22 21:37:59 2014 +0530
description:
Backed out changeset: 25dde1ffab66

This commit needs more investigation, with specific VBV
use cases like 1-sec GOPs.
Subject: [x265] simplify intra filter (with fix for da61cf406f16)

details:   http://hg.videolan.org/x265/rev/ee76b64fd051
branches:  
changeset: 8115:ee76b64fd051
user:      Satoshi Nakagawa <nakagawa424 at oki.com>
date:      Mon Sep 22 21:28:59 2014 +0900
description:
simplify intra filter (with fix for da61cf406f16)
Subject: [x265] add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE

details:   http://hg.videolan.org/x265/rev/1f5ffdc453ee
branches:  
changeset: 8116:1f5ffdc453ee
user:      Sagar Kotecha
date:      Tue Sep 23 12:47:02 2014 +0530
description:
add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE
Subject: [x265] asm: replace mova by movu to avoid AVX2 testbench crash in dct16, dct32, denoise_dct, its same speed on Haswell

details:   http://hg.videolan.org/x265/rev/02253e0800ea
branches:  
changeset: 8117:02253e0800ea
user:      Min Chen <chenm003 at 163.com>
date:      Tue Sep 23 12:18:31 2014 -0700
description:
asm: replace mova by movu to avoid AVX2 testbench crash in dct16, dct32, denoise_dct, its same speed on Haswell
Subject: [x265] asm: avx2 code for dct8x8

details:   http://hg.videolan.org/x265/rev/271e5eb1e396
branches:  
changeset: 8118:271e5eb1e396
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Tue Sep 23 10:16:26 2014 +0530
description:
asm: avx2 code for dct8x8
Subject: [x265] blockcopy_ss: 64x16, 64x32, 64x48, 64x64 AVX version of asm code, approx double speedup comapre to SSE

details:   http://hg.videolan.org/x265/rev/e2b577330c9b
branches:  
changeset: 8119:e2b577330c9b
user:      Sagar Kotecha
date:      Tue Sep 23 18:32:53 2014 +0530
description:
blockcopy_ss: 64x16, 64x32, 64x48, 64x64 AVX version of asm code, approx double speedup comapre to SSE
Subject: [x265] Backed out changeset: fa2f1aa1456e

details:   http://hg.videolan.org/x265/rev/b2b7072ddbf7
branches:  
changeset: 8120:b2b7072ddbf7
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Wed Sep 24 11:48:15 2014 +0530
description:
Backed out changeset: fa2f1aa1456e

This commit allocated the harness instances on the heap, thus
no longer respecting __declspec(align) directives for the member
fields. We could probably circumvent this by overloading operator new with
aligned_malloc, but I'm not sure this is good practice.
Subject: [x265] predict: combine redundant logic paths in predInterBi()

details:   http://hg.videolan.org/x265/rev/3f1681901fb4
branches:  
changeset: 8121:3f1681901fb4
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 18:12:56 2014 +0100
description:
predict: combine redundant logic paths in predInterBi()

removes weightedPredictionBi(), which is no longer called
Subject: [x265] predict: use faster unidir prediction for B frames when weighting not enabled

details:   http://hg.videolan.org/x265/rev/30dd73bb8a93
branches:  
changeset: 8122:30dd73bb8a93
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 18:30:03 2014 +0100
description:
predict: use faster unidir prediction for B frames when weighting not enabled
Subject: [x265] predict: streamline getWpScaling()

details:   http://hg.videolan.org/x265/rev/e26ce61cd2e3
branches:  
changeset: 8123:e26ce61cd2e3
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 18:30:35 2014 +0100
description:
predict: streamline getWpScaling()
Subject: [x265] predict: remove list argument from motionCompensation(), always REF_PIC_LIST_X

details:   http://hg.videolan.org/x265/rev/cf90338bbc87
branches:  
changeset: 8124:cf90338bbc87
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 18:39:12 2014 +0100
description:
predict: remove list argument from motionCompensation(), always REF_PIC_LIST_X
Subject: [x265] predict: remove checkIdenticalMotion()

details:   http://hg.videolan.org/x265/rev/532d0266e333
branches:  
changeset: 8125:532d0266e333
user:      Steve Borho <steve at borho.org>
date:      Sat Sep 20 18:40:56 2014 +0100
description:
predict: remove checkIdenticalMotion()

We will not insert the same reference picture into L1 and L0 at the same time,
so this check is utterly redundant.
Subject: [x265] refine deblocking filter

details:   http://hg.videolan.org/x265/rev/940cec3bf0b4
branches:  
changeset: 8126:940cec3bf0b4
user:      Satoshi Nakagawa <nakagawa424 at oki.com>
date:      Wed Sep 24 18:08:46 2014 +0900
description:
refine deblocking filter
Subject: [x265] blockcopy_pp_32x8: avx asm code, improved 281.20 cycles -> 165.47

details:   http://hg.videolan.org/x265/rev/2d8adf9a4ab0
branches:  
changeset: 8127:2d8adf9a4ab0
user:      Praveen Tiwari
date:      Wed Sep 24 15:42:15 2014 +0530
description:
blockcopy_pp_32x8: avx asm code, improved 281.20 cycles -> 165.47
Subject: [x265] blockcopy_pp_32x16: avx asm code, improved 477.74 cycles -> 309.99

details:   http://hg.videolan.org/x265/rev/b51e34a4b828
branches:  
changeset: 8128:b51e34a4b828
user:      Praveen Tiwari
date:      Wed Sep 24 17:05:42 2014 +0530
description:
blockcopy_pp_32x16: avx asm code, improved 477.74 cycles -> 309.99
Subject: [x265] blockcopy_pp_32x24: avx asm code, improved 621.84 cycles -> 371.94

details:   http://hg.videolan.org/x265/rev/fe901487b7cc
branches:  
changeset: 8129:fe901487b7cc
user:      Praveen Tiwari
date:      Wed Sep 24 17:50:48 2014 +0530
description:
blockcopy_pp_32x24: avx asm code, improved 621.84 cycles -> 371.94
Subject: [x265] bloccopy_pp avx asm code: 32x32, 32x48, 32x64 improved by 803.69 -> 514.90, 1126.36 -> 655.24, 1454.09 -> 835.76 cycles

details:   http://hg.videolan.org/x265/rev/3fe7e7975eae
branches:  
changeset: 8130:3fe7e7975eae
user:      Praveen Tiwari
date:      Wed Sep 24 18:52:33 2014 +0530
description:
bloccopy_pp avx asm code: 32x32, 32x48, 32x64 improved by 803.69 -> 514.90, 1126.36 -> 655.24, 1454.09 -> 835.76 cycles
Subject: [x265] primitives: remove unused block copy primitives

details:   http://hg.videolan.org/x265/rev/63b7cb39e9f1
branches:  
changeset: 8131:63b7cb39e9f1
user:      Steve Borho <steve at borho.org>
date:      Wed Sep 24 15:03:48 2014 -0500
description:
primitives: remove unused block copy primitives
Subject: [x265] cmake: remove blockcopy-sse3.cpp

details:   http://hg.videolan.org/x265/rev/c79590d89389
branches:  
changeset: 8132:c79590d89389
user:      Steve Borho <steve at borho.org>
date:      Wed Sep 24 15:04:08 2014 -0500
description:
cmake: remove blockcopy-sse3.cpp
Subject: [x265] vec: remove idct8, we have SSSE3 assembly for it

details:   http://hg.videolan.org/x265/rev/eb011fa1d2d8
branches:  
changeset: 8133:eb011fa1d2d8
user:      Steve Borho <steve at borho.org>
date:      Wed Sep 24 15:13:12 2014 -0500
description:
vec: remove idct8, we have SSSE3 assembly for it
Subject: [x265] vec: make a note for why we keep some of the remaining vector routines

details:   http://hg.videolan.org/x265/rev/f6a0b0a97a5b
branches:  
changeset: 8134:f6a0b0a97a5b
user:      Steve Borho <steve at borho.org>
date:      Wed Sep 24 15:30:16 2014 -0500
description:
vec: make a note for why we keep some of the remaining vector routines

diffstat:

 source/Lib/TLibCommon/TComDataCU.cpp           |    8 +-
 source/Lib/TLibCommon/TComPattern.cpp          |  188 ++----
 source/Lib/TLibCommon/TComPattern.h            |    2 +-
 source/Lib/TLibCommon/TComPicSym.cpp           |    4 +-
 source/Lib/TLibCommon/TComPicYuv.cpp           |    2 +-
 source/Lib/TLibCommon/TComPicYuv.h             |    4 +-
 source/Lib/TLibCommon/TComPicYuvMD5.cpp        |    2 +-
 source/Lib/TLibCommon/TComRom.cpp              |    2 +-
 source/Lib/TLibCommon/TComWeightPrediction.cpp |  692 -------------------------
 source/Lib/TLibCommon/TComWeightPrediction.h   |   75 --
 source/common/CMakeLists.txt                   |    4 +-
 source/common/bitstream.cpp                    |    2 +-
 source/common/deblock.cpp                      |  313 +++++------
 source/common/deblock.h                        |    4 +-
 source/common/frame.cpp                        |   27 +-
 source/common/pixel.cpp                        |   49 -
 source/common/primitives.h                     |    6 -
 source/common/quant.cpp                        |   16 +-
 source/common/vec/blockcopy-sse3.cpp           |  229 --------
 source/common/vec/dct-sse3.cpp                 |  272 +---------
 source/common/vec/dct-ssse3.cpp                |    3 +
 source/common/vec/vec-primitives.cpp           |    3 -
 source/common/x86/asm-primitives.cpp           |   36 +
 source/common/x86/blockcopy8.asm               |  282 ++++++++++
 source/common/x86/blockcopy8.h                 |   18 +
 source/common/x86/dct8.asm                     |  262 ++++++--
 source/common/x86/dct8.h                       |    1 +
 source/encoder/analysis.cpp                    |   83 +-
 source/encoder/analysis.h                      |   12 +-
 source/encoder/api.cpp                         |   35 +-
 source/encoder/bitcost.h                       |    4 +-
 source/encoder/encoder.cpp                     |   29 +-
 source/encoder/entropy.cpp                     |    4 +-
 source/encoder/motion.cpp                      |   10 +-
 source/encoder/predict.cpp                     |  417 +++++++++++---
 source/encoder/predict.h                       |   12 +-
 source/encoder/ratecontrol.cpp                 |    6 +-
 source/encoder/sao.cpp                         |    2 +-
 source/encoder/search.cpp                      |   77 +-
 source/encoder/search.h                        |    5 +-
 source/test/pixelharness.cpp                   |  126 ----
 source/test/pixelharness.h                     |    3 -
 source/test/testbench.cpp                      |   22 +-
 43 files changed, 1253 insertions(+), 2100 deletions(-)

diffs (truncated from 5043 to 300 lines):

diff -r c8f53398f8ce -r f6a0b0a97a5b source/Lib/TLibCommon/TComDataCU.cpp

--- a/source/Lib/TLibCommon/TComDataCU.cpp	Sat Sep 20 15:41:08 2014 +0100
+++ b/source/Lib/TLibCommon/TComDataCU.cpp	Wed Sep 24 15:30:16 2014 -0500
@@ -613,7 +613,7 @@ void TComDataCU::copyToPic(uint32_t dept
     m_cuMvField[1].copyTo(cu->getCUMvField(REF_PIC_LIST_1), m_absIdxInLCU);
 
     uint32_t tmpY  = 1 << ((g_maxLog2CUSize - depth) * 2);
-    uint32_t tmpY2 = m_absIdxInLCU << LOG2_UNIT_SIZE * 2;
+    uint32_t tmpY2 = m_absIdxInLCU << (LOG2_UNIT_SIZE * 2);
     memcpy(cu->getCoeffY() + tmpY2, m_trCoeff[0], sizeof(coeff_t) * tmpY);
 
     uint32_t tmpC  = tmpY  >> (m_hChromaShift + m_vChromaShift);
@@ -624,7 +624,7 @@ void TComDataCU::copyToPic(uint32_t dept
     if (m_slice->m_pps->bTransquantBypassEnabled)
     {
         uint32_t tmp  = 1 << ((g_maxLog2CUSize - depth) * 2);
-        uint32_t tmp2 = m_absIdxInLCU << LOG2_UNIT_SIZE * 2;
+        uint32_t tmp2 = m_absIdxInLCU << (LOG2_UNIT_SIZE * 2);
         memcpy(cu->getLumaOrigYuv() + tmp2, m_tqBypassOrigYuv[0], sizeof(pixel) * tmp);
 
         memcpy(cu->getChromaOrigYuv(1) + tmpC2, m_tqBypassOrigYuv[1], sizeof(pixel) * tmpC);
@@ -651,7 +651,7 @@ void TComDataCU::copyCodedToPic(uint32_t
     memcpy(cu->getCbf(TEXT_CHROMA_V) + m_absIdxInLCU, m_cbf[2], sizeInChar);
 
     uint32_t tmpY  = 1 << ((g_maxLog2CUSize - depth) * 2);
-    uint32_t tmpY2 = m_absIdxInLCU << LOG2_UNIT_SIZE * 2;
+    uint32_t tmpY2 = m_absIdxInLCU << (LOG2_UNIT_SIZE * 2);
     memcpy(cu->getCoeffY() + tmpY2, m_trCoeff[0], sizeof(coeff_t) * tmpY);
     tmpY  >>= m_hChromaShift + m_vChromaShift;
     tmpY2 >>= m_hChromaShift + m_vChromaShift;
@@ -704,7 +704,7 @@ void TComDataCU::copyToPic(uint32_t dept
     m_cuMvField[1].copyTo(cu->getCUMvField(REF_PIC_LIST_1), m_absIdxInLCU, partStart, qNumPart);
 
     uint32_t tmpY  = 1 << ((g_maxLog2CUSize - depth - partDepth) * 2);
-    uint32_t tmpY2 = partOffset << LOG2_UNIT_SIZE * 2;
+    uint32_t tmpY2 = partOffset << (LOG2_UNIT_SIZE * 2);
     memcpy(cu->getCoeffY() + tmpY2, m_trCoeff[0],  sizeof(coeff_t) * tmpY);
 
     uint32_t tmpC  = tmpY >> (m_hChromaShift + m_vChromaShift);
diff -r c8f53398f8ce -r f6a0b0a97a5b source/Lib/TLibCommon/TComPattern.cpp
--- a/source/Lib/TLibCommon/TComPattern.cpp	Sat Sep 20 15:41:08 2014 +0100
+++ b/source/Lib/TLibCommon/TComPattern.cpp	Wed Sep 24 15:30:16 2014 -0500
@@ -52,133 +52,96 @@ using namespace x265;
 void TComPattern::initAdiPattern(TComDataCU* cu, uint32_t zOrderIdxInPart, uint32_t partDepth, pixel* adiBuf,
                                  pixel* refAbove, pixel* refLeft, pixel* refAboveFlt, pixel* refLeftFlt, int dirMode)
 {
-    pixel* roiOrigin;
-    pixel* adiTemp;
-
-    int picStride = cu->m_pic->getStride();
-
     IntraNeighbors intraNeighbors;
 
     initIntraNeighbors(cu, zOrderIdxInPart, partDepth, true, &intraNeighbors);
     uint32_t tuSize = intraNeighbors.tuSize;
     uint32_t tuSize2 = tuSize << 1;
 
-    roiOrigin = cu->m_pic->getPicYuvRec()->getLumaAddr(cu->getAddr(), cu->getZorderIdxInCU() + zOrderIdxInPart);
-    adiTemp   = adiBuf;
+    pixel* adiOrigin = cu->m_pic->getPicYuvRec()->getLumaAddr(cu->getAddr(), cu->getZorderIdxInCU() + zOrderIdxInPart);
+    int picStride = cu->m_pic->getStride();
 
-    fillReferenceSamples(roiOrigin, picStride, adiTemp, intraNeighbors);
+    fillReferenceSamples(adiOrigin, picStride, adiBuf, intraNeighbors);
 
+    // initialization of ADI buffers
+    const int bufOffset = tuSize - 1;
+    refAbove += bufOffset;
+    refLeft += bufOffset;
+
+    //  ADI_BUF_STRIDE * (2 * tuSize + 1);
+    memcpy(refAbove, adiBuf, (tuSize2 + 1) * sizeof(pixel));
+    for (int k = 0; k < tuSize2 + 1; k++)
+        refLeft[k] = adiBuf[k * ADI_BUF_STRIDE];
+    
     bool bUseFilteredPredictions = (dirMode == ALL_IDX ? (8 | 16 | 32) & tuSize : g_intraFilterFlags[dirMode] & tuSize);
 
     if (bUseFilteredPredictions)
     {
         // generate filtered intra prediction samples
-        // left and left above border + above and above right border + top left corner = length of 3. filter buffer
-        int bufSize = tuSize2 + tuSize2 + 1;
-        uint32_t wh = ADI_BUF_STRIDE * (tuSize2 + 1);         // number of elements in one buffer
+        refAboveFlt += bufOffset;
+        refLeftFlt += bufOffset;
 
-        pixel* filterBuf  = adiBuf + wh;         // buffer for 2. filtering (sequential)
-        pixel* filterBufN = filterBuf + bufSize; // buffer for 1. filtering (sequential)
+        bool bStrongSmoothing = (tuSize == 32 && cu->m_slice->m_sps->bUseStrongIntraSmoothing);
 
-        int l = 0;
-        // left border from bottom to top
-        for (int i = 0; i < tuSize2; i++)
+        if (bStrongSmoothing)
         {
-            filterBuf[l++] = adiTemp[ADI_BUF_STRIDE * (tuSize2 - i)];
-        }
+            const int trSize  = 32;
+            const int trSize2 = 32 * 2;
+            const int threshold = 1 << (X265_DEPTH - 5);
+            int refBL = refLeft[trSize2];
+            int refTL = refAbove[0];
+            int refTR = refAbove[trSize2];
+            bStrongSmoothing = (abs(refBL + refTL - 2 * refLeft[trSize])  < threshold &&
+                                abs(refTL + refTR - 2 * refAbove[trSize]) < threshold);
 
-        // top left corner
-        filterBuf[l++] = adiTemp[0];
+            if (bStrongSmoothing)
+            {
+                // bilinear interpolation
+                const int shift = 5 + 1; // intraNeighbors.log2TrSize + 1;
+                int init = (refTL << shift) + tuSize;
+                int delta;
 
-        // above border from left to right
-        memcpy(&filterBuf[l], &adiTemp[1], tuSize2 * sizeof(*filterBuf));
+                refLeftFlt[0] = refAboveFlt[0] = refAbove[0];
 
-        if (tuSize >= 32 && cu->m_slice->m_sps->bUseStrongIntraSmoothing)
-        {
-            int bottomLeft = filterBuf[0];
-            int topLeft = filterBuf[tuSize2];
-            int topRight = filterBuf[bufSize - 1];
-            int threshold = 1 << (X265_DEPTH - 5);
-            bool bilinearLeft = abs(bottomLeft + topLeft - 2 * filterBuf[tuSize]) < threshold;
-            bool bilinearAbove  = abs(topLeft + topRight - 2 * filterBuf[tuSize2 + tuSize]) < threshold;
+                //TODO: Performance Primitive???
+                delta = refBL - refTL;
+                for (int i = 1; i < trSize2; i++)
+                    refLeftFlt[i] = (init + delta * i) >> shift;
+                refLeftFlt[trSize2] = refLeft[trSize2];
 
-            if (bilinearLeft && bilinearAbove)
-            {
-                int shift = intraNeighbors.log2TrSize + 1;
-                filterBufN[0] = filterBuf[0];
-                filterBufN[tuSize2] = filterBuf[tuSize2];
-                filterBufN[bufSize - 1] = filterBuf[bufSize - 1];
-                //TODO: Performance Primitive???
-                for (int i = 1; i < tuSize2; i++)
-                {
-                    filterBufN[i] = ((tuSize2 - i) * bottomLeft + i * topLeft + tuSize) >> shift;
-                }
+                delta = refTR - refTL;
+                for (int i = 1; i < trSize2; i++)
+                    refAboveFlt[i] = (init + delta * i) >> shift;
+                refAboveFlt[trSize2] = refAbove[trSize2];
 
-                for (int i = 1; i < tuSize2; i++)
-                {
-                    filterBufN[tuSize2 + i] = ((tuSize2 - i) * topLeft + i * topRight + tuSize) >> shift;
-                }
-            }
-            else
-            {
-                // 1. filtering with [1 2 1]
-                filterBufN[0] = filterBuf[0];
-                filterBufN[bufSize - 1] = filterBuf[bufSize - 1];
-                for (int i = 1; i < bufSize - 1; i++)
-                {
-                    filterBufN[i] = (filterBuf[i - 1] + 2 * filterBuf[i] + filterBuf[i + 1] + 2) >> 2;
-                }
-            }
-        }
-        else
-        {
-            // 1. filtering with [1 2 1]
-            filterBufN[0] = filterBuf[0];
-            filterBufN[bufSize - 1] = filterBuf[bufSize - 1];
-            for (int i = 1; i < bufSize - 1; i++)
-            {
-                filterBufN[i] = (filterBuf[i - 1] + 2 * filterBuf[i] + filterBuf[i + 1] + 2) >> 2;
+                return;
             }
         }
 
-        // initialization of ADI buffers
-        refAboveFlt += tuSize - 1;
-        refLeftFlt += tuSize - 1;
-        memcpy(refAboveFlt, filterBufN + tuSize2, (tuSize2 + 1) * sizeof(pixel));
-        for (int k = 0; k < tuSize2 + 1; k++)
-        {
-            refLeftFlt[k] = filterBufN[tuSize2 - k];   // Smoothened
-        }
-    }
+        refLeft[-1] = refAbove[1];
+        for (int i = 0; i < tuSize2; i++)
+            refLeftFlt[i] = (refLeft[i - 1] + 2 * refLeft[i] + refLeft[i + 1] + 2) >> 2;
+        refLeftFlt[tuSize2] = refLeft[tuSize2];
 
-    // initialization of ADI buffers
-    refAbove += tuSize - 1;
-    refLeft += tuSize - 1;
-
-    //  ADI_BUF_STRIDE * (2 * tuSize + 1);
-    memcpy(refAbove, adiBuf, (tuSize2 + 1) * sizeof(pixel));
-    for (int k = 0; k < tuSize2 + 1; k++)
-    {
-        refLeft[k] = adiBuf[k * ADI_BUF_STRIDE];
+        refAboveFlt[0] = refLeftFlt[0];
+        for (int i = 1; i < tuSize2; i++)
+            refAboveFlt[i] = (refAbove[i - 1] + 2 * refAbove[i] + refAbove[i + 1] + 2) >> 2;
+        refAboveFlt[tuSize2] = refAbove[tuSize2];
     }
 }
 
 void TComPattern::initAdiPatternChroma(TComDataCU* cu, uint32_t zOrderIdxInPart, uint32_t partDepth, pixel* adiBuf, uint32_t chromaId)
 {
-    pixel*  roiOrigin;
-    pixel*  adiTemp;
-
-    int picStride = cu->m_pic->getCStride();
-
     IntraNeighbors intraNeighbors;
 
     initIntraNeighbors(cu, zOrderIdxInPart, partDepth, false, &intraNeighbors);
     uint32_t tuSize = intraNeighbors.tuSize;
 
-    roiOrigin = cu->m_pic->getPicYuvRec()->getChromaAddr(chromaId, cu->getAddr(), cu->getZorderIdxInCU() + zOrderIdxInPart);
-    adiTemp   = getAdiChromaBuf(chromaId, tuSize, adiBuf);
+    pixel* adiOrigin = cu->m_pic->getPicYuvRec()->getChromaAddr(chromaId, cu->getAddr(), cu->getZorderIdxInCU() + zOrderIdxInPart);
+    int picStride = cu->m_pic->getCStride();
+    pixel* adiRef = getAdiChromaBuf(chromaId, tuSize, adiBuf);
 
-    fillReferenceSamples(roiOrigin, picStride, adiTemp, intraNeighbors);
+    fillReferenceSamples(adiOrigin, picStride, adiRef, intraNeighbors);
 }
 
 void TComPattern::initIntraNeighbors(TComDataCU* cu, uint32_t zOrderIdxInPart, uint32_t partDepth, bool isLuma, IntraNeighbors *intraNeighbors)
@@ -226,14 +189,13 @@ void TComPattern::initIntraNeighbors(TCo
     intraNeighbors->log2TrSize       = log2TrSize;
 }
 
-void TComPattern::fillReferenceSamples(pixel* roiOrigin, int picStride, pixel* adiTemp, const IntraNeighbors& intraNeighbors)
+void TComPattern::fillReferenceSamples(pixel* adiOrigin, int picStride, pixel* adiRef, const IntraNeighbors& intraNeighbors)
 {
     int numIntraNeighbor = intraNeighbors.numIntraNeighbor;
     int totalUnits       = intraNeighbors.totalUnits;
     uint32_t tuSize      = intraNeighbors.tuSize;
 
     uint32_t refSize = tuSize * 2 + 1;
-    pixel* roiTemp;
     int  i, j;
     int  dcValue = 1 << (X265_DEPTH - 1);
 
@@ -241,27 +203,23 @@ void TComPattern::fillReferenceSamples(p
     {
         // Fill border with DC value
         for (i = 0; i < refSize; i++)
-        {
-            adiTemp[i] = dcValue;
-        }
+            adiRef[i] = dcValue;
 
         for (i = 1; i < refSize; i++)
-        {
-            adiTemp[i * ADI_BUF_STRIDE] = dcValue;
-        }
+            adiRef[i * ADI_BUF_STRIDE] = dcValue;
     }
     else if (numIntraNeighbor == totalUnits)
     {
         // Fill top border with rec. samples
-        roiTemp = roiOrigin - picStride - 1;
-        memcpy(adiTemp, roiTemp, refSize * sizeof(*adiTemp));
+        pixel* adiTemp = adiOrigin - picStride - 1;
+        memcpy(adiRef, adiTemp, refSize * sizeof(*adiRef));
 
         // Fill left border with rec. samples
-        roiTemp = roiOrigin - 1;
+        adiTemp = adiOrigin - 1;
         for (i = 1; i < refSize; i++)
         {
-            adiTemp[i * ADI_BUF_STRIDE] = roiTemp[0];
-            roiTemp += picStride;
+            adiRef[i * ADI_BUF_STRIDE] = adiTemp[0];
+            adiTemp += picStride;
         }
     }
     else // reference samples are partially available
@@ -284,12 +242,12 @@ void TComPattern::fillReferenceSamples(p
         }
 
         // Fill top-left sample
-        roiTemp = roiOrigin - picStride - 1;
+        pixel* adiTemp =  adiOrigin - picStride - 1;
         pAdiLineTemp = pAdiLine + (leftUnits * unitHeight);
         pNeighborFlags = bNeighborFlags + leftUnits;
         if (*pNeighborFlags)
         {
-            pixel topLeftVal = roiTemp[0];
+            pixel topLeftVal = adiTemp[0];
             for (i = 0; i < unitWidth; i++)
             {
                 pAdiLineTemp[i] = topLeftVal;
@@ -297,7 +255,7 @@ void TComPattern::fillReferenceSamples(p
         }
 
         // Fill left & below-left samples