[x265-commits] [x265] asm: remove duplicate constant pw_256 and alignment nits

Fri Apr 3 21:26:29 CEST 2015

details:   http://hg.videolan.org/x265/rev/dd62c4e924ba
branches:  
changeset: 10019:dd62c4e924ba
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Fri Apr 03 10:45:54 2015 +0530
description:
asm: remove duplicate constant pw_256 and alignment nits
Subject: [x265] asm: avx2 code for intrapred_planar16x16

details:   http://hg.videolan.org/x265/rev/b95bbc82cc58
branches:  
changeset: 10020:b95bbc82cc58
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Fri Apr 03 11:11:47 2015 +0530
description:
asm: avx2 code for intrapred_planar16x16

AVX2:
intra_planar_16x16      16.24x   583.48          9475.36

SSE4:
intra_planar_16x16      11.54x   820.01          9466.91
Subject: [x265] asm: avx2 code for intra_planar_32x32

details:   http://hg.videolan.org/x265/rev/d23e5e9d6dd0
branches:  
changeset: 10021:d23e5e9d6dd0
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Fri Apr 03 11:35:53 2015 +0530
description:
asm: avx2 code for intra_planar_32x32

AVX2:
intra_planar_32x32      19.93x   1813.34         36132.20

SSE4:
intra_planar_32x32      12.25x   2951.42         36140.76
Subject: [x265] asm: avx2 code for intra_dc_32x32

details:   http://hg.videolan.org/x265/rev/aa565f72955c
branches:  
changeset: 10022:aa565f72955c
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Fri Apr 03 15:41:49 2015 +0530
description:
asm: avx2 code for intra_dc_32x32

AVX2:
intra_dc_32x32[f=0]     23.17x   435.66          10093.78

SSE4:
intra_dc_32x32[f=0]     14.36x   703.46          10100.78
Subject: [x265] asm: intra_pred_ang4_17 improved by ~57% over SSE4

details:   http://hg.videolan.org/x265/rev/38884a963301
branches:  
changeset: 10023:38884a963301
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 12:03:59 2015 +0530
description:
asm: intra_pred_ang4_17 improved by ~57% over SSE4

AVX2:
intra_ang_4x4[17]       11.06x   104.22          1152.57

SSE4:
intra_ang_4x4[17]       4.70x    244.43          1148.92
Subject: [x265] asm: intra_pred_ang4_16 improved by ~49% over SSE4

details:   http://hg.videolan.org/x265/rev/cd2577b482ae
branches:  
changeset: 10024:cd2577b482ae
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 12:19:15 2015 +0530
description:
asm: intra_pred_ang4_16 improved by ~49% over SSE4

AVX2:
intra_ang_4x4[16]       10.86x   104.30          1133.09

SSE4:
intra_ang_4x4[16]       5.51x    206.89          1139.52
Subject: [x265] asm: intra_pred_ang4_15 improved by ~53% over SSE4

details:   http://hg.videolan.org/x265/rev/8119b549ca9e
branches:  
changeset: 10025:8119b549ca9e
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 12:37:40 2015 +0530
description:
asm: intra_pred_ang4_15 improved by ~53% over SSE4

AVX2:
intra_ang_4x4[15]       10.93x   104.25          1140.00

SSE4:
intra_ang_4x4[15]       4.98x    225.91          1125.26
Subject: [x265] asm: intra_pred_ang4_14 improved by ~43% over SSE4

details:   http://hg.videolan.org/x265/rev/d240ff7beda2
branches:  
changeset: 10026:d240ff7beda2
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 12:51:29 2015 +0530
description:
asm: intra_pred_ang4_14 improved by ~43% over SSE4

AVX2:
intra_ang_4x4[14]       10.94x   102.94          1126.27

SSE4:
intra_ang_4x4[14]       6.14x    182.91          1122.57
Subject: [x265] asm: intra_pred_ang4_13 improved by ~43% over SSE4

details:   http://hg.videolan.org/x265/rev/ba4e530b68a2
branches:  
changeset: 10027:ba4e530b68a2
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 13:52:26 2015 +0530
description:
asm: intra_pred_ang4_13 improved by ~43% over SSE4

AVX2:
intra_ang_4x4[13]       10.73x   104.23          1118.51

SSE4:
intra_ang_4x4[13]       6.06x    184.99          1121.24
Subject: [x265] asm: intra_pred_ang4_12 improved by ~35% over SSE4

details:   http://hg.videolan.org/x265/rev/e68a5442024e
branches:  
changeset: 10028:e68a5442024e
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 14:03:40 2015 +0530
description:
asm: intra_pred_ang4_12 improved by ~35% over SSE4

AVX2:
intra_ang_4x4[12]       10.62x   104.55          1110.68

SSE4:
intra_ang_4x4[12]       6.84x    162.34          1110.04
Subject: [x265] asm: intra_pred_ang4_11 improved by ~31% over SSE4

details:   http://hg.videolan.org/x265/rev/9ec24afd357f
branches:  
changeset: 10029:9ec24afd357f
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 14:41:25 2015 +0530
description:
asm: intra_pred_ang4_11 improved by ~31% over SSE4

AVX2:
intra_ang_4x4[11]       10.58x   104.21          1102.93

SSE4:
intra_ang_4x4[11]       7.23x    152.13          1100.52
Subject: [x265] asm: intra_pred_ang4_9 improved by ~35% over SSE4

details:   http://hg.videolan.org/x265/rev/31ce52f6cc0e
branches:  
changeset: 10030:31ce52f6cc0e
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 14:55:56 2015 +0530
description:
asm: intra_pred_ang4_9 improved by ~35% over SSE4

AVX2:
intra_ang_4x4[ 9]       10.27x   104.54          1073.82

SSE4:
intra_ang_4x4[ 9]       6.48x    162.27          1051.73
Subject: [x265] asm: reduce code size with macro 'INTRA_PRED_TRANS_STORE_4x4'

details:   http://hg.videolan.org/x265/rev/942267525eb6
branches:  
changeset: 10031:942267525eb6
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 03 15:08:07 2015 +0530
description:
asm: reduce code size with macro 'INTRA_PRED_TRANS_STORE_4x4'
Subject: [x265] asm: general calSign to accelerate sao

details:   http://hg.videolan.org/x265/rev/4f3dfbfa5abd
branches:  
changeset: 10032:4f3dfbfa5abd
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 03 19:10:07 2015 +0800
description:
asm: general calSign to accelerate sao
---
 source/common/x86/const-a.asm    |    3 ++
 source/common/x86/loopfilter.asm |   69 ++++++++++++++++++++++++++-----------
 source/encoder/sao.cpp           |   14 ++------
 source/test/pixelharness.cpp     |    8 ++--
 4 files changed, 58 insertions(+), 36 deletions(-)
Subject: [x265] asm: reduce 1 register in quant_avx2

details:   http://hg.videolan.org/x265/rev/bb526a6863d9
branches:  
changeset: 10033:bb526a6863d9
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 03 19:10:12 2015 +0800
description:
asm: reduce 1 register in quant_avx2
Subject: [x265] improve fillReferenceSamples by merge pixel fill

details:   http://hg.videolan.org/x265/rev/6c759724db1e
branches:  
changeset: 10034:6c759724db1e
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 03 19:10:15 2015 +0800
description:
improve fillReferenceSamples by merge pixel fill
Subject: [x265] primivites: rename luma_p2s to convert_p2s and move into PU

details:   http://hg.videolan.org/x265/rev/ac4af23cbdea
branches:  
changeset: 10035:ac4af23cbdea
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 03 18:18:48 2015 +0530
description:
primivites: rename luma_p2s to convert_p2s and move into PU
Subject: [x265] asm: sse4 8bpp code for convert_p2s[4xN]

details:   http://hg.videolan.org/x265/rev/d866ce0b50ad
branches:  
changeset: 10036:d866ce0b50ad
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 03 18:32:44 2015 +0530
description:
asm: sse4 8bpp code for convert_p2s[4xN]

     convert_p2s[4x4](2.95x), convert_p2s[4x8](3.22x), convert_p2s[4x16](3.59x)
Subject: [x265] asm: ssse3 8bpp code for convert_p2s[8xN],convert_p2s[16xN]

details:   http://hg.videolan.org/x265/rev/04ea107e7f41
branches:  
changeset: 10037:04ea107e7f41
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 03 18:38:23 2015 +0530
description:
asm: ssse3 8bpp code for convert_p2s[8xN],convert_p2s[16xN]

     convert_p2s[8x4](4.15x), convert_p2s[8x8](4.87x), convert_p2s[8x16](5.57x),
     convert_p2s[8x32](5.71x), convert_p2s[16x4](9.48x),convert_p2s[16x8](11.68x),
     convert_p2s[16x12](12.47x), convert_p2s[16x16](12.77x),
     convert_p2s[16x32](13.26x), convert_p2s[16x64](12.68x)
Subject: [x265] asm: ssse3 8bpp code for convert_p2s[32xN],[64xN]

details:   http://hg.videolan.org/x265/rev/02c97d95802d
branches:  
changeset: 10038:02c97d95802d
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 03 18:41:41 2015 +0530
description:
asm: ssse3 8bpp code for convert_p2s[32xN],[64xN]

     convert_p2s[32x8](10.45x), convert_p2s[32x16](10.22x),
     convert_p2s[32x24](10.98x), convert_p2s[32x32](10.17x),
     convert_p2s[32x64](12.31x), convert_p2s[64x16](10.29x),
     convert_p2s[64x32](10.17x), convert_p2s[64x48](10.05x),
     convert_p2s[64x64](10.04x)
Subject: [x265] asm: ssse3 code for chroma_p2s for i420, i422, i444, reuse the luma code

details:   http://hg.videolan.org/x265/rev/a77cb2b78a12
branches:  
changeset: 10039:a77cb2b78a12
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 03 18:50:00 2015 +0530
description:
asm: ssse3 code for chroma_p2s for i420, i422, i444, reuse the luma code
Subject: [x265] asm: sse4 chroma_p2s[4x2](2.29x), ssse3 chroma_p2s[8x2](3.60x) for i420

details:   http://hg.videolan.org/x265/rev/3473c9fec18c
branches:  
changeset: 10040:3473c9fec18c
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 03 19:02:06 2015 +0530
description:
asm: sse4 chroma_p2s[4x2](2.29x), ssse3 chroma_p2s[8x2](3.60x) for i420
Subject: [x265] asm: only 4:4:4 chroma 4-tap filters are configured in asm-primitives.cpp

details:   http://hg.videolan.org/x265/rev/57c3306a7773
branches:  
changeset: 10041:57c3306a7773
user:      Steve Borho <steve at borho.org>
date:      Fri Apr 03 11:16:04 2015 -0500
description:
asm: only 4:4:4 chroma 4-tap filters are configured in asm-primitives.cpp

all the other 4:4:4 chroma primitives are configured by setupAliasPrimitives()
(aliased to luma CU and PU primitives)
Subject: [x265] cmake: avoid strict-overflow warnings in slicetype.cpp from GCC 4.9

details:   http://hg.videolan.org/x265/rev/1e47a8d8c226
branches:  
changeset: 10042:1e47a8d8c226
user:      Steve Borho <steve at borho.org>
date:      Fri Apr 03 12:08:21 2015 -0500
description:
cmake: avoid strict-overflow warnings in slicetype.cpp from GCC 4.9

C:\mcw\x265\source\encoder\slicetype.cpp: In member function 'void x265::Lookahead::slicetypeAnalyse(x265::Lowres**, bool)':
C:\mcw\x265\source\encoder\slicetype.cpp:1919:31: warning: assuming signed overflow does not occur when assuming that (X + c) >= X is always true [-Wstrict-overflow]
         bDoSearch[0] = p0 < b && fenc->lowresMvs[0][b - p0 - 1][0].x == 0x7FFF;
                               ^
and one other in an X265_CHECK statement. In this case, p0 and b are known to
have small positive values and so the logic is ok.
Subject: [x265] api: make x265_cleanup() a NOP if an encoder is still open

details:   http://hg.videolan.org/x265/rev/96fef6b58853
branches:  
changeset: 10043:96fef6b58853
user:      Steve Borho <steve at borho.org>
date:      Fri Apr 03 13:27:08 2015 -0500
description:
api: make x265_cleanup() a NOP if an encoder is still open

diffstat:

 source/CMakeLists.txt                |    1 +
 source/common/ipfilter.cpp           |   36 +-
 source/common/param.cpp              |    2 +-
 source/common/predict.cpp            |   31 +-
 source/common/primitives.cpp         |    3 +-
 source/common/primitives.h           |    9 +-
 source/common/x86/asm-primitives.cpp |   99 +++-
 source/common/x86/const-a.asm        |  155 +++---
 source/common/x86/intrapred.h        |   11 +
 source/common/x86/intrapred8.asm     |  319 ++++++++++++++
 source/common/x86/ipfilter8.asm      |  747 +++++++++++++++++++++-------------
 source/common/x86/ipfilter8.h        |   58 +-
 source/common/x86/loopfilter.asm     |   63 ++-
 source/common/x86/pixel-util8.asm    |    6 +-
 source/encoder/CMakeLists.txt        |    6 +-
 source/encoder/api.cpp               |    9 +-
 source/encoder/sao.cpp               |   14 +-
 source/test/ipfilterharness.cpp      |  122 +----
 source/test/ipfilterharness.h        |    1 -
 source/test/pixelharness.cpp         |    8 +-
 20 files changed, 1103 insertions(+), 597 deletions(-)

diffs (truncated from 2379 to 300 lines):

diff -r 9a5fa67583fe -r 96fef6b58853 source/CMakeLists.txt

--- a/source/CMakeLists.txt	Thu Apr 02 13:21:32 2015 -0500
+++ b/source/CMakeLists.txt	Fri Apr 03 13:27:08 2015 -0500
@@ -196,6 +196,7 @@ if(GCC)
         add_definitions(-static)
         list(APPEND LINKER_OPTIONS "-static")
     endif(STATIC_LINK_CRT)
+    check_cxx_compiler_flag(-Wno-strict-overflow CC_HAS_NO_STRICT_OVERFLOW)
     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
     if (CC_HAS_NO_ARRAY_BOUNDS)
diff -r 9a5fa67583fe -r 96fef6b58853 source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp	Thu Apr 02 13:21:32 2015 -0500
+++ b/source/common/ipfilter.cpp	Fri Apr 03 13:27:08 2015 -0500
@@ -34,27 +34,8 @@ using namespace x265;
 #endif
 
 namespace {
-template<int dstStride, int width, int height>
-void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst)
-{
-    int shift = IF_INTERNAL_PREC - X265_DEPTH;
-    int row, col;
-
-    for (row = 0; row < height; row++)
-    {
-        for (col = 0; col < width; col++)
-        {
-            int16_t val = src[col] << shift;
-            dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
-        }
-
-        src += srcStride;
-        dst += dstStride;
-    }
-}
-
-template<int dstStride>
-void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
+template<int width, int height>
+void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
 {
     int shift = IF_INTERNAL_PREC - X265_DEPTH;
     int row, col;
@@ -398,7 +379,7 @@ namespace x265 {
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define CHROMA_422(W, H) \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -407,7 +388,7 @@ namespace x265 {
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define CHROMA_444(W, H) \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -416,7 +397,7 @@ namespace x265 {
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; 
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define LUMA(W, H) \
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
@@ -426,7 +407,7 @@ namespace x265 {
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>;
 
 void setupFilterPrimitives_c(EncoderPrimitives& p)
 {
@@ -530,11 +511,6 @@ void setupFilterPrimitives_c(EncoderPrim
     CHROMA_444(48, 64);
     CHROMA_444(64, 16);
     CHROMA_444(16, 64);
-    p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>;
-
-    p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>;
-    p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
-    p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
 
     p.extendRowBorder = extendCURowColBorder;
 }
diff -r 9a5fa67583fe -r 96fef6b58853 source/common/param.cpp
--- a/source/common/param.cpp	Thu Apr 02 13:21:32 2015 -0500
+++ b/source/common/param.cpp	Fri Apr 03 13:27:08 2015 -0500
@@ -1183,7 +1183,7 @@ int x265_set_globals(x265_param* param)
     uint32_t maxLog2CUSize = (uint32_t)g_log2Size[param->maxCUSize];
     uint32_t minLog2CUSize = (uint32_t)g_log2Size[param->minCUSize];
 
-    if (g_ctuSizeConfigured || ATOMIC_INC(&g_ctuSizeConfigured) > 1)
+    if (ATOMIC_INC(&g_ctuSizeConfigured) > 1)
     {
         if (g_maxCUSize != param->maxCUSize)
         {
diff -r 9a5fa67583fe -r 96fef6b58853 source/common/predict.cpp
--- a/source/common/predict.cpp	Thu Apr 02 13:21:32 2015 -0500
+++ b/source/common/predict.cpp	Fri Apr 03 13:27:08 2015 -0500
@@ -273,7 +273,7 @@ void Predict::predInterLumaPixel(const P
 void Predict::predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const
 {
     int16_t* dst = dstSYuv.getLumaAddr(pu.puAbsPartIdx);
-    int dstStride = dstSYuv.m_size;
+    intptr_t dstStride = dstSYuv.m_size;
 
     intptr_t srcStride = refPic.m_stride;
     intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride;
@@ -288,7 +288,7 @@ void Predict::predInterLumaShort(const P
     X265_CHECK(dstStride == MAX_CU_SIZE, "stride expected to be max cu size\n");
 
     if (!(yFrac | xFrac))
-        primitives.luma_p2s(src, srcStride, dst, pu.width, pu.height);
+        primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride);
     else if (!yFrac)
         primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0);
     else if (!xFrac)
@@ -375,14 +375,13 @@ void Predict::predInterChromaShort(const
     int partEnum = partitionFromSizes(pu.width, pu.height);
     
     uint32_t cxWidth  = pu.width >> m_hChromaShift;
-    uint32_t cxHeight = pu.height >> m_vChromaShift;
 
-    X265_CHECK(((cxWidth | cxHeight) % 2) == 0, "chroma block size expected to be multiple of 2\n");
+    X265_CHECK(((cxWidth | (pu.height >> m_vChromaShift)) % 2) == 0, "chroma block size expected to be multiple of 2\n");
 
     if (!(yFrac | xFrac))
     {
-        primitives.chroma[m_csp].p2s(refCb, refStride, dstCb, cxWidth, cxHeight);
-        primitives.chroma[m_csp].p2s(refCr, refStride, dstCr, cxWidth, cxHeight);
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride);
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride);
     }
     else if (!yFrac)
     {
@@ -817,7 +816,9 @@ void Predict::fillReferenceSamples(const
             const pixel refSample = *pAdiLineNext;
             // Pad unavailable samples with new value
             int nextOrTop = X265_MIN(next, leftUnits);
+
             // fill left column
+#if HIGH_BIT_DEPTH
             while (curr < nextOrTop)
             {
                 for (int i = 0; i < unitHeight; i++)
@@ -836,6 +837,24 @@ void Predict::fillReferenceSamples(const
                 adi += unitWidth;
                 curr++;
             }
+#else
+            X265_CHECK(curr <= nextOrTop, "curr must be less than or equal to nextOrTop\n");
+            if (curr < nextOrTop)
+            {
+                const int fillSize = unitHeight * (nextOrTop - curr);
+                memset(adi, refSample, fillSize * sizeof(pixel));
+                curr = nextOrTop;
+                adi += fillSize;
+            }
+
+            if (curr < next)
+            {
+                const int fillSize = unitWidth * (next - curr);
+                memset(adi, refSample, fillSize * sizeof(pixel));
+                curr = next;
+                adi += fillSize;
+            }
+#endif
         }
 
         // pad all other reference samples.
diff -r 9a5fa67583fe -r 96fef6b58853 source/common/primitives.cpp
--- a/source/common/primitives.cpp	Thu Apr 02 13:21:32 2015 -0500
+++ b/source/common/primitives.cpp	Fri Apr 03 13:27:08 2015 -0500
@@ -90,7 +90,6 @@ void setupAliasPrimitives(EncoderPrimiti
 
     /* alias chroma 4:4:4 from luma primitives (all but chroma filters) */
 
-    p.chroma[X265_CSP_I444].p2s = p.luma_p2s;
     p.chroma[X265_CSP_I444].cu[BLOCK_4x4].sa8d = NULL;
 
     for (int i = 0; i < NUM_PU_SIZES; i++)
@@ -98,7 +97,7 @@ void setupAliasPrimitives(EncoderPrimiti
         p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp;
         p.chroma[X265_CSP_I444].pu[i].addAvg  = p.pu[i].addAvg;
         p.chroma[X265_CSP_I444].pu[i].satd    = p.pu[i].satd;
-        p.chroma[X265_CSP_I444].pu[i].chroma_p2s = p.pu[i].filter_p2s;
+        p.chroma[X265_CSP_I444].pu[i].p2s     = p.pu[i].convert_p2s;
     }
 
     for (int i = 0; i < NUM_CU_SIZES; i++)
diff -r 9a5fa67583fe -r 96fef6b58853 source/common/primitives.h
--- a/source/common/primitives.h	Thu Apr 02 13:21:32 2015 -0500
+++ b/source/common/primitives.h	Fri Apr 03 13:27:08 2015 -0500
@@ -156,8 +156,7 @@ typedef void (*filter_ps_t) (const pixel
 typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
-typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
-typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst);
+typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
 
 typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned
 typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
@@ -211,7 +210,7 @@ struct EncoderPrimitives
         addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
 
         copy_pp_t      copy_pp;
-        filter_p2s_t   filter_p2s;
+        filter_p2s_t   convert_p2s;
     }
     pu[NUM_PU_SIZES];
 
@@ -290,7 +289,6 @@ struct EncoderPrimitives
     weightp_sp_t          weight_sp;
     weightp_pp_t          weight_pp;
 
-    filter_p2s_wxh_t      luma_p2s;
 
     findPosLast_t         findPosLast;
 
@@ -317,7 +315,7 @@ struct EncoderPrimitives
             filter_hps_t filter_hps;
             addAvg_t     addAvg;
             copy_pp_t    copy_pp;
-            filter_p2s_t chroma_p2s;
+            filter_p2s_t p2s;
 
         }
         pu[NUM_PU_SIZES];
@@ -337,7 +335,6 @@ struct EncoderPrimitives
         }
         cu[NUM_CU_SIZES];
 
-        filter_p2s_wxh_t p2s; // takes width/height as arguments
     }
     chroma[X265_CSP_COUNT];
 };
diff -r 9a5fa67583fe -r 96fef6b58853 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Thu Apr 02 13:21:32 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Fri Apr 03 13:27:08 2015 -0500
@@ -859,9 +859,6 @@ void setupAssemblyPrimitives(EncoderPrim
         PIXEL_AVG_W4(mmx2);
         LUMA_VAR(sse2);
 
-        p.luma_p2s = x265_luma_p2s_sse2;
-        p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_sse2;
-        p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_sse2;
 
         ALL_LUMA_TU(blockfill_s, blockfill_s, sse2);
         ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
@@ -1273,31 +1270,6 @@ void setupAssemblyPrimitives(EncoderPrim
         ASSIGN_SSE_PP(ssse3);
         p.cu[BLOCK_4x4].sse_pp = x265_pixel_ssd_4x4_ssse3;
         p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = x265_pixel_ssd_4x8_ssse3;
-        p.pu[LUMA_4x4].filter_p2s = x265_pixelToShort_4x4_ssse3;
-        p.pu[LUMA_4x8].filter_p2s = x265_pixelToShort_4x8_ssse3;
-        p.pu[LUMA_4x16].filter_p2s = x265_pixelToShort_4x16_ssse3;
-        p.pu[LUMA_8x4].filter_p2s = x265_pixelToShort_8x4_ssse3;
-        p.pu[LUMA_8x8].filter_p2s = x265_pixelToShort_8x8_ssse3;
-        p.pu[LUMA_8x16].filter_p2s = x265_pixelToShort_8x16_ssse3;
-        p.pu[LUMA_8x32].filter_p2s = x265_pixelToShort_8x32_ssse3;
-        p.pu[LUMA_16x4].filter_p2s = x265_pixelToShort_16x4_ssse3;
-        p.pu[LUMA_16x8].filter_p2s = x265_pixelToShort_16x8_ssse3;
-        p.pu[LUMA_16x12].filter_p2s = x265_pixelToShort_16x12_ssse3;
-        p.pu[LUMA_16x16].filter_p2s = x265_pixelToShort_16x16_ssse3;
-        p.pu[LUMA_16x32].filter_p2s = x265_pixelToShort_16x32_ssse3;
-        p.pu[LUMA_16x64].filter_p2s = x265_pixelToShort_16x64_ssse3;
-        p.pu[LUMA_32x8].filter_p2s = x265_pixelToShort_32x8_ssse3;
-        p.pu[LUMA_32x16].filter_p2s = x265_pixelToShort_32x16_ssse3;
-        p.pu[LUMA_32x24].filter_p2s = x265_pixelToShort_32x24_ssse3;
-        p.pu[LUMA_32x32].filter_p2s = x265_pixelToShort_32x32_ssse3;
-        p.pu[LUMA_32x64].filter_p2s = x265_pixelToShort_32x64_ssse3;
-        p.pu[LUMA_64x16].filter_p2s = x265_pixelToShort_64x16_ssse3;
-        p.pu[LUMA_64x32].filter_p2s = x265_pixelToShort_64x32_ssse3;
-        p.pu[LUMA_64x48].filter_p2s = x265_pixelToShort_64x48_ssse3;
-        p.pu[LUMA_64x64].filter_p2s = x265_pixelToShort_64x64_ssse3;
-
-        p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_ssse3;
-        p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_ssse3;
 
         p.dst4x4 = x265_dst4_ssse3;
         p.cu[BLOCK_8x8].idct = x265_idct8_ssse3;
@@ -1307,6 +1279,52 @@ void setupAssemblyPrimitives(EncoderPrim
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
         p.scale1D_128to64 = x265_scale1D_128to64_ssse3;
         p.scale2D_64to32 = x265_scale2D_64to32_ssse3;
+
+        p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;