[x265-commits] [x265] vtune: add comma to prevent string concatenation - fixes ...

Mon Jan 12 05:50:27 CET 2015

details:   http://hg.videolan.org/x265/rev/1924c460d130
branches:  
changeset: 9063:1924c460d130
user:      Steve Borho <steve at borho.org>
date:      Fri Jan 09 11:35:26 2015 +0530
description:
vtune: add comma to prevent string concatenation - fixes task profiling
Subject: [x265] Refactor EncoderPrimitives under common.

details:   http://hg.videolan.org/x265/rev/0fb899cd8e1a
branches:  
changeset: 9064:0fb899cd8e1a
user:      Kevin Wu <kevin at multicorewareinc.com>
date:      Thu Jan 08 15:23:38 2015 -0600
description:
Refactor EncoderPrimitives under common.
Subject: [x265] Refactor EncoderPrimitives under encoder.

details:   http://hg.videolan.org/x265/rev/7f6f97778548
branches:  
changeset: 9065:7f6f97778548
user:      Kevin Wu <kevin at multicorewareinc.com>
date:      Thu Jan 08 15:30:26 2015 -0600
description:
Refactor EncoderPrimitives under encoder.
Subject: [x265] Fix index to dct primitive when using dst.

details:   http://hg.videolan.org/x265/rev/efa3c407bf30
branches:  
changeset: 9066:efa3c407bf30
user:      Kevin Wu <kevin at multicorewareinc.com>
date:      Tue Jan 06 16:46:07 2015 -0600
description:
Fix index to dct primitive when using dst.

Use the dst4x4 or idst4x4 function pointers instead of indexing over the
EncoderPrimitives and calling dct/idct.
Subject: [x265] Refactor EncoderPrimitives under test.

details:   http://hg.videolan.org/x265/rev/d3c403664833
branches:  
changeset: 9067:d3c403664833
user:      Kevin Wu <kevin at multicorewareinc.com>
date:      Wed Jan 07 17:41:45 2015 -0600
description:
Refactor EncoderPrimitives under test.
Subject: [x265] test: Move dst/idst tests out of DctConf struct

details:   http://hg.videolan.org/x265/rev/4e64bb0efa3a
branches:  
changeset: 9068:4e64bb0efa3a
user:      Kevin Wu <kevin at multicorewareinc.com>
date:      Thu Jan 08 11:45:37 2015 -0600
description:
test: Move dst/idst tests out of DctConf struct
Subject: [x265] change data type in satd_4x4 for psyCost_ss

details:   http://hg.videolan.org/x265/rev/5b95e1e639f7
branches:  
changeset: 9069:5b95e1e639f7
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Fri Jan 09 13:09:39 2015 +0530
description:
change data type in satd_4x4 for psyCost_ss
Subject: [x265] add testbench for psyCost_ss and asm for psyCost_ss_4x4: improve 1989c->515c

details:   http://hg.videolan.org/x265/rev/7c8b6c7edd0c
branches:  
changeset: 9070:7c8b6c7edd0c
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Fri Jan 09 13:26:21 2015 +0530
description:
add testbench for psyCost_ss and asm for psyCost_ss_4x4: improve 1989c->515c
Subject: [x265] fix bug in sa8d_8x8 for psyCost_ss

details:   http://hg.videolan.org/x265/rev/79566465a64f
branches:  
changeset: 9071:79566465a64f
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Fri Jan 09 18:50:13 2015 +0530
description:
fix bug in sa8d_8x8 for psyCost_ss
Subject: [x265] primitives: white-space and comment cleanpus

details:   http://hg.videolan.org/x265/rev/1ffff9157c0a
branches:  
changeset: 9072:1ffff9157c0a
user:      Steve Borho <steve at borho.org>
date:      Fri Jan 09 19:43:12 2015 +0530
description:
primitives: white-space and comment cleanpus
Subject: [x265] primitives: move extendPicBorder funcdef to common.h

details:   http://hg.videolan.org/x265/rev/4973575ee22d
branches:  
changeset: 9073:4973575ee22d
user:      Steve Borho <steve at borho.org>
date:      Fri Jan 09 19:43:34 2015 +0530
description:
primitives: move extendPicBorder funcdef to common.h
Subject: [x265] intrapred: clarify angle/mode.

details:   http://hg.videolan.org/x265/rev/17ce633add70
branches:  
changeset: 9074:17ce633add70
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Sun Jan 11 18:53:37 2015 +0530
description:
intrapred: clarify angle/mode.

Fixes asm/no-asm mismatch introduced in e23f671d64d1
Subject: [x265] analysis: simplify inter analysis structure to share more inter analysis data

details:   http://hg.videolan.org/x265/rev/1db4bd2df318
branches:  
changeset: 9075:1db4bd2df318
user:      Gopu Govindaswamy <gopu at multicorewareinc.com>
date:      Wed Dec 24 10:34:59 2014 +0530
description:
analysis: simplify inter analysis structure to share more inter analysis data
Subject: [x265] analysis load/save: dump skip mode info for reuse

details:   http://hg.videolan.org/x265/rev/7e4774b2aedd
branches:  
changeset: 9076:7e4774b2aedd
user:      Gopu Govindaswamy
date:      Sun Jan 11 21:15:07 2015 +0530
description:
analysis load/save: dump skip mode info for reuse
Subject: [x265] Merge

details:   http://hg.videolan.org/x265/rev/17de7ae8f654
branches:  
changeset: 9077:17de7ae8f654
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 12 10:07:51 2015 +0530
description:
Merge

diffstat:

 source/common/common.h               |    10 +-
 source/common/dct.cpp                |    28 +-
 source/common/ipfilter.cpp           |    50 +-
 source/common/lowres.h               |     4 +-
 source/common/pixel.cpp              |   541 ++++++++--------
 source/common/predict.cpp            |    52 +-
 source/common/primitives.cpp         |   106 +-
 source/common/primitives.h           |   176 ++--
 source/common/quant.cpp              |    34 +-
 source/common/shortyuv.cpp           |    18 +-
 source/common/vec/dct-sse3.cpp       |     6 +-
 source/common/vec/dct-ssse3.cpp      |     4 +-
 source/common/x86/asm-primitives.cpp |  1092 +++++++++++++++++----------------
 source/common/x86/pixel-a.asm        |   154 ++++
 source/common/x86/pixel.h            |     1 +
 source/common/yuv.cpp                |    54 +-
 source/encoder/analysis.cpp          |   131 ++-
 source/encoder/analysis.h            |     3 +-
 source/encoder/encoder.cpp           |    38 +-
 source/encoder/framefilter.cpp       |    28 +-
 source/encoder/motion.cpp            |    42 +-
 source/encoder/ratecontrol.cpp       |     6 +-
 source/encoder/rdcost.h              |     4 +-
 source/encoder/search.cpp            |   134 ++--
 source/encoder/slicetype.cpp         |    14 +-
 source/encoder/weightPrediction.cpp  |    20 +-
 source/profile/vtune/vtune.cpp       |     2 +-
 source/test/ipfilterharness.cpp      |   104 +-
 source/test/mbdstharness.cpp         |    44 +-
 source/test/pixelharness.cpp         |   317 +++++----
 source/test/pixelharness.h           |     1 +
 31 files changed, 1766 insertions(+), 1452 deletions(-)

diffs (truncated from 5630 to 300 lines):

diff -r 77938d3e3f09 -r 17de7ae8f654 source/common/common.h

--- a/source/common/common.h	Fri Jan 09 11:02:16 2015 +0530
+++ b/source/common/common.h	Mon Jan 12 10:07:51 2015 +0530
@@ -366,10 +366,12 @@ struct SAOParam
     }
 };
 
-/* Stores inter (motion estimation) analysis data for a single frame */
+/* Stores inter analysis data for a single frame */
 struct analysis_inter_data
 {
-    int      ref;
+    int32_t*    ref;
+    uint8_t*    depth;
+    uint8_t*    modes;
 };
 
 /* Stores intra analysis data for a single frame. This struct needs better packing */
@@ -404,6 +406,10 @@ enum SignificanceMapContextType
     CONTEXT_TYPE_NxN = 2,
     CONTEXT_NUMBER_OF_TYPES = 3
 };
+
+/* located in pixel.cpp */
+void extendPicBorder(pixel* recon, intptr_t stride, int width, int height, int marginX, int marginY);
+
 }
 
 /* outside x265 namespace, but prefixed. defined in common.cpp */
diff -r 77938d3e3f09 -r 17de7ae8f654 source/common/dct.cpp
--- a/source/common/dct.cpp	Fri Jan 09 11:02:16 2015 +0530
+++ b/source/common/dct.cpp	Mon Jan 12 10:07:51 2015 +0530
@@ -765,22 +765,22 @@ void Setup_C_DCTPrimitives(EncoderPrimit
     p.dequant_normal = dequant_normal_c;
     p.quant = quant_c;
     p.nquant = nquant_c;
-    p.dct[DST_4x4] = dst4_c;
-    p.dct[DCT_4x4] = dct4_c;
-    p.dct[DCT_8x8] = dct8_c;
-    p.dct[DCT_16x16] = dct16_c;
-    p.dct[DCT_32x32] = dct32_c;
-    p.idct[IDST_4x4] = idst4_c;
-    p.idct[IDCT_4x4] = idct4_c;
-    p.idct[IDCT_8x8] = idct8_c;
-    p.idct[IDCT_16x16] = idct16_c;
-    p.idct[IDCT_32x32] = idct32_c;
+    p.dst4x4 = dst4_c;
+    p.cu[BLOCK_4x4].dct   = dct4_c;
+    p.cu[BLOCK_8x8].dct   = dct8_c;
+    p.cu[BLOCK_16x16].dct = dct16_c;
+    p.cu[BLOCK_32x32].dct = dct32_c;
+    p.idst4x4 = idst4_c;
+    p.cu[BLOCK_4x4].idct   = idct4_c;
+    p.cu[BLOCK_8x8].idct   = idct8_c;
+    p.cu[BLOCK_16x16].idct = idct16_c;
+    p.cu[BLOCK_32x32].idct = idct32_c;
     p.count_nonzero = count_nonzero_c;
     p.denoiseDct = denoiseDct_c;
 
-    p.copy_cnt[BLOCK_4x4] = copy_count<4>;
-    p.copy_cnt[BLOCK_8x8] = copy_count<8>;
-    p.copy_cnt[BLOCK_16x16] = copy_count<16>;
-    p.copy_cnt[BLOCK_32x32] = copy_count<32>;
+    p.cu[BLOCK_4x4].copy_cnt   = copy_count<4>;
+    p.cu[BLOCK_8x8].copy_cnt   = copy_count<8>;
+    p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
+    p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
 }
 }
diff -r 77938d3e3f09 -r 17de7ae8f654 source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp	Fri Jan 09 11:02:16 2015 +0530
+++ b/source/common/ipfilter.cpp	Mon Jan 12 10:07:51 2015 +0530
@@ -373,37 +373,37 @@ namespace x265 {
 // x265 private namespace
 
 #define CHROMA_420(W, H) \
-    p.chroma[X265_CSP_I420].filter_hpp[CHROMA_ ## W ## x ## H] = interp_horiz_pp_c<4, W, H>; \
-    p.chroma[X265_CSP_I420].filter_hps[CHROMA_ ## W ## x ## H] = interp_horiz_ps_c<4, W, H>; \
-    p.chroma[X265_CSP_I420].filter_vpp[CHROMA_ ## W ## x ## H] = interp_vert_pp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I420].filter_vps[CHROMA_ ## W ## x ## H] = interp_vert_ps_c<4, W, H>;  \
-    p.chroma[X265_CSP_I420].filter_vsp[CHROMA_ ## W ## x ## H] = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I420].filter_vss[CHROMA_ ## W ## x ## H] = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I420].pu[CHROMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
+    p.chroma[X265_CSP_I420].pu[CHROMA_ ## W ## x ## H].filter_hps = interp_horiz_ps_c<4, W, H>; \
+    p.chroma[X265_CSP_I420].pu[CHROMA_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
+    p.chroma[X265_CSP_I420].pu[CHROMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
+    p.chroma[X265_CSP_I420].pu[CHROMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
+    p.chroma[X265_CSP_I420].pu[CHROMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
 
 #define CHROMA_422(W, H) \
-    p.chroma[X265_CSP_I422].filter_hpp[CHROMA422_ ## W ## x ## H] = interp_horiz_pp_c<4, W, H>; \
-    p.chroma[X265_CSP_I422].filter_hps[CHROMA422_ ## W ## x ## H] = interp_horiz_ps_c<4, W, H>; \
-    p.chroma[X265_CSP_I422].filter_vpp[CHROMA422_ ## W ## x ## H] = interp_vert_pp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I422].filter_vps[CHROMA422_ ## W ## x ## H] = interp_vert_ps_c<4, W, H>;  \
-    p.chroma[X265_CSP_I422].filter_vsp[CHROMA422_ ## W ## x ## H] = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I422].filter_vss[CHROMA422_ ## W ## x ## H] = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I422].pu[CHROMA422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
+    p.chroma[X265_CSP_I422].pu[CHROMA422_ ## W ## x ## H].filter_hps = interp_horiz_ps_c<4, W, H>; \
+    p.chroma[X265_CSP_I422].pu[CHROMA422_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
+    p.chroma[X265_CSP_I422].pu[CHROMA422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
+    p.chroma[X265_CSP_I422].pu[CHROMA422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
+    p.chroma[X265_CSP_I422].pu[CHROMA422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
 
 #define CHROMA_444(W, H) \
-    p.chroma[X265_CSP_I444].filter_hpp[LUMA_ ## W ## x ## H] = interp_horiz_pp_c<4, W, H>; \
-    p.chroma[X265_CSP_I444].filter_hps[LUMA_ ## W ## x ## H] = interp_horiz_ps_c<4, W, H>; \
-    p.chroma[X265_CSP_I444].filter_vpp[LUMA_ ## W ## x ## H] = interp_vert_pp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I444].filter_vps[LUMA_ ## W ## x ## H] = interp_vert_ps_c<4, W, H>;  \
-    p.chroma[X265_CSP_I444].filter_vsp[LUMA_ ## W ## x ## H] = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I444].filter_vss[LUMA_ ## W ## x ## H] = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hps = interp_horiz_ps_c<4, W, H>; \
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
 
 #define LUMA(W, H) \
-    p.luma_hpp[LUMA_ ## W ## x ## H]     = interp_horiz_pp_c<8, W, H>; \
-    p.luma_hps[LUMA_ ## W ## x ## H]     = interp_horiz_ps_c<8, W, H>; \
-    p.luma_vpp[LUMA_ ## W ## x ## H]     = interp_vert_pp_c<8, W, H>;  \
-    p.luma_vps[LUMA_ ## W ## x ## H]     = interp_vert_ps_c<8, W, H>;  \
-    p.luma_vsp[LUMA_ ## W ## x ## H]     = interp_vert_sp_c<8, W, H>;  \
-    p.luma_vss[LUMA_ ## W ## x ## H]     = interp_vert_ss_c<8, W, H>;  \
-    p.luma_hvpp[LUMA_ ## W ## x ## H]    = interp_hv_pp_c<8, W, H>;
+    p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].luma_hps     = interp_horiz_ps_c<8, W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].luma_vpp     = interp_vert_pp_c<8, W, H>;  \
+    p.pu[LUMA_ ## W ## x ## H].luma_vps     = interp_vert_ps_c<8, W, H>;  \
+    p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
+    p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
+    p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>;
 
 void Setup_C_IPFilterPrimitives(EncoderPrimitives& p)
 {
diff -r 77938d3e3f09 -r 17de7ae8f654 source/common/lowres.h
--- a/source/common/lowres.h	Fri Jan 09 11:02:16 2015 +0530
+++ b/source/common/lowres.h	Mon Jan 12 10:07:51 2015 +0530
@@ -69,7 +69,7 @@ struct ReferencePlanes
             int qmvy = qmv.y + (qmv.y & 1);
             int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1);
             pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride;
-            primitives.pixelavg_pp[LUMA_8x8](buf, outstride, frefA, lumaStride, frefB, lumaStride, 32);
+            primitives.pu[LUMA_8x8].pixelavg_pp(buf, outstride, frefA, lumaStride, frefB, lumaStride, 32);
             return buf;
         }
         else
@@ -91,7 +91,7 @@ struct ReferencePlanes
             int qmvy = qmv.y + (qmv.y & 1);
             int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1);
             pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride;
-            primitives.pixelavg_pp[LUMA_8x8](subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32);
+            primitives.pu[LUMA_8x8].pixelavg_pp(subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32);
             return comp(fenc, FENC_STRIDE, subpelbuf, 8);
         }
         else
diff -r 77938d3e3f09 -r 17de7ae8f654 source/common/pixel.cpp
--- a/source/common/pixel.cpp	Fri Jan 09 11:02:16 2015 +0530
+++ b/source/common/pixel.cpp	Mon Jan 12 10:07:51 2015 +0530
@@ -33,58 +33,58 @@
 using namespace x265;
 
 #define SET_FUNC_PRIMITIVE_TABLE_C(FUNC_PREFIX, FUNC_PREFIX_DEF, DATA_TYPE1, DATA_TYPE2) \
-    p.FUNC_PREFIX[LUMA_4x4]   = FUNC_PREFIX_DEF<4,  4, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_8x8]   = FUNC_PREFIX_DEF<8,  8, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_8x4]   = FUNC_PREFIX_DEF<8,  4, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_4x8]   = FUNC_PREFIX_DEF<4,  8, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_16x16] = FUNC_PREFIX_DEF<16, 16, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_16x8]  = FUNC_PREFIX_DEF<16,  8, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_8x16]  = FUNC_PREFIX_DEF<8, 16, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_16x12] = FUNC_PREFIX_DEF<16, 12, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_12x16] = FUNC_PREFIX_DEF<12, 16, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_16x4]  = FUNC_PREFIX_DEF<16,  4, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_4x16]  = FUNC_PREFIX_DEF<4, 16, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_32x32] = FUNC_PREFIX_DEF<32, 32, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_32x16] = FUNC_PREFIX_DEF<32, 16, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_16x32] = FUNC_PREFIX_DEF<16, 32, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_32x24] = FUNC_PREFIX_DEF<32, 24, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_24x32] = FUNC_PREFIX_DEF<24, 32, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_32x8]  = FUNC_PREFIX_DEF<32,  8, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_8x32]  = FUNC_PREFIX_DEF<8, 32, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_64x64] = FUNC_PREFIX_DEF<64, 64, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_64x32] = FUNC_PREFIX_DEF<64, 32, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_32x64] = FUNC_PREFIX_DEF<32, 64, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_64x48] = FUNC_PREFIX_DEF<64, 48, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_48x64] = FUNC_PREFIX_DEF<48, 64, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_64x16] = FUNC_PREFIX_DEF<64, 16, DATA_TYPE1, DATA_TYPE2>; \
-    p.FUNC_PREFIX[LUMA_16x64] = FUNC_PREFIX_DEF<16, 64, DATA_TYPE1, DATA_TYPE2>;
+    p.pu[LUMA_4x4].FUNC_PREFIX   = FUNC_PREFIX_DEF<4,  4, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_8x8].FUNC_PREFIX   = FUNC_PREFIX_DEF<8,  8, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_8x4].FUNC_PREFIX   = FUNC_PREFIX_DEF<8,  4, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_4x8].FUNC_PREFIX   = FUNC_PREFIX_DEF<4,  8, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_16x16].FUNC_PREFIX = FUNC_PREFIX_DEF<16, 16, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_16x8].FUNC_PREFIX  = FUNC_PREFIX_DEF<16,  8, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_8x16].FUNC_PREFIX  = FUNC_PREFIX_DEF<8, 16, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_16x12].FUNC_PREFIX = FUNC_PREFIX_DEF<16, 12, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_12x16].FUNC_PREFIX = FUNC_PREFIX_DEF<12, 16, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_16x4].FUNC_PREFIX  = FUNC_PREFIX_DEF<16,  4, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_4x16].FUNC_PREFIX  = FUNC_PREFIX_DEF<4, 16, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_32x32].FUNC_PREFIX = FUNC_PREFIX_DEF<32, 32, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_32x16].FUNC_PREFIX = FUNC_PREFIX_DEF<32, 16, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_16x32].FUNC_PREFIX = FUNC_PREFIX_DEF<16, 32, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_32x24].FUNC_PREFIX = FUNC_PREFIX_DEF<32, 24, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_24x32].FUNC_PREFIX = FUNC_PREFIX_DEF<24, 32, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_32x8].FUNC_PREFIX  = FUNC_PREFIX_DEF<32,  8, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_8x32].FUNC_PREFIX  = FUNC_PREFIX_DEF<8, 32, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_64x64].FUNC_PREFIX = FUNC_PREFIX_DEF<64, 64, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_64x32].FUNC_PREFIX = FUNC_PREFIX_DEF<64, 32, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_32x64].FUNC_PREFIX = FUNC_PREFIX_DEF<32, 64, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_64x48].FUNC_PREFIX = FUNC_PREFIX_DEF<64, 48, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_48x64].FUNC_PREFIX = FUNC_PREFIX_DEF<48, 64, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_64x16].FUNC_PREFIX = FUNC_PREFIX_DEF<64, 16, DATA_TYPE1, DATA_TYPE2>; \
+    p.pu[LUMA_16x64].FUNC_PREFIX = FUNC_PREFIX_DEF<16, 64, DATA_TYPE1, DATA_TYPE2>;
 
 #define SET_FUNC_PRIMITIVE_TABLE_C2(FUNC_PREFIX) \
-    p.FUNC_PREFIX[LUMA_4x4]   = FUNC_PREFIX<4,  4>; \
-    p.FUNC_PREFIX[LUMA_8x8]   = FUNC_PREFIX<8,  8>; \
-    p.FUNC_PREFIX[LUMA_8x4]   = FUNC_PREFIX<8,  4>; \
-    p.FUNC_PREFIX[LUMA_4x8]   = FUNC_PREFIX<4,  8>; \
-    p.FUNC_PREFIX[LUMA_16x16] = FUNC_PREFIX<16, 16>; \
-    p.FUNC_PREFIX[LUMA_16x8]  = FUNC_PREFIX<16,  8>; \
-    p.FUNC_PREFIX[LUMA_8x16]  = FUNC_PREFIX<8, 16>; \
-    p.FUNC_PREFIX[LUMA_16x12] = FUNC_PREFIX<16, 12>; \
-    p.FUNC_PREFIX[LUMA_12x16] = FUNC_PREFIX<12, 16>; \
-    p.FUNC_PREFIX[LUMA_16x4]  = FUNC_PREFIX<16,  4>; \
-    p.FUNC_PREFIX[LUMA_4x16]  = FUNC_PREFIX<4, 16>; \
-    p.FUNC_PREFIX[LUMA_32x32] = FUNC_PREFIX<32, 32>; \
-    p.FUNC_PREFIX[LUMA_32x16] = FUNC_PREFIX<32, 16>; \
-    p.FUNC_PREFIX[LUMA_16x32] = FUNC_PREFIX<16, 32>; \
-    p.FUNC_PREFIX[LUMA_32x24] = FUNC_PREFIX<32, 24>; \
-    p.FUNC_PREFIX[LUMA_24x32] = FUNC_PREFIX<24, 32>; \
-    p.FUNC_PREFIX[LUMA_32x8]  = FUNC_PREFIX<32,  8>; \
-    p.FUNC_PREFIX[LUMA_8x32]  = FUNC_PREFIX<8, 32>; \
-    p.FUNC_PREFIX[LUMA_64x64] = FUNC_PREFIX<64, 64>; \
-    p.FUNC_PREFIX[LUMA_64x32] = FUNC_PREFIX<64, 32>; \
-    p.FUNC_PREFIX[LUMA_32x64] = FUNC_PREFIX<32, 64>; \
-    p.FUNC_PREFIX[LUMA_64x48] = FUNC_PREFIX<64, 48>; \
-    p.FUNC_PREFIX[LUMA_48x64] = FUNC_PREFIX<48, 64>; \
-    p.FUNC_PREFIX[LUMA_64x16] = FUNC_PREFIX<64, 16>; \
-    p.FUNC_PREFIX[LUMA_16x64] = FUNC_PREFIX<16, 64>;
+    p.pu[LUMA_4x4].FUNC_PREFIX   = FUNC_PREFIX<4,  4>; \
+    p.pu[LUMA_8x8].FUNC_PREFIX   = FUNC_PREFIX<8,  8>; \
+    p.pu[LUMA_8x4].FUNC_PREFIX   = FUNC_PREFIX<8,  4>; \
+    p.pu[LUMA_4x8].FUNC_PREFIX   = FUNC_PREFIX<4,  8>; \
+    p.pu[LUMA_16x16].FUNC_PREFIX = FUNC_PREFIX<16, 16>; \
+    p.pu[LUMA_16x8].FUNC_PREFIX  = FUNC_PREFIX<16,  8>; \
+    p.pu[LUMA_8x16].FUNC_PREFIX  = FUNC_PREFIX<8, 16>; \
+    p.pu[LUMA_16x12].FUNC_PREFIX = FUNC_PREFIX<16, 12>; \
+    p.pu[LUMA_12x16].FUNC_PREFIX = FUNC_PREFIX<12, 16>; \
+    p.pu[LUMA_16x4].FUNC_PREFIX  = FUNC_PREFIX<16,  4>; \
+    p.pu[LUMA_4x16].FUNC_PREFIX  = FUNC_PREFIX<4, 16>; \
+    p.pu[LUMA_32x32].FUNC_PREFIX = FUNC_PREFIX<32, 32>; \
+    p.pu[LUMA_32x16].FUNC_PREFIX = FUNC_PREFIX<32, 16>; \
+    p.pu[LUMA_16x32].FUNC_PREFIX = FUNC_PREFIX<16, 32>; \
+    p.pu[LUMA_32x24].FUNC_PREFIX = FUNC_PREFIX<32, 24>; \
+    p.pu[LUMA_24x32].FUNC_PREFIX = FUNC_PREFIX<24, 32>; \
+    p.pu[LUMA_32x8].FUNC_PREFIX  = FUNC_PREFIX<32,  8>; \
+    p.pu[LUMA_8x32].FUNC_PREFIX  = FUNC_PREFIX<8, 32>; \
+    p.pu[LUMA_64x64].FUNC_PREFIX = FUNC_PREFIX<64, 64>; \
+    p.pu[LUMA_64x32].FUNC_PREFIX = FUNC_PREFIX<64, 32>; \
+    p.pu[LUMA_32x64].FUNC_PREFIX = FUNC_PREFIX<32, 64>; \
+    p.pu[LUMA_64x48].FUNC_PREFIX = FUNC_PREFIX<64, 48>; \
+    p.pu[LUMA_48x64].FUNC_PREFIX = FUNC_PREFIX<48, 64>; \
+    p.pu[LUMA_64x16].FUNC_PREFIX = FUNC_PREFIX<64, 16>; \
+    p.pu[LUMA_16x64].FUNC_PREFIX = FUNC_PREFIX<16, 64>;
 
 namespace {
 // place functions in anonymous namespace (file static)
@@ -243,9 +243,9 @@ int satd_4x4(const pixel* pix1, intptr_t
 
 static int satd_4x4(const int16_t* pix1, intptr_t stride_pix1)
 {
-    int64_t tmp[4][4];
-    int64_t s01, s23, d01, d23;
-    int64_t satd = 0;
+    int32_t tmp[4][4];
+    int32_t s01, s23, d01, d23;
+    int32_t satd = 0;
     int d;
 
     for (d = 0; d < 4; d++, pix1 += stride_pix1)
@@ -367,46 +367,55 @@ int sa8d_8x8(const pixel* pix1, intptr_t
     return (int)((_sa8d_8x8(pix1, i_pix1, pix2, i_pix2) + 2) >> 2);
 }
 
-inline int _sa8d_8x8(const int16_t* pix1, intptr_t i_pix1, const int16_t* pix2, intptr_t i_pix2)
+inline int _sa8d_8x8(const int16_t* pix1, intptr_t i_pix1)
 {
-    ssum2_t tmp[8][4];
-    ssum2_t a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3;
-    ssum2_t sum = 0;
+    int32_t tmp[8][8];
+    int32_t a0, a1, a2, a3, a4, a5, a6, a7;
+    int32_t sum = 0;
 
-    for (int i = 0; i < 8; i++, pix1 += i_pix1, pix2 += i_pix2)
+    for (int i = 0; i < 8; i++, pix1 += i_pix1)
     {
-        a0 = pix1[0] - pix2[0];
-        a1 = pix1[1] - pix2[1];