[x265-commits] [x265] doc: fix formatting of code sample

Wed May 20 18:52:29 CEST 2015

details:   http://hg.videolan.org/x265/rev/9b31a8a7bd57
branches:  
changeset: 10487:9b31a8a7bd57
user:      Steve Borho <steve at borho.org>
date:      Tue May 19 19:51:56 2015 -0500
description:
doc: fix formatting of code sample
Subject: [x265] asm: avx2 code for sad_x4[48x64] (33937 -> 15279) for 10 bpp

details:   http://hg.videolan.org/x265/rev/384d01eb7142
branches:  
changeset: 10488:384d01eb7142
user:      Sumalatha Polureddy
date:      Wed May 20 11:05:15 2015 +0530
description:
asm: avx2 code for sad_x4[48x64] (33937 -> 15279) for 10 bpp

sse2
sad_x4[48x64]  2.55x    33937.88        86421.41

avx2
sad_x4[48x64]  5.67x    15279.31        86572.20
Subject: [x265] param: tune grain disables rdoq-level. This provides better visual quality results

details:   http://hg.videolan.org/x265/rev/6fd44bfcb696
branches:  
changeset: 10489:6fd44bfcb696
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Wed May 20 12:08:05 2015 +0530
description:
param: tune grain disables rdoq-level. This provides better visual quality results
Subject: [x265] asm: removed some duplicate constants in intrapred16.asm 16bpp

details:   http://hg.videolan.org/x265/rev/98279c718374
branches:  
changeset: 10490:98279c718374
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Wed May 20 12:49:59 2015 +0530
description:
asm: removed some duplicate constants in intrapred16.asm 16bpp

also, renamed pw_planar4_1, pw_planar8_1 & pw_planar32_1 to pw_3, pw_7 & pd_31 resp. & moved into comman const-a.asm file
Subject: [x265] asm: removed duplicate constants in intrapred8.asm 8bpp, these constants are already defined into const-a.asm

details:   http://hg.videolan.org/x265/rev/27f6dd7d3aca
branches:  
changeset: 10491:27f6dd7d3aca
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Wed May 20 12:52:40 2015 +0530
description:
asm: removed duplicate constants in intrapred8.asm 8bpp, these constants are already defined into const-a.asm
Subject: [x265] asm: avx2 10bit code for luma_hpp[4xN]

details:   http://hg.videolan.org/x265/rev/f1493e1c6edf
branches:  
changeset: 10492:f1493e1c6edf
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Wed May 20 14:06:53 2015 +0530
description:
asm: avx2 10bit code for luma_hpp[4xN]

avx2:
luma_hpp[  4x4]         4.59x    423.90          1944.66
luma_hpp[  4x8]         4.74x    803.53          3806.63
luma_hpp[ 4x16]         4.73x    1574.01         7442.57

sse4:
luma_hpp[  4x4]         3.69x    527.97          1946.47
luma_hpp[  4x8]         3.93x    961.48          3780.20
luma_hpp[ 4x16]         4.06x    1833.63         7445.62
Subject: [x265] asm: filter_vpp, filter_vps for 64xN in avx2

details:   http://hg.videolan.org/x265/rev/cf3396fa2220
branches:  
changeset: 10493:cf3396fa2220
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Thu May 14 11:36:52 2015 +0530
description:
asm: filter_vpp, filter_vps for 64xN in avx2

filter_vpp[64x64, 64x48, 64x32, 64x16]: 15007c->7349c, 20465c->5519c, 7448c->3752c, 3705c->1917c
filter_vps[64x64, 64x48, 64x32, 64x16]: 15449c->9899c, 11674c->7483c, 7568c->4892c, 3892c->2483c
Subject: [x265] analysis: re-order RD 0/4 analysis to do splits before ME or intra

details:   http://hg.videolan.org/x265/rev/61a6bc52debf
branches:  
changeset: 10494:61a6bc52debf
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 12:46:18 2015 +0530
description:
analysis: re-order RD 0/4 analysis to do splits before ME or intra
Subject: [x265] analysis: at RD 0/4 avoid motion references if not used by split blocks

details:   http://hg.videolan.org/x265/rev/b3ddacfe1e35
branches:  
changeset: 10495:b3ddacfe1e35
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:18:54 2015 +0530
description:
analysis: at RD 0/4 avoid motion references if not used by split blocks
Subject: [x265] stats: profile effectiveness of reference limit masks

details:   http://hg.videolan.org/x265/rev/937b2a26dc1f
branches:  
changeset: 10496:937b2a26dc1f
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:20:54 2015 +0530
description:
stats: profile effectiveness of reference limit masks
Subject: [x265] analysis: skip intra in RD 0/4 if split was analyzed and no split CUs used intra

details:   http://hg.videolan.org/x265/rev/ab01c9c7c6fd
branches:  
changeset: 10497:ab01c9c7c6fd
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:22:19 2015 +0530
description:
analysis: skip intra in RD 0/4 if split was analyzed and no split CUs used intra
Subject: [x265] stats: RD 0/4 profile effectiveness of avoiding intra if split CUs did not select it

details:   http://hg.videolan.org/x265/rev/04fee4b299f6
branches:  
changeset: 10498:04fee4b299f6
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:24:42 2015 +0530
description:
stats: RD 0/4 profile effectiveness of avoiding intra if split CUs did not select it
Subject: [x265] analysis: respect X265_REF_LIMIT_DEPTH with RD 0/4

details:   http://hg.videolan.org/x265/rev/7a00289539c0
branches:  
changeset: 10499:7a00289539c0
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:31:59 2015 +0530
description:
analysis: respect X265_REF_LIMIT_DEPTH with RD 0/4

When this flag is not set, we do not restrict references used by parent CUs
Subject: [x265] cli: connect --limit-refs to param.limitReferences

details:   http://hg.videolan.org/x265/rev/3567484c8607
branches:  
changeset: 10500:3567484c8607
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:34:33 2015 +0530
description:
cli: connect --limit-refs to param.limitReferences
Subject: [x265] stats: with the CU reference limit, even 8x8 can have skipped motion searches

details:   http://hg.videolan.org/x265/rev/79293e0515e0
branches:  
changeset: 10501:79293e0515e0
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:36:26 2015 +0530
description:
stats: with the CU reference limit, even 8x8 can have skipped motion searches
Subject: [x265] analysis: model the effectiveness of --limit-ref with RD 0/4

details:   http://hg.videolan.org/x265/rev/899c9d889e79
branches:  
changeset: 10502:899c9d889e79
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Mon May 18 13:26:33 2015 +0530
description:
analysis: model the effectiveness of --limit-ref with RD 0/4
Subject: [x265] analysis: re-order cost calculation for early-outs

details:   http://hg.videolan.org/x265/rev/aba0ec72510c
branches:  
changeset: 10503:aba0ec72510c
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Wed May 20 16:10:42 2015 +0530
description:
analysis: re-order cost calculation for early-outs
Subject: [x265] docs: document --limit-refs

details:   http://hg.videolan.org/x265/rev/a28531c13d95
branches:  
changeset: 10504:a28531c13d95
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Mar 16 20:19:33 2015 -0500
description:
docs: document --limit-refs

This option is currently available only for rdLevels 0-4. It will be enhanced to
rdLevels 5,6 pretty shortly.
Subject: [x265] cli: delay calling showHelp until a param is allocated and defaulted

details:   http://hg.videolan.org/x265/rev/b30f39f374f1
branches:  
changeset: 10505:b30f39f374f1
user:      Steve Borho <steve at borho.org>
date:      Wed May 20 10:29:09 2015 -0500
description:
cli: delay calling showHelp until a param is allocated and defaulted

Removes an old 'help' variable that was initialized to zero but never set
Subject: [x265] asm: interp_4tap_vert_pX_4xN sse2

details:   http://hg.videolan.org/x265/rev/35dd4bea0bc7
branches:  
changeset: 10506:35dd4bea0bc7
user:      David T Yuen <dtyx265 at gmail.com>
date:      Tue May 19 18:29:06 2015 -0700
description:
asm: interp_4tap_vert_pX_4xN sse2

Improved register usage for addressing of output.  This improvement helps 64-bit .7% to 2.5%.
Also added interp_4tap_vert_ps_4x32 in primitives setup.

diffstat:

 doc/reST/api.rst                     |    2 +-
 doc/reST/cli.rst                     |   24 ++
 source/common/param.cpp              |   11 +-
 source/common/x86/asm-primitives.cpp |   13 +
 source/common/x86/const-a.asm        |    3 +
 source/common/x86/intrapred16.asm    |  131 +++++++--------
 source/common/x86/intrapred8.asm     |  101 ++++-------
 source/common/x86/ipfilter16.asm     |   78 +++++++++
 source/common/x86/ipfilter8.asm      |  167 +++++++++++++++++++-
 source/common/x86/sad16-a.asm        |    1 +
 source/encoder/analysis.cpp          |  281 ++++++++++++++++++++++++----------
 source/encoder/analysis.h            |    4 +-
 source/encoder/encoder.cpp           |   12 +
 source/encoder/entropy.cpp           |    2 -
 source/encoder/entropy.h             |    2 +
 source/encoder/search.cpp            |   14 +-
 source/encoder/search.h              |   11 +-
 source/x265.cpp                      |   11 +-
 source/x265cli.h                     |    6 +-
 19 files changed, 631 insertions(+), 243 deletions(-)

diffs (truncated from 1819 to 300 lines):

diff -r 58309953273e -r 35dd4bea0bc7 doc/reST/api.rst

--- a/doc/reST/api.rst	Tue May 19 17:04:04 2015 +0530
+++ b/doc/reST/api.rst	Tue May 19 18:29:06 2015 -0700
@@ -455,7 +455,7 @@ it was compiled against.
 
 A number of validations must be performed on the returned API structure
 in order to determine if it is safe for use by your application. If you
-do not perform these checks, your application is liable to crash.
+do not perform these checks, your application is liable to crash::
 
 	if (api->api_major_version != X265_MAJOR_VERSION) /* do not use */
 	if (api->sizeof_param != sizeof(x265_param))      /* do not use */
diff -r 58309953273e -r 35dd4bea0bc7 doc/reST/cli.rst
--- a/doc/reST/cli.rst	Tue May 19 17:04:04 2015 +0530
+++ b/doc/reST/cli.rst	Tue May 19 18:29:06 2015 -0700
@@ -581,6 +581,30 @@ the prediction quad-tree.
 	be consistent for all of them since the encoder configures several
 	key global data structures based on this range.
 
+.. option:: --limit-refs <0|1|2|3>
+
+	When set to X265_REF_LIMIT_DEPTH (1) x265 will limit the references
+	analyzed at the current depth based on the references used to code
+	the 4 sub-blocks at the next depth.  For example, a 16x16 CU will
+	only use the references used to code its four 8x8 CUs.
+
+	When set to X265_REF_LIMIT_CU (2), the rectangular and asymmetrical
+	partitions will only use references selected by the 2Nx2N motion
+	search (including at the lowest depth which is otherwise unaffected
+	by the depth limit).
+
+	When set to 3 (X265_REF_LIMIT_DEPTH && X265_REF_LIMIT_CU), the 2Nx2N 
+	motion search at each depth will only use references from the split 
+	CUs and the rect/amp motion searches at that depth will only use the 
+	reference(s) selected by 2Nx2N. 
+
+	You can often increase the number of references you are using
+	(within your decoder level limits) if you enable one or
+	both of these flags.
+
+	This feature is EXPERIMENTAL and currently only functional at RD
+	levels 0 through 4
+
 .. option:: --rect, --no-rect
 
 	Enable analysis of rectangular motion partitions Nx2N and 2NxN
diff -r 58309953273e -r 35dd4bea0bc7 source/common/param.cpp
--- a/source/common/param.cpp	Tue May 19 17:04:04 2015 +0530
+++ b/source/common/param.cpp	Tue May 19 18:29:06 2015 -0700
@@ -151,6 +151,7 @@ void x265_param_default(x265_param* para
     param->subpelRefine = 2;
     param->searchRange = 57;
     param->maxNumMergeCand = 2;
+    param->limitReferences = 0;
     param->bEnableWeightedPred = 1;
     param->bEnableWeightedBiPred = 0;
     param->bEnableEarlySkip = 0;
@@ -430,8 +431,8 @@ int x265_param_default_preset(x265_param
             param->deblockingFilterBetaOffset = -2;
             param->deblockingFilterTCOffset = -2;
             param->bIntraInBFrames = 0;
-            param->rdoqLevel = 1;
-            param->psyRdoq = 30;
+            param->rdoqLevel = 0;
+            param->psyRdoq = 0;
             param->psyRd = 0.5;
             param->rc.ipFactor = 1.1;
             param->rc.pbFactor = 1.1;
@@ -641,6 +642,7 @@ int x265_param_parse(x265_param* p, cons
         }
     }
     OPT("ref") p->maxNumReferences = atoi(value);
+    OPT("limit-refs") p->limitReferences = atoi(value);
     OPT("weightp") p->bEnableWeightedPred = atobool(value);
     OPT("weightb") p->bEnableWeightedBiPred = atobool(value);
     OPT("cbqpoffs") p->cbQpOffset = atoi(value);
@@ -1026,6 +1028,8 @@ int x265_check_params(x265_param* param)
           "subme must be less than or equal to X265_MAX_SUBPEL_LEVEL (7)");
     CHECK(param->subpelRefine < 0,
           "subme must be greater than or equal to 0");
+    CHECK(param->limitReferences > 3,
+          "limitReferences must be 0, 1, 2 or 3");
     CHECK(param->frameNumThreads < 0 || param->frameNumThreads > X265_MAX_FRAME_THREADS,
           "frameNumThreads (--frame-threads) must be [0 .. X265_MAX_FRAME_THREADS)");
     CHECK(param->cbQpOffset < -12, "Min. Chroma Cb QP Offset is -12");
@@ -1277,6 +1281,8 @@ void x265_print_params(x265_param* param
     if (param->rc.aqMode)
         x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree  : %d / %0.1f / %d / %d\n", param->rc.aqMode,
                  param->rc.aqStrength, param->rc.qgSize, param->rc.cuTree);
+    x265_log(param, X265_LOG_INFO, "References / ref-limit  cu / depth  : %d / %d / %d\n",
+             param->maxNumReferences, !!(param->limitReferences & X265_REF_LIMIT_CU), !!(param->limitReferences & X265_REF_LIMIT_DEPTH));
 
     if (param->bLossless)
         x265_log(param, X265_LOG_INFO, "Rate Control                        : Lossless\n");
@@ -1420,6 +1426,7 @@ char *x265_param2string(x265_param* p)
     s += sprintf(s, " bframe-bias=%d", p->bFrameBias);
     s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive);
     s += sprintf(s, " ref=%d", p->maxNumReferences);
+    s += sprintf(s, " limit-refs=%d", p->limitReferences);
     BOOL(p->bEnableWeightedPred, "weightp");
     BOOL(p->bEnableWeightedBiPred, "weightb");
     s += sprintf(s, " aq-mode=%d", p->rc.aqMode);
diff -r 58309953273e -r 35dd4bea0bc7 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Tue May 19 17:04:04 2015 +0530
+++ b/source/common/x86/asm-primitives.cpp	Tue May 19 18:29:06 2015 -0700
@@ -1359,6 +1359,7 @@ void setupAssemblyPrimitives(EncoderPrim
         p.pu[LUMA_32x24].sad_x4 = x265_pixel_sad_x4_32x24_avx2;
         p.pu[LUMA_32x32].sad_x4 = x265_pixel_sad_x4_32x32_avx2;
         p.pu[LUMA_32x64].sad_x4 = x265_pixel_sad_x4_32x64_avx2;
+        p.pu[LUMA_48x64].sad_x4 = x265_pixel_sad_x4_48x64_avx2;
         p.pu[LUMA_64x16].sad_x4 = x265_pixel_sad_x4_64x16_avx2;
         p.pu[LUMA_64x32].sad_x4 = x265_pixel_sad_x4_64x32_avx2;
         p.pu[LUMA_64x48].sad_x4 = x265_pixel_sad_x4_64x48_avx2;
@@ -1407,6 +1408,9 @@ void setupAssemblyPrimitives(EncoderPrim
         p.pu[LUMA_4x8].luma_hps = x265_interp_8tap_horiz_ps_4x8_avx2;
         p.pu[LUMA_4x16].luma_hps = x265_interp_8tap_horiz_ps_4x16_avx2;
 
+        p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_avx2;
+        p.pu[LUMA_4x8].luma_hpp = x265_interp_8tap_horiz_pp_4x8_avx2;
+        p.pu[LUMA_4x16].luma_hpp = x265_interp_8tap_horiz_pp_4x16_avx2;
         p.pu[LUMA_8x4].luma_hpp = x265_interp_8tap_horiz_pp_8x4_avx2;
         p.pu[LUMA_8x8].luma_hpp = x265_interp_8tap_horiz_pp_8x8_avx2;
         p.pu[LUMA_8x16].luma_hpp = x265_interp_8tap_horiz_pp_8x16_avx2;
@@ -1524,6 +1528,7 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_sse2;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_sse2;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_sse2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vps = x265_interp_4tap_vert_ps_4x32_sse2;
         p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_sse2;
         p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_sse2;
         p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_sse2;
@@ -2771,6 +2776,10 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2;
         p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2;
         p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vps = x265_interp_4tap_vert_ps_48x64_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vps = x265_interp_4tap_vert_ps_64x64_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vps = x265_interp_4tap_vert_ps_64x48_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vps = x265_interp_4tap_vert_ps_64x32_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vps = x265_interp_4tap_vert_ps_64x16_avx2;
 
         //i422 for chroma_vpp
         p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2;
@@ -2820,6 +2829,10 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2;
         p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2;
         p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vpp = x265_interp_4tap_vert_pp_48x64_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vpp = x265_interp_4tap_vert_pp_64x64_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vpp = x265_interp_4tap_vert_pp_64x48_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vpp = x265_interp_4tap_vert_pp_64x32_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vpp = x265_interp_4tap_vert_pp_64x16_avx2;
 
         if (cpuMask & X265_CPU_BMI2)
             p.scanPosLast = x265_scanPosLast_avx2_bmi2;
diff -r 58309953273e -r 35dd4bea0bc7 source/common/x86/const-a.asm
--- a/source/common/x86/const-a.asm	Tue May 19 17:04:04 2015 +0530
+++ b/source/common/x86/const-a.asm	Tue May 19 18:29:06 2015 -0700
@@ -63,6 +63,8 @@ const pb_000000000000000F,           db 
 
 const pw_1,                 times 16 dw 1
 const pw_2,                 times 16 dw 2
+const pw_3,                 times 16 dw 3
+const pw_7,                 times 16 dw 7
 const pw_m2,                times  8 dw -2
 const pw_4,                 times  8 dw 4
 const pw_8,                 times  8 dw 8
@@ -112,6 +114,7 @@ const pd_2,                 times  8 dd 
 const pd_4,                 times  4 dd 4
 const pd_8,                 times  4 dd 8
 const pd_16,                times  4 dd 16
+const pd_31,                times  4 dd 31
 const pd_32,                times  8 dd 32
 const pd_64,                times  4 dd 64
 const pd_128,               times  4 dd 128
diff -r 58309953273e -r 35dd4bea0bc7 source/common/x86/intrapred16.asm
--- a/source/common/x86/intrapred16.asm	Tue May 19 17:04:04 2015 +0530
+++ b/source/common/x86/intrapred16.asm	Tue May 19 18:29:06 2015 -0700
@@ -44,7 +44,6 @@ const shuf_mode32_18,       db 14, 15, 1
 const pw_punpcklwd,         db  0,  1,  2,  3,  2,  3,  4,  5,  4,  5,  6,  7,  6,  7,  8,  9
 const c_mode32_10_0,        db  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1
 
-const pw_unpackwdq, times 8 db 0,1
 const pw_ang8_12,   db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 0, 1
 const pw_ang8_13,   db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 8, 9, 0, 1
 const pw_ang8_14,   db 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 10, 11, 4, 5, 0, 1
@@ -58,16 +57,6 @@ const pw_ang16_16,   db 0, 0, 0, 0, 0, 0
 
 ;; (blkSize - 1 - x)
 pw_planar4_0:         dw 3,  2,  1,  0,  3,  2,  1,  0
-pw_planar4_1:         dw 3,  3,  3,  3,  3,  3,  3,  3
-pw_planar8_0:         dw 7,  6,  5,  4,  3,  2,  1,  0
-pw_planar8_1:         dw 7,  7,  7,  7,  7,  7,  7,  7
-pw_planar16_0:        dw 15, 14, 13, 12, 11, 10,  9, 8
-pw_planar16_1:        dw 15, 15, 15, 15, 15, 15, 15, 15
-pd_planar32_1:        dd 31, 31, 31, 31
-
-pw_planar32_1:        dw 31, 31, 31, 31, 31, 31, 31, 31
-pw_planar32_L:        dw 31, 30, 29, 28, 27, 26, 25, 24
-pw_planar32_H:        dw 23, 22, 21, 20, 19, 18, 17, 16
 
 const planar32_table
 %assign x 31
@@ -85,8 +74,11 @@ const planar32_table1
 
 SECTION .text
 
+cextern pb_01
 cextern pw_1
 cextern pw_2
+cextern pw_3
+cextern pw_7
 cextern pw_4
 cextern pw_8
 cextern pw_15
@@ -95,6 +87,7 @@ cextern pw_31
 cextern pw_32
 cextern pw_1023
 cextern pd_16
+cextern pd_31
 cextern pd_32
 cextern pw_4096
 cextern multiL
@@ -681,7 +674,7 @@ cglobal intra_pred_planar8, 3,3,5
     pshufd          m4, m4, 0               ; v_bottomLeft
 
     pmullw          m3, [multiL]            ; (x + 1) * topRight
-    pmullw          m0, m1, [pw_planar8_1]  ; (blkSize - 1 - y) * above[x]
+    pmullw          m0, m1, [pw_7]          ; (blkSize - 1 - y) * above[x]
     paddw           m3, [pw_8]
     paddw           m3, m4
     paddw           m3, m0
@@ -695,7 +688,7 @@ cglobal intra_pred_planar8, 3,3,5
     pshufhw         m1, m2, 0x55 * (%1 - 4)
     pshufd          m1, m1, 0xAA
 %endif
-    pmullw          m1, [pw_planar8_0]
+    pmullw          m1, [pw_planar16_mul + mmsize]
     paddw           m1, m3
     psraw           m1, 4
     movu            [r0], m1
@@ -733,8 +726,8 @@ cglobal intra_pred_planar16, 3,3,8
 
     pmullw          m4, m3, [multiH]            ; (x + 1) * topRight
     pmullw          m3, [multiL]                ; (x + 1) * topRight
-    pmullw          m1, m2, [pw_planar16_1]     ; (blkSize - 1 - y) * above[x]
-    pmullw          m5, m7, [pw_planar16_1]     ; (blkSize - 1 - y) * above[x]
+    pmullw          m1, m2, [pw_15]             ; (blkSize - 1 - y) * above[x]
+    pmullw          m5, m7, [pw_15]             ; (blkSize - 1 - y) * above[x]
     paddw           m4, [pw_16]
     paddw           m3, [pw_16]
     paddw           m4, m6
@@ -770,8 +763,8 @@ cglobal intra_pred_planar16, 3,3,8
     paddw           m4, m1
     lea             r0, [r0 + r1 * 2]
 %endif
-    pmullw          m0, m5, [pw_planar8_0]
-    pmullw          m5, [pw_planar16_0]
+    pmullw          m0, m5, [pw_planar16_mul + mmsize]
+    pmullw          m5, [pw_planar16_mul]
     paddw           m0, m4
     paddw           m5, m3
     psraw           m5, 5
@@ -827,7 +820,7 @@ cglobal intra_pred_planar32, 3,3,16
     mova            m9, m6
     mova            m10, m6
 
-    mova            m12, [pw_planar32_1]
+    mova            m12, [pw_31]
     movu            m4, [r2 + 2]
     psubw           m8, m4
     pmullw          m4, m12
@@ -848,10 +841,10 @@ cglobal intra_pred_planar32, 3,3,16
     pmullw          m5, m12
     paddw           m3, m5
 
-    mova            m12, [pw_planar32_L]
-    mova            m13, [pw_planar32_H]
-    mova            m14, [pw_planar16_0]
-    mova            m15, [pw_planar8_0]
+    mova            m12, [pw_planar32_mul]
+    mova            m13, [pw_planar32_mul + mmsize]
+    mova            m14, [pw_planar16_mul]
+    mova            m15, [pw_planar16_mul + mmsize]
     add             r1, r1
 
 %macro PROCESS 1
@@ -1596,7 +1589,7 @@ cglobal intra_pred_planar4, 3,3,5
     pshufd          m4, m4, 0xAA
 
     pmullw          m3, [multi_2Row]        ; (x + 1) * topRight
-    pmullw          m0, m1, [pw_planar4_1]  ; (blkSize - 1 - y) * above[x]
+    pmullw          m0, m1, [pw_3]          ; (blkSize - 1 - y) * above[x]
 
     paddw           m3, [pw_4]
     paddw           m3, m4
@@ -1934,7 +1927,7 @@ cglobal intra_pred_planar4, 3,3,5
     pshufd          m4, m4, 0xAA
 
     pmullw          m3, [multi_2Row]        ; (x + 1) * topRight
-    pmullw          m0, m1, [pw_planar4_1]  ; (blkSize - 1 - y) * above[x]
+    pmullw          m0, m1, [pw_3]          ; (blkSize - 1 - y) * above[x]