[x265-commits] [x265] merge: set zero MV correctly

Deepthi Nandakumar deepthi at multicorewareinc.com
Mon Mar 16 17:11:38 CET 2015


details:   http://hg.videolan.org/x265/rev/1adf818d7f7c
branches:  
changeset: 9738:1adf818d7f7c
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Mar 16 14:24:55 2015 +0530
description:
merge: set zero MV correctly
Subject: [x265] merge: check merge reference indices

details:   http://hg.videolan.org/x265/rev/5c845d111d92
branches:  
changeset: 9739:5c845d111d92
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Mar 16 14:29:14 2015 +0530
description:
merge: check merge reference indices
Subject: [x265] predict: check for out of bounds ref indices before weight tables are used

details:   http://hg.videolan.org/x265/rev/9791a3bb74cf
branches:  
changeset: 9740:9791a3bb74cf
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Mar 16 16:03:40 2015 +0530
description:
predict: check for out of bounds ref indices before weight tables are used
Subject: [x265] asm: filter_vsp[4x8], filter_vss[4x8] in avx2: 673c->339c, 608c->263c

details:   http://hg.videolan.org/x265/rev/8caa5bdcaf25
branches:  
changeset: 9741:8caa5bdcaf25
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Mon Mar 16 10:10:31 2015 +0530
description:
asm: filter_vsp[4x8], filter_vss[4x8] in avx2: 673c->339c, 608c->263c
Subject: [x265] asm: filter_vsp[4x16], filter_vss[4x16] in avx2: 985c->599c, 877c->491c

details:   http://hg.videolan.org/x265/rev/d69d56a2fdb3
branches:  
changeset: 9742:d69d56a2fdb3
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Mon Mar 16 10:14:19 2015 +0530
description:
asm: filter_vsp[4x16], filter_vss[4x16] in avx2: 985c->599c, 877c->491c
Subject: [x265] asm: filter_vsp[4x2], filter_vss[4x2] in avx2: 237c->137c, 206c->118c

details:   http://hg.videolan.org/x265/rev/961c52b43700
branches:  
changeset: 9743:961c52b43700
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Mon Mar 16 10:17:10 2015 +0530
description:
asm: filter_vsp[4x2], filter_vss[4x2] in avx2: 237c->137c, 206c->118c
Subject: [x265] asm: filter_vsp[2x4], filter_vss[2x4] in avx2: 292c->189c, 248c->184c

details:   http://hg.videolan.org/x265/rev/f5747f4a22e4
branches:  
changeset: 9744:f5747f4a22e4
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Mon Mar 16 10:21:43 2015 +0530
description:
asm: filter_vsp[2x4], filter_vss[2x4] in avx2: 292c->189c, 248c->184c
Subject: [x265] asm-intra_pred_ang16_28: improved, 865.16c -> 456.44c

details:   http://hg.videolan.org/x265/rev/3c8f7b661c16
branches:  
changeset: 9745:3c8f7b661c16
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Mar 13 14:02:25 2015 +0530
description:
asm-intra_pred_ang16_28: improved, 865.16c -> 456.44c

AVX2:
intra_ang_16x16[28]     19.82x   456.44          9047.79

SSE4:
intra_ang_16x16[28]     10.36x   865.16          8962.21
Subject: [x265] asm-intrapred8.asm: rename macro 'INTRA_PRED_ANG16_25' to 'INTRA_PRED_ANG16_MC1'

details:   http://hg.videolan.org/x265/rev/6ecedcdb3aae
branches:  
changeset: 9746:6ecedcdb3aae
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Mar 13 14:22:51 2015 +0530
description:
asm-intrapred8.asm: rename macro 'INTRA_PRED_ANG16_25' to 'INTRA_PRED_ANG16_MC1'

Given a generic name to reuse in other intra_pred16 mode asm codes.
Subject: [x265] asm-intrapred8.asm: use macro 'INTRA_PRED_ANG16_MC1' to shorten asm code length

details:   http://hg.videolan.org/x265/rev/34626be62712
branches:  
changeset: 9747:34626be62712
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Mar 13 14:28:43 2015 +0530
description:
asm-intrapred8.asm: use macro 'INTRA_PRED_ANG16_MC1' to shorten asm code length
Subject: [x265] asm-intra_pred_ang16_27: improved, 645.84c -> 415.14c over SSE4 asm code

details:   http://hg.videolan.org/x265/rev/b366877f0a59
branches:  
changeset: 9748:b366877f0a59
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Mar 13 18:17:38 2015 +0530
description:
asm-intra_pred_ang16_27: improved, 645.84c -> 415.14c over SSE4 asm code

AVX2:
intra_ang_16x16[27]     21.17x   415.14          8789.72

SSE4:
intra_ang_16x16[27]     13.68x   645.84          8833.49
Subject: [x265] asm-intra_pred_ang16_29: improved, 866.95c -> 493.20c over SSE4 asm code

details:   http://hg.videolan.org/x265/rev/3ed5e5416686
branches:  
changeset: 9749:3ed5e5416686
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Mar 13 19:40:19 2015 +0530
description:
asm-intra_pred_ang16_29: improved, 866.95c -> 493.20c over SSE4 asm code

AVX2:
intra_ang_16x16[29]     18.72x   493.20          9231.12

SSE4:
intra_ang_16x16[29]     10.46x   866.95          9072.53
Subject: [x265] asm-intra_pred_ang16_25: reduce const table address size

details:   http://hg.videolan.org/x265/rev/d1154c2989fa
branches:  
changeset: 9750:d1154c2989fa
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Mar 13 19:47:16 2015 +0530
description:
asm-intra_pred_ang16_25: reduce const table address size
Subject: [x265] asm-intra-pred8.asm: replace 'lea' instruction with faster 'add' instruction

details:   http://hg.videolan.org/x265/rev/058b38bf8e85
branches:  
changeset: 9751:058b38bf8e85
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Mar 13 19:51:56 2015 +0530
description:
asm-intra-pred8.asm: replace 'lea' instruction with faster 'add' instruction
Subject: [x265] asm : chroma_hpp[4x2] for i420 avx2 - improved 138c->134c

details:   http://hg.videolan.org/x265/rev/c6bb6ec54973
branches:  
changeset: 9752:c6bb6ec54973
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Mar 16 10:32:43 2015 +0530
description:
asm : chroma_hpp[4x2] for i420 avx2 - improved 138c->134c
Subject: [x265] asm: chroma_hpp[32x16, 32x24, 32x8] for i420 - improved 2966c->1647c, 4514c->2627c, 1494c->870c

details:   http://hg.videolan.org/x265/rev/7a115f22ee36
branches:  
changeset: 9753:7a115f22ee36
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Mar 16 10:36:29 2015 +0530
description:
asm: chroma_hpp[32x16, 32x24, 32x8] for i420 - improved 2966c->1647c, 4514c->2627c, 1494c->870c
Subject: [x265] asm: avx2 code for sad[32x32] for 8bpp

details:   http://hg.videolan.org/x265/rev/720154e24059
branches:  
changeset: 9754:720154e24059
user:      Sumalatha Polureddy<sumalatha at multicorewareinc.com>
date:      Mon Mar 16 10:43:13 2015 +0530
description:
asm: avx2 code for sad[32x32] for 8bpp

sad[32x32]  44.69x   494.86          22114.45
Subject: [x265] asm: chroma_hpp[8x4, 8x16, 8x32] for i420 avx2 - improved 289c->220c, 928c->618c, 1802c->1128c

details:   http://hg.videolan.org/x265/rev/dd113f0bf713
branches:  
changeset: 9755:dd113f0bf713
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Mar 16 10:43:23 2015 +0530
description:
asm: chroma_hpp[8x4, 8x16, 8x32] for i420 avx2 - improved 289c->220c, 928c->618c, 1802c->1128c
Subject: [x265] asm: chroma_hpp[4x8, 4x16] for i420 avx2 - improved 346c->322c, 610c->586c

details:   http://hg.videolan.org/x265/rev/74496ce5d8ba
branches:  
changeset: 9756:74496ce5d8ba
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Mar 16 10:47:09 2015 +0530
description:
asm: chroma_hpp[4x8, 4x16] for i420 avx2 - improved 346c->322c, 610c->586c

diffstat:

 source/common/cudata.cpp             |    2 +-
 source/common/predict.cpp            |   11 +-
 source/common/x86/asm-primitives.cpp |   25 +
 source/common/x86/intrapred.h        |    3 +
 source/common/x86/intrapred8.asm     |  195 +++++++++-
 source/common/x86/ipfilter8.asm      |  670 +++++++++++++++++++++++++++++++++++
 source/common/x86/ipfilter8.h        |    1 +
 source/common/x86/sad-a.asm          |   29 +
 source/encoder/analysis.cpp          |    1 +
 9 files changed, 922 insertions(+), 15 deletions(-)

diffs (truncated from 1103 to 300 lines):

diff -r 6461985f33ac -r 74496ce5d8ba source/common/cudata.cpp
--- a/source/common/cudata.cpp	Sun Mar 15 11:58:32 2015 -0500
+++ b/source/common/cudata.cpp	Mon Mar 16 10:47:09 2015 +0530
@@ -1608,7 +1608,7 @@ uint32_t CUData::getInterMergeCandidates
     while (count < maxNumMergeCand)
     {
         candDir[count] = 1;
-        candMvField[count][0].mv = 0;
+        candMvField[count][0].mv.word = 0;
         candMvField[count][0].refIdx = r;
 
         if (isInterB)
diff -r 6461985f33ac -r 74496ce5d8ba source/common/predict.cpp
--- a/source/common/predict.cpp	Sun Mar 15 11:58:32 2015 -0500
+++ b/source/common/predict.cpp	Mon Mar 16 10:47:09 2015 +0530
@@ -130,6 +130,9 @@ void Predict::motionCompensation(const C
         WeightValues wv0[3], wv1[3];
         const WeightParam *pwp0, *pwp1;
 
+        X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "bidir refidx0 out of range\n");
+        X265_CHECK(refIdx1 < cu.m_slice->m_numRefIdx[1], "bidir refidx1 out of range\n");
+
         if (cu.m_slice->m_pps->bUseWeightedBiPred)
         {
             pwp0 = refIdx0 >= 0 ? cu.m_slice->m_weightPredTable[0][refIdx0] : NULL;
@@ -174,10 +177,6 @@ void Predict::motionCompensation(const C
             cu.clipMv(mv0);
             cu.clipMv(mv1);
 
-            /* Biprediction */
-            X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "bidir refidx0 out of range\n");
-            X265_CHECK(refIdx1 < cu.m_slice->m_numRefIdx[1], "bidir refidx1 out of range\n");
-
             if (bLuma)
             {
                 predInterLumaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
@@ -199,9 +198,6 @@ void Predict::motionCompensation(const C
             MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
             cu.clipMv(mv0);
 
-            /* uniprediction to L0 */
-            X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "unidir refidx0 out of range\n");
-
             if (pwp0 && pwp0->bPresentFlag)
             {
                 ShortYuv& shortYuv = m_predShortYuv[0];
@@ -228,7 +224,6 @@ void Predict::motionCompensation(const C
 
             /* uniprediction to L1 */
             X265_CHECK(refIdx1 >= 0, "refidx1 was not positive\n");
-            X265_CHECK(refIdx1 < cu.m_slice->m_numRefIdx[1], "unidir refidx1 out of range\n");
 
             if (pwp1 && pwp1->bPresentFlag)
             {
diff -r 6461985f33ac -r 74496ce5d8ba source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Sun Mar 15 11:58:32 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Mon Mar 16 10:47:09 2015 +0530
@@ -1444,6 +1444,8 @@ void setupAssemblyPrimitives(EncoderPrim
         p.pu[LUMA_8x16].satd  = x265_pixel_satd_8x16_avx2;
         p.pu[LUMA_8x8].satd   = x265_pixel_satd_8x8_avx2;
 
+        p.pu[LUMA_32x32].sad = x265_pixel_sad_32x32_avx2;
+
         p.pu[LUMA_8x4].sad_x3 = x265_pixel_sad_x3_8x4_avx2;
         p.pu[LUMA_8x8].sad_x3 = x265_pixel_sad_x3_8x8_avx2;
         p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_avx2;
@@ -1507,6 +1509,9 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2;
         p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;
         p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2;
+        p.cu[BLOCK_16x16].intra_pred[28] = x265_intra_pred_ang16_28_avx2;
+        p.cu[BLOCK_16x16].intra_pred[27] = x265_intra_pred_ang16_27_avx2;
+        p.cu[BLOCK_16x16].intra_pred[29] = x265_intra_pred_ang16_29_avx2;
 
         // copy_sp primitives
         p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
@@ -1581,6 +1586,18 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2;
 
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_hpp = x265_interp_4tap_horiz_pp_4x2_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2;
+
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hpp = x265_interp_4tap_horiz_pp_32x8_avx2;
+
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2;
+
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2;
@@ -1653,6 +1670,10 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vsp = x265_interp_4tap_vert_sp_2x4_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vsp = x265_interp_4tap_vert_sp_4x2_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vsp = x265_interp_4tap_vert_sp_24x32_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2;
@@ -1661,6 +1682,10 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vss = x265_interp_4tap_vert_ss_2x4_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vss = x265_interp_4tap_vert_ss_4x2_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vss = x265_interp_4tap_vert_ss_24x32_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2;
diff -r 6461985f33ac -r 74496ce5d8ba source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h	Sun Mar 15 11:58:32 2015 -0500
+++ b/source/common/x86/intrapred.h	Mon Mar 16 10:47:09 2015 +0530
@@ -184,6 +184,9 @@ void x265_intra_pred_ang8_12_avx2(pixel*
 void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
diff -r 6461985f33ac -r 74496ce5d8ba source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm	Sun Mar 15 11:58:32 2015 -0500
+++ b/source/common/x86/intrapred8.asm	Mon Mar 16 10:47:09 2015 +0530
@@ -2,6 +2,7 @@
 ;* Copyright (C) 2013 x265 project
 ;*
 ;* Authors: Min Chen <chenm003 at 163.com> <min.chen at multicorewareinc.com>
+;*          Praveen Kumar Tiwari <praveen at multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -123,6 +124,44 @@ c_ang16_mode_25:      db 2, 30, 2, 30, 2
                       db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
+
+ALIGN 32
+c_ang16_mode_28:      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_27:      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+ALIGN 32
+intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
+
+
+ALIGN 32
+c_ang16_mode_29:     db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9,  14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
 ALIGN 32
 ;; (blkSize - 1 - x)
 pw_planar4_0:         dw 3,  2,  1,  0,  3,  2,  1,  0
@@ -10699,7 +10738,7 @@ cglobal intra_pred_ang8_24, 3, 5, 5
     movu              [%2], xm4
 %endmacro
 
-%macro INTRA_PRED_ANG16_25 1
+%macro INTRA_PRED_ANG16_MC1 1
     INTRA_PRED_ANG16_MC0 r0, r0 + r1, %1
     INTRA_PRED_ANG16_MC0 r0 + 2 * r1, r0 + r3, (%1 + 1)
 %endmacro
@@ -10716,14 +10755,158 @@ cglobal intra_pred_ang16_25, 3, 5, 5
     lea               r3, [3 * r1]
     lea               r4, [c_ang16_mode_25]
 
-    INTRA_PRED_ANG16_25 0
+    INTRA_PRED_ANG16_MC1 0
 
     lea    r0, [r0 + 4 * r1]
-    INTRA_PRED_ANG16_25 2
+    INTRA_PRED_ANG16_MC1 2
+
+    add           r4, 4 * mmsize
 
     lea    r0, [r0 + 4 * r1]
-    INTRA_PRED_ANG16_25 4
+    INTRA_PRED_ANG16_MC1 0
 
     lea    r0, [r0 + 4 * r1]
-    INTRA_PRED_ANG16_25 6
-    RET
+    INTRA_PRED_ANG16_MC1 2
+    RET
+
+INIT_YMM avx2
+cglobal intra_pred_ang16_28, 3, 5, 6
+    mova              m0, [pw_1024]
+    mova              m5, [intra_pred_shuff_0_8]
+    lea               r3, [3 * r1]
+    lea               r4, [c_ang16_mode_28]
+
+    vbroadcasti128    m1, [r2 + 1]
+    pshufb            m1, m5
+    vbroadcasti128    m2, [r2 + 9]
+    pshufb            m2, m5
+
+    INTRA_PRED_ANG16_MC1 0
+
+    lea               r0, [r0 + 4 * r1]
+
+    INTRA_PRED_ANG16_MC0 r0, r0 + r1, 2
+
+    vbroadcasti128    m1, [r2 + 2]
+    pshufb            m1, m5
+    vbroadcasti128    m2, [r2 + 10]
+    pshufb            m2, m5
+
+    INTRA_PRED_ANG16_MC0 r0 + 2 * r1, r0 + r3, 3
+
+    lea               r0, [r0 + 4 * r1]
+    add               r4, 4 * mmsize
+
+    INTRA_PRED_ANG16_MC1 0
+
+    vbroadcasti128    m1, [r2 + 3]
+    pshufb            m1, m5
+    vbroadcasti128    m2, [r2 + 11]
+    pshufb            m2, m5
+
+    lea               r0, [r0 + 4 * r1]
+
+    INTRA_PRED_ANG16_MC1 2
+    RET
+
+INIT_YMM avx2
+cglobal intra_pred_ang16_27, 3, 5, 5
+    mova              m0, [pw_1024]
+    lea               r3, [3 * r1]
+    lea               r4, [c_ang16_mode_27]
+
+    vbroadcasti128    m1, [r2 + 1]
+    pshufb            m1, [intra_pred_shuff_0_8]
+    vbroadcasti128    m2, [r2 + 9]
+    pshufb            m2, [intra_pred_shuff_0_8]
+
+    INTRA_PRED_ANG16_MC1 0
+
+    lea               r0, [r0 + 4 * r1]
+    INTRA_PRED_ANG16_MC1 2
+
+    lea               r0, [r0 + 4 * r1]
+    add               r4, 4 * mmsize
+    INTRA_PRED_ANG16_MC1 0
+
+    lea               r0, [r0 + 4 * r1]
+    INTRA_PRED_ANG16_MC0 r0, r0 + r1, 2
+
+    vperm2i128        m1, m1, m2, 00100000b
+    pmaddubsw         m3, m1, [r4 + 3 * mmsize]
+    pmulhrsw          m3, m0
+    vbroadcasti128    m2, [r2 + 2]
+    pshufb            m2, [intra_pred_shuff_0_15]
+    pmaddubsw         m2, [r4 + 4 * mmsize]
+    pmulhrsw          m2, m0
+    packuswb          m3, m2
+    vpermq            m3, m3, 11011000b
+    movu              [r0 + 2 * r1], xm3
+    vextracti128      xm4, m3, 1
+    movu              [r0 + r3], xm4
+    RET
+
+INIT_YMM avx2
+cglobal intra_pred_ang16_29, 3, 5, 5
+    mova              m0, [pw_1024]
+    mova              m5, [intra_pred_shuff_0_8]
+    lea               r3, [3 * r1]


More information about the x265-commits mailing list