[x265-commits] [x265] asm: fix eoln in comment

Fri Apr 10 18:35:56 CEST 2015

details:   http://hg.videolan.org/x265/rev/ee76a15fa312
branches:  
changeset: 10146:ee76a15fa312
user:      Steve Borho <steve at borho.org>
date:      Fri Apr 10 10:24:55 2015 -0500
description:
asm: fix eoln in comment
Subject: [x265] cli: annex_b format switch

details:   http://hg.videolan.org/x265/rev/9f6a053a2868
branches:  
changeset: 10147:9f6a053a2868
user:      Xinyue Lu <i at 7086.in>
date:      Thu Apr 09 18:06:44 2015 -0700
description:
cli: annex_b format switch

When bAnnexB set to true, the NAL serializer will place start codes (0x00 00 00 01) before NAL.
When false, it will place 4 bytes length before NAL.
Container formats may prefer the latter.

Also move output->setParam up so that it can select format before we initialize the encoder.
Subject: [x265] asm: avx2 code for planecopy_sp

details:   http://hg.videolan.org/x265/rev/58386976e7b6
branches:  
changeset: 10148:58386976e7b6
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Fri Apr 10 10:56:41 2015 +0530
description:
asm: avx2 code for planecopy_sp

AVX2:
planecopy_sp   22.19x   5337.07         118407.46

SSE2:
planecopy_sp   14.83x   8106.54         120242.02
Subject: [x265] asm: avx2 8bpp code for convert_p2s[24xN]

details:   http://hg.videolan.org/x265/rev/9c46289a0957
branches:  
changeset: 10149:9c46289a0957
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 10 10:49:34 2015 +0530
description:
asm: avx2 8bpp code for convert_p2s[24xN]

     convert_p2s[24x32](16.21x)
Subject: [x265] asm: avx2 8bpp code for chroma_p2s[32xN],[24xN], reuse the luma code

details:   http://hg.videolan.org/x265/rev/b7dd8105b91c
branches:  
changeset: 10150:b7dd8105b91c
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Fri Apr 10 11:10:47 2015 +0530
description:
asm: avx2 8bpp code for chroma_p2s[32xN],[24xN], reuse the luma code
Subject: [x265] change costUncoded[] coordinate system from Raster to Zigzag

details:   http://hg.videolan.org/x265/rev/94d7485893a3
branches:  
changeset: 10151:94d7485893a3
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 10 20:06:23 2015 +0800
description:
change costUncoded[] coordinate system from Raster to Zigzag
Subject: [x265] avoid calculate rateIncUp and rateIncDown when sigHide disabled

details:   http://hg.videolan.org/x265/rev/010a73622b59
branches:  
changeset: 10152:010a73622b59
user:      Min Chen <chenm003 at 163.com>
date:      Fri Apr 10 20:49:25 2015 +0800
description:
avoid calculate rateIncUp and rateIncDown when sigHide disabled
Subject: [x265] asm: intra_pred_ang8_20 improved by ~4% over SSE4

details:   http://hg.videolan.org/x265/rev/270da1018d2e
branches:  
changeset: 10153:270da1018d2e
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 10 12:01:19 2015 +0530
description:
asm: intra_pred_ang8_20 improved by ~4% over SSE4

AVX2:
intra_ang_8x8[20]       7.98x    256.94          2050.52

SSE4:
intra_ang_8x8[20]       7.59x    267.77          2031.49
Subject: [x265] asm: intra_pred_ang8_16 improved by ~3% over SSE4

details:   http://hg.videolan.org/x265/rev/17b694085f6a
branches:  
changeset: 10154:17b694085f6a
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Apr 10 12:38:48 2015 +0530
description:
asm: intra_pred_ang8_16 improved by ~3% over SSE4

AVX2:
intra_ang_8x8[16]       9.22x    360.04          3320.64

SSE4:
intra_ang_8x8[16]       8.68x    371.05          3222.21
Subject: [x265] asm: avx code for chroma sa8d, reused luma code

details:   http://hg.videolan.org/x265/rev/58bcc43a1333
branches:  
changeset: 10155:58bcc43a1333
user:      Sumalatha Polureddy
date:      Fri Apr 10 13:42:12 2015 +0530
description:
asm: avx code for chroma sa8d, reused luma code
Subject: [x265] asm: saoCuOrgB0 avx2 code: 23780c->18441c

details:   http://hg.videolan.org/x265/rev/a6c7cf774564
branches:  
changeset: 10156:a6c7cf774564
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Fri Apr 10 18:35:23 2015 +0530
description:
asm: saoCuOrgB0 avx2 code: 23780c->18441c

diffstat:

 doc/reST/cli.rst                         |    9 ++
 source/CMakeLists.txt                    |    2 +-
 source/common/param.cpp                  |    2 +
 source/common/quant.cpp                  |   16 +-
 source/common/x86/asm-primitives.cpp     |   21 +++++
 source/common/x86/intrapred.h            |    2 +
 source/common/x86/intrapred8.asm         |  125 +++++++++++++++++++++++++++++++
 source/common/x86/intrapred8_allangs.asm |    2 +-
 source/common/x86/ipfilter8.asm          |   69 +++++++++++++++++
 source/common/x86/ipfilter8.h            |   17 ++++
 source/common/x86/loopfilter.asm         |   80 +++++++++++++++++++-
 source/common/x86/loopfilter.h           |    1 +
 source/common/x86/pixel-a.asm            |  111 +++++++++++++++++++++++++++
 source/common/x86/pixel.h                |    1 +
 source/encoder/encoder.cpp               |    5 +
 source/encoder/nal.cpp                   |   18 ++++-
 source/encoder/nal.h                     |    1 +
 source/output/output.h                   |    6 +-
 source/output/raw.cpp                    |    7 +-
 source/output/raw.h                      |    8 +-
 source/test/pixelharness.cpp             |   22 ++++-
 source/x265.cpp                          |    6 +-
 source/x265.h                            |    5 +
 23 files changed, 506 insertions(+), 30 deletions(-)

diffs (truncated from 966 to 300 lines):

diff -r 984e254f93f7 -r a6c7cf774564 doc/reST/cli.rst

--- a/doc/reST/cli.rst	Thu Apr 09 11:48:08 2015 -0500
+++ b/doc/reST/cli.rst	Fri Apr 10 18:35:23 2015 +0530
@@ -1481,6 +1481,15 @@ VUI fields must be manually specified.
 Bitstream options
 =================
 
+.. option:: --annexb, --no-annexb
+
+	If enabled, x265 will produce Annex B bitstream format, which places
+	start codes before NAL. If disabled, x265 will produce file format,
+	which places length before NAL. x265 CLI will choose the right option
+	based on output format. Default enabled
+
+	**API ONLY**
+
 .. option:: --repeat-headers, --no-repeat-headers
 
 	If enabled, x265 will emit VPS, SPS, and PPS headers with every
diff -r 984e254f93f7 -r a6c7cf774564 source/CMakeLists.txt
--- a/source/CMakeLists.txt	Thu Apr 09 11:48:08 2015 -0500
+++ b/source/CMakeLists.txt	Fri Apr 10 18:35:23 2015 +0530
@@ -30,7 +30,7 @@ option(STATIC_LINK_CRT "Statically link 
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 54)
+set(X265_BUILD 55)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
diff -r 984e254f93f7 -r a6c7cf774564 source/common/param.cpp
--- a/source/common/param.cpp	Thu Apr 09 11:48:08 2015 -0500
+++ b/source/common/param.cpp	Fri Apr 10 18:35:23 2015 +0530
@@ -117,6 +117,7 @@ void x265_param_default(x265_param* para
     param->levelIdc = 0;
     param->bHighTier = 0;
     param->interlaceMode = 0;
+    param->bAnnexB = 1;
     param->bRepeatHeaders = 0;
     param->bEnableAccessUnitDelimiters = 0;
     param->bEmitHRDSEI = 0;
@@ -580,6 +581,7 @@ int x265_param_parse(x265_param* p, cons
         }
     }
     OPT("cu-stats") p->bLogCuStats = atobool(value);
+    OPT("annexb") p->bAnnexB = atobool(value);
     OPT("repeat-headers") p->bRepeatHeaders = atobool(value);
     OPT("wpp") p->bEnableWavefront = atobool(value);
     OPT("ctu") p->maxCUSize = (uint32_t)atoi(value);
diff -r 984e254f93f7 -r a6c7cf774564 source/common/quant.cpp
--- a/source/common/quant.cpp	Thu Apr 09 11:48:08 2015 -0500
+++ b/source/common/quant.cpp	Fri Apr 10 18:35:23 2015 +0530
@@ -613,13 +613,13 @@ uint32_t Quant::rdoQuant(const CUData& c
              * FIX15 nature of the CABAC cost tables minus the forward transform scale */
 
             /* cost of not coding this coefficient (all distortion, no signal bits) */
-            costUncoded[scanPos] = ((int64_t)signCoef * signCoef) << scaleBits;
+            costUncoded[blkPos] = ((int64_t)signCoef * signCoef) << scaleBits;
             X265_CHECK((!!scanPos ^ !!blkPos) == 0, "failed on (blkPos=0 && scanPos!=0)\n");
             if (usePsyMask & scanPos)
                 /* when no residual coefficient is coded, predicted coef == recon coef */
-                costUncoded[scanPos] -= PSYVALUE(predictedCoef);
+                costUncoded[blkPos] -= PSYVALUE(predictedCoef);
 
-            totalUncodedCost += costUncoded[scanPos];
+            totalUncodedCost += costUncoded[blkPos];
 
             if (maxAbsLevel && lastScanPos < 0)
             {
@@ -638,7 +638,7 @@ uint32_t Quant::rdoQuant(const CUData& c
                 /* No non-zero coefficient yet found, but this does not mean
                  * there is no uncoded-cost for this coefficient. Pre-
                  * quantization the coefficient may have been non-zero */
-                totalRdCost += costUncoded[scanPos];
+                totalRdCost += costUncoded[blkPos];
             }
             else
             {
@@ -668,7 +668,7 @@ uint32_t Quant::rdoQuant(const CUData& c
                     {
                         /* set default costs to uncoded costs */
                         costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[ctxSig][0]);
-                        costCoeff[scanPos] = costUncoded[scanPos] + costSig[scanPos];
+                        costCoeff[scanPos] = costUncoded[blkPos] + costSig[scanPos];
                     }
                     sigRateDelta[blkPos] = estBitsSbac.significantBits[ctxSig][1] - estBitsSbac.significantBits[ctxSig][0];
                     sigCoefBits = estBitsSbac.significantBits[ctxSig][1];
@@ -739,7 +739,7 @@ uint32_t Quant::rdoQuant(const CUData& c
                 totalRdCost += costCoeff[scanPos];
 
                 /* record costs for sign-hiding performed at the end */
-                if (level)
+                if ((cu.m_slice->m_pps->bSignHideEnabled ? ~0 : 0) & level)
                 {
                     const int32_t diff0 = level - 1 - baseLevel;
                     const int32_t diff2 = level + 1 - baseLevel;
@@ -810,7 +810,7 @@ uint32_t Quant::rdoQuant(const CUData& c
             {
                 sigCoeffGroupFlag64 |= cgBlkPosMask;
                 cgRdStats.codedLevelAndDist += costCoeff[scanPos] - costSig[scanPos];
-                cgRdStats.uncodedDist += costUncoded[scanPos];
+                cgRdStats.uncodedDist += costUncoded[blkPos];
                 cgRdStats.nnzBeforePos0 += scanPosinCG;
             }
         } /* end for (scanPosinCG) */
@@ -965,7 +965,7 @@ uint32_t Quant::rdoQuant(const CUData& c
                 }
 
                 totalRdCost -= costCoeff[scanPos];
-                totalRdCost += costUncoded[scanPos];
+                totalRdCost += costUncoded[blkPos];
             }
             else
                 totalRdCost -= costSig[scanPos];
diff -r 984e254f93f7 -r a6c7cf774564 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Thu Apr 09 11:48:08 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Fri Apr 10 18:35:23 2015 +0530
@@ -1488,6 +1488,10 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = x265_pixel_satd_32x8_avx;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = x265_pixel_satd_8x32_avx;
         ASSIGN_SA8D(avx);
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = x265_pixel_sa8d_32x32_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = x265_pixel_sa8d_16x16_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = x265_pixel_sa8d_8x8_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].sa8d = x265_pixel_satd_4x4_avx;
         ASSIGN_SSE_PP(avx);
         p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sse_pp = x265_pixel_ssd_8x8_avx;
         ASSIGN_SSE_SS(avx);
@@ -1552,6 +1556,8 @@ void setupAssemblyPrimitives(EncoderPrim
 #if X86_64
     if (cpuMask & X265_CPU_AVX2)
     {
+        p.planecopy_sp = x265_downShift_16_avx2;
+
         p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_avx2;
 
         p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_avx2;
@@ -1563,6 +1569,7 @@ void setupAssemblyPrimitives(EncoderPrim
         p.saoCuOrgE0 = x265_saoCuOrgE0_avx2;
         p.saoCuOrgE1 = x265_saoCuOrgE1_avx2;
         p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_avx2;
+        p.saoCuOrgB0 = x265_saoCuOrgB0_avx2;
 
         p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_avx2;
         p.cu[BLOCK_8x8].psy_cost_ss = x265_psyCost_ss_8x8_avx2;
@@ -1769,11 +1776,13 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2;
         p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;
         p.cu[BLOCK_8x8].intra_pred[13] = x265_intra_pred_ang8_13_avx2;
+        p.cu[BLOCK_8x8].intra_pred[20] = x265_intra_pred_ang8_20_avx2;
         p.cu[BLOCK_8x8].intra_pred[21] = x265_intra_pred_ang8_21_avx2;
         p.cu[BLOCK_8x8].intra_pred[22] = x265_intra_pred_ang8_22_avx2;
         p.cu[BLOCK_8x8].intra_pred[23] = x265_intra_pred_ang8_23_avx2;
         p.cu[BLOCK_8x8].intra_pred[14] = x265_intra_pred_ang8_14_avx2;
         p.cu[BLOCK_8x8].intra_pred[15] = x265_intra_pred_ang8_15_avx2;
+        p.cu[BLOCK_8x8].intra_pred[16] = x265_intra_pred_ang8_16_avx2;
         p.cu[BLOCK_16x16].intra_pred[3] = x265_intra_pred_ang16_3_avx2;
         p.cu[BLOCK_16x16].intra_pred[4] = x265_intra_pred_ang16_4_avx2;
         p.cu[BLOCK_16x16].intra_pred[5] = x265_intra_pred_ang16_5_avx2;
@@ -2070,6 +2079,18 @@ void setupAssemblyPrimitives(EncoderPrim
         p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2;
         p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2;
         p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2;
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2;
+
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2;
 
         if ((cpuMask & X265_CPU_BMI1) && (cpuMask & X265_CPU_BMI2))
             p.findPosLast = x265_findPosLast_x64;
diff -r 984e254f93f7 -r a6c7cf774564 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h	Thu Apr 09 11:48:08 2015 -0500
+++ b/source/common/x86/intrapred.h	Fri Apr 10 18:35:23 2015 +0530
@@ -236,6 +236,8 @@ void x265_intra_pred_ang8_11_avx2(pixel*
 void x265_intra_pred_ang8_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
diff -r 984e254f93f7 -r a6c7cf774564 source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm	Thu Apr 09 11:48:08 2015 -0500
+++ b/source/common/x86/intrapred8.asm	Fri Apr 10 18:35:23 2015 +0530
@@ -690,6 +690,12 @@ c_ang8_mode_15:       db 17, 15, 17, 15,
                       db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
                       db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
 
+ALIGN 32
+c_ang8_mode_20:       db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+
 const ang_table
 %assign x 0
 %rep 32
@@ -11946,6 +11952,125 @@ cglobal intra_pred_ang8_15, 3, 6, 6
     RET
 
 INIT_YMM avx2
+cglobal intra_pred_ang8_16, 3, 6, 6
+    mova              m3, [pw_1024]
+    movu              xm5, [r2 + 16]
+    pinsrb            xm5, [r2], 0
+    lea               r5, [intra_pred_shuff_0_8]
+    mova              xm0, xm5
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 2], 0
+    vinserti128       m0, m0, xm5, 1
+    pshufb            m0, [r5]
+
+    lea               r4, [c_ang8_mode_20]
+    pmaddubsw         m1, m0, [r4]
+    pmulhrsw          m1, m3
+    mova              xm0, xm5
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 3], 0
+    vinserti128       m0, m0, xm5, 1
+    pshufb            m0, [r5]
+    pmaddubsw         m2, m0, [r4 + mmsize]
+    pmulhrsw          m2, m3
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 5], 0
+    vinserti128       m0, m5, xm5, 1
+    pshufb            m0, [r5]
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
+    pmulhrsw          m4, m3
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 6], 0
+    mova              xm0, xm5
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 8], 0
+    vinserti128       m0, m0, xm5, 1
+    pshufb            m0, [r5]
+    pmaddubsw         m0, [r4 + 3 * mmsize]
+    pmulhrsw          m0, m3
+
+    packuswb          m1, m2
+    packuswb          m4, m0
+
+    vperm2i128        m2, m1, m4, 00100000b
+    vperm2i128        m1, m1, m4, 00110001b
+    punpcklbw         m4, m2, m1
+    punpckhbw         m2, m1
+    punpcklwd         m1, m4, m2
+    punpckhwd         m4, m2
+    mova              m0, [trans8_shuf]
+    vpermd            m1, m0, m1
+    vpermd            m4, m0, m4
+
+    lea               r3, [3 * r1]
+    movq              [r0], xm1
+    movhps            [r0 + r1], xm1
+    vextracti128      xm2, m1, 1
+    movq              [r0 + 2 * r1], xm2
+    movhps            [r0 + r3], xm2
+    lea               r0, [r0 + 4 * r1]
+    movq              [r0], xm4
+    movhps            [r0 + r1], xm4
+    vextracti128      xm2, m4, 1
+    movq              [r0 + 2 * r1], xm2
+    movhps            [r0 + r3], xm2
+    RET
+
+INIT_YMM avx2
+cglobal intra_pred_ang8_20, 3, 6, 6
+    mova              m3, [pw_1024]
+    movu              xm5, [r2]
+    lea               r5, [intra_pred_shuff_0_8]
+    mova              xm0, xm5
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 2 + 16], 0
+    vinserti128       m0, m0, xm5, 1
+    pshufb            m0, [r5]
+
+    lea               r4, [c_ang8_mode_20]
+    pmaddubsw         m1, m0, [r4]
+    pmulhrsw          m1, m3
+    mova              xm0, xm5
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 3 + 16], 0
+    vinserti128       m0, m0, xm5, 1
+    pshufb            m0, [r5]
+    pmaddubsw         m2, m0, [r4 + mmsize]
+    pmulhrsw          m2, m3
+    pslldq            xm5, 1
+    pinsrb            xm5, [r2 + 5 + 16], 0
+    vinserti128       m0, m5, xm5, 1
+    pshufb            m0, [r5]