[x265-commits] [x265] aq: implementation of fine grained adaptive quantization

Deepthi Nandakumar deepthi at multicorewareinc.com
Tue Apr 7 21:12:22 CEST 2015


details:   http://hg.videolan.org/x265/rev/b66b0e32d2ff
branches:  
changeset: 10096:b66b0e32d2ff
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Tue Apr 07 09:39:25 2015 +0530
description:
aq: implementation of fine grained adaptive quantization

Currently adaptive quantization adjusts the QP values on 64x64 pixel CodingTree
units (CTUs) across a video frame. The new param option --qg-size will
enable QP to be adjusted to individual quantization groups (QGs) of size 64/32/16
Subject: [x265] aq: add cost of sub-LCU level QP to RD costs

details:   http://hg.videolan.org/x265/rev/75d6c2588e34
branches:  
changeset: 10097:75d6c2588e34
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Apr 06 15:39:07 2015 +0530
description:
aq: add cost of sub-LCU level QP to RD costs
Subject: [x265] analysis: remove 1i64, typecast to size_t

details:   http://hg.videolan.org/x265/rev/095ed87526e5
branches:  
changeset: 10098:095ed87526e5
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Tue Apr 07 09:42:31 2015 +0530
description:
analysis: remove 1i64, typecast to size_t
Subject: [x265] doc: add level 8.5 to --level docs

details:   http://hg.videolan.org/x265/rev/09d7ccdca69c
branches:  
changeset: 10099:09d7ccdca69c
user:      Steve Borho <steve at borho.org>
date:      Mon Apr 06 22:05:33 2015 -0500
description:
doc: add level 8.5 to --level docs
Subject: [x265] encoder: fix eoln damaga and white-space nits

details:   http://hg.videolan.org/x265/rev/8be3ac946dc4
branches:  
changeset: 10100:8be3ac946dc4
user:      Steve Borho <steve at borho.org>
date:      Tue Apr 07 10:08:06 2015 -0500
description:
encoder: fix eoln damaga and white-space nits
Subject: [x265] quant: remove unused fastMax

details:   http://hg.videolan.org/x265/rev/d53fe6fe7c8b
branches:  
changeset: 10101:d53fe6fe7c8b
user:      Steve Borho <steve at borho.org>
date:      Tue Apr 07 11:05:42 2015 -0500
description:
quant: remove unused fastMax
Subject: [x265] analysis: fix crash when AQ is disabled

details:   http://hg.videolan.org/x265/rev/1ba7a0a0742b
branches:  
changeset: 10102:1ba7a0a0742b
user:      Steve Borho <steve at borho.org>
date:      Tue Apr 07 11:10:45 2015 -0500
description:
analysis: fix crash when AQ is disabled
Subject: [x265] backout fine-grained AQ until unexplained output changes are resolved

details:   http://hg.videolan.org/x265/rev/ff5e67b4a60a
branches:  
changeset: 10103:ff5e67b4a60a
user:      Steve Borho <steve at borho.org>
date:      Tue Apr 07 12:29:46 2015 -0500
description:
backout fine-grained AQ until unexplained output changes are resolved

This is a partial backout of b66b0e32d2ff and 75d6c2588e34, leaving the param,
cli, and documentation support in place to avoid API churn.  The feature will
be present and documented but unimplimented until the remaining issues are
resolved.
Subject: [x265] rc-test: revise rc test cases - use more test clips, additional rc features.

details:   http://hg.videolan.org/x265/rev/ba6d393a8ec8
branches:  
changeset: 10104:ba6d393a8ec8
user:      Aarthi Thirumalai
date:      Tue Apr 07 22:42:22 2015 +0530
description:
rc-test: revise rc test cases - use more test clips, additional rc features.
Subject: [x265] asm: improve avx2 code for add_ps[32x32] (1428 -> 1312)

details:   http://hg.videolan.org/x265/rev/e23102250bbe
branches:  
changeset: 10105:e23102250bbe
user:      Sumalatha Polureddy
date:      Tue Apr 07 13:59:45 2015 +0530
description:
asm: improve avx2 code for add_ps[32x32] (1428 -> 1312)
Subject: [x265] simplify coeff group clear when CG decide to not encode

details:   http://hg.videolan.org/x265/rev/10b657f2d3c1
branches:  
changeset: 10106:10b657f2d3c1
user:      Min Chen <chenm003 at 163.com>
date:      Tue Apr 07 20:03:28 2015 +0800
description:
simplify coeff group clear when CG decide to not encode
Subject: [x265] asm: intra_pred_ang16_11 improved by ~27% over SSE4

details:   http://hg.videolan.org/x265/rev/ea8a5f97245b
branches:  
changeset: 10107:ea8a5f97245b
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Tue Apr 07 11:02:33 2015 +0530
description:
asm: intra_pred_ang16_11 improved by ~27% over SSE4

AVX2:
intra_ang_16x16[11]     15.18x   787.92          11958.20

SSE4:
intra_ang_16x16[11]     10.48x   1075.33         11267.02
Subject: [x265] asm: intra_pred_ang16_9 improved by ~28% over SSE4

details:   http://hg.videolan.org/x265/rev/a6e50e84d731
branches:  
changeset: 10108:a6e50e84d731
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Tue Apr 07 13:35:25 2015 +0530
description:
asm: intra_pred_ang16_9 improved by ~28% over SSE4

AVX2:
intra_ang_16x16[ 9]     15.68x   770.21          12074.77

SSE4:
intra_ang_16x16[ 9]     11.35x   1072.00         12165.87
Subject: [x265] asm: intra_pred_ang16_8 improved by ~28% over SSE4

details:   http://hg.videolan.org/x265/rev/69af1c0c86cc
branches:  
changeset: 10109:69af1c0c86cc
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Tue Apr 07 16:32:12 2015 +0530
description:
asm: intra_pred_ang16_8 improved by ~28% over SSE4

AVX2:
intra_ang_16x16[ 8]     14.70x   792.85          11653.86

SSE4:
intra_ang_16x16[ 8]     11.28x   1014.29         11441.50
Subject: [x265] asm: intra_pred_ang16_7 improved by ~22% over SSE4

details:   http://hg.videolan.org/x265/rev/7ba2105a084d
branches:  
changeset: 10110:7ba2105a084d
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Tue Apr 07 17:32:26 2015 +0530
description:
asm: intra_pred_ang16_7 improved by ~22% over SSE4

AVX2:
intra_ang_16x16[ 7]     14.58x   795.95          11608.27

SSE4:
intra_ang_16x16[ 7]     11.54x   1021.72         11793.51
Subject: [x265] asm: optimize code size with macro 'INTRA_PRED_ANG16_CAL_ROW'

details:   http://hg.videolan.org/x265/rev/ae5dfd5187e9
branches:  
changeset: 10111:ae5dfd5187e9
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Tue Apr 07 17:49:16 2015 +0530
description:
asm: optimize code size with macro 'INTRA_PRED_ANG16_CAL_ROW'
Subject: [x265] asm: optimize buffer address using registers

details:   http://hg.videolan.org/x265/rev/3430dc0ae834
branches:  
changeset: 10112:3430dc0ae834
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Tue Apr 07 18:17:43 2015 +0530
description:
asm: optimize buffer address using registers
Subject: [x265] asm: avx2 8bpp code for convert_p2s[32xN]

details:   http://hg.videolan.org/x265/rev/676da42eb87d
branches:  
changeset: 10113:676da42eb87d
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Tue Apr 07 18:05:41 2015 +0530
description:
asm: avx2 8bpp code for convert_p2s[32xN]

     convert_p2s[32x8](11.11x), convert_p2s[32x16](11.01x),
     convert_p2s[32x24](11.03x), convert_p2s[32x32](11.00x),
     convert_p2s[32x64](11.06x)
Subject: [x265] asm: avx2 8bpp code for convert_p2s[64xN]

details:   http://hg.videolan.org/x265/rev/a72f08e05ab9
branches:  
changeset: 10114:a72f08e05ab9
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Tue Apr 07 18:36:17 2015 +0530
description:
asm: avx2 8bpp code for convert_p2s[64xN]

     convert_p2s[64x16](10.54x),convert_p2s[64x32](10.78x),
     convert_p2s[64x48](10.32x),convert_p2s[64x64](11.14x)

diffstat:

 doc/reST/cli.rst                     |    9 +-
 source/CMakeLists.txt                |    2 +-
 source/common/param.cpp              |    5 +
 source/common/quant.cpp              |   28 +--
 source/common/x86/asm-primitives.cpp |   14 +
 source/common/x86/intrapred.h        |    4 +
 source/common/x86/intrapred8.asm     |  324 +++++++++++++++++++++++++++++++++++
 source/common/x86/ipfilter8.asm      |  197 +++++++++++++++++++++
 source/common/x86/ipfilter8.h        |    9 +
 source/common/x86/pixeladd8.asm      |   37 ++-
 source/encoder/analysis.cpp          |   44 ++--
 source/encoder/analysis.h            |    2 +-
 source/encoder/encoder.cpp           |   15 +
 source/encoder/entropy.h             |    3 +-
 source/test/rate-control-tests.txt   |   52 ++--
 source/x265.h                        |    6 +
 source/x265cli.h                     |    2 +
 17 files changed, 665 insertions(+), 88 deletions(-)

diffs (truncated from 1087 to 300 lines):

diff -r 0ce13ce29304 -r a72f08e05ab9 doc/reST/cli.rst
--- a/doc/reST/cli.rst	Mon Apr 06 21:02:36 2015 -0500
+++ b/doc/reST/cli.rst	Tue Apr 07 18:36:17 2015 +0530
@@ -437,7 +437,7 @@ Profile, Level, Tier
 	times 10, for example level **5.1** is specified as "5.1" or "51",
 	and level **5.0** is specified as "5.0" or "50".
 
-	Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2
+	Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2, 8.5
 
 .. option:: --high-tier, --no-high-tier
 
@@ -1122,6 +1122,13 @@ Quality, rate control and rate distortio
 
 	**Range of values:** 0.0 to 3.0
 
+.. option:: --qg-size <64|32|16>
+	Enable adaptive quantization for sub-CTUs. This parameter specifies 
+	the minimum CU size at which QP can be adjusted, ie. Quantization Group
+	size. Allowed range of values are 64, 32, 16 provided this falls within 
+	the inclusive range [maxCUSize, minCUSize]. Experimental.
+	Default: same as maxCUSize
+
 .. option:: --cutree, --no-cutree
 
 	Enable the use of lookahead's lowres motion vector fields to
diff -r 0ce13ce29304 -r a72f08e05ab9 source/CMakeLists.txt
--- a/source/CMakeLists.txt	Mon Apr 06 21:02:36 2015 -0500
+++ b/source/CMakeLists.txt	Tue Apr 07 18:36:17 2015 +0530
@@ -30,7 +30,7 @@ option(STATIC_LINK_CRT "Statically link 
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 53)
+set(X265_BUILD 54)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
diff -r 0ce13ce29304 -r a72f08e05ab9 source/common/param.cpp
--- a/source/common/param.cpp	Mon Apr 06 21:02:36 2015 -0500
+++ b/source/common/param.cpp	Tue Apr 07 18:36:17 2015 +0530
@@ -209,6 +209,7 @@ void x265_param_default(x265_param* para
     param->rc.zones = NULL;
     param->rc.bEnableSlowFirstPass = 0;
     param->rc.bStrictCbr = 0;
+    param->rc.qgSize = 64; /* Same as maxCUSize */
 
     /* Video Usability Information (VUI) */
     param->vui.aspectRatioIdc = 0;
@@ -263,6 +264,7 @@ int x265_param_default_preset(x265_param
             param->rc.aqStrength = 0.0;
             param->rc.aqMode = X265_AQ_NONE;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
         }
         else if (!strcmp(preset, "superfast"))
@@ -279,6 +281,7 @@ int x265_param_default_preset(x265_param
             param->rc.aqStrength = 0.0;
             param->rc.aqMode = X265_AQ_NONE;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableSAO = 0;
             param->bEnableFastIntra = 1;
         }
@@ -292,6 +295,7 @@ int x265_param_default_preset(x265_param
             param->rdLevel = 2;
             param->maxNumReferences = 1;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
         }
         else if (!strcmp(preset, "faster"))
@@ -844,6 +848,7 @@ int x265_param_parse(x265_param* p, cons
     OPT2("pools", "numa-pools") p->numaPools = strdup(value);
     OPT("lambda-file") p->rc.lambdaFileName = strdup(value);
     OPT("analysis-file") p->analysisFileName = strdup(value);
+    OPT("qg-size") p->rc.qgSize = atoi(value);
     else
         return X265_PARAM_BAD_NAME;
 #undef OPT
diff -r 0ce13ce29304 -r a72f08e05ab9 source/common/quant.cpp
--- a/source/common/quant.cpp	Mon Apr 06 21:02:36 2015 -0500
+++ b/source/common/quant.cpp	Tue Apr 07 18:36:17 2015 +0530
@@ -50,11 +50,6 @@ inline int fastMin(int x, int y)
     return y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y)
 }
 
-inline int fastMax(int x, int y)
-{
-    return x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)
-}
-
 inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, const uint32_t absGoRice, const uint32_t maxVlc, uint32_t c1c2Idx)
 {
     X265_CHECK(c1c2Idx <= 3, "c1c2Idx check failure\n");
@@ -585,6 +580,7 @@ uint32_t Quant::rdoQuant(const CUData& c
     TUEntropyCodingParameters codeParams;
     cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, bIsLuma);
     const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2);
+    const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE);
 
     /* TODO: update bit estimates if dirty */
     EstBitsSbac& estBitsSbac = m_entropyCoder->m_estBitsSbac;
@@ -601,7 +597,7 @@ uint32_t Quant::rdoQuant(const CUData& c
         const uint64_t cgBlkPosMask = ((uint64_t)1 << cgBlkPos);
         memset(&cgRdStats, 0, sizeof(coeffGroupRDStats));
 
-        const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
+        const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
 
         /* iterate over coefficients in each group in reverse scan order */
         for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--)
@@ -824,7 +820,7 @@ uint32_t Quant::rdoQuant(const CUData& c
              * of the significant coefficient group flag and evaluate whether the RD cost of the
              * coded group is more than the RD cost of the uncoded group */
 
-            uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
+            uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
 
             int64_t costZeroCG = totalRdCost + SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]);
             costZeroCG += cgRdStats.uncodedDist;       /* add distortion for resetting non-zero levels to zero levels */
@@ -841,23 +837,17 @@ uint32_t Quant::rdoQuant(const CUData& c
                 costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]);
 
                 /* reset all coeffs to 0. UNCODE THIS COEFF GROUP! */
-                for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--)
-                {
-                    scanPos = cgScanPos * cgSize + scanPosinCG;
-                    uint32_t blkPos = codeParams.scan[scanPos];
-                    if (dstCoeff[blkPos])
-                    {
-                        costCoeff[scanPos] = costUncoded[scanPos];
-                        costSig[scanPos] = 0;
-                    }
-                    dstCoeff[blkPos] = 0;
-                }
+                const uint32_t blkPos = codeParams.scan[cgScanPos * cgSize];
+                memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff));
+                memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff));
+                memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff));
+                memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff));
             }
         }
         else
         {
             /* there were no coded coefficients in this coefficient group */
-            uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
+            uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
             costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]);
             totalRdCost += costCoeffGroupSig[cgScanPos];  /* add cost of 0 bit in significant CG bitmap */
             totalRdCost -= cgRdStats.sigCost;             /* remove cost of significant coefficient bitmap */
diff -r 0ce13ce29304 -r a72f08e05ab9 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Mon Apr 06 21:02:36 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Tue Apr 07 18:36:17 2015 +0530
@@ -1761,6 +1761,10 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_8x8].intra_pred[12] = x265_intra_pred_ang8_12_avx2;
         p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2;
         p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;
+        p.cu[BLOCK_16x16].intra_pred[7] = x265_intra_pred_ang16_7_avx2;
+        p.cu[BLOCK_16x16].intra_pred[8] = x265_intra_pred_ang16_8_avx2;
+        p.cu[BLOCK_16x16].intra_pred[9] = x265_intra_pred_ang16_9_avx2;
+        p.cu[BLOCK_16x16].intra_pred[11] = x265_intra_pred_ang16_11_avx2;
         p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2;
         p.cu[BLOCK_16x16].intra_pred[28] = x265_intra_pred_ang16_28_avx2;
         p.cu[BLOCK_16x16].intra_pred[27] = x265_intra_pred_ang16_27_avx2;
@@ -2037,6 +2041,16 @@ void setupAssemblyPrimitives(EncoderPrim
 
         p.pu[LUMA_16x16].luma_hvpp = x265_interp_8tap_hv_pp_16x16_avx2;
 
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2;
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2;
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2;
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2;
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2;
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2;
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2;
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2;
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2;
+
         if ((cpuMask & X265_CPU_BMI1) && (cpuMask & X265_CPU_BMI2))
             p.findPosLast = x265_findPosLast_x64;
     }
diff -r 0ce13ce29304 -r a72f08e05ab9 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h	Mon Apr 06 21:02:36 2015 -0500
+++ b/source/common/x86/intrapred.h	Tue Apr 07 18:36:17 2015 +0530
@@ -233,6 +233,10 @@ void x265_intra_pred_ang8_25_avx2(pixel*
 void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
diff -r 0ce13ce29304 -r a72f08e05ab9 source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm	Mon Apr 06 21:02:36 2015 -0500
+++ b/source/common/x86/intrapred8.asm	Tue Apr 07 18:36:17 2015 +0530
@@ -123,6 +123,15 @@ c_ang16_mode_25:      db 2, 30, 2, 30, 2
                       db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
+ALIGN 32
+c_ang16_mode_11:      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
 ALIGN 32
 c_ang16_mode_28:      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
@@ -134,6 +143,15 @@ c_ang16_mode_28:      db 27, 5, 27, 5, 2
                       db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
                       db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+ALIGN 32
+c_ang16_mode_9:       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
 ALIGN 32
 c_ang16_mode_27:      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
@@ -149,6 +167,15 @@ c_ang16_mode_27:      db 30, 2, 30, 2, 3
 ALIGN 32
 intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
 
+ALIGN 32
+c_ang16_mode_8:       db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_29:     db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9,  14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
@@ -161,6 +188,15 @@ c_ang16_mode_29:     db 23, 9, 23, 9, 23
                      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+ALIGN 32
+c_ang16_mode_7:      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                     db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_30:      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
@@ -11894,6 +11930,294 @@ cglobal intra_pred_ang8_24, 3, 5, 5
     movu              [%2], xm3
 %endmacro
 
+%if ARCH_X86_64 == 1
+%macro INTRA_PRED_TRANS_STORE_16x16 0
+    punpcklbw    m8, m0, m1
+    punpckhbw    m0, m1
+
+    punpcklbw    m1, m2, m3
+    punpckhbw    m2, m3
+
+    punpcklbw    m3, m4, m5
+    punpckhbw    m4, m5
+
+    punpcklbw    m5, m6, m7
+    punpckhbw    m6, m7
+
+    punpcklwd    m7, m8, m1
+    punpckhwd    m8, m1
+
+    punpcklwd    m1, m3, m5
+    punpckhwd    m3, m5
+
+    punpcklwd    m5, m0, m2
+    punpckhwd    m0, m2
+
+    punpcklwd    m2, m4, m6
+    punpckhwd    m4, m6
+
+    punpckldq    m6, m7, m1
+    punpckhdq    m7, m1
+
+    punpckldq    m1, m8, m3
+    punpckhdq    m8, m3
+


More information about the x265-commits mailing list