[x265-commits] [x265] analysis: re-order RD 5/6 analysis to do splits before ME...

Wed Jul 15 20:05:55 CEST 2015

details:   http://hg.videolan.org/x265/rev/42e55c6eafb0
branches:  
changeset: 10816:42e55c6eafb0
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Thu May 21 19:16:28 2015 +0530
description:
analysis: re-order RD 5/6 analysis to do splits before ME or intra

This commit changes outputs because splits used to be avoided when an inter or
intra mode was chosen without residual coding. This recursion early-out is no
longer possible. Only merge without residual (aka skip) can abort recursion.

This commit changes the order of analysis such that the four split blocks are
analyzed prior to attempting any ME or intra modes. Future commits we will use
the knowledge learned during split analysis to avoid unlikely work at the
current depth (reducing motion references avoiding unlikely intra, rectangular,
asymmetric, and lossless modes)
Subject: [x265] analysis: at RD 5/6 avoid motion references if not used by split blocks

details:   http://hg.videolan.org/x265/rev/c19a4ae5cf7d
branches:  
changeset: 10817:c19a4ae5cf7d
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:07 2015 +0530
description:
analysis: at RD 5/6 avoid motion references if not used by split blocks
Subject: [x265] analysis: skip intra in RD 5/6 if split was analyzed and no split CUs used intra

details:   http://hg.videolan.org/x265/rev/ef9cd36f3672
branches:  
changeset: 10818:ef9cd36f3672
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:10 2015 +0530
description:
analysis: skip intra in RD 5/6 if split was analyzed and no split CUs used intra
Subject: [x265] stats: RD 5/6 profile effectiveness of avoiding intra if split CUs did not select it

details:   http://hg.videolan.org/x265/rev/af57c28db2ff
branches:  
changeset: 10819:af57c28db2ff
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:13 2015 +0530
description:
stats: RD 5/6 profile effectiveness of avoiding intra if split CUs did not select it
Subject: [x265] analysis: respect X265_REF_LIMIT_DEPTH with RD 5/6

details:   http://hg.videolan.org/x265/rev/98bfdc49b66e
branches:  
changeset: 10820:98bfdc49b66e
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:17 2015 +0530
description:
analysis: respect X265_REF_LIMIT_DEPTH with RD 5/6
Subject: [x265] analysis: model the effectiveness of --limit-ref with RD 5/6

details:   http://hg.videolan.org/x265/rev/63f8b338f2be
branches:  
changeset: 10821:63f8b338f2be
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:20 2015 +0530
description:
analysis: model the effectiveness of --limit-ref with RD 5/6
Subject: [x265] Regression Test: added new command line --ref-limits for RD-5/6 in regression-tests.txt

details:   http://hg.videolan.org/x265/rev/8521e8d7a477
branches:  
changeset: 10822:8521e8d7a477
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:24 2015 +0530
description:
Regression Test: added new command line --ref-limits for RD-5/6 in regression-tests.txt
Subject: [x265] analysis: removed switch-case to read the best ref index

details:   http://hg.videolan.org/x265/rev/19f3f98b5c73
branches:  
changeset: 10823:19f3f98b5c73
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:28 2015 +0530
description:
analysis: removed switch-case to read the best ref index
Subject: [x265] analysis: used CUData helper function to get number of PUs and offset

details:   http://hg.videolan.org/x265/rev/a850ecb0895b
branches:  
changeset: 10824:a850ecb0895b
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:34 2015 +0530
description:
analysis: used CUData helper function to get number of PUs and offset
Subject: [x265] entropy: removed g_puOffset table

details:   http://hg.videolan.org/x265/rev/35029d6001c5
branches:  
changeset: 10825:35029d6001c5
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Jun 23 20:35:38 2015 +0530
description:
entropy: removed g_puOffset table
Subject: [x265] dither: fix bitdepth check

details:   http://hg.videolan.org/x265/rev/8fc3ea4894c2
branches:  
changeset: 10826:8fc3ea4894c2
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Wed Jul 15 11:54:24 2015 +0530
description:
dither: fix bitdepth check
Subject: [x265] cli: add 12-bit to showHelp

details:   http://hg.videolan.org/x265/rev/54689cbd2e01
branches:  
changeset: 10827:54689cbd2e01
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Wed Jul 15 12:24:46 2015 +0530
description:
cli: add 12-bit to showHelp
Subject: [x265] asm: fix intra_pred_dc_sse2 in Main12

details:   http://hg.videolan.org/x265/rev/8efce8620ae2
branches:  
changeset: 10828:8efce8620ae2
user:      Min Chen <chenm003 at 163.com>
date:      Tue Jul 14 16:29:46 2015 -0700
description:
asm: fix intra_pred_dc_sse2 in Main12

diffstat:

 source/common/cudata.cpp          |   32 ----
 source/common/cudata.h            |   37 ++++-
 source/common/x86/intrapred16.asm |   61 +++----
 source/encoder/analysis.cpp       |  281 ++++++++++++++++++++++---------------
 source/encoder/analysis.h         |    4 +-
 source/encoder/entropy.cpp        |    7 +-
 source/encoder/entropy.h          |    2 -
 source/encoder/search.cpp         |    2 +-
 source/test/regression-tests.txt  |   24 +-
 source/x265-extras.cpp            |    4 +-
 source/x265cli.h                  |    2 +-
 11 files changed, 247 insertions(+), 209 deletions(-)

diffs (truncated from 930 to 300 lines):

diff -r 8023786c5247 -r 8efce8620ae2 source/common/cudata.cpp

--- a/source/common/cudata.cpp	Mon Jul 13 17:38:02 2015 -0700
+++ b/source/common/cudata.cpp	Tue Jul 14 16:29:46 2015 -0700
@@ -112,38 +112,6 @@ inline MV scaleMv(MV mv, int scale)
     return MV((int16_t)mvx, (int16_t)mvy);
 }
 
-// Partition table.
-// First index is partitioning mode. Second index is partition index.
-// Third index is 0 for partition sizes, 1 for partition offsets. The 
-// sizes and offsets are encoded as two packed 4-bit values (X,Y). 
-// X and Y represent 1/4 fractions of the block size.
-const uint32_t partTable[8][4][2] =
-{
-    //        XY
-    { { 0x44, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2Nx2N.
-    { { 0x42, 0x00 }, { 0x42, 0x02 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxN.
-    { { 0x24, 0x00 }, { 0x24, 0x20 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_Nx2N.
-    { { 0x22, 0x00 }, { 0x22, 0x20 }, { 0x22, 0x02 }, { 0x22, 0x22 } }, // SIZE_NxN.
-    { { 0x41, 0x00 }, { 0x43, 0x01 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnU.
-    { { 0x43, 0x00 }, { 0x41, 0x03 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnD.
-    { { 0x14, 0x00 }, { 0x34, 0x10 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_nLx2N.
-    { { 0x34, 0x00 }, { 0x14, 0x30 }, { 0x00, 0x00 }, { 0x00, 0x00 } }  // SIZE_nRx2N.
-};
-
-// Partition Address table.
-// First index is partitioning mode. Second index is partition address.
-const uint32_t partAddrTable[8][4] =
-{
-    { 0x00, 0x00, 0x00, 0x00 }, // SIZE_2Nx2N.
-    { 0x00, 0x08, 0x08, 0x08 }, // SIZE_2NxN.
-    { 0x00, 0x04, 0x04, 0x04 }, // SIZE_Nx2N.
-    { 0x00, 0x04, 0x08, 0x0C }, // SIZE_NxN.
-    { 0x00, 0x02, 0x02, 0x02 }, // SIZE_2NxnU.
-    { 0x00, 0x0A, 0x0A, 0x0A }, // SIZE_2NxnD.
-    { 0x00, 0x01, 0x01, 0x01 }, // SIZE_nLx2N.
-    { 0x00, 0x05, 0x05, 0x05 }  // SIZE_nRx2N.
-};
-
 }
 
 cubcast_t CUData::s_partSet[NUM_FULL_DEPTH] = { NULL, NULL, NULL, NULL, NULL };
diff -r 8023786c5247 -r 8efce8620ae2 source/common/cudata.h
--- a/source/common/cudata.h	Mon Jul 13 17:38:02 2015 -0700
+++ b/source/common/cudata.h	Tue Jul 14 16:29:46 2015 -0700
@@ -121,6 +121,38 @@ typedef void(*cubcast_t)(uint8_t* dst, u
 // Partition count table, index represents partitioning mode.
 const uint32_t nbPartsTable[8] = { 1, 2, 2, 4, 2, 2, 2, 2 };
 
+// Partition table.
+// First index is partitioning mode. Second index is partition index.
+// Third index is 0 for partition sizes, 1 for partition offsets. The 
+// sizes and offsets are encoded as two packed 4-bit values (X,Y). 
+// X and Y represent 1/4 fractions of the block size.
+const uint32_t partTable[8][4][2] =
+{
+    //        XY
+    { { 0x44, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2Nx2N.
+    { { 0x42, 0x00 }, { 0x42, 0x02 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxN.
+    { { 0x24, 0x00 }, { 0x24, 0x20 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_Nx2N.
+    { { 0x22, 0x00 }, { 0x22, 0x20 }, { 0x22, 0x02 }, { 0x22, 0x22 } }, // SIZE_NxN.
+    { { 0x41, 0x00 }, { 0x43, 0x01 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnU.
+    { { 0x43, 0x00 }, { 0x41, 0x03 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnD.
+    { { 0x14, 0x00 }, { 0x34, 0x10 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_nLx2N.
+    { { 0x34, 0x00 }, { 0x14, 0x30 }, { 0x00, 0x00 }, { 0x00, 0x00 } }  // SIZE_nRx2N.
+};
+
+// Partition Address table.
+// First index is partitioning mode. Second index is partition address.
+const uint32_t partAddrTable[8][4] =
+{
+    { 0x00, 0x00, 0x00, 0x00 }, // SIZE_2Nx2N.
+    { 0x00, 0x08, 0x08, 0x08 }, // SIZE_2NxN.
+    { 0x00, 0x04, 0x04, 0x04 }, // SIZE_Nx2N.
+    { 0x00, 0x04, 0x08, 0x0C }, // SIZE_NxN.
+    { 0x00, 0x02, 0x02, 0x02 }, // SIZE_2NxnU.
+    { 0x00, 0x0A, 0x0A, 0x0A }, // SIZE_2NxnD.
+    { 0x00, 0x01, 0x01, 0x01 }, // SIZE_nLx2N.
+    { 0x00, 0x05, 0x05, 0x05 }  // SIZE_nRx2N.
+};
+
 // Holds part data for a CU of a given size, from an 8x8 CU to a CTU
 class CUData
 {
@@ -222,8 +254,11 @@ public:
     void     getNeighbourMV(uint32_t puIdx, uint32_t absPartIdx, InterNeighbourMV* neighbours) const;
     void     getIntraTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const;
     void     getInterTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const;
+    uint32_t getBestRefIdx(uint32_t subPartIdx) const { return ((m_interDir[subPartIdx] & 1) << m_refIdx[0][subPartIdx]) | 
+                                                              (((m_interDir[subPartIdx] >> 1) & 1) << (m_refIdx[1][subPartIdx] + 16)); }
+    uint32_t getPUOffset(uint32_t puIdx, uint32_t absPartIdx) const { return (partAddrTable[(int)m_partSize[absPartIdx]][puIdx] << (g_unitSizeDepth - m_cuDepth[absPartIdx]) * 2) >> 4; }
 
-    uint32_t getNumPartInter() const              { return nbPartsTable[(int)m_partSize[0]]; }
+    uint32_t getNumPartInter(uint32_t absPartIdx) const              { return nbPartsTable[(int)m_partSize[absPartIdx]]; }
     bool     isIntra(uint32_t absPartIdx) const   { return m_predMode[absPartIdx] == MODE_INTRA; }
     bool     isInter(uint32_t absPartIdx) const   { return !!(m_predMode[absPartIdx] & MODE_INTER); }
     bool     isSkipped(uint32_t absPartIdx) const { return m_predMode[absPartIdx] == MODE_SKIP; }
diff -r 8023786c5247 -r 8efce8620ae2 source/common/x86/intrapred16.asm
--- a/source/common/x86/intrapred16.asm	Mon Jul 13 17:38:02 2015 -0700
+++ b/source/common/x86/intrapred16.asm	Tue Jul 14 16:29:46 2015 -0700
@@ -142,7 +142,7 @@ cglobal intra_pred_dc4, 5,6,2
     test        r4d,            r4d
 
     paddw       m0,             [pw_4]
-    psraw       m0,             3
+    psrlw       m0,             3
 
     ; store DC 4x4
     movh        [r0],           m0
@@ -161,7 +161,7 @@ cglobal intra_pred_dc4, 5,6,2
     ; filter top
     movh        m1,             [r2 + 2]
     paddw       m1,             m0
-    psraw       m1,             2
+    psrlw       m1,             2
     movh        [r0],           m1             ; overwrite top-left pixel, we will update it later
 
     ; filter top-left
@@ -176,7 +176,7 @@ cglobal intra_pred_dc4, 5,6,2
     ; filter left
     movu        m1,             [r2 + 20]
     paddw       m1,             m0
-    psraw       m1,             2
+    psrlw       m1,             2
     movd        r3d,            m1
     mov         [r0 + r1 * 2],  r3w
     shr         r3d,            16
@@ -202,7 +202,7 @@ cglobal intra_pred_dc8, 5, 8, 2
     pmaddwd         m0,            [pw_1]
 
     paddw           m0,            [pw_8]
-    psraw           m0,            4              ; sum = sum / 16
+    psrlw           m0,            4              ; sum = sum / 16
     pshuflw         m0,            m0, 0
     pshufd          m0,            m0, 0          ; m0 = word [dc_val ...]
 
@@ -235,7 +235,7 @@ cglobal intra_pred_dc8, 5, 8, 2
     ; filter top
     movu            m0,            [r2 + 2]
     paddw           m0,            m1
-    psraw           m0,            2
+    psrlw           m0,            2
     movu            [r0],          m0
 
     ; filter top-left
@@ -250,7 +250,7 @@ cglobal intra_pred_dc8, 5, 8, 2
     ; filter left
     movu            m0,            [r2 + 36]
     paddw           m0,            m1
-    psraw           m0,            2
+    psrlw           m0,            2
     movh            r3,            m0
     mov             [r0 + r1 * 2], r3w
     shr             r3,            16
@@ -284,14 +284,10 @@ cglobal intra_pred_dc16, 5, 10, 4
     paddw           m0,                  m1
     paddw           m2,                  m3
     paddw           m0,                  m2
-    movhlps         m1,                  m0
-    paddw           m0,                  m1
-    pshuflw         m1,                  m0, 0x6E
-    paddw           m0,                  m1
-    pmaddwd         m0,                  [pw_1]
-
-    paddw           m0,                  [pw_16]
-    psraw           m0,                  5
+    HADDUW          m0,                  m1
+    paddd           m0,                  [pd_16]
+    psrld           m0,                  5
+
     movd            r5d,                 m0
     pshuflw         m0,                  m0, 0 ; m0 = word [dc_val ...]
     pshufd          m0,                  m0, 0
@@ -347,11 +343,11 @@ cglobal intra_pred_dc16, 5, 10, 4
     ; filter top
     movu            m2,                  [r2 + 2]
     paddw           m2,                  m1
-    psraw           m2,                  2
+    psrlw           m2,                  2
     movu            [r0],                m2
     movu            m3,                  [r2 + 18]
     paddw           m3,                  m1
-    psraw           m3,                  2
+    psrlw           m3,                  2
     movu            [r0 + 16],           m3
 
     ; filter top-left
@@ -366,7 +362,7 @@ cglobal intra_pred_dc16, 5, 10, 4
     ; filter left
     movu            m2,                  [r3 + 2]
     paddw           m2,                  m1
-    psraw           m2,                  2
+    psrlw           m2,                  2
 
     movq            r2,                  m2
     pshufd          m2,                  m2, 0xEE
@@ -388,7 +384,7 @@ cglobal intra_pred_dc16, 5, 10, 4
 
     movu            m3,                  [r3 + 18]
     paddw           m3,                  m1
-    psraw           m3,                  2
+    psrlw           m3,                  2
 
     movq            r3,                  m3
     pshufd          m3,                  m3, 0xEE
@@ -423,20 +419,19 @@ cglobal intra_pred_dc32, 3, 4, 6
     paddw           m0,                  m1
     paddw           m2,                  m3
     paddw           m0,                  m2
+    HADDUWD         m0,                  m1
+
     movu            m1,                  [r2]
-    movu            m3,                  [r2 + 16]
-    movu            m4,                  [r2 + 32]
-    movu            m5,                  [r2 + 48]
+    movu            m2,                  [r2 + 16]
+    movu            m3,                  [r2 + 32]
+    movu            m4,                  [r2 + 48]
+    paddw           m1,                  m2
+    paddw           m3,                  m4
     paddw           m1,                  m3
-    paddw           m4,                  m5
-    paddw           m1,                  m4
-    paddw           m0,                  m1
-    movhlps         m1,                  m0
-    paddw           m0,                  m1
-    pshuflw         m1,                  m0, 0x6E
-    paddw           m0,                  m1
-    pmaddwd         m0,                  [pw_1]
-
+    HADDUWD         m1,                  m2
+
+    paddd           m0,                  m1
+    HADDD           m0,                  m1
     paddd           m0,                  [pd_32]     ; sum = sum + 32
     psrld           m0,                  6           ; sum = sum / 64
     pshuflw         m0,                  m0, 0
@@ -487,7 +482,7 @@ cglobal intra_pred_dc16, 3, 9, 4
     phaddw          xm0,                 xm0
     pmaddwd         xm0,                 [pw_1]
     paddd           xm0,                 [pd_16]
-    psrad           xm0,                 5
+    psrld           xm0,                 5
     movd            r5d,                 xm0
     vpbroadcastw    m0,                  xm0
 
@@ -527,7 +522,7 @@ cglobal intra_pred_dc16, 3, 9, 4
     ; filter top
     movu            m2,                  [r2 + 2]
     paddw           m2,                  m1
-    psraw           m2,                  2
+    psrlw           m2,                  2
     movu            [r0],                m2
 
     ; filter top-left
@@ -542,7 +537,7 @@ cglobal intra_pred_dc16, 3, 9, 4
     ; filter left
     movu            m2,                  [r2 + 68]
     paddw           m2,                  m1
-    psraw           m2,                  2
+    psrlw           m2,                  2
     vextracti128    xm3,                 m2, 1
 
     movq            r3,                  xm2
diff -r 8023786c5247 -r 8efce8620ae2 source/encoder/analysis.cpp
--- a/source/encoder/analysis.cpp	Mon Jul 13 17:38:02 2015 -0700
+++ b/source/encoder/analysis.cpp	Tue Jul 14 16:29:46 2015 -0700
@@ -385,10 +385,10 @@ void Analysis::processPmode(PMODE& pmode
     /* perform Mode task, repeat until no more work is available */
     do
     {
+        uint32_t refMasks[2] = { 0, 0 };
+
         if (m_param->rdLevel <= 4)
         {
-            uint32_t refMasks[2] = { 0, 0 };
-
             switch (pmode.modes[task])
             {
             case PRED_INTRA:
@@ -443,7 +443,7 @@ void Analysis::processPmode(PMODE& pmode
                 break;
 
             case PRED_2Nx2N:
-                slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N);
+                slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, refMasks);
                 md.pred[PRED_BIDIR].rdCost = MAX_INT64;
                 if (m_slice->m_sliceType == B_SLICE)
                 {
@@ -454,27 +454,27 @@ void Analysis::processPmode(PMODE& pmode
                 break;
 
             case PRED_Nx2N:
-                slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N);
+                slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, refMasks);
                 break;
 
             case PRED_2NxN:
-                slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN);
+                slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, refMasks);