[x265-commits] [x265] x86inc.asm: fix vpbroadcastd bug on Mac platform

Mon Sep 8 15:20:59 CEST 2014

details:   http://hg.videolan.org/x265/rev/51930084e148
branches:  
changeset: 7989:51930084e148
user:      Min Chen <chenm003 at 163.com>
date:      Fri Sep 05 16:48:03 2014 -0700
description:
x86inc.asm: fix vpbroadcastd bug on Mac platform
Subject: [x265] entropy: change top-level encode to encodeCTU

details:   http://hg.videolan.org/x265/rev/de5614144bce
branches:  
changeset: 7990:de5614144bce
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Sep 08 11:48:56 2014 +0530
description:
entropy: change top-level encode to encodeCTU
Subject: [x265] Merge with correct x86inc.asm patch

details:   http://hg.videolan.org/x265/rev/c55d69561948
branches:  
changeset: 7991:c55d69561948
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 12:02:55 2014 +0200
description:
Merge with correct x86inc.asm patch
Subject: [x265] nits

details:   http://hg.videolan.org/x265/rev/a29aa966336e
branches:  
changeset: 7992:a29aa966336e
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 14:00:52 2014 +0200
description:
nits
Subject: [x265] frameencoder: remove redundant clear of frame stats

details:   http://hg.videolan.org/x265/rev/a7465d789c64
branches:  
changeset: 7993:a7465d789c64
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 14:03:30 2014 +0200
description:
frameencoder: remove redundant clear of frame stats

they were being zero'd in the constructor, init(), and in compressCTURows.
techincally only the last is truly necessary, but I'm leaving the memset in the
contructor.
Subject: [x265] frameencoder: remove second encodeCU() pass over CTUs when SAO is disabled

details:   http://hg.videolan.org/x265/rev/a117564df3ef
branches:  
changeset: 7994:a117564df3ef
user:      Steve Borho <steve at borho.org>
date:      Fri Sep 05 17:56:17 2014 +0200
description:
frameencoder: remove second encodeCU() pass over CTUs when SAO is disabled

This is a performance optimization, it allows the encoder to generate the final
bitstream of each CTU as it is compressed and cache hot.

When SAO is enabled, SAO analysis must be performed and coded at the start of
the CTU but SAO analysis currently requires surrounding CTUs to be encoded
making the second pass unavoidable.

Note that this commit changes the way non-WPP encodes are performed, for the
better. Now it always uses row 0's CI_CURR_BEST entropy coder instance to
communicate entropy state between all CTUs and between rows. This better models
how encodeSlice() works and makes RDO work better
Subject: [x265] frameencoder: merge more of encodeSlice() into processCU

details:   http://hg.videolan.org/x265/rev/60289c638600
branches:  
changeset: 7995:60289c638600
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 12:41:13 2014 +0200
description:
frameencoder: merge more of encodeSlice() into processCU

this commit fixes no-WPP after the previous change. the per-row or per-frame
(+- WPP) bistreams are flushed as they are finished (and cache hot) and the
per CU stats are summed per row and then summarized all in one place.
Subject: [x265] frameencoder: do more CU stat math as integer

details:   http://hg.videolan.org/x265/rev/406d92c860d5
branches:  
changeset: 7996:406d92c860d5
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 14:29:09 2014 +0200
description:
frameencoder: do more CU stat math as integer
Subject: [x265] frameencoder: rename percent fields for clarity

details:   http://hg.videolan.org/x265/rev/9581a45d4344
branches:  
changeset: 7997:9581a45d4344
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 14:32:02 2014 +0200
description:
frameencoder: rename percent fields for clarity
Subject: [x265] rc: move FrameStats to ratecontrol.h

details:   http://hg.videolan.org/x265/rev/89e682182a7a
branches:  
changeset: 7998:89e682182a7a
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 14:34:38 2014 +0200
description:
rc: move FrameStats to ratecontrol.h

rate control shouldn't need to include frameencoder.h
Subject: [x265] frameencoder: combine some conditional expressions

details:   http://hg.videolan.org/x265/rev/cb67f6f65577
branches:  
changeset: 7999:cb67f6f65577
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 15:11:56 2014 +0200
description:
frameencoder: combine some conditional expressions
Subject: [x265] frameencoder: avoid another call to resetEntropy(), they are expensive

details:   http://hg.videolan.org/x265/rev/cfe197e3044d
branches:  
changeset: 8000:cfe197e3044d
user:      Steve Borho <steve at borho.org>
date:      Mon Sep 08 15:12:20 2014 +0200
description:
frameencoder: avoid another call to resetEntropy(), they are expensive

diffstat:

 source/common/dct.cpp                |    1 +
 source/common/x86/asm-primitives.cpp |    2 +
 source/common/x86/const-a.asm        |    1 +
 source/common/x86/pixel-util.h       |    1 +
 source/common/x86/pixel-util8.asm    |   61 +++++++++++++--
 source/common/x86/x86inc.asm         |   12 +++
 source/encoder/analysis.cpp          |   15 +++-
 source/encoder/entropy.cpp           |   16 ++--
 source/encoder/entropy.h             |    9 +-
 source/encoder/frameencoder.cpp      |  140 ++++++++++++++++++++---------------
 source/encoder/frameencoder.h        |   21 +----
 source/encoder/ratecontrol.cpp       |    7 +-
 source/encoder/ratecontrol.h         |   18 ++++-
 source/encoder/sao.cpp               |   51 ++++++------
 14 files changed, 220 insertions(+), 135 deletions(-)

diffs (truncated from 737 to 300 lines):

diff -r 795878af3973 -r cfe197e3044d source/common/dct.cpp

--- a/source/common/dct.cpp	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/dct.cpp	Mon Sep 08 15:12:20 2014 +0200
@@ -729,6 +729,7 @@ void dequant_normal_c(const int16_t* qua
     X265_CHECK(num <= 32 * 32, "dequant num %d too large\n", num);
     X265_CHECK((num % 8) == 0, "dequant num %d not multiple of 8\n", num);
     X265_CHECK(shift <= 10, "shift too large %d\n", shift);
+    X265_CHECK(((intptr_t)coef & 31) == 0, "dequant coef buffer not aligned\n");
 
     int add, coeffQ;
 
diff -r 795878af3973 -r cfe197e3044d source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/asm-primitives.cpp	Mon Sep 08 15:12:20 2014 +0200
@@ -1442,6 +1442,7 @@ void Setup_Assembly_Primitives(EncoderPr
     {
         p.dct[DCT_4x4] = x265_dct4_avx2;
         p.nquant = x265_nquant_avx2;
+        p.dequant_normal = x265_dequant_normal_avx2;
     }
     /* at HIGH_BIT_DEPTH, pixel == short so we can reuse a number of primitives */
     for (int i = 0; i < NUM_LUMA_PARTITIONS; i++)
@@ -1739,6 +1740,7 @@ void Setup_Assembly_Primitives(EncoderPr
 
         p.dct[DCT_4x4] = x265_dct4_avx2;
         p.nquant = x265_nquant_avx2;
+        p.dequant_normal = x265_dequant_normal_avx2;
     }
 #endif // if HIGH_BIT_DEPTH
 }
diff -r 795878af3973 -r cfe197e3044d source/common/x86/const-a.asm
--- a/source/common/x86/const-a.asm	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/const-a.asm	Mon Sep 08 15:12:20 2014 +0200
@@ -89,6 +89,7 @@ const pd_512,      times 4 dd 512
 const pd_1024,     times 4 dd 1024
 const pd_2048,     times 4 dd 2048
 const pd_ffff,     times 4 dd 0xffff
+const pd_32767,    times 4 dd 32767
 const pd_n32768,   times 4 dd 0xffff8000
 const pw_ff00,     times 8 dw 0xff00
 
diff -r 795878af3973 -r cfe197e3044d source/common/x86/pixel-util.h
--- a/source/common/x86/pixel-util.h	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/pixel-util.h	Mon Sep 08 15:12:20 2014 +0200
@@ -48,6 +48,7 @@ uint32_t x265_quant_sse4(int32_t *coef, 
 uint32_t x265_nquant_sse4(int32_t *coef, int32_t *quantCoeff, int16_t *qCoef, int qBits, int add, int numCoeff);
 uint32_t x265_nquant_avx2(int32_t *coef, int32_t *quantCoeff, int16_t *qCoef, int qBits, int add, int numCoeff);
 void x265_dequant_normal_sse4(const int16_t* quantCoef, int32_t* coef, int num, int scale, int shift);
+void x265_dequant_normal_avx2(const int16_t* quantCoef, int32_t* coef, int num, int scale, int shift);
 int x265_count_nonzero_ssse3(const int16_t *quantCoeff, int numCoeff);
 
 void x265_weight_pp_sse4(pixel *src, pixel *dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
diff -r 795878af3973 -r cfe197e3044d source/common/x86/pixel-util8.asm
--- a/source/common/x86/pixel-util8.asm	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/pixel-util8.asm	Mon Sep 08 15:12:20 2014 +0200
@@ -54,6 +54,8 @@ cextern pw_1
 cextern pw_00ff
 cextern pw_2000
 cextern pw_pixel_max
+cextern pd_32767
+cextern pd_n32768
 
 ;-----------------------------------------------------------------------------
 ; void calcrecon(pixel* pred, int16_t* residual, int16_t* reconqt, pixel *reconipred, int stride, int strideqt, int strideipred)
@@ -1040,21 +1042,18 @@ cglobal nquant, 3,5,7
 ;-----------------------------------------------------------------------------
 INIT_XMM sse4
 cglobal dequant_normal, 5,5,5
-    movd        m1, r3              ; m1 = word [scale]
     mova        m2, [pw_1]
 %if HIGH_BIT_DEPTH
     cmp         r3d, 32767
     jle         .skip
-    psrld       m1, 2
+    shr         r3d, 2
     sub         r4d, 2
 .skip:
 %endif
     movd        m0, r4d             ; m0 = shift
-    xor         r3d, r3d
-    dec         r4d
+    add         r4d, 15
     bts         r3d, r4d
-    movd        m3, r3d
-    punpcklwd   m1, m3
+    movd        m1, r3d
     pshufd      m1, m1, 0           ; m1 = dword [add scale]
     ; m0 = shift
     ; m1 = scale
@@ -1071,8 +1070,8 @@ cglobal dequant_normal, 5,5,5
     pmovsxwd    m3, m3
     packssdw    m4, m4
     pmovsxwd    m4, m4
-    movu        [r1], m3
-    movu        [r1 + 16], m4
+    mova        [r1], m3
+    mova        [r1 + 16], m4
 
     add         r0, 16
     add         r1, 32
@@ -1082,6 +1081,52 @@ cglobal dequant_normal, 5,5,5
     RET
 
 
+INIT_YMM avx2
+cglobal dequant_normal, 5,5,7
+    vpbroadcastd    m2, [pw_1]          ; m2 = word [1]
+    vpbroadcastd    m5, [pd_32767]      ; m5 = dword [32767]
+    vpbroadcastd    m6, [pd_n32768]     ; m6 = dword [-32768]
+%if HIGH_BIT_DEPTH
+    cmp             r3d, 32767
+    jle            .skip
+    shr             r3d, 2
+    sub             r4d, 2
+.skip:
+%endif
+    movd            xm0, r4d            ; m0 = shift
+    add             r4d, -1+16
+    bts             r3d, r4d
+    vpbroadcastd    m1, r3d             ; m1 = dword [add scale]
+
+    ; m0 = shift
+    ; m1 = scale
+    ; m2 = word [1]
+    shr             r2d, 4
+.loop:
+    movu            m3, [r0]
+    punpckhwd       m4, m3, m2
+    punpcklwd       m3, m2
+    pmaddwd         m3, m1              ; m3 = dword (clipQCoef * scale + add)
+    pmaddwd         m4, m1
+    psrad           m3, xm0
+    psrad           m4, xm0
+    pminsd          m3, m5
+    pmaxsd          m3, m6
+    pminsd          m4, m5
+    pmaxsd          m4, m6
+    mova            [r1 + 0 * mmsize/2], xm3
+    mova            [r1 + 1 * mmsize/2], xm4
+    vextracti128    [r1 + 2 * mmsize/2], m3, 1
+    vextracti128    [r1 + 3 * mmsize/2], m4, 1
+
+    add             r0, mmsize
+    add             r1, mmsize * 2
+
+    dec             r2d
+    jnz            .loop
+    RET
+
+
 ;-----------------------------------------------------------------------------
 ; int count_nonzero(const int16_t *quantCoeff, int numCoeff);
 ;-----------------------------------------------------------------------------
diff -r 795878af3973 -r cfe197e3044d source/common/x86/x86inc.asm
--- a/source/common/x86/x86inc.asm	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/x86inc.asm	Mon Sep 08 15:12:20 2014 +0200
@@ -888,6 +888,8 @@ INIT_XMM
     %define ymmmm%1   mm%1
     %define ymmxmm%1 xmm%1
     %define ymmymm%1 ymm%1
+    %define ymm%1xmm xmm%1
+    %define xmm%1ymm ymm%1
     %define xm%1 xmm %+ m%1
     %define ym%1 ymm %+ m%1
 %endmacro
@@ -1480,3 +1482,13 @@ FMA4_INSTR fnmsubss, fnmsub132ss, fnmsub
 %endif
 %endmacro
 %endif
+
+; workaround: vpbroadcastd with register, the yasm will generate wrong code
+%macro vpbroadcastd 2
+  %ifid %2
+    movd         %1 %+ xmm, %2
+    vpbroadcastd %1, %1 %+ xmm
+  %else
+    vpbroadcastd %1, %2
+  %endif
+%endmacro
diff -r 795878af3973 -r cfe197e3044d source/encoder/analysis.cpp
--- a/source/encoder/analysis.cpp	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/analysis.cpp	Mon Sep 08 15:12:20 2014 +0200
@@ -1056,21 +1056,30 @@ void Analysis::compressInterCU_rd0_4(TCo
             copyYuv2Pic(pic, outBestCU->getAddr(), absPartIdx, depth);
     }
 
+#if CHECKED_BUILD || _DEBUG
     /* Assert if Best prediction mode is NONE
      * Selected mode's RD-cost must be not MAX_INT64 */
     if (bInsidePicture)
     {
         X265_CHECK(outBestCU->getPartitionSize(0) != SIZE_NONE, "no best prediction size\n");
         X265_CHECK(outBestCU->getPredictionMode(0) != MODE_NONE, "no best prediction mode\n");
-        if (m_rdCost.m_psyRd)
+        if (m_param->rdLevel > 1)
         {
-            X265_CHECK(outBestCU->m_totalPsyCost != MAX_INT64, "no best partition cost\n");
+            if (m_rdCost.m_psyRd)
+            {
+                X265_CHECK(outBestCU->m_totalPsyCost != MAX_INT64, "no best partition cost\n");
+            }
+            else
+            {
+                X265_CHECK(outBestCU->m_totalRDCost != MAX_INT64, "no best partition cost\n");
+            }
         }
         else
         {
-            X265_CHECK(outBestCU->m_totalRDCost != MAX_INT64, "no best partition cost\n");
+            X265_CHECK(outBestCU->m_sa8dCost != MAX_INT64, "no best partition cost\n");
         }
     }
+#endif
 
     x265_emms();
 }
diff -r 795878af3973 -r cfe197e3044d source/encoder/entropy.cpp
--- a/source/encoder/entropy.cpp	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/entropy.cpp	Mon Sep 08 15:12:20 2014 +0200
@@ -481,7 +481,7 @@ void Entropy::codeShortTermRefPicSet(RPS
     }
 }
 
-void Entropy::encodeCU(TComDataCU* cu)
+void Entropy::encodeCTU(TComDataCU* cu)
 {
     bool bEncodeDQP = cu->m_slice->m_pps->bUseDQP;
     encodeCU(cu, 0, 0, false, bEncodeDQP);
@@ -572,11 +572,6 @@ void Entropy::finishCU(TComDataCU* cu, u
     uint32_t realEndAddress = slice->m_endCUAddr;
     uint32_t cuAddr = cu->getSCUAddr() + absPartIdx;
 
-    // Encode slice finish
-    bool bTerminateSlice = false;
-    if (cuAddr + (cu->m_pic->getNumPartInCU() >> (depth << 1)) == realEndAddress)
-        bTerminateSlice = true;
-
     uint32_t granularityMask = g_maxCUSize - 1;
     uint32_t cuSize = 1 << cu->getLog2CUSize(absPartIdx);
     uint32_t rpelx = cu->getCUPelX() + g_zscanToPelX[absPartIdx] + cuSize;
@@ -586,12 +581,17 @@ void Entropy::finishCU(TComDataCU* cu, u
 
     if (granularityBoundary)
     {
+        // Encode slice finish
+        bool bTerminateSlice = false;
+        if (cuAddr + (cu->m_pic->getNumPartInCU() >> (depth << 1)) == realEndAddress)
+            bTerminateSlice = true;
+
         // The 1-terminating bit is added to all streams, so don't add it here when it's 1.
         if (!bTerminateSlice)
-            codeTerminatingBit(0);
+            encodeBinTrm(0);
 
         if (!m_bitIf)
-            resetBits();
+            resetBits(); // TODO: most likely unnecessary
     }
 }
 
diff -r 795878af3973 -r cfe197e3044d source/encoder/entropy.h
--- a/source/encoder/entropy.h	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/entropy.h	Mon Sep 08 15:12:20 2014 +0200
@@ -133,7 +133,7 @@ public:
     void load(Entropy& src);
     void loadIntraDirModeLuma(Entropy& src);
     void store(Entropy& dest);
-    void loadContexts(Entropy& src)       { copyContextsFrom(src); }
+    void loadContexts(Entropy& src)    { copyContextsFrom(src); }
     void copyState(Entropy& other);
 
     void codeVPS(VPS* vps);
@@ -146,13 +146,12 @@ public:
     void codeSliceHeader(Slice* slice);
     void codeSliceHeaderWPPEntryPoints(Slice* slice, uint32_t *substreamSizes, uint32_t maxOffset);
     void codeShortTermRefPicSet(RPS* rps);
-    void codeSliceFinish()                   { finish(); }
-    void codeTerminatingBit(uint32_t lsLast) { encodeBinTrm(lsLast); }
+    void finishSlice()                 { encodeBinTrm(1); finish(); dynamic_cast<Bitstream*>(m_bitIf)->writeByteAlignment(); }
 
-    void encodeCU(TComDataCU* cu);
+    void encodeCTU(TComDataCU* cu);
     void codeSaoOffset(SaoLcuParam* saoLcuParam, uint32_t compIdx);
     void codeSaoUnitInterleaving(int compIdx, bool saoFlag, int rx, int ry, SaoLcuParam* saoLcuParam, int cuAddrInSlice, int cuAddrUpInSlice, int allowMergeLeft, int allowMergeUp);
-    void codeSaoMerge(uint32_t code) { encodeBin(code, m_contextState[OFF_SAO_MERGE_FLAG_CTX]); }
+    void codeSaoMerge(uint32_t code)   { encodeBin(code, m_contextState[OFF_SAO_MERGE_FLAG_CTX]); }
 
     void codeCUTransquantBypassFlag(uint32_t symbol);
     void codeSkipFlag(TComDataCU* cu, uint32_t absPartIdx);
diff -r 795878af3973 -r cfe197e3044d source/encoder/frameencoder.cpp
--- a/source/encoder/frameencoder.cpp	Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/frameencoder.cpp	Mon Sep 08 15:12:20 2014 +0200
@@ -117,7 +117,6 @@ bool FrameEncoder::init(Encoder *top, in
         ok &= m_rce.picTimingSEI && m_rce.hrdTiming;
     }
 
-    memset(&m_frameStats, 0, sizeof(m_frameStats));
     if (m_param->noiseReduction)
         m_nr = X265_MALLOC(NoiseReduction, 1);
     if (m_nr)