[x265-commits] [x265] asm: disable x265_scale2D_64to32_ssse3, DUMA finds access...

Wed Jan 8 00:20:10 CET 2014

details:   http://hg.videolan.org/x265/rev/d4bef967ae10
branches:  
changeset: 5800:d4bef967ae10
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 15:08:05 2014 -0600
description:
asm: disable x265_scale2D_64to32_ssse3, DUMA finds access violations

I tried simple buffer padding workarounds, adding 16 bytes at the start and end
of bufScale, but it was still causing the access violation.
Subject: [x265] slicetype: better prevention for compiler warnings and misbehaviors

details:   http://hg.videolan.org/x265/rev/54835bf61c11
branches:  
changeset: 5801:54835bf61c11
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 15:21:09 2014 -0600
description:
slicetype: better prevention for compiler warnings and misbehaviors
Subject: [x265] motion: add early out for subpel refine if bcost is already zero

details:   http://hg.videolan.org/x265/rev/63d6b04fe201
branches:  
changeset: 5802:63d6b04fe201
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 16:11:53 2014 -0600
description:
motion: add early out for subpel refine if bcost is already zero
Subject: [x265] TComBitStream: rename variables for clarity

details:   http://hg.videolan.org/x265/rev/324d99e3d6ac
branches:  
changeset: 5803:324d99e3d6ac
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 16:12:45 2014 -0600
description:
TComBitStream: rename variables for clarity

There was no point making cnt an unsigned variable when the return value is
signed, this just adds more compiler warnings
Subject: [x265] TComBitstream: simplify and streamline start code checks

details:   http://hg.videolan.org/x265/rev/e1ee0fc31e79
branches:  
changeset: 5804:e1ee0fc31e79
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 16:15:21 2014 -0600
description:
TComBitstream: simplify and streamline start code checks
Subject: [x265] ignore vim swap files

details:   http://hg.videolan.org/x265/rev/6d40ab7be379
branches:  
changeset: 5805:6d40ab7be379
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 16:17:52 2014 -0600
description:
ignore vim swap files
Subject: [x265] TComBitStream: fix loop bounds so we do not check past end of buffer

details:   http://hg.videolan.org/x265/rev/bd9b395c80c7
branches:  
changeset: 5806:bd9b395c80c7
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 16:26:32 2014 -0600
description:
TComBitStream: fix loop bounds so we do not check past end of buffer
Subject: [x265] wtf? a useless comment and if()/else() with two identical statements?

details:   http://hg.videolan.org/x265/rev/c1cf926c20e0
branches:  
changeset: 5807:c1cf926c20e0
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 23:14:06 2014 -0600
description:
wtf? a useless comment and if()/else() with two identical statements?
Subject: [x265] TComPrediction: simplify luma intra prediction function

details:   http://hg.videolan.org/x265/rev/4811da38078c
branches:  
changeset: 5808:4811da38078c
user:      Steve Borho <steve at borho.org>
date:      Mon Jan 06 23:15:58 2014 -0600
description:
TComPrediction: simplify luma intra prediction function
Subject: [x265] correct number of xmm register on interp_8tap_horiz*

details:   http://hg.videolan.org/x265/rev/ca7bde495318
branches:  
changeset: 5809:ca7bde495318
user:      Min Chen <chenm003 at 163.com>
date:      Tue Jan 07 18:36:48 2014 +0800
description:
correct number of xmm register on interp_8tap_horiz*
Subject: [x265] asm: fix memory access violation due to scale2D_64to32

details:   http://hg.videolan.org/x265/rev/c4edab8dab65
branches:  
changeset: 5810:c4edab8dab65
user:      Murugan Vairavel <murugan at multicorewareinc.com>
date:      Tue Jan 07 18:36:17 2014 +0530
description:
asm: fix memory access violation due to scale2D_64to32

diffstat:

 .hgignore                                |   1 +
 source/Lib/TLibCommon/TComBitStream.cpp  |  13 ++---
 source/Lib/TLibCommon/TComPrediction.cpp |  24 +----------
 source/common/x86/ipfilter8.asm          |   2 +-
 source/common/x86/pixel-util8.asm        |  67 +++++++++++++++++--------------
 source/encoder/motion.cpp                |   6 ++-
 source/encoder/slicetype.cpp             |   4 +-
 7 files changed, 56 insertions(+), 61 deletions(-)

diffs (truncated from 309 to 300 lines):

diff -r abd4da45823c -r c4edab8dab65 .hgignore

--- a/.hgignore	Wed Jan 01 15:52:11 2014 -0600
+++ b/.hgignore	Tue Jan 07 18:36:17 2014 +0530
@@ -7,3 +7,4 @@ build/
 **.yuv
 **.y4m
 **.out
+**.swp
diff -r abd4da45823c -r c4edab8dab65 source/Lib/TLibCommon/TComBitStream.cpp
--- a/source/Lib/TLibCommon/TComBitStream.cpp	Wed Jan 01 15:52:11 2014 -0600
+++ b/source/Lib/TLibCommon/TComBitStream.cpp	Tue Jan 07 18:36:17 2014 +0530
@@ -184,21 +184,20 @@ void TComOutputBitstream::writeByteAlign
 
 int TComOutputBitstream::countStartCodeEmulations()
 {
-    uint32_t cnt = 0;
+    int numStartCodes = 0;
     uint8_t *rbsp = getFIFO();
     uint32_t fsize = getByteStreamLength();
 
-    for (uint32_t count = 0; count < fsize; count++)
+    for (uint32_t i = 0; i + 2 < fsize; i++)
     {
-        if ((rbsp[count + 2] == 0x00 || rbsp[count + 2] == 0x01 || rbsp[count + 2] == 0x02 || rbsp[count + 2] == 0x03)
-            && rbsp[count + 1] == 0x00 && rbsp[count] == 0x00)
+        if (!rbsp[i] && !rbsp[i + 1] && rbsp[i + 2] <= 3)
         {
-            cnt++;
-            count = count + 1;
+            numStartCodes++;
+            i++;
         }
     }
 
-    return cnt;
+    return numStartCodes;
 }
 
 void TComOutputBitstream::push_back(uint8_t val)
diff -r abd4da45823c -r c4edab8dab65 source/Lib/TLibCommon/TComPrediction.cpp
--- a/source/Lib/TLibCommon/TComPrediction.cpp	Wed Jan 01 15:52:11 2014 -0600
+++ b/source/Lib/TLibCommon/TComPrediction.cpp	Tue Jan 07 18:36:17 2014 +0530
@@ -153,18 +153,8 @@ void TComPrediction::predIntraLumaAng(ui
         refAbv = refAboveFlt + size - 1;
     }
 
-    // get starting pixel in block
-    bool bFilter = (size <= 16);
-
-    // Create the prediction
-    if (dirMode == PLANAR_IDX)
-    {
-        primitives.intra_pred[log2BlkSize - 2][PLANAR_IDX](dst, stride, refLft, refAbv, dirMode, 0);
-    }
-    else
-    {
-        primitives.intra_pred[log2BlkSize - 2][dirMode](dst, stride, refLft, refAbv, dirMode, bFilter);
-    }
+    bool bFilter = size <= 16 && dirMode != PLANAR_IDX;
+    primitives.intra_pred[log2BlkSize - 2][dirMode](dst, stride, refLft, refAbv, dirMode, bFilter);
 }
 
 // Angular chroma
@@ -183,15 +173,7 @@ void TComPrediction::predIntraChromaAng(
         refLft[k + width - 1] = src[k * ADI_BUF_STRIDE];
     }
 
-    // get starting pixel in block
-    if (dirMode == PLANAR_IDX)
-    {
-        primitives.intra_pred[log2BlkSize][dirMode](dst, stride, refLft + width - 1, refAbv + width - 1, dirMode, 0);
-    }
-    else
-    {
-        primitives.intra_pred[log2BlkSize][dirMode](dst, stride, refLft + width - 1, refAbv + width - 1, dirMode, 0);
-    }
+    primitives.intra_pred[log2BlkSize][dirMode](dst, stride, refLft + width - 1, refAbv + width - 1, dirMode, 0);
 }
 
 /** Function for checking identical motion.
diff -r abd4da45823c -r c4edab8dab65 source/common/x86/ipfilter8.asm
--- a/source/common/x86/ipfilter8.asm	Wed Jan 01 15:52:11 2014 -0600
+++ b/source/common/x86/ipfilter8.asm	Tue Jan 07 18:36:17 2014 +0530
@@ -623,7 +623,7 @@ IPFILTER_CHROMA_W 32, 32
 ;----------------------------------------------------------------------------------------------------------------------------
 %macro IPFILTER_LUMA 3
 INIT_XMM sse4
-cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 5
+cglobal interp_8tap_horiz_%3_%1x%2, 4,7,6
 
     mov       r4d, r4m
 
diff -r abd4da45823c -r c4edab8dab65 source/common/x86/pixel-util8.asm
--- a/source/common/x86/pixel-util8.asm	Wed Jan 01 15:52:11 2014 -0600
+++ b/source/common/x86/pixel-util8.asm	Tue Jan 07 18:36:17 2014 +0530
@@ -2325,17 +2325,17 @@ RET
 ;-----------------------------------------------------------------
 ; void scale2D_64to32(pixel *dst, pixel *src, intptr_t stride)
 ;-----------------------------------------------------------------
+%if HIGH_BIT_DEPTH
 INIT_XMM ssse3
 cglobal scale2D_64to32, 3, 4, 8, dest, src, stride
     mov       r3d,    32
-%if HIGH_BIT_DEPTH
     mova      m7,    [deinterleave_word_shuf]
     add       r2,    r2
 .loop
     movu      m0,    [r1]                  ;i
-    movu      m1,    [r1 + 2]              ;j
+    psrld     m1,    m0,    16             ;j
     movu      m2,    [r1 + r2]             ;k
-    movu      m3,    [r1 + r2 + 2]         ;l
+    psrld     m3,    m2,    16             ;l
     movu      m4,    m0
     movu      m5,    m2
     pxor      m4,    m1                    ;i^j
@@ -2350,9 +2350,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     pand      m4,    [hmulw_16p]
     psubw     m0,    m4                    ;Result
     movu      m1,    [r1 + 16]             ;i
-    movu      m2,    [r1 + 16 + 2]         ;j
+    psrld     m2,    m1,    16             ;j
     movu      m3,    [r1 + r2 + 16]        ;k
-    movu      m4,    [r1 + r2 + 16 + 2]    ;l
+    psrld     m4,    m3,    16             ;l
     movu      m5,    m1
     movu      m6,    m3
     pxor      m5,    m2                    ;i^j
@@ -2373,9 +2373,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     movu          [r0],     m0
 
     movu      m0,    [r1 + 32]             ;i
-    movu      m1,    [r1 + 32 + 2]         ;j
+    psrld     m1,    m0,    16             ;j
     movu      m2,    [r1 + r2 + 32]        ;k
-    movu      m3,    [r1 + r2 + 32 + 2]    ;l
+    psrld     m3,    m2,    16             ;l
     movu      m4,    m0
     movu      m5,    m2
     pxor      m4,    m1                    ;i^j
@@ -2390,9 +2390,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     pand      m4,    [hmulw_16p]
     psubw     m0,    m4                    ;Result
     movu      m1,    [r1 + 48]             ;i
-    movu      m2,    [r1 + 48 + 2]         ;j
+    psrld     m2,    m1,    16             ;j
     movu      m3,    [r1 + r2 + 48]        ;k
-    movu      m4,    [r1 + r2 + 48 + 2]    ;l
+    psrld     m4,    m3,    16             ;l
     movu      m5,    m1
     movu      m6,    m3
     pxor      m5,    m2                    ;i^j
@@ -2413,9 +2413,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     movu          [r0 + 16],    m0
 
     movu      m0,    [r1 + 64]             ;i
-    movu      m1,    [r1 + 64 + 2]         ;j
+    psrld     m1,    m0,    16             ;j
     movu      m2,    [r1 + r2 + 64]        ;k
-    movu      m3,    [r1 + r2 + 64 + 2]    ;l
+    psrld     m3,    m2,    16             ;l
     movu      m4,    m0
     movu      m5,    m2
     pxor      m4,    m1                    ;i^j
@@ -2430,9 +2430,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     pand      m4,    [hmulw_16p]
     psubw     m0,    m4                    ;Result
     movu      m1,    [r1 + 80]             ;i
-    movu      m2,    [r1 + 80 + 2]         ;j
+    psrld     m2,    m1,    16             ;j
     movu      m3,    [r1 + r2 + 80]        ;k
-    movu      m4,    [r1 + r2 + 80 + 2]    ;l
+    psrld     m4,    m3,    16             ;l
     movu      m5,    m1
     movu      m6,    m3
     pxor      m5,    m2                    ;i^j
@@ -2453,9 +2453,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     movu          [r0 + 32],    m0
 
     movu      m0,    [r1 + 96]             ;i
-    movu      m1,    [r1 + 96 + 2]         ;j
+    psrld     m1,    m0,    16             ;j
     movu      m2,    [r1 + r2 + 96]        ;k
-    movu      m3,    [r1 + r2 + 96 + 2]    ;l
+    psrld     m3,    m2,    16             ;l
     movu      m4,    m0
     movu      m5,    m2
     pxor      m4,    m1                    ;i^j
@@ -2469,10 +2469,10 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     pand      m4,    m5                    ;(ij|kl)&st
     pand      m4,    [hmulw_16p]
     psubw     m0,    m4                    ;Result
-    movu      m1,    [r1 + 112]             ;i
-    movu      m2,    [r1 + 112 + 2]         ;j
-    movu      m3,    [r1 + r2 + 112]        ;k
-    movu      m4,    [r1 + r2 + 112 + 2]    ;l
+    movu      m1,    [r1 + 112]            ;i
+    psrld     m2,    m1,    16             ;j
+    movu      m3,    [r1 + r2 + 112]       ;k
+    psrld     m4,    m3,    16             ;l
     movu      m5,    m1
     movu      m6,    m3
     pxor      m5,    m2                    ;i^j
@@ -2492,14 +2492,22 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     punpcklqdq    m0,           m1
     movu          [r0 + 48],    m0
     lea    r0,    [r0 + 64]
+    lea    r1,    [r1 + 2 * r2]
+    dec    r3d
+    jnz    .loop
+    RET
 %else
+
+INIT_XMM ssse3
+cglobal scale2D_64to32, 3, 4, 8, dest, src, stride
+    mov       r3d,    32
     mova        m7,      [deinterleave_shuf]
 .loop
 
     movu        m0,      [r1]                  ;i
-    movu        m1,      [r1 + 1]              ;j
+    psrlw       m1,      m0,    8              ;j
     movu        m2,      [r1 + r2]             ;k
-    movu        m3,      [r1 + r2 + 1]         ;l
+    psrlw       m3,      m2,    8              ;l
     movu        m4,      m0
     movu        m5,      m2
 
@@ -2517,9 +2525,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     psubb       m0,      m4                    ;Result
 
     movu        m1,      [r1 + 16]             ;i
-    movu        m2,      [r1 + 16 + 1]         ;j
+    psrlw       m2,      m1,    8              ;j
     movu        m3,      [r1 + r2 + 16]        ;k
-    movu        m4,      [r1 + r2 + 16 + 1]    ;l
+    psrlw       m4,      m3,    8              ;l
     movu        m5,      m1
     movu        m6,      m3
 
@@ -2543,9 +2551,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     movu          [r0],         m0
 
     movu        m0,      [r1 + 32]             ;i
-    movu        m1,      [r1 + 32 + 1]         ;j
+    psrlw       m1,      m0,    8              ;j
     movu        m2,      [r1 + r2 + 32]        ;k
-    movu        m3,      [r1 + r2 + 32 + 1]    ;l
+    psrlw       m3,      m2,    8              ;l
     movu        m4,      m0
     movu        m5,      m2
 
@@ -2563,9 +2571,9 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     psubb       m0,      m4                    ;Result
 
     movu        m1,      [r1 + 48]             ;i
-    movu        m2,      [r1 + 48 + 1]         ;j
+    psrlw       m2,      m1,    8              ;j
     movu        m3,      [r1 + r2 + 48]        ;k
-    movu        m4,      [r1 + r2 + 48 + 1]    ;l
+    psrlw       m4,      m3,    8              ;l
     movu        m5,      m1
     movu        m6,      m3
 
@@ -2589,12 +2597,11 @@ cglobal scale2D_64to32, 3, 4, 8, dest, s
     movu          [r0 + 16],    m0
 
     lea    r0,    [r0 + 32]
-%endif
     lea    r1,    [r1 + 2 * r2]
     dec    r3d
     jnz    .loop
-
-RET
+    RET
+%endif
 
 
 ;-----------------------------------------------------------------------------
diff -r abd4da45823c -r c4edab8dab65 source/encoder/motion.cpp
--- a/source/encoder/motion.cpp	Wed Jan 01 15:52:11 2014 -0600
+++ b/source/encoder/motion.cpp	Tue Jan 07 18:36:17 2014 +0530
@@ -1051,7 +1051,11 @@ me_hex2:
 
     SubpelWorkload& wl = workload[this->subpelRefine];
 
-    if (ref->isLowres)
+    if (!bcost)
+    {
+        /* subpel refine isn't going to improve this */
+    }
+    else if (ref->isLowres)
     {
         int bdir = 0, cost;
         for (int i = 1; i <= wl.hpel_dirs; i++)
diff -r abd4da45823c -r c4edab8dab65 source/encoder/slicetype.cpp
--- a/source/encoder/slicetype.cpp	Wed Jan 01 15:52:11 2014 -0600
+++ b/source/encoder/slicetype.cpp	Tue Jan 07 18:36:17 2014 +0530
@@ -858,7 +858,9 @@ void Lookahead::slicetypeDecide()
             }