[x265-commits] [x265] fix bug for testbench string buffer overflow

Fri Nov 22 17:33:37 CET 2013

details:   http://hg.videolan.org/x265/rev/ab94f6effb71
branches:  
changeset: 5259:ab94f6effb71
user:      Min Chen <chenm003 at 163.com>
date:      Fri Nov 22 15:00:04 2013 +0800
description:
fix bug for testbench string buffer overflow
Subject: [x265] split dequant to normal and scaling path

details:   http://hg.videolan.org/x265/rev/4ec80bd40603
branches:  
changeset: 5260:4ec80bd40603
user:      Min Chen <chenm003 at 163.com>
date:      Fri Nov 22 18:49:49 2013 +0800
description:
split dequant to normal and scaling path
Subject: [x265] asm: code for sse_pp_12x16 routine

details:   http://hg.videolan.org/x265/rev/f09ca4290a55
branches:  
changeset: 5261:f09ca4290a55
user:      Murugan Vairavel <murugan at multicorewareinc.com>
date:      Fri Nov 22 15:50:33 2013 +0550
description:
asm: code for sse_pp_12x16 routine
Subject: [x265] pixel_add_ps_12x16, asm code

details:   http://hg.videolan.org/x265/rev/9f34d1d82296
branches:  
changeset: 5262:9f34d1d82296
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Nov 22 17:35:45 2013 +0550
description:
pixel_add_ps_12x16, asm code
Subject: [x265] pixel_add_ps_48x64, asm code

details:   http://hg.videolan.org/x265/rev/3847098e9553
branches:  
changeset: 5263:3847098e9553
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Nov 22 18:04:59 2013 +0550
description:
pixel_add_ps_48x64, asm code
Subject: [x265] pixel_add_ps_64xN, asm code

details:   http://hg.videolan.org/x265/rev/e7eeb6443303
branches:  
changeset: 5264:e7eeb6443303
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Nov 22 18:17:02 2013 +0550
description:
pixel_add_ps_64xN, asm code
Subject: [x265] asm-primitives.cpp, removed temporary function pointer initialization, generated through macro calls

details:   http://hg.videolan.org/x265/rev/76e2c787aadb
branches:  
changeset: 5265:76e2c787aadb
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Nov 22 19:04:26 2013 +0550
description:
asm-primitives.cpp, removed temporary function pointer initialization, generated through macro calls
Subject: [x265] asm: code for sse_pp_24x32 routine

details:   http://hg.videolan.org/x265/rev/0b9bccb2ef7f
branches:  
changeset: 5266:0b9bccb2ef7f
user:      Murugan Vairavel <murugan at multicorewareinc.com>
date:      Fri Nov 22 19:44:32 2013 +0550
description:
asm: code for sse_pp_24x32 routine
Subject: [x265] asm: code of sse_pp routine for 48x64 and 64x16 blocks

details:   http://hg.videolan.org/x265/rev/2e0a0a5eb0c7
branches:  
changeset: 5267:2e0a0a5eb0c7
user:      Murugan Vairavel <murugan at multicorewareinc.com>
date:      Fri Nov 22 20:09:55 2013 +0550
description:
asm: code of sse_pp routine for 48x64 and 64x16 blocks
Subject: [x265] TComYuv::addClip, integrated luma_add_ps

details:   http://hg.videolan.org/x265/rev/fd90bd911169
branches:  
changeset: 5268:fd90bd911169
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Nov 22 20:43:13 2013 +0550
description:
TComYuv::addClip, integrated luma_add_ps
Subject: [x265] added blockcopy_sp function pointers

details:   http://hg.videolan.org/x265/rev/4b437f76280d
branches:  
changeset: 5269:4b437f76280d
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Nov 22 20:52:51 2013 +0550
description:
added blockcopy_sp function pointers
Subject: [x265] asm: code of sse_pp routine for 64x32, 64x48 and 64x64 blocks

details:   http://hg.videolan.org/x265/rev/f082c556f337
branches:  
changeset: 5270:f082c556f337
user:      Murugan Vairavel <murugan at multicorewareinc.com>
date:      Fri Nov 22 21:04:40 2013 +0550
description:
asm: code of sse_pp routine for 64x32, 64x48 and 64x64 blocks
Subject: [x265] TComYuv::addClipChroma, integrated pixel_add_ps function

details:   http://hg.videolan.org/x265/rev/cc123a1ec253
branches:  
changeset: 5271:cc123a1ec253
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Nov 22 21:25:37 2013 +0550
description:
TComYuv::addClipChroma, integrated pixel_add_ps function
Subject: [x265] pixelharness: fix the other header buffer

details:   http://hg.videolan.org/x265/rev/3c827bba6cd6
branches:  
changeset: 5272:3c827bba6cd6
user:      Steve Borho <steve at borho.org>
date:      Fri Nov 22 10:02:18 2013 -0600
description:
pixelharness: fix the other header buffer
Subject: [x265] pixel: drop intrinsic sse_pp functions, we have ASM coverage

details:   http://hg.videolan.org/x265/rev/1c74d7bfd007
branches:  
changeset: 5273:1c74d7bfd007
user:      Steve Borho <steve at borho.org>
date:      Fri Nov 22 10:18:18 2013 -0600
description:
pixel: drop intrinsic sse_pp functions, we have ASM coverage

diffstat:

 source/Lib/TLibCommon/TComTrQuant.cpp |   15 +-
 source/Lib/TLibCommon/TComYuv.cpp     |   43 +--
 source/Lib/TLibCommon/TComYuv.h       |    4 +-
 source/common/dct.cpp                 |   70 ++--
 source/common/primitives.h            |    7 +-
 source/common/vec/dct-sse41.cpp       |  170 ++++++------
 source/common/vec/pixel-sse41.cpp     |  220 ----------------
 source/common/x86/asm-primitives.cpp  |   22 +-
 source/common/x86/pixel-a.asm         |  453 +++++++++++++++++++++++++++++++++-
 source/common/x86/pixel.h             |    7 +
 source/common/x86/pixeladd8.asm       |  249 ++++++++++++++++++
 source/test/mbdstharness.cpp          |   82 ++++-
 source/test/mbdstharness.h            |    3 +-
 source/test/pixelharness.cpp          |    4 +-
 14 files changed, 933 insertions(+), 416 deletions(-)

diffs (truncated from 1664 to 300 lines):

diff -r 5009254d3d3a -r 1c74d7bfd007 source/Lib/TLibCommon/TComTrQuant.cpp

--- a/source/Lib/TLibCommon/TComTrQuant.cpp	Fri Nov 22 00:17:46 2013 -0600
+++ b/source/Lib/TLibCommon/TComTrQuant.cpp	Fri Nov 22 10:18:18 2013 -0600
@@ -409,8 +409,21 @@ void TComTrQuant::invtransformNxN(bool t
     int rem = m_qpParam.m_rem;
     bool useScalingList = getUseScalingList();
     uint32_t log2TrSize = g_convertToBit[width] + 2;
+    int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize;
+    int shift = QUANT_IQUANT_SHIFT - QUANT_SHIFT - transformShift;
     int32_t *dequantCoef = getDequantCoeff(scalingListType, m_qpParam.m_rem, log2TrSize - 2);
-    primitives.dequant(coeff, m_tmpCoeff, width, height, per, rem, useScalingList, log2TrSize, dequantCoef);
+
+    if (!useScalingList)
+    {
+        static const int invQuantScales[6] = { 40, 45, 51, 57, 64, 72 };
+        int scale = invQuantScales[rem] << per;
+        primitives.dequant_normal(coeff, m_tmpCoeff, width * height, scale, shift);
+    }
+    else
+    {
+        // CHECK_ME: the code is not verify since this is DEAD path
+        primitives.dequant_scaling(coeff, dequantCoef, m_tmpCoeff, width * height, per, shift);
+    }
 
     if (useTransformSkip == true)
     {
diff -r 5009254d3d3a -r 1c74d7bfd007 source/Lib/TLibCommon/TComYuv.cpp
--- a/source/Lib/TLibCommon/TComYuv.cpp	Fri Nov 22 00:17:46 2013 -0600
+++ b/source/Lib/TLibCommon/TComYuv.cpp	Fri Nov 22 10:18:18 2013 -0600
@@ -395,14 +395,14 @@ void TComYuv::copyPartToPartChroma(TShor
 
 void TComYuv::addClip(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize)
 {
-    addClipLuma(srcYuv0, srcYuv1, trUnitIdx, partSize);
-    addClipChroma(srcYuv0, srcYuv1, trUnitIdx, partSize >> m_hChromaShift);
+    int part = partitionFromSizes(partSize, partSize);
+
+    addClipLuma(srcYuv0, srcYuv1, trUnitIdx, partSize, part);
+    addClipChroma(srcYuv0, srcYuv1, trUnitIdx, partSize >> m_hChromaShift, part);
 }
 
-void TComYuv::addClipLuma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize)
+void TComYuv::addClipLuma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize, uint32_t part)
 {
-    int x, y;
-
     Pel* src0 = srcYuv0->getLumaAddr(trUnitIdx, partSize);
     int16_t* src1 = srcYuv1->getLumaAddr(trUnitIdx, partSize);
     Pel* dst = getLumaAddr(trUnitIdx, partSize);
@@ -411,23 +411,11 @@ void TComYuv::addClipLuma(TComYuv* srcYu
     uint32_t src1Stride = srcYuv1->m_width;
     uint32_t dststride  = getStride();
 
-    for (y = partSize - 1; y >= 0; y--)
-    {
-        for (x = partSize - 1; x >= 0; x--)
-        {
-            dst[x] = ClipY(static_cast<int16_t>(src0[x]) + src1[x]);
-        }
-
-        src0 += src0Stride;
-        src1 += src1Stride;
-        dst  += dststride;
-    }
+    primitives.luma_add_ps[part](dst, dststride, src0, src1, src0Stride, src1Stride);
 }
 
-void TComYuv::addClipChroma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize)
+void TComYuv::addClipChroma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize, uint32_t part)
 {
-    int x, y;
-
     Pel* srcU0 = srcYuv0->getCbAddr(trUnitIdx, partSize);
     int16_t* srcU1 = srcYuv1->getCbAddr(trUnitIdx, partSize);
     Pel* srcV0 = srcYuv0->getCrAddr(trUnitIdx, partSize);
@@ -439,21 +427,8 @@ void TComYuv::addClipChroma(TComYuv* src
     uint32_t src1Stride = srcYuv1->m_cwidth;
     uint32_t dststride  = getCStride();
 
-    for (y = partSize - 1; y >= 0; y--)
-    {
-        for (x = partSize - 1; x >= 0; x--)
-        {
-            dstU[x] = ClipC(static_cast<int16_t>(srcU0[x]) + srcU1[x]);
-            dstV[x] = ClipC(static_cast<int16_t>(srcV0[x]) + srcV1[x]);
-        }
-
-        srcU0 += src0Stride;
-        srcU1 += src1Stride;
-        srcV0 += src0Stride;
-        srcV1 += src1Stride;
-        dstU  += dststride;
-        dstV  += dststride;
-    }
+   primitives.chroma[m_csp].add_ps[part](dstU, dststride, srcU0, srcU1, src0Stride, src1Stride);
+   primitives.chroma[m_csp].add_ps[part](dstV, dststride, srcV0, srcV1, src0Stride, src1Stride);
 }
 
 void TComYuv::subtract(TComYuv* srcYuv0, TComYuv* srcYuv1, uint32_t trUnitIdx, uint32_t partSize)
diff -r 5009254d3d3a -r 1c74d7bfd007 source/Lib/TLibCommon/TComYuv.h
--- a/source/Lib/TLibCommon/TComYuv.h	Fri Nov 22 00:17:46 2013 -0600
+++ b/source/Lib/TLibCommon/TComYuv.h	Fri Nov 22 10:18:18 2013 -0600
@@ -153,8 +153,8 @@ public:
 
     //  Clip(srcYuv0 + srcYuv1) -> m_apiBuf
     void    addClip(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize);
-    void    addClipLuma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize);
-    void    addClipChroma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize);
+    void    addClipLuma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize, uint32_t part);
+    void    addClipChroma(TComYuv* srcYuv0, TShortYUV* srcYuv1, uint32_t trUnitIdx, uint32_t partSize, uint32_t part);
 
     //  srcYuv0 - srcYuv1 -> m_apiBuf
     void    subtract(TComYuv* srcYuv0, TComYuv* srcYuv1, uint32_t trUnitIdx, uint32_t partSize);
diff -r 5009254d3d3a -r 1c74d7bfd007 source/common/dct.cpp
--- a/source/common/dct.cpp	Fri Nov 22 00:17:46 2013 -0600
+++ b/source/common/dct.cpp	Fri Nov 22 10:18:18 2013 -0600
@@ -718,57 +718,52 @@ void idct32_c(int32_t *src, int16_t *dst
     }
 }
 
-void dequant_c(const int32_t* quantCoef, int32_t* coef, int width, int height, int per, int rem, bool useScalingList, unsigned int log2TrSize, int32_t *dequantCoef)
+void dequant_normal_c(const int32_t* quantCoef, int32_t* coef, int num, int scale, int shift)
 {
-    int invQuantScales[6] = { 40, 45, 51, 57, 64, 72 };
-
-    if (width > 32)
-    {
-        width  = 32;
-        height = 32;
-    }
+    static const int invQuantScales[6] = { 40, 45, 51, 57, 64, 72 };
+    assert(num <= 32 * 32);
 
     int add, coeffQ;
-    int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize;
-    int shift = QUANT_IQUANT_SHIFT - QUANT_SHIFT - transformShift;
 
     int clipQCoef;
 
-    if (useScalingList)
+    add = 1 << (shift - 1);
+
+    for (int n = 0; n < num; n++)
     {
-        shift += 4;
+        clipQCoef = Clip3(-32768, 32767, quantCoef[n]);
+        coeffQ = (clipQCoef * scale + add) >> shift;
+        coef[n] = Clip3(-32768, 32767, coeffQ);
+    }
+}
 
-        if (shift > per)
+void dequant_scaling_c(const int32_t* quantCoef, const int32_t *deQuantCoef, int32_t* coef, int num, int per, int shift)
+{
+    assert(num <= 32 * 32);
+
+    int add, coeffQ;
+    int clipQCoef;
+
+    shift += 4;
+
+    if (shift > per)
+    {
+        add = 1 << (shift - per - 1);
+
+        for (int n = 0; n < num; n++)
         {
-            add = 1 << (shift - per - 1);
-
-            for (int n = 0; n < width * height; n++)
-            {
-                clipQCoef = Clip3(-32768, 32767, quantCoef[n]);
-                coeffQ = ((clipQCoef * dequantCoef[n]) + add) >> (shift - per);
-                coef[n] = Clip3(-32768, 32767, coeffQ);
-            }
-        }
-        else
-        {
-            for (int n = 0; n < width * height; n++)
-            {
-                clipQCoef = Clip3(-32768, 32767, quantCoef[n]);
-                coeffQ   = Clip3(-32768, 32767, clipQCoef * dequantCoef[n]);
-                coef[n] = Clip3(-32768, 32767, coeffQ << (per - shift));
-            }
+            clipQCoef = Clip3(-32768, 32767, quantCoef[n]);
+            coeffQ = ((clipQCoef * deQuantCoef[n]) + add) >> (shift - per);
+            coef[n] = Clip3(-32768, 32767, coeffQ);
         }
     }
     else
     {
-        add = 1 << (shift - 1);
-        int scale = invQuantScales[rem] << per;
-
-        for (int n = 0; n < width * height; n++)
+        for (int n = 0; n < num; n++)
         {
             clipQCoef = Clip3(-32768, 32767, quantCoef[n]);
-            coeffQ = (clipQCoef * scale + add) >> shift;
-            coef[n] = Clip3(-32768, 32767, coeffQ);
+            coeffQ   = Clip3(-32768, 32767, clipQCoef * deQuantCoef[n]);
+            coef[n] = Clip3(-32768, 32767, coeffQ << (per - shift));
         }
     }
 }
@@ -804,7 +799,8 @@ namespace x265 {
 
 void Setup_C_DCTPrimitives(EncoderPrimitives& p)
 {
-    p.dequant = dequant_c;
+    p.dequant_scaling = dequant_scaling_c;
+    p.dequant_normal = dequant_normal_c;
     p.quant = quant_c;
     p.dct[DST_4x4] = dst4_c;
     p.dct[DCT_4x4] = dct4_c;
diff -r 5009254d3d3a -r 1c74d7bfd007 source/common/primitives.h
--- a/source/common/primitives.h	Fri Nov 22 00:17:46 2013 -0600
+++ b/source/common/primitives.h	Fri Nov 22 10:18:18 2013 -0600
@@ -178,8 +178,8 @@ typedef void (*calcresidual_t)(pixel *fe
 typedef void (*calcrecon_t)(pixel* pred, int16_t* residual, pixel* recon, int16_t* reconqt, pixel *reconipred, int stride, int strideqt, int strideipred);
 typedef void (*transpose_t)(pixel* dst, pixel* src, intptr_t stride);
 typedef uint32_t (*quant_t)(int32_t *coef, int32_t *quantCoeff, int32_t *deltaU, int32_t *qCoef, int qBits, int add, int numCoeff, int32_t* lastPos);
-typedef void (*dequant_t)(const int32_t* src, int32_t* dst, int width, int height, int mcqp_miper, int mcqp_mirem, bool useScalingList,
-                          unsigned int trSizeLog2, int32_t *dequantCoef);
+typedef void (*dequant_scaling_t)(const int32_t* src, const int32_t *dequantCoef, int32_t* dst, int num, int mcqp_miper, int shift);
+typedef void (*dequant_normal_t)(const int32_t* quantCoef, int32_t* coef, int num, int scale, int shift);
 
 typedef void (*weightp_pp_t)(pixel *src, pixel *dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
 typedef void (*weightp_sp_t)(int16_t *src, pixel *dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
@@ -261,7 +261,8 @@ struct EncoderPrimitives
     dct_t           dct[NUM_DCTS];
     idct_t          idct[NUM_IDCTS];
     quant_t         quant;
-    dequant_t       dequant;
+    dequant_scaling_t dequant_scaling;
+    dequant_normal_t dequant_normal;
 
     calcresidual_t  calcresidual[NUM_SQUARE_BLOCKS];
     calcrecon_t     calcrecon[NUM_SQUARE_BLOCKS];
diff -r 5009254d3d3a -r 1c74d7bfd007 source/common/vec/dct-sse41.cpp
--- a/source/common/vec/dct-sse41.cpp	Fri Nov 22 00:17:46 2013 -0600
+++ b/source/common/vec/dct-sse41.cpp	Fri Nov 22 10:18:18 2013 -0600
@@ -40,114 +40,103 @@
 using namespace x265;
 
 namespace {
-void dequant(const int32_t* quantCoef, int32_t* coef, int width, int height, int per, int rem, bool useScalingList, unsigned int log2TrSize, int32_t *deQuantCoef)
+// TODO: normal and 8bpp dequant have only 16-bits dynamic rang, we can reduce 32-bits multiplication later
+void dequant_normal(const int32_t* quantCoef, int32_t* coef, int num, int scale, int shift)
 {
-    int invQuantScales[6] = { 40, 45, 51, 57, 64, 72 };
+    int valueToAdd = 1 << (shift - 1);
+    __m128i vScale = _mm_set1_epi32(scale);
+    __m128i vAdd = _mm_set1_epi32(valueToAdd);
 
-    if (width > 32)
+    for (int n = 0; n < num; n = n + 8)
     {
-        width  = 32;
-        height = 32;
+        __m128i quantCoef1, quantCoef2, quantCoef12, sign;
+
+        quantCoef1 = _mm_loadu_si128((__m128i*)(quantCoef + n));
+        quantCoef2 = _mm_loadu_si128((__m128i*)(quantCoef + n + 4));
+
+        quantCoef12 = _mm_packs_epi32(quantCoef1, quantCoef2);
+        sign = _mm_srai_epi16(quantCoef12, 15);
+        quantCoef1 = _mm_unpacklo_epi16(quantCoef12, sign);
+        quantCoef2 = _mm_unpackhi_epi16(quantCoef12, sign);
+
+        quantCoef1 = _mm_sra_epi32(_mm_add_epi32(_mm_mullo_epi32(quantCoef1, vScale), vAdd), _mm_cvtsi32_si128(shift));
+        quantCoef2 = _mm_sra_epi32(_mm_add_epi32(_mm_mullo_epi32(quantCoef2, vScale), vAdd), _mm_cvtsi32_si128(shift));
+
+        quantCoef12 = _mm_packs_epi32(quantCoef1, quantCoef2);
+        sign = _mm_srai_epi16(quantCoef12, 15);
+        quantCoef1 = _mm_unpacklo_epi16(quantCoef12, sign);
+        _mm_storeu_si128((__m128i*)(coef + n), quantCoef1);
+        quantCoef2 = _mm_unpackhi_epi16(quantCoef12, sign);
+        _mm_storeu_si128((__m128i*)(coef + n + 4), quantCoef2);
     }
+}
+
+void dequant_scaling(const int32_t* quantCoef, const int32_t *deQuantCoef, int32_t* coef, int num, int per, int shift)
+{
+    assert(num <= 32 * 32);
 
     int valueToAdd;
-    int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize;
-    int shift = QUANT_IQUANT_SHIFT - QUANT_SHIFT - transformShift;
 
-    if (useScalingList)
+    shift += 4;
+
+    if (shift > per)
     {
-        shift += 4;
+        valueToAdd = 1 << (shift - per - 1);
+        __m128i IAdd = _mm_set1_epi32(valueToAdd);
 
-        if (shift > per)