[x265-commits] [x265] intra-sse3.cpp: replace xPredIntraAng4x4 vector class fun...

Sun Oct 20 23:24:37 CEST 2013

details:   http://hg.videolan.org/x265/rev/c1e53b796ef4
branches:  
changeset: 4538:c1e53b796ef4
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 18 10:51:33 2013 +0530
description:
intra-sse3.cpp: replace xPredIntraAng4x4 vector class function with intrinsic.
Subject: [x265] blockcopy-sse3.cpp: removed warning: overflow in implicit constant conversion.

details:   http://hg.videolan.org/x265/rev/48afd41e0753
branches:  
changeset: 4539:48afd41e0753
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 18 10:29:53 2013 +0530
description:
blockcopy-sse3.cpp: removed warning: overflow in implicit constant conversion.
Subject: [x265] added cvt32to16_shr_sse2 function to testbench.

details:   http://hg.videolan.org/x265/rev/f3523973eafb
branches:  
changeset: 4540:f3523973eafb
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 18 14:18:05 2013 +0530
description:
added cvt32to16_shr_sse2 function to testbench.

Speed up measured is almost 14x.
Subject: [x265] cmake: msvc yasm dependency fix

details:   http://hg.videolan.org/x265/rev/357a6d0c305d
branches:  
changeset: 4541:357a6d0c305d
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 13:39:39 2013 -0500
description:
cmake: msvc yasm dependency fix
Subject: [x265] blockcopy-sse3.cpp: removed unnecessary variable.

details:   http://hg.videolan.org/x265/rev/9ff06eb3bc4d
branches:  
changeset: 4542:9ff06eb3bc4d
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 18 16:13:07 2013 +0530
description:
blockcopy-sse3.cpp: removed unnecessary variable.
Subject: [x265] added pixelavg_pp function to testbench

details:   http://hg.videolan.org/x265/rev/fdd1262059ad
branches:  
changeset: 4543:fdd1262059ad
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 18 15:53:12 2013 +0530
description:
added pixelavg_pp function to testbench
Subject: [x265] pixelharness: fix iteration through partition enums

details:   http://hg.videolan.org/x265/rev/7e95be5f70bc
branches:  
changeset: 4544:7e95be5f70bc
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 14:31:39 2013 -0500
description:
pixelharness: fix iteration through partition enums
Subject: [x265] asm: disable remaining pixelavg primitives, they fail against our C ref

details:   http://hg.videolan.org/x265/rev/904ff6d6e5d9
branches:  
changeset: 4545:904ff6d6e5d9
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 14:35:08 2013 -0500
description:
asm: disable remaining pixelavg primitives, they fail against our C ref
Subject: [x265] rc : removed warning , moved strength to acEnergyCu

details:   http://hg.videolan.org/x265/rev/089b29b4da2a
branches:  
changeset: 4546:089b29b4da2a
user:      Aarthi Thirumalai<aarthi at multicorewareinc.com>
date:      Fri Oct 18 17:47:20 2013 +0530
description:
rc : removed warning , moved strength to acEnergyCu
Subject: [x265] intra: replace intra_pred_dc vector class function with intrinsic

details:   http://hg.videolan.org/x265/rev/d24283fe5e31
branches:  
changeset: 4547:d24283fe5e31
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Fri Oct 18 16:05:38 2013 +0530
description:
intra: replace intra_pred_dc vector class function with intrinsic
Subject: [x265] intra: replace predDCFiltering vector class function with intrinsic

details:   http://hg.videolan.org/x265/rev/140f90417702
branches:  
changeset: 4548:140f90417702
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Fri Oct 18 17:05:37 2013 +0530
description:
intra: replace predDCFiltering vector class function with intrinsic
Subject: [x265] intra: remove SSE3 planar intrinsic functions; they are redundant

details:   http://hg.videolan.org/x265/rev/edf6eb8da4ca
branches:  
changeset: 4549:edf6eb8da4ca
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 13:32:50 2013 -0500
description:
intra: remove SSE3 planar intrinsic functions; they are redundant
Subject: [x265] intra: nits

details:   http://hg.videolan.org/x265/rev/6a453beeea88
branches:  
changeset: 4550:6a453beeea88
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 13:32:55 2013 -0500
description:
intra: nits
Subject: [x265] intra: sane function names and typedefs

details:   http://hg.videolan.org/x265/rev/1959dbe1b643
branches:  
changeset: 4551:1959dbe1b643
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 14:06:56 2013 -0500
description:
intra: sane function names and typedefs
Subject: [x265] intra: isolate last remaining vector class functions (angular intra 8, 16, 32)

details:   http://hg.videolan.org/x265/rev/8de380c7bd41
branches:  
changeset: 4552:8de380c7bd41
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 13:52:48 2013 -0500
description:
intra: isolate last remaining vector class functions (angular intra 8, 16, 32)
Subject: [x265] ipfilter.cpp, added code to support luma coefficients too

details:   http://hg.videolan.org/x265/rev/4bcfc0e23935
branches:  
changeset: 4553:4bcfc0e23935
user:      Praveen Tiwari
date:      Fri Oct 18 15:53:11 2013 +0530
description:
ipfilter.cpp, added code to support luma coefficients too
Subject: [x265] asm: corrected luma enum variable, testbench fix

details:   http://hg.videolan.org/x265/rev/8b507771e6b0
branches:  
changeset: 4554:8b507771e6b0
user:      Praveen Tiwari
date:      Fri Oct 18 15:43:54 2013 +0530
description:
asm: corrected luma enum variable, testbench fix
Subject: [x265] added 24x32 partion size asm code to chroma function

details:   http://hg.videolan.org/x265/rev/a301f749b0bc
branches:  
changeset: 4555:a301f749b0bc
user:      Praveen Tiwari
date:      Fri Oct 18 17:07:36 2013 +0530
description:
added 24x32 partion size asm code to chroma function
Subject: [x265] asm code for luma filter functions

details:   http://hg.videolan.org/x265/rev/0d146f05d561
branches:  
changeset: 4556:0d146f05d561
user:      Praveen Tiwari
date:      Fri Oct 18 16:18:12 2013 +0530
description:
asm code for luma filter functions
Subject: [x265] TEncSearch: add x265_emms() after use of pixelavg_pp and satd primitives

details:   http://hg.videolan.org/x265/rev/1fa93e1f4caa
branches:  
changeset: 4557:1fa93e1f4caa
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 14:57:57 2013 -0500
description:
TEncSearch: add x265_emms() after use of pixelavg_pp and satd primitives
Subject: [x265] ipfilterharness: simplify filter names

details:   http://hg.videolan.org/x265/rev/4066e6e725ee
branches:  
changeset: 4558:4066e6e725ee
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 15:00:40 2013 -0500
description:
ipfilterharness: simplify filter names
Subject: [x265] WaveFront: add new function to enable all rows

details:   http://hg.videolan.org/x265/rev/dd45e55248c8
branches:  
changeset: 4559:dd45e55248c8
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Fri Oct 18 17:01:32 2013 +0530
description:
WaveFront: add new function to enable all rows
Subject: [x265] Lookahead: implement wavefront parallel processing

details:   http://hg.videolan.org/x265/rev/c96f97cf3914
branches:  
changeset: 4560:c96f97cf3914
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 18 17:10:56 2013 +0530
description:
Lookahead: implement wavefront parallel processing
Subject: [x265] intra: move intra_pred_dc to intra-sse41.cpp; it uses SSSE3 instructions

details:   http://hg.videolan.org/x265/rev/7ec69cb067fd
branches:  
changeset: 4561:7ec69cb067fd
user:      Steve Borho <steve at borho.org>
date:      Sun Oct 20 15:51:01 2013 -0500
description:
intra: move intra_pred_dc to intra-sse41.cpp; it uses SSSE3 instructions

We don't have an intra-ssse3.cpp and it seems a waste to create one just for
this one function.
Subject: [x265] remove reduce register copy in FILTER_H4_w2_2 and FILTER_H4_w4_2

details:   http://hg.videolan.org/x265/rev/fabb25ae4db4
branches:  
changeset: 4562:fabb25ae4db4
user:      Min Chen <chenm003 at 163.com>
date:      Sat Oct 19 18:08:07 2013 +0800
description:
remove reduce register copy in FILTER_H4_w2_2 and FILTER_H4_w4_2

diffstat:

 source/Lib/TLibEncoder/TEncSearch.cpp |    5 +-
 source/common/CMakeLists.txt          |    2 +-
 source/common/ipfilter.cpp            |    2 +-
 source/common/vec/blockcopy-sse3.cpp  |    5 +-
 source/common/vec/intra-sse3.cpp      |  812 ++++++++-------------------------
 source/common/vec/intra-sse41.cpp     |  190 +++++++
 source/common/wavefront.cpp           |    5 +
 source/common/wavefront.h             |   12 +-
 source/common/x86/asm-primitives.cpp  |   14 +-
 source/common/x86/ipfilter8.asm       |  250 ++++++++--
 source/encoder/encoder.cpp            |    3 +-
 source/encoder/ratecontrol.cpp        |    7 +-
 source/encoder/slicetype.cpp          |  152 +++++-
 source/encoder/slicetype.h            |   54 +-
 source/test/ipfilterharness.cpp       |   10 +-
 source/test/pixelharness.cpp          |   87 +++-
 source/test/pixelharness.h            |    4 +
 17 files changed, 893 insertions(+), 721 deletions(-)

diffs (truncated from 2501 to 300 lines):

diff -r d6d7187c5f4e -r fabb25ae4db4 source/Lib/TLibEncoder/TEncSearch.cpp

--- a/source/Lib/TLibEncoder/TEncSearch.cpp	Fri Oct 18 00:42:36 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp	Sat Oct 19 18:08:07 2013 +0800
@@ -2336,8 +2336,8 @@ void TEncSearch::predInterSearch(TComDat
 
                 int partEnum = PartitionFromSizes(roiWidth, roiHeight);
                 primitives.pixelavg_pp[partEnum](avg, roiWidth, ref0, m_predYuv[0].getStride(), ref1, m_predYuv[1].getStride(), 32);
-
                 int satdCost = primitives.satd[partEnum](pu, fenc->getStride(), avg, roiWidth);
+                x265_emms();
                 bits[2] = bits[0] + bits[1] - mbBits[0] - mbBits[1] + mbBits[2];
                 costbi =  satdCost + m_rdCost->getCost(bits[2]);
 
@@ -2347,10 +2347,9 @@ void TEncSearch::predInterSearch(TComDat
                     ref1 = cu->getSlice()->m_mref[1][refIdx[1]]->fpelPlane + (pu - fenc->getLumaAddr());  //MV(0,0) of ref1
                     intptr_t refStride = cu->getSlice()->m_mref[0][refIdx[0]]->lumaStride;
 
-                    partEnum = PartitionFromSizes(roiWidth, roiHeight);
                     primitives.pixelavg_pp[partEnum](avg, roiWidth, ref0, refStride, ref1, refStride, 32);
-
                     satdCost = primitives.satd[partEnum](pu, fenc->getStride(), avg, roiWidth);
+                    x265_emms();
 
                     unsigned int bitsZero0, bitsZero1;
                     m_me.setMVP(mvPredBi[0][refIdxBidir[0]]);
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/CMakeLists.txt
--- a/source/common/CMakeLists.txt	Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/CMakeLists.txt	Sat Oct 19 18:08:07 2013 +0800
@@ -202,7 +202,7 @@ if(ENABLE_PRIMITIVES_ASM)
             add_custom_command(
                 OUTPUT ${ASM}.obj
                 COMMAND ${YASM_EXECUTABLE} ARGS ${FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/x86/${ASM} -o ${ASM}.obj
-                DEPENDS x86/${ASM})
+                DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/x86/${ASM})
         endforeach()
         add_library(assembly STATIC x86/asm-primitives.cpp x86/pixel.h x86/mc.h x86/ipfilter8.h ${FULLPATHASM} ${OBJS})
     else()
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp	Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/ipfilter.cpp	Sat Oct 19 18:08:07 2013 +0800
@@ -448,7 +448,7 @@ void extendCURowColBorder(pixel* txt, in
 template<int N, int width, int height>
 void interp_horiz_pp_c(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
 {
-    int16_t const * coeff = g_chromaFilter[coeffIdx];
+    int16_t const * coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
     int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
     int offset =  (1 << (headRoom - 1));
     int16_t maxVal = (1 << X265_DEPTH) - 1;
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/vec/blockcopy-sse3.cpp
--- a/source/common/vec/blockcopy-sse3.cpp	Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/vec/blockcopy-sse3.cpp	Sat Oct 19 18:08:07 2013 +0800
@@ -134,11 +134,10 @@ void blockcopy_ps(int bx, int by, pixel 
 void pixeladd_pp(int bx, int by, pixel *dst, intptr_t dstride, pixel *src0, pixel *src1, intptr_t sstride0, intptr_t sstride1)
 {
     size_t aligncheck = (size_t)dst | (size_t)src0 | bx | sstride0 | sstride1 | dstride;
-    int i = 1;
 
     if (!(aligncheck & 15))
     {
-        __m128i maxval = _mm_set1_epi8((i << X265_DEPTH) - 1);
+        __m128i maxval = _mm_set1_epi8((unsigned char)((1 << X265_DEPTH) - 1));
         __m128i zero = _mm_setzero_si128();
 
         // fast path, multiples of 16 pixel wide blocks
@@ -162,7 +161,7 @@ void pixeladd_pp(int bx, int by, pixel *
     }
     else if (!(bx & 15))
     {
-        __m128i maxval = _mm_set1_epi8((i << X265_DEPTH) - 1);
+        __m128i maxval = _mm_set1_epi8((unsigned char)((1 << X265_DEPTH) - 1));
         __m128i zero = _mm_setzero_si128();
 
         // fast path, multiples of 16 pixel wide blocks but pointers/strides require unaligned accesses
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/vec/intra-sse3.cpp
--- a/source/common/vec/intra-sse3.cpp	Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/vec/intra-sse3.cpp	Sat Oct 19 18:08:07 2013 +0800
@@ -24,16 +24,11 @@
  * For more information, contact us at licensing at multicorewareinc.com.
  *****************************************************************************/
 
-#if defined(_MSC_VER)
-#define ALWAYSINLINE  __forceinline
-#endif
-
-#define INSTRSET 3
-#include "vectorclass.h"
-
 #include "primitives.h"
 #include "TLibCommon/TComRom.h"
 #include <assert.h>
+#include <xmmintrin.h> // SSE
+#include <pmmintrin.h> // SSE3
 
 using namespace x265;
 
@@ -96,387 +91,6 @@ const int angAP[17][64] =
 #define GETAP(X, Y) angAP[8 - (X)][(Y)]
 
 #if !HIGH_BIT_DEPTH
-inline void predDCFiltering(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int width)
-{
-    int y;
-    pixel pixDC = *dst;
-    int pixDCx3 = pixDC * 3 + 2;
-
-    // boundary pixels processing
-    dst[0] = (pixel)((above[0] + left[0] + 2 * pixDC + 2) >> 2);
-
-    Vec8us im1(pixDCx3);
-    Vec8us im2, im3;
-    Vec16uc pix;
-    switch (width)
-    {
-    case 4:
-        pix = load_partial(const_int(4), &above[1]);
-        im2 = extend_low(pix);
-        im2 = (im1 + im2) >> const_int(2);
-        pix = compress(im2, im2);
-        store_partial(const_int(4), &dst[1], pix);
-        break;
-
-    case 8:
-        pix = load_partial(const_int(8), &above[1]);
-        im2 = extend_low(pix);
-        im2 = (im1 + im2) >> const_int(2);
-        pix = compress(im2, im2);
-        store_partial(const_int(8), &dst[1], pix);
-        break;
-
-    case 16:
-        pix.load(&above[1]);
-        im2 = extend_low(pix);
-        im3 = extend_high(pix);
-        im2 = (im1 + im2) >> const_int(2);
-        im3 = (im1 + im3) >> const_int(2);
-        pix = compress(im2, im3);
-        pix.store(&dst[1]);
-        break;
-
-    case 32:
-        pix.load(&above[1]);
-        im2 = extend_low(pix);
-        im3 = extend_high(pix);
-        im2 = (im1 + im2) >> const_int(2);
-        im3 = (im1 + im3) >> const_int(2);
-        pix = compress(im2, im3);
-        pix.store(&dst[1]);
-
-        pix.load(&above[1 + 16]);
-        im2 = extend_low(pix);
-        im3 = extend_high(pix);
-        im2 = (im1 + im2) >> const_int(2);
-        im3 = (im1 + im3) >> const_int(2);
-        pix = compress(im2, im3);
-        pix.store(&dst[1 + 16]);
-        break;
-    }
-
-    for (y = 1; y < width; y++)
-    {
-        dst[dstStride] = (pixel)((left[y] + pixDCx3) >> 2);
-        dst += dstStride;
-    }
-}
-
-void intra_pred_dc(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int width, int filter)
-{
-    int sum;
-    int logSize = g_convertToBit[width] + 2;
-
-    Vec16uc pixL, pixT;
-    Vec8us  im;
-    Vec4ui  im1, im2;
-
-    switch (width)
-    {
-    case 4:
-        pixL.fromUint32(*(uint32_t*)left);
-        pixT.fromUint32(*(uint32_t*)above);
-        sum  = horizontal_add(extend_low(pixL));
-        sum += horizontal_add(extend_low(pixT));
-        break;
-
-    case 8:
-#if X86_64
-        pixL.fromUint64(*(uint64_t*)left);
-        pixT.fromUint64(*(uint64_t*)above);
-#else
-        pixL.load_partial(8, left);
-        pixT.load_partial(8, above);
-#endif
-        sum  = horizontal_add(extend_low(pixL));
-        sum += horizontal_add(extend_low(pixT));
-        break;
-
-    case 16:
-        pixL.load(left);
-        pixT.load(above);
-        sum  = horizontal_add_x(pixL);
-        sum += horizontal_add_x(pixT);
-        break;
-
-    default:
-    case 32:
-        pixL.load(left);
-        im1  = (Vec4ui)(pixL.sad(_mm_setzero_si128()));
-        pixL.load(left + 16);
-        im1 += (Vec4ui)(pixL.sad(_mm_setzero_si128()));
-
-        pixT.load(above);
-        im1 += (Vec4ui)(pixT.sad(_mm_setzero_si128()));
-        pixT.load(above + 16);
-        im1 += (Vec4ui)(pixT.sad(_mm_setzero_si128()));
-        im1 += (Vec4ui)((Vec128b)im1 >> const_int(64));
-        sum = toInt32(im1);
-        break;
-    }
-
-    logSize += 1;
-    pixel dcVal = (sum + (1 << (logSize - 1))) >> logSize;
-    Vec16uc dcValN(dcVal);
-    pixel *dst1 = dst;
-
-    switch (width)
-    {
-    case 4:
-        store_partial(const_int(4), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(4), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(4), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(4), dst1, dcValN);
-        break;
-
-    case 8:
-        store_partial(const_int(8), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(8), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(8), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(8), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(8), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(8), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(8), dst1, dcValN);
-        dst1 += dstStride;
-        store_partial(const_int(8), dst1, dcValN);
-        break;
-
-    case 16:
-        for (int k = 0; k < 16; k += 4)
-        {
-            store_partial(const_int(16), dst1, dcValN);
-            dst1 += dstStride;
-            store_partial(const_int(16), dst1, dcValN);
-            dst1 += dstStride;
-            store_partial(const_int(16), dst1, dcValN);
-            dst1 += dstStride;
-            store_partial(const_int(16), dst1, dcValN);
-            dst1 += dstStride;
-        }
-        break;
-
-    case 32:
-        for (int k = 0; k < 32; k += 2)
-        {
-            store_partial(const_int(16), dst1,      dcValN);
-            store_partial(const_int(16), dst1 + 16, dcValN);
-            dst1 += dstStride;
-            store_partial(const_int(16), dst1,      dcValN);
-            store_partial(const_int(16), dst1 + 16, dcValN);
-            dst1 += dstStride;
-        }
-        break;
-    }
-
-    if (filter)
-    {
-        predDCFiltering(above, left, dst, dstStride, width);
-    }
-}
-
-#define BROADCAST16(a, d, x) { \
-    const int dL = (d) & 3; \
-    const int dH = ((d)-4) & 3; \
-    if (d>=4) { \
-        (x) = _mm_shufflehi_epi16((a), dH * 0x55); \
-        (x) = _mm_unpackhi_epi64((x), (x)); \
-    } \
-    else { \
-        (x) = _mm_shufflelo_epi16((a), dL * 0x55); \
-        (x) = _mm_unpacklo_epi64((x), (x)); \
-    } \
-}
-