[x265-commits] [x265] TEncSearch: fix for gcc warning

Fri Oct 4 21:14:01 CEST 2013

details:   http://hg.videolan.org/x265/rev/ae9c68edd6b2
branches:  
changeset: 4193:ae9c68edd6b2
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Fri Oct 04 15:23:07 2013 +0530
description:
TEncSearch: fix for gcc warning
Subject: [x265] Bidir ME: store bits required for bidir which will be used for merge estimation

details:   http://hg.videolan.org/x265/rev/5b987ed0a557
branches:  
changeset: 4194:5b987ed0a557
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Fri Oct 04 15:40:48 2013 +0530
description:
Bidir ME: store bits required for bidir which will be used for merge estimation
Subject: [x265] TEncSearch: remove unused code

details:   http://hg.videolan.org/x265/rev/a201bc951e10
branches:  
changeset: 4195:a201bc951e10
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Fri Oct 04 15:25:15 2013 +0530
description:
TEncSearch: remove unused code
Subject: [x265] Replace sad_48 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/88378feb4794
branches:  
changeset: 4196:88378feb4794
user:      yuvaraj
date:      Fri Oct 04 15:32:59 2013 +0530
description:
Replace sad_48 vector class function with intrinsic.
Subject: [x265] Replace sad_64 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/4f990ec05dc5
branches:  
changeset: 4197:4f990ec05dc5
user:      yuvaraj
date:      Fri Oct 04 16:02:25 2013 +0530
description:
Replace sad_64 vector class function with intrinsic.
Subject: [x265] Replace sad_x3_48 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/c29821f80cd3
branches:  
changeset: 4198:c29821f80cd3
user:      yuvaraj
date:      Fri Oct 04 16:13:58 2013 +0530
description:
Replace sad_x3_48 vector class function with intrinsic.
Subject: [x265] Replace sad_x3_64 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/6dcae4946fe3
branches:  
changeset: 4199:6dcae4946fe3
user:      yuvaraj
date:      Fri Oct 04 16:20:32 2013 +0530
description:
Replace sad_x3_64 vector class function with intrinsic.
Subject: [x265] Replace sad_x4_48 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/d370697071ed
branches:  
changeset: 4200:d370697071ed
user:      yuvaraj
date:      Fri Oct 04 16:33:06 2013 +0530
description:
Replace sad_x4_48 vector class function with intrinsic.
Subject: [x265] Replace sad_x4_64 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/d59dcf48b9de
branches:  
changeset: 4201:d59dcf48b9de
user:      yuvaraj
date:      Fri Oct 04 16:41:08 2013 +0530
description:
Replace sad_x4_64 vector class function with intrinsic.
Subject: [x265] pixel: move SSE4.1 functions from pixel8.inc to pixel-sse41.cpp

details:   http://hg.videolan.org/x265/rev/8829b508822b
branches:  
changeset: 4202:8829b508822b
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 04 12:33:54 2013 -0500
description:
pixel: move SSE4.1 functions from pixel8.inc to pixel-sse41.cpp
Subject: [x265] replace block_copy_p_p vector class function with intrinsic code.

details:   http://hg.videolan.org/x265/rev/7b93c1cae0c4
branches:  
changeset: 4203:7b93c1cae0c4
user:      Dnyaneshwar
date:      Fri Oct 04 16:27:02 2013 +0530
description:
replace block_copy_p_p vector class function with intrinsic code.
Performance is almost same as that of vector function.
Subject: [x265] replace block_copy_p_s (short to pixel) vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/64325084bd3b
branches:  
changeset: 4204:64325084bd3b
user:      Dnyaneshwar
date:      Fri Oct 04 16:55:16 2013 +0530
description:
replace block_copy_p_s (short to pixel) vector class function with intrinsic.
Performance measured is same as that of vector function.
Subject: [x265] replace blockcopy_s_p (pixel to short) vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/5b7226f332be
branches:  
changeset: 4205:5b7226f332be
user:      Dnyaneshwar
date:      Fri Oct 04 17:10:51 2013 +0530
description:
replace blockcopy_s_p (pixel to short) vector class function with intrinsic.
Performance is same as that of vector class function.
Subject: [x265] replace "pixelsub_sp" vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/1a884afb63bb
branches:  
changeset: 4206:1a884afb63bb
user:      Dnyaneshwar
date:      Fri Oct 04 18:03:32 2013 +0530
description:
replace "pixelsub_sp" vector class function with intrinsic.
Performance is same as that of vector function.
Subject: [x265] Replace "pixeladd_ss" vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/cfc69c57d335
branches:  
changeset: 4207:cfc69c57d335
user:      Dnyaneshwar
date:      Fri Oct 04 18:42:00 2013 +0530
description:
Replace "pixeladd_ss" vector class function with intrinsic.
Performance measured is same as that of vector function.

diffstat:

 source/Lib/TLibEncoder/TEncSearch.cpp |    16 +-
 source/common/vec/blockcopy-sse3.cpp  |    86 +-
 source/common/vec/pixel-sse41.cpp     |  9796 ++++++++++++++++++++++++++++++++-
 source/common/vec/pixel8.inc          |   471 -
 4 files changed, 9849 insertions(+), 520 deletions(-)

diffs (truncated from 10518 to 300 lines):

diff -r bf14f75b8cf9 -r cfc69c57d335 source/Lib/TLibEncoder/TEncSearch.cpp

--- a/source/Lib/TLibEncoder/TEncSearch.cpp	Fri Oct 04 01:39:22 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp	Fri Oct 04 18:42:00 2013 +0530
@@ -2249,14 +2249,12 @@ void TEncSearch::predInterSearch(TComDat
 
     UInt mbBits[3] = { 1, 1, 0 };
     int refIdx[2] = { 0, 0 }; // If un-initialized, may cause SEGV in bi-directional prediction iterative stage.
-    int refIdxBidir[2];
+    int refIdxBidir[2] = { 0, 0 };
 
     UInt partAddr;
     int  roiWidth, roiHeight;
 
     PartSize partSize = cu->getPartitionSize(0);
-    int bestBiPRefIdxL1 = 0;
-    int bestBiPMvpL1 = 0;
     UInt lastMode = 0;
     int numPart = cu->getNumPartInter();
     int numPredDir = cu->getSlice()->isInterP() ? 1 : 2;
@@ -2281,7 +2279,6 @@ void TEncSearch::predInterSearch(TComDat
         UInt costbi = MAX_UINT;
         UInt costTemp = 0;
         UInt bitsTemp;
-        UInt bestBiPDist = MAX_INT;
         MV   mvValidList1;
         int  refIdxValidList1 = 0;
         UInt bitsValidList1 = MAX_UINT;
@@ -2328,13 +2325,6 @@ void TEncSearch::predInterSearch(TComDat
                     mvpIdx[refList][refIdxTmp] = cu->getMVPIdx(picList, partAddr);
                     mvpNum[refList][refIdxTmp] = cu->getMVPNum(picList, partAddr);
 
-                    if (cu->getSlice()->getMvdL1ZeroFlag() && refList == 1 && biPDistTemp < bestBiPDist)
-                    {
-                        bestBiPDist = biPDistTemp;
-                        bestBiPMvpL1 = mvpIdx[refList][refIdxTmp];
-                        bestBiPRefIdxL1 = refIdxTmp;
-                    }
-
                     bitsTemp += m_mvpIdxCost[mvpIdx[refList][refIdxTmp]][AMVP_MAX_NUM_CANDS];
 
                     if (refList == 1) // list 1
@@ -2441,7 +2431,8 @@ void TEncSearch::predInterSearch(TComDat
                 primitives.pixelavg_pp[partEnum](avg, roiWidth, ref0, ref1, m_predYuv[0].getStride(), m_predYuv[1].getStride());
 
                 int satdCost = primitives.satd[partEnum](pu, fenc->getStride(), avg, roiWidth);
-                costbi =  satdCost + m_rdCost->getCost(bits[0]) + m_rdCost->getCost(bits[1]);
+                bits[2] = bits[0] + bits[1] - mbBits[0] - mbBits[1] + mbBits[2];
+                costbi =  satdCost + m_rdCost->getCost(bits[2]);
 
                 if (mv[0].notZero() || mv[1].notZero())
                 {
@@ -2470,6 +2461,7 @@ void TEncSearch::predInterSearch(TComDat
                         costbi = costZero;
                         mvBidir[0].x = mvBidir[0].y = 0;
                         mvBidir[1].x = mvBidir[1].y = 0;
+                        bits[2] = bitsZero0 + bitsZero1 - mbBits[0] - mbBits[1] + mbBits[2];
                     }
                 }
             } // if (B_SLICE)
diff -r bf14f75b8cf9 -r cfc69c57d335 source/common/vec/blockcopy-sse3.cpp
--- a/source/common/vec/blockcopy-sse3.cpp	Fri Oct 04 01:39:22 2013 -0500
+++ b/source/common/vec/blockcopy-sse3.cpp	Fri Oct 04 18:42:00 2013 +0530
@@ -76,9 +76,8 @@ void blockcopy_p_p(int bx, int by, pixel
         {
             for (int x = 0; x < bx; x += 16)
             {
-                Vec16c word;
-                word.load_a(src + x);
-                word.store_a(dst + x);
+                __m128i word0 = _mm_load_si128((__m128i const*)(src + x)); // load block of 16 byte from src
+                _mm_store_si128((__m128i*)&dst[x], word0); // store block into dst
             }
 
             src += sstride;
@@ -107,10 +106,14 @@ void blockcopy_p_s(int bx, int by, pixel
         {
             for (int x = 0; x < bx; x += 16)
             {
-                Vec8us word0, word1;
-                word0.load_a(src + x);
-                word1.load_a(src + x + 8);
-                compress(word0, word1).store_a(dst + x);
+                __m128i word0 = _mm_load_si128((__m128i const*)(src + x));       // load block of 16 byte from src
+                __m128i word1 = _mm_load_si128((__m128i const*)(src + x + 8));
+
+                __m128i mask = _mm_set1_epi32(0x00FF00FF);                  // mask for low bytes
+                __m128i low_mask = _mm_and_si128(word0, mask);              // bytes of low
+                __m128i high_mask = _mm_and_si128(word1, mask);             // bytes of high
+                __m128i word01 = _mm_packus_epi16(low_mask, high_mask);     // unsigned pack
+                _mm_store_si128((__m128i*)&dst[x], word01);                 // store block into dst
             }
 
             src += sstride;
@@ -145,10 +148,11 @@ void blockcopy_s_p(int bx, int by, short
         {
             for (int x = 0; x < bx; x += 16)
             {
-                Vec16uc word;
-                word.load_a(src + x);
-                extend_low(word).store_a(dst + x);
-                extend_high(word).store_a(dst + x + 8);
+                __m128i word0 = _mm_load_si128((__m128i const*)(src + x));        // load block of 16 byte from src
+                __m128i word1 = _mm_unpacklo_epi8(word0, _mm_setzero_si128());    // interleave with zero extensions
+                _mm_store_si128((__m128i*)&dst[x], word1);                        // store block into dst
+                __m128i word2 = _mm_unpackhi_epi8(word0, _mm_setzero_si128());    // interleave with zero extensions
+                _mm_store_si128((__m128i*)&dst[x + 8], word2);                    // store block into dst
             }
 
             src += sstride;
@@ -182,14 +186,20 @@ void pixelsub_sp(int bx, int by, short *
         {
             for (int x = 0; x < bx; x += 16)
             {
-                Vec16uc word0, word1;
-                Vec8s word3, word4;
-                word0.load_a(src0 + x);
-                word1.load_a(src1 + x);
-                word3 = extend_low(word0) - extend_low(word1);
-                word4 = extend_high(word0) - extend_high(word1);
-                word3.store_a(dst + x);
-                word4.store_a(dst + x + 8);
+                __m128i word0, word1;
+                __m128i word3, word4;
+                __m128i mask = _mm_setzero_si128();
+
+                word0 = _mm_load_si128((__m128i const*)(src0 + x));    // load 16 bytes from src1
+                word1 = _mm_load_si128((__m128i const*)(src1 + x));    // load 16 bytes from src2
+
+                word3 = _mm_unpacklo_epi8(word0, mask);    // interleave with zero extensions
+                word4 = _mm_unpacklo_epi8(word1, mask);
+                _mm_store_si128((__m128i*)&dst[x], _mm_subs_epi16(word3, word4));    // store block into dst
+
+                word3 = _mm_unpackhi_epi8(word0, mask);    // interleave with zero extensions
+                word4 = _mm_unpackhi_epi8(word1, mask);
+                _mm_store_si128((__m128i*)&dst[x + 8], _mm_subs_epi16(word3, word4));    // store block into dst
             }
 
             src0 += sstride0;
@@ -220,21 +230,24 @@ void pixeladd_ss(int bx, int by, short *
 
     if ( !(aligncheck & 15) && !(bx & 7))
     {
-        Vec8s zero(0), maxval((1 << X265_DEPTH) - 1);
+        __m128i maxval = _mm_set1_epi16((1 << X265_DEPTH) - 1);
+        __m128i zero = _mm_setzero_si128();
+
         // fast path, multiples of 8 pixel wide blocks
         for (int y = 0; y < by; y++)
         {
             for (int x = 0; x < bx; x += 8)
             {
-                Vec8s vecsrc0, vecsrc1, vecsum;
-                vecsrc0.load_a(src0 + x);
-                vecsrc1.load_a(src1 + x);
+                __m128i word0, word1, sum;
 
-                vecsum = add_saturated(vecsrc0, vecsrc1);
-                vecsum = max(vecsum, zero);
-                vecsum = min(vecsum, maxval);
+                word0 = _mm_load_si128((__m128i*)(src0 + x));    // load 16 bytes from src1
+                word1 = _mm_load_si128((__m128i*)(src1 + x));    // load 16 bytes from src2
 
-                vecsum.store(dst + x);
+                sum = _mm_adds_epi16(word0, word1);
+                sum = _mm_max_epi16(sum, zero);
+                sum = _mm_min_epi16(sum, maxval);
+
+                _mm_store_si128((__m128i*)&dst[x], sum);    // store block into dst
             }
 
             src0 += sstride0;
@@ -244,20 +257,23 @@ void pixeladd_ss(int bx, int by, short *
     }
     else if (!(bx & 7))
     {
-        Vec8s zero(0), maxval((1 << X265_DEPTH) - 1);
+        __m128i maxval = _mm_set1_epi16((1 << X265_DEPTH) - 1);
+        __m128i zero = _mm_setzero_si128();
+
         for (int y = 0; y < by; y++)
         {
             for (int x = 0; x < bx; x += 8)
             {
-                Vec8s vecsrc0, vecsrc1, vecsum;
-                vecsrc0.load(src0 + x);
-                vecsrc1.load(src1 + x);
+                __m128i word0, word1, sum;
 
-                vecsum = add_saturated(vecsrc0, vecsrc1);
-                vecsum = max(vecsum, zero);
-                vecsum = min(vecsum, maxval);
+                word0 = _mm_load_si128((__m128i*)(src0 + x));    // load 16 bytes from src1
+                word1 = _mm_load_si128((__m128i*)(src1 + x));    // load 16 bytes from src2
 
-                vecsum.store(dst + x);
+                sum = _mm_adds_epi16(word0, word1);
+                sum = _mm_max_epi16(sum, zero);
+                sum = _mm_min_epi16(sum, maxval);
+
+                _mm_store_si128((__m128i*)&dst[x], sum);    // store block into dst
             }
 
             src0 += sstride0;
diff -r bf14f75b8cf9 -r cfc69c57d335 source/common/vec/pixel-sse41.cpp
--- a/source/common/vec/pixel-sse41.cpp	Fri Oct 04 01:39:22 2013 -0500
+++ b/source/common/vec/pixel-sse41.cpp	Fri Oct 04 18:42:00 2013 +0530
@@ -2346,6 +2346,1545 @@ int sad_32(pixel * fenc, intptr_t fencst
     return _mm_cvtsi128_si32(sum0);
 }
 
+template<int ly>
+int sad_48(pixel * fenc, intptr_t fencstride, pixel * fref, intptr_t frefstride)
+{
+    assert((ly % 4) == 0);
+
+    __m128i sum0 = _mm_setzero_si128();
+    __m128i sum1 = _mm_setzero_si128();
+
+    if (ly == 4)
+    {
+        __m128i T00, T01, T02;
+        __m128i T10, T11, T12;
+        __m128i T20, T21, T22;
+
+        T00 = _mm_load_si128((__m128i*)(fenc));           /*Loding 48 8-bit integer from fenc to local variables*/
+        T01 = _mm_load_si128((__m128i*)(fenc + 16));
+        T02 = _mm_load_si128((__m128i*)(fenc + 32));
+
+        T10 = _mm_loadu_si128((__m128i*)(fref));          /*Loding 48 8-bit integer from fref to local variables*/
+        T11 = _mm_loadu_si128((__m128i*)(fref + 16));
+        T12 = _mm_loadu_si128((__m128i*)(fref + 32));
+
+        T20 = _mm_sad_epu8(T00, T10);
+        T21 = _mm_sad_epu8(T01, T11);
+        T22 = _mm_sad_epu8(T02, T12);
+
+        sum0 = _mm_add_epi16(sum0, T20);
+        sum0 = _mm_add_epi16(sum0, T21);
+        sum0 = _mm_add_epi16(sum0, T22);
+
+        T00 = _mm_load_si128((__m128i*)(fenc + (1) * fencstride));
+        T01 = _mm_load_si128((__m128i*)(fenc + 16 + (1) * fencstride));
+        T02 = _mm_load_si128((__m128i*)(fenc + 32 + (1) * fencstride));
+
+        T10 = _mm_loadu_si128((__m128i*)(fref + (1) * frefstride));
+        T11 = _mm_loadu_si128((__m128i*)(fref + 16 + (1) * frefstride));
+        T12 = _mm_loadu_si128((__m128i*)(fref + 32 + (1) * frefstride));
+
+        T20 = _mm_sad_epu8(T00, T10);
+        T21 = _mm_sad_epu8(T01, T11);
+        T22 = _mm_sad_epu8(T02, T12);
+
+        sum0 = _mm_add_epi16(sum0, T20);
+        sum0 = _mm_add_epi16(sum0, T21);
+        sum0 = _mm_add_epi16(sum0, T22);
+
+        T00 = _mm_load_si128((__m128i*)(fenc + (2) * fencstride));
+        T01 = _mm_load_si128((__m128i*)(fenc + 16 + (2) * fencstride));
+        T02 = _mm_load_si128((__m128i*)(fenc + 32 + (2) * fencstride));
+
+        T10 = _mm_loadu_si128((__m128i*)(fref + (2) * frefstride));
+        T11 = _mm_loadu_si128((__m128i*)(fref + 16 + (2) * frefstride));
+        T12 = _mm_loadu_si128((__m128i*)(fref + 32 + (2) * frefstride));
+
+        T20 = _mm_sad_epu8(T00, T10);
+        T21 = _mm_sad_epu8(T01, T11);
+        T22 = _mm_sad_epu8(T02, T12);
+
+        sum0 = _mm_add_epi16(sum0, T20);
+        sum0 = _mm_add_epi16(sum0, T21);
+        sum0 = _mm_add_epi16(sum0, T22);
+
+        T00 = _mm_load_si128((__m128i*)(fenc + (3) * fencstride));
+        T01 = _mm_load_si128((__m128i*)(fenc + 16 + (3) * fencstride));
+        T02 = _mm_load_si128((__m128i*)(fenc + 32 + (3) * fencstride));
+
+        T10 = _mm_loadu_si128((__m128i*)(fref + (3) * frefstride));
+        T11 = _mm_loadu_si128((__m128i*)(fref + 16 + (3) * frefstride));
+        T12 = _mm_loadu_si128((__m128i*)(fref + 32 + (3) * frefstride));
+
+        T20 = _mm_sad_epu8(T00, T10);
+        T21 = _mm_sad_epu8(T01, T11);
+        T22 = _mm_sad_epu8(T02, T12);
+
+        sum0 = _mm_add_epi16(sum0, T20);
+        sum0 = _mm_add_epi16(sum0, T21);
+        sum0 = _mm_add_epi16(sum0, T22);
+    }
+    else if (ly == 8)
+    {
+        __m128i T00, T01, T02;
+        __m128i T10, T11, T12;
+        __m128i T20, T21, T22;
+
+        T00 = _mm_load_si128((__m128i*)(fenc));
+        T01 = _mm_load_si128((__m128i*)(fenc + 16));
+        T02 = _mm_load_si128((__m128i*)(fenc + 32));
+
+        T10 = _mm_loadu_si128((__m128i*)(fref));
+        T11 = _mm_loadu_si128((__m128i*)(fref + 16));