[x265-commits] [x265] TEncSearch: fix for gcc warning
Deepthi Devaki
deepthidevaki at multicorewareinc.com
Fri Oct 4 21:14:01 CEST 2013
details: http://hg.videolan.org/x265/rev/ae9c68edd6b2
branches:
changeset: 4193:ae9c68edd6b2
user: Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date: Fri Oct 04 15:23:07 2013 +0530
description:
TEncSearch: fix for gcc warning
Subject: [x265] Bidir ME: store bits required for bidir which will be used for merge estimation
details: http://hg.videolan.org/x265/rev/5b987ed0a557
branches:
changeset: 4194:5b987ed0a557
user: Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date: Fri Oct 04 15:40:48 2013 +0530
description:
Bidir ME: store bits required for bidir which will be used for merge estimation
Subject: [x265] TEncSearch: remove unused code
details: http://hg.videolan.org/x265/rev/a201bc951e10
branches:
changeset: 4195:a201bc951e10
user: Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date: Fri Oct 04 15:25:15 2013 +0530
description:
TEncSearch: remove unused code
Subject: [x265] Replace sad_48 vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/88378feb4794
branches:
changeset: 4196:88378feb4794
user: yuvaraj
date: Fri Oct 04 15:32:59 2013 +0530
description:
Replace sad_48 vector class function with intrinsic.
Subject: [x265] Replace sad_64 vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/4f990ec05dc5
branches:
changeset: 4197:4f990ec05dc5
user: yuvaraj
date: Fri Oct 04 16:02:25 2013 +0530
description:
Replace sad_64 vector class function with intrinsic.
Subject: [x265] Replace sad_x3_48 vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/c29821f80cd3
branches:
changeset: 4198:c29821f80cd3
user: yuvaraj
date: Fri Oct 04 16:13:58 2013 +0530
description:
Replace sad_x3_48 vector class function with intrinsic.
Subject: [x265] Replace sad_x3_64 vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/6dcae4946fe3
branches:
changeset: 4199:6dcae4946fe3
user: yuvaraj
date: Fri Oct 04 16:20:32 2013 +0530
description:
Replace sad_x3_64 vector class function with intrinsic.
Subject: [x265] Replace sad_x4_48 vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/d370697071ed
branches:
changeset: 4200:d370697071ed
user: yuvaraj
date: Fri Oct 04 16:33:06 2013 +0530
description:
Replace sad_x4_48 vector class function with intrinsic.
Subject: [x265] Replace sad_x4_64 vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/d59dcf48b9de
branches:
changeset: 4201:d59dcf48b9de
user: yuvaraj
date: Fri Oct 04 16:41:08 2013 +0530
description:
Replace sad_x4_64 vector class function with intrinsic.
Subject: [x265] pixel: move SSE4.1 functions from pixel8.inc to pixel-sse41.cpp
details: http://hg.videolan.org/x265/rev/8829b508822b
branches:
changeset: 4202:8829b508822b
user: Steve Borho <steve at borho.org>
date: Fri Oct 04 12:33:54 2013 -0500
description:
pixel: move SSE4.1 functions from pixel8.inc to pixel-sse41.cpp
Subject: [x265] replace block_copy_p_p vector class function with intrinsic code.
details: http://hg.videolan.org/x265/rev/7b93c1cae0c4
branches:
changeset: 4203:7b93c1cae0c4
user: Dnyaneshwar
date: Fri Oct 04 16:27:02 2013 +0530
description:
replace block_copy_p_p vector class function with intrinsic code.
Performance is almost same as that of vector function.
Subject: [x265] replace block_copy_p_s (short to pixel) vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/64325084bd3b
branches:
changeset: 4204:64325084bd3b
user: Dnyaneshwar
date: Fri Oct 04 16:55:16 2013 +0530
description:
replace block_copy_p_s (short to pixel) vector class function with intrinsic.
Performance measured is same as that of vector function.
Subject: [x265] replace blockcopy_s_p (pixel to short) vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/5b7226f332be
branches:
changeset: 4205:5b7226f332be
user: Dnyaneshwar
date: Fri Oct 04 17:10:51 2013 +0530
description:
replace blockcopy_s_p (pixel to short) vector class function with intrinsic.
Performance is same as that of vector class function.
Subject: [x265] replace "pixelsub_sp" vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/1a884afb63bb
branches:
changeset: 4206:1a884afb63bb
user: Dnyaneshwar
date: Fri Oct 04 18:03:32 2013 +0530
description:
replace "pixelsub_sp" vector class function with intrinsic.
Performance is same as that of vector function.
Subject: [x265] Replace "pixeladd_ss" vector class function with intrinsic.
details: http://hg.videolan.org/x265/rev/cfc69c57d335
branches:
changeset: 4207:cfc69c57d335
user: Dnyaneshwar
date: Fri Oct 04 18:42:00 2013 +0530
description:
Replace "pixeladd_ss" vector class function with intrinsic.
Performance measured is same as that of vector function.
diffstat:
source/Lib/TLibEncoder/TEncSearch.cpp | 16 +-
source/common/vec/blockcopy-sse3.cpp | 86 +-
source/common/vec/pixel-sse41.cpp | 9796 ++++++++++++++++++++++++++++++++-
source/common/vec/pixel8.inc | 471 -
4 files changed, 9849 insertions(+), 520 deletions(-)
diffs (truncated from 10518 to 300 lines):
diff -r bf14f75b8cf9 -r cfc69c57d335 source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp Fri Oct 04 01:39:22 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp Fri Oct 04 18:42:00 2013 +0530
@@ -2249,14 +2249,12 @@ void TEncSearch::predInterSearch(TComDat
UInt mbBits[3] = { 1, 1, 0 };
int refIdx[2] = { 0, 0 }; // If un-initialized, may cause SEGV in bi-directional prediction iterative stage.
- int refIdxBidir[2];
+ int refIdxBidir[2] = { 0, 0 };
UInt partAddr;
int roiWidth, roiHeight;
PartSize partSize = cu->getPartitionSize(0);
- int bestBiPRefIdxL1 = 0;
- int bestBiPMvpL1 = 0;
UInt lastMode = 0;
int numPart = cu->getNumPartInter();
int numPredDir = cu->getSlice()->isInterP() ? 1 : 2;
@@ -2281,7 +2279,6 @@ void TEncSearch::predInterSearch(TComDat
UInt costbi = MAX_UINT;
UInt costTemp = 0;
UInt bitsTemp;
- UInt bestBiPDist = MAX_INT;
MV mvValidList1;
int refIdxValidList1 = 0;
UInt bitsValidList1 = MAX_UINT;
@@ -2328,13 +2325,6 @@ void TEncSearch::predInterSearch(TComDat
mvpIdx[refList][refIdxTmp] = cu->getMVPIdx(picList, partAddr);
mvpNum[refList][refIdxTmp] = cu->getMVPNum(picList, partAddr);
- if (cu->getSlice()->getMvdL1ZeroFlag() && refList == 1 && biPDistTemp < bestBiPDist)
- {
- bestBiPDist = biPDistTemp;
- bestBiPMvpL1 = mvpIdx[refList][refIdxTmp];
- bestBiPRefIdxL1 = refIdxTmp;
- }
-
bitsTemp += m_mvpIdxCost[mvpIdx[refList][refIdxTmp]][AMVP_MAX_NUM_CANDS];
if (refList == 1) // list 1
@@ -2441,7 +2431,8 @@ void TEncSearch::predInterSearch(TComDat
primitives.pixelavg_pp[partEnum](avg, roiWidth, ref0, ref1, m_predYuv[0].getStride(), m_predYuv[1].getStride());
int satdCost = primitives.satd[partEnum](pu, fenc->getStride(), avg, roiWidth);
- costbi = satdCost + m_rdCost->getCost(bits[0]) + m_rdCost->getCost(bits[1]);
+ bits[2] = bits[0] + bits[1] - mbBits[0] - mbBits[1] + mbBits[2];
+ costbi = satdCost + m_rdCost->getCost(bits[2]);
if (mv[0].notZero() || mv[1].notZero())
{
@@ -2470,6 +2461,7 @@ void TEncSearch::predInterSearch(TComDat
costbi = costZero;
mvBidir[0].x = mvBidir[0].y = 0;
mvBidir[1].x = mvBidir[1].y = 0;
+ bits[2] = bitsZero0 + bitsZero1 - mbBits[0] - mbBits[1] + mbBits[2];
}
}
} // if (B_SLICE)
diff -r bf14f75b8cf9 -r cfc69c57d335 source/common/vec/blockcopy-sse3.cpp
--- a/source/common/vec/blockcopy-sse3.cpp Fri Oct 04 01:39:22 2013 -0500
+++ b/source/common/vec/blockcopy-sse3.cpp Fri Oct 04 18:42:00 2013 +0530
@@ -76,9 +76,8 @@ void blockcopy_p_p(int bx, int by, pixel
{
for (int x = 0; x < bx; x += 16)
{
- Vec16c word;
- word.load_a(src + x);
- word.store_a(dst + x);
+ __m128i word0 = _mm_load_si128((__m128i const*)(src + x)); // load block of 16 byte from src
+ _mm_store_si128((__m128i*)&dst[x], word0); // store block into dst
}
src += sstride;
@@ -107,10 +106,14 @@ void blockcopy_p_s(int bx, int by, pixel
{
for (int x = 0; x < bx; x += 16)
{
- Vec8us word0, word1;
- word0.load_a(src + x);
- word1.load_a(src + x + 8);
- compress(word0, word1).store_a(dst + x);
+ __m128i word0 = _mm_load_si128((__m128i const*)(src + x)); // load block of 16 byte from src
+ __m128i word1 = _mm_load_si128((__m128i const*)(src + x + 8));
+
+ __m128i mask = _mm_set1_epi32(0x00FF00FF); // mask for low bytes
+ __m128i low_mask = _mm_and_si128(word0, mask); // bytes of low
+ __m128i high_mask = _mm_and_si128(word1, mask); // bytes of high
+ __m128i word01 = _mm_packus_epi16(low_mask, high_mask); // unsigned pack
+ _mm_store_si128((__m128i*)&dst[x], word01); // store block into dst
}
src += sstride;
@@ -145,10 +148,11 @@ void blockcopy_s_p(int bx, int by, short
{
for (int x = 0; x < bx; x += 16)
{
- Vec16uc word;
- word.load_a(src + x);
- extend_low(word).store_a(dst + x);
- extend_high(word).store_a(dst + x + 8);
+ __m128i word0 = _mm_load_si128((__m128i const*)(src + x)); // load block of 16 byte from src
+ __m128i word1 = _mm_unpacklo_epi8(word0, _mm_setzero_si128()); // interleave with zero extensions
+ _mm_store_si128((__m128i*)&dst[x], word1); // store block into dst
+ __m128i word2 = _mm_unpackhi_epi8(word0, _mm_setzero_si128()); // interleave with zero extensions
+ _mm_store_si128((__m128i*)&dst[x + 8], word2); // store block into dst
}
src += sstride;
@@ -182,14 +186,20 @@ void pixelsub_sp(int bx, int by, short *
{
for (int x = 0; x < bx; x += 16)
{
- Vec16uc word0, word1;
- Vec8s word3, word4;
- word0.load_a(src0 + x);
- word1.load_a(src1 + x);
- word3 = extend_low(word0) - extend_low(word1);
- word4 = extend_high(word0) - extend_high(word1);
- word3.store_a(dst + x);
- word4.store_a(dst + x + 8);
+ __m128i word0, word1;
+ __m128i word3, word4;
+ __m128i mask = _mm_setzero_si128();
+
+ word0 = _mm_load_si128((__m128i const*)(src0 + x)); // load 16 bytes from src1
+ word1 = _mm_load_si128((__m128i const*)(src1 + x)); // load 16 bytes from src2
+
+ word3 = _mm_unpacklo_epi8(word0, mask); // interleave with zero extensions
+ word4 = _mm_unpacklo_epi8(word1, mask);
+ _mm_store_si128((__m128i*)&dst[x], _mm_subs_epi16(word3, word4)); // store block into dst
+
+ word3 = _mm_unpackhi_epi8(word0, mask); // interleave with zero extensions
+ word4 = _mm_unpackhi_epi8(word1, mask);
+ _mm_store_si128((__m128i*)&dst[x + 8], _mm_subs_epi16(word3, word4)); // store block into dst
}
src0 += sstride0;
@@ -220,21 +230,24 @@ void pixeladd_ss(int bx, int by, short *
if ( !(aligncheck & 15) && !(bx & 7))
{
- Vec8s zero(0), maxval((1 << X265_DEPTH) - 1);
+ __m128i maxval = _mm_set1_epi16((1 << X265_DEPTH) - 1);
+ __m128i zero = _mm_setzero_si128();
+
// fast path, multiples of 8 pixel wide blocks
for (int y = 0; y < by; y++)
{
for (int x = 0; x < bx; x += 8)
{
- Vec8s vecsrc0, vecsrc1, vecsum;
- vecsrc0.load_a(src0 + x);
- vecsrc1.load_a(src1 + x);
+ __m128i word0, word1, sum;
- vecsum = add_saturated(vecsrc0, vecsrc1);
- vecsum = max(vecsum, zero);
- vecsum = min(vecsum, maxval);
+ word0 = _mm_load_si128((__m128i*)(src0 + x)); // load 16 bytes from src1
+ word1 = _mm_load_si128((__m128i*)(src1 + x)); // load 16 bytes from src2
- vecsum.store(dst + x);
+ sum = _mm_adds_epi16(word0, word1);
+ sum = _mm_max_epi16(sum, zero);
+ sum = _mm_min_epi16(sum, maxval);
+
+ _mm_store_si128((__m128i*)&dst[x], sum); // store block into dst
}
src0 += sstride0;
@@ -244,20 +257,23 @@ void pixeladd_ss(int bx, int by, short *
}
else if (!(bx & 7))
{
- Vec8s zero(0), maxval((1 << X265_DEPTH) - 1);
+ __m128i maxval = _mm_set1_epi16((1 << X265_DEPTH) - 1);
+ __m128i zero = _mm_setzero_si128();
+
for (int y = 0; y < by; y++)
{
for (int x = 0; x < bx; x += 8)
{
- Vec8s vecsrc0, vecsrc1, vecsum;
- vecsrc0.load(src0 + x);
- vecsrc1.load(src1 + x);
+ __m128i word0, word1, sum;
- vecsum = add_saturated(vecsrc0, vecsrc1);
- vecsum = max(vecsum, zero);
- vecsum = min(vecsum, maxval);
+ word0 = _mm_load_si128((__m128i*)(src0 + x)); // load 16 bytes from src1
+ word1 = _mm_load_si128((__m128i*)(src1 + x)); // load 16 bytes from src2
- vecsum.store(dst + x);
+ sum = _mm_adds_epi16(word0, word1);
+ sum = _mm_max_epi16(sum, zero);
+ sum = _mm_min_epi16(sum, maxval);
+
+ _mm_store_si128((__m128i*)&dst[x], sum); // store block into dst
}
src0 += sstride0;
diff -r bf14f75b8cf9 -r cfc69c57d335 source/common/vec/pixel-sse41.cpp
--- a/source/common/vec/pixel-sse41.cpp Fri Oct 04 01:39:22 2013 -0500
+++ b/source/common/vec/pixel-sse41.cpp Fri Oct 04 18:42:00 2013 +0530
@@ -2346,6 +2346,1545 @@ int sad_32(pixel * fenc, intptr_t fencst
return _mm_cvtsi128_si32(sum0);
}
+template<int ly>
+int sad_48(pixel * fenc, intptr_t fencstride, pixel * fref, intptr_t frefstride)
+{
+ assert((ly % 4) == 0);
+
+ __m128i sum0 = _mm_setzero_si128();
+ __m128i sum1 = _mm_setzero_si128();
+
+ if (ly == 4)
+ {
+ __m128i T00, T01, T02;
+ __m128i T10, T11, T12;
+ __m128i T20, T21, T22;
+
+ T00 = _mm_load_si128((__m128i*)(fenc)); /*Loding 48 8-bit integer from fenc to local variables*/
+ T01 = _mm_load_si128((__m128i*)(fenc + 16));
+ T02 = _mm_load_si128((__m128i*)(fenc + 32));
+
+ T10 = _mm_loadu_si128((__m128i*)(fref)); /*Loding 48 8-bit integer from fref to local variables*/
+ T11 = _mm_loadu_si128((__m128i*)(fref + 16));
+ T12 = _mm_loadu_si128((__m128i*)(fref + 32));
+
+ T20 = _mm_sad_epu8(T00, T10);
+ T21 = _mm_sad_epu8(T01, T11);
+ T22 = _mm_sad_epu8(T02, T12);
+
+ sum0 = _mm_add_epi16(sum0, T20);
+ sum0 = _mm_add_epi16(sum0, T21);
+ sum0 = _mm_add_epi16(sum0, T22);
+
+ T00 = _mm_load_si128((__m128i*)(fenc + (1) * fencstride));
+ T01 = _mm_load_si128((__m128i*)(fenc + 16 + (1) * fencstride));
+ T02 = _mm_load_si128((__m128i*)(fenc + 32 + (1) * fencstride));
+
+ T10 = _mm_loadu_si128((__m128i*)(fref + (1) * frefstride));
+ T11 = _mm_loadu_si128((__m128i*)(fref + 16 + (1) * frefstride));
+ T12 = _mm_loadu_si128((__m128i*)(fref + 32 + (1) * frefstride));
+
+ T20 = _mm_sad_epu8(T00, T10);
+ T21 = _mm_sad_epu8(T01, T11);
+ T22 = _mm_sad_epu8(T02, T12);
+
+ sum0 = _mm_add_epi16(sum0, T20);
+ sum0 = _mm_add_epi16(sum0, T21);
+ sum0 = _mm_add_epi16(sum0, T22);
+
+ T00 = _mm_load_si128((__m128i*)(fenc + (2) * fencstride));
+ T01 = _mm_load_si128((__m128i*)(fenc + 16 + (2) * fencstride));
+ T02 = _mm_load_si128((__m128i*)(fenc + 32 + (2) * fencstride));
+
+ T10 = _mm_loadu_si128((__m128i*)(fref + (2) * frefstride));
+ T11 = _mm_loadu_si128((__m128i*)(fref + 16 + (2) * frefstride));
+ T12 = _mm_loadu_si128((__m128i*)(fref + 32 + (2) * frefstride));
+
+ T20 = _mm_sad_epu8(T00, T10);
+ T21 = _mm_sad_epu8(T01, T11);
+ T22 = _mm_sad_epu8(T02, T12);
+
+ sum0 = _mm_add_epi16(sum0, T20);
+ sum0 = _mm_add_epi16(sum0, T21);
+ sum0 = _mm_add_epi16(sum0, T22);
+
+ T00 = _mm_load_si128((__m128i*)(fenc + (3) * fencstride));
+ T01 = _mm_load_si128((__m128i*)(fenc + 16 + (3) * fencstride));
+ T02 = _mm_load_si128((__m128i*)(fenc + 32 + (3) * fencstride));
+
+ T10 = _mm_loadu_si128((__m128i*)(fref + (3) * frefstride));
+ T11 = _mm_loadu_si128((__m128i*)(fref + 16 + (3) * frefstride));
+ T12 = _mm_loadu_si128((__m128i*)(fref + 32 + (3) * frefstride));
+
+ T20 = _mm_sad_epu8(T00, T10);
+ T21 = _mm_sad_epu8(T01, T11);
+ T22 = _mm_sad_epu8(T02, T12);
+
+ sum0 = _mm_add_epi16(sum0, T20);
+ sum0 = _mm_add_epi16(sum0, T21);
+ sum0 = _mm_add_epi16(sum0, T22);
+ }
+ else if (ly == 8)
+ {
+ __m128i T00, T01, T02;
+ __m128i T10, T11, T12;
+ __m128i T20, T21, T22;
+
+ T00 = _mm_load_si128((__m128i*)(fenc));
+ T01 = _mm_load_si128((__m128i*)(fenc + 16));
+ T02 = _mm_load_si128((__m128i*)(fenc + 32));
+
+ T10 = _mm_loadu_si128((__m128i*)(fref));
+ T11 = _mm_loadu_si128((__m128i*)(fref + 16));
More information about the x265-commits
mailing list