[x265-commits] [x265] intra-sse3.cpp: replace xPredIntraAng4x4 vector class fun...
Dnyaneshwar Gorade
dnyaneshwar at multicorewareinc.com
Sun Oct 20 23:24:37 CEST 2013
details: http://hg.videolan.org/x265/rev/c1e53b796ef4
branches:
changeset: 4538:c1e53b796ef4
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Fri Oct 18 10:51:33 2013 +0530
description:
intra-sse3.cpp: replace xPredIntraAng4x4 vector class function with intrinsic.
Subject: [x265] blockcopy-sse3.cpp: removed warning: overflow in implicit constant conversion.
details: http://hg.videolan.org/x265/rev/48afd41e0753
branches:
changeset: 4539:48afd41e0753
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Fri Oct 18 10:29:53 2013 +0530
description:
blockcopy-sse3.cpp: removed warning: overflow in implicit constant conversion.
Subject: [x265] added cvt32to16_shr_sse2 function to testbench.
details: http://hg.videolan.org/x265/rev/f3523973eafb
branches:
changeset: 4540:f3523973eafb
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Fri Oct 18 14:18:05 2013 +0530
description:
added cvt32to16_shr_sse2 function to testbench.
Speed up measured is almost 14x.
Subject: [x265] cmake: msvc yasm dependency fix
details: http://hg.videolan.org/x265/rev/357a6d0c305d
branches:
changeset: 4541:357a6d0c305d
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 13:39:39 2013 -0500
description:
cmake: msvc yasm dependency fix
Subject: [x265] blockcopy-sse3.cpp: removed unnecessary variable.
details: http://hg.videolan.org/x265/rev/9ff06eb3bc4d
branches:
changeset: 4542:9ff06eb3bc4d
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Fri Oct 18 16:13:07 2013 +0530
description:
blockcopy-sse3.cpp: removed unnecessary variable.
Subject: [x265] added pixelavg_pp function to testbench
details: http://hg.videolan.org/x265/rev/fdd1262059ad
branches:
changeset: 4543:fdd1262059ad
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Fri Oct 18 15:53:12 2013 +0530
description:
added pixelavg_pp function to testbench
Subject: [x265] pixelharness: fix iteration through partition enums
details: http://hg.videolan.org/x265/rev/7e95be5f70bc
branches:
changeset: 4544:7e95be5f70bc
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 14:31:39 2013 -0500
description:
pixelharness: fix iteration through partition enums
Subject: [x265] asm: disable remaining pixelavg primitives, they fail against our C ref
details: http://hg.videolan.org/x265/rev/904ff6d6e5d9
branches:
changeset: 4545:904ff6d6e5d9
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 14:35:08 2013 -0500
description:
asm: disable remaining pixelavg primitives, they fail against our C ref
Subject: [x265] rc : removed warning , moved strength to acEnergyCu
details: http://hg.videolan.org/x265/rev/089b29b4da2a
branches:
changeset: 4546:089b29b4da2a
user: Aarthi Thirumalai<aarthi at multicorewareinc.com>
date: Fri Oct 18 17:47:20 2013 +0530
description:
rc : removed warning , moved strength to acEnergyCu
Subject: [x265] intra: replace intra_pred_dc vector class function with intrinsic
details: http://hg.videolan.org/x265/rev/d24283fe5e31
branches:
changeset: 4547:d24283fe5e31
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Fri Oct 18 16:05:38 2013 +0530
description:
intra: replace intra_pred_dc vector class function with intrinsic
Subject: [x265] intra: replace predDCFiltering vector class function with intrinsic
details: http://hg.videolan.org/x265/rev/140f90417702
branches:
changeset: 4548:140f90417702
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Fri Oct 18 17:05:37 2013 +0530
description:
intra: replace predDCFiltering vector class function with intrinsic
Subject: [x265] intra: remove SSE3 planar intrinsic functions; they are redundant
details: http://hg.videolan.org/x265/rev/edf6eb8da4ca
branches:
changeset: 4549:edf6eb8da4ca
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 13:32:50 2013 -0500
description:
intra: remove SSE3 planar intrinsic functions; they are redundant
Subject: [x265] intra: nits
details: http://hg.videolan.org/x265/rev/6a453beeea88
branches:
changeset: 4550:6a453beeea88
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 13:32:55 2013 -0500
description:
intra: nits
Subject: [x265] intra: sane function names and typedefs
details: http://hg.videolan.org/x265/rev/1959dbe1b643
branches:
changeset: 4551:1959dbe1b643
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 14:06:56 2013 -0500
description:
intra: sane function names and typedefs
Subject: [x265] intra: isolate last remaining vector class functions (angular intra 8, 16, 32)
details: http://hg.videolan.org/x265/rev/8de380c7bd41
branches:
changeset: 4552:8de380c7bd41
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 13:52:48 2013 -0500
description:
intra: isolate last remaining vector class functions (angular intra 8, 16, 32)
Subject: [x265] ipfilter.cpp, added code to support luma coefficients too
details: http://hg.videolan.org/x265/rev/4bcfc0e23935
branches:
changeset: 4553:4bcfc0e23935
user: Praveen Tiwari
date: Fri Oct 18 15:53:11 2013 +0530
description:
ipfilter.cpp, added code to support luma coefficients too
Subject: [x265] asm: corrected luma enum variable, testbench fix
details: http://hg.videolan.org/x265/rev/8b507771e6b0
branches:
changeset: 4554:8b507771e6b0
user: Praveen Tiwari
date: Fri Oct 18 15:43:54 2013 +0530
description:
asm: corrected luma enum variable, testbench fix
Subject: [x265] added 24x32 partion size asm code to chroma function
details: http://hg.videolan.org/x265/rev/a301f749b0bc
branches:
changeset: 4555:a301f749b0bc
user: Praveen Tiwari
date: Fri Oct 18 17:07:36 2013 +0530
description:
added 24x32 partion size asm code to chroma function
Subject: [x265] asm code for luma filter functions
details: http://hg.videolan.org/x265/rev/0d146f05d561
branches:
changeset: 4556:0d146f05d561
user: Praveen Tiwari
date: Fri Oct 18 16:18:12 2013 +0530
description:
asm code for luma filter functions
Subject: [x265] TEncSearch: add x265_emms() after use of pixelavg_pp and satd primitives
details: http://hg.videolan.org/x265/rev/1fa93e1f4caa
branches:
changeset: 4557:1fa93e1f4caa
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 14:57:57 2013 -0500
description:
TEncSearch: add x265_emms() after use of pixelavg_pp and satd primitives
Subject: [x265] ipfilterharness: simplify filter names
details: http://hg.videolan.org/x265/rev/4066e6e725ee
branches:
changeset: 4558:4066e6e725ee
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 15:00:40 2013 -0500
description:
ipfilterharness: simplify filter names
Subject: [x265] WaveFront: add new function to enable all rows
details: http://hg.videolan.org/x265/rev/dd45e55248c8
branches:
changeset: 4559:dd45e55248c8
user: Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date: Fri Oct 18 17:01:32 2013 +0530
description:
WaveFront: add new function to enable all rows
Subject: [x265] Lookahead: implement wavefront parallel processing
details: http://hg.videolan.org/x265/rev/c96f97cf3914
branches:
changeset: 4560:c96f97cf3914
user: Steve Borho <steve at borho.org>
date: Fri Oct 18 17:10:56 2013 +0530
description:
Lookahead: implement wavefront parallel processing
Subject: [x265] intra: move intra_pred_dc to intra-sse41.cpp; it uses SSSE3 instructions
details: http://hg.videolan.org/x265/rev/7ec69cb067fd
branches:
changeset: 4561:7ec69cb067fd
user: Steve Borho <steve at borho.org>
date: Sun Oct 20 15:51:01 2013 -0500
description:
intra: move intra_pred_dc to intra-sse41.cpp; it uses SSSE3 instructions
We don't have an intra-ssse3.cpp and it seems a waste to create one just for
this one function.
Subject: [x265] remove reduce register copy in FILTER_H4_w2_2 and FILTER_H4_w4_2
details: http://hg.videolan.org/x265/rev/fabb25ae4db4
branches:
changeset: 4562:fabb25ae4db4
user: Min Chen <chenm003 at 163.com>
date: Sat Oct 19 18:08:07 2013 +0800
description:
remove reduce register copy in FILTER_H4_w2_2 and FILTER_H4_w4_2
diffstat:
source/Lib/TLibEncoder/TEncSearch.cpp | 5 +-
source/common/CMakeLists.txt | 2 +-
source/common/ipfilter.cpp | 2 +-
source/common/vec/blockcopy-sse3.cpp | 5 +-
source/common/vec/intra-sse3.cpp | 812 ++++++++-------------------------
source/common/vec/intra-sse41.cpp | 190 +++++++
source/common/wavefront.cpp | 5 +
source/common/wavefront.h | 12 +-
source/common/x86/asm-primitives.cpp | 14 +-
source/common/x86/ipfilter8.asm | 250 ++++++++--
source/encoder/encoder.cpp | 3 +-
source/encoder/ratecontrol.cpp | 7 +-
source/encoder/slicetype.cpp | 152 +++++-
source/encoder/slicetype.h | 54 +-
source/test/ipfilterharness.cpp | 10 +-
source/test/pixelharness.cpp | 87 +++-
source/test/pixelharness.h | 4 +
17 files changed, 893 insertions(+), 721 deletions(-)
diffs (truncated from 2501 to 300 lines):
diff -r d6d7187c5f4e -r fabb25ae4db4 source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp Fri Oct 18 00:42:36 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp Sat Oct 19 18:08:07 2013 +0800
@@ -2336,8 +2336,8 @@ void TEncSearch::predInterSearch(TComDat
int partEnum = PartitionFromSizes(roiWidth, roiHeight);
primitives.pixelavg_pp[partEnum](avg, roiWidth, ref0, m_predYuv[0].getStride(), ref1, m_predYuv[1].getStride(), 32);
-
int satdCost = primitives.satd[partEnum](pu, fenc->getStride(), avg, roiWidth);
+ x265_emms();
bits[2] = bits[0] + bits[1] - mbBits[0] - mbBits[1] + mbBits[2];
costbi = satdCost + m_rdCost->getCost(bits[2]);
@@ -2347,10 +2347,9 @@ void TEncSearch::predInterSearch(TComDat
ref1 = cu->getSlice()->m_mref[1][refIdx[1]]->fpelPlane + (pu - fenc->getLumaAddr()); //MV(0,0) of ref1
intptr_t refStride = cu->getSlice()->m_mref[0][refIdx[0]]->lumaStride;
- partEnum = PartitionFromSizes(roiWidth, roiHeight);
primitives.pixelavg_pp[partEnum](avg, roiWidth, ref0, refStride, ref1, refStride, 32);
-
satdCost = primitives.satd[partEnum](pu, fenc->getStride(), avg, roiWidth);
+ x265_emms();
unsigned int bitsZero0, bitsZero1;
m_me.setMVP(mvPredBi[0][refIdxBidir[0]]);
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/CMakeLists.txt
--- a/source/common/CMakeLists.txt Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/CMakeLists.txt Sat Oct 19 18:08:07 2013 +0800
@@ -202,7 +202,7 @@ if(ENABLE_PRIMITIVES_ASM)
add_custom_command(
OUTPUT ${ASM}.obj
COMMAND ${YASM_EXECUTABLE} ARGS ${FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/x86/${ASM} -o ${ASM}.obj
- DEPENDS x86/${ASM})
+ DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/x86/${ASM})
endforeach()
add_library(assembly STATIC x86/asm-primitives.cpp x86/pixel.h x86/mc.h x86/ipfilter8.h ${FULLPATHASM} ${OBJS})
else()
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/ipfilter.cpp Sat Oct 19 18:08:07 2013 +0800
@@ -448,7 +448,7 @@ void extendCURowColBorder(pixel* txt, in
template<int N, int width, int height>
void interp_horiz_pp_c(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
{
- int16_t const * coeff = g_chromaFilter[coeffIdx];
+ int16_t const * coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
int offset = (1 << (headRoom - 1));
int16_t maxVal = (1 << X265_DEPTH) - 1;
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/vec/blockcopy-sse3.cpp
--- a/source/common/vec/blockcopy-sse3.cpp Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/vec/blockcopy-sse3.cpp Sat Oct 19 18:08:07 2013 +0800
@@ -134,11 +134,10 @@ void blockcopy_ps(int bx, int by, pixel
void pixeladd_pp(int bx, int by, pixel *dst, intptr_t dstride, pixel *src0, pixel *src1, intptr_t sstride0, intptr_t sstride1)
{
size_t aligncheck = (size_t)dst | (size_t)src0 | bx | sstride0 | sstride1 | dstride;
- int i = 1;
if (!(aligncheck & 15))
{
- __m128i maxval = _mm_set1_epi8((i << X265_DEPTH) - 1);
+ __m128i maxval = _mm_set1_epi8((unsigned char)((1 << X265_DEPTH) - 1));
__m128i zero = _mm_setzero_si128();
// fast path, multiples of 16 pixel wide blocks
@@ -162,7 +161,7 @@ void pixeladd_pp(int bx, int by, pixel *
}
else if (!(bx & 15))
{
- __m128i maxval = _mm_set1_epi8((i << X265_DEPTH) - 1);
+ __m128i maxval = _mm_set1_epi8((unsigned char)((1 << X265_DEPTH) - 1));
__m128i zero = _mm_setzero_si128();
// fast path, multiples of 16 pixel wide blocks but pointers/strides require unaligned accesses
diff -r d6d7187c5f4e -r fabb25ae4db4 source/common/vec/intra-sse3.cpp
--- a/source/common/vec/intra-sse3.cpp Fri Oct 18 00:42:36 2013 -0500
+++ b/source/common/vec/intra-sse3.cpp Sat Oct 19 18:08:07 2013 +0800
@@ -24,16 +24,11 @@
* For more information, contact us at licensing at multicorewareinc.com.
*****************************************************************************/
-#if defined(_MSC_VER)
-#define ALWAYSINLINE __forceinline
-#endif
-
-#define INSTRSET 3
-#include "vectorclass.h"
-
#include "primitives.h"
#include "TLibCommon/TComRom.h"
#include <assert.h>
+#include <xmmintrin.h> // SSE
+#include <pmmintrin.h> // SSE3
using namespace x265;
@@ -96,387 +91,6 @@ const int angAP[17][64] =
#define GETAP(X, Y) angAP[8 - (X)][(Y)]
#if !HIGH_BIT_DEPTH
-inline void predDCFiltering(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int width)
-{
- int y;
- pixel pixDC = *dst;
- int pixDCx3 = pixDC * 3 + 2;
-
- // boundary pixels processing
- dst[0] = (pixel)((above[0] + left[0] + 2 * pixDC + 2) >> 2);
-
- Vec8us im1(pixDCx3);
- Vec8us im2, im3;
- Vec16uc pix;
- switch (width)
- {
- case 4:
- pix = load_partial(const_int(4), &above[1]);
- im2 = extend_low(pix);
- im2 = (im1 + im2) >> const_int(2);
- pix = compress(im2, im2);
- store_partial(const_int(4), &dst[1], pix);
- break;
-
- case 8:
- pix = load_partial(const_int(8), &above[1]);
- im2 = extend_low(pix);
- im2 = (im1 + im2) >> const_int(2);
- pix = compress(im2, im2);
- store_partial(const_int(8), &dst[1], pix);
- break;
-
- case 16:
- pix.load(&above[1]);
- im2 = extend_low(pix);
- im3 = extend_high(pix);
- im2 = (im1 + im2) >> const_int(2);
- im3 = (im1 + im3) >> const_int(2);
- pix = compress(im2, im3);
- pix.store(&dst[1]);
- break;
-
- case 32:
- pix.load(&above[1]);
- im2 = extend_low(pix);
- im3 = extend_high(pix);
- im2 = (im1 + im2) >> const_int(2);
- im3 = (im1 + im3) >> const_int(2);
- pix = compress(im2, im3);
- pix.store(&dst[1]);
-
- pix.load(&above[1 + 16]);
- im2 = extend_low(pix);
- im3 = extend_high(pix);
- im2 = (im1 + im2) >> const_int(2);
- im3 = (im1 + im3) >> const_int(2);
- pix = compress(im2, im3);
- pix.store(&dst[1 + 16]);
- break;
- }
-
- for (y = 1; y < width; y++)
- {
- dst[dstStride] = (pixel)((left[y] + pixDCx3) >> 2);
- dst += dstStride;
- }
-}
-
-void intra_pred_dc(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int width, int filter)
-{
- int sum;
- int logSize = g_convertToBit[width] + 2;
-
- Vec16uc pixL, pixT;
- Vec8us im;
- Vec4ui im1, im2;
-
- switch (width)
- {
- case 4:
- pixL.fromUint32(*(uint32_t*)left);
- pixT.fromUint32(*(uint32_t*)above);
- sum = horizontal_add(extend_low(pixL));
- sum += horizontal_add(extend_low(pixT));
- break;
-
- case 8:
-#if X86_64
- pixL.fromUint64(*(uint64_t*)left);
- pixT.fromUint64(*(uint64_t*)above);
-#else
- pixL.load_partial(8, left);
- pixT.load_partial(8, above);
-#endif
- sum = horizontal_add(extend_low(pixL));
- sum += horizontal_add(extend_low(pixT));
- break;
-
- case 16:
- pixL.load(left);
- pixT.load(above);
- sum = horizontal_add_x(pixL);
- sum += horizontal_add_x(pixT);
- break;
-
- default:
- case 32:
- pixL.load(left);
- im1 = (Vec4ui)(pixL.sad(_mm_setzero_si128()));
- pixL.load(left + 16);
- im1 += (Vec4ui)(pixL.sad(_mm_setzero_si128()));
-
- pixT.load(above);
- im1 += (Vec4ui)(pixT.sad(_mm_setzero_si128()));
- pixT.load(above + 16);
- im1 += (Vec4ui)(pixT.sad(_mm_setzero_si128()));
- im1 += (Vec4ui)((Vec128b)im1 >> const_int(64));
- sum = toInt32(im1);
- break;
- }
-
- logSize += 1;
- pixel dcVal = (sum + (1 << (logSize - 1))) >> logSize;
- Vec16uc dcValN(dcVal);
- pixel *dst1 = dst;
-
- switch (width)
- {
- case 4:
- store_partial(const_int(4), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(4), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(4), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(4), dst1, dcValN);
- break;
-
- case 8:
- store_partial(const_int(8), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(8), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(8), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(8), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(8), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(8), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(8), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(8), dst1, dcValN);
- break;
-
- case 16:
- for (int k = 0; k < 16; k += 4)
- {
- store_partial(const_int(16), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(16), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(16), dst1, dcValN);
- dst1 += dstStride;
- store_partial(const_int(16), dst1, dcValN);
- dst1 += dstStride;
- }
- break;
-
- case 32:
- for (int k = 0; k < 32; k += 2)
- {
- store_partial(const_int(16), dst1, dcValN);
- store_partial(const_int(16), dst1 + 16, dcValN);
- dst1 += dstStride;
- store_partial(const_int(16), dst1, dcValN);
- store_partial(const_int(16), dst1 + 16, dcValN);
- dst1 += dstStride;
- }
- break;
- }
-
- if (filter)
- {
- predDCFiltering(above, left, dst, dstStride, width);
- }
-}
-
-#define BROADCAST16(a, d, x) { \
- const int dL = (d) & 3; \
- const int dH = ((d)-4) & 3; \
- if (d>=4) { \
- (x) = _mm_shufflehi_epi16((a), dH * 0x55); \
- (x) = _mm_unpackhi_epi64((x), (x)); \
- } \
- else { \
- (x) = _mm_shufflelo_epi16((a), dL * 0x55); \
- (x) = _mm_unpacklo_epi64((x), (x)); \
- } \
-}
-
More information about the x265-commits
mailing list