[x265-commits] [x265] asm: declare asm function pointers for sad_64xN partitions
Dnyaneshwar Gorade
dnyaneshwar at multicorewareinc.com
Wed Oct 30 20:38:28 CET 2013
details: http://hg.videolan.org/x265/rev/9f9b2f8d293a
branches:
changeset: 4754:9f9b2f8d293a
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 12:54:18 2013 +0530
description:
asm: declare asm function pointers for sad_64xN partitions
Subject: [x265] chroma interp_4tap_vert_pp all blocks asm code
details: http://hg.videolan.org/x265/rev/74bf8634037c
branches:
changeset: 4755:74bf8634037c
user: Praveen Tiwari
date: Wed Oct 30 13:44:16 2013 +0530
description:
chroma interp_4tap_vert_pp all blocks asm code
Subject: [x265] no-rdo: use bit estimates from ME to calculate RDcost.
details: http://hg.videolan.org/x265/rev/77db80a67f4e
branches:
changeset: 4756:77db80a67f4e
user: Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date: Wed Oct 30 15:16:59 2013 +0530
description:
no-rdo: use bit estimates from ME to calculate RDcost.
bits estimated in ME stored in CU and used for calculating rdcost along with distortion. This results in better bitrate with no-rdo, with small drop in PSNR.
Subject: [x265] asm: modified common macro for pixel_sad_64xN
details: http://hg.videolan.org/x265/rev/e9340727231d
branches:
changeset: 4757:e9340727231d
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 12:57:17 2013 +0530
description:
asm: modified common macro for pixel_sad_64xN
Subject: [x265] asm: assembly code for pixel_sad_64x16
details: http://hg.videolan.org/x265/rev/4414f3394a61
branches:
changeset: 4758:4414f3394a61
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 13:25:12 2013 +0530
description:
asm: assembly code for pixel_sad_64x16
Subject: [x265] asm: assembly code for pixel_sad_64x32
details: http://hg.videolan.org/x265/rev/42ad273b1d4f
branches:
changeset: 4759:42ad273b1d4f
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 13:45:38 2013 +0530
description:
asm: assembly code for pixel_sad_64x32
Subject: [x265] asm: assembly code for pixel_sad_64x48 and pixel_sad_64x64
details: http://hg.videolan.org/x265/rev/700b46a1a0cf
branches:
changeset: 4760:700b46a1a0cf
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 14:11:41 2013 +0530
description:
asm: assembly code for pixel_sad_64x48 and pixel_sad_64x64
Subject: [x265] asm: filterConvertPelToShort
details: http://hg.videolan.org/x265/rev/1a51e6cb0e0c
branches:
changeset: 4761:1a51e6cb0e0c
user: Min Chen <chenm003 at 163.com>
date: Wed Oct 30 22:47:44 2013 +0800
description:
asm: filterConvertPelToShort
Subject: [x265] asm: assembly code for pixel_sad_48x64
details: http://hg.videolan.org/x265/rev/78db76b7abec
branches:
changeset: 4762:78db76b7abec
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 15:29:59 2013 +0530
description:
asm: assembly code for pixel_sad_48x64
Subject: [x265] asm: assembly code for pixel_sad_24x32
details: http://hg.videolan.org/x265/rev/ed5d877b8452
branches:
changeset: 4763:ed5d877b8452
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 18:11:01 2013 +0530
description:
asm: assembly code for pixel_sad_24x32
Subject: [x265] asm: assembly code for pixel_sad_12x16
details: http://hg.videolan.org/x265/rev/8ee637b11d17
branches:
changeset: 4764:8ee637b11d17
user: Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date: Wed Oct 30 19:46:15 2013 +0530
description:
asm: assembly code for pixel_sad_12x16
Subject: [x265] assembly code for pixel_sad_x3_24x32
details: http://hg.videolan.org/x265/rev/de91fbc95b4a
branches:
changeset: 4765:de91fbc95b4a
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Wed Oct 30 14:37:25 2013 +0530
description:
assembly code for pixel_sad_x3_24x32
Subject: [x265] assembly code for pixel_sad_x4_24x32
details: http://hg.videolan.org/x265/rev/f021f06f3b80
branches:
changeset: 4766:f021f06f3b80
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Wed Oct 30 15:48:55 2013 +0530
description:
assembly code for pixel_sad_x4_24x32
Subject: [x265] assembly code for pixel_sad_x3_32xN
details: http://hg.videolan.org/x265/rev/e371719c4c47
branches:
changeset: 4767:e371719c4c47
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Wed Oct 30 18:41:40 2013 +0530
description:
assembly code for pixel_sad_x3_32xN
Subject: [x265] assembly code for pixel_sad_x4_32xN
details: http://hg.videolan.org/x265/rev/c3cf2c42e854
branches:
changeset: 4768:c3cf2c42e854
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Wed Oct 30 18:57:53 2013 +0530
description:
assembly code for pixel_sad_x4_32xN
Subject: [x265] pixel: remove 24 and 32 width sad intrinsic functions
details: http://hg.videolan.org/x265/rev/eccfe236169b
branches:
changeset: 4769:eccfe236169b
user: Steve Borho <steve at borho.org>
date: Wed Oct 30 13:10:52 2013 -0500
description:
pixel: remove 24 and 32 width sad intrinsic functions
These are now covered by assembly. Only 12, 48, and 64 remain because they
still lack x3 and x4 versions.
Subject: [x265] pixel: remove sad_12, sad_48, and sad_64
details: http://hg.videolan.org/x265/rev/645899ddda59
branches:
changeset: 4770:645899ddda59
user: Steve Borho <steve at borho.org>
date: Wed Oct 30 13:14:06 2013 -0500
description:
pixel: remove sad_12, sad_48, and sad_64
All single sads have asm coverage
Subject: [x265] added test code for blockcopy_pp function
details: http://hg.videolan.org/x265/rev/e8e84b67cf8f
branches:
changeset: 4771:e8e84b67cf8f
user: Praveen Tiwari
date: Wed Oct 30 20:30:17 2013 +0530
description:
added test code for blockcopy_pp function
Subject: [x265] added blockcopy_pp_c primitive according to modified argument list
details: http://hg.videolan.org/x265/rev/7f68debc632b
branches:
changeset: 4772:7f68debc632b
user: Praveen Tiwari
date: Wed Oct 30 20:23:37 2013 +0530
description:
added blockcopy_pp_c primitive according to modified argument list
diffstat:
source/Lib/TLibCommon/TComPrediction.cpp | 3 +
source/Lib/TLibEncoder/TEncSearch.cpp | 18 +-
source/Lib/TLibEncoder/TEncSearch.h | 2 +-
source/common/ipfilter.cpp | 19 +
source/common/pixel.cpp | 46 +
source/common/primitives.h | 7 +
source/common/vec/pixel-sse41.cpp | 965 +-----------------------
source/common/x86/asm-primitives.cpp | 37 +-
source/common/x86/ipfilter8.asm | 1226 ++++++++++++++++++++++++++++++
source/common/x86/ipfilter8.h | 1 +
source/common/x86/pixel.h | 8 +
source/common/x86/sad-a.asm | 1052 +++++++++++++++++++++++++-
source/encoder/compress.cpp | 6 +-
source/test/ipfilterharness.cpp | 51 +
source/test/ipfilterharness.h | 1 +
source/test/pixelharness.cpp | 56 +
source/test/pixelharness.h | 1 +
17 files changed, 2532 insertions(+), 967 deletions(-)
diffs (truncated from 3923 to 300 lines):
diff -r 65462024832b -r 7f68debc632b source/Lib/TLibCommon/TComPrediction.cpp
--- a/source/Lib/TLibCommon/TComPrediction.cpp Wed Oct 30 01:54:16 2013 -0500
+++ b/source/Lib/TLibCommon/TComPrediction.cpp Wed Oct 30 20:23:37 2013 +0530
@@ -516,6 +516,9 @@ void TComPrediction::xPredInterLumaBlk(T
int xFrac = mv->x & 0x3;
int yFrac = mv->y & 0x3;
+ assert((width % 4) + (height % 4) == 0);
+ assert(dstStride == MAX_CU_SIZE);
+
if ((yFrac | xFrac) == 0)
{
primitives.ipfilter_p2s(ref, refStride, dst, dstStride, width, height);
diff -r 65462024832b -r 7f68debc632b source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp Wed Oct 30 01:54:16 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp Wed Oct 30 20:23:37 2013 +0530
@@ -2115,7 +2115,7 @@ uint32_t TEncSearch::xGetInterPrediction
* \param bValid
* \returns void
*/
-void TEncSearch::xMergeEstimation(TComDataCU* cu, int puIdx, uint32_t& interDir, TComMvField* mvField, uint32_t& mergeIndex, uint32_t& outCost, TComMvField* mvFieldNeighbours, UChar* interDirNeighbours, int& numValidMergeCand)
+void TEncSearch::xMergeEstimation(TComDataCU* cu, int puIdx, uint32_t& interDir, TComMvField* mvField, uint32_t& mergeIndex, uint32_t& outCost, uint32_t& outbits, TComMvField* mvFieldNeighbours, UChar* interDirNeighbours, int& numValidMergeCand)
{
uint32_t absPartIdx = 0;
int width = 0;
@@ -2144,7 +2144,7 @@ void TEncSearch::xMergeEstimation(TComDa
{
uint32_t costCand = MAX_UINT;
uint32_t bitsCand = 0;
-
+
cu->getCUMvField(REF_PIC_LIST_0)->m_mv[absPartIdx] = mvFieldNeighbours[0 + 2 * mergeCand].mv;
cu->getCUMvField(REF_PIC_LIST_0)->m_refIdx[absPartIdx] = mvFieldNeighbours[0 + 2 * mergeCand].refIdx;
cu->getCUMvField(REF_PIC_LIST_1)->m_mv[absPartIdx] = mvFieldNeighbours[1 + 2 * mergeCand].mv;
@@ -2160,6 +2160,7 @@ void TEncSearch::xMergeEstimation(TComDa
if (costCand < outCost)
{
outCost = costCand;
+ outbits = bitsCand;
mvField[0] = mvFieldNeighbours[0 + 2 * mergeCand];
mvField[1] = mvFieldNeighbours[1 + 2 * mergeCand];
interDir = interDirNeighbours[mergeCand];
@@ -2226,6 +2227,8 @@ void TEncSearch::predInterSearch(TComDat
UChar interDirNeighbours[MRG_MAX_NUM_CANDS];
int numValidMergeCand = 0;
+ int totalmebits = 0;
+
for (int partIdx = 0; partIdx < numPart; partIdx++)
{
uint32_t listCost[2] = { MAX_UINT, MAX_UINT };
@@ -2495,7 +2498,8 @@ void TEncSearch::predInterSearch(TComDat
// find Merge result
uint32_t mrgCost = MAX_UINT;
- xMergeEstimation(cu, partIdx, mrgInterDir, mrgMvField, mrgIndex, mrgCost, mvFieldNeighbours, interDirNeighbours, numValidMergeCand);
+ uint32_t mrgBits = 0;
+ xMergeEstimation(cu, partIdx, mrgInterDir, mrgMvField, mrgIndex, mrgCost, mrgBits, mvFieldNeighbours, interDirNeighbours, numValidMergeCand);
if (mrgCost < meCost)
{
// set Merge result
@@ -2517,6 +2521,7 @@ void TEncSearch::predInterSearch(TComDat
#if CU_STAT_LOGFILE
meCost += mrgCost;
#endif
+ totalmebits += mrgBits;
}
else
{
@@ -2530,11 +2535,18 @@ void TEncSearch::predInterSearch(TComDat
#if CU_STAT_LOGFILE
meCost += meCost;
#endif
+ totalmebits += mebits;
}
}
+ else
+ {
+ totalmebits += mebits;
+ }
motionCompensation(cu, predYuv, REF_PIC_LIST_X, partIdx, bLuma, bChroma);
}
+ cu->m_totalBits = totalmebits;
+
setWpScalingDistParam(cu, -1, REF_PIC_LIST_X);
}
diff -r 65462024832b -r 7f68debc632b source/Lib/TLibEncoder/TEncSearch.h
--- a/source/Lib/TLibEncoder/TEncSearch.h Wed Oct 30 01:54:16 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.h Wed Oct 30 20:23:37 2013 +0530
@@ -211,7 +211,7 @@ protected:
void xGetBlkBits(PartSize cuMode, bool bPSlice, int partIdx, uint32_t lastMode, uint32_t blockBit[3]);
void xMergeEstimation(TComDataCU* cu, int partIdx, uint32_t& uiInterDir,
- TComMvField* pacMvField, uint32_t& mergeIndex, uint32_t& outCost,
+ TComMvField* pacMvField, uint32_t& mergeIndex, uint32_t& outCost, uint32_t& outbits,
TComMvField* mvFieldNeighbors, UChar* interDirNeighbors, int& numValidMergeCand);
void xRestrictBipredMergeCand(TComDataCU* cu, uint32_t puIdx, TComMvField* mvFieldNeighbours,
diff -r 65462024832b -r 7f68debc632b source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/ipfilter.cpp Wed Oct 30 20:23:37 2013 +0530
@@ -264,6 +264,24 @@ void filterConvertPelToShort_c(pixel *sr
}
}
+void filterConvertPelToShort_c(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
+{
+ int shift = IF_INTERNAL_PREC - X265_DEPTH;
+ int row, col;
+
+ for (row = 0; row < height; row++)
+ {
+ for (col = 0; col < width; col++)
+ {
+ int16_t val = src[col] << shift;
+ dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
+ }
+
+ src += srcStride;
+ dst += MAX_CU_SIZE;
+ }
+}
+
template<int N>
void filterVertical_pp_c(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int16_t const *c)
{
@@ -471,6 +489,7 @@ void Setup_C_IPFilterPrimitives(EncoderP
p.ipfilter_p2s = filterConvertPelToShort_c;
p.ipfilter_s2p = filterConvertShortToPel_c;
+ p.luma_p2s = filterConvertPelToShort_c;
p.extendRowBorder = extendCURowColBorder;
}
diff -r 65462024832b -r 7f68debc632b source/common/pixel.cpp
--- a/source/common/pixel.cpp Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/pixel.cpp Wed Oct 30 20:23:37 2013 +0530
@@ -758,6 +758,21 @@ void plane_copy_deinterleave_chroma(pixe
}
}
}
+
+template<int bx, int by>
+void blockcopy_pp_c(pixel *a, intptr_t stridea, pixel *b, intptr_t strideb)
+{
+ for (int y = 0; y < by; y++)
+ {
+ for (int x = 0; x < bx; x++)
+ {
+ a[x] = b[x];
+ }
+
+ a += stridea;
+ b += strideb;
+ }
+}
} // end anonymous namespace
namespace x265 {
@@ -798,6 +813,37 @@ void Setup_C_PixelPrimitives(EncoderPrim
p.satd[LUMA_64x16] = satd8<64, 16>;
p.satd[LUMA_16x64] = satd8<16, 64>;
+#define CHROMA(W, H) \
+ p.chroma_copy_pp[CHROMA_ ## W ## x ## H] = blockcopy_pp_c<W, H>
+#define LUMA(W, H) \
+ p.luma_copy_pp[LUMA_ ## W ## x ## H] = blockcopy_pp_c<W, H>
+
+ LUMA(4, 4);
+ LUMA(8, 8); CHROMA(4, 4);
+ LUMA(4, 8); CHROMA(2, 4);
+ LUMA(8, 4); CHROMA(4, 2);
+ LUMA(16, 16); CHROMA(8, 8);
+ LUMA(16, 8); CHROMA(8, 4);
+ LUMA( 8, 16); CHROMA(4, 8);
+ LUMA(16, 12); CHROMA(8, 6);
+ LUMA(12, 16); CHROMA(6, 8);
+ LUMA(16, 4); CHROMA(8, 2);
+ LUMA( 4, 16); CHROMA(2, 8);
+ LUMA(32, 32); CHROMA(16, 16);
+ LUMA(32, 16); CHROMA(16, 8);
+ LUMA(16, 32); CHROMA(8, 16);
+ LUMA(32, 24); CHROMA(16, 12);
+ LUMA(24, 32); CHROMA(12, 16);
+ LUMA(32, 8); CHROMA(16, 4);
+ LUMA( 8, 32); CHROMA(4, 16);
+ LUMA(64, 64); CHROMA(32, 32);
+ LUMA(64, 32); CHROMA(32, 16);
+ LUMA(32, 64); CHROMA(16, 32);
+ LUMA(64, 48); CHROMA(32, 24);
+ LUMA(48, 64); CHROMA(24, 32);
+ LUMA(64, 16); CHROMA(32, 8);
+ LUMA(16, 64); CHROMA(8, 32);
+
//sse
#if HIGH_BIT_DEPTH
SET_FUNC_PRIMITIVE_TABLE_C(sse_pp, sse, pixelcmp_t, int16_t, int16_t)
diff -r 65462024832b -r 7f68debc632b source/common/primitives.h
--- a/source/common/primitives.h Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/primitives.h Wed Oct 30 20:23:37 2013 +0530
@@ -210,6 +210,9 @@ typedef void (*plane_copy_deinterleave_t
typedef void (*filter_pp_t) (pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
typedef void (*filter_hv_pp_t) (pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY);
+typedef void (*filter_p2s_t)(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height);
+
+typedef void (*copy_pp_t)(pixel *dst, intptr_t dstride, pixel *src, intptr_t sstride); // dst is aligned
/* Define a structure containing function pointers to optimized encoder
* primitives. Each pointer can reference either an assembly routine,
@@ -235,6 +238,9 @@ struct EncoderPrimitives
cvt16to16_shl_t cvt16to16_shl;
cvt32to16_shr_t cvt32to16_shr;
+ copy_pp_t luma_copy_pp[NUM_LUMA_PARTITIONS];
+ copy_pp_t chroma_copy_pp[NUM_CHROMA_PARTITIONS];
+
ipfilter_pp_t ipfilter_pp[NUM_IPFILTER_P_P];
ipfilter_ps_t ipfilter_ps[NUM_IPFILTER_P_S];
ipfilter_sp_t ipfilter_sp[NUM_IPFILTER_S_P];
@@ -247,6 +253,7 @@ struct EncoderPrimitives
filter_pp_t chroma_vpp[NUM_CHROMA_PARTITIONS];
filter_pp_t luma_vpp[NUM_LUMA_PARTITIONS];
filter_hv_pp_t luma_hvpp[NUM_LUMA_PARTITIONS];
+ filter_p2s_t luma_p2s;
intra_dc_t intra_pred_dc;
intra_planar_t intra_pred_planar;
diff -r 65462024832b -r 7f68debc632b source/common/vec/pixel-sse41.cpp
--- a/source/common/vec/pixel-sse41.cpp Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/vec/pixel-sse41.cpp Wed Oct 30 20:23:37 2013 +0530
@@ -34,496 +34,6 @@ using namespace x265;
namespace {
#if !HIGH_BIT_DEPTH
template<int ly>
-// will only be instanced with ly == 16
-int sad_12(pixel *fenc, intptr_t fencstride, pixel *fref, intptr_t frefstride)
-{
- assert(ly == 16);
- __m128i sum0 = _mm_setzero_si128();
- __m128i sum1 = _mm_setzero_si128();
- __m128i T00, T01, T02, T03;
- __m128i T10, T11, T12, T13;
- __m128i T20, T21, T22, T23;
-
-#define MASK _mm_set_epi32(0x00000000, 0xffffffff, 0xffffffff, 0xffffffff)
-
-#define PROCESS_12x4(BASE) \
- T00 = _mm_load_si128((__m128i*)(fenc + (BASE + 0) * fencstride)); \
- T00 = _mm_and_si128(T00, MASK); \
- T01 = _mm_load_si128((__m128i*)(fenc + (BASE + 1) * fencstride)); \
- T01 = _mm_and_si128(T01, MASK); \
- T02 = _mm_load_si128((__m128i*)(fenc + (BASE + 2) * fencstride)); \
- T02 = _mm_and_si128(T02, MASK); \
- T03 = _mm_load_si128((__m128i*)(fenc + (BASE + 3) * fencstride)); \
- T03 = _mm_and_si128(T03, MASK); \
- T10 = _mm_loadu_si128((__m128i*)(fref + (BASE + 0) * frefstride)); \
- T10 = _mm_and_si128(T10, MASK); \
- T11 = _mm_loadu_si128((__m128i*)(fref + (BASE + 1) * frefstride)); \
- T11 = _mm_and_si128(T11, MASK); \
- T12 = _mm_loadu_si128((__m128i*)(fref + (BASE + 2) * frefstride)); \
- T12 = _mm_and_si128(T12, MASK); \
- T13 = _mm_loadu_si128((__m128i*)(fref + (BASE + 3) * frefstride)); \
- T13 = _mm_and_si128(T13, MASK); \
- T20 = _mm_sad_epu8(T00, T10); \
- T21 = _mm_sad_epu8(T01, T11); \
- T22 = _mm_sad_epu8(T02, T12); \
- T23 = _mm_sad_epu8(T03, T13); \
- sum0 = _mm_add_epi16(sum0, T20); \
- sum0 = _mm_add_epi16(sum0, T21); \
- sum0 = _mm_add_epi16(sum0, T22); \
- sum0 = _mm_add_epi16(sum0, T23)
-
- PROCESS_12x4(0);
- PROCESS_12x4(4);
- PROCESS_12x4(8);
- PROCESS_12x4(12);
-
- sum1 = _mm_shuffle_epi32(sum0, 2);
- sum0 = _mm_add_epi32(sum0, sum1);
-
- return _mm_cvtsi128_si32(sum0);
-}
-
-template<int ly>
-// always instanced for 32 rows
-int sad_24(pixel *fenc, intptr_t fencstride, pixel *fref, intptr_t frefstride)
-{
- __m128i sum0 = _mm_setzero_si128();
- __m128i sum1 = _mm_setzero_si128();
- __m128i T00, T01, T02, T03;
- __m128i T10, T11, T12, T13;
- __m128i T20, T21, T22, T23;
-
-#define PROCESS_24x4(BASE) \
- T00 = _mm_load_si128((__m128i*)(fenc + (BASE + 0) * fencstride)); \
- T01 = _mm_load_si128((__m128i*)(fenc + (BASE + 1) * fencstride)); \
- T02 = _mm_load_si128((__m128i*)(fenc + (BASE + 2) * fencstride)); \
More information about the x265-commits
mailing list