[x265-commits] [x265] asm: assembly code for pixel_satd_24x32
Yuvaraj Venkatesh
yuvaraj at multicorewareinc.com
Thu Nov 14 14:58:15 CET 2013
details: http://hg.videolan.org/x265/rev/2ffe634ebd71
branches:
changeset: 5070:2ffe634ebd71
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Wed Nov 13 13:08:03 2013 +0530
description:
asm: assembly code for pixel_satd_24x32
Subject: [x265] asm: assembly code for pixel_satd_32x32
details: http://hg.videolan.org/x265/rev/4ee655b93b03
branches:
changeset: 5071:4ee655b93b03
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Wed Nov 13 16:43:37 2013 +0530
description:
asm: assembly code for pixel_satd_32x32
Subject: [x265] asm: assembly code for pixel_satd_64x16
details: http://hg.videolan.org/x265/rev/32e01ab333a6
branches:
changeset: 5072:32e01ab333a6
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Wed Nov 13 17:04:08 2013 +0530
description:
asm: assembly code for pixel_satd_64x16
Subject: [x265] asm: Proper indentation and function prototype updation for chroma hps filter functions for 2xN, 4xN, 6x8 and 12x16 block sizes.
details: http://hg.videolan.org/x265/rev/51d3c0782e46
branches:
changeset: 5073:51d3c0782e46
user: Nabajit Deka
date: Wed Nov 13 13:58:39 2013 +0530
description:
asm: Proper indentation and function prototype updation for chroma hps filter functions for 2xN, 4xN, 6x8 and 12x16 block sizes.
Subject: [x265] asm: routines for chroma hps filter functions for 8xN block sizes.
details: http://hg.videolan.org/x265/rev/3448252924ad
branches:
changeset: 5074:3448252924ad
user: Nabajit Deka
date: Wed Nov 13 14:11:00 2013 +0530
description:
asm: routines for chroma hps filter functions for 8xN block sizes.
Subject: [x265] asm: routines for chroma hps filter functions for 16xN, 24xN and 32xN
details: http://hg.videolan.org/x265/rev/d80ab2913b31
branches:
changeset: 5075:d80ab2913b31
user: Nabajit Deka
date: Wed Nov 13 14:30:22 2013 +0530
description:
asm: routines for chroma hps filter functions for 16xN, 24xN and 32xN
Subject: [x265] asm: routines for chroma vps filter functions for 4xN block sizes.
details: http://hg.videolan.org/x265/rev/23aecd3f9180
branches:
changeset: 5076:23aecd3f9180
user: Nabajit Deka
date: Wed Nov 13 15:45:39 2013 +0530
description:
asm: routines for chroma vps filter functions for 4xN block sizes.
Subject: [x265] asm: routines for chroma vps filter functions for 8xN block sizes
details: http://hg.videolan.org/x265/rev/91cfcd159ff3
branches:
changeset: 5077:91cfcd159ff3
user: Nabajit Deka
date: Wed Nov 13 16:02:48 2013 +0530
description:
asm: routines for chroma vps filter functions for 8xN block sizes
Subject: [x265] asm: routines for chroma vps filter functions for 6x8 and 12x16 block sizes.
details: http://hg.videolan.org/x265/rev/8e6dcabdccd5
branches:
changeset: 5078:8e6dcabdccd5
user: Nabajit Deka
date: Wed Nov 13 16:19:47 2013 +0530
description:
asm: routines for chroma vps filter functions for 6x8 and 12x16 block sizes.
Subject: [x265] asm: routines for chroma vps filter functions for 16xN block sizes.
details: http://hg.videolan.org/x265/rev/52d18d911356
branches:
changeset: 5079:52d18d911356
user: Nabajit Deka
date: Wed Nov 13 16:28:08 2013 +0530
description:
asm: routines for chroma vps filter functions for 16xN block sizes.
Subject: [x265] asm: routines for chroma vps filter function for 24x32 block size.
details: http://hg.videolan.org/x265/rev/21d27b188e71
branches:
changeset: 5080:21d27b188e71
user: Nabajit Deka
date: Wed Nov 13 16:35:45 2013 +0530
description:
asm: routines for chroma vps filter function for 24x32 block size.
Subject: [x265] asm: routines for chroma vps filter functions for 32xN block sizes.
details: http://hg.videolan.org/x265/rev/701b696d0670
branches:
changeset: 5081:701b696d0670
user: Nabajit Deka
date: Wed Nov 13 16:46:42 2013 +0530
description:
asm: routines for chroma vps filter functions for 32xN block sizes.
Subject: [x265] Adding asm function declarations and initializations for chroma vps filter functions.
details: http://hg.videolan.org/x265/rev/5fc6ca938864
branches:
changeset: 5082:5fc6ca938864
user: Nabajit Deka
date: Wed Nov 13 16:58:12 2013 +0530
description:
Adding asm function declarations and initializations for chroma vps filter functions.
Subject: [x265] Change minimum architecture to sse4 as chroma vsp functions for block sizes(2x4,2x8 and 6x8) need faster SSE4 instructions.
details: http://hg.videolan.org/x265/rev/a04ca925ad3f
branches:
changeset: 5083:a04ca925ad3f
user: Nabajit Deka
date: Wed Nov 13 18:27:00 2013 +0530
description:
Change minimum architecture to sse4 as chroma vsp functions for block sizes(2x4,2x8 and 6x8) need faster SSE4 instructions.
Subject: [x265] TEncSearch: Fix parameter type of xEstimateResidualQT
details: http://hg.videolan.org/x265/rev/c89e22d26bcd
branches:
changeset: 5084:c89e22d26bcd
user: Derek Buitenhuis <derek.buitenhuis at gmail.com>
date: Wed Nov 13 13:52:43 2013 +0000
description:
TEncSearch: Fix parameter type of xEstimateResidualQT
Fixes compilation with g++.
Subject: [x265] Reindent after last commit
details: http://hg.videolan.org/x265/rev/5683ee5b793c
branches:
changeset: 5085:5683ee5b793c
user: Derek Buitenhuis <derek.buitenhuis at gmail.com>
date: Wed Nov 13 13:53:13 2013 +0000
description:
Reindent after last commit
Subject: [x265] asm: routines for chroma vps filter functions for 2x4 and 2x8 block sizes.
details: http://hg.videolan.org/x265/rev/c828dd4d9eae
branches:
changeset: 5086:c828dd4d9eae
user: Nabajit Deka
date: Wed Nov 13 15:30:09 2013 +0530
description:
asm: routines for chroma vps filter functions for 2x4 and 2x8 block sizes.
Subject: [x265] TEncSearch: nit
details: http://hg.videolan.org/x265/rev/e871fe75d5ab
branches:
changeset: 5087:e871fe75d5ab
user: Steve Borho <steve at borho.org>
date: Wed Nov 13 13:52:43 2013 +0000
description:
TEncSearch: nit
diffstat:
source/Lib/TLibEncoder/TEncSearch.cpp | 8 +-
source/Lib/TLibEncoder/TEncSearch.h | 2 +-
source/common/x86/asm-primitives.cpp | 25 +-
source/common/x86/ipfilter8.asm | 1424 ++++++++++++++++++++++++++++++--
source/common/x86/ipfilter8.h | 9 +-
source/common/x86/pixel-a.asm | 254 +++++
6 files changed, 1603 insertions(+), 119 deletions(-)
diffs (truncated from 1980 to 300 lines):
diff -r c4ca80d19105 -r e871fe75d5ab source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp Tue Nov 12 19:10:23 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp Wed Nov 13 13:52:43 2013 +0000
@@ -2808,8 +2808,8 @@ void TEncSearch::encodeResAndCalcRdInter
}
// Residual coding.
- int qp, qpBest = 0;
- UInt64 cost, bcost = MAX_INT64;
+ int qp, qpBest = 0;
+ uint64_t cost, bcost = MAX_INT64;
uint32_t trLevel = 0;
if ((cu->getWidth(0) > cu->getSlice()->getSPS()->getMaxTrSize()))
@@ -3042,7 +3042,7 @@ void TEncSearch::xEstimateResidualQT(TCo
uint32_t absTUPartIdx,
TShortYUV* resiYuv,
const uint32_t depth,
- UInt64 & rdCost,
+ uint64_t & rdCost,
uint32_t & outBits,
uint32_t & outDist,
uint32_t * outZeroDist,
@@ -3634,7 +3634,7 @@ void TEncSearch::xEstimateResidualQT(TCo
}
uint32_t subdivDist = 0;
uint32_t subdivBits = 0;
- UInt64 subDivCost = 0;
+ uint64_t subDivCost = 0;
const uint32_t qPartNumSubdiv = cu->getPic()->getNumPartInCU() >> ((depth + 1) << 1);
for (uint32_t i = 0; i < 4; ++i)
diff -r c4ca80d19105 -r e871fe75d5ab source/Lib/TLibEncoder/TEncSearch.h
--- a/source/Lib/TLibEncoder/TEncSearch.h Tue Nov 12 19:10:23 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncSearch.h Wed Nov 13 13:52:43 2013 +0000
@@ -250,7 +250,7 @@ protected:
void xEncodeResidualQT(TComDataCU* cu, uint32_t absPartIdx, uint32_t depth, bool bSubdivAndCbf, TextType ttype);
void xEstimateResidualQT(TComDataCU* cu, uint32_t absPartIdx, uint32_t absTUPartIdx, TShortYUV* resiYuv, uint32_t depth,
- UInt64 &rdCost, uint32_t &outBits, uint32_t &outDist, uint32_t *puiZeroDist, bool curUseRDOQ = true);
+ uint64_t &rdCost, uint32_t &outBits, uint32_t &outDist, uint32_t *puiZeroDist, bool curUseRDOQ = true);
void xSetResidualQTData(TComDataCU* cu, uint32_t absPartIdx, uint32_t absTUPartIdx, TShortYUV* resiYuv, uint32_t depth, bool bSpatial);
void setWpScalingDistParam(TComDataCU* cu, int refIdx, int picList);
diff -r c4ca80d19105 -r e871fe75d5ab source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Tue Nov 12 19:10:23 2013 +0530
+++ b/source/common/x86/asm-primitives.cpp Wed Nov 13 13:52:43 2013 +0000
@@ -59,14 +59,14 @@ extern "C" {
#define INIT8(name, cpu) INIT8_NAME(name, name, cpu)
#define HEVC_SATD(cpu) \
- p.satd[LUMA_32x32] = cmp<32, 32, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
- p.satd[LUMA_24x32] = cmp<24, 32, 8, 16, x265_pixel_satd_8x16_ ## cpu>; \
+ p.satd[LUMA_32x32] = x265_pixel_satd_32x32_ ## cpu; \
+ p.satd[LUMA_24x32] = x265_pixel_satd_24x32_ ## cpu; \
p.satd[LUMA_64x64] = cmp<64, 64, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
p.satd[LUMA_64x32] = cmp<64, 32, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
p.satd[LUMA_32x64] = cmp<32, 64, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
p.satd[LUMA_64x48] = cmp<64, 48, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
p.satd[LUMA_48x64] = cmp<48, 64, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
- p.satd[LUMA_64x16] = cmp < 64, 16, 16, 16, x265_pixel_satd_16x16_ ## cpu >
+ p.satd[LUMA_64x16] = x265_pixel_satd_64x16_ ## cpu
#define ASSGN_SSE(cpu) \
p.sse_pp[LUMA_8x8] = x265_pixel_ssd_8x8_ ## cpu; \
@@ -138,6 +138,7 @@ extern "C" {
#define SETUP_CHROMA_FUNC_DEF(W, H, cpu) \
p.chroma_hpp[CHROMA_ ## W ## x ## H] = x265_interp_4tap_horiz_pp_ ## W ## x ## H ## cpu; \
p.chroma_vpp[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu; \
+ p.chroma_vps[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu; \
p.chroma_copy_ps[CHROMA_ ## W ## x ## H] = x265_blockcopy_ps_ ## W ## x ## H ## cpu; \
p.chroma_sub_ps[CHROMA_ ## W ## x ## H] = x265_pixel_sub_ps_ ## W ## x ## H ## cpu;
@@ -176,14 +177,11 @@ extern "C" {
#define CHROMA_SP_FILTERS(cpu) \
SETUP_CHROMA_SP_FUNC_DEF(4, 4, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(4, 2, cpu); \
- SETUP_CHROMA_SP_FUNC_DEF(2, 4, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(8, 8, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(8, 4, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(4, 8, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(8, 6, cpu); \
- SETUP_CHROMA_SP_FUNC_DEF(6, 8, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(8, 2, cpu); \
- SETUP_CHROMA_SP_FUNC_DEF(2, 8, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(16, 16, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(16, 8, cpu); \
SETUP_CHROMA_SP_FUNC_DEF(8, 16, cpu); \
@@ -359,7 +357,6 @@ void Setup_Assembly_Primitives(EncoderPr
INIT8(sad_x3, _mmx2);
INIT8(sad_x4, _mmx2);
INIT8(satd, _mmx2);
- HEVC_SATD(mmx2);
p.satd[LUMA_8x32] = x265_pixel_satd_8x32_sse2;
p.satd[LUMA_12x16] = x265_pixel_satd_12x16_sse2;
p.satd[LUMA_16x4] = x265_pixel_satd_16x4_sse2;
@@ -526,6 +523,10 @@ void Setup_Assembly_Primitives(EncoderPr
p.chroma_copy_sp[CHROMA_2x4] = x265_blockcopy_sp_2x4_sse4;
p.chroma_copy_sp[CHROMA_2x8] = x265_blockcopy_sp_2x8_sse4;
p.chroma_copy_sp[CHROMA_6x8] = x265_blockcopy_sp_6x8_sse4;
+
+ p.chroma_vsp[CHROMA_2x4] = x265_interp_4tap_vert_sp_2x4_sse4;
+ p.chroma_vsp[CHROMA_2x8] = x265_interp_4tap_vert_sp_2x8_sse4;
+ p.chroma_vsp[CHROMA_6x8] = x265_interp_4tap_vert_sp_6x8_sse4;
}
if (cpuMask & X265_CPU_AVX)
{
@@ -539,6 +540,7 @@ void Setup_Assembly_Primitives(EncoderPr
p.sa8d[BLOCK_16x16] = x265_pixel_sa8d_16x16_avx;
SA8D_INTER_FROM_BLOCK(avx);
ASSGN_SSE(avx);
+ HEVC_SATD(avx);
p.sad_x3[LUMA_12x16] = x265_pixel_sad_x3_12x16_avx;
p.sad_x4[LUMA_12x16] = x265_pixel_sad_x4_12x16_avx;
@@ -588,10 +590,17 @@ void Setup_Assembly_Primitives(EncoderPr
{
INIT2(sad_x4, _avx2);
INIT4(satd, _avx2);
- HEVC_SATD(avx2);
INIT2_NAME(sse_pp, ssd, _avx2);
p.sa8d[BLOCK_8x8] = x265_pixel_sa8d_8x8_avx2;
SA8D_INTER_FROM_BLOCK8(avx2);
+ p.satd[LUMA_32x32] = cmp<32, 32, 16, 16, x265_pixel_satd_16x16_avx2>;
+ p.satd[LUMA_24x32] = cmp<24, 32, 8, 16, x265_pixel_satd_8x16_avx2>;
+ p.satd[LUMA_64x64] = cmp<64, 64, 16, 16, x265_pixel_satd_16x16_avx2>;
+ p.satd[LUMA_64x32] = cmp<64, 32, 16, 16, x265_pixel_satd_16x16_avx2>;
+ p.satd[LUMA_32x64] = cmp<32, 64, 16, 16, x265_pixel_satd_16x16_avx2>;
+ p.satd[LUMA_64x48] = cmp<64, 48, 16, 16, x265_pixel_satd_16x16_avx2>;
+ p.satd[LUMA_48x64] = cmp<48, 64, 16, 16, x265_pixel_satd_16x16_avx2>;
+ p.satd[LUMA_64x16] = cmp<64, 16, 16, 16, x265_pixel_satd_16x16_avx2>;
p.sad_x4[LUMA_16x12] = x265_pixel_sad_x4_16x12_avx2;
p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_avx2;
diff -r c4ca80d19105 -r e871fe75d5ab source/common/x86/ipfilter8.asm
--- a/source/common/x86/ipfilter8.asm Tue Nov 12 19:10:23 2013 +0530
+++ b/source/common/x86/ipfilter8.asm Wed Nov 13 13:52:43 2013 +0000
@@ -1401,6 +1401,874 @@ FILTER_V4_W8_8x6 8, 6
RET
+;-------------------------------------------------------------------------------------------------------------
+; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+;-------------------------------------------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal interp_4tap_vert_ps_4x2, 4, 6, 8
+
+mov r4d, r4m
+sub r0, r1
+add r3d, r3d
+
+%ifdef PIC
+lea r5, [tab_ChromaCoeff]
+movd m0, [r5 + r4 * 4]
+%else
+movd m0, [tab_ChromaCoeff + r4 * 4]
+%endif
+
+pshufb m0, [tab_Cm]
+
+mova m1, [tab_c_8192]
+
+movd m2, [r0]
+movd m3, [r0 + r1]
+movd m4, [r0 + 2 * r1]
+lea r5, [r0 + 2 * r1]
+movd m5, [r5 + r1]
+
+punpcklbw m2, m3
+punpcklbw m6, m4, m5
+punpcklbw m2, m6
+
+pmaddubsw m2, m0
+
+movd m6, [r0 + 4 * r1]
+
+punpcklbw m3, m4
+punpcklbw m5, m6
+punpcklbw m3, m5
+
+pmaddubsw m3, m0
+
+phaddw m2, m3
+
+psubw m2, m1
+movlps [r2], m2
+movhps [r2 + r3], m2
+
+RET
+
+;-------------------------------------------------------------------------------------------------------------
+; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+;-------------------------------------------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal interp_4tap_vert_ps_4x4, 4, 7, 8
+
+ mov r4d, r4m
+ sub r0, r1
+ add r3d, r3d
+
+%ifdef PIC
+ lea r5, [tab_ChromaCoeff]
+ movd m0, [r5 + r4 * 4]
+%else
+ movd m0, [tab_ChromaCoeff + r4 * 4]
+%endif
+
+ pshufb m0, [tab_Cm]
+
+ mova m1, [tab_c_8192]
+
+ movd m2, [r0]
+ movd m3, [r0 + r1]
+ movd m4, [r0 + 2 * r1]
+ lea r5, [r0 + 2 * r1]
+ movd m5, [r5 + r1]
+
+ punpcklbw m2, m3
+ punpcklbw m6, m4, m5
+ punpcklbw m2, m6
+
+ pmaddubsw m2, m0
+
+ movd m6, [r0 + 4 * r1]
+
+ punpcklbw m3, m4
+ punpcklbw m7, m5, m6
+ punpcklbw m3, m7
+
+ pmaddubsw m3, m0
+
+ phaddw m2, m3
+
+ psubw m2, m1
+ movlps [r2], m2
+ movhps [r2 + r3], m2
+
+ lea r5, [r0 + 4 * r1]
+ movd m2, [r5 + r1]
+
+ punpcklbw m4, m5
+ punpcklbw m3, m6, m2
+ punpcklbw m4, m3
+
+ pmaddubsw m4, m0
+
+ movd m3, [r5 + 2 * r1]
+
+ punpcklbw m5, m6
+ punpcklbw m2, m3
+ punpcklbw m5, m2
+
+ pmaddubsw m5, m0
+
+ phaddw m4, m5
+
+ psubw m4, m1
+ movlps [r2 + 2 * r3], m4
+ lea r6, [r2 + 2 * r3]
+ movhps [r6 + r3], m4
+
+ RET
+
+;---------------------------------------------------------------------------------------------------------------
+; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+;---------------------------------------------------------------------------------------------------------------
+%macro FILTER_V_PS_W4_H4 2
+INIT_XMM sse4
+cglobal interp_4tap_vert_ps_%1x%2, 4, 7, 8
+
+ mov r4d, r4m
+ sub r0, r1
+ add r3d, r3d
+
+%ifdef PIC
+ lea r5, [tab_ChromaCoeff]
+ movd m0, [r5 + r4 * 4]
+%else
+ movd m0, [tab_ChromaCoeff + r4 * 4]
+%endif
+
+ pshufb m0, [tab_Cm]
+
+ mova m1, [tab_c_8192]
+
+ mov r4d, %2/4
+
+.loop
+ movd m2, [r0]
+ movd m3, [r0 + r1]
+ movd m4, [r0 + 2 * r1]
+ lea r5, [r0 + 2 * r1]
+ movd m5, [r5 + r1]
+
+ punpcklbw m2, m3
+ punpcklbw m6, m4, m5
+ punpcklbw m2, m6
+
+ pmaddubsw m2, m0
+
More information about the x265-commits
mailing list