[x265-commits] [x265] x86inc.asm: fix vpbroadcastd bug on Mac platform
Min Chen
chenm003 at 163.com
Mon Sep 8 15:20:59 CEST 2014
details: http://hg.videolan.org/x265/rev/51930084e148
branches:
changeset: 7989:51930084e148
user: Min Chen <chenm003 at 163.com>
date: Fri Sep 05 16:48:03 2014 -0700
description:
x86inc.asm: fix vpbroadcastd bug on Mac platform
Subject: [x265] entropy: change top-level encode to encodeCTU
details: http://hg.videolan.org/x265/rev/de5614144bce
branches:
changeset: 7990:de5614144bce
user: Deepthi Nandakumar <deepthi at multicorewareinc.com>
date: Mon Sep 08 11:48:56 2014 +0530
description:
entropy: change top-level encode to encodeCTU
Subject: [x265] Merge with correct x86inc.asm patch
details: http://hg.videolan.org/x265/rev/c55d69561948
branches:
changeset: 7991:c55d69561948
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 12:02:55 2014 +0200
description:
Merge with correct x86inc.asm patch
Subject: [x265] nits
details: http://hg.videolan.org/x265/rev/a29aa966336e
branches:
changeset: 7992:a29aa966336e
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 14:00:52 2014 +0200
description:
nits
Subject: [x265] frameencoder: remove redundant clear of frame stats
details: http://hg.videolan.org/x265/rev/a7465d789c64
branches:
changeset: 7993:a7465d789c64
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 14:03:30 2014 +0200
description:
frameencoder: remove redundant clear of frame stats
they were being zero'd in the constructor, init(), and in compressCTURows.
techincally only the last is truly necessary, but I'm leaving the memset in the
contructor.
Subject: [x265] frameencoder: remove second encodeCU() pass over CTUs when SAO is disabled
details: http://hg.videolan.org/x265/rev/a117564df3ef
branches:
changeset: 7994:a117564df3ef
user: Steve Borho <steve at borho.org>
date: Fri Sep 05 17:56:17 2014 +0200
description:
frameencoder: remove second encodeCU() pass over CTUs when SAO is disabled
This is a performance optimization, it allows the encoder to generate the final
bitstream of each CTU as it is compressed and cache hot.
When SAO is enabled, SAO analysis must be performed and coded at the start of
the CTU but SAO analysis currently requires surrounding CTUs to be encoded
making the second pass unavoidable.
Note that this commit changes the way non-WPP encodes are performed, for the
better. Now it always uses row 0's CI_CURR_BEST entropy coder instance to
communicate entropy state between all CTUs and between rows. This better models
how encodeSlice() works and makes RDO work better
Subject: [x265] frameencoder: merge more of encodeSlice() into processCU
details: http://hg.videolan.org/x265/rev/60289c638600
branches:
changeset: 7995:60289c638600
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 12:41:13 2014 +0200
description:
frameencoder: merge more of encodeSlice() into processCU
this commit fixes no-WPP after the previous change. the per-row or per-frame
(+- WPP) bistreams are flushed as they are finished (and cache hot) and the
per CU stats are summed per row and then summarized all in one place.
Subject: [x265] frameencoder: do more CU stat math as integer
details: http://hg.videolan.org/x265/rev/406d92c860d5
branches:
changeset: 7996:406d92c860d5
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 14:29:09 2014 +0200
description:
frameencoder: do more CU stat math as integer
Subject: [x265] frameencoder: rename percent fields for clarity
details: http://hg.videolan.org/x265/rev/9581a45d4344
branches:
changeset: 7997:9581a45d4344
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 14:32:02 2014 +0200
description:
frameencoder: rename percent fields for clarity
Subject: [x265] rc: move FrameStats to ratecontrol.h
details: http://hg.videolan.org/x265/rev/89e682182a7a
branches:
changeset: 7998:89e682182a7a
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 14:34:38 2014 +0200
description:
rc: move FrameStats to ratecontrol.h
rate control shouldn't need to include frameencoder.h
Subject: [x265] frameencoder: combine some conditional expressions
details: http://hg.videolan.org/x265/rev/cb67f6f65577
branches:
changeset: 7999:cb67f6f65577
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 15:11:56 2014 +0200
description:
frameencoder: combine some conditional expressions
Subject: [x265] frameencoder: avoid another call to resetEntropy(), they are expensive
details: http://hg.videolan.org/x265/rev/cfe197e3044d
branches:
changeset: 8000:cfe197e3044d
user: Steve Borho <steve at borho.org>
date: Mon Sep 08 15:12:20 2014 +0200
description:
frameencoder: avoid another call to resetEntropy(), they are expensive
diffstat:
source/common/dct.cpp | 1 +
source/common/x86/asm-primitives.cpp | 2 +
source/common/x86/const-a.asm | 1 +
source/common/x86/pixel-util.h | 1 +
source/common/x86/pixel-util8.asm | 61 +++++++++++++--
source/common/x86/x86inc.asm | 12 +++
source/encoder/analysis.cpp | 15 +++-
source/encoder/entropy.cpp | 16 ++--
source/encoder/entropy.h | 9 +-
source/encoder/frameencoder.cpp | 140 ++++++++++++++++++++---------------
source/encoder/frameencoder.h | 21 +----
source/encoder/ratecontrol.cpp | 7 +-
source/encoder/ratecontrol.h | 18 ++++-
source/encoder/sao.cpp | 51 ++++++------
14 files changed, 220 insertions(+), 135 deletions(-)
diffs (truncated from 737 to 300 lines):
diff -r 795878af3973 -r cfe197e3044d source/common/dct.cpp
--- a/source/common/dct.cpp Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/dct.cpp Mon Sep 08 15:12:20 2014 +0200
@@ -729,6 +729,7 @@ void dequant_normal_c(const int16_t* qua
X265_CHECK(num <= 32 * 32, "dequant num %d too large\n", num);
X265_CHECK((num % 8) == 0, "dequant num %d not multiple of 8\n", num);
X265_CHECK(shift <= 10, "shift too large %d\n", shift);
+ X265_CHECK(((intptr_t)coef & 31) == 0, "dequant coef buffer not aligned\n");
int add, coeffQ;
diff -r 795878af3973 -r cfe197e3044d source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/asm-primitives.cpp Mon Sep 08 15:12:20 2014 +0200
@@ -1442,6 +1442,7 @@ void Setup_Assembly_Primitives(EncoderPr
{
p.dct[DCT_4x4] = x265_dct4_avx2;
p.nquant = x265_nquant_avx2;
+ p.dequant_normal = x265_dequant_normal_avx2;
}
/* at HIGH_BIT_DEPTH, pixel == short so we can reuse a number of primitives */
for (int i = 0; i < NUM_LUMA_PARTITIONS; i++)
@@ -1739,6 +1740,7 @@ void Setup_Assembly_Primitives(EncoderPr
p.dct[DCT_4x4] = x265_dct4_avx2;
p.nquant = x265_nquant_avx2;
+ p.dequant_normal = x265_dequant_normal_avx2;
}
#endif // if HIGH_BIT_DEPTH
}
diff -r 795878af3973 -r cfe197e3044d source/common/x86/const-a.asm
--- a/source/common/x86/const-a.asm Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/const-a.asm Mon Sep 08 15:12:20 2014 +0200
@@ -89,6 +89,7 @@ const pd_512, times 4 dd 512
const pd_1024, times 4 dd 1024
const pd_2048, times 4 dd 2048
const pd_ffff, times 4 dd 0xffff
+const pd_32767, times 4 dd 32767
const pd_n32768, times 4 dd 0xffff8000
const pw_ff00, times 8 dw 0xff00
diff -r 795878af3973 -r cfe197e3044d source/common/x86/pixel-util.h
--- a/source/common/x86/pixel-util.h Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/pixel-util.h Mon Sep 08 15:12:20 2014 +0200
@@ -48,6 +48,7 @@ uint32_t x265_quant_sse4(int32_t *coef,
uint32_t x265_nquant_sse4(int32_t *coef, int32_t *quantCoeff, int16_t *qCoef, int qBits, int add, int numCoeff);
uint32_t x265_nquant_avx2(int32_t *coef, int32_t *quantCoeff, int16_t *qCoef, int qBits, int add, int numCoeff);
void x265_dequant_normal_sse4(const int16_t* quantCoef, int32_t* coef, int num, int scale, int shift);
+void x265_dequant_normal_avx2(const int16_t* quantCoef, int32_t* coef, int num, int scale, int shift);
int x265_count_nonzero_ssse3(const int16_t *quantCoeff, int numCoeff);
void x265_weight_pp_sse4(pixel *src, pixel *dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
diff -r 795878af3973 -r cfe197e3044d source/common/x86/pixel-util8.asm
--- a/source/common/x86/pixel-util8.asm Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/pixel-util8.asm Mon Sep 08 15:12:20 2014 +0200
@@ -54,6 +54,8 @@ cextern pw_1
cextern pw_00ff
cextern pw_2000
cextern pw_pixel_max
+cextern pd_32767
+cextern pd_n32768
;-----------------------------------------------------------------------------
; void calcrecon(pixel* pred, int16_t* residual, int16_t* reconqt, pixel *reconipred, int stride, int strideqt, int strideipred)
@@ -1040,21 +1042,18 @@ cglobal nquant, 3,5,7
;-----------------------------------------------------------------------------
INIT_XMM sse4
cglobal dequant_normal, 5,5,5
- movd m1, r3 ; m1 = word [scale]
mova m2, [pw_1]
%if HIGH_BIT_DEPTH
cmp r3d, 32767
jle .skip
- psrld m1, 2
+ shr r3d, 2
sub r4d, 2
.skip:
%endif
movd m0, r4d ; m0 = shift
- xor r3d, r3d
- dec r4d
+ add r4d, 15
bts r3d, r4d
- movd m3, r3d
- punpcklwd m1, m3
+ movd m1, r3d
pshufd m1, m1, 0 ; m1 = dword [add scale]
; m0 = shift
; m1 = scale
@@ -1071,8 +1070,8 @@ cglobal dequant_normal, 5,5,5
pmovsxwd m3, m3
packssdw m4, m4
pmovsxwd m4, m4
- movu [r1], m3
- movu [r1 + 16], m4
+ mova [r1], m3
+ mova [r1 + 16], m4
add r0, 16
add r1, 32
@@ -1082,6 +1081,52 @@ cglobal dequant_normal, 5,5,5
RET
+INIT_YMM avx2
+cglobal dequant_normal, 5,5,7
+ vpbroadcastd m2, [pw_1] ; m2 = word [1]
+ vpbroadcastd m5, [pd_32767] ; m5 = dword [32767]
+ vpbroadcastd m6, [pd_n32768] ; m6 = dword [-32768]
+%if HIGH_BIT_DEPTH
+ cmp r3d, 32767
+ jle .skip
+ shr r3d, 2
+ sub r4d, 2
+.skip:
+%endif
+ movd xm0, r4d ; m0 = shift
+ add r4d, -1+16
+ bts r3d, r4d
+ vpbroadcastd m1, r3d ; m1 = dword [add scale]
+
+ ; m0 = shift
+ ; m1 = scale
+ ; m2 = word [1]
+ shr r2d, 4
+.loop:
+ movu m3, [r0]
+ punpckhwd m4, m3, m2
+ punpcklwd m3, m2
+ pmaddwd m3, m1 ; m3 = dword (clipQCoef * scale + add)
+ pmaddwd m4, m1
+ psrad m3, xm0
+ psrad m4, xm0
+ pminsd m3, m5
+ pmaxsd m3, m6
+ pminsd m4, m5
+ pmaxsd m4, m6
+ mova [r1 + 0 * mmsize/2], xm3
+ mova [r1 + 1 * mmsize/2], xm4
+ vextracti128 [r1 + 2 * mmsize/2], m3, 1
+ vextracti128 [r1 + 3 * mmsize/2], m4, 1
+
+ add r0, mmsize
+ add r1, mmsize * 2
+
+ dec r2d
+ jnz .loop
+ RET
+
+
;-----------------------------------------------------------------------------
; int count_nonzero(const int16_t *quantCoeff, int numCoeff);
;-----------------------------------------------------------------------------
diff -r 795878af3973 -r cfe197e3044d source/common/x86/x86inc.asm
--- a/source/common/x86/x86inc.asm Fri Sep 05 16:03:44 2014 +0200
+++ b/source/common/x86/x86inc.asm Mon Sep 08 15:12:20 2014 +0200
@@ -888,6 +888,8 @@ INIT_XMM
%define ymmmm%1 mm%1
%define ymmxmm%1 xmm%1
%define ymmymm%1 ymm%1
+ %define ymm%1xmm xmm%1
+ %define xmm%1ymm ymm%1
%define xm%1 xmm %+ m%1
%define ym%1 ymm %+ m%1
%endmacro
@@ -1480,3 +1482,13 @@ FMA4_INSTR fnmsubss, fnmsub132ss, fnmsub
%endif
%endmacro
%endif
+
+; workaround: vpbroadcastd with register, the yasm will generate wrong code
+%macro vpbroadcastd 2
+ %ifid %2
+ movd %1 %+ xmm, %2
+ vpbroadcastd %1, %1 %+ xmm
+ %else
+ vpbroadcastd %1, %2
+ %endif
+%endmacro
diff -r 795878af3973 -r cfe197e3044d source/encoder/analysis.cpp
--- a/source/encoder/analysis.cpp Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/analysis.cpp Mon Sep 08 15:12:20 2014 +0200
@@ -1056,21 +1056,30 @@ void Analysis::compressInterCU_rd0_4(TCo
copyYuv2Pic(pic, outBestCU->getAddr(), absPartIdx, depth);
}
+#if CHECKED_BUILD || _DEBUG
/* Assert if Best prediction mode is NONE
* Selected mode's RD-cost must be not MAX_INT64 */
if (bInsidePicture)
{
X265_CHECK(outBestCU->getPartitionSize(0) != SIZE_NONE, "no best prediction size\n");
X265_CHECK(outBestCU->getPredictionMode(0) != MODE_NONE, "no best prediction mode\n");
- if (m_rdCost.m_psyRd)
+ if (m_param->rdLevel > 1)
{
- X265_CHECK(outBestCU->m_totalPsyCost != MAX_INT64, "no best partition cost\n");
+ if (m_rdCost.m_psyRd)
+ {
+ X265_CHECK(outBestCU->m_totalPsyCost != MAX_INT64, "no best partition cost\n");
+ }
+ else
+ {
+ X265_CHECK(outBestCU->m_totalRDCost != MAX_INT64, "no best partition cost\n");
+ }
}
else
{
- X265_CHECK(outBestCU->m_totalRDCost != MAX_INT64, "no best partition cost\n");
+ X265_CHECK(outBestCU->m_sa8dCost != MAX_INT64, "no best partition cost\n");
}
}
+#endif
x265_emms();
}
diff -r 795878af3973 -r cfe197e3044d source/encoder/entropy.cpp
--- a/source/encoder/entropy.cpp Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/entropy.cpp Mon Sep 08 15:12:20 2014 +0200
@@ -481,7 +481,7 @@ void Entropy::codeShortTermRefPicSet(RPS
}
}
-void Entropy::encodeCU(TComDataCU* cu)
+void Entropy::encodeCTU(TComDataCU* cu)
{
bool bEncodeDQP = cu->m_slice->m_pps->bUseDQP;
encodeCU(cu, 0, 0, false, bEncodeDQP);
@@ -572,11 +572,6 @@ void Entropy::finishCU(TComDataCU* cu, u
uint32_t realEndAddress = slice->m_endCUAddr;
uint32_t cuAddr = cu->getSCUAddr() + absPartIdx;
- // Encode slice finish
- bool bTerminateSlice = false;
- if (cuAddr + (cu->m_pic->getNumPartInCU() >> (depth << 1)) == realEndAddress)
- bTerminateSlice = true;
-
uint32_t granularityMask = g_maxCUSize - 1;
uint32_t cuSize = 1 << cu->getLog2CUSize(absPartIdx);
uint32_t rpelx = cu->getCUPelX() + g_zscanToPelX[absPartIdx] + cuSize;
@@ -586,12 +581,17 @@ void Entropy::finishCU(TComDataCU* cu, u
if (granularityBoundary)
{
+ // Encode slice finish
+ bool bTerminateSlice = false;
+ if (cuAddr + (cu->m_pic->getNumPartInCU() >> (depth << 1)) == realEndAddress)
+ bTerminateSlice = true;
+
// The 1-terminating bit is added to all streams, so don't add it here when it's 1.
if (!bTerminateSlice)
- codeTerminatingBit(0);
+ encodeBinTrm(0);
if (!m_bitIf)
- resetBits();
+ resetBits(); // TODO: most likely unnecessary
}
}
diff -r 795878af3973 -r cfe197e3044d source/encoder/entropy.h
--- a/source/encoder/entropy.h Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/entropy.h Mon Sep 08 15:12:20 2014 +0200
@@ -133,7 +133,7 @@ public:
void load(Entropy& src);
void loadIntraDirModeLuma(Entropy& src);
void store(Entropy& dest);
- void loadContexts(Entropy& src) { copyContextsFrom(src); }
+ void loadContexts(Entropy& src) { copyContextsFrom(src); }
void copyState(Entropy& other);
void codeVPS(VPS* vps);
@@ -146,13 +146,12 @@ public:
void codeSliceHeader(Slice* slice);
void codeSliceHeaderWPPEntryPoints(Slice* slice, uint32_t *substreamSizes, uint32_t maxOffset);
void codeShortTermRefPicSet(RPS* rps);
- void codeSliceFinish() { finish(); }
- void codeTerminatingBit(uint32_t lsLast) { encodeBinTrm(lsLast); }
+ void finishSlice() { encodeBinTrm(1); finish(); dynamic_cast<Bitstream*>(m_bitIf)->writeByteAlignment(); }
- void encodeCU(TComDataCU* cu);
+ void encodeCTU(TComDataCU* cu);
void codeSaoOffset(SaoLcuParam* saoLcuParam, uint32_t compIdx);
void codeSaoUnitInterleaving(int compIdx, bool saoFlag, int rx, int ry, SaoLcuParam* saoLcuParam, int cuAddrInSlice, int cuAddrUpInSlice, int allowMergeLeft, int allowMergeUp);
- void codeSaoMerge(uint32_t code) { encodeBin(code, m_contextState[OFF_SAO_MERGE_FLAG_CTX]); }
+ void codeSaoMerge(uint32_t code) { encodeBin(code, m_contextState[OFF_SAO_MERGE_FLAG_CTX]); }
void codeCUTransquantBypassFlag(uint32_t symbol);
void codeSkipFlag(TComDataCU* cu, uint32_t absPartIdx);
diff -r 795878af3973 -r cfe197e3044d source/encoder/frameencoder.cpp
--- a/source/encoder/frameencoder.cpp Fri Sep 05 16:03:44 2014 +0200
+++ b/source/encoder/frameencoder.cpp Mon Sep 08 15:12:20 2014 +0200
@@ -117,7 +117,6 @@ bool FrameEncoder::init(Encoder *top, in
ok &= m_rce.picTimingSEI && m_rce.hrdTiming;
}
- memset(&m_frameStats, 0, sizeof(m_frameStats));
if (m_param->noiseReduction)
m_nr = X265_MALLOC(NoiseReduction, 1);
if (m_nr)
More information about the x265-commits
mailing list