[x265-commits] [x265] frameencoder: avoid race hazard in end of frame logic
Steve Borho
steve at borho.org
Thu Mar 5 04:36:07 CET 2015
details: http://hg.videolan.org/x265/rev/7b5d44d7831e
branches:
changeset: 9612:7b5d44d7831e
user: Steve Borho <steve at borho.org>
date: Wed Mar 04 17:15:36 2015 -0600
description:
frameencoder: avoid race hazard in end of frame logic
This bug has been in the encoder, latent, for about 6 months so it deserves an
obituary (being somewhat presumptuous that this commit actually kills it). The
bug worked in this way:
1. user starts an encode of a small resolution encode at superfast preset,
generating frame rates of about 200-300 fps depending on the hardware.
2. a worker thread finishes compressing all of the CTUs in its row, but then
gets evicted from its core before it is allowed to flush its row bitstream
object (leaving four bytes within the Entropy object unwritten)
3. The other threads quickly finish the rest of the frame and the worker which
compressed the last row signals frame completion.
4. The frame encoder thread concatenates the row bitstreams together and emits
the NAL. If you were lucky, the row that was evicted would have been 4 bytes
long since the 0 byte length triggers a check failure indicating something is
seriously wrong. If you're un-lucky the NAL is garbage and does evil things
to your decoder.
5. the evicted worker finally gets back on the core and flushes its bitstream,
to the aid of no-one.
With this commit, we note when the last row has been compressed (indicating all
CTUs in the frame are compressed) but we do not signal that the frame is
complete until the last worker thread leaves processRow() (taking advantage of
the recently introduced atomic count of workers).
Subject: [x265] asm: improve intra_pred_dc4_sse4 by merge reduce code
details: http://hg.videolan.org/x265/rev/0cb6948f9d3c
branches:
changeset: 9613:0cb6948f9d3c
user: Min Chen <chenm003 at 163.com>
date: Mon Mar 02 18:32:36 2015 -0800
description:
asm: improve intra_pred_dc4_sse4 by merge reduce code
Subject: [x265] asm: improve algorithm on luma_hps[8xN]
details: http://hg.videolan.org/x265/rev/ae36726c875c
branches:
changeset: 9614:ae36726c875c
user: Min Chen <chenm003 at 163.com>
date: Tue Mar 03 19:10:06 2015 -0800
description:
asm: improve algorithm on luma_hps[8xN]
Old New
luma_hps[ 8x8] 978.30 726.92
luma_hps[ 8x4] 795.12 542.97
luma_hps[ 8x16] 1492.22 1121.78
luma_hps[ 8x32] 2590.12 1975.35
Subject: [x265] asm: intra pred dc8 sse2
details: http://hg.videolan.org/x265/rev/05dd8e2b2fda
branches:
changeset: 9615:05dd8e2b2fda
user: David T Yuen <dtyx265 at gmail.com>
date: Fri Feb 27 14:56:56 2015 -0800
description:
asm: intra pred dc8 sse2
This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred dc8 sse4
./test/TestBench --testbench intrapred | grep 8x8
intra_dc_8x8[f=0] 3.86x 235.11 906.74
intra_dc_8x8[f=1] 2.31x 555.00 1280.00
and supports 32 bit
./test/TestBench --testbench intrapred | grep 8x8
intra_dc_8x8[f=0] 3.86x 235.21 906.81
intra_dc_8x8[f=1] 2.37x 539.99 1279.99
and a white space nit in intrapred.h
Subject: [x265] asm: intra pred dc16 sse2
details: http://hg.videolan.org/x265/rev/95867667be8f
branches:
changeset: 9616:95867667be8f
user: David T Yuen <dtyx265 at gmail.com>
date: Mon Mar 02 14:32:14 2015 -0800
description:
asm: intra pred dc16 sse2
This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred dc16 sse4 high bit
64-bit
./test/TestBench --testbench intrapred | grep 16x16
intra_dc_16x16[f=0] 2.43x 580.09 1412.50
intra_dc_16x16[f=1] 2.36x 1017.66 2400.02
32-bit
./test/TestBench --testbench intrapred | grep 16x16
intra_dc_16x16[f=0] 3.58x 754.99 2705.04
intra_dc_16x16[f=1] 3.00x 1230.08 3687.46
Subject: [x265] asm: intra pred planar4 sse2
details: http://hg.videolan.org/x265/rev/78ee1c3a3457
branches:
changeset: 9617:78ee1c3a3457
user: David T Yuen <dtyx265 at gmail.com>
date: Tue Mar 03 18:40:21 2015 -0800
description:
asm: intra pred planar4 sse2
This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred planar4 sse4
64-bit
./test/TestBench --testbench intrapred | grep intra_planar_4x4
intra_planar_4x4 1.16x 507.48 587.52
32-bit
./test/TestBench --testbench intrapred | grep intra_planar_4x4
intra_planar_4x4 1.56x 532.49 832.30
Subject: [x265] asm: intra pred planar4 sse2 high bit
details: http://hg.videolan.org/x265/rev/edc70a895095
branches:
changeset: 9618:edc70a895095
user: David T Yuen <dtyx265 at gmail.com>
date: Tue Mar 03 18:43:49 2015 -0800
description:
asm: intra pred planar4 sse2 high bit
This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred planar4 sse4 high bit
./test/TestBench --testbench intrapred | grep intra_planar_4x4
intra_planar_4x4 1.31x 434.94 569.95
Subject: [x265] asm: filter_vpp[4x2], filter_vps[4x2]: improve 142c->130c, 126c->121c
details: http://hg.videolan.org/x265/rev/7ed370850e36
branches:
changeset: 9619:7ed370850e36
user: Divya Manivannan <divya at multicorewareinc.com>
date: Wed Mar 04 11:49:20 2015 +0530
description:
asm: filter_vpp[4x2], filter_vps[4x2]: improve 142c->130c, 126c->121c
Subject: [x265] asm: filter_vpp[8x6], filter_vps[8x6]: improve 277c->226c, 264c->217c
details: http://hg.videolan.org/x265/rev/ea9bdb10353f
branches:
changeset: 9620:ea9bdb10353f
user: Divya Manivannan <divya at multicorewareinc.com>
date: Wed Mar 04 13:20:55 2015 +0530
description:
asm: filter_vpp[8x6], filter_vps[8x6]: improve 277c->226c, 264c->217c
diffstat:
source/common/x86/asm-primitives.cpp | 10 +
source/common/x86/intrapred.h | 7 +-
source/common/x86/intrapred16.asm | 112 ++++++++++-
source/common/x86/intrapred8.asm | 363 +++++++++++++++++++++++++++++++++-
source/common/x86/ipfilter8.asm | 346 ++++++++++++++++++++++++--------
source/encoder/frameencoder.cpp | 7 +-
source/encoder/frameencoder.h | 1 +
7 files changed, 739 insertions(+), 107 deletions(-)
diffs (truncated from 1038 to 300 lines):
diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/asm-primitives.cpp Wed Mar 04 13:20:55 2015 +0530
@@ -868,6 +868,8 @@ void setupAssemblyPrimitives(EncoderPrim
ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
ALL_LUMA_TU_S(transpose, transpose, sse2);
+ p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
+
p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
@@ -1205,6 +1207,10 @@ void setupAssemblyPrimitives(EncoderPrim
ALL_LUMA_TU_S(ssd_s, pixel_ssd_s_, sse2);
p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
+ p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
+ p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
+
+ p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2;
p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2;
@@ -1623,15 +1629,19 @@ void setupAssemblyPrimitives(EncoderPrim
p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vps = x265_interp_4tap_vert_ps_4x2_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vps = x265_interp_4tap_vert_ps_8x6_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2;
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2;
diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/intrapred.h Wed Mar 04 13:20:55 2015 +0530
@@ -26,12 +26,15 @@
#ifndef X265_INTRAPRED_H
#define X265_INTRAPRED_H
-void x265_intra_pred_dc4_sse2 (pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
-void x265_intra_pred_dc4_sse4 (pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc8_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc16_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse4(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
+void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
void x265_intra_pred_planar4_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/intrapred16.asm
--- a/source/common/x86/intrapred16.asm Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/intrapred16.asm Wed Mar 04 13:20:55 2015 +0530
@@ -160,6 +160,116 @@ cglobal intra_pred_dc4, 5,6,2
.end:
RET
+;-------------------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int filter)
+;-------------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc32, 3, 4, 6
+ lea r3, [r2 + 130] ;130 = 32*sizeof(pixel)*2 + 1*sizeof(pixel)
+ add r2, 2
+ add r1, r1
+ movu m0, [r3]
+ movu m1, [r3 + 16]
+ movu m2, [r3 + 32]
+ movu m3, [r3 + 48]
+ paddw m0, m1
+ paddw m2, m3
+ paddw m0, m2
+ movu m1, [r2]
+ movu m3, [r2 + 16]
+ movu m4, [r2 + 32]
+ movu m5, [r2 + 48]
+ paddw m1, m3
+ paddw m4, m5
+ paddw m1, m4
+ paddw m0, m1
+ movhlps m1, m0
+ paddw m0, m1
+ pshuflw m1, m0, 0x6E
+ paddw m0, m1
+ pmaddwd m0, [pw_1]
+
+ paddd m0, [pd_32] ; sum = sum + 32
+ psrld m0, 6 ; sum = sum / 64
+ pshuflw m0, m0, 0
+ pshufd m0, m0, 0
+
+ lea r2, [r1 * 3]
+ ; store DC 32x32
+%assign x 1
+%rep 8
+ movu [r0 + 0], m0
+ movu [r0 + 16], m0
+ movu [r0 + 32], m0
+ movu [r0 + 48], m0
+ movu [r0 + r1 + 0], m0
+ movu [r0 + r1 + 16], m0
+ movu [r0 + r1 + 32], m0
+ movu [r0 + r1 + 48], m0
+ movu [r0 + r1 * 2 + 0], m0
+ movu [r0 + r1 * 2 + 16], m0
+ movu [r0 + r1 * 2 + 32], m0
+ movu [r0 + r1 * 2 + 48], m0
+ movu [r0 + r2 + 0], m0
+ movu [r0 + r2 + 16], m0
+ movu [r0 + r2 + 32], m0
+ movu [r0 + r2 + 48], m0
+ %if x < 8
+ lea r0, [r0 + r1 * 4]
+ %endif
+%assign x x + 1
+%endrep
+ RET
+
+;---------------------------------------------------------------------------------------
+; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
+;---------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_planar4, 3,3,5
+ movu m1, [r2 + 2]
+ movu m2, [r2 + 18]
+ pshufhw m3, m1, 0 ; topRight
+ pshufd m3, m3, 0xAA
+ pshufhw m4, m2, 0 ; bottomLeft
+ pshufd m4, m4, 0xAA
+
+ pmullw m3, [multi_2Row] ; (x + 1) * topRight
+ pmullw m0, m1, [pw_planar4_1] ; (blkSize - 1 - y) * above[x]
+
+ paddw m3, [pw_4]
+ paddw m3, m4
+ paddw m3, m0
+ psubw m4, m1
+
+ pshuflw m1, m2, 0
+ pmullw m1, [pw_planar4_0]
+ paddw m1, m3
+ paddw m3, m4
+ psraw m1, 3
+ movh [r0], m1
+
+ pshuflw m1, m2, 01010101b
+ pmullw m1, [pw_planar4_0]
+ paddw m1, m3
+ paddw m3, m4
+ psraw m1, 3
+ movh [r0 + r1 * 2], m1
+ lea r0, [r0 + 4 * r1]
+
+ pshuflw m1, m2, 10101010b
+ pmullw m1, [pw_planar4_0]
+ paddw m1, m3
+ paddw m3, m4
+ psraw m1, 3
+ movh [r0], m1
+
+ pshuflw m1, m2, 11111111b
+ pmullw m1, [pw_planar4_0]
+ paddw m1, m3
+ psraw m1, 3
+ movh [r0 + r1 * 2], m1
+ RET
+
;-----------------------------------------------------------------------------------
; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
;-----------------------------------------------------------------------------------
@@ -378,7 +488,7 @@ cglobal intra_pred_dc16, 5, 7, 4
;-------------------------------------------------------------------------------------------
INIT_XMM sse4
cglobal intra_pred_dc32, 3, 5, 6
- lea r3, [r2 + 130]
+ lea r3, [r2 + 130] ;130 = 32*sizeof(pixel)*2 + 1*sizeof(pixel)
add r2, 2
add r1, r1
movu m0, [r3]
diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/intrapred8.asm Wed Mar 04 13:20:55 2015 +0530
@@ -117,12 +117,14 @@ const ang_table
SECTION .text
+cextern pw_2
cextern pw_4
cextern pw_8
cextern pw_16
cextern pw_32
cextern pw_257
cextern pw_1024
+cextern pw_4096
cextern pb_unpackbd1
cextern multiL
cextern multiH
@@ -207,6 +209,292 @@ cglobal intra_pred_dc4, 5,5,3
;---------------------------------------------------------------------------------------------
; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
;---------------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc8, 5, 7, 3
+ pxor m0, m0
+ movh m1, [r2 + 1]
+ movh m2, [r2 + 17]
+ punpcklqdq m1, m2
+ psadbw m1, m0
+ pshufd m2, m1, 2
+ paddw m1, m2
+
+ paddw m1, [pw_8]
+ psraw m1, 4
+ pmullw m1, [pw_257]
+ pshuflw m1, m1, 0x00 ; m1 = byte [dc_val ...]
+
+ test r4d, r4d
+
+ ; store DC 8x8
+ lea r6, [r1 + r1 * 2]
+ lea r5, [r6 + r1 * 2]
+ movh [r0], m1
+ movh [r0 + r1], m1
+ movh [r0 + r1 * 2], m1
+ movh [r0 + r6], m1
+ movh [r0 + r1 * 4], m1
+ movh [r0 + r5], m1
+ movh [r0 + r6 * 2], m1
+ lea r5, [r5 + r1 * 2]
+ movh [r0 + r5], m1
+
+ ; Do DC Filter
+ jz .end
+ psrlw m1, 8
+ movq m2, [pw_2]
+ pmullw m2, m1
+ paddw m2, [pw_2]
+ movd r4d, m2 ; r4d = DC * 2 + 2
+ paddw m1, m2 ; m1 = DC * 3 + 2
+ pshufd m1, m1, 0
+
+ ; filter top
+ movq m2, [r2 + 1]
+ punpcklbw m2, m0
+ paddw m2, m1
+ psraw m2, 2 ; sum = sum / 16
+ packuswb m2, m2
+ movh [r0], m2
+
+ ; filter top-left
+ movzx r3d, byte [r2 + 17]
+ add r4d, r3d
+ movzx r3d, byte [r2 + 1]
+ add r3d, r4d
+ shr r3d, 2
+ mov [r0], r3b
+
+ ; filter left
+ movq m2, [r2 + 18]
+ punpcklbw m2, m0
+ paddw m2, m1
+ psraw m2, 2
+ packuswb m2, m2
+ movd r2d, m2
+ lea r0, [r0 + r1]
+ lea r5, [r6 + r1 * 2]
+ mov [r0], r2b
+ shr r2, 8
+ mov [r0 + r1], r2b
+ shr r2, 8
+ mov [r0 + r1 * 2], r2b
+ shr r2, 8
+ mov [r0 + r6], r2b
+ pshufd m2, m2, 0x01
+ movd r2d, m2
+ mov [r0 + r1 * 4], r2b
+ shr r2, 8
+ mov [r0 + r5], r2b
+ shr r2, 8
+ mov [r0 + r6 * 2], r2b
+
+.end:
+ RET
+
+;--------------------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
More information about the x265-commits
mailing list