[x265-commits] [x265] frameencoder: avoid race hazard in end of frame logic

Steve Borho steve at borho.org
Thu Mar 5 04:36:07 CET 2015


details:   http://hg.videolan.org/x265/rev/7b5d44d7831e
branches:  
changeset: 9612:7b5d44d7831e
user:      Steve Borho <steve at borho.org>
date:      Wed Mar 04 17:15:36 2015 -0600
description:
frameencoder: avoid race hazard in end of frame logic

This bug has been in the encoder, latent, for about 6 months so it deserves an
obituary (being somewhat presumptuous that this commit actually kills it). The
bug worked in this way:

1. user starts an encode of a small resolution encode at superfast preset,
   generating frame rates of about 200-300 fps depending on the hardware.

2. a worker thread finishes compressing all of the CTUs in its row, but then
   gets evicted from its core before it is allowed to flush its row bitstream
   object (leaving four bytes within the Entropy object unwritten)

3. The other threads quickly finish the rest of the frame and the worker which
   compressed the last row signals frame completion.

4. The frame encoder thread concatenates the row bitstreams together and emits
   the NAL. If you were lucky, the row that was evicted would have been 4 bytes
   long since the 0 byte length triggers a check failure indicating something is
   seriously wrong. If you're un-lucky the NAL is garbage and does evil things
   to your decoder.

5. the evicted worker finally gets back on the core and flushes its bitstream,
   to the aid of no-one.

With this commit, we note when the last row has been compressed (indicating all
CTUs in the frame are compressed) but we do not signal that the frame is
complete until the last worker thread leaves processRow() (taking advantage of
the recently introduced atomic count of workers).
Subject: [x265] asm: improve intra_pred_dc4_sse4 by merge reduce code

details:   http://hg.videolan.org/x265/rev/0cb6948f9d3c
branches:  
changeset: 9613:0cb6948f9d3c
user:      Min Chen <chenm003 at 163.com>
date:      Mon Mar 02 18:32:36 2015 -0800
description:
asm: improve intra_pred_dc4_sse4 by merge reduce code
Subject: [x265] asm: improve algorithm on luma_hps[8xN]

details:   http://hg.videolan.org/x265/rev/ae36726c875c
branches:  
changeset: 9614:ae36726c875c
user:      Min Chen <chenm003 at 163.com>
date:      Tue Mar 03 19:10:06 2015 -0800
description:
asm: improve algorithm on luma_hps[8xN]
                   Old       New
luma_hps[  8x8]    978.30    726.92
luma_hps[  8x4]    795.12    542.97
luma_hps[ 8x16]   1492.22   1121.78
luma_hps[ 8x32]   2590.12   1975.35
Subject: [x265] asm: intra pred dc8 sse2

details:   http://hg.videolan.org/x265/rev/05dd8e2b2fda
branches:  
changeset: 9615:05dd8e2b2fda
user:      David T Yuen <dtyx265 at gmail.com>
date:      Fri Feb 27 14:56:56 2015 -0800
description:
asm: intra pred dc8 sse2

This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred dc8 sse4

./test/TestBench --testbench intrapred | grep 8x8
intra_dc_8x8[f=0]	3.86x 	 235.11   	 906.74
intra_dc_8x8[f=1]	2.31x 	 555.00   	 1280.00

and supports 32 bit

./test/TestBench --testbench intrapred | grep 8x8
intra_dc_8x8[f=0]	3.86x 	 235.21   	 906.81
intra_dc_8x8[f=1]	2.37x 	 539.99   	 1279.99

and a white space nit in intrapred.h
Subject: [x265] asm: intra pred dc16 sse2

details:   http://hg.videolan.org/x265/rev/95867667be8f
branches:  
changeset: 9616:95867667be8f
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 02 14:32:14 2015 -0800
description:
asm: intra pred dc16 sse2

This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred dc16 sse4 high bit

64-bit

./test/TestBench --testbench intrapred | grep 16x16
intra_dc_16x16[f=0]	2.43x 	 580.09   	 1412.50
intra_dc_16x16[f=1]	2.36x 	 1017.66  	 2400.02

32-bit

./test/TestBench --testbench intrapred | grep 16x16
intra_dc_16x16[f=0]	3.58x 	 754.99   	 2705.04
intra_dc_16x16[f=1]	3.00x 	 1230.08  	 3687.46
Subject: [x265] asm: intra pred planar4 sse2

details:   http://hg.videolan.org/x265/rev/78ee1c3a3457
branches:  
changeset: 9617:78ee1c3a3457
user:      David T Yuen <dtyx265 at gmail.com>
date:      Tue Mar 03 18:40:21 2015 -0800
description:
asm: intra pred planar4 sse2

This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred planar4 sse4

64-bit

./test/TestBench --testbench intrapred | grep intra_planar_4x4
intra_planar_4x4	1.16x 	 507.48   	 587.52

32-bit

./test/TestBench --testbench intrapred | grep intra_planar_4x4
intra_planar_4x4	1.56x 	 532.49   	 832.30
Subject: [x265] asm: intra pred planar4 sse2 high bit

details:   http://hg.videolan.org/x265/rev/edc70a895095
branches:  
changeset: 9618:edc70a895095
user:      David T Yuen <dtyx265 at gmail.com>
date:      Tue Mar 03 18:43:49 2015 -0800
description:
asm: intra pred planar4 sse2 high bit

This replaces c code for systems using ssse3 to sse2 processors
The code is backported from intrapred planar4 sse4 high bit

./test/TestBench --testbench intrapred | grep intra_planar_4x4
intra_planar_4x4	1.31x 	 434.94   	 569.95
Subject: [x265] asm: filter_vpp[4x2], filter_vps[4x2]: improve 142c->130c, 126c->121c

details:   http://hg.videolan.org/x265/rev/7ed370850e36
branches:  
changeset: 9619:7ed370850e36
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Wed Mar 04 11:49:20 2015 +0530
description:
asm: filter_vpp[4x2], filter_vps[4x2]: improve 142c->130c, 126c->121c
Subject: [x265] asm: filter_vpp[8x6], filter_vps[8x6]: improve 277c->226c, 264c->217c

details:   http://hg.videolan.org/x265/rev/ea9bdb10353f
branches:  
changeset: 9620:ea9bdb10353f
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Wed Mar 04 13:20:55 2015 +0530
description:
asm: filter_vpp[8x6], filter_vps[8x6]: improve 277c->226c, 264c->217c

diffstat:

 source/common/x86/asm-primitives.cpp |   10 +
 source/common/x86/intrapred.h        |    7 +-
 source/common/x86/intrapred16.asm    |  112 ++++++++++-
 source/common/x86/intrapred8.asm     |  363 +++++++++++++++++++++++++++++++++-
 source/common/x86/ipfilter8.asm      |  346 ++++++++++++++++++++++++--------
 source/encoder/frameencoder.cpp      |    7 +-
 source/encoder/frameencoder.h        |    1 +
 7 files changed, 739 insertions(+), 107 deletions(-)

diffs (truncated from 1038 to 300 lines):

diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/asm-primitives.cpp	Wed Mar 04 13:20:55 2015 +0530
@@ -868,6 +868,8 @@ void setupAssemblyPrimitives(EncoderPrim
         ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
         ALL_LUMA_TU_S(transpose, transpose, sse2);
 
+        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
+
         p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
         ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
 
@@ -1205,6 +1207,10 @@ void setupAssemblyPrimitives(EncoderPrim
         ALL_LUMA_TU_S(ssd_s, pixel_ssd_s_, sse2);
 
         p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
+        p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
+        p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
+
+        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
 
         p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2;
         p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2;
@@ -1623,15 +1629,19 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2;
 
         p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vps = x265_interp_4tap_vert_ps_4x2_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vps = x265_interp_4tap_vert_ps_8x6_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2;
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2;
diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h	Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/intrapred.h	Wed Mar 04 13:20:55 2015 +0530
@@ -26,12 +26,15 @@
 #ifndef X265_INTRAPRED_H
 #define X265_INTRAPRED_H
 
-void x265_intra_pred_dc4_sse2 (pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
-void x265_intra_pred_dc4_sse4 (pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc8_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc16_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse4(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 
+void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar4_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/intrapred16.asm
--- a/source/common/x86/intrapred16.asm	Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/intrapred16.asm	Wed Mar 04 13:20:55 2015 +0530
@@ -160,6 +160,116 @@ cglobal intra_pred_dc4, 5,6,2
 .end:
     RET
 
+;-------------------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int filter)
+;-------------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc32, 3, 4, 6
+    lea             r3,                  [r2 + 130]     ;130 = 32*sizeof(pixel)*2 + 1*sizeof(pixel)
+    add             r2,                  2
+    add             r1,                  r1
+    movu            m0,                  [r3]
+    movu            m1,                  [r3 + 16]
+    movu            m2,                  [r3 + 32]
+    movu            m3,                  [r3 + 48]
+    paddw           m0,                  m1
+    paddw           m2,                  m3
+    paddw           m0,                  m2
+    movu            m1,                  [r2]
+    movu            m3,                  [r2 + 16]
+    movu            m4,                  [r2 + 32]
+    movu            m5,                  [r2 + 48]
+    paddw           m1,                  m3
+    paddw           m4,                  m5
+    paddw           m1,                  m4
+    paddw           m0,                  m1
+    movhlps         m1,                  m0
+    paddw           m0,                  m1
+    pshuflw         m1,                  m0, 0x6E
+    paddw           m0,                  m1
+    pmaddwd         m0,                  [pw_1]
+
+    paddd           m0,                  [pd_32]     ; sum = sum + 32
+    psrld           m0,                  6           ; sum = sum / 64
+    pshuflw         m0,                  m0, 0
+    pshufd          m0,                  m0, 0
+
+    lea             r2,                 [r1 * 3]
+    ; store DC 32x32
+%assign x 1
+%rep 8
+    movu            [r0 +  0],          m0
+    movu            [r0 + 16],          m0
+    movu            [r0 + 32],          m0
+    movu            [r0 + 48],          m0
+    movu            [r0 + r1 +  0],     m0
+    movu            [r0 + r1 + 16],     m0
+    movu            [r0 + r1 + 32],     m0
+    movu            [r0 + r1 + 48],     m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r1 * 2 + 16], m0
+    movu            [r0 + r1 * 2 + 32], m0
+    movu            [r0 + r1 * 2 + 48], m0
+    movu            [r0 + r2 +  0],     m0
+    movu            [r0 + r2 + 16],     m0
+    movu            [r0 + r2 + 32],     m0
+    movu            [r0 + r2 + 48],     m0
+    %if x < 8
+    lea             r0, [r0 + r1 * 4]
+    %endif
+%assign x x + 1
+%endrep
+    RET
+
+;---------------------------------------------------------------------------------------
+; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
+;---------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_planar4, 3,3,5
+    movu            m1, [r2 + 2]
+    movu            m2, [r2 + 18]
+    pshufhw         m3, m1, 0               ; topRight
+    pshufd          m3, m3, 0xAA
+    pshufhw         m4, m2, 0               ; bottomLeft
+    pshufd          m4, m4, 0xAA
+
+    pmullw          m3, [multi_2Row]        ; (x + 1) * topRight
+    pmullw          m0, m1, [pw_planar4_1]  ; (blkSize - 1 - y) * above[x]
+
+    paddw           m3, [pw_4]
+    paddw           m3, m4
+    paddw           m3, m0
+    psubw           m4, m1
+
+    pshuflw         m1, m2, 0
+    pmullw          m1, [pw_planar4_0]
+    paddw           m1, m3
+    paddw           m3, m4
+    psraw           m1, 3
+    movh            [r0], m1
+
+    pshuflw         m1, m2, 01010101b
+    pmullw          m1, [pw_planar4_0]
+    paddw           m1, m3
+    paddw           m3, m4
+    psraw           m1, 3
+    movh            [r0 + r1 * 2], m1
+    lea             r0, [r0 + 4 * r1]
+
+    pshuflw         m1, m2, 10101010b
+    pmullw          m1, [pw_planar4_0]
+    paddw           m1, m3
+    paddw           m3, m4
+    psraw           m1, 3
+    movh            [r0], m1
+
+    pshuflw         m1, m2, 11111111b
+    pmullw          m1, [pw_planar4_0]
+    paddw           m1, m3
+    psraw           m1, 3
+    movh            [r0 + r1 * 2], m1
+    RET
+
 ;-----------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
 ;-----------------------------------------------------------------------------------
@@ -378,7 +488,7 @@ cglobal intra_pred_dc16, 5, 7, 4
 ;-------------------------------------------------------------------------------------------
 INIT_XMM sse4
 cglobal intra_pred_dc32, 3, 5, 6
-    lea             r3,                  [r2 + 130]
+    lea             r3,                  [r2 + 130]     ;130 = 32*sizeof(pixel)*2 + 1*sizeof(pixel)
     add             r2,                  2
     add             r1,                  r1
     movu            m0,                  [r3]
diff -r 7605e562bef6 -r ea9bdb10353f source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm	Wed Mar 04 14:40:44 2015 -0600
+++ b/source/common/x86/intrapred8.asm	Wed Mar 04 13:20:55 2015 +0530
@@ -117,12 +117,14 @@ const ang_table
 
 SECTION .text
 
+cextern pw_2
 cextern pw_4
 cextern pw_8
 cextern pw_16
 cextern pw_32
 cextern pw_257
 cextern pw_1024
+cextern pw_4096
 cextern pb_unpackbd1
 cextern multiL
 cextern multiH
@@ -207,6 +209,292 @@ cglobal intra_pred_dc4, 5,5,3
 ;---------------------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
 ;---------------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc8, 5, 7, 3
+    pxor            m0,            m0
+    movh            m1,            [r2 + 1]
+    movh            m2,            [r2 + 17]
+    punpcklqdq      m1,            m2
+    psadbw          m1,            m0
+    pshufd          m2,            m1, 2
+    paddw           m1,            m2
+
+    paddw           m1,            [pw_8]
+    psraw           m1,            4
+    pmullw          m1,            [pw_257]
+    pshuflw         m1,            m1, 0x00       ; m1 = byte [dc_val ...]
+
+    test            r4d,           r4d
+
+    ; store DC 8x8
+    lea             r6,            [r1 + r1 * 2]
+    lea             r5,            [r6 + r1 * 2]
+    movh            [r0],          m1
+    movh            [r0 + r1],     m1
+    movh            [r0 + r1 * 2], m1
+    movh            [r0 + r6],     m1
+    movh            [r0 + r1 * 4], m1
+    movh            [r0 + r5],     m1
+    movh            [r0 + r6 * 2], m1
+    lea             r5,            [r5 + r1 * 2]
+    movh            [r0 + r5],     m1
+
+    ; Do DC Filter
+    jz              .end
+    psrlw           m1,            8
+    movq            m2,            [pw_2]
+    pmullw          m2,            m1
+    paddw           m2,            [pw_2]
+    movd            r4d,           m2             ; r4d = DC * 2 + 2
+    paddw           m1,            m2             ; m1 = DC * 3 + 2
+    pshufd          m1,            m1, 0
+
+    ; filter top
+    movq            m2,            [r2 + 1]
+    punpcklbw       m2,            m0
+    paddw           m2,            m1
+    psraw           m2,            2              ; sum = sum / 16
+    packuswb        m2,            m2
+    movh            [r0],          m2
+
+    ; filter top-left
+    movzx           r3d, byte      [r2 + 17]
+    add             r4d,           r3d
+    movzx           r3d, byte      [r2 + 1]
+    add             r3d,           r4d
+    shr             r3d,           2
+    mov             [r0],          r3b
+
+    ; filter left
+    movq            m2,            [r2 + 18]
+    punpcklbw       m2,            m0
+    paddw           m2,            m1
+    psraw           m2,            2
+    packuswb        m2,            m2
+    movd            r2d,           m2
+    lea             r0,            [r0 + r1]
+    lea             r5,            [r6 + r1 * 2]
+    mov             [r0],          r2b
+    shr             r2,            8
+    mov             [r0 + r1],     r2b
+    shr             r2,            8
+    mov             [r0 + r1 * 2], r2b
+    shr             r2,            8
+    mov             [r0 + r6],     r2b
+    pshufd          m2,            m2, 0x01
+    movd            r2d,           m2
+    mov             [r0 + r1 * 4], r2b
+    shr             r2,            8
+    mov             [r0 + r5],     r2b
+    shr             r2,            8
+    mov             [r0 + r6 * 2], r2b
+
+.end:
+    RET
+
+;--------------------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)


More information about the x265-commits mailing list