[x265-commits] [x265] asm: 10bpp sse4 code for saoCuOrgE0, improved 8740c->974c...

Dnyaneshwar G dnyaneshwar at multicorewareinc.com
Wed Jun 24 19:25:11 CEST 2015


details:   http://hg.videolan.org/x265/rev/9192c823687d
branches:  
changeset: 10696:9192c823687d
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Fri Jun 19 16:47:56 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE0, improved 8740c->974c, over C code
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE1, improved 5017c->470c, over C code

details:   http://hg.videolan.org/x265/rev/cb483678d472
branches:  
changeset: 10697:cb483678d472
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Mon Jun 22 10:18:14 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE1, improved 5017c->470c, over C code
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE1_2Rows, improved 10095c->900c, over C code

details:   http://hg.videolan.org/x265/rev/49c09709a288
branches:  
changeset: 10698:49c09709a288
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Mon Jun 22 18:15:40 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE1_2Rows, improved 10095c->900c, over C code
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE2

details:   http://hg.videolan.org/x265/rev/653b3b2d59a2
branches:  
changeset: 10699:653b3b2d59a2
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Mon Jun 22 14:23:11 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE2

Performance improvement over C:
SAO_EO_2[0]     6.27x    207.22          1298.92
SAO_EO_2[1]     8.92x    555.20          4949.69
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE3

details:   http://hg.videolan.org/x265/rev/3ba5f136b20f
branches:  
changeset: 10700:3ba5f136b20f
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Mon Jun 22 16:06:52 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE3

Performance improvement over C:
SAO_EO_3[0]     4.97x    236.72          1177.29
SAO_EO_3[1]     8.67x    551.14          4778.67
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgB0, improved 173346c->23127c over C code

details:   http://hg.videolan.org/x265/rev/c888d2ea8f14
branches:  
changeset: 10701:c888d2ea8f14
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Mon Jun 22 17:07:45 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgB0, improved 173346c->23127c over C code
Subject: [x265] asm: pixelavg_pp[16xN] avx2 code for 10bpp

details:   http://hg.videolan.org/x265/rev/5dc4ce30c40c
branches:  
changeset: 10702:5dc4ce30c40c
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Mon Jun 22 19:34:32 2015 +0530
description:
asm: pixelavg_pp[16xN] avx2 code for 10bpp

avx2:
avg_pp[ 16x4]  9.60x    140.07          1344.66
avg_pp[ 16x8]  12.90x   200.11          2580.72
avg_pp[16x12]  14.62x   265.30          3878.63
avg_pp[16x16]  15.00x   339.53          5094.42
avg_pp[16x32]  17.80x   578.67          10300.56
avg_pp[16x64]  19.37x   1050.96         20357.99

sse2:
avg_pp[ 16x4]  7.87x    170.18          1339.60
avg_pp[ 16x8]  8.22x    313.15          2575.54
avg_pp[16x12]  9.78x    394.35          3856.47
avg_pp[16x16]  10.41x   486.99          5070.16
avg_pp[16x32]  11.34x   902.48          10236.26
avg_pp[16x64]  11.96x   1686.64         20171.16
Subject: [x265] faster algorithm to calculate ctxSet in codeCoeffNxN()

details:   http://hg.videolan.org/x265/rev/593a325e0950
branches:  
changeset: 10703:593a325e0950
user:      Min Chen <chenm003 at 163.com>
date:      Mon Jun 22 17:39:45 2015 -0700
description:
faster algorithm to calculate ctxSet in codeCoeffNxN()
Subject: [x265] reduce shift operator on coeff remain code

details:   http://hg.videolan.org/x265/rev/175c9d1a998b
branches:  
changeset: 10704:175c9d1a998b
user:      Min Chen <chenm003 at 163.com>
date:      Mon Jun 22 17:39:48 2015 -0700
description:
reduce shift operator on coeff remain code
Subject: [x265] asm: AVX2 of SAD_x4[32xN]

details:   http://hg.videolan.org/x265/rev/ff1c35f6a261
branches:  
changeset: 10705:ff1c35f6a261
user:      Min Chen <chenm003 at 163.com>
date:      Mon Jun 22 17:39:51 2015 -0700
description:
asm: AVX2 of SAD_x4[32xN]
AVX:
  sad_x4[32x32]  36.69x   2843.87         104330.24
  sad_x4[32x16]  35.67x   1547.93         55217.42
  sad_x4[32x24]  34.01x   2161.25         73503.10
  sad_x4[32x64]  38.73x   5122.28         198363.05

AVX2:
  sad_x4[32x32]  41.91x   2379.45         99724.21
  sad_x4[32x16]  35.79x   1395.48         49947.39
  sad_x4[32x24]  39.03x   1890.22         73777.83
  sad_x4[32x64]  39.64x   4997.68         198107.81
Subject: [x265] asm: improve AVX2 sad_x4[32xN] by new faster algorithm

details:   http://hg.videolan.org/x265/rev/3a5cd130f908
branches:  
changeset: 10706:3a5cd130f908
user:      Min Chen <chenm003 at 163.com>
date:      Mon Jun 22 17:39:54 2015 -0700
description:
asm: improve AVX2 sad_x4[32xN] by new faster algorithm
Old:
  sad_x4[32x32]  41.91x   2379.45         99724.21
  sad_x4[32x16]  35.79x   1395.48         49947.39
  sad_x4[32x24]  39.03x   1890.22         73777.83
  sad_x4[32x64]  39.64x   4997.68         198107.81

New:
  sad_x4[32x32]  60.80x   1672.85         101713.55
  sad_x4[32x16]  50.97x   989.42          50435.25
  sad_x4[32x24]  55.34x   1416.17         78370.77
  sad_x4[32x64]  70.01x   2830.01         198127.63
Subject: [x265] cmake: further cleanups for high-bit-depth comment and desc string

details:   http://hg.videolan.org/x265/rev/e16b8c5fa3ac
branches:  
changeset: 10707:e16b8c5fa3ac
user:      Steve Borho <steve at borho.org>
date:      Wed Jun 24 10:26:17 2015 -0500
description:
cmake: further cleanups for high-bit-depth comment and desc string
Subject: [x265] threading: fix 32bit multilib with GCC

details:   http://hg.videolan.org/x265/rev/95df7ec3c5e6
branches:  
changeset: 10708:95df7ec3c5e6
user:      Steve Borho <steve at borho.org>
date:      Wed Jun 24 10:31:02 2015 -0500
description:
threading: fix 32bit multilib with GCC
Subject: [x265] param: declare our custom strtok_r file-local to avoid multilib breakage

details:   http://hg.videolan.org/x265/rev/b1af4c36f48a
branches:  
changeset: 10709:b1af4c36f48a
user:      Steve Borho <steve at borho.org>
date:      Wed Jun 24 10:36:15 2015 -0500
description:
param: declare our custom strtok_r file-local to avoid multilib breakage

diffstat:

 source/CMakeLists.txt                |    6 +-
 source/common/CMakeLists.txt         |    2 +-
 source/common/param.cpp              |    2 +-
 source/common/threading.cpp          |   14 +-
 source/common/x86/asm-primitives.cpp |   21 +
 source/common/x86/loopfilter.asm     |  444 +++++++++++++++++++++++++++++++++++
 source/common/x86/mc-a.asm           |  139 ++++++++++
 source/common/x86/sad-a.asm          |   99 +++++++
 source/encoder/entropy.cpp           |   12 +-
 source/test/pixelharness.cpp         |   24 +-
 10 files changed, 735 insertions(+), 28 deletions(-)

diffs (truncated from 1063 to 300 lines):

diff -r f1f25aa959fc -r b1af4c36f48a source/CMakeLists.txt
--- a/source/CMakeLists.txt	Tue Jun 23 10:51:33 2015 -0500
+++ b/source/CMakeLists.txt	Wed Jun 24 10:36:15 2015 -0500
@@ -275,14 +275,14 @@ set(EXTRA_LINK_FLAGS "" CACHE STRING "Ex
 mark_as_advanced(EXTRA_LIB EXTRA_LINK_FLAGS)
 
 if(X64)
-    # NOTE: We only officially support 16bit-per-pixel compiles of x265
+    # NOTE: We only officially support high-bit-depth compiles of x265
     # on 64bit architectures. Main10 plus large resolution plus slow
     # preset plus 32bit address space usually means malloc failure.  You
     # can disable this if(X64) check if you desparately need a 32bit
     # build with 10bit/12bit support, but this violates the "shrink wrap
     # license" so to speak.  If it breaks you get to keep both halves.
-    # You will likely need to compile without assembly
-    option(HIGH_BIT_DEPTH "Store pixels as 16bit values" OFF)
+    # You will need to disable assembly manually.
+    option(HIGH_BIT_DEPTH "Store pixel samples as 16bit values (Main10)" OFF)
 endif(X64)
 if(HIGH_BIT_DEPTH)
     add_definitions(-DHIGH_BIT_DEPTH=1)
diff -r f1f25aa959fc -r b1af4c36f48a source/common/CMakeLists.txt
--- a/source/common/CMakeLists.txt	Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/CMakeLists.txt	Wed Jun 24 10:36:15 2015 -0500
@@ -46,7 +46,7 @@ if(ENABLE_ASSEMBLY)
                mc-a2.asm pixel-util8.asm blockcopy8.asm
                pixeladd8.asm dct8.asm)
     if(HIGH_BIT_DEPTH)
-        set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm)
+        set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm loopfilter.asm)
     else()
         set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm intrapred8_allangs.asm ipfilter8.asm loopfilter.asm)
     endif()
diff -r f1f25aa959fc -r b1af4c36f48a source/common/param.cpp
--- a/source/common/param.cpp	Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/param.cpp	Wed Jun 24 10:36:15 2015 -0500
@@ -52,7 +52,7 @@
  */
 
 #undef strtok_r
-char* strtok_r(char* str, const char* delim, char** nextp)
+static char* strtok_r(char* str, const char* delim, char** nextp)
 {
     if (!str)
         str = *nextp;
diff -r f1f25aa959fc -r b1af4c36f48a source/common/threading.cpp
--- a/source/common/threading.cpp	Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/threading.cpp	Wed Jun 24 10:36:15 2015 -0500
@@ -21,21 +21,24 @@
  * For more information, contact us at license @ x265.com
  *****************************************************************************/
 
+#include "common.h"
 #include "threading.h"
+#include "cpu.h"
 
 namespace X265_NS {
 // x265 private namespace
 
 #if X265_ARCH_X86 && !defined(X86_64) && ENABLE_ASSEMBLY && defined(__GNUC__)
-extern "C" intptr_t x265_stack_align(void (*func)(), ...);
-#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__)
+extern "C" intptr_t PFX(stack_align)(void (*func)(), ...);
+#define STACK_ALIGN(func, ...) PFX(stack_align)((void (*)())func, __VA_ARGS__)
 #else
-#define x265_stack_align(func, ...) func(__VA_ARGS__)
+#define STACK_ALIGN(func, ...) func(__VA_ARGS__)
 #endif
 
 /* C shim for forced stack alignment */
 static void stackAlignMain(Thread *instance)
 {
+    // defer processing to the virtual function implemented in the derived class
     instance->threadMain();
 }
 
@@ -43,8 +46,7 @@ static void stackAlignMain(Thread *insta
 
 static DWORD WINAPI ThreadShim(Thread *instance)
 {
-    // defer processing to the virtual function implemented in the derived class
-    x265_stack_align(stackAlignMain, instance);
+    STACK_ALIGN(stackAlignMain, instance);
 
     return 0;
 }
@@ -77,7 +79,7 @@ static void *ThreadShim(void *opaque)
     // defer processing to the virtual function implemented in the derived class
     Thread *instance = reinterpret_cast<Thread *>(opaque);
 
-    x265_stack_align(stackAlignMain, instance);
+    STACK_ALIGN(stackAlignMain, instance);
 
     return NULL;
 }
diff -r f1f25aa959fc -r b1af4c36f48a source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Wed Jun 24 10:36:15 2015 -0500
@@ -1089,6 +1089,15 @@ void setupAssemblyPrimitives(EncoderPrim
     }
     if (cpuMask & X265_CPU_SSE4)
     {
+        p.saoCuOrgE0 = PFX(saoCuOrgE0_sse4);
+        p.saoCuOrgE1 = PFX(saoCuOrgE1_sse4);
+        p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_sse4);
+        p.saoCuOrgE2[0] = PFX(saoCuOrgE2_sse4);
+        p.saoCuOrgE2[1] = PFX(saoCuOrgE2_sse4);
+        p.saoCuOrgE3[0] = PFX(saoCuOrgE3_sse4);
+        p.saoCuOrgE3[1] = PFX(saoCuOrgE3_sse4);
+        p.saoCuOrgB0 = PFX(saoCuOrgB0_sse4);
+
         LUMA_ADDAVG(sse4);
         CHROMA_420_ADDAVG(sse4);
         CHROMA_422_ADDAVG(sse4);
@@ -1343,6 +1352,13 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_32x32].intra_pred[33]    = PFX(intra_pred_ang32_33_avx2);
         p.cu[BLOCK_32x32].intra_pred[34]    = PFX(intra_pred_ang32_2_avx2);
 
+        p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2);
+        p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2);
+        p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_avx2);
+        p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_avx2);
+        p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_avx2);
+        p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_avx2);
+
         p.pu[LUMA_8x4].addAvg   = PFX(addAvg_8x4_avx2);
         p.pu[LUMA_8x8].addAvg   = PFX(addAvg_8x8_avx2);
         p.pu[LUMA_8x16].addAvg  = PFX(addAvg_8x16_avx2);
@@ -2747,6 +2763,11 @@ void setupAssemblyPrimitives(EncoderPrim
         p.pu[LUMA_16x12].sad_x4 = PFX(pixel_sad_x4_16x12_avx2);
         p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_avx2);
         p.pu[LUMA_16x32].sad_x4 = PFX(pixel_sad_x4_16x32_avx2);
+        p.pu[LUMA_32x32].sad_x4 = PFX(pixel_sad_x4_32x32_avx2);
+        p.pu[LUMA_32x16].sad_x4 = PFX(pixel_sad_x4_32x16_avx2);
+        p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_avx2);
+        p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_avx2);
+        p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_avx2);
 
         p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2);
         p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2);
diff -r f1f25aa959fc -r b1af4c36f48a source/common/x86/loopfilter.asm
--- a/source/common/x86/loopfilter.asm	Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/x86/loopfilter.asm	Wed Jun 24 10:36:15 2015 -0500
@@ -38,6 +38,7 @@ cextern pb_1
 cextern pb_128
 cextern pb_2
 cextern pw_2
+cextern pw_1023
 cextern pb_movemask
 
 
@@ -45,6 +46,107 @@ cextern pb_movemask
 ; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride)
 ;============================================================================================================
 INIT_XMM sse4
+%if HIGH_BIT_DEPTH
+cglobal saoCuOrgE0, 4,5,9
+    mov         r4d, r4m
+    movh        m6,  [r1]
+    movzx       r1d, byte [r3]
+    pxor        m5, m5
+    neg         r1b
+    movd        m0, r1d
+    lea         r1, [r0 + r4 * 2]
+    mov         r4d, r2d
+
+.loop:
+    movu        m7, [r0]
+    movu        m8, [r0 + 16]
+    movu        m2, [r0 + 2]
+    movu        m1, [r0 + 18]
+
+    pcmpgtw     m3, m7, m2
+    pcmpgtw     m2, m7
+    pcmpgtw     m4, m8, m1
+    pcmpgtw     m1, m8 
+
+    packsswb    m3, m4
+    packsswb    m2, m1
+
+    pand        m3, [pb_1]
+    por         m3, m2
+
+    palignr     m2, m3, m5, 15
+    por         m2, m0
+
+    mova        m4, [pw_1023]
+    psignb      m2, [pb_128]                ; m2 = signLeft
+    pxor        m0, m0
+    palignr     m0, m3, 15
+    paddb       m3, m2
+    paddb       m3, [pb_2]                  ; m2 = uiEdgeType
+    pshufb      m2, m6, m3
+    pmovsxbw    m3, m2                      ; offsetEo
+    punpckhbw   m2, m2
+    psraw       m2, 8
+    paddw       m7, m3
+    paddw       m8, m2
+    pmaxsw      m7, m5
+    pmaxsw      m8, m5
+    pminsw      m7, m4
+    pminsw      m8, m4
+    movu        [r0], m7
+    movu        [r0 + 16], m8
+
+    add         r0q, 32
+    sub         r2d, 16
+    jnz        .loop
+
+    movzx       r3d, byte [r3 + 1]
+    neg         r3b
+    movd        m0, r3d
+.loopH:
+    movu        m7, [r1]
+    movu        m8, [r1 + 16]
+    movu        m2, [r1 + 2]
+    movu        m1, [r1 + 18]
+
+    pcmpgtw     m3, m7, m2
+    pcmpgtw     m2, m7
+    pcmpgtw     m4, m8, m1
+    pcmpgtw     m1, m8 
+
+    packsswb    m3, m4
+    packsswb    m2, m1
+
+    pand        m3, [pb_1]
+    por         m3, m2
+
+    palignr     m2, m3, m5, 15
+    por         m2, m0
+
+    mova        m4, [pw_1023]
+    psignb      m2, [pb_128]                ; m2 = signLeft
+    pxor        m0, m0
+    palignr     m0, m3, 15
+    paddb       m3, m2
+    paddb       m3, [pb_2]                  ; m2 = uiEdgeType
+    pshufb      m2, m6, m3
+    pmovsxbw    m3, m2                      ; offsetEo
+    punpckhbw   m2, m2
+    psraw       m2, 8
+    paddw       m7, m3
+    paddw       m8, m2
+    pmaxsw      m7, m5
+    pmaxsw      m8, m5
+    pminsw      m7, m4
+    pminsw      m8, m4
+    movu        [r1], m7
+    movu        [r1 + 16], m8
+
+    add         r1q, 32
+    sub         r4d, 16
+    jnz        .loopH
+    RET
+%else ; HIGH_BIT_DEPTH
 cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
 
     mov         r4d, r4m
@@ -130,6 +232,7 @@ cglobal saoCuOrgE0, 5, 5, 8, rec, offset
     sub         r4d, 16
     jnz        .loopH
     RET
+%endif
 
 INIT_YMM avx2
 cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
@@ -189,6 +292,62 @@ cglobal saoCuOrgE0, 5, 5, 7, rec, offset
 ; void saoCuOrgE1(pixel *pRec, int8_t *m_iUpBuff1, int8_t *m_iOffsetEo, Int iStride, Int iLcuWidth)
 ;==================================================================================================
 INIT_XMM sse4
+%if HIGH_BIT_DEPTH
+cglobal saoCuOrgE1, 4,5,8
+    add         r3d, r3d
+    mov         r4d, r4m
+    pxor        m0, m0                      ; m0 = 0
+    mova        m6, [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
+    shr         r4d, 4
+.loop
+    movu        m7, [r0]
+    movu        m5, [r0 + 16]
+    movu        m3, [r0 + r3]
+    movu        m1, [r0 + r3 + 16]
+
+    pcmpgtw     m2, m7, m3
+    pcmpgtw     m3, m7
+    pcmpgtw     m4, m5, m1
+    pcmpgtw     m1, m5 
+
+    packsswb    m2, m4
+    packsswb    m3, m1
+
+    pand        m2, [pb_1]
+    por         m2, m3
+
+    movu        m3, [r1]                    ; m3 = m_iUpBuff1
+
+    paddb       m3, m2
+    paddb       m3, m6
+


More information about the x265-commits mailing list