[x265-commits] [x265] asm: 10bpp sse4 code for saoCuOrgE0, improved 8740c->974c...
Dnyaneshwar G
dnyaneshwar at multicorewareinc.com
Wed Jun 24 19:25:11 CEST 2015
details: http://hg.videolan.org/x265/rev/9192c823687d
branches:
changeset: 10696:9192c823687d
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Fri Jun 19 16:47:56 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE0, improved 8740c->974c, over C code
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE1, improved 5017c->470c, over C code
details: http://hg.videolan.org/x265/rev/cb483678d472
branches:
changeset: 10697:cb483678d472
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Mon Jun 22 10:18:14 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE1, improved 5017c->470c, over C code
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE1_2Rows, improved 10095c->900c, over C code
details: http://hg.videolan.org/x265/rev/49c09709a288
branches:
changeset: 10698:49c09709a288
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Mon Jun 22 18:15:40 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE1_2Rows, improved 10095c->900c, over C code
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE2
details: http://hg.videolan.org/x265/rev/653b3b2d59a2
branches:
changeset: 10699:653b3b2d59a2
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Mon Jun 22 14:23:11 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE2
Performance improvement over C:
SAO_EO_2[0] 6.27x 207.22 1298.92
SAO_EO_2[1] 8.92x 555.20 4949.69
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgE3
details: http://hg.videolan.org/x265/rev/3ba5f136b20f
branches:
changeset: 10700:3ba5f136b20f
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Mon Jun 22 16:06:52 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgE3
Performance improvement over C:
SAO_EO_3[0] 4.97x 236.72 1177.29
SAO_EO_3[1] 8.67x 551.14 4778.67
Subject: [x265] asm: 10bpp sse4 code for saoCuOrgB0, improved 173346c->23127c over C code
details: http://hg.videolan.org/x265/rev/c888d2ea8f14
branches:
changeset: 10701:c888d2ea8f14
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Mon Jun 22 17:07:45 2015 +0530
description:
asm: 10bpp sse4 code for saoCuOrgB0, improved 173346c->23127c over C code
Subject: [x265] asm: pixelavg_pp[16xN] avx2 code for 10bpp
details: http://hg.videolan.org/x265/rev/5dc4ce30c40c
branches:
changeset: 10702:5dc4ce30c40c
user: Rajesh Paulraj<rajesh at multicorewareinc.com>
date: Mon Jun 22 19:34:32 2015 +0530
description:
asm: pixelavg_pp[16xN] avx2 code for 10bpp
avx2:
avg_pp[ 16x4] 9.60x 140.07 1344.66
avg_pp[ 16x8] 12.90x 200.11 2580.72
avg_pp[16x12] 14.62x 265.30 3878.63
avg_pp[16x16] 15.00x 339.53 5094.42
avg_pp[16x32] 17.80x 578.67 10300.56
avg_pp[16x64] 19.37x 1050.96 20357.99
sse2:
avg_pp[ 16x4] 7.87x 170.18 1339.60
avg_pp[ 16x8] 8.22x 313.15 2575.54
avg_pp[16x12] 9.78x 394.35 3856.47
avg_pp[16x16] 10.41x 486.99 5070.16
avg_pp[16x32] 11.34x 902.48 10236.26
avg_pp[16x64] 11.96x 1686.64 20171.16
Subject: [x265] faster algorithm to calculate ctxSet in codeCoeffNxN()
details: http://hg.videolan.org/x265/rev/593a325e0950
branches:
changeset: 10703:593a325e0950
user: Min Chen <chenm003 at 163.com>
date: Mon Jun 22 17:39:45 2015 -0700
description:
faster algorithm to calculate ctxSet in codeCoeffNxN()
Subject: [x265] reduce shift operator on coeff remain code
details: http://hg.videolan.org/x265/rev/175c9d1a998b
branches:
changeset: 10704:175c9d1a998b
user: Min Chen <chenm003 at 163.com>
date: Mon Jun 22 17:39:48 2015 -0700
description:
reduce shift operator on coeff remain code
Subject: [x265] asm: AVX2 of SAD_x4[32xN]
details: http://hg.videolan.org/x265/rev/ff1c35f6a261
branches:
changeset: 10705:ff1c35f6a261
user: Min Chen <chenm003 at 163.com>
date: Mon Jun 22 17:39:51 2015 -0700
description:
asm: AVX2 of SAD_x4[32xN]
AVX:
sad_x4[32x32] 36.69x 2843.87 104330.24
sad_x4[32x16] 35.67x 1547.93 55217.42
sad_x4[32x24] 34.01x 2161.25 73503.10
sad_x4[32x64] 38.73x 5122.28 198363.05
AVX2:
sad_x4[32x32] 41.91x 2379.45 99724.21
sad_x4[32x16] 35.79x 1395.48 49947.39
sad_x4[32x24] 39.03x 1890.22 73777.83
sad_x4[32x64] 39.64x 4997.68 198107.81
Subject: [x265] asm: improve AVX2 sad_x4[32xN] by new faster algorithm
details: http://hg.videolan.org/x265/rev/3a5cd130f908
branches:
changeset: 10706:3a5cd130f908
user: Min Chen <chenm003 at 163.com>
date: Mon Jun 22 17:39:54 2015 -0700
description:
asm: improve AVX2 sad_x4[32xN] by new faster algorithm
Old:
sad_x4[32x32] 41.91x 2379.45 99724.21
sad_x4[32x16] 35.79x 1395.48 49947.39
sad_x4[32x24] 39.03x 1890.22 73777.83
sad_x4[32x64] 39.64x 4997.68 198107.81
New:
sad_x4[32x32] 60.80x 1672.85 101713.55
sad_x4[32x16] 50.97x 989.42 50435.25
sad_x4[32x24] 55.34x 1416.17 78370.77
sad_x4[32x64] 70.01x 2830.01 198127.63
Subject: [x265] cmake: further cleanups for high-bit-depth comment and desc string
details: http://hg.videolan.org/x265/rev/e16b8c5fa3ac
branches:
changeset: 10707:e16b8c5fa3ac
user: Steve Borho <steve at borho.org>
date: Wed Jun 24 10:26:17 2015 -0500
description:
cmake: further cleanups for high-bit-depth comment and desc string
Subject: [x265] threading: fix 32bit multilib with GCC
details: http://hg.videolan.org/x265/rev/95df7ec3c5e6
branches:
changeset: 10708:95df7ec3c5e6
user: Steve Borho <steve at borho.org>
date: Wed Jun 24 10:31:02 2015 -0500
description:
threading: fix 32bit multilib with GCC
Subject: [x265] param: declare our custom strtok_r file-local to avoid multilib breakage
details: http://hg.videolan.org/x265/rev/b1af4c36f48a
branches:
changeset: 10709:b1af4c36f48a
user: Steve Borho <steve at borho.org>
date: Wed Jun 24 10:36:15 2015 -0500
description:
param: declare our custom strtok_r file-local to avoid multilib breakage
diffstat:
source/CMakeLists.txt | 6 +-
source/common/CMakeLists.txt | 2 +-
source/common/param.cpp | 2 +-
source/common/threading.cpp | 14 +-
source/common/x86/asm-primitives.cpp | 21 +
source/common/x86/loopfilter.asm | 444 +++++++++++++++++++++++++++++++++++
source/common/x86/mc-a.asm | 139 ++++++++++
source/common/x86/sad-a.asm | 99 +++++++
source/encoder/entropy.cpp | 12 +-
source/test/pixelharness.cpp | 24 +-
10 files changed, 735 insertions(+), 28 deletions(-)
diffs (truncated from 1063 to 300 lines):
diff -r f1f25aa959fc -r b1af4c36f48a source/CMakeLists.txt
--- a/source/CMakeLists.txt Tue Jun 23 10:51:33 2015 -0500
+++ b/source/CMakeLists.txt Wed Jun 24 10:36:15 2015 -0500
@@ -275,14 +275,14 @@ set(EXTRA_LINK_FLAGS "" CACHE STRING "Ex
mark_as_advanced(EXTRA_LIB EXTRA_LINK_FLAGS)
if(X64)
- # NOTE: We only officially support 16bit-per-pixel compiles of x265
+ # NOTE: We only officially support high-bit-depth compiles of x265
# on 64bit architectures. Main10 plus large resolution plus slow
# preset plus 32bit address space usually means malloc failure. You
# can disable this if(X64) check if you desparately need a 32bit
# build with 10bit/12bit support, but this violates the "shrink wrap
# license" so to speak. If it breaks you get to keep both halves.
- # You will likely need to compile without assembly
- option(HIGH_BIT_DEPTH "Store pixels as 16bit values" OFF)
+ # You will need to disable assembly manually.
+ option(HIGH_BIT_DEPTH "Store pixel samples as 16bit values (Main10)" OFF)
endif(X64)
if(HIGH_BIT_DEPTH)
add_definitions(-DHIGH_BIT_DEPTH=1)
diff -r f1f25aa959fc -r b1af4c36f48a source/common/CMakeLists.txt
--- a/source/common/CMakeLists.txt Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/CMakeLists.txt Wed Jun 24 10:36:15 2015 -0500
@@ -46,7 +46,7 @@ if(ENABLE_ASSEMBLY)
mc-a2.asm pixel-util8.asm blockcopy8.asm
pixeladd8.asm dct8.asm)
if(HIGH_BIT_DEPTH)
- set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm)
+ set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm loopfilter.asm)
else()
set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm intrapred8_allangs.asm ipfilter8.asm loopfilter.asm)
endif()
diff -r f1f25aa959fc -r b1af4c36f48a source/common/param.cpp
--- a/source/common/param.cpp Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/param.cpp Wed Jun 24 10:36:15 2015 -0500
@@ -52,7 +52,7 @@
*/
#undef strtok_r
-char* strtok_r(char* str, const char* delim, char** nextp)
+static char* strtok_r(char* str, const char* delim, char** nextp)
{
if (!str)
str = *nextp;
diff -r f1f25aa959fc -r b1af4c36f48a source/common/threading.cpp
--- a/source/common/threading.cpp Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/threading.cpp Wed Jun 24 10:36:15 2015 -0500
@@ -21,21 +21,24 @@
* For more information, contact us at license @ x265.com
*****************************************************************************/
+#include "common.h"
#include "threading.h"
+#include "cpu.h"
namespace X265_NS {
// x265 private namespace
#if X265_ARCH_X86 && !defined(X86_64) && ENABLE_ASSEMBLY && defined(__GNUC__)
-extern "C" intptr_t x265_stack_align(void (*func)(), ...);
-#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__)
+extern "C" intptr_t PFX(stack_align)(void (*func)(), ...);
+#define STACK_ALIGN(func, ...) PFX(stack_align)((void (*)())func, __VA_ARGS__)
#else
-#define x265_stack_align(func, ...) func(__VA_ARGS__)
+#define STACK_ALIGN(func, ...) func(__VA_ARGS__)
#endif
/* C shim for forced stack alignment */
static void stackAlignMain(Thread *instance)
{
+ // defer processing to the virtual function implemented in the derived class
instance->threadMain();
}
@@ -43,8 +46,7 @@ static void stackAlignMain(Thread *insta
static DWORD WINAPI ThreadShim(Thread *instance)
{
- // defer processing to the virtual function implemented in the derived class
- x265_stack_align(stackAlignMain, instance);
+ STACK_ALIGN(stackAlignMain, instance);
return 0;
}
@@ -77,7 +79,7 @@ static void *ThreadShim(void *opaque)
// defer processing to the virtual function implemented in the derived class
Thread *instance = reinterpret_cast<Thread *>(opaque);
- x265_stack_align(stackAlignMain, instance);
+ STACK_ALIGN(stackAlignMain, instance);
return NULL;
}
diff -r f1f25aa959fc -r b1af4c36f48a source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp Wed Jun 24 10:36:15 2015 -0500
@@ -1089,6 +1089,15 @@ void setupAssemblyPrimitives(EncoderPrim
}
if (cpuMask & X265_CPU_SSE4)
{
+ p.saoCuOrgE0 = PFX(saoCuOrgE0_sse4);
+ p.saoCuOrgE1 = PFX(saoCuOrgE1_sse4);
+ p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_sse4);
+ p.saoCuOrgE2[0] = PFX(saoCuOrgE2_sse4);
+ p.saoCuOrgE2[1] = PFX(saoCuOrgE2_sse4);
+ p.saoCuOrgE3[0] = PFX(saoCuOrgE3_sse4);
+ p.saoCuOrgE3[1] = PFX(saoCuOrgE3_sse4);
+ p.saoCuOrgB0 = PFX(saoCuOrgB0_sse4);
+
LUMA_ADDAVG(sse4);
CHROMA_420_ADDAVG(sse4);
CHROMA_422_ADDAVG(sse4);
@@ -1343,6 +1352,13 @@ void setupAssemblyPrimitives(EncoderPrim
p.cu[BLOCK_32x32].intra_pred[33] = PFX(intra_pred_ang32_33_avx2);
p.cu[BLOCK_32x32].intra_pred[34] = PFX(intra_pred_ang32_2_avx2);
+ p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2);
+ p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2);
+ p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_avx2);
+ p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_avx2);
+ p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_avx2);
+ p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_avx2);
+
p.pu[LUMA_8x4].addAvg = PFX(addAvg_8x4_avx2);
p.pu[LUMA_8x8].addAvg = PFX(addAvg_8x8_avx2);
p.pu[LUMA_8x16].addAvg = PFX(addAvg_8x16_avx2);
@@ -2747,6 +2763,11 @@ void setupAssemblyPrimitives(EncoderPrim
p.pu[LUMA_16x12].sad_x4 = PFX(pixel_sad_x4_16x12_avx2);
p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_avx2);
p.pu[LUMA_16x32].sad_x4 = PFX(pixel_sad_x4_16x32_avx2);
+ p.pu[LUMA_32x32].sad_x4 = PFX(pixel_sad_x4_32x32_avx2);
+ p.pu[LUMA_32x16].sad_x4 = PFX(pixel_sad_x4_32x16_avx2);
+ p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_avx2);
+ p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_avx2);
+ p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_avx2);
p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2);
p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2);
diff -r f1f25aa959fc -r b1af4c36f48a source/common/x86/loopfilter.asm
--- a/source/common/x86/loopfilter.asm Tue Jun 23 10:51:33 2015 -0500
+++ b/source/common/x86/loopfilter.asm Wed Jun 24 10:36:15 2015 -0500
@@ -38,6 +38,7 @@ cextern pb_1
cextern pb_128
cextern pb_2
cextern pw_2
+cextern pw_1023
cextern pb_movemask
@@ -45,6 +46,107 @@ cextern pb_movemask
; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride)
;============================================================================================================
INIT_XMM sse4
+%if HIGH_BIT_DEPTH
+cglobal saoCuOrgE0, 4,5,9
+ mov r4d, r4m
+ movh m6, [r1]
+ movzx r1d, byte [r3]
+ pxor m5, m5
+ neg r1b
+ movd m0, r1d
+ lea r1, [r0 + r4 * 2]
+ mov r4d, r2d
+
+.loop:
+ movu m7, [r0]
+ movu m8, [r0 + 16]
+ movu m2, [r0 + 2]
+ movu m1, [r0 + 18]
+
+ pcmpgtw m3, m7, m2
+ pcmpgtw m2, m7
+ pcmpgtw m4, m8, m1
+ pcmpgtw m1, m8
+
+ packsswb m3, m4
+ packsswb m2, m1
+
+ pand m3, [pb_1]
+ por m3, m2
+
+ palignr m2, m3, m5, 15
+ por m2, m0
+
+ mova m4, [pw_1023]
+ psignb m2, [pb_128] ; m2 = signLeft
+ pxor m0, m0
+ palignr m0, m3, 15
+ paddb m3, m2
+ paddb m3, [pb_2] ; m2 = uiEdgeType
+ pshufb m2, m6, m3
+ pmovsxbw m3, m2 ; offsetEo
+ punpckhbw m2, m2
+ psraw m2, 8
+ paddw m7, m3
+ paddw m8, m2
+ pmaxsw m7, m5
+ pmaxsw m8, m5
+ pminsw m7, m4
+ pminsw m8, m4
+ movu [r0], m7
+ movu [r0 + 16], m8
+
+ add r0q, 32
+ sub r2d, 16
+ jnz .loop
+
+ movzx r3d, byte [r3 + 1]
+ neg r3b
+ movd m0, r3d
+.loopH:
+ movu m7, [r1]
+ movu m8, [r1 + 16]
+ movu m2, [r1 + 2]
+ movu m1, [r1 + 18]
+
+ pcmpgtw m3, m7, m2
+ pcmpgtw m2, m7
+ pcmpgtw m4, m8, m1
+ pcmpgtw m1, m8
+
+ packsswb m3, m4
+ packsswb m2, m1
+
+ pand m3, [pb_1]
+ por m3, m2
+
+ palignr m2, m3, m5, 15
+ por m2, m0
+
+ mova m4, [pw_1023]
+ psignb m2, [pb_128] ; m2 = signLeft
+ pxor m0, m0
+ palignr m0, m3, 15
+ paddb m3, m2
+ paddb m3, [pb_2] ; m2 = uiEdgeType
+ pshufb m2, m6, m3
+ pmovsxbw m3, m2 ; offsetEo
+ punpckhbw m2, m2
+ psraw m2, 8
+ paddw m7, m3
+ paddw m8, m2
+ pmaxsw m7, m5
+ pmaxsw m8, m5
+ pminsw m7, m4
+ pminsw m8, m4
+ movu [r1], m7
+ movu [r1 + 16], m8
+
+ add r1q, 32
+ sub r4d, 16
+ jnz .loopH
+ RET
+%else ; HIGH_BIT_DEPTH
cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
mov r4d, r4m
@@ -130,6 +232,7 @@ cglobal saoCuOrgE0, 5, 5, 8, rec, offset
sub r4d, 16
jnz .loopH
RET
+%endif
INIT_YMM avx2
cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
@@ -189,6 +292,62 @@ cglobal saoCuOrgE0, 5, 5, 7, rec, offset
; void saoCuOrgE1(pixel *pRec, int8_t *m_iUpBuff1, int8_t *m_iOffsetEo, Int iStride, Int iLcuWidth)
;==================================================================================================
INIT_XMM sse4
+%if HIGH_BIT_DEPTH
+cglobal saoCuOrgE1, 4,5,8
+ add r3d, r3d
+ mov r4d, r4m
+ pxor m0, m0 ; m0 = 0
+ mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
+ shr r4d, 4
+.loop
+ movu m7, [r0]
+ movu m5, [r0 + 16]
+ movu m3, [r0 + r3]
+ movu m1, [r0 + r3 + 16]
+
+ pcmpgtw m2, m7, m3
+ pcmpgtw m3, m7
+ pcmpgtw m4, m5, m1
+ pcmpgtw m1, m5
+
+ packsswb m2, m4
+ packsswb m3, m1
+
+ pand m2, [pb_1]
+ por m2, m3
+
+ movu m3, [r1] ; m3 = m_iUpBuff1
+
+ paddb m3, m2
+ paddb m3, m6
+
More information about the x265-commits
mailing list