[x265-commits] [x265] asm: 10bpp avx2 code for intra_pred_ang32x32 mode 11 & 25

Wed Jun 17 06:14:49 CEST 2015

details:   http://hg.videolan.org/x265/rev/87c0aca8e965
branches:  
changeset: 10636:87c0aca8e965
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Thu Jun 11 11:04:36 2015 +0530
description:
asm: 10bpp avx2 code for intra_pred_ang32x32 mode 11 & 25

performance improvement over SSE:
intra_ang_32x32[11]    8256c->4236c, 48%
intra_ang_32x32[25]    5646c->2755c, 51%
Subject: [x265] asm: 10bpp avx2 code for intra_pred_ang32x32 mode 12 & 24

details:   http://hg.videolan.org/x265/rev/b0c6a597dffc
branches:  
changeset: 10637:b0c6a597dffc
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Thu Jun 11 16:21:47 2015 +0530
description:
asm: 10bpp avx2 code for intra_pred_ang32x32 mode 12 & 24

performance improvement over SSE:
intra_ang_32x32[12]    8084c->4584c, 43%
intra_ang_32x32[24]    5629c->2934c, 48%
Subject: [x265] asm: avx2 interp_8tap_hv_pp for 8bpp

details:   http://hg.videolan.org/x265/rev/32590b25678b
branches:  
changeset: 10638:32590b25678b
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Fri Jun 12 16:48:06 2015 +0530
description:
asm: avx2 interp_8tap_hv_pp for 8bpp

Removing x265_interp_8tap_hv_pp_16x16_avx2 seperate asm code, since its giving same performnace as calling interp_8tap_hv_pp_cpu C function(which calls luma_hps and luma_vsp asm functions individually)

Including ALL_LUMA_PU_T for luma_hvpp which calls interp_8tap_hv_pp_cpu C function.
ALL_LUMA_PU_T has declared all sizes except 4x4, hence including luma_hvpp[4x4] separately.
Subject: [x265] asm: new SSE2 primivite costC1C2Flag in codeCoeffNxN()

details:   http://hg.videolan.org/x265/rev/9e236126045a
branches:  
changeset: 10639:9e236126045a
user:      Min Chen <chenm003 at 163.com>
date:      Fri Jun 12 13:51:44 2015 -0700
description:
asm: new SSE2 primivite costC1C2Flag in codeCoeffNxN()
Subject: [x265] move firstC2Idx, firstC2Flag and c1Next from common to local

details:   http://hg.videolan.org/x265/rev/5db0498cad27
branches:  
changeset: 10640:5db0498cad27
user:      Min Chen <chenm003 at 163.com>
date:      Fri Jun 12 13:51:47 2015 -0700
description:
move firstC2Idx, firstC2Flag and c1Next from common to local
Subject: [x265] asm: avx2 interp_8tap_hv_pp for 16bpp

details:   http://hg.videolan.org/x265/rev/604bac6aa380
branches:  
changeset: 10641:604bac6aa380
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Mon Jun 15 12:41:09 2015 +0530
description:
asm: avx2 interp_8tap_hv_pp for 16bpp

Including ALL_LUMA_PU_T for luma_hvpp which calls interp_8tap_hv_pp_cpu C function(which calls luma_hps and luma_vsp asm functions individually)
ALL_LUMA_PU_T has declared all sizes except 4x4, hence including luma_hvpp[4x4] separately.
Subject: [x265] asm: prefix primitives with X265_NS[credited to Kevin Wu]

details:   http://hg.videolan.org/x265/rev/4948aeae8a18
branches:  
changeset: 10642:4948aeae8a18
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Tue Jun 16 11:32:57 2015 +0530
description:
asm: prefix primitives with X265_NS[credited to Kevin Wu]

This commit declares all possible combinations of primitives and CPU
architectures,including those that have not been implemented.
Subject: [x265] cmake: ugly hacks for adding arbitrary link library

details:   http://hg.videolan.org/x265/rev/0902e35aee38
branches:  
changeset: 10643:0902e35aee38
user:      Steve Borho <steve at borho.org>
date:      Fri Jun 05 21:14:14 2015 -0500
description:
cmake: ugly hacks for adding arbitrary link library

(currently necessary for multilib library)
Subject: [x265] common: move remaining x265_ functions into private namespace

details:   http://hg.videolan.org/x265/rev/14a5b588659f
branches:  
changeset: 10644:14a5b588659f
user:      Steve Borho <steve at borho.org>
date:      Fri Jun 05 21:26:27 2015 -0500
description:
common: move remaining x265_ functions into private namespace
Subject: [x265] dct: rename g_entropyStateBits[credited to Steve Borho]

details:   http://hg.videolan.org/x265/rev/5fb6f53513cf
branches:  
changeset: 10645:5fb6f53513cf
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Tue Jun 16 14:57:22 2015 +0530
description:
dct: rename g_entropyStateBits[credited to Steve Borho]
Subject: [x265] version: X265_NS prefix for build strings, max bit depth [credited to Steve Borho]

details:   http://hg.videolan.org/x265/rev/b76b5f0a82b6
branches:  
changeset: 10646:b76b5f0a82b6
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Tue Jun 16 15:02:53 2015 +0530
description:
version: X265_NS prefix for build strings, max bit depth [credited to Steve Borho]
Subject: [x265] cli: retrieve build strings from api pointer, rather than exported symbols

details:   http://hg.videolan.org/x265/rev/df1d0a11f905
branches:  
changeset: 10647:df1d0a11f905
user:      Steve Borho <steve at borho.org>
date:      Sat Jun 06 12:31:32 2015 -0500
description:
cli: retrieve build strings from api pointer, rather than exported symbols
Subject: [x265] build: introduce a new multilib build folder for gmake environments

details:   http://hg.videolan.org/x265/rev/4de5b00815f4
branches:  
changeset: 10648:4de5b00815f4
user:      Steve Borho <steve at borho.org>
date:      Sat Jun 06 12:40:45 2015 -0500
description:
build: introduce a new multilib build folder for gmake environments

in theory this should also work with MSVC, but I have not yet tried it
Subject: [x265] fix issue #143 x265 is slow when it is build with GCC 5.1

details:   http://hg.videolan.org/x265/rev/be0ed447922c
branches:  
changeset: 10649:be0ed447922c
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Tue Jun 16 11:15:03 2015 +0530
description:
fix issue #143 x265 is slow when it is build with GCC 5.1

diffstat:

 build/multilib-gmake/10bit/build.sh         |     2 +
 build/multilib-gmake/8bit/build.sh          |     2 +
 build/multilib-gmake/README.txt             |    17 +
 source/CMakeLists.txt                       |     7 +
 source/cmake/CMakeASM_YASMInformation.cmake |     4 +-
 source/common/common.cpp                    |     6 +-
 source/common/common.h                      |     3 +-
 source/common/contexts.h                    |     2 +-
 source/common/cpu.cpp                       |    26 +-
 source/common/cpu.h                         |    15 +-
 source/common/dct.cpp                       |    57 +-
 source/common/param.cpp                     |     2 +-
 source/common/primitives.cpp                |    11 +-
 source/common/primitives.h                  |     9 +
 source/common/vec/vec-primitives.cpp        |     5 +-
 source/common/version.cpp                   |     9 +-
 source/common/x86/asm-primitives.cpp        |  4430 +++++++++++++-------------
 source/common/x86/blockcopy8.h              |   254 +-
 source/common/x86/dct8.h                    |    37 +-
 source/common/x86/intrapred.h               |   327 +-
 source/common/x86/intrapred16.asm           |   459 ++
 source/common/x86/ipfilter8.h               |  1135 +------
 source/common/x86/loopfilter.h              |    34 +-
 source/common/x86/mc.h                      |    33 +-
 source/common/x86/pixel-util.h              |   138 +-
 source/common/x86/pixel-util8.asm           |   148 +-
 source/common/x86/pixel.h                   |   280 +-
 source/common/x86/ssd-a.asm                 |     2 +-
 source/common/x86/x86inc.asm                |     2 +-
 source/encoder/api.cpp                      |    10 +-
 source/encoder/encoder.cpp                  |     6 +-
 source/encoder/entropy.cpp                  |    63 +-
 source/filters/filters.cpp                  |     2 +
 source/x265.cpp                             |     5 +-
 source/x265cli.h                            |     7 +-
 35 files changed, 3213 insertions(+), 4336 deletions(-)

diffs (truncated from 8685 to 300 lines):

diff -r 91f6f8daef59 -r be0ed447922c build/multilib-gmake/10bit/build.sh

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/build/multilib-gmake/10bit/build.sh	Tue Jun 16 11:15:03 2015 +0530
@@ -0,0 +1,2 @@
+cmake ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF
+make
diff -r 91f6f8daef59 -r be0ed447922c build/multilib-gmake/8bit/build.sh
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/build/multilib-gmake/8bit/build.sh	Tue Jun 16 11:15:03 2015 +0530
@@ -0,0 +1,2 @@
+cmake ../../../source -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=ON -DEXTRA_LIB=x265.a
+make
diff -r 91f6f8daef59 -r be0ed447922c build/multilib-gmake/README.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/build/multilib-gmake/README.txt	Tue Jun 16 11:15:03 2015 +0530
@@ -0,0 +1,17 @@
+These two subfolders can be used to build a single x265 console binary
+on linux or any other system supporting 'make' builds, with both the
+8bit and 10bit libraries statically linked together.
+
+At the end of the process, the 8bit/libx265.a and 10bit/libx265.a could
+also be linked into other applications, so long as they use
+x265_api_get() or x265_api_query() to acquire the x265 API and then only
+use the function pointers and data elementes within (no other C APIs or
+symbols are exported).
+
+The folders must be built in a specific order. 10bit first, then 8bit
+last. ie:
+
+cd 10bit
+./build.sh
+cd ../8bit
+./build.sh
diff -r 91f6f8daef59 -r be0ed447922c source/CMakeLists.txt
--- a/source/CMakeLists.txt	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/CMakeLists.txt	Tue Jun 16 11:15:03 2015 +0530
@@ -270,6 +270,8 @@ endif()
 # Build options
 set(LIB_INSTALL_DIR lib CACHE STRING "Install location of libraries")
 set(BIN_INSTALL_DIR bin CACHE STRING "Install location of executables")
+set(EXTRA_LIB "" CACHE STRING "Extra libraries to link against")
+mark_as_advanced(EXTRA_LIB)
 
 if(X64)
     # NOTE: We only officially support 16bit-per-pixel compiles of x265
@@ -391,6 +393,11 @@ add_library(x265-static STATIC $<TARGET_
 if(NOT MSVC)
     set_target_properties(x265-static PROPERTIES OUTPUT_NAME x265)
 endif()
+if(EXTRA_LIB)
+    # ugly link path hack
+    link_directories(${CMAKE_BINARY_DIR}/../10bit)
+    target_link_libraries(x265-static ${EXTRA_LIB})
+endif()
 install(TARGETS x265-static
     LIBRARY DESTINATION ${LIB_INSTALL_DIR}
     ARCHIVE DESTINATION ${LIB_INSTALL_DIR})
diff -r 91f6f8daef59 -r be0ed447922c source/cmake/CMakeASM_YASMInformation.cmake
--- a/source/cmake/CMakeASM_YASMInformation.cmake	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/cmake/CMakeASM_YASMInformation.cmake	Tue Jun 16 11:15:03 2015 +0530
@@ -31,9 +31,9 @@ else()
 endif()
 
 if(HIGH_BIT_DEPTH)
-    list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10)
+    list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10 -DX265_NS=${X265_NS})
 else()
-    list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8)
+    list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8 -DX265_NS=${X265_NS})
 endif()
 
 list(APPEND ASM_FLAGS "${CMAKE_ASM_YASM_FLAGS}")
diff -r 91f6f8daef59 -r be0ed447922c source/common/common.cpp
--- a/source/common/common.cpp	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/common/common.cpp	Tue Jun 16 11:15:03 2015 +0530
@@ -37,6 +37,8 @@
 int g_checkFailures;
 #endif
 
+namespace X265_NS {
+
 int64_t x265_mdate(void)
 {
 #if _WIN32
@@ -50,8 +52,6 @@ int64_t x265_mdate(void)
 #endif
 }
 
-using namespace X265_NS;
-
 #define X265_ALIGNBYTES 32
 
 #if _WIN32
@@ -215,3 +215,5 @@ error:
     fclose(fh);
     return NULL;
 }
+
+}
diff -r 91f6f8daef59 -r be0ed447922c source/common/common.h
--- a/source/common/common.h	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/common/common.h	Tue Jun 16 11:15:03 2015 +0530
@@ -409,8 +409,6 @@ enum SignificanceMapContextType
 /* located in pixel.cpp */
 void extendPicBorder(pixel* recon, intptr_t stride, int width, int height, int marginX, int marginY);
 
-}
-
 /* outside x265 namespace, but prefixed. defined in common.cpp */
 int64_t  x265_mdate(void);
 #define  x265_log(param, ...) general_log(param, "x265", __VA_ARGS__)
@@ -427,6 +425,7 @@ void     x265_free(void *ptr);
 char*    x265_slurp_file(const char *filename);
 
 void     x265_setup_primitives(x265_param* param, int cpu); /* primitives.cpp */
+}
 
 #include "constants.h"
 
diff -r 91f6f8daef59 -r be0ed447922c source/common/contexts.h
--- a/source/common/contexts.h	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/common/contexts.h	Tue Jun 16 11:15:03 2015 +0530
@@ -102,7 +102,7 @@
 #define OFF_TQUANT_BYPASS_FLAG_CTX (OFF_TRANSFORMSKIP_FLAG_CTX + 2 * NUM_TRANSFORMSKIP_FLAG_CTX)
 #define MAX_OFF_CTX_MOD            (OFF_TQUANT_BYPASS_FLAG_CTX +     NUM_TQUANT_BYPASS_FLAG_CTX)
 
-extern "C" const uint32_t g_entropyStateBits[128];
+extern "C" const uint32_t PFX(entropyStateBits)[128];
 
 namespace X265_NS {
 // private namespace
diff -r 91f6f8daef59 -r be0ed447922c source/common/cpu.cpp
--- a/source/common/cpu.cpp	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/common/cpu.cpp	Tue Jun 16 11:15:03 2015 +0530
@@ -107,9 +107,9 @@ const cpu_name_t cpu_names[] =
 
 extern "C" {
 /* cpu-a.asm */
-int x265_cpu_cpuid_test(void);
-void x265_cpu_cpuid(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
-void x265_cpu_xgetbv(uint32_t op, uint32_t *eax, uint32_t *edx);
+int PFX(cpu_cpuid_test)(void);
+void PFX(cpu_cpuid)(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
+void PFX(cpu_xgetbv)(uint32_t op, uint32_t *eax, uint32_t *edx);
 }
 
 #if defined(_MSC_VER)
@@ -129,12 +129,12 @@ uint32_t cpu_detect(void)
         return 0;
 #endif
 
-    x265_cpu_cpuid(0, &eax, vendor + 0, vendor + 2, vendor + 1);
+    PFX(cpu_cpuid)(0, &eax, vendor + 0, vendor + 2, vendor + 1);
     max_basic_cap = eax;
     if (max_basic_cap == 0)
         return 0;
 
-    x265_cpu_cpuid(1, &eax, &ebx, &ecx, &edx);
+    PFX(cpu_cpuid)(1, &eax, &ebx, &ecx, &edx);
     if (edx & 0x00800000)
         cpu |= X265_CPU_MMX;
     else
@@ -159,7 +159,7 @@ uint32_t cpu_detect(void)
     if ((ecx & 0x18000000) == 0x18000000)
     {
         /* Check for OS support */
-        x265_cpu_xgetbv(0, &eax, &edx);
+        PFX(cpu_xgetbv)(0, &eax, &edx);
         if ((eax & 0x6) == 0x6)
         {
             cpu |= X265_CPU_AVX;
@@ -170,7 +170,7 @@ uint32_t cpu_detect(void)
 
     if (max_basic_cap >= 7)
     {
-        x265_cpu_cpuid(7, &eax, &ebx, &ecx, &edx);
+        PFX(cpu_cpuid)(7, &eax, &ebx, &ecx, &edx);
         /* AVX2 requires OS support, but BMI1/2 don't. */
         if ((cpu & X265_CPU_AVX) && (ebx & 0x00000020))
             cpu |= X265_CPU_AVX2;
@@ -185,12 +185,12 @@ uint32_t cpu_detect(void)
     if (cpu & X265_CPU_SSSE3)
         cpu |= X265_CPU_SSE2_IS_FAST;
 
-    x265_cpu_cpuid(0x80000000, &eax, &ebx, &ecx, &edx);
+    PFX(cpu_cpuid)(0x80000000, &eax, &ebx, &ecx, &edx);
     max_extended_cap = eax;
 
     if (max_extended_cap >= 0x80000001)
     {
-        x265_cpu_cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
+        PFX(cpu_cpuid)(0x80000001, &eax, &ebx, &ecx, &edx);
 
         if (ecx & 0x00000020)
             cpu |= X265_CPU_LZCNT; /* Supported by Intel chips starting with Haswell */
@@ -233,7 +233,7 @@ uint32_t cpu_detect(void)
 
     if (!strcmp((char*)vendor, "GenuineIntel"))
     {
-        x265_cpu_cpuid(1, &eax, &ebx, &ecx, &edx);
+        PFX(cpu_cpuid)(1, &eax, &ebx, &ecx, &edx);
         int family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff);
         int model  = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0);
         if (family == 6)
@@ -264,11 +264,11 @@ uint32_t cpu_detect(void)
     if ((!strcmp((char*)vendor, "GenuineIntel") || !strcmp((char*)vendor, "CyrixInstead")) && !(cpu & X265_CPU_SSE42))
     {
         /* cacheline size is specified in 3 places, any of which may be missing */
-        x265_cpu_cpuid(1, &eax, &ebx, &ecx, &edx);
+        PFX(cpu_cpuid)(1, &eax, &ebx, &ecx, &edx);
         int cache = (ebx & 0xff00) >> 5; // cflush size
         if (!cache && max_extended_cap >= 0x80000006)
         {
-            x265_cpu_cpuid(0x80000006, &eax, &ebx, &ecx, &edx);
+            PFX(cpu_cpuid)(0x80000006, &eax, &ebx, &ecx, &edx);
             cache = ecx & 0xff; // cacheline size
         }
         if (!cache && max_basic_cap >= 2)
@@ -281,7 +281,7 @@ uint32_t cpu_detect(void)
             int max, i = 0;
             do
             {
-                x265_cpu_cpuid(2, buf + 0, buf + 1, buf + 2, buf + 3);
+                PFX(cpu_cpuid)(2, buf + 0, buf + 1, buf + 2, buf + 3);
                 max = buf[0] & 0xff;
                 buf[0] &= ~0xff;
                 for (int j = 0; j < 4; j++)
diff -r 91f6f8daef59 -r be0ed447922c source/common/cpu.h
--- a/source/common/cpu.h	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/common/cpu.h	Tue Jun 16 11:15:03 2015 +0530
@@ -27,21 +27,26 @@
 
 #include "common.h"
 
+/* All assembly functions are prefixed with X265_NS (macro expanded) */
+#define PFX3(prefix, name) prefix ## _ ## name
+#define PFX2(prefix, name) PFX3(prefix, name)
+#define PFX(name)          PFX2(X265_NS, name)
+
 // from cpu-a.asm, if ASM primitives are compiled, else primitives.cpp
-extern "C" void x265_cpu_emms(void);
-extern "C" void x265_safe_intel_cpu_indicator_init(void);
+extern "C" void PFX(cpu_emms)(void);
+extern "C" void PFX(safe_intel_cpu_indicator_init)(void);
 
 #if _MSC_VER && _WIN64
-#define x265_emms() x265_cpu_emms()
+#define x265_emms() PFX(cpu_emms)()
 #elif _MSC_VER
 #include <mmintrin.h>
 #define x265_emms() _mm_empty()
 #elif __GNUC__
 // Cannot use _mm_empty() directly without compiling all the source with
 // a fixed CPU arch, which we would like to avoid at the moment
-#define x265_emms() x265_cpu_emms()
+#define x265_emms() PFX(cpu_emms)()
 #else
-#define x265_emms() x265_cpu_emms()
+#define x265_emms() PFX(cpu_emms)()
 #endif
 
 namespace X265_NS {
diff -r 91f6f8daef59 -r be0ed447922c source/common/dct.cpp
--- a/source/common/dct.cpp	Fri Jun 12 14:54:16 2015 +0530
+++ b/source/common/dct.cpp	Tue Jun 16 11:15:03 2015 +0530
@@ -854,7 +854,7 @@ uint32_t costCoeffNxN_c(const uint16_t *
             //encodeBin(sig, baseCtx[ctxSig]);
             const uint32_t mstate = baseCtx[ctxSig];
             const uint32_t mps = mstate & 1;
-            const uint32_t stateBits = g_entropyStateBits[mstate ^ sig];
+            const uint32_t stateBits = PFX(entropyStateBits)[mstate ^ sig];
             uint32_t nextState = (stateBits >> 24) + mps;
             if ((mstate ^ sig) == 1)
                 nextState = sig;
@@ -920,6 +920,60 @@ uint32_t costCoeffRemain_c(uint16_t *abs
     return sum;
 }
 
+
+uint32_t costC1C2Flag_c(uint16_t *absCoeff, intptr_t numC1Flag, uint8_t *baseCtxMod, intptr_t ctxOffset)
+{
+    uint32_t sum = 0;
+    uint32_t c1 = 1;
+    uint32_t firstC2Idx = 8;
+    uint32_t firstC2Flag = 2;
+    uint32_t c1Next = 0xFFFFFFFE;
+
+    int idx = 0;
+    do
+    {
+        uint32_t symbol1 = absCoeff[idx] > 1;
+        uint32_t symbol2 = absCoeff[idx] > 2;
+        //encodeBin(symbol1, baseCtxMod[c1]);
+        {
+            const uint32_t mstate = baseCtxMod[c1];
+            baseCtxMod[c1] = sbacNext(mstate, symbol1);
+            sum += sbacGetEntropyBits(mstate, symbol1);
+        }
+
+        if (symbol1)