[x265-commits] [x265] asm: assembly code for pixel_satd_24x32

Thu Nov 14 14:58:15 CET 2013

details:   http://hg.videolan.org/x265/rev/2ffe634ebd71
branches:  
changeset: 5070:2ffe634ebd71
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Wed Nov 13 13:08:03 2013 +0530
description:
asm: assembly code for pixel_satd_24x32
Subject: [x265] asm: assembly code for pixel_satd_32x32

details:   http://hg.videolan.org/x265/rev/4ee655b93b03
branches:  
changeset: 5071:4ee655b93b03
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Wed Nov 13 16:43:37 2013 +0530
description:
asm: assembly code for pixel_satd_32x32
Subject: [x265] asm: assembly code for pixel_satd_64x16

details:   http://hg.videolan.org/x265/rev/32e01ab333a6
branches:  
changeset: 5072:32e01ab333a6
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Wed Nov 13 17:04:08 2013 +0530
description:
asm: assembly code for pixel_satd_64x16
Subject: [x265] asm: Proper indentation and function prototype updation for chroma hps filter functions for 2xN, 4xN, 6x8 and 12x16 block sizes.

details:   http://hg.videolan.org/x265/rev/51d3c0782e46
branches:  
changeset: 5073:51d3c0782e46
user:      Nabajit Deka
date:      Wed Nov 13 13:58:39 2013 +0530
description:
asm: Proper indentation and function prototype updation for chroma hps filter functions for 2xN, 4xN, 6x8 and 12x16 block sizes.
Subject: [x265] asm: routines for chroma hps filter functions for 8xN block sizes.

details:   http://hg.videolan.org/x265/rev/3448252924ad
branches:  
changeset: 5074:3448252924ad
user:      Nabajit Deka
date:      Wed Nov 13 14:11:00 2013 +0530
description:
asm: routines for chroma hps filter functions for 8xN block sizes.
Subject: [x265] asm: routines for chroma hps filter functions for 16xN, 24xN and 32xN

details:   http://hg.videolan.org/x265/rev/d80ab2913b31
branches:  
changeset: 5075:d80ab2913b31
user:      Nabajit Deka
date:      Wed Nov 13 14:30:22 2013 +0530
description:
asm: routines for chroma hps filter functions for 16xN, 24xN and 32xN
Subject: [x265] asm: routines for chroma vps filter functions for 4xN block sizes.

details:   http://hg.videolan.org/x265/rev/23aecd3f9180
branches:  
changeset: 5076:23aecd3f9180
user:      Nabajit Deka
date:      Wed Nov 13 15:45:39 2013 +0530
description:
asm: routines for chroma vps filter functions for 4xN block sizes.
Subject: [x265] asm: routines for chroma vps filter functions for 8xN block sizes

details:   http://hg.videolan.org/x265/rev/91cfcd159ff3
branches:  
changeset: 5077:91cfcd159ff3
user:      Nabajit Deka
date:      Wed Nov 13 16:02:48 2013 +0530
description:
asm: routines for chroma vps filter functions for 8xN block sizes
Subject: [x265] asm: routines for chroma vps filter functions for 6x8 and 12x16 block sizes.

details:   http://hg.videolan.org/x265/rev/8e6dcabdccd5
branches:  
changeset: 5078:8e6dcabdccd5
user:      Nabajit Deka
date:      Wed Nov 13 16:19:47 2013 +0530
description:
asm: routines for chroma vps filter functions for 6x8 and 12x16 block sizes.
Subject: [x265] asm: routines for chroma vps filter functions for 16xN block sizes.

details:   http://hg.videolan.org/x265/rev/52d18d911356
branches:  
changeset: 5079:52d18d911356
user:      Nabajit Deka
date:      Wed Nov 13 16:28:08 2013 +0530
description:
asm: routines for chroma vps filter functions for 16xN block sizes.
Subject: [x265] asm: routines for chroma vps filter function for 24x32 block size.

details:   http://hg.videolan.org/x265/rev/21d27b188e71
branches:  
changeset: 5080:21d27b188e71
user:      Nabajit Deka
date:      Wed Nov 13 16:35:45 2013 +0530
description:
asm: routines for chroma vps filter function for 24x32 block size.
Subject: [x265] asm: routines for chroma vps filter functions for 32xN block sizes.

details:   http://hg.videolan.org/x265/rev/701b696d0670
branches:  
changeset: 5081:701b696d0670
user:      Nabajit Deka
date:      Wed Nov 13 16:46:42 2013 +0530
description:
asm: routines for chroma vps filter functions for 32xN block sizes.
Subject: [x265] Adding asm function declarations and initializations for chroma vps filter functions.

details:   http://hg.videolan.org/x265/rev/5fc6ca938864
branches:  
changeset: 5082:5fc6ca938864
user:      Nabajit Deka
date:      Wed Nov 13 16:58:12 2013 +0530
description:
Adding asm function declarations and initializations for chroma vps filter functions.
Subject: [x265] Change minimum architecture to sse4 as chroma vsp functions for block sizes(2x4,2x8 and 6x8) need faster SSE4 instructions.

details:   http://hg.videolan.org/x265/rev/a04ca925ad3f
branches:  
changeset: 5083:a04ca925ad3f
user:      Nabajit Deka
date:      Wed Nov 13 18:27:00 2013 +0530
description:
Change minimum architecture to sse4 as chroma vsp functions for block sizes(2x4,2x8 and 6x8) need faster SSE4 instructions.
Subject: [x265] TEncSearch: Fix parameter type of xEstimateResidualQT

details:   http://hg.videolan.org/x265/rev/c89e22d26bcd
branches:  
changeset: 5084:c89e22d26bcd
user:      Derek Buitenhuis <derek.buitenhuis at gmail.com>
date:      Wed Nov 13 13:52:43 2013 +0000
description:
TEncSearch: Fix parameter type of xEstimateResidualQT

Fixes compilation with g++.
Subject: [x265] Reindent after last commit

details:   http://hg.videolan.org/x265/rev/5683ee5b793c
branches:  
changeset: 5085:5683ee5b793c
user:      Derek Buitenhuis <derek.buitenhuis at gmail.com>
date:      Wed Nov 13 13:53:13 2013 +0000
description:
Reindent after last commit
Subject: [x265] asm: routines for chroma vps filter functions for 2x4 and 2x8 block sizes.

details:   http://hg.videolan.org/x265/rev/c828dd4d9eae
branches:  
changeset: 5086:c828dd4d9eae
user:      Nabajit Deka
date:      Wed Nov 13 15:30:09 2013 +0530
description:
asm: routines for chroma vps filter functions for 2x4 and 2x8 block sizes.
Subject: [x265] TEncSearch: nit

details:   http://hg.videolan.org/x265/rev/e871fe75d5ab
branches:  
changeset: 5087:e871fe75d5ab
user:      Steve Borho <steve at borho.org>
date:      Wed Nov 13 13:52:43 2013 +0000
description:
TEncSearch: nit

diffstat:

 source/Lib/TLibEncoder/TEncSearch.cpp |     8 +-
 source/Lib/TLibEncoder/TEncSearch.h   |     2 +-
 source/common/x86/asm-primitives.cpp  |    25 +-
 source/common/x86/ipfilter8.asm       |  1424 ++++++++++++++++++++++++++++++--
 source/common/x86/ipfilter8.h         |     9 +-
 source/common/x86/pixel-a.asm         |   254 +++++
 6 files changed, 1603 insertions(+), 119 deletions(-)

diffs (truncated from 1980 to 300 lines):

diff -r c4ca80d19105 -r e871fe75d5ab source/Lib/TLibEncoder/TEncSearch.cpp

--- a/source/Lib/TLibEncoder/TEncSearch.cpp	Tue Nov 12 19:10:23 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp	Wed Nov 13 13:52:43 2013 +0000
@@ -2808,8 +2808,8 @@ void TEncSearch::encodeResAndCalcRdInter
     }
 
     //  Residual coding.
-    int     qp, qpBest = 0;
-    UInt64  cost, bcost = MAX_INT64;
+    int      qp, qpBest = 0;
+    uint64_t cost, bcost = MAX_INT64;
 
     uint32_t trLevel = 0;
     if ((cu->getWidth(0) > cu->getSlice()->getSPS()->getMaxTrSize()))
@@ -3042,7 +3042,7 @@ void TEncSearch::xEstimateResidualQT(TCo
                                      uint32_t       absTUPartIdx,
                                      TShortYUV*     resiYuv,
                                      const uint32_t depth,
-                                     UInt64 &       rdCost,
+                                     uint64_t &     rdCost,
                                      uint32_t &     outBits,
                                      uint32_t &     outDist,
                                      uint32_t *     outZeroDist,
@@ -3634,7 +3634,7 @@ void TEncSearch::xEstimateResidualQT(TCo
         }
         uint32_t subdivDist = 0;
         uint32_t subdivBits = 0;
-        UInt64 subDivCost = 0;
+        uint64_t subDivCost = 0;
 
         const uint32_t qPartNumSubdiv = cu->getPic()->getNumPartInCU() >> ((depth + 1) << 1);
         for (uint32_t i = 0; i < 4; ++i)
diff -r c4ca80d19105 -r e871fe75d5ab source/Lib/TLibEncoder/TEncSearch.h
--- a/source/Lib/TLibEncoder/TEncSearch.h	Tue Nov 12 19:10:23 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncSearch.h	Wed Nov 13 13:52:43 2013 +0000
@@ -250,7 +250,7 @@ protected:
 
     void xEncodeResidualQT(TComDataCU* cu, uint32_t absPartIdx, uint32_t depth, bool bSubdivAndCbf, TextType ttype);
     void xEstimateResidualQT(TComDataCU* cu, uint32_t absPartIdx, uint32_t absTUPartIdx, TShortYUV* resiYuv, uint32_t depth,
-                             UInt64 &rdCost, uint32_t &outBits, uint32_t &outDist, uint32_t *puiZeroDist, bool curUseRDOQ = true);
+                             uint64_t &rdCost, uint32_t &outBits, uint32_t &outDist, uint32_t *puiZeroDist, bool curUseRDOQ = true);
     void xSetResidualQTData(TComDataCU* cu, uint32_t absPartIdx, uint32_t absTUPartIdx, TShortYUV* resiYuv, uint32_t depth, bool bSpatial);
 
     void setWpScalingDistParam(TComDataCU* cu, int refIdx, int picList);
diff -r c4ca80d19105 -r e871fe75d5ab source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Tue Nov 12 19:10:23 2013 +0530
+++ b/source/common/x86/asm-primitives.cpp	Wed Nov 13 13:52:43 2013 +0000
@@ -59,14 +59,14 @@ extern "C" {
 #define INIT8(name, cpu) INIT8_NAME(name, name, cpu)
 
 #define HEVC_SATD(cpu) \
-    p.satd[LUMA_32x32] = cmp<32, 32, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
-    p.satd[LUMA_24x32] = cmp<24, 32, 8, 16, x265_pixel_satd_8x16_ ## cpu>; \
+    p.satd[LUMA_32x32] = x265_pixel_satd_32x32_ ## cpu; \
+    p.satd[LUMA_24x32] = x265_pixel_satd_24x32_ ## cpu; \
     p.satd[LUMA_64x64] = cmp<64, 64, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
     p.satd[LUMA_64x32] = cmp<64, 32, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
     p.satd[LUMA_32x64] = cmp<32, 64, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
     p.satd[LUMA_64x48] = cmp<64, 48, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
     p.satd[LUMA_48x64] = cmp<48, 64, 16, 16, x265_pixel_satd_16x16_ ## cpu>; \
-    p.satd[LUMA_64x16] = cmp < 64, 16, 16, 16, x265_pixel_satd_16x16_ ## cpu >
+    p.satd[LUMA_64x16] = x265_pixel_satd_64x16_ ## cpu
 
 #define ASSGN_SSE(cpu) \
     p.sse_pp[LUMA_8x8]   = x265_pixel_ssd_8x8_ ## cpu; \
@@ -138,6 +138,7 @@ extern "C" {
 #define SETUP_CHROMA_FUNC_DEF(W, H, cpu) \
     p.chroma_hpp[CHROMA_ ## W ## x ## H] = x265_interp_4tap_horiz_pp_ ## W ## x ## H ## cpu; \
     p.chroma_vpp[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu; \
+    p.chroma_vps[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu; \
     p.chroma_copy_ps[CHROMA_ ## W ## x ## H] = x265_blockcopy_ps_ ## W ## x ## H ## cpu; \
     p.chroma_sub_ps[CHROMA_ ## W ## x ## H] = x265_pixel_sub_ps_ ## W ## x ## H ## cpu;
 
@@ -176,14 +177,11 @@ extern "C" {
 #define CHROMA_SP_FILTERS(cpu) \
     SETUP_CHROMA_SP_FUNC_DEF(4, 4, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(4, 2, cpu); \
-    SETUP_CHROMA_SP_FUNC_DEF(2, 4, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(8, 8, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(8, 4, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(4, 8, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(8, 6, cpu); \
-    SETUP_CHROMA_SP_FUNC_DEF(6, 8, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(8, 2, cpu); \
-    SETUP_CHROMA_SP_FUNC_DEF(2, 8, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(16, 16, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(16, 8, cpu); \
     SETUP_CHROMA_SP_FUNC_DEF(8, 16, cpu); \
@@ -359,7 +357,6 @@ void Setup_Assembly_Primitives(EncoderPr
         INIT8(sad_x3, _mmx2);
         INIT8(sad_x4, _mmx2);
         INIT8(satd, _mmx2);
-        HEVC_SATD(mmx2);
         p.satd[LUMA_8x32] = x265_pixel_satd_8x32_sse2;
         p.satd[LUMA_12x16] = x265_pixel_satd_12x16_sse2;
         p.satd[LUMA_16x4] = x265_pixel_satd_16x4_sse2;
@@ -526,6 +523,10 @@ void Setup_Assembly_Primitives(EncoderPr
         p.chroma_copy_sp[CHROMA_2x4] = x265_blockcopy_sp_2x4_sse4;
         p.chroma_copy_sp[CHROMA_2x8] = x265_blockcopy_sp_2x8_sse4;
         p.chroma_copy_sp[CHROMA_6x8] = x265_blockcopy_sp_6x8_sse4;
+
+        p.chroma_vsp[CHROMA_2x4] = x265_interp_4tap_vert_sp_2x4_sse4;
+        p.chroma_vsp[CHROMA_2x8] = x265_interp_4tap_vert_sp_2x8_sse4;
+        p.chroma_vsp[CHROMA_6x8] = x265_interp_4tap_vert_sp_6x8_sse4;
     }
     if (cpuMask & X265_CPU_AVX)
     {
@@ -539,6 +540,7 @@ void Setup_Assembly_Primitives(EncoderPr
         p.sa8d[BLOCK_16x16] = x265_pixel_sa8d_16x16_avx;
         SA8D_INTER_FROM_BLOCK(avx);
         ASSGN_SSE(avx);
+        HEVC_SATD(avx);
 
         p.sad_x3[LUMA_12x16] = x265_pixel_sad_x3_12x16_avx;
         p.sad_x4[LUMA_12x16] = x265_pixel_sad_x4_12x16_avx;
@@ -588,10 +590,17 @@ void Setup_Assembly_Primitives(EncoderPr
     {
         INIT2(sad_x4, _avx2);
         INIT4(satd, _avx2);
-        HEVC_SATD(avx2);
         INIT2_NAME(sse_pp, ssd, _avx2);
         p.sa8d[BLOCK_8x8] = x265_pixel_sa8d_8x8_avx2;
         SA8D_INTER_FROM_BLOCK8(avx2);
+        p.satd[LUMA_32x32] = cmp<32, 32, 16, 16, x265_pixel_satd_16x16_avx2>;
+        p.satd[LUMA_24x32] = cmp<24, 32, 8, 16, x265_pixel_satd_8x16_avx2>;
+        p.satd[LUMA_64x64] = cmp<64, 64, 16, 16, x265_pixel_satd_16x16_avx2>;
+        p.satd[LUMA_64x32] = cmp<64, 32, 16, 16, x265_pixel_satd_16x16_avx2>;
+        p.satd[LUMA_32x64] = cmp<32, 64, 16, 16, x265_pixel_satd_16x16_avx2>;
+        p.satd[LUMA_64x48] = cmp<64, 48, 16, 16, x265_pixel_satd_16x16_avx2>;
+        p.satd[LUMA_48x64] = cmp<48, 64, 16, 16, x265_pixel_satd_16x16_avx2>;
+        p.satd[LUMA_64x16] = cmp<64, 16, 16, 16, x265_pixel_satd_16x16_avx2>;
 
         p.sad_x4[LUMA_16x12] = x265_pixel_sad_x4_16x12_avx2;
         p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_avx2;
diff -r c4ca80d19105 -r e871fe75d5ab source/common/x86/ipfilter8.asm
--- a/source/common/x86/ipfilter8.asm	Tue Nov 12 19:10:23 2013 +0530
+++ b/source/common/x86/ipfilter8.asm	Wed Nov 13 13:52:43 2013 +0000
@@ -1401,6 +1401,874 @@ FILTER_V4_W8_8x6 8, 6
 
 RET
 
+;-------------------------------------------------------------------------------------------------------------
+; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+;-------------------------------------------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal interp_4tap_vert_ps_4x2, 4, 6, 8
+
+mov         r4d, r4m
+sub         r0, r1
+add         r3d, r3d
+
+%ifdef PIC
+lea         r5, [tab_ChromaCoeff]
+movd        m0, [r5 + r4 * 4]
+%else
+movd        m0, [tab_ChromaCoeff + r4 * 4]
+%endif
+
+pshufb      m0, [tab_Cm]
+
+mova        m1, [tab_c_8192]
+
+movd        m2, [r0]
+movd        m3, [r0 + r1]
+movd        m4, [r0 + 2 * r1]
+lea         r5, [r0 + 2 * r1]
+movd        m5, [r5 + r1]
+
+punpcklbw   m2, m3
+punpcklbw   m6, m4, m5
+punpcklbw   m2, m6
+
+pmaddubsw   m2, m0
+
+movd        m6, [r0 + 4 * r1]
+
+punpcklbw   m3, m4
+punpcklbw   m5, m6
+punpcklbw   m3, m5
+
+pmaddubsw   m3, m0
+
+phaddw      m2, m3
+
+psubw       m2, m1
+movlps      [r2], m2
+movhps      [r2 + r3], m2
+
+RET
+
+;-------------------------------------------------------------------------------------------------------------
+; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+;-------------------------------------------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal interp_4tap_vert_ps_4x4, 4, 7, 8
+
+    mov        r4d, r4m
+    sub        r0, r1
+    add        r3d, r3d
+
+%ifdef PIC
+    lea        r5, [tab_ChromaCoeff]
+    movd       m0, [r5 + r4 * 4]
+%else
+    movd       m0, [tab_ChromaCoeff + r4 * 4]
+%endif
+
+    pshufb     m0, [tab_Cm]
+
+    mova       m1, [tab_c_8192]
+
+    movd       m2, [r0]
+    movd       m3, [r0 + r1]
+    movd       m4, [r0 + 2 * r1]
+    lea        r5, [r0 + 2 * r1]
+    movd       m5, [r5 + r1]
+
+    punpcklbw  m2, m3
+    punpcklbw  m6, m4, m5
+    punpcklbw  m2, m6
+
+    pmaddubsw  m2, m0
+
+    movd       m6, [r0 + 4 * r1]
+
+    punpcklbw  m3, m4
+    punpcklbw  m7, m5, m6
+    punpcklbw  m3, m7
+
+    pmaddubsw  m3, m0
+
+    phaddw     m2, m3
+
+    psubw      m2, m1
+    movlps     [r2], m2
+    movhps     [r2 + r3], m2
+
+    lea        r5, [r0 + 4 * r1]
+    movd       m2, [r5 + r1]
+
+    punpcklbw  m4, m5
+    punpcklbw  m3, m6, m2
+    punpcklbw  m4, m3
+
+    pmaddubsw  m4, m0
+
+    movd       m3, [r5 + 2 * r1]
+
+    punpcklbw  m5, m6
+    punpcklbw  m2, m3
+    punpcklbw  m5, m2
+
+    pmaddubsw  m5, m0
+
+    phaddw     m4, m5
+
+    psubw      m4, m1
+    movlps     [r2 + 2 * r3], m4
+    lea        r6, [r2 + 2 * r3]
+    movhps     [r6 + r3], m4
+
+    RET
+
+;---------------------------------------------------------------------------------------------------------------
+; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+;---------------------------------------------------------------------------------------------------------------
+%macro FILTER_V_PS_W4_H4 2
+INIT_XMM sse4
+cglobal interp_4tap_vert_ps_%1x%2, 4, 7, 8
+
+    mov        r4d, r4m
+    sub        r0, r1
+    add        r3d, r3d
+
+%ifdef PIC
+    lea        r5, [tab_ChromaCoeff]
+    movd       m0, [r5 + r4 * 4]
+%else
+    movd       m0, [tab_ChromaCoeff + r4 * 4]
+%endif
+
+    pshufb     m0, [tab_Cm]
+
+    mova       m1, [tab_c_8192]
+
+    mov        r4d, %2/4
+
+.loop
+    movd       m2, [r0]
+    movd       m3, [r0 + r1]
+    movd       m4, [r0 + 2 * r1]
+    lea        r5, [r0 + 2 * r1]
+    movd       m5, [r5 + r1]
+
+    punpcklbw  m2, m3
+    punpcklbw  m6, m4, m5
+    punpcklbw  m2, m6
+
+    pmaddubsw  m2, m0
+