[x265-commits] [x265] asm: Modifications to intrapred16 modes 3, 4, 32 and 33 s...

Murugan Vairavel murugan at multicorewareinc.com
Wed Feb 5 21:21:34 CET 2014


details:   http://hg.videolan.org/x265/rev/cd73618857c5
branches:  
changeset: 6022:cd73618857c5
user:      Murugan Vairavel <murugan at multicorewareinc.com>
date:      Tue Feb 04 13:00:44 2014 +0530
description:
asm: Modifications to intrapred16 modes 3, 4, 32 and 33 such that it uses TRANSPOSE_STORE macro of intrapred32
Subject: [x265] asm: intra_pred_ang8 asm code for all modes

details:   http://hg.videolan.org/x265/rev/669000ad4a0d
branches:  
changeset: 6023:669000ad4a0d
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Tue Feb 04 15:11:07 2014 +0530
description:
asm: intra_pred_ang8 asm code for all modes
Subject: [x265] all_angs_pred_16x16, asm code

details:   http://hg.videolan.org/x265/rev/906d972bb4b7
branches:  
changeset: 6024:906d972bb4b7
user:      Praveen Tiwari
date:      Wed Feb 05 17:48:22 2014 +0530
description:
all_angs_pred_16x16, asm code
Subject: [x265] asm: remove redundant macro definition

details:   http://hg.videolan.org/x265/rev/ea99e4d138cd
branches:  
changeset: 6025:ea99e4d138cd
user:      Steve Borho <steve at borho.org>
date:      Wed Feb 05 13:30:16 2014 -0600
description:
asm: remove redundant macro definition
Subject: [x265] vec: remove 4x4, 8x8, and 16x16 allangs functions; covered by assembly

details:   http://hg.videolan.org/x265/rev/bf4dbea1e4f5
branches:  
changeset: 6026:bf4dbea1e4f5
user:      Steve Borho <steve at borho.org>
date:      Wed Feb 05 13:34:06 2014 -0600
description:
vec: remove 4x4, 8x8, and 16x16 allangs functions; covered by assembly
Subject: [x265] vec: remove 4x4 and 8x8 intra mode prediction functions, asm coverage

details:   http://hg.videolan.org/x265/rev/8c9e1b3564e8
branches:  
changeset: 6027:8c9e1b3564e8
user:      Steve Borho <steve at borho.org>
date:      Wed Feb 05 13:39:02 2014 -0600
description:
vec: remove 4x4 and 8x8 intra mode prediction functions, asm coverage

diffstat:

 source/common/vec/intra-sse41.cpp    |  5482 +------------------------------
 source/common/vec/intra-ssse3.cpp    |  1388 -------
 source/common/x86/asm-primitives.cpp |    29 +-
 source/common/x86/intrapred.h        |     2 +-
 source/common/x86/intrapred8.asm     |  6026 +++++++++++++++++++++++++++++++--
 5 files changed, 5680 insertions(+), 7247 deletions(-)

diffs (truncated from 13334 to 300 lines):

diff -r 2f54c7616ef8 -r 8c9e1b3564e8 source/common/vec/intra-sse41.cpp
--- a/source/common/vec/intra-sse41.cpp	Wed Feb 05 12:34:25 2014 -0600
+++ b/source/common/vec/intra-sse41.cpp	Wed Feb 05 13:39:02 2014 -0600
@@ -35,54 +35,6 @@ using namespace x265;
 
 namespace {
 #if !HIGH_BIT_DEPTH
-ALIGN_VAR_32(static const unsigned char, tab_angle_0[][16]) =
-{
-    { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 },         //  0
-    { 15, 0, 0, 1, 2, 3, 4, 5, 7, 0, 0, 9, 10, 11, 12, 13 },    //  1
-    { 12, 0, 0, 1, 2, 3, 4, 5, 3, 0, 0, 9, 10, 11, 12, 13 },    //  2
-    { 15, 11, 12, 0, 0, 1, 2, 3, 7, 3, 4, 0, 0, 9, 10, 11 },    //  3
-    { 13, 12, 11, 8, 8, 1, 2, 3, 5, 4, 3, 0, 0, 9, 10, 11 },    //  4
-    { 9, 0, 0, 1, 2, 3, 4, 5, 1, 0, 0, 9, 10, 11, 12, 13 },     //  5
-    { 11, 10, 9, 0, 0, 1, 2, 3, 4, 2, 1, 0, 0, 9, 10, 11 },     //  6
-    { 15, 12, 11, 10, 9, 0, 0, 1, 7, 4, 3, 2, 1, 0, 0, 9 },     //  7
-    { 0, 10, 11, 13, 1, 0, 10, 11, 3, 2, 0, 10, 5, 4, 2, 0 },    //  8
-
-    { 1, 2, 2, 3, 3, 4, 4,  5,  5,  6,  6,  7,  7,  8,  8,  9 },    //  9
-    { 2, 3, 3, 4, 4, 5, 5,  6,  6,  7,  7,  8,  8,  9,  9, 10 },    // 10
-    { 3, 4, 4, 5, 5, 6, 6,  7,  7,  8,  8,  9,  9, 10, 10, 11 },    // 11
-    { 4, 5, 5, 6, 6, 7, 7,  8,  8,  9,  9, 10, 10, 11, 11, 12 },    // 12
-    { 5, 6, 6, 7, 7, 8, 8,  9,  9, 10, 10, 11, 11, 12, 12, 13 },    // 13
-    { 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 },    // 14
-    { 9, 0, 0, 1, 1, 2, 2,  3,  3,  4,  4,  5,  5,  6,  6,  7 },    // 15
-    { 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 },    // 16
-    { 11, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7 },            // 17
-    { 4, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 },    // 18
-    { 14, 11, 11, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 },           // 19
-    { 7, 4, 4, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 },      // 20
-    { 13, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7 },            // 21
-    { 6, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 },    // 22
-    { 12, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 },            // 23
-    { 5, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 },      // 24
-    { 14, 12, 12, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5 },          // 25
-    { 7, 5, 5, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12 },        // 26
-    { 11, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 },            // 27
-    { 4, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 },      // 28
-    { 13, 11, 11, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5 },          // 29
-    { 6, 4, 4, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12 },        // 30
-    { 15, 13, 13, 11, 11, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4 },        // 31
-    { 10, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 },            // 32
-    { 3, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 },      // 33
-    { 12, 10, 10, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5 },          // 34
-    { 5, 3, 3, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12 },        // 35
-    { 13, 12, 12, 10, 10, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4 },        // 36
-    { 6, 5, 5, 3, 3, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11 },          // 37
-    { 15, 13, 13, 12, 12, 10, 10, 9, 9, 0, 0, 1, 1, 2, 2, 3 },      // 38
-    { 0, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 },             // 39
-    { 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 },       // 40
-
-    { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 },       // 41
-};
-
 // TODO: Remove unused table and merge here
 ALIGN_VAR_32(static const unsigned char, tab_angle_2[][16]) =
 {
@@ -131,5415 +83,6 @@ ALIGN_VAR_32(static const char, tab_angl
 #undef MAKE_COEF8
 };
 
-// See doc/intra/T4.TXT for algorithm details
-void predIntraAngs4(pixel *dst, pixel *above0, pixel *left0, pixel *above1, pixel *left1, bool filter)
-{
-    // avoid warning
-    (void)left1;
-    (void)above1;
-
-    pixel(*dstN)[4 * 4] = (pixel(*)[4 * 4])dst;
-
-    __m128i T00, T01, T02, T03, T04, T05, T06, T07;
-    __m128i T10, T11, T12, T13;
-    __m128i T20, T21, T22, T23;
-    __m128i T30, T31, T32;
-    __m128i R00, R10, R20, R30;
-    __m128i R01, R11, R21, R31;
-
-    R00 = _mm_loadu_si128((__m128i*)(left0 + 1));    // [-- -- -- -- -- -- --  -- -08 -07 -06 -05 -04 -03 -02 -01]
-    R10 = _mm_srli_si128(R00, 1);                   // [-- -- -- -- -- -- --  --  -- -08 -07 -06 -05 -04 -03 -02]
-    R20 = _mm_srli_si128(R00, 2);                   // [-- -- -- -- -- -- --  --  --  -- -08 -07 -06 -05 -04 -03]
-    R30 = _mm_srli_si128(R00, 3);                   // [-- -- -- -- -- -- --  --  --  --  -- -08 -07 -06 -05 -04]
-
-    R01 = _mm_loadu_si128((__m128i*)(above0 + 1));   // [-- -- -- -- -- -- --  --  08  07  06  05  04  03  02  01]
-    R11 = _mm_srli_si128(R01, 1);                   // [-- -- -- -- -- -- --  --  --  08  07  06  05  04  03  02]
-    R21 = _mm_srli_si128(R01, 2);                   // [-- -- -- -- -- -- --  --  --  --  08  07  06  05  04  03]
-    R31 = _mm_srli_si128(R01, 3);                   // [-- -- -- -- -- -- --  --  --  --  --  08  07  06  05  04]
-
-    T00 = _mm_unpacklo_epi32(R00, R00);
-    T00 = _mm_unpacklo_epi64(T00, T00);
-    _mm_store_si128((__m128i*)dstN[8], T00);
-
-    T00 = _mm_unpacklo_epi32(R01, R01);
-    T00 = _mm_unpacklo_epi64(T00, T00);
-    _mm_store_si128((__m128i*)dstN[24], T00);
-
-    if (filter)
-    {
-        __m128i roundH, roundV;
-        __m128i pL = _mm_set1_epi16(left0[1]);
-        __m128i pT = _mm_set1_epi16(above0[1]);
-        roundH = _mm_set1_epi16(above0[0]);
-        roundV = roundH;
-
-        roundH = _mm_srai_epi16(_mm_sub_epi16(_mm_unpacklo_epi8(R01, _mm_setzero_si128()), roundH), 1);
-        roundV = _mm_srai_epi16(_mm_sub_epi16(_mm_unpacklo_epi8(R00, _mm_setzero_si128()), roundV), 1);
-
-        T00 = _mm_add_epi16(roundH, pL);
-        T00 = _mm_packus_epi16(T00, T00);
-        T01 = _mm_add_epi16(roundV, pT);
-        T01 = _mm_packus_epi16(T01, T01);
-
-        int tmp0;
-        tmp0 = _mm_cvtsi128_si32(T00);
-        dstN[8][0 * 4] = tmp0 & 0xFF;
-        dstN[8][1 * 4] = (tmp0 >> 8) & 0xFF;
-        dstN[8][2 * 4] = (tmp0 >> 16) & 0xFF;
-        dstN[8][3 * 4] = (tmp0 >> 24) & 0xFF;
-
-        tmp0 = _mm_cvtsi128_si32(T01);
-        dstN[24][0 * 4] = tmp0 & 0xFF;
-        dstN[24][1 * 4] = (tmp0 >> 8) & 0xFF;
-        dstN[24][2 * 4] = (tmp0 >> 16) & 0xFF;
-        dstN[24][3 * 4] = (tmp0 >> 24) & 0xFF;
-    }
-
-    const __m128i c_16 = _mm_set1_epi16(16);
-
-    T00 = _mm_shufflelo_epi16(R10, 0x94);
-    T01 = _mm_shufflelo_epi16(R20, 0x94);
-    T00 = _mm_unpacklo_epi32(T00, T01);
-    _mm_store_si128((__m128i*)dstN[0], T00);
-
-    T00 = _mm_shufflelo_epi16(R11, 0x94);
-    T01 = _mm_shufflelo_epi16(R21, 0x94);
-    T00 = _mm_unpacklo_epi32(T00, T01);
-    _mm_store_si128((__m128i*)dstN[32], T00);
-
-    T00 = _mm_shuffle_epi8(R00, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T01 = _mm_shuffle_epi8(R10, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T02 = _mm_shuffle_epi8(R20, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T03 = _mm_shuffle_epi8(R30, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T04 = _mm_shuffle_epi8(R01, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T05 = _mm_shuffle_epi8(R11, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T06 = _mm_shuffle_epi8(R21, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T07 = _mm_shuffle_epi8(R31, _mm_load_si128((__m128i*)tab_angle_0[0]));
-    T00 = _mm_unpacklo_epi64(T00, T04);
-    T01 = _mm_unpacklo_epi64(T01, T05);
-    T02 = _mm_unpacklo_epi64(T02, T06);
-    T03 = _mm_unpacklo_epi64(T03, T07);
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[26]));
-    T11 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[20]));
-    T12 = _mm_maddubs_epi16(T02, _mm_load_si128((__m128i*)tab_angle_1[14]));
-    T13 = _mm_maddubs_epi16(T03, _mm_load_si128((__m128i*)tab_angle_1[8]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[1], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[31], T22);
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[21]));
-    T11 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[10]));
-    T12 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[31]));
-    T13 = _mm_maddubs_epi16(T02, _mm_load_si128((__m128i*)tab_angle_1[20]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[2], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[30], T22);
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[17]));
-    T11 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[2]));
-    T12 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[19]));
-    T13 = _mm_maddubs_epi16(T02, _mm_load_si128((__m128i*)tab_angle_1[4]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[3], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[29], T22);
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[13]));
-    T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[26]));
-    T12 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[7]));
-    T13 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[20]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[4], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[28], T22);
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[9]));
-    T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[18]));
-    T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[27]));
-    T13 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[4]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[5], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[27], T22);
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[5]));
-    T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[10]));
-    T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[15]));
-    T13 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[20]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[6], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[26], T22);
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[2]));
-    T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[4]));
-    T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[6]));
-    T13 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[8]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[7], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[25], T22);
-
-    R00 = _mm_loadu_si128((__m128i*)(left0));      // [-- -- -- -- -- --  -- -08 -07 -06 -05 -04 -03 -02 -01  00]
-    R10 = _mm_srli_si128(R00, 1);                   // [-- -- -- -- -- --  --  -- -08 -07 -06 -05 -04 -03 -02 -01]
-    R20 = _mm_srli_si128(R00, 2);                   // [-- -- -- -- -- --  --  --  -- -08 -07 -06 -05 -04 -03 -02]
-    R30 = _mm_srli_si128(R00, 3);                   // [-- -- -- -- -- --  --  --  --  -- -08 -07 -06 -05 -04 -03]
-
-    R01 = _mm_loadu_si128((__m128i*)(above0));     // [-- -- -- -- -- -- --   08  07  06  05  04  03  02  01  00]
-    R11 = _mm_srli_si128(R01, 1);                   // [-- -- -- -- -- -- --   --  08  07  06  05  04  03  02  01]
-    R21 = _mm_srli_si128(R01, 2);                   // [-- -- -- -- -- -- --   --  --  08  07  06  05  04  03  02]
-    R31 = _mm_srli_si128(R01, 3);                   // [-- -- -- -- -- -- --   --  --  --  08  07  06  05  04  03]
-
-    T00 = _mm_shuffle_epi8(R00, _mm_load_si128((__m128i*)tab_angle_0[0]));    // [ -- -08 -07 -06 -06 -05 -05 -04 -04 -03 -03 -02 -02 -01 -01  00]
-    T04 = _mm_shuffle_epi8(R01, _mm_load_si128((__m128i*)tab_angle_0[0]));    // [ --  08  07  06  06  05  05  04  04  03  03  02  02  01  01  00]
-    T00 = _mm_unpacklo_epi64(T00, T04);     // [ 04  03  03  02  02  01  01  00 -04 -03 -03 -02 -02 -01 -01  00]
-
-    T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[30]));
-    T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[28]));
-    T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[26]));
-    T13 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[24]));
-    T20 = _mm_unpacklo_epi64(T10, T11);
-    T21 = _mm_unpacklo_epi64(T12, T13);
-    T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
-    T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
-    T20 = _mm_packus_epi16(T20, T21);
-    _mm_store_si128((__m128i*)dstN[9], T20);
-    T22 = _mm_unpackhi_epi64(T10, T11);
-    T23 = _mm_unpackhi_epi64(T12, T13);
-    T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
-    T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
-    T22 = _mm_packus_epi16(T22, T23);
-    _mm_store_si128((__m128i*)dstN[23], T22);


More information about the x265-commits mailing list