[x265-commits] [x265] asm: Modifications to intrapred16 modes 3, 4, 32 and 33 s...
Murugan Vairavel
murugan at multicorewareinc.com
Wed Feb 5 21:21:34 CET 2014
details: http://hg.videolan.org/x265/rev/cd73618857c5
branches:
changeset: 6022:cd73618857c5
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Feb 04 13:00:44 2014 +0530
description:
asm: Modifications to intrapred16 modes 3, 4, 32 and 33 such that it uses TRANSPOSE_STORE macro of intrapred32
Subject: [x265] asm: intra_pred_ang8 asm code for all modes
details: http://hg.videolan.org/x265/rev/669000ad4a0d
branches:
changeset: 6023:669000ad4a0d
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Tue Feb 04 15:11:07 2014 +0530
description:
asm: intra_pred_ang8 asm code for all modes
Subject: [x265] all_angs_pred_16x16, asm code
details: http://hg.videolan.org/x265/rev/906d972bb4b7
branches:
changeset: 6024:906d972bb4b7
user: Praveen Tiwari
date: Wed Feb 05 17:48:22 2014 +0530
description:
all_angs_pred_16x16, asm code
Subject: [x265] asm: remove redundant macro definition
details: http://hg.videolan.org/x265/rev/ea99e4d138cd
branches:
changeset: 6025:ea99e4d138cd
user: Steve Borho <steve at borho.org>
date: Wed Feb 05 13:30:16 2014 -0600
description:
asm: remove redundant macro definition
Subject: [x265] vec: remove 4x4, 8x8, and 16x16 allangs functions; covered by assembly
details: http://hg.videolan.org/x265/rev/bf4dbea1e4f5
branches:
changeset: 6026:bf4dbea1e4f5
user: Steve Borho <steve at borho.org>
date: Wed Feb 05 13:34:06 2014 -0600
description:
vec: remove 4x4, 8x8, and 16x16 allangs functions; covered by assembly
Subject: [x265] vec: remove 4x4 and 8x8 intra mode prediction functions, asm coverage
details: http://hg.videolan.org/x265/rev/8c9e1b3564e8
branches:
changeset: 6027:8c9e1b3564e8
user: Steve Borho <steve at borho.org>
date: Wed Feb 05 13:39:02 2014 -0600
description:
vec: remove 4x4 and 8x8 intra mode prediction functions, asm coverage
diffstat:
source/common/vec/intra-sse41.cpp | 5482 +------------------------------
source/common/vec/intra-ssse3.cpp | 1388 -------
source/common/x86/asm-primitives.cpp | 29 +-
source/common/x86/intrapred.h | 2 +-
source/common/x86/intrapred8.asm | 6026 +++++++++++++++++++++++++++++++--
5 files changed, 5680 insertions(+), 7247 deletions(-)
diffs (truncated from 13334 to 300 lines):
diff -r 2f54c7616ef8 -r 8c9e1b3564e8 source/common/vec/intra-sse41.cpp
--- a/source/common/vec/intra-sse41.cpp Wed Feb 05 12:34:25 2014 -0600
+++ b/source/common/vec/intra-sse41.cpp Wed Feb 05 13:39:02 2014 -0600
@@ -35,54 +35,6 @@ using namespace x265;
namespace {
#if !HIGH_BIT_DEPTH
-ALIGN_VAR_32(static const unsigned char, tab_angle_0[][16]) =
-{
- { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 }, // 0
- { 15, 0, 0, 1, 2, 3, 4, 5, 7, 0, 0, 9, 10, 11, 12, 13 }, // 1
- { 12, 0, 0, 1, 2, 3, 4, 5, 3, 0, 0, 9, 10, 11, 12, 13 }, // 2
- { 15, 11, 12, 0, 0, 1, 2, 3, 7, 3, 4, 0, 0, 9, 10, 11 }, // 3
- { 13, 12, 11, 8, 8, 1, 2, 3, 5, 4, 3, 0, 0, 9, 10, 11 }, // 4
- { 9, 0, 0, 1, 2, 3, 4, 5, 1, 0, 0, 9, 10, 11, 12, 13 }, // 5
- { 11, 10, 9, 0, 0, 1, 2, 3, 4, 2, 1, 0, 0, 9, 10, 11 }, // 6
- { 15, 12, 11, 10, 9, 0, 0, 1, 7, 4, 3, 2, 1, 0, 0, 9 }, // 7
- { 0, 10, 11, 13, 1, 0, 10, 11, 3, 2, 0, 10, 5, 4, 2, 0 }, // 8
-
- { 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9 }, // 9
- { 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10 }, // 10
- { 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11 }, // 11
- { 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12 }, // 12
- { 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 }, // 13
- { 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 }, // 14
- { 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7 }, // 15
- { 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 }, // 16
- { 11, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7 }, // 17
- { 4, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 }, // 18
- { 14, 11, 11, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 }, // 19
- { 7, 4, 4, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 }, // 20
- { 13, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7 }, // 21
- { 6, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14 }, // 22
- { 12, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 }, // 23
- { 5, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 }, // 24
- { 14, 12, 12, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5 }, // 25
- { 7, 5, 5, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12 }, // 26
- { 11, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 }, // 27
- { 4, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 }, // 28
- { 13, 11, 11, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5 }, // 29
- { 6, 4, 4, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12 }, // 30
- { 15, 13, 13, 11, 11, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4 }, // 31
- { 10, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 }, // 32
- { 3, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13 }, // 33
- { 12, 10, 10, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5 }, // 34
- { 5, 3, 3, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11, 11, 12 }, // 35
- { 13, 12, 12, 10, 10, 9, 9, 0, 0, 1, 1, 2, 2, 3, 3, 4 }, // 36
- { 6, 5, 5, 3, 3, 2, 2, 0, 0, 8, 8, 9, 9, 10, 10, 11 }, // 37
- { 15, 13, 13, 12, 12, 10, 10, 9, 9, 0, 0, 1, 1, 2, 2, 3 }, // 38
- { 0, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 }, // 39
- { 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 }, // 40
-
- { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }, // 41
-};
-
// TODO: Remove unused table and merge here
ALIGN_VAR_32(static const unsigned char, tab_angle_2[][16]) =
{
@@ -131,5415 +83,6 @@ ALIGN_VAR_32(static const char, tab_angl
#undef MAKE_COEF8
};
-// See doc/intra/T4.TXT for algorithm details
-void predIntraAngs4(pixel *dst, pixel *above0, pixel *left0, pixel *above1, pixel *left1, bool filter)
-{
- // avoid warning
- (void)left1;
- (void)above1;
-
- pixel(*dstN)[4 * 4] = (pixel(*)[4 * 4])dst;
-
- __m128i T00, T01, T02, T03, T04, T05, T06, T07;
- __m128i T10, T11, T12, T13;
- __m128i T20, T21, T22, T23;
- __m128i T30, T31, T32;
- __m128i R00, R10, R20, R30;
- __m128i R01, R11, R21, R31;
-
- R00 = _mm_loadu_si128((__m128i*)(left0 + 1)); // [-- -- -- -- -- -- -- -- -08 -07 -06 -05 -04 -03 -02 -01]
- R10 = _mm_srli_si128(R00, 1); // [-- -- -- -- -- -- -- -- -- -08 -07 -06 -05 -04 -03 -02]
- R20 = _mm_srli_si128(R00, 2); // [-- -- -- -- -- -- -- -- -- -- -08 -07 -06 -05 -04 -03]
- R30 = _mm_srli_si128(R00, 3); // [-- -- -- -- -- -- -- -- -- -- -- -08 -07 -06 -05 -04]
-
- R01 = _mm_loadu_si128((__m128i*)(above0 + 1)); // [-- -- -- -- -- -- -- -- 08 07 06 05 04 03 02 01]
- R11 = _mm_srli_si128(R01, 1); // [-- -- -- -- -- -- -- -- -- 08 07 06 05 04 03 02]
- R21 = _mm_srli_si128(R01, 2); // [-- -- -- -- -- -- -- -- -- -- 08 07 06 05 04 03]
- R31 = _mm_srli_si128(R01, 3); // [-- -- -- -- -- -- -- -- -- -- -- 08 07 06 05 04]
-
- T00 = _mm_unpacklo_epi32(R00, R00);
- T00 = _mm_unpacklo_epi64(T00, T00);
- _mm_store_si128((__m128i*)dstN[8], T00);
-
- T00 = _mm_unpacklo_epi32(R01, R01);
- T00 = _mm_unpacklo_epi64(T00, T00);
- _mm_store_si128((__m128i*)dstN[24], T00);
-
- if (filter)
- {
- __m128i roundH, roundV;
- __m128i pL = _mm_set1_epi16(left0[1]);
- __m128i pT = _mm_set1_epi16(above0[1]);
- roundH = _mm_set1_epi16(above0[0]);
- roundV = roundH;
-
- roundH = _mm_srai_epi16(_mm_sub_epi16(_mm_unpacklo_epi8(R01, _mm_setzero_si128()), roundH), 1);
- roundV = _mm_srai_epi16(_mm_sub_epi16(_mm_unpacklo_epi8(R00, _mm_setzero_si128()), roundV), 1);
-
- T00 = _mm_add_epi16(roundH, pL);
- T00 = _mm_packus_epi16(T00, T00);
- T01 = _mm_add_epi16(roundV, pT);
- T01 = _mm_packus_epi16(T01, T01);
-
- int tmp0;
- tmp0 = _mm_cvtsi128_si32(T00);
- dstN[8][0 * 4] = tmp0 & 0xFF;
- dstN[8][1 * 4] = (tmp0 >> 8) & 0xFF;
- dstN[8][2 * 4] = (tmp0 >> 16) & 0xFF;
- dstN[8][3 * 4] = (tmp0 >> 24) & 0xFF;
-
- tmp0 = _mm_cvtsi128_si32(T01);
- dstN[24][0 * 4] = tmp0 & 0xFF;
- dstN[24][1 * 4] = (tmp0 >> 8) & 0xFF;
- dstN[24][2 * 4] = (tmp0 >> 16) & 0xFF;
- dstN[24][3 * 4] = (tmp0 >> 24) & 0xFF;
- }
-
- const __m128i c_16 = _mm_set1_epi16(16);
-
- T00 = _mm_shufflelo_epi16(R10, 0x94);
- T01 = _mm_shufflelo_epi16(R20, 0x94);
- T00 = _mm_unpacklo_epi32(T00, T01);
- _mm_store_si128((__m128i*)dstN[0], T00);
-
- T00 = _mm_shufflelo_epi16(R11, 0x94);
- T01 = _mm_shufflelo_epi16(R21, 0x94);
- T00 = _mm_unpacklo_epi32(T00, T01);
- _mm_store_si128((__m128i*)dstN[32], T00);
-
- T00 = _mm_shuffle_epi8(R00, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T01 = _mm_shuffle_epi8(R10, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T02 = _mm_shuffle_epi8(R20, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T03 = _mm_shuffle_epi8(R30, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T04 = _mm_shuffle_epi8(R01, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T05 = _mm_shuffle_epi8(R11, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T06 = _mm_shuffle_epi8(R21, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T07 = _mm_shuffle_epi8(R31, _mm_load_si128((__m128i*)tab_angle_0[0]));
- T00 = _mm_unpacklo_epi64(T00, T04);
- T01 = _mm_unpacklo_epi64(T01, T05);
- T02 = _mm_unpacklo_epi64(T02, T06);
- T03 = _mm_unpacklo_epi64(T03, T07);
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[26]));
- T11 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[20]));
- T12 = _mm_maddubs_epi16(T02, _mm_load_si128((__m128i*)tab_angle_1[14]));
- T13 = _mm_maddubs_epi16(T03, _mm_load_si128((__m128i*)tab_angle_1[8]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[1], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[31], T22);
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[21]));
- T11 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[10]));
- T12 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[31]));
- T13 = _mm_maddubs_epi16(T02, _mm_load_si128((__m128i*)tab_angle_1[20]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[2], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[30], T22);
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[17]));
- T11 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[2]));
- T12 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[19]));
- T13 = _mm_maddubs_epi16(T02, _mm_load_si128((__m128i*)tab_angle_1[4]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[3], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[29], T22);
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[13]));
- T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[26]));
- T12 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[7]));
- T13 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[20]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[4], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[28], T22);
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[9]));
- T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[18]));
- T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[27]));
- T13 = _mm_maddubs_epi16(T01, _mm_load_si128((__m128i*)tab_angle_1[4]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[5], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[27], T22);
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[5]));
- T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[10]));
- T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[15]));
- T13 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[20]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[6], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[26], T22);
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[2]));
- T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[4]));
- T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[6]));
- T13 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[8]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[7], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[25], T22);
-
- R00 = _mm_loadu_si128((__m128i*)(left0)); // [-- -- -- -- -- -- -- -08 -07 -06 -05 -04 -03 -02 -01 00]
- R10 = _mm_srli_si128(R00, 1); // [-- -- -- -- -- -- -- -- -08 -07 -06 -05 -04 -03 -02 -01]
- R20 = _mm_srli_si128(R00, 2); // [-- -- -- -- -- -- -- -- -- -08 -07 -06 -05 -04 -03 -02]
- R30 = _mm_srli_si128(R00, 3); // [-- -- -- -- -- -- -- -- -- -- -08 -07 -06 -05 -04 -03]
-
- R01 = _mm_loadu_si128((__m128i*)(above0)); // [-- -- -- -- -- -- -- 08 07 06 05 04 03 02 01 00]
- R11 = _mm_srli_si128(R01, 1); // [-- -- -- -- -- -- -- -- 08 07 06 05 04 03 02 01]
- R21 = _mm_srli_si128(R01, 2); // [-- -- -- -- -- -- -- -- -- 08 07 06 05 04 03 02]
- R31 = _mm_srli_si128(R01, 3); // [-- -- -- -- -- -- -- -- -- -- 08 07 06 05 04 03]
-
- T00 = _mm_shuffle_epi8(R00, _mm_load_si128((__m128i*)tab_angle_0[0])); // [ -- -08 -07 -06 -06 -05 -05 -04 -04 -03 -03 -02 -02 -01 -01 00]
- T04 = _mm_shuffle_epi8(R01, _mm_load_si128((__m128i*)tab_angle_0[0])); // [ -- 08 07 06 06 05 05 04 04 03 03 02 02 01 01 00]
- T00 = _mm_unpacklo_epi64(T00, T04); // [ 04 03 03 02 02 01 01 00 -04 -03 -03 -02 -02 -01 -01 00]
-
- T10 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[30]));
- T11 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[28]));
- T12 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[26]));
- T13 = _mm_maddubs_epi16(T00, _mm_load_si128((__m128i*)tab_angle_1[24]));
- T20 = _mm_unpacklo_epi64(T10, T11);
- T21 = _mm_unpacklo_epi64(T12, T13);
- T20 = _mm_srai_epi16(_mm_add_epi16(T20, c_16), 5);
- T21 = _mm_srai_epi16(_mm_add_epi16(T21, c_16), 5);
- T20 = _mm_packus_epi16(T20, T21);
- _mm_store_si128((__m128i*)dstN[9], T20);
- T22 = _mm_unpackhi_epi64(T10, T11);
- T23 = _mm_unpackhi_epi64(T12, T13);
- T22 = _mm_srai_epi16(_mm_add_epi16(T22, c_16), 5);
- T23 = _mm_srai_epi16(_mm_add_epi16(T23, c_16), 5);
- T22 = _mm_packus_epi16(T22, T23);
- _mm_store_si128((__m128i*)dstN[23], T22);
More information about the x265-commits
mailing list