[x265-commits] [x265] TEncSearch: nit
Steve Borho
steve at borho.org
Tue Dec 10 18:04:56 CET 2013
details: http://hg.videolan.org/x265/rev/644d27bb26e9
branches:
changeset: 5650:644d27bb26e9
user: Steve Borho <steve at borho.org>
date: Mon Dec 09 11:05:58 2013 -0600
description:
TEncSearch: nit
Subject: [x265] ratecontrol: make weightp analysis aware of colorspaces
details: http://hg.videolan.org/x265/rev/f25e60a2b62c
branches:
changeset: 5651:f25e60a2b62c
user: Steve Borho <steve at borho.org>
date: Mon Dec 09 11:10:32 2013 -0600
description:
ratecontrol: make weightp analysis aware of colorspaces
Subject: [x265] dct: drop intrinsic DCT 8x8 primitive, we have asm coverage
details: http://hg.videolan.org/x265/rev/eacdbae47e47
branches:
changeset: 5652:eacdbae47e47
user: Steve Borho <steve at borho.org>
date: Mon Dec 09 11:13:58 2013 -0600
description:
dct: drop intrinsic DCT 8x8 primitive, we have asm coverage
Subject: [x265] ratecontrol: avoid reads past the end of chroma buffers
details: http://hg.videolan.org/x265/rev/67e711fde921
branches: stable
changeset: 5653:67e711fde921
user: Steve Borho <steve at borho.org>
date: Mon Dec 09 11:36:11 2013 -0600
description:
ratecontrol: avoid reads past the end of chroma buffers
Subject: [x265] Merge with stable
details: http://hg.videolan.org/x265/rev/c6c73ef24c97
branches:
changeset: 5654:c6c73ef24c97
user: Steve Borho <steve at borho.org>
date: Mon Dec 09 11:57:16 2013 -0600
description:
Merge with stable
Subject: [x265] sbac: move global tables into x265 namespace
details: http://hg.videolan.org/x265/rev/7d4f5cbc68e7
branches: stable
changeset: 5655:7d4f5cbc68e7
user: Steve Borho <steve at borho.org>
date: Tue Dec 03 23:56:22 2013 -0600
description:
sbac: move global tables into x265 namespace
Subject: [x265] Merge with stable
details: http://hg.videolan.org/x265/rev/a88c5723d266
branches:
changeset: 5656:a88c5723d266
user: Steve Borho <steve at borho.org>
date: Mon Dec 09 13:01:26 2013 -0600
description:
Merge with stable
Subject: [x265] log: fix crash caused by logging after CU analysis
details: http://hg.videolan.org/x265/rev/ef26367cd10c
branches:
changeset: 5657:ef26367cd10c
user: Kavitha Sampath <kavitha at multicorewareinc.com>
date: Tue Dec 10 12:01:51 2013 +0530
description:
log: fix crash caused by logging after CU analysis
Subject: [x265] asm: alignment branch to 16 bytes
details: http://hg.videolan.org/x265/rev/89fea75bbc1b
branches:
changeset: 5658:89fea75bbc1b
user: Min Chen <chenm003 at 163.com>
date: Tue Dec 10 13:08:41 2013 +0800
description:
asm: alignment branch to 16 bytes
Subject: [x265] asm: 16bpp asm code for intra_pred_ang4_7
details: http://hg.videolan.org/x265/rev/f33ca21fe0c2
branches:
changeset: 5659:f33ca21fe0c2
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Mon Dec 09 19:38:00 2013 +0550
description:
asm: 16bpp asm code for intra_pred_ang4_7
Subject: [x265] asm: 16bpp asm code for intra_pred_ang4_8 and intra_pred_ang4_9
details: http://hg.videolan.org/x265/rev/66d8405320d2
branches:
changeset: 5660:66d8405320d2
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Mon Dec 09 20:17:21 2013 +0550
description:
asm: 16bpp asm code for intra_pred_ang4_8 and intra_pred_ang4_9
Subject: [x265] asm: 10bpp code of blockcopy_pp for 2xN, 4xN, 6x8 and 8xN blocks
details: http://hg.videolan.org/x265/rev/285a4d8c42a0
branches:
changeset: 5661:285a4d8c42a0
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Mon Dec 09 21:44:11 2013 +0550
description:
asm: 10bpp code of blockcopy_pp for 2xN, 4xN, 6x8 and 8xN blocks
Subject: [x265] asm: improvement IntraPredDC_32x32 by replace macro extend by loop
details: http://hg.videolan.org/x265/rev/9a8b0e81330f
branches:
changeset: 5662:9a8b0e81330f
user: Min Chen <chenm003 at 163.com>
date: Tue Dec 10 13:46:06 2013 +0800
description:
asm: improvement IntraPredDC_32x32 by replace macro extend by loop
Subject: [x265] rename IntraPred.cpp to intrapred.cpp to avoid name conflict
details: http://hg.videolan.org/x265/rev/7810ce2bdb53
branches:
changeset: 5663:7810ce2bdb53
user: Min Chen <chenm003 at 163.com>
date: Tue Dec 10 13:54:40 2013 +0800
description:
rename IntraPred.cpp to intrapred.cpp to avoid name conflict
Subject: [x265] asm: Intra Planar 16x16
details: http://hg.videolan.org/x265/rev/5604254f779e
branches:
changeset: 5664:5604254f779e
user: Min Chen <chenm003 at 163.com>
date: Tue Dec 10 18:37:35 2013 +0800
description:
asm: Intra Planar 16x16
Subject: [x265] testbench: fix wrong width parameter in check_planar_primitive()
details: http://hg.videolan.org/x265/rev/5a7f116e3aae
branches:
changeset: 5665:5a7f116e3aae
user: Min Chen <chenm003 at 163.com>
date: Tue Dec 10 18:57:28 2013 +0800
description:
testbench: fix wrong width parameter in check_planar_primitive()
Subject: [x265] asm: pixel_add_ps integration code for Luma and chroma partitions
details: http://hg.videolan.org/x265/rev/0c46964557c8
branches:
changeset: 5666:0c46964557c8
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 14:20:44 2013 +0550
description:
asm: pixel_add_ps integration code for Luma and chroma partitions
Subject: [x265] asm: 10bpp code for bolckcopy_ps_12x16
details: http://hg.videolan.org/x265/rev/54e8c012597c
branches:
changeset: 5667:54e8c012597c
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 14:42:56 2013 +0550
description:
asm: 10bpp code for bolckcopy_ps_12x16
Subject: [x265] Bug fix in luma_hps C primitive.
details: http://hg.videolan.org/x265/rev/1863cdede774
branches:
changeset: 5668:1863cdede774
user: Nabajit Deka <nabajit at multicorewareinc.com>
date: Tue Dec 10 14:45:36 2013 +0550
description:
Bug fix in luma_hps C primitive.
Subject: [x265] asm: 10bpp code for blockcopy_ps_16xN
details: http://hg.videolan.org/x265/rev/2e56e8e76f72
branches:
changeset: 5669:2e56e8e76f72
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 15:15:04 2013 +0550
description:
asm: 10bpp code for blockcopy_ps_16xN
Subject: [x265] 16bpp: enabled blockfill_s primitive
details: http://hg.videolan.org/x265/rev/2cf9944afa92
branches:
changeset: 5670:2cf9944afa92
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Tue Dec 10 15:28:12 2013 +0550
description:
16bpp: enabled blockfill_s primitive
Subject: [x265] asm: 10bpp code for blockcopy_ps_24x32
details: http://hg.videolan.org/x265/rev/e7fff01a464b
branches:
changeset: 5671:e7fff01a464b
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 15:54:05 2013 +0550
description:
asm: 10bpp code for blockcopy_ps_24x32
Subject: [x265] asm: 10bpp code for blockcopy_ps_32xN
details: http://hg.videolan.org/x265/rev/64c8f43aa7ce
branches:
changeset: 5672:64c8f43aa7ce
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 16:11:27 2013 +0550
description:
asm: 10bpp code for blockcopy_ps_32xN
Subject: [x265] asm: 10bpp code for blockcopy_ps_48x64
details: http://hg.videolan.org/x265/rev/1679ad2da2a1
branches:
changeset: 5673:1679ad2da2a1
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 16:30:04 2013 +0550
description:
asm: 10bpp code for blockcopy_ps_48x64
Subject: [x265] asm: 16bpp asm code for intra_pred_ang4_10
details: http://hg.videolan.org/x265/rev/d604f25e6eab
branches:
changeset: 5674:d604f25e6eab
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Tue Dec 10 18:00:18 2013 +0550
description:
asm: 16bpp asm code for intra_pred_ang4_10
Subject: [x265] asm: 10bpp blockcopy_ps bug fix
details: http://hg.videolan.org/x265/rev/8e34f135fd9e
branches:
changeset: 5675:8e34f135fd9e
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 19:09:25 2013 +0550
description:
asm: 10bpp blockcopy_ps bug fix
Subject: [x265] asm: 10bpp code for blockcopy_ps_64xN
details: http://hg.videolan.org/x265/rev/72e7899bef55
branches:
changeset: 5676:72e7899bef55
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 18:09:10 2013 +0550
description:
asm: 10bpp code for blockcopy_ps_64xN
Subject: [x265] asm: 16bpp code for intra_pred_ang4_26
details: http://hg.videolan.org/x265/rev/1eb855251cc5
branches:
changeset: 5677:1eb855251cc5
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Tue Dec 10 18:14:02 2013 +0550
description:
asm: 16bpp code for intra_pred_ang4_26
Subject: [x265] asm: 10bpp blockcopy_ps integration for Luma and chroma partitions
details: http://hg.videolan.org/x265/rev/887206700a13
branches:
changeset: 5678:887206700a13
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 18:15:20 2013 +0550
description:
asm: 10bpp blockcopy_ps integration for Luma and chroma partitions
Subject: [x265] asm: 16bpp asm code for intra_pred_ang4 - mode 11,12,13
details: http://hg.videolan.org/x265/rev/b29166445321
branches:
changeset: 5679:b29166445321
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Tue Dec 10 19:08:11 2013 +0550
description:
asm: 16bpp asm code for intra_pred_ang4 - mode 11,12,13
Subject: [x265] asm: 10bpp support for blockcopy_ps and blockcopy_sp
details: http://hg.videolan.org/x265/rev/8f8d4811352a
branches:
changeset: 5680:8f8d4811352a
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 19:16:51 2013 +0550
description:
asm: 10bpp support for blockcopy_ps and blockcopy_sp
Subject: [x265] asm: 16bpp asm code for intra_pred_ang4 - mode 14,15,16
details: http://hg.videolan.org/x265/rev/573a8cfac514
branches:
changeset: 5681:573a8cfac514
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Tue Dec 10 19:35:01 2013 +0550
description:
asm: 16bpp asm code for intra_pred_ang4 - mode 14,15,16
Subject: [x265] asm: 10bpp code for calcresidual_4x4 and 8x8
details: http://hg.videolan.org/x265/rev/e4c13676c4b5
branches:
changeset: 5682:e4c13676c4b5
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 20:51:47 2013 +0550
description:
asm: 10bpp code for calcresidual_4x4 and 8x8
Subject: [x265] asm: 16bpp asm code for intra_pred_ang4 - mode 17,18
details: http://hg.videolan.org/x265/rev/384d99887688
branches:
changeset: 5683:384d99887688
user: Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date: Tue Dec 10 21:24:43 2013 +0550
description:
asm: 16bpp asm code for intra_pred_ang4 - mode 17,18
Subject: [x265] Add comment for luma_hps and chroma_hps test bench code.
details: http://hg.videolan.org/x265/rev/1cc5b2d87d8b
branches:
changeset: 5684:1cc5b2d87d8b
user: Nabajit Deka <nabajit at multicorewareinc.com>
date: Tue Dec 10 21:31:25 2013 +0550
description:
Add comment for luma_hps and chroma_hps test bench code.
Subject: [x265] asm : Hook up luma_hps with the encoder.
details: http://hg.videolan.org/x265/rev/af1f46818bed
branches:
changeset: 5685:af1f46818bed
user: Nabajit Deka <nabajit at multicorewareinc.com>
date: Tue Dec 10 21:34:18 2013 +0550
description:
asm : Hook up luma_hps with the encoder.
Subject: [x265] asm: 10bpp code for calcresidual_16x16 and 32x32
details: http://hg.videolan.org/x265/rev/1169201b50c4
branches:
changeset: 5686:1169201b50c4
user: Murugan Vairavel <murugan at multicorewareinc.com>
date: Tue Dec 10 21:40:16 2013 +0550
description:
asm: 10bpp code for calcresidual_16x16 and 32x32
Subject: [x265] Merge
details: http://hg.videolan.org/x265/rev/c4fdea3fd659
branches:
changeset: 5687:c4fdea3fd659
user: Deepthi Nandakumar <deepthi at multicorewareinc.com>
date: Tue Dec 10 22:03:29 2013 +0530
description:
Merge
Subject: [x265] intra: fix 64bit build of intrapred16.asm - Min please review
details: http://hg.videolan.org/x265/rev/dcef9f3bca1e
branches:
changeset: 5688:dcef9f3bca1e
user: Steve Borho <steve at borho.org>
date: Tue Dec 10 11:04:41 2013 -0600
description:
intra: fix 64bit build of intrapred16.asm - Min please review
diffstat:
source/Lib/TLibCommon/TComPrediction.cpp | 8 +-
source/Lib/TLibEncoder/TEncCu.cpp | 14 +-
source/Lib/TLibEncoder/TEncSearch.cpp | 2 +-
source/common/ipfilter.cpp | 2 +-
source/common/vec/dct-ssse3.cpp | 180 ---
source/common/x86/asm-primitives.cpp | 90 +-
source/common/x86/blockcopy8.asm | 1699 ++++++++++++++++++-----------
source/common/x86/const-a.asm | 3 +
source/common/x86/intrapred16.asm | 483 +++++++-
source/common/x86/pixel-util.h | 2 +
source/common/x86/pixel-util8.asm | 169 ++-
source/encoder/motion.cpp | 5 +-
source/encoder/ratecontrol.cpp | 7 +-
source/test/intrapredharness.cpp | 42 +-
source/test/intrapredharness.h | 2 +-
source/test/ipfilterharness.cpp | 6 +-
16 files changed, 1781 insertions(+), 933 deletions(-)
diffs (truncated from 3199 to 300 lines):
diff -r 7bd7937e762b -r dcef9f3bca1e source/Lib/TLibCommon/TComPrediction.cpp
--- a/source/Lib/TLibCommon/TComPrediction.cpp Mon Dec 09 18:02:09 2013 +0530
+++ b/source/Lib/TLibCommon/TComPrediction.cpp Tue Dec 10 11:04:41 2013 -0600
@@ -449,7 +449,7 @@ void TComPrediction::xPredInterLumaBlk(T
int tmpStride = width;
int filterSize = NTAPS_LUMA;
int halfFilterSize = (filterSize >> 1);
- primitives.ipfilter_ps[FILTER_H_P_S_8](src - (halfFilterSize - 1) * srcStride, srcStride, m_immedVals, tmpStride, width, height + filterSize - 1, g_lumaFilter[xFrac]);
+ primitives.luma_hps[partEnum](src, srcStride, m_immedVals, tmpStride, xFrac, 1);
primitives.luma_vsp[partEnum](m_immedVals + (halfFilterSize - 1) * tmpStride, tmpStride, dst, dstStride, yFrac);
}
}
@@ -467,6 +467,8 @@ void TComPrediction::xPredInterLumaBlk(T
int xFrac = mv->x & 0x3;
int yFrac = mv->y & 0x3;
+ int partEnum = partitionFromSizes(width, height);
+
assert((width % 4) + (height % 4) == 0);
assert(dstStride == MAX_CU_SIZE);
@@ -476,7 +478,7 @@ void TComPrediction::xPredInterLumaBlk(T
}
else if (yFrac == 0)
{
- primitives.ipfilter_ps[FILTER_H_P_S_8](ref, refStride, dst, dstStride, width, height, g_lumaFilter[xFrac]);
+ primitives.luma_hps[partEnum](ref, refStride, dst, dstStride, xFrac, 0);
}
else if (xFrac == 0)
{
@@ -487,7 +489,7 @@ void TComPrediction::xPredInterLumaBlk(T
int tmpStride = width;
int filterSize = NTAPS_LUMA;
int halfFilterSize = (filterSize >> 1);
- primitives.ipfilter_ps[FILTER_H_P_S_8](ref - (halfFilterSize - 1) * refStride, refStride, m_immedVals, tmpStride, width, height + filterSize - 1, g_lumaFilter[xFrac]);
+ primitives.luma_hps[partEnum](ref, refStride, m_immedVals, tmpStride, xFrac, 1);
primitives.ipfilter_ss[FILTER_V_S_S_8](m_immedVals + (halfFilterSize - 1) * tmpStride, tmpStride, dst, dstStride, width, height, yFrac);
}
}
diff -r 7bd7937e762b -r dcef9f3bca1e source/Lib/TLibEncoder/TEncCu.cpp
--- a/source/Lib/TLibEncoder/TEncCu.cpp Mon Dec 09 18:02:09 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncCu.cpp Tue Dec 10 11:04:41 2013 -0600
@@ -361,25 +361,25 @@ void TEncCu::compressCU(TComDataCU* cu)
{
xCompressIntraCU(m_bestCU[0], m_tempCU[0], 0);
int i = 0, part;
- part = m_bestCU[0]->getDepth(i);
+ part = cu->getDepth(i);
do
{
m_log->totalCu++;
- int next = m_bestCU[0]->getTotalNumPart() >> (part * 2);
- if (part == g_maxCUDepth - 1 && m_bestCU[0]->getPartitionSize(i) != SIZE_2Nx2N)
+ int next = cu->getTotalNumPart() >> (part * 2);
+ if (part == g_maxCUDepth - 1 && cu->getPartitionSize(i) != SIZE_2Nx2N)
{
m_log->cntIntraNxN++;
}
else
{
m_log->cntIntra[part]++;
- if (m_bestCU[0]->getLumaIntraDir()[i] > 1)
+ if (cu->getLumaIntraDir()[i] > 1)
m_log->cuIntraDistribution[part][ANGULAR_MODE_ID]++;
else
- m_log->cuIntraDistribution[part][m_bestCU[0]->getLumaIntraDir()[i]]++;
+ m_log->cuIntraDistribution[part][cu->getLumaIntraDir()[i]]++;
}
i += next;
- part = m_bestCU[0]->getDepth(i);
+ part = cu->getDepth(i);
}
while (part < g_maxCUDepth);
}
@@ -400,7 +400,7 @@ void TEncCu::compressCU(TComDataCU* cu)
do
{
m_log->cntTotalCu[part]++;
- int next = m_bestCU[0]->getTotalNumPart() >> (part * 2);
+ int next = cu->getTotalNumPart() >> (part * 2);
if (cu->isSkipped(i))
{
m_log->cntSkipCu[part]++;
diff -r 7bd7937e762b -r dcef9f3bca1e source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp Mon Dec 09 18:02:09 2013 +0530
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp Tue Dec 10 11:04:41 2013 -0600
@@ -1634,7 +1634,7 @@ void TEncSearch::estIntraPredQT(TComData
}
// PLANAR
- primitives.intra_pred[log2SizeMinus2][PLANAR_IDX](tmp, scaleStride,leftPlanar, abovePlanar, 0, 0);
+ primitives.intra_pred[log2SizeMinus2][PLANAR_IDX](tmp, scaleStride, leftPlanar, abovePlanar, 0, 0);
modeCosts[PLANAR_IDX] = costMultiplier * sa8d(fenc, scaleStride, tmp, scaleStride);
// Transpose NxN
diff -r 7bd7937e762b -r dcef9f3bca1e source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp Mon Dec 09 18:02:09 2013 +0530
+++ b/source/common/ipfilter.cpp Tue Dec 10 11:04:41 2013 -0600
@@ -305,7 +305,7 @@ void interp_horiz_ps_c(pixel *src, intpt
sum += src[col + 7] * coeff[7];
}
- int16_t val = (int16_t)(sum + offset) >> shift;
+ int16_t val = (int16_t)((sum + offset) >> shift);
dst[col] = val;
}
diff -r 7bd7937e762b -r dcef9f3bca1e source/common/vec/dct-ssse3.cpp
--- a/source/common/vec/dct-ssse3.cpp Mon Dec 09 18:02:09 2013 +0530
+++ b/source/common/vec/dct-ssse3.cpp Tue Dec 10 11:04:41 2013 -0600
@@ -62,185 +62,6 @@ ALIGN_VAR_32(static const int16_t, tab_d
{ 18, -18, -89, 89, -50, 50, 75, -75 },
};
-void dct8(int16_t *src, int32_t *dst, intptr_t stride)
-{
- // Const
- __m128i c_2 = _mm_set1_epi32(2);
- __m128i c_256 = _mm_set1_epi32(256);
-
- // DCT1
- __m128i T00, T01, T02, T03, T04, T05, T06, T07;
- __m128i T10, T11, T12, T13, T14, T15, T16, T17;
- __m128i T20, T21, T22, T23, T24, T25, T26, T27;
- __m128i T30, T31, T32, T33;
- __m128i T40, T41, T42, T43, T44, T45, T46, T47;
- __m128i T50, T51, T52, T53, T54, T55, T56, T57;
-
- T00 = _mm_load_si128((__m128i*)&src[0 * stride]); // [07 06 05 04 03 02 01 00]
- T01 = _mm_load_si128((__m128i*)&src[1 * stride]); // [17 16 15 14 13 12 11 10]
- T02 = _mm_load_si128((__m128i*)&src[2 * stride]); // [27 26 25 24 23 22 21 20]
- T03 = _mm_load_si128((__m128i*)&src[3 * stride]); // [37 36 35 34 33 32 31 30]
- T04 = _mm_load_si128((__m128i*)&src[4 * stride]); // [47 46 45 44 43 42 41 40]
- T05 = _mm_load_si128((__m128i*)&src[5 * stride]); // [57 56 55 54 53 52 51 50]
- T06 = _mm_load_si128((__m128i*)&src[6 * stride]); // [67 66 65 64 63 62 61 60]
- T07 = _mm_load_si128((__m128i*)&src[7 * stride]); // [77 76 75 74 73 72 71 70]
-
- T10 = _mm_shuffle_epi8(T00, _mm_load_si128((__m128i*)tab_dct_8[0])); // [05 02 06 01 04 03 07 00]
- T11 = _mm_shuffle_epi8(T01, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T12 = _mm_shuffle_epi8(T02, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T13 = _mm_shuffle_epi8(T03, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T14 = _mm_shuffle_epi8(T04, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T15 = _mm_shuffle_epi8(T05, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T16 = _mm_shuffle_epi8(T06, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T17 = _mm_shuffle_epi8(T07, _mm_load_si128((__m128i*)tab_dct_8[0]));
-
- T20 = _mm_hadd_epi16(T10, T11); // [s25_1 s16_1 s34_1 s07_1 s25_0 s16_0 s34_0 s07_0]
- T21 = _mm_hadd_epi16(T12, T13); // [s25_3 s16_3 s34_3 s07_3 s25_2 s16_2 s34_2 s07_2]
- T22 = _mm_hadd_epi16(T14, T15); // [s25_5 s16_5 s34_5 s07_5 s25_4 s16_4 s34_4 s07_4]
- T23 = _mm_hadd_epi16(T16, T17); // [s25_7 s16_7 s34_7 s07_7 s25_6 s16_6 s34_6 s07_6]
-
- T24 = _mm_hsub_epi16(T10, T11); // [d25_1 d16_1 d34_1 d07_1 d25_0 d16_0 d34_0 d07_0]
- T25 = _mm_hsub_epi16(T12, T13); // [d25_3 d16_3 d34_3 d07_3 d25_2 d16_2 d34_2 d07_2]
- T26 = _mm_hsub_epi16(T14, T15); // [d25_5 d16_5 d34_5 d07_5 d25_4 d16_4 d34_4 d07_4]
- T27 = _mm_hsub_epi16(T16, T17); // [d25_7 d16_7 d34_7 d07_7 d25_6 d16_6 d34_6 d07_6]
-
- T30 = _mm_hadd_epi16(T20, T21); // [EE1_3 EE0_3 EE1_2 EE0_2 EE1_1 EE0_1 EE1_0 EE0_0]
- T31 = _mm_hadd_epi16(T22, T23); // [EE1_7 EE0_7 EE1_6 EE0_6 EE1_5 EE0_5 EE1_4 EE0_4]
- T32 = _mm_hsub_epi16(T20, T21); // [EO1_3 EO0_3 EO1_2 EO0_2 EO1_1 EO0_1 EO1_0 EO0_0]
- T33 = _mm_hsub_epi16(T22, T23); // [EO1_7 EO0_7 EO1_6 EO0_6 EO1_5 EO0_5 EO1_4 EO0_4]
-
- T40 = _mm_madd_epi16(T30, _mm_load_si128((__m128i*)tab_dct_8[1]));
- T41 = _mm_madd_epi16(T31, _mm_load_si128((__m128i*)tab_dct_8[1]));
- T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_2), 2);
- T41 = _mm_srai_epi32(_mm_add_epi32(T41, c_2), 2);
- T50 = _mm_packs_epi32(T40, T41);
-
- T42 = _mm_madd_epi16(T30, _mm_load_si128((__m128i*)tab_dct_8[2]));
- T43 = _mm_madd_epi16(T31, _mm_load_si128((__m128i*)tab_dct_8[2]));
- T42 = _mm_srai_epi32(_mm_add_epi32(T42, c_2), 2);
- T43 = _mm_srai_epi32(_mm_add_epi32(T43, c_2), 2);
- T54 = _mm_packs_epi32(T42, T43);
-
- T44 = _mm_madd_epi16(T32, _mm_load_si128((__m128i*)tab_dct_8[3]));
- T45 = _mm_madd_epi16(T33, _mm_load_si128((__m128i*)tab_dct_8[3]));
- T44 = _mm_srai_epi32(_mm_add_epi32(T44, c_2), 2);
- T45 = _mm_srai_epi32(_mm_add_epi32(T45, c_2), 2);
- T52 = _mm_packs_epi32(T44, T45);
-
- T46 = _mm_madd_epi16(T32, _mm_load_si128((__m128i*)tab_dct_8[4]));
- T47 = _mm_madd_epi16(T33, _mm_load_si128((__m128i*)tab_dct_8[4]));
- T46 = _mm_srai_epi32(_mm_add_epi32(T46, c_2), 2);
- T47 = _mm_srai_epi32(_mm_add_epi32(T47, c_2), 2);
- T56 = _mm_packs_epi32(T46, T47);
-
- T40 = _mm_madd_epi16(T24, _mm_load_si128((__m128i*)tab_dct_8[5]));
- T41 = _mm_madd_epi16(T25, _mm_load_si128((__m128i*)tab_dct_8[5]));
- T42 = _mm_madd_epi16(T26, _mm_load_si128((__m128i*)tab_dct_8[5]));
- T43 = _mm_madd_epi16(T27, _mm_load_si128((__m128i*)tab_dct_8[5]));
- T40 = _mm_hadd_epi32(T40, T41);
- T42 = _mm_hadd_epi32(T42, T43);
- T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_2), 2);
- T42 = _mm_srai_epi32(_mm_add_epi32(T42, c_2), 2);
- T51 = _mm_packs_epi32(T40, T42);
-
- T40 = _mm_madd_epi16(T24, _mm_load_si128((__m128i*)tab_dct_8[6]));
- T41 = _mm_madd_epi16(T25, _mm_load_si128((__m128i*)tab_dct_8[6]));
- T42 = _mm_madd_epi16(T26, _mm_load_si128((__m128i*)tab_dct_8[6]));
- T43 = _mm_madd_epi16(T27, _mm_load_si128((__m128i*)tab_dct_8[6]));
- T40 = _mm_hadd_epi32(T40, T41);
- T42 = _mm_hadd_epi32(T42, T43);
- T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_2), 2);
- T42 = _mm_srai_epi32(_mm_add_epi32(T42, c_2), 2);
- T53 = _mm_packs_epi32(T40, T42);
-
- T40 = _mm_madd_epi16(T24, _mm_load_si128((__m128i*)tab_dct_8[7]));
- T41 = _mm_madd_epi16(T25, _mm_load_si128((__m128i*)tab_dct_8[7]));
- T42 = _mm_madd_epi16(T26, _mm_load_si128((__m128i*)tab_dct_8[7]));
- T43 = _mm_madd_epi16(T27, _mm_load_si128((__m128i*)tab_dct_8[7]));
- T40 = _mm_hadd_epi32(T40, T41);
- T42 = _mm_hadd_epi32(T42, T43);
- T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_2), 2);
- T42 = _mm_srai_epi32(_mm_add_epi32(T42, c_2), 2);
- T55 = _mm_packs_epi32(T40, T42);
-
- T40 = _mm_madd_epi16(T24, _mm_load_si128((__m128i*)tab_dct_8[8]));
- T41 = _mm_madd_epi16(T25, _mm_load_si128((__m128i*)tab_dct_8[8]));
- T42 = _mm_madd_epi16(T26, _mm_load_si128((__m128i*)tab_dct_8[8]));
- T43 = _mm_madd_epi16(T27, _mm_load_si128((__m128i*)tab_dct_8[8]));
- T40 = _mm_hadd_epi32(T40, T41);
- T42 = _mm_hadd_epi32(T42, T43);
- T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_2), 2);
- T42 = _mm_srai_epi32(_mm_add_epi32(T42, c_2), 2);
- T57 = _mm_packs_epi32(T40, T42);
-
- T10 = _mm_shuffle_epi8(T50, _mm_load_si128((__m128i*)tab_dct_8[0])); // [05 02 06 01 04 03 07 00]
- T11 = _mm_shuffle_epi8(T51, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T12 = _mm_shuffle_epi8(T52, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T13 = _mm_shuffle_epi8(T53, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T14 = _mm_shuffle_epi8(T54, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T15 = _mm_shuffle_epi8(T55, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T16 = _mm_shuffle_epi8(T56, _mm_load_si128((__m128i*)tab_dct_8[0]));
- T17 = _mm_shuffle_epi8(T57, _mm_load_si128((__m128i*)tab_dct_8[0]));
-
- // DCT2
- T20 = _mm_madd_epi16(T10, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_0 64*s16_0 64*s34_0 64*s07_0]
- T21 = _mm_madd_epi16(T11, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_1 64*s16_1 64*s34_1 64*s07_1]
- T22 = _mm_madd_epi16(T12, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_2 64*s16_2 64*s34_2 64*s07_2]
- T23 = _mm_madd_epi16(T13, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_3 64*s16_3 64*s34_3 64*s07_3]
- T24 = _mm_madd_epi16(T14, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_4 64*s16_4 64*s34_4 64*s07_4]
- T25 = _mm_madd_epi16(T15, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_5 64*s16_5 64*s34_5 64*s07_5]
- T26 = _mm_madd_epi16(T16, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_6 64*s16_6 64*s34_6 64*s07_6]
- T27 = _mm_madd_epi16(T17, _mm_load_si128((__m128i*)tab_dct_8[1])); // [64*s25_7 64*s16_7 64*s34_7 64*s07_7]
-
- T30 = _mm_hadd_epi32(T20, T21); // [64*(s16+s25)_1 64*(s07+s34)_1 64*(s16+s25)_0 64*(s07+s34)_0]
- T31 = _mm_hadd_epi32(T22, T23); // [64*(s16+s25)_3 64*(s07+s34)_3 64*(s16+s25)_2 64*(s07+s34)_2]
- T32 = _mm_hadd_epi32(T24, T25); // [64*(s16+s25)_5 64*(s07+s34)_5 64*(s16+s25)_4 64*(s07+s34)_4]
- T33 = _mm_hadd_epi32(T26, T27); // [64*(s16+s25)_7 64*(s07+s34)_7 64*(s16+s25)_6 64*(s07+s34)_6]
-
- T40 = _mm_hadd_epi32(T30, T31); // [64*((s07+s34)+(s16+s25))_3 64*((s07+s34)+(s16+s25))_2 64*((s07+s34)+(s16+s25))_1 64*((s07+s34)+(s16+s25))_0]
- T41 = _mm_hadd_epi32(T32, T33); // [64*((s07+s34)+(s16+s25))_7 64*((s07+s34)+(s16+s25))_6 64*((s07+s34)+(s16+s25))_5 64*((s07+s34)+(s16+s25))_4]
- T42 = _mm_hsub_epi32(T30, T31); // [64*((s07+s34)-(s16+s25))_3 64*((s07+s34)-(s16+s25))_2 64*((s07+s34)-(s16+s25))_1 64*((s07+s34)-(s16+s25))_0]
- T43 = _mm_hsub_epi32(T32, T33); // [64*((s07+s34)-(s16+s25))_7 64*((s07+s34)-(s16+s25))_6 64*((s07+s34)-(s16+s25))_5 64*((s07+s34)-(s16+s25))_4]
-
- T50 = _mm_srai_epi32(_mm_add_epi32(T40, c_256), 9);
- T51 = _mm_srai_epi32(_mm_add_epi32(T41, c_256), 9);
- T52 = _mm_srai_epi32(_mm_add_epi32(T42, c_256), 9);
- T53 = _mm_srai_epi32(_mm_add_epi32(T43, c_256), 9);
-
- _mm_store_si128((__m128i*)&dst[0 * 8 + 0], T50);
- _mm_store_si128((__m128i*)&dst[0 * 8 + 4], T51);
- _mm_store_si128((__m128i*)&dst[4 * 8 + 0], T52);
- _mm_store_si128((__m128i*)&dst[4 * 8 + 4], T53);
-
-#define MAKE_ODD(tab, dstPos) \
- T20 = _mm_madd_epi16(T10, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T21 = _mm_madd_epi16(T11, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T22 = _mm_madd_epi16(T12, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T23 = _mm_madd_epi16(T13, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T24 = _mm_madd_epi16(T14, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T25 = _mm_madd_epi16(T15, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T26 = _mm_madd_epi16(T16, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T27 = _mm_madd_epi16(T17, _mm_load_si128((__m128i*)tab_dct_8[(tab)])); \
- T30 = _mm_hadd_epi32(T20, T21); \
- T31 = _mm_hadd_epi32(T22, T23); \
- T32 = _mm_hadd_epi32(T24, T25); \
- T33 = _mm_hadd_epi32(T26, T27); \
- T40 = _mm_hadd_epi32(T30, T31); \
- T41 = _mm_hadd_epi32(T32, T33); \
- T50 = _mm_srai_epi32(_mm_add_epi32(T40, c_256), 9); \
- T51 = _mm_srai_epi32(_mm_add_epi32(T41, c_256), 9); \
- _mm_store_si128((__m128i*)&dst[(dstPos) * 8 + 0], T50); \
- _mm_store_si128((__m128i*)&dst[(dstPos) * 8 + 4], T51);
-
- MAKE_ODD(9, 2);
- MAKE_ODD(10, 6);
- MAKE_ODD(11, 1);
- MAKE_ODD(12, 3);
- MAKE_ODD(13, 5);
- MAKE_ODD(14, 7);
-#undef MAKE_ODD
-}
-
ALIGN_VAR_32(static const int16_t, tab_dct_16_0[][8]) =
{
{ 0x0F0E, 0x0D0C, 0x0B0A, 0x0908, 0x0706, 0x0504, 0x0302, 0x0100 }, // 0
@@ -1280,7 +1101,6 @@ namespace x265 {
void Setup_Vec_DCTPrimitives_ssse3(EncoderPrimitives &p)
{
#if !HIGH_BIT_DEPTH
More information about the x265-commits
mailing list