[x265-commits] [x265] disable SIGPIPE on Windows platform
Min Chen
chenm003 at 163.com
Wed Apr 15 08:37:35 CEST 2015
details: http://hg.videolan.org/x265/rev/dd456de98c23
branches:
changeset: 10170:dd456de98c23
user: Min Chen <chenm003 at 163.com>
date: Tue Apr 14 13:41:40 2015 +0800
description:
disable SIGPIPE on Windows platform
Subject: [x265] asm: improve sub_ps[16x16] (477 -> 461) and reduce code size
details: http://hg.videolan.org/x265/rev/e21ede5958ea
branches:
changeset: 10171:e21ede5958ea
user: Sumalatha Polureddy
date: Mon Apr 13 16:25:08 2015 +0530
description:
asm: improve sub_ps[16x16] (477 -> 461) and reduce code size
Subject: [x265] asm: intra_pred_ang32_18 improved by ~45% over SSE4
details: http://hg.videolan.org/x265/rev/becd2f63197d
branches:
changeset: 10172:becd2f63197d
user: Praveen Tiwari <praveen at multicorewareinc.com>
date: Tue Apr 14 11:49:12 2015 +0530
description:
asm: intra_pred_ang32_18 improved by ~45% over SSE4
AVX2:
intra_ang_32x32[18] 33.10x 354.58 11737.10
SSE4:
intra_ang_32x32[18] 17.51x 650.80 11396.64
Subject: [x265] sao: add saoCuOrgE3_2Rows function to process 2 rows
details: http://hg.videolan.org/x265/rev/6fce6c27e22b
branches:
changeset: 10173:6fce6c27e22b
user: Divya Manivannan <divya at multicorewareinc.com>
date: Tue Apr 14 10:18:29 2015 +0530
description:
sao: add saoCuOrgE3_2Rows function to process 2 rows
Subject: [x265] asm: avx2 code for satd_32xN
details: http://hg.videolan.org/x265/rev/8e583b1e4de8
branches:
changeset: 10174:8e583b1e4de8
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Tue Apr 14 14:13:31 2015 +0530
description:
asm: avx2 code for satd_32xN
AVX2:
satd[ 32x8] 8.40x 957.22 8040.38
satd[32x16] 8.31x 1950.86 16214.44
satd[32x24] 8.50x 2897.62 24636.81
satd[32x32] 8.88x 3952.35 35115.40
satd[32x64] 9.18x 7334.90 67312.13
AVX:
satd[ 32x8] 4.63x 1738.62 8048.18
satd[32x16] 5.01x 3249.63 16295.51
satd[32x24] 5.30x 4767.54 25279.60
satd[32x32] 5.67x 6156.74 34895.57
satd[32x64] 5.59x 11708.14 65479.60
Subject: [x265] asm: avx code for chroma copy_ss 32x64, reused luma code (2616 -> 1313)
details: http://hg.videolan.org/x265/rev/dc4e269d1dec
branches:
changeset: 10175:dc4e269d1dec
user: Sumalatha Polureddy
date: Tue Apr 14 15:51:59 2015 +0530
description:
asm: avx code for chroma copy_ss 32x64, reused luma code (2616 -> 1313)
sse2
[i422] copy_ss[32x64] 8.36x 2616.62 21881.62
avx
[i422] copy_ss[32x64] 16.80x 1313.77 22065.42
Subject: [x265] asm: ssse3 10bit code for convert_p2s[4xN]
details: http://hg.videolan.org/x265/rev/07848ecda186
branches:
changeset: 10176:07848ecda186
user: Rajesh Paulraj<rajesh at multicorewareinc.com>
date: Tue Apr 14 18:10:45 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[4xN]
convert_p2s[4x4](2.70x), convert_p2s[4x8](3.53x), convert_p2s[4x16](3.82x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[8xN]
details: http://hg.videolan.org/x265/rev/3adff3b58196
branches:
changeset: 10177:3adff3b58196
user: Rajesh Paulraj<rajesh at multicorewareinc.com>
date: Tue Apr 14 18:28:02 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[8xN]
convert_p2s[8x4](4.06x), convert_p2s[8x8](5.07x), convert_p2s[8x16](6.00x),
convert_p2s[8x32](6.42x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[16xN]
details: http://hg.videolan.org/x265/rev/c6d0421a367d
branches:
changeset: 10178:c6d0421a367d
user: Rajesh Paulraj<rajesh at multicorewareinc.com>
date: Tue Apr 14 18:50:02 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[16xN]
convert_p2s[16x4](8.18x), convert_p2s[16x8](10.59x),
convert_p2s[16x12](11.01x), convert_p2s[16x16](11.00x),
convert_p2s[16x32](11.59x), convert_p2s[16x64](11.68x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[32xN],[64xN]
details: http://hg.videolan.org/x265/rev/565829b7a970
branches:
changeset: 10179:565829b7a970
user: Rajesh Paulraj<rajesh at multicorewareinc.com>
date: Tue Apr 14 19:05:49 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[32xN],[64xN]
convert_p2s[32x8](9.51x), convert_p2s[32x16](10.44x),
convert_p2s[32x24](9.64x), convert_p2s[32x32](10.70x),
convert_p2s[32x64](11.52x), convert_p2s[64x16](10.35x),
convert_p2s[64x32](9.12x), convert_p2s[64x48](10.05x),
convert_p2s[64x64](9.00x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[24xN]
details: http://hg.videolan.org/x265/rev/00b90fb64d5f
branches:
changeset: 10180:00b90fb64d5f
user: Rajesh Paulraj<rajesh at multicorewareinc.com>
date: Tue Apr 14 19:18:46 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[24xN]
convert_p2s[24x32](14.57x)
Subject: [x265] improve rdoQuant() by reduce count of code group scan
details: http://hg.videolan.org/x265/rev/b26385e20632
branches:
changeset: 10181:b26385e20632
user: Min Chen <chenm003 at 163.com>
date: Tue Apr 14 21:18:58 2015 +0800
description:
improve rdoQuant() by reduce count of code group scan
Subject: [x265] improve rdoQuant() by use non-zero coeff group mask to reduce count of coeff scan
details: http://hg.videolan.org/x265/rev/3a87866c76ad
branches:
changeset: 10182:3a87866c76ad
user: Min Chen <chenm003 at 163.com>
date: Tue Apr 14 21:19:02 2015 +0800
description:
improve rdoQuant() by use non-zero coeff group mask to reduce count of coeff scan
Subject: [x265] improve rdoQuant() by block fill on non-zero coeff group
details: http://hg.videolan.org/x265/rev/44edfb7f0a0a
branches:
changeset: 10183:44edfb7f0a0a
user: Min Chen <chenm003 at 163.com>
date: Tue Apr 14 21:19:06 2015 +0800
description:
improve rdoQuant() by block fill on non-zero coeff group
Subject: [x265] asm: improve algorithm logic on saoCuOrgE3
details: http://hg.videolan.org/x265/rev/7f32086318d9
branches:
changeset: 10184:7f32086318d9
user: Min Chen <chenm003 at 163.com>
date: Wed Apr 15 14:08:36 2015 +0800
description:
asm: improve algorithm logic on saoCuOrgE3
Subject: [x265] regression: typo in rc tests
details: http://hg.videolan.org/x265/rev/737edf5ac008
branches:
changeset: 10185:737edf5ac008
user: mahesh pittala <mahesh at multicorewareinc.com>
date: Wed Apr 15 10:58:54 2015 +0530
description:
regression: typo in rc tests
diffstat:
source/common/loopfilter.cpp | 24 +
source/common/primitives.h | 2 +
source/common/quant.cpp | 23 +-
source/common/x86/asm-primitives.cpp | 32 +
source/common/x86/intrapred.h | 1 +
source/common/x86/intrapred8.asm | 94 +++++
source/common/x86/ipfilter16.asm | 586 +++++++++++++++++++++++++++++++---
source/common/x86/ipfilter8.h | 23 +
source/common/x86/loopfilter.asm | 40 +-
source/common/x86/pixel-a.asm | 300 +++++++++++++++++
source/common/x86/pixel-util8.asm | 50 ++-
source/encoder/sao.cpp | 24 +-
source/output/reconplay.cpp | 4 +
source/test/pixelharness.cpp | 46 ++-
source/test/pixelharness.h | 3 +
source/test/rate-control-tests.txt | 4 +-
16 files changed, 1146 insertions(+), 110 deletions(-)
diffs (truncated from 1572 to 300 lines):
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/loopfilter.cpp
--- a/source/common/loopfilter.cpp Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/loopfilter.cpp Wed Apr 15 10:58:54 2015 +0530
@@ -122,6 +122,29 @@ void processSaoCUE3(pixel *rec, int8_t *
}
}
+void processSaoCUE3_2Rows(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX, int8_t* signDown)
+{
+ int8_t signDown1;
+ int8_t edgeType;
+
+ for (int y = 0; y < 2; y++)
+ {
+ edgeType = signDown[y] + upBuff1[startX] + 2;
+ upBuff1[startX - 1] = -signDown[y];
+ rec[startX] = x265_clip(rec[startX] + offsetEo[edgeType]);
+
+ for (int x = startX + 1; x < endX; x++)
+ {
+ signDown1 = signOf(rec[x] - rec[x + stride]);
+ edgeType = signDown1 + upBuff1[x] + 2;
+ upBuff1[x - 1] = -signDown1;
+ rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+ }
+ upBuff1[endX - 1] = signOf(rec[endX - 1 + stride + 1] - rec[endX]);
+ rec += stride + 1;
+ }
+}
+
void processSaoCUB0(pixel* rec, const int8_t* offset, int ctuWidth, int ctuHeight, intptr_t stride)
{
#define SAO_BO_BITS 5
@@ -146,6 +169,7 @@ void setupLoopFilterPrimitives_c(Encoder
p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows;
p.saoCuOrgE2 = processSaoCUE2;
p.saoCuOrgE3 = processSaoCUE3;
+ p.saoCuOrgE3_2Rows = processSaoCUE3_2Rows;
p.saoCuOrgB0 = processSaoCUB0;
p.sign = calSign;
}
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/primitives.h
--- a/source/common/primitives.h Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/primitives.h Wed Apr 15 10:58:54 2015 +0530
@@ -172,6 +172,7 @@ typedef void (*saoCuOrgE0_t)(pixel* rec,
typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
+typedef void (*saoCuOrgE3_2Rows_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX, int8_t* signDown);
typedef void (*saoCuOrgB0_t)(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
typedef void (*sign_t)(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
typedef void (*planecopy_cp_t) (const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
@@ -277,6 +278,7 @@ struct EncoderPrimitives
saoCuOrgE1_t saoCuOrgE1, saoCuOrgE1_2Rows;
saoCuOrgE2_t saoCuOrgE2;
saoCuOrgE3_t saoCuOrgE3;
+ saoCuOrgE3_2Rows_t saoCuOrgE3_2Rows;
saoCuOrgB0_t saoCuOrgB0;
downscale_t frameInitLowres;
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/quant.cpp
--- a/source/common/quant.cpp Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/quant.cpp Wed Apr 15 10:58:54 2015 +0530
@@ -981,26 +981,41 @@ uint32_t Quant::rdoQuant(const CUData& c
dstCoeff[blkPos] = (int16_t)((level ^ mask) - mask);
}
+ // Average 49.62 pixels
/* clean uncoded coefficients */
- for (int pos = bestLastIdx; pos <= lastScanPos; pos++)
+ for (int pos = bestLastIdx; pos <= fastMin(lastScanPos, (bestLastIdx | (SCAN_SET_SIZE - 1))); pos++)
+ {
dstCoeff[codeParams.scan[pos]] = 0;
+ }
+ for (int pos = (bestLastIdx & ~(SCAN_SET_SIZE - 1)) + SCAN_SET_SIZE; pos <= lastScanPos; pos += SCAN_SET_SIZE)
+ {
+ const uint32_t blkPos = codeParams.scan[pos];
+ memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff));
+ memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff));
+ memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff));
+ memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff));
+ }
/* rate-distortion based sign-hiding */
if (cu.m_slice->m_pps->bSignHideEnabled && numSig >= 2)
{
+ const int realLastScanPos = (bestLastIdx - 1) >> LOG2_SCAN_SET_SIZE;
int lastCG = true;
- for (int subSet = cgLastScanPos; subSet >= 0; subSet--)
+ for (int subSet = realLastScanPos; subSet >= 0; subSet--)
{
int subPos = subSet << LOG2_SCAN_SET_SIZE;
int n;
+ if (!(sigCoeffGroupFlag64 & (1ULL << codeParams.scanCG[subSet])))
+ continue;
+
/* measure distance between first and last non-zero coef in this
* coding group */
for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
if (dstCoeff[codeParams.scan[n + subPos]])
break;
- if (n < 0)
- continue;
+
+ X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
int lastNZPosInCG = n;
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/asm-primitives.cpp Wed Apr 15 10:58:54 2015 +0530
@@ -948,6 +948,30 @@ void setupAssemblyPrimitives(EncoderPrim
p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
+
+ p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3;
+ p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3;
+ p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3;
+ p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
+ p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
+ p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
+ p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
+ p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
+ p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
+ p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
+ p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
+ p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
+ p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
+ p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
+ p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
+ p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
+ p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
+ p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
+ p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
+ p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
+ p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
+ p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
+ p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
}
if (cpuMask & X265_CPU_SSE4)
{
@@ -1516,6 +1540,7 @@ void setupAssemblyPrimitives(EncoderPrim
p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx;
+ p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx;
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
p.pu[LUMA_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
@@ -1669,6 +1694,12 @@ void setupAssemblyPrimitives(EncoderPrim
p.pu[LUMA_8x16].satd = x265_pixel_satd_8x16_avx2;
p.pu[LUMA_8x8].satd = x265_pixel_satd_8x8_avx2;
+ p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2;
+ p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2;
+ p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2;
+ p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2;
+ p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2;
+
p.pu[LUMA_32x8].sad = x265_pixel_sad_32x8_avx2;
p.pu[LUMA_32x16].sad = x265_pixel_sad_32x16_avx2;
p.pu[LUMA_32x24].sad = x265_pixel_sad_32x24_avx2;
@@ -1821,6 +1852,7 @@ void setupAssemblyPrimitives(EncoderPrim
p.cu[BLOCK_32x32].intra_pred[23] = x265_intra_pred_ang32_23_avx2;
p.cu[BLOCK_32x32].intra_pred[22] = x265_intra_pred_ang32_22_avx2;
p.cu[BLOCK_32x32].intra_pred[21] = x265_intra_pred_ang32_21_avx2;
+ p.cu[BLOCK_32x32].intra_pred[18] = x265_intra_pred_ang32_18_avx2;
// copy_sp primitives
p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/intrapred.h Wed Apr 15 10:58:54 2015 +0530
@@ -277,6 +277,7 @@ void x265_intra_pred_ang32_24_avx2(pixel
void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/intrapred8.asm Wed Apr 15 10:58:54 2015 +0530
@@ -28,6 +28,7 @@
SECTION_RODATA 32
intra_pred_shuff_0_8: times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+intra_pred_shuff_15_0: times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
pb_0_8 times 8 db 0, 8
pb_unpackbw1 times 2 db 1, 8, 2, 8, 3, 8, 4, 8
@@ -10366,6 +10367,99 @@ cglobal intra_pred_ang32_17, 4,7,8
RET
+INIT_YMM avx2
+cglobal intra_pred_ang32_18, 4, 4, 3
+ movu m0, [r2]
+ movu xm1, [r2 + 1 + 64]
+ pshufb xm1, [intra_pred_shuff_15_0]
+ mova xm2, xm0
+ vinserti128 m1, m1, xm2, 1
+
+ lea r3, [r1 * 3]
+
+ movu [r0], m0
+ palignr m2, m0, m1, 15
+ movu [r0 + r1], m2
+ palignr m2, m0, m1, 14
+ movu [r0 + r1 * 2], m2
+ palignr m2, m0, m1, 13
+ movu [r0 + r3], m2
+
+ lea r0, [r0 + r1 * 4]
+ palignr m2, m0, m1, 12
+ movu [r0], m2
+ palignr m2, m0, m1, 11
+ movu [r0 + r1], m2
+ palignr m2, m0, m1, 10
+ movu [r0 + r1 * 2], m2
+ palignr m2, m0, m1, 9
+ movu [r0 + r3], m2
+
+ lea r0, [r0 + r1 * 4]
+ palignr m2, m0, m1, 8
+ movu [r0], m2
+ palignr m2, m0, m1, 7
+ movu [r0 + r1], m2
+ palignr m2, m0, m1, 6
+ movu [r0 + r1 * 2], m2
+ palignr m2, m0, m1, 5
+ movu [r0 + r3], m2
+
+ lea r0, [r0 + r1 * 4]
+ palignr m2, m0, m1, 4
+ movu [r0], m2
+ palignr m2, m0, m1, 3
+ movu [r0 + r1], m2
+ palignr m2, m0, m1, 2
+ movu [r0 + r1 * 2], m2
+ palignr m2, m0, m1, 1
+ movu [r0 + r3], m2
+
+ lea r0, [r0 + r1 * 4]
+ movu [r0], m1
+
+ movu xm0, [r2 + 64 + 17]
+ pshufb xm0, [intra_pred_shuff_15_0]
+ vinserti128 m0, m0, xm1, 1
+
+ palignr m2, m1, m0, 15
+ movu [r0 + r1], m2
+ palignr m2, m1, m0, 14
+ movu [r0 + r1 * 2], m2
+ palignr m2, m1, m0, 13
+ movu [r0 + r3], m2
+
+ lea r0, [r0 + r1 * 4]
+ palignr m2, m1, m0, 12
+ movu [r0], m2
+ palignr m2, m1, m0, 11
+ movu [r0 + r1], m2
+ palignr m2, m1, m0, 10
+ movu [r0 + r1 * 2], m2
+ palignr m2, m1, m0, 9
+ movu [r0 + r3], m2
+
+ lea r0, [r0 + r1 * 4]
+ palignr m2, m1, m0, 8
+ movu [r0], m2
+ palignr m2, m1, m0, 7
+ movu [r0 + r1], m2
+ palignr m2, m1, m0,6
+ movu [r0 + r1 * 2], m2
+ palignr m2, m1, m0, 5
+ movu [r0 + r3], m2
+
+ lea r0, [r0 + r1 * 4]
+ palignr m2, m1, m0, 4
+ movu [r0], m2
+ palignr m2, m1, m0, 3
+ movu [r0 + r1], m2
+ palignr m2, m1, m0,2
+ movu [r0 + r1 * 2], m2
+ palignr m2, m1, m0, 1
+ movu [r0 + r3], m2
+ RET
+
INIT_XMM sse4
cglobal intra_pred_ang32_18, 4,5,5
movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/ipfilter16.asm
--- a/source/common/x86/ipfilter16.asm Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/ipfilter16.asm Wed Apr 15 10:58:54 2015 +0530
@@ -117,6 +117,7 @@ SECTION .text
cextern pd_32
cextern pw_pixel_max
More information about the x265-commits
mailing list