[x265-commits] [x265] asm: avx2 code for high_bit_depth satd_16x8
Dnyaneshwar G
dnyaneshwar at multicorewareinc.com
Sat May 9 19:54:57 CEST 2015
details: http://hg.videolan.org/x265/rev/948636c0bbab
branches:
changeset: 10385:948636c0bbab
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Thu May 07 13:41:38 2015 +0530
description:
asm: avx2 code for high_bit_depth satd_16x8
AVX2:
satd[ 16x8] 8.92x 500.34 4461.95
AVX:
satd[ 16x8] 4.35x 1039.88 4521.10
Subject: [x265] asm: avx2 code for high_bit_depth satd_16xN, improved over ~50% than previous asm
details: http://hg.videolan.org/x265/rev/216cf6567633
branches:
changeset: 10386:216cf6567633
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Thu May 07 14:20:01 2015 +0530
description:
asm: avx2 code for high_bit_depth satd_16xN, improved over ~50% than previous asm
Subject: [x265] asm: avx2 code for high_bit_depth satd_32xN, improved over ~50% than previous asm
details: http://hg.videolan.org/x265/rev/915ddaa2f810
branches:
changeset: 10387:915ddaa2f810
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Thu May 07 14:38:56 2015 +0530
description:
asm: avx2 code for high_bit_depth satd_32xN, improved over ~50% than previous asm
Subject: [x265] asm: avx2 code for high_bit_depth satd_64xN, improved over ~50% than previous asm
details: http://hg.videolan.org/x265/rev/044200a7367f
branches:
changeset: 10388:044200a7367f
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Thu May 07 14:49:34 2015 +0530
description:
asm: avx2 code for high_bit_depth satd_64xN, improved over ~50% than previous asm
Subject: [x265] asm: avx2 code for high_bit_depth satd_48x64, improved over ~50% than previous asm
details: http://hg.videolan.org/x265/rev/57fce5531352
branches:
changeset: 10389:57fce5531352
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Thu May 07 14:58:39 2015 +0530
description:
asm: avx2 code for high_bit_depth satd_48x64, improved over ~50% than previous asm
Subject: [x265] cli.rst - qgsize formatting fixed, fixed explanation of --strict-cbr
details: http://hg.videolan.org/x265/rev/ac0e35cb5b89
branches:
changeset: 10390:ac0e35cb5b89
user: Tom Vaughan <tom.vaughan at multicorewareinc.com>
date: Fri May 08 21:22:47 2015 +0000
description:
cli.rst - qgsize formatting fixed, fixed explanation of --strict-cbr
Subject: [x265] asm: interp_4tap_vert_pp sse2
details: http://hg.videolan.org/x265/rev/a391c5114ef9
branches:
changeset: 10391:a391c5114ef9
user: David T Yuen <dtyx265 at gmail.com>
date: Fri May 08 13:01:22 2015 -0700
description:
asm: interp_4tap_vert_pp sse2
This replaces c code for 2x4, 2x8 and 2x16
64-bit
./test/TestBench --testbench interp | grep vpp
chroma_vpp[ 2x4] 1.80x 644.93 1159.95
chroma_vpp[ 2x8] 1.72x 1204.84 2067.43
chroma_vpp[ 2x16] 1.99x 2252.45 4480.33
32-bit
./test/TestBench --testbench interp | grep vpp
chroma_vpp[ 2x4] 1.80x 822.45 1479.92
chroma_vpp[ 2x8] 1.90x 1477.46 2807.42
chroma_vpp[ 2x16] 2.26x 2770.04 6247.41
Subject: [x265] asm: interp_4tap_vert_pp sse2
details: http://hg.videolan.org/x265/rev/099134882df4
branches:
changeset: 10392:099134882df4
user: David T Yuen <dtyx265 at gmail.com>
date: Fri May 08 13:08:21 2015 -0700
description:
asm: interp_4tap_vert_pp sse2
This replaces c code for 4x2
64-bit
./test/TestBench --testbench interp | grep vpp | grep " 4x"
chroma_vpp[ 4x2] 2.17x 514.98 1117.46
32-bit
./test/TestBench --testbench interp | grep vpp | grep " 4x"
chroma_vpp[ 4x2] 2.35x 589.99 1387.42
Subject: [x265] asm: interp_4tap_vert_pp sse2
details: http://hg.videolan.org/x265/rev/3cade0851ecd
branches:
changeset: 10393:3cade0851ecd
user: David T Yuen <dtyx265 at gmail.com>
date: Fri May 08 13:34:34 2015 -0700
description:
asm: interp_4tap_vert_pp sse2
This replaces c code for 4x4, 4x8, 4x16 and 4x32
64-bit
/test/TestBench --testbench interp | grep vpp | grep " 4x"
chroma_vpp[ 4x4] 2.11x 1000.01 2107.46
chroma_vpp[ 4x2] 2.13x 524.99 1117.38
chroma_vpp[ 4x8] 2.28x 1932.54 4400.88
chroma_vpp[ 4x16] 2.29x 3782.51 8675.26
chroma_vpp[ 4x8] 2.28x 1927.55 4400.15
chroma_vpp[ 4x4] 2.10x 1005.00 2107.43
chroma_vpp[ 4x16] 2.29x 3782.51 8674.96
chroma_vpp[ 4x32] 2.27x 7475.00 16994.84
chroma_vpp[ 4x4] 2.10x 1005.00 2107.45
chroma_vpp[ 4x8] 2.28x 1927.50 4400.29
chroma_vpp[ 4x16] 2.30x 3777.50 8675.26
32-bit
./test/TestBench --testbench interp | grep vpp | grep " 4x"
chroma_vpp[ 4x4] 2.33x 1159.99 2697.42
chroma_vpp[ 4x2] 2.39x 580.00 1387.46
chroma_vpp[ 4x8] 2.59x 2185.00 5662.72
chroma_vpp[ 4x16] 2.64x 4205.00 11117.68
chroma_vpp[ 4x8] 2.59x 2185.00 5662.75
chroma_vpp[ 4x4] 2.29x 1177.50 2697.49
chroma_vpp[ 4x16] 2.65x 4202.50 11117.68
chroma_vpp[ 4x32] 2.65x 8242.49 21837.50
chroma_vpp[ 4x4] 2.29x 1177.49 2697.42
chroma_vpp[ 4x8] 2.59x 2184.99 5662.75
chroma_vpp[ 4x16] 2.64x 4205.00 11117.68
Subject: [x265] asm: interp_4tap_vert_pp sse2
details: http://hg.videolan.org/x265/rev/f0f222d0f073
branches:
changeset: 10394:f0f222d0f073
user: David T Yuen <dtyx265 at gmail.com>
date: Fri May 08 13:43:25 2015 -0700
description:
asm: interp_4tap_vert_pp sse2
This replaces c code for 6x8 and 6x16 for 64-bit only
64-bit
./test/TestBench --testbench interp | grep vpp | grep " 6x"
chroma_vpp[ 6x8] 2.95x 2152.49 6340.15
chroma_vpp[ 6x16] 3.01x 4159.98 12530.22
Subject: [x265] asm: interp_4tap_vert_pp sse2
details: http://hg.videolan.org/x265/rev/4ecd64010b80
branches:
changeset: 10395:4ecd64010b80
user: David T Yuen <dtyx265 at gmail.com>
date: Fri May 08 13:55:02 2015 -0700
description:
asm: interp_4tap_vert_pp sse2
This replaces c code for 8x2, 8x4 and 8x6 for 64-bit only
64-bit
./test/TestBench --testbench interp | grep vpp | grep " 8x"
chroma_vpp[ 8x4] 3.97x 1047.50 4161.69
chroma_vpp[ 8x6] 3.95x 1559.98 6161.25
chroma_vpp[ 8x2] 3.71x 560.00 2077.42
chroma_vpp[ 8x4] 3.91x 1065.00 4160.75
chroma_vpp[ 8x4] 3.91x 1064.90 4160.91
Subject: [x265] asm: interp_4tap_vert_pp sse2
details: http://hg.videolan.org/x265/rev/c99dbb717aa2
branches:
changeset: 10396:c99dbb717aa2
user: David T Yuen <dtyx265 at gmail.com>
date: Fri May 08 19:14:29 2015 -0700
description:
asm: interp_4tap_vert_pp sse2
This code replaces c code for 8x8, 8x12, 8x16, 8x32 and 8x64
64-bit
./test/TestBench --testbench interp | grep vpp | grep " 8x"
chroma_vpp[ 8x8] 4.07x 2009.95 8188.89
chroma_vpp[ 8x16] 4.07x 3989.99 16231.35
chroma_vpp[ 8x32] 4.05x 7909.96 32071.44
chroma_vpp[ 8x12] 4.06x 3022.50 12270.46
chroma_vpp[ 8x64] 4.07x 15745.53 64057.47
Subject: [x265] use new combo nextState and bitsCost table to reduce memory and address operators in codeCoeffNxN()
details: http://hg.videolan.org/x265/rev/80eab76506d1
branches:
changeset: 10397:80eab76506d1
user: Min Chen <chenm003 at 163.com>
date: Fri May 08 17:20:04 2015 -0700
description:
use new combo nextState and bitsCost table to reduce memory and address operators in codeCoeffNxN()
Subject: [x265] param: added qcomp into x265 info on ABR or CRF ratecontrol mode
details: http://hg.videolan.org/x265/rev/a218d4f3b49a
branches:
changeset: 10398:a218d4f3b49a
user: Jie Zhang <zj262144 at 163.com>
date: Thu May 07 10:15:31 2015 +0800
description:
param: added qcomp into x265 info on ABR or CRF ratecontrol mode
Subject: [x265] param: space nits
details: http://hg.videolan.org/x265/rev/3700169eb622
branches:
changeset: 10399:3700169eb622
user: Jie Zhang <zj262144 at 163.com>
date: Thu May 07 10:18:17 2015 +0800
description:
param: space nits
Subject: [x265] Merge with default (prep for 1.7)
details: http://hg.videolan.org/x265/rev/b642b3d8cc1e
branches: stable
changeset: 10400:b642b3d8cc1e
user: Steve Borho <steve at borho.org>
date: Sat May 09 12:33:18 2015 -0500
description:
Merge with default (prep for 1.7)
diffstat:
doc/reST/cli.rst | 7 +-
source/common/contexts.h | 1 +
source/common/param.cpp | 6 +-
source/common/x86/asm-primitives.cpp | 55 ++
source/common/x86/ipfilter8.asm | 834 +++++++++++++++++++++++++++++++++++
source/common/x86/ipfilter8.h | 20 +
source/common/x86/pixel-a.asm | 733 ++++++++++++++++++++++++++++++-
source/encoder/entropy.cpp | 45 +-
8 files changed, 1688 insertions(+), 13 deletions(-)
diffs (truncated from 1823 to 300 lines):
diff -r 7a1fd7073941 -r b642b3d8cc1e doc/reST/cli.rst
--- a/doc/reST/cli.rst Tue May 05 14:44:19 2015 -0700
+++ b/doc/reST/cli.rst Sat May 09 12:33:18 2015 -0500
@@ -1143,6 +1143,7 @@ Quality, rate control and rate distortio
**Range of values:** 0.0 to 3.0
.. option:: --qg-size <64|32|16>
+
Enable adaptive quantization for sub-CTUs. This parameter specifies
the minimum CU size at which QP can be adjusted, ie. Quantization Group
size. Allowed range of values are 64, 32, 16 provided this falls within
@@ -1200,12 +1201,12 @@ Quality, rate control and rate distortio
.. option:: --strict-cbr, --no-strict-cbr
Enables stricter conditions to control bitrate deviance from the
- target bitrate in CBR mode. Bitrate adherence is prioritised
+ target bitrate in ABR mode. Bit rate adherence is prioritised
over quality. Rate tolerance is reduced to 50%. Default disabled.
This option is for use-cases which require the final average bitrate
- to be within very strict limits of the target - preventing overshoots
- completely, and achieve bitrates within 5% of target bitrate,
+ to be within very strict limits of the target; preventing overshoots,
+ while keeping the bit rate within 5% of the target setting,
especially in short segment encodes. Typically, the encoder stays
conservative, waiting until there is enough feedback in terms of
encoded frames to control QP. strict-cbr allows the encoder to be
diff -r 7a1fd7073941 -r b642b3d8cc1e source/common/contexts.h
--- a/source/common/contexts.h Tue May 05 14:44:19 2015 -0700
+++ b/source/common/contexts.h Sat May 09 12:33:18 2015 -0500
@@ -106,6 +106,7 @@ namespace x265 {
// private namespace
extern const uint32_t g_entropyBits[128];
+extern const uint32_t g_entropyStateBits[128];
extern const uint8_t g_nextState[128][2];
#define sbacGetMps(S) ((S) & 1)
diff -r 7a1fd7073941 -r b642b3d8cc1e source/common/param.cpp
--- a/source/common/param.cpp Tue May 05 14:44:19 2015 -0700
+++ b/source/common/param.cpp Sat May 09 12:33:18 2015 -0500
@@ -1283,11 +1283,11 @@ void x265_print_params(x265_param* param
else switch (param->rc.rateControlMode)
{
case X265_RC_ABR:
- x265_log(param, X265_LOG_INFO, "Rate Control : ABR-%d kbps\n", param->rc.bitrate); break;
+ x265_log(param, X265_LOG_INFO, "Rate Control / qCompress : ABR-%d kbps / %0.2f\n", param->rc.bitrate, param->rc.qCompress); break;
case X265_RC_CQP:
- x265_log(param, X265_LOG_INFO, "Rate Control : CQP-%d\n", param->rc.qp); break;
+ x265_log(param, X265_LOG_INFO, "Rate Control : CQP-%d\n", param->rc.qp); break;
case X265_RC_CRF:
- x265_log(param, X265_LOG_INFO, "Rate Control : CRF-%0.1f\n", param->rc.rfConstant); break;
+ x265_log(param, X265_LOG_INFO, "Rate Control / qCompress : CRF-%0.1f / %0.2f\n", param->rc.rfConstant, param->rc.qCompress); break;
}
if (param->rc.vbvBufferSize)
diff -r 7a1fd7073941 -r b642b3d8cc1e source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Tue May 05 14:44:19 2015 -0700
+++ b/source/common/x86/asm-primitives.cpp Sat May 09 12:33:18 2015 -0500
@@ -1181,6 +1181,26 @@ void setupAssemblyPrimitives(EncoderPrim
}
if (cpuMask & X265_CPU_AVX2)
{
+ p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2;
+
+ p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2;
+ p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2;
+ p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2;
+ p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2;
+
+ p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2;
+ p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2;
+ p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2;
+ p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2;
+ p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2;
+
+ p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2;
+ p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2;
+ p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2;
+ p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2;
+ p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2;
+ p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2;
+
p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2;
p.cu[BLOCK_16x16].sse_ss = x265_pixel_ssd_ss_16x16_avx2;
@@ -1350,6 +1370,41 @@ void setupAssemblyPrimitives(EncoderPrim
CHROMA_420_VSP_FILTERS(_sse2);
CHROMA_422_VSP_FILTERS(_sse2);
CHROMA_444_VSP_FILTERS(_sse2);
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vpp = x265_interp_4tap_vert_pp_2x16_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = x265_interp_4tap_vert_pp_4x32_sse2;
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
+#if X86_64
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = x265_interp_4tap_vert_pp_6x8_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = x265_interp_4tap_vert_pp_8x2_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vpp = x265_interp_4tap_vert_pp_6x16_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_sse2;
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
+#endif
ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2);
p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_sse2;
diff -r 7a1fd7073941 -r b642b3d8cc1e source/common/x86/ipfilter8.asm
--- a/source/common/x86/ipfilter8.asm Tue May 05 14:44:19 2015 -0700
+++ b/source/common/x86/ipfilter8.asm Sat May 09 12:33:18 2015 -0500
@@ -1042,6 +1042,840 @@ cglobal interp_8tap_horiz_%3_%1x%2, 4,6,
IPFILTER_LUMA_sse2 64, 16, ps
IPFILTER_LUMA_sse2 16, 64, ps
+%macro WORD_TO_DOUBLE 1
+%if ARCH_X86_64
+ punpcklbw %1, m8
+%else
+ punpcklbw %1, %1
+ psrlw %1, 8
+%endif
+%endmacro
+
+;-----------------------------------------------------------------------------
+; void interp_4tap_vert_pp_2xn(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
+;-----------------------------------------------------------------------------
+%macro FILTER_V4_W2_H4_sse2 1
+INIT_XMM sse2
+%if ARCH_X86_64
+cglobal interp_4tap_vert_pp_2x%1, 4, 6, 9
+ pxor m8, m8
+%else
+cglobal interp_4tap_vert_pp_2x%1, 4, 6, 8
+%endif
+ mov r4d, r4m
+ sub r0, r1
+
+%ifdef PIC
+ lea r5, [tabw_ChromaCoeff]
+ movh m0, [r5 + r4 * 8]
+%else
+ movh m0, [tabw_ChromaCoeff + r4 * 8]
+%endif
+
+ punpcklqdq m0, m0
+ mova m1, [pw_32]
+ lea r5, [3 * r1]
+
+%assign x 1
+%rep %1/4
+ movd m2, [r0]
+ movd m3, [r0 + r1]
+ movd m4, [r0 + 2 * r1]
+ movd m5, [r0 + r5]
+
+ punpcklbw m2, m3
+ punpcklbw m6, m4, m5
+ punpcklwd m2, m6
+
+ WORD_TO_DOUBLE m2
+ pmaddwd m2, m0
+
+ lea r0, [r0 + 4 * r1]
+ movd m6, [r0]
+
+ punpcklbw m3, m4
+ punpcklbw m7, m5, m6
+ punpcklwd m3, m7
+
+ WORD_TO_DOUBLE m3
+ pmaddwd m3, m0
+
+ packssdw m2, m3
+ pshuflw m3, m2, q2301
+ pshufhw m3, m3, q2301
+ paddw m2, m3
+ psrld m2, 16
+
+ movd m7, [r0 + r1]
+
+ punpcklbw m4, m5
+ punpcklbw m3, m6, m7
+ punpcklwd m4, m3
+
+ WORD_TO_DOUBLE m4
+ pmaddwd m4, m0
+
+ movd m3, [r0 + 2 * r1]
+
+ punpcklbw m5, m6
+ punpcklbw m7, m3
+ punpcklwd m5, m7
+
+ WORD_TO_DOUBLE m5
+ pmaddwd m5, m0
+
+ packssdw m4, m5
+ pshuflw m5, m4, q2301
+ pshufhw m5, m5, q2301
+ paddw m4, m5
+ psrld m4, 16
+
+ packssdw m2, m4
+ paddw m2, m1
+ psraw m2, 6
+ packuswb m2, m2
+
+%if ARCH_X86_64
+ movq r4, m2
+ mov [r2], r4w
+ shr r4, 16
+ mov [r2 + r3], r4w
+ lea r2, [r2 + 2 * r3]
+ shr r4, 16
+ mov [r2], r4w
+ shr r4, 16
+ mov [r2 + r3], r4w
+%else
+ movd r4, m2
+ mov [r2], r4w
+ shr r4, 16
+ mov [r2 + r3], r4w
+ lea r2, [r2 + 2 * r3]
+ psrldq m2, 4
+ movd r4, m2
+ mov [r2], r4w
+ shr r4, 16
+ mov [r2 + r3], r4w
+%endif
+
+%if x < %1/4
+ lea r2, [r2 + 2 * r3]
+%endif
+%assign x x+1
+%endrep
+ RET
+
+%endmacro
+
+ FILTER_V4_W2_H4_sse2 4
+ FILTER_V4_W2_H4_sse2 8
+ FILTER_V4_W2_H4_sse2 16
+
+;-----------------------------------------------------------------------------
+; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
+;-----------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal interp_4tap_vert_pp_4x2, 4, 6, 8
+
+ mov r4d, r4m
+ sub r0, r1
+ pxor m7, m7
+
+%ifdef PIC
+ lea r5, [tabw_ChromaCoeff]
+ movh m0, [r5 + r4 * 8]
+%else
+ movh m0, [tabw_ChromaCoeff + r4 * 8]
+%endif
+
+ lea r5, [r0 + 2 * r1]
+ punpcklqdq m0, m0
+ movd m2, [r0]
+ movd m3, [r0 + r1]
+ movd m4, [r5]
+ movd m5, [r5 + r1]
+
+ punpcklbw m2, m3
+ punpcklbw m1, m4, m5
+ punpcklwd m2, m1
+
+ movhlps m6, m2
+ punpcklbw m2, m7
+ punpcklbw m6, m7
+ pmaddwd m2, m0
+ pmaddwd m6, m0
+ packssdw m2, m6
+
+ movd m1, [r0 + 4 * r1]
More information about the x265-commits
mailing list