[x265-commits] [x265] api: clarify docs and use of x265_api_get()

Thu Apr 30 22:37:02 CEST 2015

details:   http://hg.videolan.org/x265/rev/a3ba8c92dcea
branches:  
changeset: 10329:a3ba8c92dcea
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Thu Apr 30 09:44:07 2015 +0530
description:
api: clarify docs and use of x265_api_get()
Subject: [x265] doc: replace sublayer with enhancement layer

details:   http://hg.videolan.org/x265/rev/2a1dd8a1b324
branches:  
changeset: 10330:2a1dd8a1b324
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Thu Apr 30 13:27:12 2015 +0530
description:
doc: replace sublayer with enhancement layer
Subject: [x265] asm: chroma_hpp[48x64] for i444 - improved 17498c->13381c

details:   http://hg.videolan.org/x265/rev/5c9b9856de29
branches:  
changeset: 10331:5c9b9856de29
user:      Aasaipriya Chandran <aasaipriya at multicorewareinc.com>
date:      Wed Apr 29 17:11:03 2015 +0530
description:
asm: chroma_hpp[48x64] for i444 - improved 17498c->13381c
Subject: [x265] convert sigCtx table from [4][4] to [16]

details:   http://hg.videolan.org/x265/rev/60a66c581d67
branches:  
changeset: 10332:60a66c581d67
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:48:46 2015 +0800
description:
convert sigCtx table from [4][4] to [16]
Subject: [x265] pre-compute abs coeff and simplify scan table

details:   http://hg.videolan.org/x265/rev/e6f14a4b35ed
branches:  
changeset: 10333:e6f14a4b35ed
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:48:49 2015 +0800
description:
pre-compute abs coeff and simplify scan table
Subject: [x265] remove reduce check on firstC2FlagIdx

details:   http://hg.videolan.org/x265/rev/f406b2e6262e
branches:  
changeset: 10334:f406b2e6262e
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:48:53 2015 +0800
description:
remove reduce check on firstC2FlagIdx
Subject: [x265] fast RD path on encode coeff remain code in codeCoeffNxN()

details:   http://hg.videolan.org/x265/rev/2158765e992f
branches:  
changeset: 10335:2158765e992f
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:48:57 2015 +0800
description:
fast RD path on encode coeff remain code in codeCoeffNxN()
Subject: [x265] improve compute on baseLevel by 2-bits encode code

details:   http://hg.videolan.org/x265/rev/84b6da2f3da0
branches:  
changeset: 10336:84b6da2f3da0
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:49:01 2015 +0800
description:
improve compute on baseLevel by 2-bits encode code
Subject: [x265] simplify compute on get codeNumber length

details:   http://hg.videolan.org/x265/rev/73a3bfc8c2a2
branches:  
changeset: 10337:73a3bfc8c2a2
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:49:05 2015 +0800
description:
simplify compute on get codeNumber length
Subject: [x265] faster clip operator on goRiceParam

details:   http://hg.videolan.org/x265/rev/432f2e3df326
branches:  
changeset: 10338:432f2e3df326
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:49:09 2015 +0800
description:
faster clip operator on goRiceParam
Subject: [x265] simplify logic on get coeff remain cost in codeCoeffNxN()

details:   http://hg.videolan.org/x265/rev/d774ef13d9a5
branches:  
changeset: 10339:d774ef13d9a5
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 18:49:14 2015 +0800
description:
simplify logic on get coeff remain cost in codeCoeffNxN()
Subject: [x265] fix check failure in Entropy::writeCoefRemainExGolomb()

details:   http://hg.videolan.org/x265/rev/554a5c9b1646
branches:  
changeset: 10340:554a5c9b1646
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 19:52:30 2015 +0800
description:
fix check failure in Entropy::writeCoefRemainExGolomb()
Subject: [x265] asm: downgrade x265_interp_8tap_hv_pp_8x8 from SSE4 to SSSE3

details:   http://hg.videolan.org/x265/rev/e7aba11a3bbc
branches:  
changeset: 10341:e7aba11a3bbc
user:      Min Chen <chenm003 at 163.com>
date:      Thu Apr 30 19:52:36 2015 +0800
description:
asm: downgrade x265_interp_8tap_hv_pp_8x8 from SSE4 to SSSE3
Subject: [x265] asm: filter_vpp, filter_vps for 12x32 in avx2

details:   http://hg.videolan.org/x265/rev/ba4f1516cea2
branches:  
changeset: 10342:ba4f1516cea2
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Thu Apr 30 15:16:19 2015 +0530
description:
asm: filter_vpp, filter_vps for 12x32 in avx2

filter_vpp[12x32]: 2307c->1885c
filter_vps[12x32]: 1884c->1612c
Subject: [x265] asm: filter_vpp, filter_vps for 8x12 in avx2

details:   http://hg.videolan.org/x265/rev/0562f6ae98f1
branches:  
changeset: 10343:0562f6ae98f1
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Thu Apr 30 15:31:33 2015 +0530
description:
asm: filter_vpp, filter_vps for 8x12 in avx2

filter_vpp[8x12]: 425c->388c
filter_vps[8x12]: 458c->388c
Subject: [x265] asm: filter_vpp, filter_vps for 2x4 in avx2

details:   http://hg.videolan.org/x265/rev/21b710bafb92
branches:  
changeset: 10344:21b710bafb92
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Thu Apr 30 18:08:03 2015 +0530
description:
asm: filter_vpp, filter_vps for 2x4 in avx2
Subject: [x265] search: cleanup checkBestMVP(), no behavior change

details:   http://hg.videolan.org/x265/rev/2b3275b1eb85
branches:  
changeset: 10345:2b3275b1eb85
user:      Steve Borho <steve at borho.org>
date:      Wed Apr 29 12:36:21 2015 -0500
description:
search: cleanup checkBestMVP(), no behavior change
Subject: [x265] search: introduce selectMVP helper method

details:   http://hg.videolan.org/x265/rev/acf4ede2ca53
branches:  
changeset: 10346:acf4ede2ca53
user:      Steve Borho <steve at borho.org>
date:      Wed Apr 29 14:40:48 2015 -0500
description:
search: introduce selectMVP helper method
Subject: [x265] search: do not clip MVP in setSearchRange()

details:   http://hg.videolan.org/x265/rev/5f89a0776b96
branches:  
changeset: 10347:5f89a0776b96
user:      Steve Borho <steve at borho.org>
date:      Wed Apr 29 14:50:26 2015 -0500
description:
search: do not clip MVP in setSearchRange()

The MVP itself should not be clipped, since this will make MVD calculations
incorrect.  Motion estimation is always careful to clip all motion vectors to
within the available pixel range (mvmin/mvmax) during the search, so it is safe
for the MVP to be out of range.
Subject: [x265] search: allow AMP to use motion estimation for 64x64 CUs

details:   http://hg.videolan.org/x265/rev/bca33880585a
branches:  
changeset: 10348:bca33880585a
user:      Steve Borho <steve at borho.org>
date:      Sat Apr 25 00:41:25 2015 -0500
description:
search: allow AMP to use motion estimation for 64x64 CUs

This was a hold-over from the HM which never wanted to perform motion searches
for AMP PUs for 64x64 CUs. Presumably because they were never optimized.
Because of the way the rd-levels were developed, RD levels 0..4 always
hard-coded bMergeOnly to false, but to compensate they never attempted AMP
modes at 64x64 CUs.

This patch makes AMP partitions always perform motion estimation, regardless of
CU size and RD level, and it removes the bMergeOnly argument to predInterSearch.
It should give a small improvement to compression efficiency at slower presets
for a minimal performance cost (since 64x64 inter analysis is relatively rare).

diffstat:

 doc/reST/api.rst                     |   26 ++-
 doc/reST/cli.rst                     |    4 +-
 source/common/x86/asm-primitives.cpp |   12 +-
 source/common/x86/ipfilter8.asm      |  252 +++++++++++++++++++++++++++++++++-
 source/common/x86/ipfilter8.h        |    2 +-
 source/encoder/analysis.cpp          |   45 ++---
 source/encoder/analysis.h            |    2 +-
 source/encoder/entropy.cpp           |  175 +++++++++++++++---------
 source/encoder/search.cpp            |  171 ++++++++---------------
 source/encoder/search.h              |    7 +-
 source/x265.cpp                      |    4 +-
 11 files changed, 468 insertions(+), 232 deletions(-)

diffs (truncated from 1231 to 300 lines):

diff -r 74d7fe7a81ad -r bca33880585a doc/reST/api.rst

--- a/doc/reST/api.rst	Wed Apr 29 11:08:44 2015 -0500
+++ b/doc/reST/api.rst	Sat Apr 25 00:41:25 2015 -0500
@@ -352,7 +352,7 @@ CTU size::
 Multi-library Interface
 =======================
 
-If your application might want to make a runtime selection between among
+If your application might want to make a runtime selection between
 a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will
 want to use the multi-library interface.
 
@@ -370,16 +370,20 @@ without the **x265_** prefix. So **x265_
      *   libx265 */
     const x265_api* x265_api_get(int bitDepth);
 
-The general idea is to request the API for the bitDepth you would prefer
-the encoder to use (8 or 10), and if that returns NULL you request the
-API for bitDepth=0, which returns the system default libx265.
+Note that using this multi-library API in your application is only the
+first step.
 
-Note that using this multi-library API in your application is only the
-first step.  Your application must link to one build of libx265
-(statically or dynamically) and this linked version of libx265 will
-support one bit-depth (8 or 10 bits). If you request a different
-bit-depth, the linked libx265 will attempt to dynamically bind a shared
-library libx265 with a name appropriate for the requested bit-depth:
+Your application must link to one build of libx265 (statically or 
+dynamically) and this linked version of libx265 will support one 
+bit-depth (8 or 10 bits). 
+
+Your application must now request the API for the bitDepth you would 
+prefer the encoder to use (8 or 10). If the requested bitdepth is zero, 
+or if it matches the bitdepth of the system default libx265 (the 
+currently linked library), then this library will be used for encode.
+If you request a different bit-depth, the linked libx265 will attempt 
+to dynamically bind a shared library with a name appropriate for the 
+requested bit-depth:
 
     8-bit:  libx265_main.dll
     10-bit: libx265_main10.dll
@@ -390,7 +394,7 @@ library libx265 with a name appropriate 
 For example on Windows, one could package together an x265.exe
 statically linked against the 8bpp libx265 together with a
 libx265_main10.dll in the same folder, and this executable would be able
-to encode 10bit bitstreams by specifying -P main10 on the command line.
+to encode main and main10 bitstreams.
 
 On Linux, x265 packagers could install 8bpp static and shared libraries
 under the name libx265 (so all applications link against 8bpp libx265)
diff -r 74d7fe7a81ad -r bca33880585a doc/reST/cli.rst
--- a/doc/reST/cli.rst	Wed Apr 29 11:08:44 2015 -0500
+++ b/doc/reST/cli.rst	Sat Apr 25 00:41:25 2015 -0500
@@ -1559,8 +1559,8 @@ Bitstream options
 
 	Enable a temporal sub layer. All referenced I/P/B frames are in the
 	base layer and all unreferenced B frames are placed in a temporal
-	sublayer. A decoder may chose to drop the sublayer and only decode
-	and display the base layer slices.
+	enhancement layer. A decoder may chose to drop the enhancement layer 
+	and only decode and display the base layer slices.
 	
 	If used with a fixed GOP (:option:`b-adapt` 0) and :option:`bframes`
 	3 then the two layers evenly split the frame rate, with a cadence of
diff -r 74d7fe7a81ad -r bca33880585a source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Wed Apr 29 11:08:44 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Sat Apr 25 00:41:25 2015 -0500
@@ -1447,6 +1447,9 @@ void setupAssemblyPrimitives(EncoderPrim
 
         ALL_LUMA_TU(count_nonzero, count_nonzero, ssse3);
 
+        // MUST be done after LUMA_FILTERS() to overwrite default version
+        p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3;
+
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
         p.scale1D_128to64 = x265_scale1D_128to64_ssse3;
         p.scale2D_64to32 = x265_scale2D_64to32_ssse3;
@@ -1548,7 +1551,7 @@ void setupAssemblyPrimitives(EncoderPrim
         CHROMA_444_VSP_FILTERS_SSE4(_sse4);
 
         // MUST be done after LUMA_FILTERS() to overwrite default version
-        p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse4;
+        p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3;
 
         LUMA_CU_BLOCKCOPY(ps, sse4);
         CHROMA_420_CU_BLOCKCOPY(ps, sse4);
@@ -2408,6 +2411,7 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = x265_interp_4tap_horiz_pp_64x32_avx2;
         p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = x265_interp_4tap_horiz_pp_64x48_avx2;
         p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = x265_interp_4tap_horiz_pp_64x16_avx2;
+        p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = x265_interp_4tap_horiz_pp_48x64_avx2;
 
         p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2;
@@ -2536,6 +2540,9 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = x265_interp_4tap_vert_ps_8x64_avx2;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = x265_interp_4tap_vert_ps_32x48_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vps = x265_interp_4tap_vert_ps_12x32_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = x265_interp_4tap_vert_ps_8x12_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2;
 
         //i444 for chroma_vps
         p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2;
@@ -2577,6 +2584,9 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_avx2;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = x265_interp_4tap_vert_pp_32x48_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vpp = x265_interp_4tap_vert_pp_12x32_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_avx2;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2;
 
         //i444 for chroma_vpp
         p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2;
diff -r 74d7fe7a81ad -r bca33880585a source/common/x86/ipfilter8.asm
--- a/source/common/x86/ipfilter8.asm	Wed Apr 29 11:08:44 2015 -0500
+++ b/source/common/x86/ipfilter8.asm	Sat Apr 25 00:41:25 2015 -0500
@@ -3157,7 +3157,7 @@ cglobal interp_8tap_horiz_pp_%1x%2, 4,6,
 ;-----------------------------------------------------------------------------
 ; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
 ;-----------------------------------------------------------------------------
-INIT_XMM sse4
+INIT_XMM ssse3
 cglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16
 %define coef        m7
 %define stk_buf     rsp
@@ -5556,6 +5556,148 @@ cglobal interp_4tap_vert_%1_8x16, 4, 7, 
     FILTER_VER_CHROMA_AVX2_8x16 pp
     FILTER_VER_CHROMA_AVX2_8x16 ps
 
+%macro FILTER_VER_CHROMA_AVX2_8x12 1
+INIT_YMM avx2
+cglobal interp_4tap_vert_%1_8x12, 4, 7, 8
+    mov             r4d, r4m
+    shl             r4d, 6
+
+%ifdef PIC
+    lea             r5, [tab_ChromaCoeffVer_32]
+    add             r5, r4
+%else
+    lea             r5, [tab_ChromaCoeffVer_32 + r4]
+%endif
+
+    lea             r4, [r1 * 3]
+    sub             r0, r1
+%ifidn %1, pp
+    mova            m7, [pw_512]
+%else
+    add             r3d, r3d
+    mova            m7, [pw_2000]
+%endif
+    lea             r6, [r3 * 3]
+    movq            xm1, [r0]                       ; m1 = row 0
+    movq            xm2, [r0 + r1]                  ; m2 = row 1
+    punpcklbw       xm1, xm2
+    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2
+    punpcklbw       xm2, xm3
+    vinserti128     m5, m1, xm2, 1
+    pmaddubsw       m5, [r5]
+    movq            xm4, [r0 + r4]                  ; m4 = row 3
+    punpcklbw       xm3, xm4
+    lea             r0, [r0 + r1 * 4]
+    movq            xm1, [r0]                       ; m1 = row 4
+    punpcklbw       xm4, xm1
+    vinserti128     m2, m3, xm4, 1
+    pmaddubsw       m0, m2, [r5 + 1 * mmsize]
+    paddw           m5, m0
+    pmaddubsw       m2, [r5]
+    movq            xm3, [r0 + r1]                  ; m3 = row 5
+    punpcklbw       xm1, xm3
+    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6
+    punpcklbw       xm3, xm4
+    vinserti128     m1, m1, xm3, 1
+    pmaddubsw       m0, m1, [r5 + 1 * mmsize]
+    paddw           m2, m0
+    pmaddubsw       m1, [r5]
+    movq            xm3, [r0 + r4]                  ; m3 = row 7
+    punpcklbw       xm4, xm3
+    lea             r0, [r0 + r1 * 4]
+    movq            xm0, [r0]                       ; m0 = row 8
+    punpcklbw       xm3, xm0
+    vinserti128     m4, m4, xm3, 1
+    pmaddubsw       m3, m4, [r5 + 1 * mmsize]
+    paddw           m1, m3
+    pmaddubsw       m4, [r5]
+    movq            xm3, [r0 + r1]                  ; m3 = row 9
+    punpcklbw       xm0, xm3
+    movq            xm6, [r0 + r1 * 2]              ; m6 = row 10
+    punpcklbw       xm3, xm6
+    vinserti128     m0, m0, xm3, 1
+    pmaddubsw       m3, m0, [r5 + 1 * mmsize]
+    paddw           m4, m3
+    pmaddubsw       m0, [r5]
+%ifidn %1, pp
+    pmulhrsw        m5, m7                          ; m5 = word: row 0, row 1
+    pmulhrsw        m2, m7                          ; m2 = word: row 2, row 3
+    pmulhrsw        m1, m7                          ; m1 = word: row 4, row 5
+    pmulhrsw        m4, m7                          ; m4 = word: row 6, row 7
+    packuswb        m5, m2
+    packuswb        m1, m4
+    vextracti128    xm2, m5, 1
+    vextracti128    xm4, m1, 1
+    movq            [r2], xm5
+    movq            [r2 + r3], xm2
+    movhps          [r2 + r3 * 2], xm5
+    movhps          [r2 + r6], xm2
+    lea             r2, [r2 + r3 * 4]
+    movq            [r2], xm1
+    movq            [r2 + r3], xm4
+    movhps          [r2 + r3 * 2], xm1
+    movhps          [r2 + r6], xm4
+%else
+    psubw           m5, m7                          ; m5 = word: row 0, row 1
+    psubw           m2, m7                          ; m2 = word: row 2, row 3
+    psubw           m1, m7                          ; m1 = word: row 4, row 5
+    psubw           m4, m7                          ; m4 = word: row 6, row 7
+    vextracti128    xm3, m5, 1
+    movu            [r2], xm5
+    movu            [r2 + r3], xm3
+    vextracti128    xm3, m2, 1
+    movu            [r2 + r3 * 2], xm2
+    movu            [r2 + r6], xm3
+    lea             r2, [r2 + r3 * 4]
+    vextracti128    xm5, m1, 1
+    vextracti128    xm3, m4, 1
+    movu            [r2], xm1
+    movu            [r2 + r3], xm5
+    movu            [r2 + r3 * 2], xm4
+    movu            [r2 + r6], xm3
+%endif
+    movq            xm3, [r0 + r4]                  ; m3 = row 11
+    punpcklbw       xm6, xm3
+    lea             r0, [r0 + r1 * 4]
+    movq            xm5, [r0]                       ; m5 = row 12
+    punpcklbw       xm3, xm5
+    vinserti128     m6, m6, xm3, 1
+    pmaddubsw       m3, m6, [r5 + 1 * mmsize]
+    paddw           m0, m3
+    pmaddubsw       m6, [r5]
+    movq            xm3, [r0 + r1]                  ; m3 = row 13
+    punpcklbw       xm5, xm3
+    movq            xm2, [r0 + r1 * 2]              ; m2 = row 14
+    punpcklbw       xm3, xm2
+    vinserti128     m5, m5, xm3, 1
+    pmaddubsw       m3, m5, [r5 + 1 * mmsize]
+    paddw           m6, m3
+    lea             r2, [r2 + r3 * 4]
+%ifidn %1, pp
+    pmulhrsw        m0, m7                          ; m0 = word: row 8, row 9
+    pmulhrsw        m6, m7                          ; m6 = word: row 10, row 11
+    packuswb        m0, m6
+    vextracti128    xm6, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r3], xm6
+    movhps          [r2 + r3 * 2], xm0
+    movhps          [r2 + r6], xm6
+%else
+    psubw           m0, m7                          ; m0 = word: row 8, row 9
+    psubw           m6, m7                          ; m6 = word: row 10, row 11
+    vextracti128    xm1, m0, 1
+    vextracti128    xm3, m6, 1
+    movu            [r2], xm0
+    movu            [r2 + r3], xm1
+    movu            [r2 + r3 * 2], xm6
+    movu            [r2 + r6], xm3
+%endif
+    RET
+%endmacro
+
+    FILTER_VER_CHROMA_AVX2_8x12 pp
+    FILTER_VER_CHROMA_AVX2_8x12 ps
+
 %macro FILTER_VER_CHROMA_AVX2_8xN 2
 INIT_YMM avx2
 cglobal interp_4tap_vert_%1_8x%2, 4, 7, 8
@@ -7560,9 +7702,9 @@ cglobal interp_4tap_vert_%1_16x4, 4, 6, 
     FILTER_VER_CHROMA_AVX2_16x4 pp
     FILTER_VER_CHROMA_AVX2_16x4 ps
 
-%macro FILTER_VER_CHROMA_AVX2_12x16 1
-INIT_YMM avx2
-cglobal interp_4tap_vert_%1_12x16, 4, 7, 8
+%macro FILTER_VER_CHROMA_AVX2_12xN 2
+INIT_YMM avx2
+cglobal interp_4tap_vert_%1_12x%2, 4, 7, 8
     mov             r4d, r4m
     shl             r4d, 6
 
@@ -7582,7 +7724,7 @@ cglobal interp_4tap_vert_%1_12x16, 4, 7,
     vbroadcasti128  m7, [pw_2000]
 %endif
     lea             r6, [r3 * 3]
-
+%rep %2 / 16
     movu            xm0, [r0]                       ; m0 = row 0
     movu            xm1, [r0 + r1]                  ; m1 = row 1
     punpckhbw       xm2, xm0, xm1
@@ -7868,11 +8010,15 @@ cglobal interp_4tap_vert_%1_12x16, 4, 7,
     vextracti128    xm5, m5, 1