[x265-commits] [x265] asm: intra_pred_ang4_2_sse2

David T Yuen dtyx265 at gmail.com
Tue Mar 24 22:23:47 CET 2015


details:   http://hg.videolan.org/x265/rev/6b8da2264523
branches:  
changeset: 9864:6b8da2264523
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 12:26:38 2015 -0700
description:
asm: intra_pred_ang4_2_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 2\]"
intra_ang_4x4[ 2]	8.86x 	 134.98   	 1195.68

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 2\]"
intra_ang_4x4[ 2]	9.23x 	 222.48   	 2053.30
Subject: [x265] asm: intra_pred_ang4_3_sse2

details:   http://hg.videolan.org/x265/rev/1ad3d2d854a4
branches:  
changeset: 9865:1ad3d2d854a4
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 12:35:33 2015 -0700
description:
asm: intra_pred_ang4_3_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 3\]"
intra_ang_4x4[ 3]	2.58x 	 704.98   	 1818.77

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 3\]"
intra_ang_4x4[ 3]	3.68x 	 757.49   	 2784.21
Subject: [x265] asm: intra_pred_ang4_4_sse2

details:   http://hg.videolan.org/x265/rev/e91b92457670
branches:  
changeset: 9866:e91b92457670
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 12:53:22 2015 -0700
description:
asm: intra_pred_ang4_4_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 4\]"
intra_ang_4x4[ 4]	2.74x 	 709.98   	 1947.60

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 4\]"
intra_ang_4x4[ 4]	3.97x 	 747.49   	 2970.13
Subject: [x265] asm: intra_pred_ang4_5_sse2

details:   http://hg.videolan.org/x265/rev/02bc460262b4
branches:  
changeset: 9867:02bc460262b4
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 12:57:42 2015 -0700
description:
asm: intra_pred_ang4_5_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 5\]"
intra_ang_4x4[ 5]	2.94x 	 684.47   	 2014.99

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 5\]"
intra_ang_4x4[ 5]	3.82x 	 747.48   	 2854.97
Subject: [x265] asm: intra_pred_ang4_6_sse2

details:   http://hg.videolan.org/x265/rev/e043561425f9
branches:  
changeset: 9868:e043561425f9
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 13:01:36 2015 -0700
description:
asm: intra_pred_ang4_6_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 6\]"
intra_ang_4x4[ 6]	2.92x 	 655.00   	 1914.97

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 6\]"
intra_ang_4x4[ 6]	3.96x 	 717.58   	 2844.93
Subject: [x265] asm: intra_pred_ang4_7_sse2

details:   http://hg.videolan.org/x265/rev/3daa8229d676
branches:  
changeset: 9869:3daa8229d676
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 13:18:33 2015 -0700
description:
asm: intra_pred_ang4_7_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 7\]"
intra_ang_4x4[ 7]	2.77x 	 655.00   	 1817.47

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 7\]"
intra_ang_4x4[ 7]	3.56x 	 762.50   	 2714.98
Subject: [x265] asm: intra_pred_ang4_8_sse2

details:   http://hg.videolan.org/x265/rev/71636c334b57
branches:  
changeset: 9870:71636c334b57
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 13:22:23 2015 -0700
description:
asm: intra_pred_ang4_8_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 8\]"
intra_ang_4x4[ 8]	3.04x 	 640.00   	 1942.47

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 8\]"
intra_ang_4x4[ 8]	3.97x 	 722.50   	 2864.98
Subject: [x265] asm: intra_pred_ang4_9_sse2

details:   http://hg.videolan.org/x265/rev/1b69c3a7bbd5
branches:  
changeset: 9871:1b69c3a7bbd5
user:      David T Yuen <dtyx265 at gmail.com>
date:      Mon Mar 23 13:28:35 2015 -0700
description:
asm: intra_pred_ang4_9_sse2

This is backported from sse4 code and replaces c code.

64-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 9\]"
intra_ang_4x4[ 9]	2.97x 	 645.00   	 1917.47

32-bit

./test/TestBench --testbench intrapred | grep "intra_ang_4x4\[ 9\]"
intra_ang_4x4[ 9]	4.03x 	 722.50   	 2910.00
Subject: [x265] asm: avx2 code for ssd_s[16x16] for 8bpp

details:   http://hg.videolan.org/x265/rev/cd68e557eb7b
branches:  
changeset: 9872:cd68e557eb7b
user:      Sumalatha Polureddy
date:      Tue Mar 24 10:25:06 2015 +0530
description:
asm: avx2 code for ssd_s[16x16] for 8bpp

see3
ssd_s[16x16]  6.33x    345.70          2188.47

avx2
ssd_s[16x16]  9.86x    221.34          2183.05
Subject: [x265] asm: psyCost_pp avx2 code for BLOCK(8x8,16x16,32x32,64x64)

details:   http://hg.videolan.org/x265/rev/9eefa3feecdb
branches:  
changeset: 9873:9eefa3feecdb
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Mon Mar 23 14:10:52 2015 +0530
description:
asm: psyCost_pp avx2 code for BLOCK(8x8,16x16,32x32,64x64)

AVX2:
psy_cost_pp[8x8]         12.28x   611.76          7511.84
psy_cost_pp[16x16]       13.43x   2253.78         30262.36
psy_cost_pp[32x32]       14.16x   8578.93         121519.92
psy_cost_pp[64x64]       12.37x   39645.38        490279.69

SSE4:
psy_cost_pp[8x8]         8.40x    930.68          7818.93
psy_cost_pp[16x16]       8.57x    3648.62         31282.65
psy_cost_pp[32x32]       8.73x    13969.57        121993.38
psy_cost_pp[64x64]       8.74x    54604.69        477252.69
Subject: [x265] asm: psyCost_pp avx2 code for BLOCK_4x4

details:   http://hg.videolan.org/x265/rev/48fee0fa4814
branches:  
changeset: 9874:48fee0fa4814
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Mon Mar 23 20:20:02 2015 +0530
description:
asm: psyCost_pp avx2 code for BLOCK_4x4

AVX2:
psy_cost_pp[4x4]         10.30x   216.56          2230.77

SSE4:
psy_cost_pp[4x4]         6.53x    352.01          2297.35
Subject: [x265] analysis: only perform checks if merge mode was selected

details:   http://hg.videolan.org/x265/rev/27717be056d3
branches:  
changeset: 9875:27717be056d3
user:      Steve Borho <steve at borho.org>
date:      Tue Mar 24 14:26:24 2015 -0500
description:
analysis: only perform checks if merge mode was selected
Subject: [x265] param: bframes can match lookaheadDepth if both are zero (fixes #118)

details:   http://hg.videolan.org/x265/rev/a962bb577a47
branches:  
changeset: 9876:a962bb577a47
user:      Steve Borho <steve at borho.org>
date:      Tue Mar 24 15:10:46 2015 -0500
description:
param: bframes can match lookaheadDepth if both are zero (fixes #118)
Subject: [x265] slicetype: fix crash when lookaheadDepth is 0

details:   http://hg.videolan.org/x265/rev/c7740b6cec26
branches:  
changeset: 9877:c7740b6cec26
user:      Steve Borho <steve at borho.org>
date:      Tue Mar 24 15:30:53 2015 -0500
description:
slicetype: fix crash when lookaheadDepth is 0
Subject: [x265] slicetype: spleling

details:   http://hg.videolan.org/x265/rev/e637273e2ae6
branches:  
changeset: 9878:e637273e2ae6
user:      Steve Borho <steve at borho.org>
date:      Tue Mar 24 15:31:05 2015 -0500
description:
slicetype: spleling

diffstat:

 source/common/param.cpp              |    2 +-
 source/common/x86/asm-primitives.cpp |   17 +
 source/common/x86/intrapred.h        |    9 +
 source/common/x86/intrapred8.asm     |  244 +++++++++++++++++++++++++
 source/common/x86/pixel-a.asm        |  332 ++++++++++++++++++++++++++++++++++-
 source/common/x86/pixel.h            |    7 +
 source/common/x86/ssd-a.asm          |   29 +++
 source/encoder/analysis.cpp          |    4 +-
 source/encoder/slicetype.cpp         |    4 +-
 9 files changed, 642 insertions(+), 6 deletions(-)

diffs (truncated from 783 to 300 lines):

diff -r 7b66c36ed9ef -r e637273e2ae6 source/common/param.cpp
--- a/source/common/param.cpp	Mon Mar 23 19:55:02 2015 -0500
+++ b/source/common/param.cpp	Tue Mar 24 15:31:05 2015 -0500
@@ -1055,7 +1055,7 @@ int x265_check_params(x265_param* param)
           "RD Level is out of range");
     CHECK(param->rdoqLevel < 0 || param->rdoqLevel > 2,
         "RDOQ Level is out of range");
-    CHECK(param->bframes >= param->lookaheadDepth && !param->rc.bStatRead,
+    CHECK(param->bframes && param->bframes >= param->lookaheadDepth && !param->rc.bStatRead,
           "Lookahead depth must be greater than the max consecutive bframe count");
     CHECK(param->bframes < 0,
           "bframe count should be greater than zero");
diff -r 7b66c36ed9ef -r e637273e2ae6 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Mon Mar 23 19:55:02 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Tue Mar 24 15:31:05 2015 -0500
@@ -1196,6 +1196,15 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
         p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
 
+        p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2;
+        p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2;
+        p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2;
+        p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2;
+        p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2;
+        p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2;
+        p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2;
+        p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2;
+
         p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2;
         p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2;
 
@@ -1417,6 +1426,12 @@ void setupAssemblyPrimitives(EncoderPrim
 #if X86_64
     if (cpuMask & X265_CPU_AVX2)
     {
+        p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_avx2;
+        p.cu[BLOCK_8x8].psy_cost_pp = x265_psyCost_pp_8x8_avx2;
+        p.cu[BLOCK_16x16].psy_cost_pp = x265_psyCost_pp_16x16_avx2;
+        p.cu[BLOCK_32x32].psy_cost_pp = x265_psyCost_pp_32x32_avx2;
+        p.cu[BLOCK_64x64].psy_cost_pp = x265_psyCost_pp_64x64_avx2;
+
         p.pu[LUMA_8x4].addAvg = x265_addAvg_8x4_avx2;
         p.pu[LUMA_8x8].addAvg = x265_addAvg_8x8_avx2;
         p.pu[LUMA_8x16].addAvg = x265_addAvg_8x16_avx2;
@@ -1519,6 +1534,8 @@ void setupAssemblyPrimitives(EncoderPrim
         p.pu[LUMA_16x32].sad_x4 = x265_pixel_sad_x4_16x32_avx2;
 
         p.cu[BLOCK_16x16].sse_pp = x265_pixel_ssd_16x16_avx2;
+
+        p.cu[BLOCK_16x16].ssd_s = x265_pixel_ssd_s_16_avx2;
         p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2;
 
         p.cu[BLOCK_8x8].copy_cnt = x265_copy_cnt_8_avx2;
diff -r 7b66c36ed9ef -r e637273e2ae6 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h	Mon Mar 23 19:55:02 2015 -0500
+++ b/source/common/x86/intrapred.h	Tue Mar 24 15:31:05 2015 -0500
@@ -47,6 +47,15 @@ void x265_intra_pred_planar32_sse4(pixel
 #define DECL_ANG(bsize, mode, cpu) \
     void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 
+DECL_ANG(4, 2, sse2);
+DECL_ANG(4, 3, sse2);
+DECL_ANG(4, 4, sse2);
+DECL_ANG(4, 5, sse2);
+DECL_ANG(4, 6, sse2);
+DECL_ANG(4, 7, sse2);
+DECL_ANG(4, 8, sse2);
+DECL_ANG(4, 9, sse2);
+
 DECL_ANG(4, 2, ssse3);
 DECL_ANG(4, 3, sse4);
 DECL_ANG(4, 4, sse4);
diff -r 7b66c36ed9ef -r e637273e2ae6 source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm	Mon Mar 23 19:55:02 2015 -0500
+++ b/source/common/x86/intrapred8.asm	Tue Mar 24 15:31:05 2015 -0500
@@ -267,6 +267,13 @@ const ang_table
 %assign x x+1
 %endrep
 
+const pw_ang_table
+%assign x 0
+%rep 32
+    times 4 dw (32-x), x
+%assign x x+1
+%endrep
+
 SECTION .text
 
 cextern pw_2
@@ -1109,6 +1116,243 @@ cglobal intra_pred_planar32, 3,3,8,0-(4*
 
 %endif ; end ARCH_X86_32
 
+;-----------------------------------------------------------------------------------------
+; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
+;-----------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_ang4_2, 3,5,3
+    lea         r4, [r2 + 2]
+    add         r2, 10
+    cmp         r3m, byte 34
+    cmove       r2, r4
+
+    movh        m0, [r2]
+    movd        [r0], m0
+    mova        m1, m0
+    psrldq      m1, 1
+    movd        [r0 + r1], m1
+    mova        m2, m0
+    psrldq      m2, 2
+    movd        [r0 + r1 * 2], m2
+    lea         r1, [r1 * 3]
+    psrldq      m0, 3
+    movd        [r0 + r1], m0
+    RET
+
+INIT_XMM sse2
+cglobal intra_pred_ang4_3, 3,5,8
+    mov         r4, 1
+    cmp         r3m, byte 33
+    mov         r3, 9
+    cmove       r3, r4
+
+    movh        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m1, m0
+    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
+    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
+    mova        m1, m0
+    psrldq      m1, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
+    mova        m2, m0
+    psrldq      m2, 4           ; [x x x x x x x x 7 6 6 5 5 4 4 3]
+    mova        m3, m0
+    psrldq      m3, 6           ; [x x x x x x x x 8 7 7 6 6 5 5 4]
+    punpcklqdq  m0, m1
+    punpcklqdq  m2, m3
+
+    lea         r3, [pw_ang_table + 20 * 16]
+    mova        m4, [r3 + 6 * 16]   ; [26]
+    mova        m5, [r3]            ; [20]
+    mova        m6, [r3 - 6 * 16]   ; [14]
+    mova        m7, [r3 - 12 * 16]  ; [ 8]
+    jmp        .do_filter4x4
+
+    ; NOTE: share path, input is m0=[1 0], m2=[3 2], m3,m4=coef, flag_z=no_transpose
+ALIGN 16
+.do_filter4x4:
+    pxor        m1, m1
+    pxor        m3, m3
+    punpckhbw   m3, m0
+    psrlw       m3, 8
+    pmaddwd     m3, m5
+    punpcklbw   m0, m1
+    pmaddwd     m0, m4
+    packssdw    m0, m3
+    paddw       m0, [pw_16]
+    psraw       m0, 5
+    pxor        m3, m3
+    punpckhbw   m3, m2
+    psrlw       m3, 8
+    pmaddwd     m3, m7
+    punpcklbw   m2, m1
+    pmaddwd     m2, m6
+    packssdw    m2, m3
+    paddw       m2, [pw_16]
+    psraw       m2, 5
+
+    ; NOTE: mode 33 doesn't reorder, UNSAFE but I don't use any instruction that affect eflag register before
+    jz         .store
+
+    ; transpose 4x4 c_trans_4x4           db  0,  4,  8, 12,  1,  5,  9, 13,  2,  6, 10, 14,  3,  7, 11, 15
+    pshufd      m0, m0, 0xD8
+    pshufd      m1, m2, 0xD8
+    pshuflw     m0, m0, 0xD8
+    pshuflw     m1, m1, 0xD8
+    pshufhw     m0, m0, 0xD8
+    pshufhw     m1, m1, 0xD8
+    mova        m2, m0
+    punpckldq   m0, m1
+    punpckhdq   m2, m1
+
+.store:
+    packuswb    m0, m2
+    movd        [r0], m0
+    pshufd      m0, m0, 0x39
+    movd        [r0 + r1], m0
+    pshufd      m0, m0, 0x39
+    movd        [r0 + r1 * 2], m0
+    lea         r1, [r1 * 3]
+    pshufd      m0, m0, 0x39
+    movd        [r0 + r1], m0
+    RET
+
+cglobal intra_pred_ang4_4, 3,5,8
+    xor         r4, r4
+    inc         r4
+    cmp         r3m, byte 32
+    mov         r3, 9
+    cmove       r3, r4
+
+    movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
+    mova        m1, m0
+    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
+    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
+    mova        m1, m0
+    psrldq      m1, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
+    mova        m3, m0
+    psrldq      m3, 4           ; [x x x x x x x x 7 6 6 5 5 4 4 3]
+    punpcklqdq  m0, m1
+    punpcklqdq  m2, m1, m3
+
+    lea         r3, [pw_ang_table + 18 * 16]
+    mova        m4, [r3 +  3 * 16]  ; [21]
+    mova        m5, [r3 -  8 * 16]  ; [10]
+    mova        m6, [r3 + 13 * 16]  ; [31]
+    mova        m7, [r3 +  2 * 16]  ; [20]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_5, 3,5,8
+    xor         r4, r4
+    inc         r4
+    cmp         r3m, byte 31
+    mov         r3, 9
+    cmove       r3, r4
+
+    movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
+    mova        m1, m0
+    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
+    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
+    mova        m1, m0
+    psrldq      m1, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
+    mova        m3, m0
+    psrldq      m3, 4           ; [x x x x x x x x 7 6 6 5 5 4 4 3]
+    punpcklqdq  m0, m1
+    punpcklqdq  m2, m1, m3
+
+    lea         r3, [pw_ang_table + 10 * 16]
+    mova        m4, [r3 +  7 * 16]  ; [17]
+    mova        m5, [r3 -  8 * 16]  ; [ 2]
+    mova        m6, [r3 +  9 * 16]  ; [19]
+    mova        m7, [r3 -  6 * 16]  ; [ 4]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_6, 3,5,8
+    xor         r4, r4
+    inc         r4
+    cmp         r3m, byte 30
+    mov         r3, 9
+    cmove       r3, r4
+
+    movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
+    mova        m1, m0
+    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
+    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
+    mova        m2, m0
+    psrldq      m2, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
+    punpcklqdq  m0, m0
+    punpcklqdq  m2, m2
+
+    lea         r3, [pw_ang_table + 19 * 16]
+    mova        m4, [r3 -  6 * 16]  ; [13]
+    mova        m5, [r3 +  7 * 16]  ; [26]
+    mova        m6, [r3 - 12 * 16]  ; [ 7]
+    mova        m7, [r3 +  1 * 16]  ; [20]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_7, 3,5,8
+    xor         r4, r4
+    inc         r4
+    cmp         r3m, byte 29
+    mov         r3, 9
+    cmove       r3, r4
+
+    movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
+    mova        m1, m0
+    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
+    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
+    mova        m3, m0
+    psrldq      m3, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
+    punpcklqdq  m2, m0, m3
+    punpcklqdq  m0, m0
+
+    lea         r3, [pw_ang_table + 20 * 16]
+    mova        m4, [r3 - 11 * 16]  ; [ 9]
+    mova        m5, [r3 -  2 * 16]  ; [18]
+    mova        m6, [r3 +  7 * 16]  ; [27]
+    mova        m7, [r3 - 16 * 16]  ; [ 4]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_8, 3,5,8
+    xor         r4, r4
+    inc         r4
+    cmp         r3m, byte 28
+    mov         r3, 9
+    cmove       r3, r4
+
+    movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
+    mova        m1, m0
+    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
+    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
+    punpcklqdq  m0, m0


More information about the x265-commits mailing list