<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div><br> </div><pre><br>At 2014-09-05 19:51:58,yuvaraj@multicorewareinc.com wrote:
># HG changeset patch
># User Yuvaraj Venkatesh <yuvaraj@multicorewareinc.com>
># Date 1409917643 -19800
># Fri Sep 05 17:17:23 2014 +0530
># Node ID 78143d079d48b0b0cbcda7bb208389342d433c55
># Parent 93db2f53fe573537bcd4eb53ca3cdb69af557eb5
>asm: avx2 assembly code for dct16
>
>diff -r 93db2f53fe57 -r 78143d079d48 source/common/x86/asm-primitives.cpp
>--- a/source/common/x86/asm-primitives.cpp Thu Sep 04 16:42:24 2014 -0700
>+++ b/source/common/x86/asm-primitives.cpp Fri Sep 05 17:17:23 2014 +0530
>@@ -1734,8 +1734,8 @@
> p.cvt32to16_shl[BLOCK_16x16] = x265_cvt32to16_shl_16_avx2;
> p.cvt32to16_shl[BLOCK_32x32] = x265_cvt32to16_shl_32_avx2;
> p.denoiseDct = x265_denoise_dct_avx2;
>-
> p.dct[DCT_4x4] = x265_dct4_avx2;
>+ p.dct[DCT_16x16] = x265_dct16_avx2;
</pre><pre>your code just work on x64, need check it here</pre><pre> </pre><pre>> }
> #endif // if HIGH_BIT_DEPTH
> }
>diff -r 93db2f53fe57 -r 78143d079d48 source/common/x86/dct8.asm
>--- a/source/common/x86/dct8.asm Thu Sep 04 16:42:24 2014 -0700
>+++ b/source/common/x86/dct8.asm Fri Sep 05 17:17:23 2014 +0530
>@@ -29,13 +29,61 @@
> %include "x86util.asm"
>
> SECTION_RODATA 32
>+tab_dct16_1: times 2 dw 64, 64, 64, 64, 64, 64, 64, 64
</pre><pre>we can reduce size to half, see below</pre><pre> </pre><pre> </pre><pre>>+%macro DCT16_PASS_1_E 2
>+ mova m7, [tab_dct16_3 + %1]
</pre><pre>>+ pmaddwd m4, m0, m7
</pre><pre><pre>two choice: vpbroadcastq to reduce half memory or combo into pmaddwd</pre><pre>I suggest you buffer tab_dct16_3 address into gerenal register to reduce code size, you are work on x64, you have many free registers</pre><pre>>+ phaddd m4, m4
>+
>+ pmaddwd m6, m2, m7
>+ phaddd m6, m6
>+
>+ punpcklqdq m4, m6
</pre><pre>we may combo with two phaddd</pre><pre> </pre><pre>>+ mova m0, m8
>+ phaddw m0, m0
</pre><pre>phaddw m0, m8, m8?</pre><pre> </pre><pre>>+
>+ pshufb m1, m14
>+ mova m2, m1
>+ phaddw m2, m2
>+
>+ punpcklqdq m0, m2
</pre><pre>combo with last two phaddw</pre><pre>>+
>+ lea r0, [r0 + 8 * r2]
>+ add r5, 256
>+
>+ dec r4
>+ jnz .pass1
>+
>+ mov r5, rsp
>+ mov r4, 2
>+ add r2d, r2d
>+ lea r3, [r2 * 3]
>+ vpbroadcastd m9, [pd_512]
>+
>+.pass2:
>+ mova m0, [r5 + 0 * 32] ; [row0lo row4lo]
>+ mova m1, [r5 + 8 * 32] ; [row0hi row4hi]
>+
>+ mova m2, [r5 + 1 * 32] ; [row1lo row5lo]
>+ mova m3, [r5 + 9 * 32] ; [row1hi row5hi]
>+
>+ mova m4, [r5 + 2 * 32] ; [row2lo row6lo]
>+ mova m5, [r5 + 10 * 32] ; [row2hi row6hi]
>+
>+ mova m6, [r5 + 3 * 32] ; [row3lo row7lo]
>+ mova m7, [r5 + 11 * 32] ; [row3hi row7hi]
>+
>+ DCT16_PASS_2 0 * 32
>+ mova [r1], m10
>+ DCT16_PASS_2 1 * 32
>+ mova [r1 + r2], m10
>+ DCT16_PASS_2 2 * 32
>+ mova [r1 + r2 * 2], m10
>+ DCT16_PASS_2 3 * 32
>+ mova [r1 + r3], m10
>+
>+ lea r6, [r1 + r2 * 4]
>+ DCT16_PASS_2 4 * 32
>+ mova [r6], m10
>+ DCT16_PASS_2 5 * 32
>+ mova [r6 + r2], m10
>+ DCT16_PASS_2 6 * 32
>+ mova [r6 + r2 * 2], m10
>+ DCT16_PASS_2 7 * 32
>+ mova [r6 + r3], m10
>+
>+ lea r6, [r6 + r2 * 4]
>+ DCT16_PASS_2 8 * 32
>+ mova [r6], m10
>+ DCT16_PASS_2 9 * 32
>+ mova [r6 + r2], m10
>+ DCT16_PASS_2 10 * 32
>+ mova [r6 + r2 * 2], m10
>+ DCT16_PASS_2 11 * 32
>+ mova [r6 + r3], m10
>+
>+ lea r6, [r6 + r2 * 4]
>+ DCT16_PASS_2 12 * 32
>+ mova [r6], m10
>+ DCT16_PASS_2 13 * 32
>+ mova [r6 + r2], m10
>+ DCT16_PASS_2 14 * 32
>+ mova [r6 + r2 * 2], m10
>+ DCT16_PASS_2 15 * 32
>+ mova [r6 + r3], m10
>+
>+ add r1, 32
>+ add r5, 128
>+
>+ dec r4
</pre><pre>does you need 64bits counter?</pre><pre> </pre><pre>>+ jnz .pass2
>+
>+ RET
</pre></pre></div>