<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><pre><br>At 2018-04-06 21:17:37, mythreyi@multicorewareinc.com wrote:
># HG changeset patch
># User Jayashree
># Date 1517283539 28800
># Mon Jan 29 19:38:59 2018 -0800
># Node ID 3c6e5ce07dbca7f967e4b5b62fe450979da3bf81
># Parent 624c83571d1df840e1206c46e589044fbf87ff32
>x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
>
>count_nonzero[16x16] 18.88x -> 23.04x
>
>+;-----------------------------------------------------------------------------
>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);
>+;-----------------------------------------------------------------------------
>+INIT_ZMM avx512
>+cglobal count_nonzero_16x16, 1,4,2
>+ mov r1, 0xFFFFFFFFFFFFFFFF
>+ kmovq k2, r1
<div><br></div><div>https://www.cs.utexas.edu/~hunt/class/2017-spring/cs350c/documents/Intel-x86-Docs/64-ia-32-architectures-instruction-set-extensions-reference-manual.pdf
<span style="font-family: NeoSansIntelMedium; font-size: 11pt; color: rgb(8, 96, 168); font-variant-numeric: normal; font-variant-east-asian: normal;">2.5.1.1 Opmask Register K0<br><span style="font-family: Verdana; font-size: 9pt; color: rgb(0, 0, 0); font-variant-numeric: normal; font-variant-east-asian: normal;">The only exception to the opmask rules described above is that opmask k0 can not be used as a predicate operand.<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">Opmask k0 cannot be encoded as a predicate operand for a vector operation; the encoding value that would select<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">opmask k0 will instead selects an implicit opmask value of 0xFFFFFFFFFFFFFFFF, thereby effectively disabling<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">masking. Opmask register k0 can still be used for any instruction that takes opmask register(s) as operand(s)<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">(either source or destination).</span></span></span></span></span><br style="font-variant-numeric: normal; font-variant-east-asian: normal; line-height: normal; text-align: -webkit-auto; white-space: normal; text-size-adjust: auto;">
</span></div><div><br></div><div>>+ xor r3, r3</div>>+ pxor m0, m0
>+
>+%assign x 0
<div>>+%rep 4</div><div>unroll 4 times only, so unnecessary unroll in here</div><div>I suggest load all of bytes in same time, it can be hidden memory latency with calculate instructions.</div><div><br></div>>+ movu m1, [r0 + x]
<div>>+ vpacksswb m1, [r0 + x + 64]</div><div>>+%assign x x+128</div>>+ vpcmpb k1 {k2}, m1, m0, 00000100b
>+ kmovq r1, k1
>+ popcnt r2, r1
>+ add r3d, r2d
>+%endrep
>+ mov eax, r3d
>+
>+ RET
>+
</pre></div>