<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><pre><br>At 2018-04-06 21:17:37, mythreyi@multicorewareinc.com wrote:

># HG changeset patch

># User Jayashree

># Date 1517283539 28800

>#      Mon Jan 29 19:38:59 2018 -0800

># Node ID 3c6e5ce07dbca7f967e4b5b62fe450979da3bf81

># Parent  624c83571d1df840e1206c46e589044fbf87ff32

>x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2

>

>count_nonzero[16x16]   18.88x ->  23.04x

>

>+;-----------------------------------------------------------------------------

>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);

>+;-----------------------------------------------------------------------------

>+INIT_ZMM avx512

>+cglobal count_nonzero_16x16, 1,4,2

>+    mov             r1, 0xFFFFFFFFFFFFFFFF

>+    kmovq           k2, r1

<div><br></div><div>https://www.cs.utexas.edu/~hunt/class/2017-spring/cs350c/documents/Intel-x86-Docs/64-ia-32-architectures-instruction-set-extensions-reference-manual.pdf

<span style="font-family: NeoSansIntelMedium; font-size: 11pt; color: rgb(8, 96, 168); font-variant-numeric: normal; font-variant-east-asian: normal;">2.5.1.1 Opmask Register K0<br><span style="font-family: Verdana; font-size: 9pt; color: rgb(0, 0, 0); font-variant-numeric: normal; font-variant-east-asian: normal;">The only exception to the opmask rules described above is that opmask k0 can not be used as a predicate operand.<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">Opmask k0 cannot be encoded as a predicate operand for a vector operation; the encoding value that would select<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">opmask k0 will instead selects an implicit opmask value of 0xFFFFFFFFFFFFFFFF, thereby effectively disabling<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">masking. Opmask register k0 can still be used for any instruction that takes opmask register(s) as operand(s)<br><span style="font-size: 9pt; font-variant-numeric: normal; font-variant-east-asian: normal;">(either source or destination).</span></span></span></span></span><br style="font-variant-numeric: normal; font-variant-east-asian: normal; line-height: normal; text-align: -webkit-auto; white-space: normal; text-size-adjust: auto;">

</span></div><div><br></div><div>>+    xor             r3, r3</div>>+    pxor            m0, m0

>+

>+%assign x 0

<div>>+%rep 4</div><div>unroll 4 times only, so unnecessary unroll in here</div><div>I suggest load all of bytes in same time, it can be hidden memory latency with calculate instructions.</div><div><br></div>>+    movu            m1, [r0 + x]

<div>>+    vpacksswb       m1, [r0 + x + 64]</div><div>>+%assign x x+128</div>>+    vpcmpb          k1 {k2}, m1, m0, 00000100b

>+    kmovq           r1, k1

>+    popcnt          r2, r1

>+    add             r3d, r2d

>+%endrep

>+    mov             eax, r3d

>+

>+    RET

>+


</pre></div>