[x265] [PATCH 300 of 307] x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2

Fri Apr 6 20:12:59 CEST 2018

Sorry, I miss a line, resend with addition comment

At 2018-04-07 01:27:34, "chen" <chenm003 at 163.com> wrote:

At 2018-04-06 21:17:37, mythreyi at multicorewareinc.com wrote:
># HG changeset patch
># User Jayashree
># Date 1517283539 28800
>#      Mon Jan 29 19:38:59 2018 -0800
># Node ID 3c6e5ce07dbca7f967e4b5b62fe450979da3bf81
># Parent  624c83571d1df840e1206c46e589044fbf87ff32
>x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
>
>count_nonzero[16x16]   18.88x ->  23.04x
>
>+;-----------------------------------------------------------------------------
>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);
>+;-----------------------------------------------------------------------------
>+INIT_ZMM avx512
>+cglobal count_nonzero_16x16, 1,4,2
>+    mov             r1, 0xFFFFFFFFFFFFFFFF
>+    kmovq           k2, r1

https://www.cs.utexas.edu/~hunt/class/2017-spring/cs350c/documents/Intel-x86-Docs/64-ia-32-architectures-instruction-set-extensions-reference-manual.pdf
2.5.1.1 Opmask Register K0
The only exception to the opmask rules described above is that opmask k0 can not be used as a predicate operand.
Opmask k0 cannot be encoded as a predicate operand for a vector operation; the encoding value that would select
opmask k0 will instead selects an implicit opmask value of 0xFFFFFFFFFFFFFFFF, thereby effectively disabling
masking. Opmask register k0 can still be used for any instruction that takes opmask register(s) as operand(s)
(either source or destination).

>+    xor             r3, r3
>+    pxor            m0, m0
>+
>+%assign x 0

>+%rep 4
unroll 4 times only, so unnecessary unroll in here
I suggest load all of bytes in same time, it can be hidden memory latency with calculate instructions.

>+    movu            m1, [r0 + x]

>+    vpacksswb       m1, [r0 + x + 64]
>+%assign x x+128
>+    vpcmpb          k1 {k2}, m1, m0, 00000100b
could you please declare a new macro/const, the developers are difficult to understand that the '00000100b' (4) means NE (on Intel's document).

>+    kmovq           r1, k1
>+    popcnt          r2, r1
>+    add             r3d, r2d
>+%endrep
>+    mov             eax, r3d
>+
>+    RET
>+

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20180407/301a62b7/attachment.html>