[x265] [PATCH 300 of 307] x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
chen
chenm003 at 163.com
Fri Apr 6 20:12:59 CEST 2018
Sorry, I miss a line, resend with addition comment
At 2018-04-07 01:27:34, "chen" <chenm003 at 163.com> wrote:
At 2018-04-06 21:17:37, mythreyi at multicorewareinc.com wrote:
># HG changeset patch
># User Jayashree
># Date 1517283539 28800
># Mon Jan 29 19:38:59 2018 -0800
># Node ID 3c6e5ce07dbca7f967e4b5b62fe450979da3bf81
># Parent 624c83571d1df840e1206c46e589044fbf87ff32
>x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
>
>count_nonzero[16x16] 18.88x -> 23.04x
>
>+;-----------------------------------------------------------------------------
>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);
>+;-----------------------------------------------------------------------------
>+INIT_ZMM avx512
>+cglobal count_nonzero_16x16, 1,4,2
>+ mov r1, 0xFFFFFFFFFFFFFFFF
>+ kmovq k2, r1
https://www.cs.utexas.edu/~hunt/class/2017-spring/cs350c/documents/Intel-x86-Docs/64-ia-32-architectures-instruction-set-extensions-reference-manual.pdf
2.5.1.1 Opmask Register K0
The only exception to the opmask rules described above is that opmask k0 can not be used as a predicate operand.
Opmask k0 cannot be encoded as a predicate operand for a vector operation; the encoding value that would select
opmask k0 will instead selects an implicit opmask value of 0xFFFFFFFFFFFFFFFF, thereby effectively disabling
masking. Opmask register k0 can still be used for any instruction that takes opmask register(s) as operand(s)
(either source or destination).
>+ xor r3, r3
>+ pxor m0, m0
>+
>+%assign x 0
>+%rep 4
unroll 4 times only, so unnecessary unroll in here
I suggest load all of bytes in same time, it can be hidden memory latency with calculate instructions.
>+ movu m1, [r0 + x]
>+ vpacksswb m1, [r0 + x + 64]
>+%assign x x+128
>+ vpcmpb k1 {k2}, m1, m0, 00000100b
could you please declare a new macro/const, the developers are difficult to understand that the '00000100b' (4) means NE (on Intel's document).
>+ kmovq r1, k1
>+ popcnt r2, r1
>+ add r3d, r2d
>+%endrep
>+ mov eax, r3d
>+
>+ RET
>+
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20180407/301a62b7/attachment.html>
More information about the x265-devel
mailing list