[x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}
chen
chenm003 at 163.com
Sat Jul 24 05:42:12 UTC 2021
At 2021-07-24 05:23:44, "Pop, Sebastian" <spop at amazon.com> wrote:
Hi,
> + fmov w12, s4
> + neg w12, w12
> + add w0, w12, #16
> (-w12) + 16 equal to 16-w12, load #16 into w0 may execution parallelism with FMOV.
I see a small improvement with this change. Please see attached a patch.
> + clz v2.4h, v0.4h
> + clz v3.4h, v1.4h
> + ushr v2.4h, v2.4h, #4
> + ushr v3.4h, v3.4h, #4
> + add v2.4h, v2.4h, v3.4h
> clz+ushr+add is slower than cmeq+add in either exection throughput or cycles.
I do not see any improvement with this change applied to x265_copy_cnt_{4,8,16,32}:
@@ -508,14 +508,14 @@ function x265_copy_cnt_4_neon
.rept 2
ld1 {v0.8b}, [x1], x2
ld1 {v1.8b}, [x1], x2
- clz v2.4h, v0.4h
- clz v3.4h, v1.4h
- ushr v2.4h, v2.4h, #4
- ushr v3.4h, v3.4h, #4
- add v2.4h, v2.4h, v3.4h
- add v4.4h, v4.4h, v2.4h
st1 {v0.8b}, [x0], #8
st1 {v1.8b}, [x0], #8
+ cmeq v0.4h, v0.4h, #0
+ cmeq v1.4h, v1.4h, #0
+ ushr v0.4h, v0.4h, #15
+ ushr v1.4h, v1.4h, #15
+ add v4.4h, v4.4h, v0.4h
+ add v4.4h, v4.4h, v1.4h
.endr
uaddlv s4, v4.4h
fmov w12, s4
Before this change, the time is slightly better:
copy_cnt[4x4] 13.84x 7.53 104.19
copy_cnt[8x8] 31.37x 12.44 390.16
copy_cnt[16x16] 43.34x 35.83 1553.07
copy_cnt[32x32] 47.40x 129.28 6127.89
than after the change:
copy_cnt[4x4] 13.91x 7.50 104.25
copy_cnt[8x8] 31.09x 12.57 390.92
copy_cnt[16x16] 43.12x 36.04 1554.11
copy_cnt[32x32] 47.38x 129.34 6128.81
Neoverse-N1 SWOG says:
https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd
CLZ latency 2, throughput 2
CMEQ latency 2, throughput 1
Changing CLZ to CMEQ has less parallelism with a lower throughput.
You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1 depends on result, we can sum of these -1 to get totally number of non-zero coeffs, it reduce 3 instructions to 2.
> The copy_s* looks good, my only comment is the instruction TBL faster than XTN/XTN2
>
Neoverse-N1 SWOG says TBL is as fast as XTN:
TBL (with 1 or 2 table regs) latency 2 throughput 2
XTN latency 2 throughput 2
+ xtn v0.8b, v0.8h
+ xtn2 v0.16b, v1.8h
equal to
tbl v0, {v0,v1}, v2
Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210724/6ecc8654/attachment.html>
More information about the x265-devel
mailing list