[x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}

Pop, Sebastian spop at amazon.com
Fri Jul 23 21:23:44 UTC 2021


Hi,

> +    fmov            w12, s4
> +    neg             w12, w12
> +    add             w0, w12, #16
> (-w12) + 16 equal to 16-w12, load #16 into w0 may execution parallelism with FMOV.

I see a small improvement with this change.  Please see attached a patch.

> +    clz             v2.4h, v0.4h
> +    clz             v3.4h, v1.4h
> +    ushr            v2.4h, v2.4h, #4
> +    ushr            v3.4h, v3.4h, #4
> +    add             v2.4h, v2.4h, v3.4h
> clz+ushr+add is slower than cmeq+add in either exection throughput or cycles.

I do not see any improvement with this change applied to x265_copy_cnt_{4,8,16,32}:

@@ -508,14 +508,14 @@ function x265_copy_cnt_4_neon
.rept 2
     ld1             {v0.8b}, [x1], x2
     ld1             {v1.8b}, [x1], x2
-    clz             v2.4h, v0.4h
-    clz             v3.4h, v1.4h
-    ushr            v2.4h, v2.4h, #4
-    ushr            v3.4h, v3.4h, #4
-    add             v2.4h, v2.4h, v3.4h
-    add             v4.4h, v4.4h, v2.4h
     st1             {v0.8b}, [x0], #8
     st1             {v1.8b}, [x0], #8
+    cmeq            v0.4h, v0.4h, #0
+    cmeq            v1.4h, v1.4h, #0
+    ushr            v0.4h, v0.4h, #15
+    ushr            v1.4h, v1.4h, #15
+    add             v4.4h, v4.4h, v0.4h
+    add             v4.4h, v4.4h, v1.4h
.endr
     uaddlv          s4, v4.4h
     fmov            w12, s4

Before this change, the time is slightly better:

         copy_cnt[4x4]  13.84x   7.53            104.19
         copy_cnt[8x8]  31.37x   12.44           390.16
       copy_cnt[16x16]  43.34x   35.83           1553.07
       copy_cnt[32x32]  47.40x   129.28          6127.89

than after the change:

         copy_cnt[4x4]  13.91x   7.50            104.25
         copy_cnt[8x8]  31.09x   12.57           390.92
       copy_cnt[16x16]  43.12x   36.04           1554.11
       copy_cnt[32x32]  47.38x   129.34          6128.81

Neoverse-N1 SWOG says:
https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd

CLZ  latency 2, throughput 2
CMEQ latency 2, throughput 1

Changing CLZ to CMEQ has less parallelism with a lower throughput.

> The copy_s* looks good, my only comment is the instruction TBL faster than XTN/XTN2
>

Neoverse-N1 SWOG says TBL is as fast as XTN:

TBL (with 1 or 2 table regs) latency 2 throughput 2
XTN latency 2 throughput 2

Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210723/ef140129/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm64-port-count_nonzero-blkfill-and-copy_-ss-sp-ps.patch
Type: application/octet-stream
Size: 39026 bytes
Desc: 0001-arm64-port-count_nonzero-blkfill-and-copy_-ss-sp-ps.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210723/ef140129/attachment-0001.obj>


More information about the x265-devel mailing list