[x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}

Sat Jul 24 05:42:12 UTC 2021



At 2021-07-24 05:23:44, "Pop, Sebastian" <spop at amazon.com> wrote:

Hi,

 

> +    fmov            w12, s4

> +    neg             w12, w12

> +    add             w0, w12, #16

> (-w12) + 16 equal to 16-w12, load #16 into w0 may execution parallelism with FMOV.

 

I see a small improvement with this change.  Please see attached a patch.

 

> +    clz             v2.4h, v0.4h

> +    clz             v3.4h, v1.4h

> +    ushr            v2.4h, v2.4h, #4

> +    ushr            v3.4h, v3.4h, #4

> +    add             v2.4h, v2.4h, v3.4h

> clz+ushr+add is slower than cmeq+add in either exection throughput or cycles.

 

I do not see any improvement with this change applied to x265_copy_cnt_{4,8,16,32}:

 

@@ -508,14 +508,14 @@ function x265_copy_cnt_4_neon

.rept 2

     ld1             {v0.8b}, [x1], x2

     ld1             {v1.8b}, [x1], x2

-    clz             v2.4h, v0.4h

-    clz             v3.4h, v1.4h

-    ushr            v2.4h, v2.4h, #4

-    ushr            v3.4h, v3.4h, #4

-    add             v2.4h, v2.4h, v3.4h

-    add             v4.4h, v4.4h, v2.4h

     st1             {v0.8b}, [x0], #8

     st1             {v1.8b}, [x0], #8

+    cmeq            v0.4h, v0.4h, #0

+    cmeq            v1.4h, v1.4h, #0

+    ushr            v0.4h, v0.4h, #15

+    ushr            v1.4h, v1.4h, #15

+    add             v4.4h, v4.4h, v0.4h

+    add             v4.4h, v4.4h, v1.4h

.endr

     uaddlv          s4, v4.4h

     fmov            w12, s4

 

Before this change, the time is slightly better:

 

         copy_cnt[4x4]  13.84x   7.53            104.19

         copy_cnt[8x8]  31.37x   12.44           390.16

       copy_cnt[16x16]  43.34x   35.83           1553.07

       copy_cnt[32x32]  47.40x   129.28          6127.89

 

than after the change:

 

         copy_cnt[4x4]  13.91x   7.50            104.25

         copy_cnt[8x8]  31.09x   12.57           390.92

       copy_cnt[16x16]  43.12x   36.04           1554.11

       copy_cnt[32x32]  47.38x   129.34          6128.81

 

Neoverse-N1 SWOG says:

https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd

 

CLZ  latency 2, throughput 2

CMEQ latency 2, throughput 1

 

Changing CLZ to CMEQ has less parallelism with a lower throughput.







You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1 depends on result, we can sum of these -1 to get totally number of non-zero coeffs, it reduce 3 instructions to 2.




 

> The copy_s* looks good, my only comment is the instruction TBL faster than XTN/XTN2

> 

 

Neoverse-N1 SWOG says TBL is as fast as XTN:

 

TBL (with 1 or 2 table regs) latency 2 throughput 2

XTN latency 2 throughput 2




+    xtn             v0.8b, v0.8h

+    xtn2            v0.16b, v1.8h

equal to
tbl v0, {v0,v1}, v2







 

Thanks,

Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210724/6ecc8654/attachment.html>