[x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}

Sun Jul 25 05:49:18 UTC 2021

Hi,


@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon
......

+    uaddlv          s4, v4.4h

Unsigned?




+    umov            w12, v4.h[0]

+    sxth            w12, w12

+    add             x0, x12, #16




The SXTH is unnecessary because count of zeros must be in range [0,16],  so the W12 in the range [-16,0]

Please also remind the W0 is low part of X0, and result in the reg S4 is int32.




Others in the patch looks good.




Regards,

Min Chen

At 2021-07-25 13:31:06, "Pop, Sebastian" <spop at amazon.com> wrote:

Hi,

 

> You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1

> depends on result, we can sum of these -1 to get totally number of non-zero

> coeffs, it reduce 3 instructions to 2.

 

You are right.  With this change I see a lot of improvement:

 

@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon

.rept 2

     ld1             {v0.8b}, [x1], x2

     ld1             {v1.8b}, [x1], x2

-    clz             v2.4h, v0.4h

-    clz             v3.4h, v1.4h

-    ushr            v2.4h, v2.4h, #4

-    ushr            v3.4h, v3.4h, #4

-    add             v2.4h, v2.4h, v3.4h

-    add             v4.4h, v4.4h, v2.4h

     st1             {v0.8b}, [x0], #8

     st1             {v1.8b}, [x0], #8

+    cmeq            v0.4h, v0.4h, #0

+    cmeq            v1.4h, v1.4h, #0

+    add             v4.4h, v4.4h, v0.4h

+    add             v4.4h, v4.4h, v1.4h

.endr

     uaddlv          s4, v4.4h

-    fmov            w12, s4

-    mov             w11, #16

-    sub             w0, w11, w12

+    umov            w12, v4.h[0]

+    sxth            w12, w12

+    add             x0, x12, #16

     ret

endfunc

 

 

Before:

         copy_cnt[4x4]  13.93x   7.50            104.56

         copy_cnt[8x8]  31.20x   12.70           396.33

       copy_cnt[16x16]  43.22x   36.00           1556.03

       copy_cnt[32x32]  47.39x   129.34          6129.63

 

After:

         copy_cnt[4x4]  14.76x   7.12            105.12

         copy_cnt[8x8]  37.56x   10.60           398.25

       copy_cnt[16x16]  52.57x   29.74           1563.60

       copy_cnt[32x32]  62.22x   98.37           6120.29

 

 

> +    xtn             v0.8b, v0.8h

> +    xtn2            v0.16b, v1.8h

> equal to

> tbl v0, {v0,v1}, v2

 

You are right.  With this change I see a lot of improvement:

 

Before:

copy_sp[16x16]  85.13x   18.78           1599.19

copy_sp[32x32]  96.31x   65.07           6266.88

copy_sp[64x64]  98.81x   252.38          24937.40

[i422] copy_sp[16x32]  91.93x   34.32           3154.89

[i422] copy_sp[32x64]  99.54x   128.29          12769.10

 

After:

copy_sp[16x16]  96.23x   16.42           1579.74

copy_sp[32x32]  104.33x          57.84           6034.24

copy_sp[64x64]  110.79x          221.66          24558.72

[i422] copy_sp[16x32]  97.74x   31.89           3116.46

[i422] copy_sp[32x64]  111.37x          112.39          12517.52

 

Please see the amended patch.

 

Thanks,

Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210725/c563933b/attachment.html>