[x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}

Pop, Sebastian spop at amazon.com
Sun Jul 25 05:31:06 UTC 2021


Hi,

> You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1
> depends on result, we can sum of these -1 to get totally number of non-zero
> coeffs, it reduce 3 instructions to 2.

You are right.  With this change I see a lot of improvement:

@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon
.rept 2
     ld1             {v0.8b}, [x1], x2
     ld1             {v1.8b}, [x1], x2
-    clz             v2.4h, v0.4h
-    clz             v3.4h, v1.4h
-    ushr            v2.4h, v2.4h, #4
-    ushr            v3.4h, v3.4h, #4
-    add             v2.4h, v2.4h, v3.4h
-    add             v4.4h, v4.4h, v2.4h
     st1             {v0.8b}, [x0], #8
     st1             {v1.8b}, [x0], #8
+    cmeq            v0.4h, v0.4h, #0
+    cmeq            v1.4h, v1.4h, #0
+    add             v4.4h, v4.4h, v0.4h
+    add             v4.4h, v4.4h, v1.4h
.endr
     uaddlv          s4, v4.4h
-    fmov            w12, s4
-    mov             w11, #16
-    sub             w0, w11, w12
+    umov            w12, v4.h[0]
+    sxth            w12, w12
+    add             x0, x12, #16
     ret
endfunc


Before:
         copy_cnt[4x4]  13.93x   7.50            104.56
         copy_cnt[8x8]  31.20x   12.70           396.33
       copy_cnt[16x16]  43.22x   36.00           1556.03
       copy_cnt[32x32]  47.39x   129.34          6129.63

After:
         copy_cnt[4x4]  14.76x   7.12            105.12
         copy_cnt[8x8]  37.56x   10.60           398.25
       copy_cnt[16x16]  52.57x   29.74           1563.60
       copy_cnt[32x32]  62.22x   98.37           6120.29


> +    xtn             v0.8b, v0.8h
> +    xtn2            v0.16b, v1.8h
> equal to
> tbl v0, {v0,v1}, v2

You are right.  With this change I see a lot of improvement:

Before:
copy_sp[16x16]  85.13x   18.78           1599.19
copy_sp[32x32]  96.31x   65.07           6266.88
copy_sp[64x64]  98.81x   252.38          24937.40
[i422] copy_sp[16x32]  91.93x   34.32           3154.89
[i422] copy_sp[32x64]  99.54x   128.29          12769.10

After:
copy_sp[16x16]  96.23x   16.42           1579.74
copy_sp[32x32]  104.33x          57.84           6034.24
copy_sp[64x64]  110.79x          221.66          24558.72
[i422] copy_sp[16x32]  97.74x   31.89           3116.46
[i422] copy_sp[32x64]  111.37x          112.39          12517.52

Please see the amended patch.

Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210725/3b6e6f57/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm64-port-count_nonzero-blkfill-and-copy_-ss-sp-ps.patch
Type: application/octet-stream
Size: 38986 bytes
Desc: 0001-arm64-port-count_nonzero-blkfill-and-copy_-ss-sp-ps.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210725/3b6e6f57/attachment-0001.obj>


More information about the x265-devel mailing list