[x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}
chen
chenm003 at 163.com
Sun Jul 25 05:49:18 UTC 2021
Hi,
@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon
......
+ uaddlv s4, v4.4h
Unsigned?
+ umov w12, v4.h[0]
+ sxth w12, w12
+ add x0, x12, #16
The SXTH is unnecessary because count of zeros must be in range [0,16], so the W12 in the range [-16,0]
Please also remind the W0 is low part of X0, and result in the reg S4 is int32.
Others in the patch looks good.
Regards,
Min Chen
At 2021-07-25 13:31:06, "Pop, Sebastian" <spop at amazon.com> wrote:
Hi,
> You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1
> depends on result, we can sum of these -1 to get totally number of non-zero
> coeffs, it reduce 3 instructions to 2.
You are right. With this change I see a lot of improvement:
@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon
.rept 2
ld1 {v0.8b}, [x1], x2
ld1 {v1.8b}, [x1], x2
- clz v2.4h, v0.4h
- clz v3.4h, v1.4h
- ushr v2.4h, v2.4h, #4
- ushr v3.4h, v3.4h, #4
- add v2.4h, v2.4h, v3.4h
- add v4.4h, v4.4h, v2.4h
st1 {v0.8b}, [x0], #8
st1 {v1.8b}, [x0], #8
+ cmeq v0.4h, v0.4h, #0
+ cmeq v1.4h, v1.4h, #0
+ add v4.4h, v4.4h, v0.4h
+ add v4.4h, v4.4h, v1.4h
.endr
uaddlv s4, v4.4h
- fmov w12, s4
- mov w11, #16
- sub w0, w11, w12
+ umov w12, v4.h[0]
+ sxth w12, w12
+ add x0, x12, #16
ret
endfunc
Before:
copy_cnt[4x4] 13.93x 7.50 104.56
copy_cnt[8x8] 31.20x 12.70 396.33
copy_cnt[16x16] 43.22x 36.00 1556.03
copy_cnt[32x32] 47.39x 129.34 6129.63
After:
copy_cnt[4x4] 14.76x 7.12 105.12
copy_cnt[8x8] 37.56x 10.60 398.25
copy_cnt[16x16] 52.57x 29.74 1563.60
copy_cnt[32x32] 62.22x 98.37 6120.29
> + xtn v0.8b, v0.8h
> + xtn2 v0.16b, v1.8h
> equal to
> tbl v0, {v0,v1}, v2
You are right. With this change I see a lot of improvement:
Before:
copy_sp[16x16] 85.13x 18.78 1599.19
copy_sp[32x32] 96.31x 65.07 6266.88
copy_sp[64x64] 98.81x 252.38 24937.40
[i422] copy_sp[16x32] 91.93x 34.32 3154.89
[i422] copy_sp[32x64] 99.54x 128.29 12769.10
After:
copy_sp[16x16] 96.23x 16.42 1579.74
copy_sp[32x32] 104.33x 57.84 6034.24
copy_sp[64x64] 110.79x 221.66 24558.72
[i422] copy_sp[16x32] 97.74x 31.89 3116.46
[i422] copy_sp[32x64] 111.37x 112.39 12517.52
Please see the amended patch.
Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210725/c563933b/attachment.html>
More information about the x265-devel
mailing list