[x265] [arm64] Status and combined patch

chen chenm003 at 163.com
Fri Jan 28 05:12:47 UTC 2022


Hi Sebastian,


Thank you for your explain more, I inline my comments.



At 2022-01-28 10:08:36, "Pop, Sebastian" <spop at amazon.com> wrote:

Hi Min Chen,





Thank you for your review comments, that helped improve the performance of scanPosLast on arm64:





           scanPosLast  5.46x    782.47          4275.92

I think I addressed all the changes you requested with the exception of the two below:


> +    // get sign
> +    cmeq            v5.16b, v3.16b, #0  //      equal to zero
> +    mvn             v5.16b, v5.16b      // v5 = non-zero
> [MC] Why not replace cmeq+mvn by cmgt?


[SP] We cannot replace the sequence with cmgt.
cmgt #0 is "Compare signed Greater than zero".
cmgt #0 would only select positive values.
We need all non-zero values, i.e., negative and positive values.


[MC] This is my fault, I forgot CMGT #0 work on Signed only, how about CMHI with a vector register that hold zeros?


> +    // val - w13 = pmovmskb(v3)
> +    and             v3.16b, v3.16b, v28.16b
> +    mov             d4, v3.d[1]
> +    addv            b13, v3.8b
> +    addv            b14, v4.8b
> [MC] ADDV support .16b?


[SP] I cannot use the .16b variant of ADDV.
The data in v3.16b is ANDed with a mask in v28.16b:
    and             v3.16b, v3.16b, v28.16b
The mask in v28 is:
.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80
This is used to select which byte gets counted in which position.


To use an ADDV .16b I would need to encode the position of the bytes
in 16 bits instead of 8 bits, i.e., the mask would be:
.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000
however that would require the data to be in 16bit vector elements and NEON vectors would be 8h which is half too short.


Another solution I was considering is to decrease the vector factor for the loop from 16 to 8.

That would simplify the code for pmovmskb, however the scalar code would be less efficient, as it would only deal with half the bytes.

Do you think I should try out with a lower vector factor 8?



[MC]  two of my algorithms use shll & ushl to reduce count of addv, and accelerate with 2 parallelism data path, but it is same 7 instructions, so we can keep your current version here.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20220128/6ddc3f92/attachment-0001.html>


More information about the x265-devel mailing list