[x265] [arm64] Status and combined patch

Pop, Sebastian spop at amazon.com
Fri Jan 28 02:08:36 UTC 2022


Hi Min Chen,


Thank you for your review comments, that helped improve the performance of scanPosLast on arm64:


           scanPosLast  5.46x    782.47          4275.92

I think I addressed all the changes you requested with the exception of the two below:

> +    // get sign
> +    cmeq            v5.16b, v3.16b, #0  //      equal to zero
> +    mvn             v5.16b, v5.16b      // v5 = non-zero
> [MC] Why not replace cmeq+mvn by cmgt?

[SP] We cannot replace the sequence with cmgt.
cmgt #0 is "Compare signed Greater than zero".
cmgt #0 would only select positive values.
We need all non-zero values, i.e., negative and positive values.

> +    // val - w13 = pmovmskb(v3)
> +    and             v3.16b, v3.16b, v28.16b
> +    mov             d4, v3.d[1]
> +    addv            b13, v3.8b
> +    addv            b14, v4.8b
> [MC] ADDV support .16b?

[SP] I cannot use the .16b variant of ADDV.
The data in v3.16b is ANDed with a mask in v28.16b:
    and             v3.16b, v3.16b, v28.16b
The mask in v28 is:
.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80
This is used to select which byte gets counted in which position.

To use an ADDV .16b I would need to encode the position of the bytes
in 16 bits instead of 8 bits, i.e., the mask would be:
.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000
however that would require the data to be in 16bit vector elements and NEON vectors would be 8h which is half too short.

Another solution I was considering is to decrease the vector factor for the loop from 16 to 8.
That would simplify the code for pmovmskb, however the scalar code would be less efficient, as it would only deal with half the bytes.
Do you think I should try out with a lower vector factor 8?

Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20220128/462536af/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm64-port-scanPosLast.patch
Type: text/x-patch
Size: 5556 bytes
Desc: 0001-arm64-port-scanPosLast.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20220128/462536af/attachment.bin>


More information about the x265-devel mailing list