[x265] [arm64] port LUMA_VPP_4xN

Pop, Sebastian spop at amazon.com
Wed Jul 7 18:20:01 UTC 2021


Attached the amended patch with movi.
That improved performance, thanks!

I have seen the cmp/br pattern several times.
We can do the reordering tuning after all the interpolate functions are ported.

Sebastian

From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Reply-To: Development for x265 <x265-devel at videolan.org>
Date: Tuesday, July 6, 2021 at 9:10 PM
To: Development for x265 <x265-devel at videolan.org>
Subject: RE: [EXTERNAL] [x265] [arm64] port LUMA_VPP_4xN


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


Looks good for me.

There have some little improve, it may update in future version.
For example,

+    mov             w12, #32
+    dup             v16.4s, w12
Equal to
MOVI v16.4s,#32

We may get more performance by reorder compare & branch
+    cmp             x4, #0
+    b.eq            0f
+    cmp             x4, #1
+    b.eq            1f
+    cmp             x4, #2
+    b.eq            2f
+    cmp             x4, #3
+    b.eq            3f
+0:


At 2021-07-07 00:01:17, "Pop, Sebastian" <spop at amazon.com> wrote:
Thanks for your careful reviews.
I addressed the problems for eor and rodata.
Please see the attached patch.

Sebastian

From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Reply-To: Development for x265 <x265-devel at videolan.org>
Date: Friday, July 2, 2021 at 8:11 PM
To: Development for x265 <x265-devel at videolan.org>
Subject: RE: [EXTERNAL] [x265] [arm64] port LUMA_VPP_4xN


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


Hi,

I put my comments inline. thanks.

btw: I found more improve on this patch.
+    eor             v17.16b, v17.16b, v17.16b
The clear register operator may replace by MOVI

At 2021-07-03 02:43:07, "Pop, Sebastian" <spop at amazon.com> wrote:
Hi,
thanks for your review.

> +#ifdef __MACH__
> +#   define MACH
> +#else
> +#   define MACH #
> This is not good idea to bypass .const_data

MACH uses ".const_data" directive, which is invalid for ELF.
For ELF the directive is ".rodata":

> ELF     .section        .rodata
> MACH    .const_data

[MC] I means you may declare MACH_RODATA so similar macro, it is empty on ELF but something on Macho, I guess it better than '#' to bypass unnecessary statement.

> +    ushll           v0.8h, v0.8b, #0
> ...
> +    mul             v16.8h, v0.8h, v24.8h
> Why not MULL?

That would not work for the rest of the computation.
Part of the data in v0 gets used in the next computation,
and then I would have to split mla into a mull + add.

[MC] This is depends on your algorithm, in your code
below, you combin row1 & row2 and multiplier
coeff[0], however, it also works with 8b x 8b
with UMULL.
However, it is a little complex algorithm,
so we can keep this version and improve in
future.
*** Code
> +    mul             v16.8h, v0.8h, v24.8h
> +    ext             v21.16b, v0.16b, v1.16b, #8
> +    mul             v17.8h, v21.8h, v24.8h
> +    mov             v0.16b, v1.16b
*** End



> +    orr             v0.16b, v1.16b, v1.16b
> This is equal to MOV, I guess compiler will replace to right instruction on ARM64

I replaced orr with mov instructions.

> +    // sum row[0-7]
> +    dup             v18.2d, v16.d[1]
> +    dup             v19.2d, v17.d[1]
> +    add             v16.4h, v16.4h, v18.4h
> +    add             v17.4h, v17.4h, v19.4h
> +    trn1            v16.2d, v16.2d, v17.2d
> How about ADDP?

I replaced the above 5 instructions with the following 3 and the performance improved.

    trn1            v20.2d, v16.2d, v17.2d
    trn2            v21.2d, v16.2d, v17.2d
    add             v16.8h, v20.8h, v21.8h

Please see attached the amended patch.

Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210707/aac7d7ec/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm64-port-luma_vpp.patch
Type: application/octet-stream
Size: 19151 bytes
Desc: 0001-arm64-port-luma_vpp.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210707/aac7d7ec/attachment-0001.obj>


More information about the x265-devel mailing list