[x265] [arm64] port LUMA_VPP_4xN

Wed Jul 7 02:06:50 UTC 2021

Looks good for me.

There have some little improve, it may update in future version.
For example,

+    mov             w12, #32
+    dup             v16.4s, w12
Equal to
MOVI v16.4s,#32

We may get more performance by reorder compare & branch
+    cmp             x4, #0
+    b.eq            0f
+    cmp             x4, #1
+    b.eq            1f
+    cmp             x4, #2
+    b.eq            2f
+    cmp             x4, #3
+    b.eq            3f
+0:

At 2021-07-07 00:01:17, "Pop, Sebastian" <spop at amazon.com> wrote:

Thanks for your careful reviews.

I addressed the problems for eor and rodata.

Please see the attached patch.

Sebastian

From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Reply-To: Development for x265 <x265-devel at videolan.org>
Date: Friday, July 2, 2021 at 8:11 PM
To: Development for x265 <x265-devel at videolan.org>
Subject: RE: [EXTERNAL] [x265] [arm64] port LUMA_VPP_4xN

|

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

|

Hi,

I put my comments inline. thanks.

btw: I found more improve on this patch.

+    eor             v17.16b, v17.16b, v17.16b

The clear register operator may replace by MOVI

At 2021-07-03 02:43:07, "Pop, Sebastian" <spop at amazon.com> wrote:

Hi,

thanks for your review.

> +#ifdef __MACH__

> +#   define MACH

> +#else

> +#   define MACH #

> This is not good idea to bypass .const_data

MACH uses ".const_data" directive, which is invalid for ELF.

For ELF the directive is ".rodata":

> ELF     .section        .rodata

> MACH    .const_data

[MC] I means you may declare MACH_RODATA so similar macro, it is empty on ELF but something on Macho, I guess it better than '#' to bypass unnecessary statement.

> +    ushll           v0.8h, v0.8b, #0

> ...

> +    mul             v16.8h, v0.8h, v24.8h

> Why not MULL?

That would not work for the rest of the computation.

Part of the data in v0 gets used in the next computation,

and then I would have to split mla into a mull + add.

[MC] This is depends on your algorithm, in your code

below, you combin row1 & row2 and multiplier

coeff[0], however, it also works with 8b x 8b

with UMULL.

However, it is a little complex algorithm,

so we can keep this version and improve in

future.

*** Code

> +    mul             v16.8h, v0.8h, v24.8h

> +    ext             v21.16b, v0.16b, v1.16b, #8

> +    mul             v17.8h, v21.8h, v24.8h

> +    mov             v0.16b, v1.16b

*** End

> +    orr             v0.16b, v1.16b, v1.16b

> This is equal to MOV, I guess compiler will replace to right instruction on ARM64

I replaced orr with mov instructions.

> +    // sum row[0-7]

> +    dup             v18.2d, v16.d[1]

> +    dup             v19.2d, v17.d[1]

> +    add             v16.4h, v16.4h, v18.4h

> +    add             v17.4h, v17.4h, v19.4h

> +    trn1            v16.2d, v16.2d, v17.2d

> How about ADDP?

I replaced the above 5 instructions with the following 3 and the performance improved.

    trn1            v20.2d, v16.2d, v17.2d

    trn2            v21.2d, v16.2d, v17.2d

    add             v16.8h, v20.8h, v21.8h

Please see attached the amended patch.

Thanks,

Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210707/d642488b/attachment.html>