[x265] [arm64] port LUMA_VPP_4xN
chen
chenm003 at 163.com
Wed Jul 7 02:06:50 UTC 2021
Looks good for me.
There have some little improve, it may update in future version.
For example,
+ mov w12, #32
+ dup v16.4s, w12
Equal to
MOVI v16.4s,#32
We may get more performance by reorder compare & branch
+ cmp x4, #0
+ b.eq 0f
+ cmp x4, #1
+ b.eq 1f
+ cmp x4, #2
+ b.eq 2f
+ cmp x4, #3
+ b.eq 3f
+0:
At 2021-07-07 00:01:17, "Pop, Sebastian" <spop at amazon.com> wrote:
Thanks for your careful reviews.
I addressed the problems for eor and rodata.
Please see the attached patch.
Sebastian
From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Reply-To: Development for x265 <x265-devel at videolan.org>
Date: Friday, July 2, 2021 at 8:11 PM
To: Development for x265 <x265-devel at videolan.org>
Subject: RE: [EXTERNAL] [x265] [arm64] port LUMA_VPP_4xN
|
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
|
Hi,
I put my comments inline. thanks.
btw: I found more improve on this patch.
+ eor v17.16b, v17.16b, v17.16b
The clear register operator may replace by MOVI
At 2021-07-03 02:43:07, "Pop, Sebastian" <spop at amazon.com> wrote:
Hi,
thanks for your review.
> +#ifdef __MACH__
> +# define MACH
> +#else
> +# define MACH #
> This is not good idea to bypass .const_data
MACH uses ".const_data" directive, which is invalid for ELF.
For ELF the directive is ".rodata":
> ELF .section .rodata
> MACH .const_data
[MC] I means you may declare MACH_RODATA so similar macro, it is empty on ELF but something on Macho, I guess it better than '#' to bypass unnecessary statement.
> + ushll v0.8h, v0.8b, #0
> ...
> + mul v16.8h, v0.8h, v24.8h
> Why not MULL?
That would not work for the rest of the computation.
Part of the data in v0 gets used in the next computation,
and then I would have to split mla into a mull + add.
[MC] This is depends on your algorithm, in your code
below, you combin row1 & row2 and multiplier
coeff[0], however, it also works with 8b x 8b
with UMULL.
However, it is a little complex algorithm,
so we can keep this version and improve in
future.
*** Code
> + mul v16.8h, v0.8h, v24.8h
> + ext v21.16b, v0.16b, v1.16b, #8
> + mul v17.8h, v21.8h, v24.8h
> + mov v0.16b, v1.16b
*** End
> + orr v0.16b, v1.16b, v1.16b
> This is equal to MOV, I guess compiler will replace to right instruction on ARM64
I replaced orr with mov instructions.
> + // sum row[0-7]
> + dup v18.2d, v16.d[1]
> + dup v19.2d, v17.d[1]
> + add v16.4h, v16.4h, v18.4h
> + add v17.4h, v17.4h, v19.4h
> + trn1 v16.2d, v16.2d, v17.2d
> How about ADDP?
I replaced the above 5 instructions with the following 3 and the performance improved.
trn1 v20.2d, v16.2d, v17.2d
trn2 v21.2d, v16.2d, v17.2d
add v16.8h, v20.8h, v21.8h
Please see attached the amended patch.
Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210707/d642488b/attachment.html>
More information about the x265-devel
mailing list