[x265] [PATCH 0/8] AArch64: Clean up and optimize block copy primitives

chen chenm003 at 163.com
Tue May 20 14:40:28 UTC 2025


Hi Li,




Thank for details, we may keep current method.




Regards,
Min










At 2025-05-20 17:42:58, "Li Zhang" <Li.Zhang2 at arm.com> wrote:

Hi Chen,

 

Thanks for the comment.

 

LDP+STP is recommended in optimization guide for the memory copy loops.

Older compilers sometimes struggle to generate optimal code from the vld1q_<x>_x2 intrinsics.

Using 2 vld1q_<x> is most likely to get most compilers to generate something optimal (LDP + STP).

 

Regards,

Li

 

From: chen <chenm003 at 163.com>
Date: Tuesday, 2025. May 20. at 5:18
To: Development for x265 <x265-devel at videolan.org>
Cc: nd <nd at arm.com>, Li Zhang <Li.Zhang2 at arm.com>
Subject: Re:[x265] [PATCH 0/8] AArch64: Clean up and optimize block copy primitives

Hi Li,

 

Thank for the improve patches.

It looks good to me, just a little comment below

 

In the most function,
+ int16x8_t a0 = vld1q_s16(src + w + 0); + int16x8_t a1 = vld1q_s16(src + w + 8);

How about performance compare to vld1q_s16_x2 ?

 

Regards,
Chen
 
At 2025-05-20 00:41:39, "Li Zhang" <li.zhang2 at arm.com> wrote:
>Hello,
> 
>This patch series optimizes and implements several AArch64 block copy
>primitives using Neon intrinsics. It also cleans up and removes the Neon
>and SVE assembly implementations that are either slower or offer no
>performance benefit.
> 
>Many thanks,
>Li
> 
>Li Zhang (8):
>  AArch64: Optimize blockcopy_pp_neon intrinsics implementation
>  AArch64: Optimize blockcopy_ps Neon intrinsics implementation
>  AArch64: Implement blockcopy_ss primitives using Neon intrinsics
>  AArch64: Implement blockcopy_sp primitives using Neon intrinsics
>  AArch64: Optimize cpy1Dto2D_shl Neon intrinsics implementation
>  AArch64: Optimize cpy2Dto1D_shl Neon intrinsics implementation
>  AArch64: Implement cpy2Dto1D_shr using Neon intrinsics
>  AArch64: Implement cpy1Dto2D_shr using Neon intrinsics
> 
> source/common/CMakeLists.txt              |    2 +-
> source/common/aarch64/asm-primitives.cpp  |  180 ---
> source/common/aarch64/blockcopy8-common.S |   54 -
> source/common/aarch64/blockcopy8-sve.S    | 1346 ---------------------
> source/common/aarch64/blockcopy8.S        | 1049 ----------------
> source/common/aarch64/pixel-prim.cpp      |  358 +++++-
> 6 files changed, 305 insertions(+), 2684 deletions(-)
> delete mode 100644 source/common/aarch64/blockcopy8-common.S
> 
>--
>2.39.5 (Apple Git-154)
> 
>_______________________________________________
>x265-devel mailing list
>x265-devel at videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20250520/4a29156b/attachment.htm>


More information about the x265-devel mailing list