[x265] [PATCH 0/2] AArch64: DCT optimization and NEON intrapred fix

Sun Apr 12 05:42:20 UTC 2026

Hi,

In the Patch 1, it included optimized intra_pred_ang_neon, the width is constant,
how about performance in between memcpy and intrinsic functions?

For the Patch 2, copySingleRow_neon<4>,
may we use memcpy or
*(uint32_t*)dst = *(uint32_t*)src

 2026-04-10 17:18:49，"Wiki Deng" <wiki.deng at hj-micro.com> 

Hi,                                                                                                        

  This patch series contains two AArch64 NEON optimizations for x265:

  Patch 1: AArch64: Optimize DCT kernels with stride-aware implementations
  The previous DCT implementation performed an unnecessary memcpy into a
  contiguous buffer before running transforms. This patch introduces
  stride-aware versions of all DCT butterfly kernels (4x4, 8x8, 16x16,
  32x32) that load directly from stride layout, eliminating the
  intermediate buffer copy. It also adds HIGH_BIT_DEPTH support and
  replaces memcpy calls in intrapred with NEON-optimized helpers.
  Performance: DCT 8x8 memory operations reduced by ~75%.

  Patch 2: AArch64: Fix 4x4 NEON memory overflow in intrapred helpers
  The 8-bit width=4 NEON copy helpers used vld1_u8/vst1_u8 which
  read/write 8 bytes instead of the required 4 bytes. In
  all_angs_pred_neon<2>(), 4x4 mode outputs are packed contiguously
  (16 bytes per mode), so writing 8 bytes per row overwrites adjacent
  mode buffers.
  Fixed by switching to scalar copy for exact 4-byte access, which also
  avoids strict aliasing UB from uint32_t* casts and potential alignment
  issues.
  Affected files:
    - source/common/aarch64/dct-prim.cpp
    - source/common/aarch64/dct-prim-sve.cpp
    - source/common/aarch64/intrapred-prim.cpp
  Best regards,

Wiki Deng
wiki.deng at hj-micro.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20260412/108cc7e3/attachment.htm>