[x265] [PATCH 0/2] AArch64: DCT optimization and NEON intrapred fix
chen
chenm003 at 163.com
Sun Apr 12 05:42:20 UTC 2026
Hi,
In the Patch 1, it included optimized intra_pred_ang_neon, the width is constant,
how about performance in between memcpy and intrinsic functions?
For the Patch 2, copySingleRow_neon<4>,
may we use memcpy or
*(uint32_t*)dst = *(uint32_t*)src
2026-04-10 17:18:49,"Wiki Deng" <wiki.deng at hj-micro.com>
Hi,
This patch series contains two AArch64 NEON optimizations for x265:
Patch 1: AArch64: Optimize DCT kernels with stride-aware implementations
The previous DCT implementation performed an unnecessary memcpy into a
contiguous buffer before running transforms. This patch introduces
stride-aware versions of all DCT butterfly kernels (4x4, 8x8, 16x16,
32x32) that load directly from stride layout, eliminating the
intermediate buffer copy. It also adds HIGH_BIT_DEPTH support and
replaces memcpy calls in intrapred with NEON-optimized helpers.
Performance: DCT 8x8 memory operations reduced by ~75%.
Patch 2: AArch64: Fix 4x4 NEON memory overflow in intrapred helpers
The 8-bit width=4 NEON copy helpers used vld1_u8/vst1_u8 which
read/write 8 bytes instead of the required 4 bytes. In
all_angs_pred_neon<2>(), 4x4 mode outputs are packed contiguously
(16 bytes per mode), so writing 8 bytes per row overwrites adjacent
mode buffers.
Fixed by switching to scalar copy for exact 4-byte access, which also
avoids strict aliasing UB from uint32_t* casts and potential alignment
issues.
Affected files:
- source/common/aarch64/dct-prim.cpp
- source/common/aarch64/dct-prim-sve.cpp
- source/common/aarch64/intrapred-prim.cpp
Best regards,
Wiki Deng
wiki.deng at hj-micro.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20260412/108cc7e3/attachment.htm>
More information about the x265-devel
mailing list