[x265] [PATCH v3] AArch64: Stride-aware DCT optimization for NEON and SVE
chen
chenm003 at 163.com
Tue Apr 21 02:48:39 UTC 2026
Hi,
This revised looks good to me, thank you.
At 2026-04-21 09:07:20, "Wiki Deng" <wiki.deng at hj-micro.com> wrote:
Hi
This is v3 of the AArch64 DCT optimization patch. Changes from v2:
- Removed unused stride-aware functions (partialButterfly4_neon_stride,
partialButterfly8_neon_stride, fastForwardDst4_neon_stride,
pass1Butterfly8_sve)
- Restored brace style in memcpy loops to match existing codebase
Wiki Deng
wiki.deng at hj-micro.com
签名由 网易灵犀办公 定制
Original:
From:Wiki Deng <wiki.deng at hj-micro.com>
Date:2026-04-17 14:03:04(中国 (GMT+08:00))
To:x265-devel <x265-devel at videolan.org>
Cc:x265Contributions <x265Contributions at multicorewareinc.com>
Subject:[PATCH v2] AArch64: Stride-aware DCT optimization for NEON and SVE
Hi
This is v2 of the AArch64 DCT optimization patch. Changes from v1:
- Dropped all intrapred-prim.cpp changes (memcpy is equivalent
to NEON intrinsics for compile-time known widths)
- Small DCT forward transforms (4×4, 8×8) retain memcpy path;
stride-aware only for 16×16 and 32×32
What it does
The DCT forward transforms (4×4, 8×8, 16×16, 32×32) operate in
two passes: pass 1 transforms rows into an intermediate flat
buffer, and pass 2 transforms columns to produce the final
coefficients. The encoder passes residual data in stride layout
(CU pitch), requiring a memcpy to flatten it before DCT.
This patch eliminates that intermediate copy for large transforms
by making pass 1 stride-aware — it loads rows directly from
src + row * srcStride. For small transforms (4×4, 8×8), memcpy
is retained since contiguous loads are more cache-friendly at
those sizes.
Changes
- source/common/aarch64/dct-prim.cpp (+491, −20):
Adds *_neon_stride pass-1 kernels for 16×16 and 32×32 DCT.
DCT 4×4 and 8×8 retain memcpy.
- source/common/aarch64/dct-prim-sve.cpp (+104, −2):
Adds stride-aware SVE pass-1 kernels for dct{16,32}_sve.
dct8_sve retains memcpy.
Note: IDCT/IDST pass 2 already writes to stride layout natively,
so no changes were needed. The NEON 16×16 DCT continues using
the existing hand-tuned assembly (PFX(dct16_neon) from dct.S);
the SVE 16×16 path uses the stride-aware intrinsic.
Testing
- TestBench: PASS (NEON + SVE, zero mismatches)
- Bit-exact output: identical bitstreams before and after
End-to-end encoding performance (3840×2160, 100 frames, CRF 28,
16 cores pinned to NUMA node 0, median of 3 runs):
Preset Before After Speedup
ultrafast 3.30s 3.25s +1.5%
superfast 4.66s 4.54s +2.6%
veryfast 7.22s 7.03s +2.6%
faster 7.34s 7.26s +1.1%
fast 8.57s 8.44s +1.5%
medium 11.74s 11.23s +4.3%
slow 31.35s 30.26s +3.5%
slower 112.98s 110.07s +2.6%
veryslow 217.67s 209.37s +3.8%
placebo 376.74s 360.95s +4.2%
All presets show positive speedup. Slower presets benefit more
(+3–4%) as DCT occupies a larger share of encoder cycles when
ME/RDO search is more thorough.
Environment
- AArch64, GCC 13.2.1,
- SIMD: NEON, Neon_DotProd, Neon_I8MM, SVE, SVE2, SVE2_BitPerm
BR
Wiki Deng
wiki.deng at hj-micro.com
签名由 网易灵犀办公 定制
Original:
From:Wiki Deng <wiki.deng at hj-micro.com>
Date:2026-04-10 17:18:49(中国 (GMT+08:00))
To:x265-devel <x265-devel at videolan.org>
Cc:x265Contributions <x265Contributions at multicorewareinc.com>
Subject:[PATCH 0/2] AArch64: DCT optimization and NEON intrapred fix
Hi,
This patch series contains two AArch64 NEON optimizations for x265:
Patch 1: AArch64: Optimize DCT kernels with stride-aware implementations
The previous DCT implementation performed an unnecessary memcpy into a
contiguous buffer before running transforms. This patch introduces
stride-aware versions of all DCT butterfly kernels (4x4, 8x8, 16x16,
32x32) that load directly from stride layout, eliminating the
intermediate buffer copy. It also adds HIGH_BIT_DEPTH support and
replaces memcpy calls in intrapred with NEON-optimized helpers.
Performance: DCT 8x8 memory operations reduced by ~75%.
Patch 2: AArch64: Fix 4x4 NEON memory overflow in intrapred helpers
The 8-bit width=4 NEON copy helpers used vld1_u8/vst1_u8 which
read/write 8 bytes instead of the required 4 bytes. In
all_angs_pred_neon<2>(), 4x4 mode outputs are packed contiguously
(16 bytes per mode), so writing 8 bytes per row overwrites adjacent
mode buffers.
Fixed by switching to scalar copy for exact 4-byte access, which also
avoids strict aliasing UB from uint32_t* casts and potential alignment
issues.
Affected files:
- source/common/aarch64/dct-prim.cpp
- source/common/aarch64/dct-prim-sve.cpp
- source/common/aarch64/intrapred-prim.cpp
Best regards,
Wiki Deng
wiki.deng at hj-micro.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20260421/795d415d/attachment-0001.htm>
More information about the x265-devel
mailing list