<head></head><body style="line-height: 1.5; font-size: 14px; color: rgba(38, 42, 51, 0.9); font-family: Source Han Sans;">
<div style="text-align: left;" data-mce-style="text-align: left;"><span style="font-size: 18px;" data-mce-style="font-size: 18px;">Hi, </span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> </span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> This patch series contains two AArch64 NEON optimizations for x265:</span></div><div><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Patch 1: AArch64: Optimize DCT kernels with stride-aware implementations</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> The previous DCT implementation performed an unnecessary memcpy into a</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> contiguous buffer before running transforms. This patch introduces</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> stride-aware versions of all DCT butterfly kernels (4x4, 8x8, 16x16,</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> 32x32) that load directly from stride layout, eliminating the</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> intermediate buffer copy. It also adds HIGH_BIT_DEPTH support and</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> replaces memcpy calls in intrapred with NEON-optimized helpers.</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Performance: DCT 8x8 memory operations reduced by ~75%.</span></div><div><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Patch 2: AArch64: Fix 4x4 NEON memory overflow in intrapred helpers</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> The 8-bit width=4 NEON copy helpers used vld1_u8/vst1_u8 which</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> read/write 8 bytes instead of the required 4 bytes. In</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> all_angs_pred_neon<2>(), 4x4 mode outputs are packed contiguously</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> (16 bytes per mode), so writing 8 bytes per row overwrites adjacent</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> mode buffers.</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Fixed by switching to scalar copy for exact 4-byte access, which also</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> avoids strict aliasing UB from uint32_t* casts and potential alignment</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> issues.</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Affected files:</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> - source/common/aarch64/dct-prim.cpp</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> - source/common/aarch64/dct-prim-sve.cpp</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> - source/common/aarch64/intrapred-prim.cpp</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Best regards,</span><br><br><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;">Wiki Deng</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;">wiki.deng@hj-micro.com</span><br><br></div>
</body>