<div data-ntes="ntes_mail_body_root" style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div id="spnEditorContent"><div style="margin: 0;">Hi,</div><div style="margin: 0;"><br>In the Patch 1, it included optimized <b>intra_pred_ang_neon</b>, the width is constant,<br>how about performance in between memcpy and intrinsic functions?</div><div style="margin: 0;"><br></div><div style="margin: 0;">For the Patch 2, copySingleRow_neon<4>,<br>may we use memcpy or<br>*(uint32_t*)dst = *(uint32_t*)src</div><div style="margin: 0;"><br></div><div style="margin: 0;"><br></div></div><div style="position:relative;zoom:1"></div><div id="divNeteaseMailCard"></div><div style="margin: 0;"><br></div><p> 2026-04-10 17:18:49£¬"Wiki Deng" <wiki.deng@hj-micro.com> </p><blockquote id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid"><div style="line-height: 1.5; font-size: 14px; color: rgba(38, 42, 51, 0.9); font-family: Source Han Sans;">
<div style="text-align: left;" data-mce-style="text-align: left;"><span style="font-size: 18px;" data-mce-style="font-size: 18px;">Hi, </span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> </span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> This patch series contains two AArch64 NEON optimizations for x265:</span></div><div><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Patch 1: AArch64: Optimize DCT kernels with stride-aware implementations</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> The previous DCT implementation performed an unnecessary memcpy into a</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> contiguous buffer before running transforms. This patch introduces</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> stride-aware versions of all DCT butterfly kernels (4x4, 8x8, 16x16,</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> 32x32) that load directly from stride layout, eliminating the</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> intermediate buffer copy. It also adds HIGH_BIT_DEPTH support and</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> replaces memcpy calls in intrapred with NEON-optimized helpers.</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Performance: DCT 8x8 memory operations reduced by ~75%.</span></div><div><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Patch 2: AArch64: Fix 4x4 NEON memory overflow in intrapred helpers</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> The 8-bit width=4 NEON copy helpers used vld1_u8/vst1_u8 which</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> read/write 8 bytes instead of the required 4 bytes. In</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> all_angs_pred_neon<2>(), 4x4 mode outputs are packed contiguously</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> (16 bytes per mode), so writing 8 bytes per row overwrites adjacent</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> mode buffers.</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Fixed by switching to scalar copy for exact 4-byte access, which also</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> avoids strict aliasing UB from uint32_t* casts and potential alignment</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> issues.</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Affected files:</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> - source/common/aarch64/dct-prim.cpp</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> - source/common/aarch64/dct-prim-sve.cpp</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> - source/common/aarch64/intrapred-prim.cpp</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;"> Best regards,</span><br><br><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;">Wiki Deng</span><br><span style="font-size: 18px;" data-mce-style="font-size: 18px;">wiki.deng@hj-micro.com</span><br><br></div>
</div></blockquote></div>