[x265] [PATCH v3] AArch64: Stride-aware DCT optimization for NEON and SVE

Wiki Deng wiki.deng at hj-micro.com
Tue Apr 21 01:07:20 UTC 2026


Hi


This is v3 of the AArch64 DCT optimization patch. Changes from v2:                                     
  - Removed unused stride-aware functions (partialButterfly4_neon_stride,                  
    partialButterfly8_neon_stride, fastForwardDst4_neon_stride,                                          
    pass1Butterfly8_sve)                                                                                 
  - Restored brace style in memcpy loops to match existing codebase       








Wiki Deng
wiki.deng at hj-micro.com

签名由 网易灵犀办公 定制









Original:
From:Wiki Deng <wiki.deng at hj-micro.com>Date:2026-04-17 14:03:04(中国 (GMT+08:00))To:x265-devel <x265-devel at videolan.org>Cc:x265Contributions <x265Contributions at multicorewareinc.com>Subject:[PATCH v2] AArch64: Stride-aware DCT optimization for NEON and SVEHi                                                                                         
                                                                                                         
  This is v2 of the AArch64 DCT optimization patch. Changes from v1:                                     
  - Dropped all intrapred-prim.cpp changes (memcpy is equivalent                          
    to NEON intrinsics for compile-time known widths)                                                    
  - Small DCT forward transforms (4×4, 8×8) retain memcpy path;                                          
    stride-aware only for 16×16 and 32×32                                                                
                                                                                                         
  What it does                                                                                           
                                                                                                         
  The DCT forward transforms (4×4, 8×8, 16×16, 32×32) operate in                                         
  two passes: pass 1 transforms rows into an intermediate flat                            
  buffer, and pass 2 transforms columns to produce the final                                             
  coefficients. The encoder passes residual data in stride layout                                        
  (CU pitch), requiring a memcpy to flatten it before DCT.                                               
                                                                                                         
  This patch eliminates that intermediate copy for large transforms                                      
  by making pass 1 stride-aware — it loads rows directly from                                            
  src + row * srcStride. For small transforms (4×4, 8×8), memcpy                                         
  is retained since contiguous loads are more cache-friendly at                                          
  those sizes.                                                                                           
                                                                                                         
  Changes                                                                                                
                                                                                                         
  - source/common/aarch64/dct-prim.cpp (+491, −20):                                                      
    Adds *_neon_stride pass-1 kernels for 16×16 and 32×32 DCT.                            
    DCT 4×4 and 8×8 retain memcpy.                                                                       
  - source/common/aarch64/dct-prim-sve.cpp (+104, −2):                                                   
    Adds stride-aware SVE pass-1 kernels for dct{16,32}_sve.                                             
    dct8_sve retains memcpy.                                                                             
                                                                                                         
  Note: IDCT/IDST pass 2 already writes to stride layout natively,                                       
  so no changes were needed. The NEON 16×16 DCT continues using                                          
  the existing hand-tuned assembly (PFX(dct16_neon) from dct.S);                                         
  the SVE 16×16 path uses the stride-aware intrinsic.                                                    
                                                                                                         
  Testing                                                                                                
                                                                                                         
  - TestBench: PASS (NEON + SVE, zero mismatches)                                                        
  - Bit-exact output: identical bitstreams before and after                               
                                                                                                         
  End-to-end encoding performance (3840×2160, 100 frames, CRF 28,                                        
  16 cores pinned to NUMA node 0, median of 3 runs):                                                     
                                                                                                         
    Preset     Before    After     Speedup                                                               
    ultrafast  3.30s     3.25s     +1.5%                                                                 
    superfast  4.66s     4.54s     +2.6%                                                                 
    veryfast   7.22s     7.03s     +2.6%                                                                 
    faster     7.34s     7.26s     +1.1%                                                                 
    fast       8.57s     8.44s     +1.5%                                                                 
    medium     11.74s    11.23s    +4.3%                                                                 
    slow       31.35s    30.26s    +3.5%                                                                 
    slower     112.98s   110.07s   +2.6%                                                                 
    veryslow   217.67s   209.37s   +3.8%                                                                 
    placebo    376.74s   360.95s   +4.2%                                                                 
                                                                                                         
  All presets show positive speedup. Slower presets benefit more                                         
  (+3–4%) as DCT occupies a larger share of encoder cycles when                                          
  ME/RDO search is more thorough.                                                                        
                                                                                                         
  Environment                                                                                            
                                                                                                         
  - AArch64, GCC 13.2.1,                                                                              
  - SIMD: NEON, Neon_DotProd, Neon_I8MM, SVE, SVE2, SVE2_BitPerm                          
                                                                                                         

BR


Wiki Deng
wiki.deng at hj-micro.com

签名由 网易灵犀办公 定制









Original:
From:Wiki Deng <wiki.deng at hj-micro.com>Date:2026-04-10 17:18:49(中国 (GMT+08:00))To:x265-devel <x265-devel at videolan.org>Cc:x265Contributions <x265Contributions at multicorewareinc.com>Subject:[PATCH 0/2] AArch64: DCT optimization and NEON intrapred fixHi,                                                                                                        
                       
  This patch series contains two AArch64 NEON optimizations for x265:

  Patch 1: AArch64: Optimize DCT kernels with stride-aware implementations
  The previous DCT implementation performed an unnecessary memcpy into a
  contiguous buffer before running transforms. This patch introduces
  stride-aware versions of all DCT butterfly kernels (4x4, 8x8, 16x16,
  32x32) that load directly from stride layout, eliminating the
  intermediate buffer copy. It also adds HIGH_BIT_DEPTH support and
  replaces memcpy calls in intrapred with NEON-optimized helpers.
  Performance: DCT 8x8 memory operations reduced by ~75%.

  Patch 2: AArch64: Fix 4x4 NEON memory overflow in intrapred helpers
  The 8-bit width=4 NEON copy helpers used vld1_u8/vst1_u8 which
  read/write 8 bytes instead of the required 4 bytes. In
  all_angs_pred_neon<2>(), 4x4 mode outputs are packed contiguously
  (16 bytes per mode), so writing 8 bytes per row overwrites adjacent
  mode buffers.
  Fixed by switching to scalar copy for exact 4-byte access, which also
  avoids strict aliasing UB from uint32_t* casts and potential alignment
  issues.
  Affected files:
    - source/common/aarch64/dct-prim.cpp
    - source/common/aarch64/dct-prim-sve.cpp
    - source/common/aarch64/intrapred-prim.cpp
  Best regards,


Wiki Deng
wiki.deng at hj-micro.com






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20260421/cd8961bc/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v3-0001-AArch64-Stride-aware-DCT-optimization-for-NEON-an.patch
Type: application/octet-stream
Size: 14498 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20260421/cd8961bc/attachment-0001.obj>


More information about the x265-devel mailing list