[x265] [PATCH] RISC-V: Add RVV optimized DCT32x32

Wed Apr 22 01:21:03 UTC 2026

On 4/12/26 12:59, chen wrote:
> Thank for contribution.
> Most looks good to me, some comments,
> 
> 
> +.macro lx rd, addr
> +#if (__riscv_xlen == 32)
> + lw \rd, \addr
> +#elif (__riscv_xlen == 64)
> + ld \rd, \addr
> +#else
> + lq \rd, \addr
> +#endif
> RV128I still draft, we may replace by #error here
> 
> 
> + li t0, 4096
> + // temp stack address
> + sub t5, sp, t0
> + li t0, 2048
> + sub sp, t5, t0
> I don't suggest allocate 6KB stack in the function without check page available, it more than 4KB page size, potential memory risk.
> Another risk large VLEN may overflow temporary buffer, please add comment to indicate safety VLEN range, (VLEN<=1024 ?)
Thanks for the detailed review.

I’ll switch this temporary buffer to heap allocation in the next revision to avoid relying on a relatively large stack frame and make the behavior more robust across different environments.

Regarding the VLEN point, the temporary buffer is currently sized based on the fixed 32x32 data layout, rather than being derived from VLEN. From my understanding, VLEN should not directly affect the buffer size. Could you help clarify how a larger VLEN would lead to overflowing this buffer?

If there is indeed a dependency, I can either document the valid VLEN range or adjust the implementation to make the constraint explicit.
> +        li     t1, 32
> 
> +        vsetvli t4, t1, e16, m1, ta, ma    <-- m1
> 
> ...
> +function func_tr_32xN_\name\()_rvv
> +        .option arch, +zba
> +        // E saved from tmp stack
> +        mv              a7, t5
> +        // one vector bytes after widen
> +        slli            t2, t4, 2
> Here potential depends on m1, suggest add comment to remind that if vsetvli changed, need update here either
> 
> 
> 
> 
> Others,
> Some ident mismatch on line DCT32_4_DST_ADD_1_MEMBER
> 
> 
> At 2026-02-06 16:14:53, "daichengrong" <daichengrong at iscas.ac.cn> wrote:
>> This patch adds an RVV-optimized implementation of DCT 32x32 for RISC-V.
>>
>> The current implementation in the repository is written with the assumption of a 128-bit VLEN and does not account for wider vector lengths. Therefore, initial testing was performed on a 128-bit platform, allowing the results to directly reflect the advantages of the optimized code over the existing implementation.
>>
>> **SG2044 (128-bit VLEN):**
>>
>> ```
>> dct32x32 | 5.14x | 1800.12 | 9247.73
>> dct32x32 | 9.85x |  935.26 | 9214.26
>> ```
>>
>> Building on this, the new implementation adopts a Vector-Length Agnostic (VLA) design. Additional testing on a 256-bit platform demonstrates good scalability and further performance gains.
>>
>> **Banana Pi F3 (256-bit VLEN):**
>>
>> ```
>> dct32x32 | 5.59x | 2222.48 | 12420.64
>> dct32x32 | 13.28x |  935.97 | 12431.17
>> ```
>>
>> To simplify comparison with the existing implementation, this patch introduces an `RVV_DCT32_OPT` compile-time option. The optimization can be disabled using:
>>
>> ```
>> -DRVV_DCT32_OPT=0
>> ```
>>
>> allowing straightforward A/B performance testing.
>>
>> Signed-off-by: daichengrong <daichengrong at iscas.ac.cn>
>> ---
>> source/CMakeLists.txt                    |   6 +
>> source/common/CMakeLists.txt             |   2 +-
>> source/common/riscv64/asm-primitives.cpp |   3 +
>> source/common/riscv64/dct-32dct.S        | 714 +++++++++++++++++++++++
>> source/common/riscv64/fun-decls.h        |   1 +
>> 5 files changed, 725 insertions(+), 1 deletion(-)
>> mode change 100755 => 100644 source/CMakeLists.txt
>> create mode 100644 source/common/riscv64/dct-32dct.S
>>
>