[x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Wed Jul 9 04:12:06 UTC 2025

Hi Changsheng,

Thank for the remind, I ignored 'm2'.

I still feeling RISC-VV in the early stage, such as ISA unclear, no cycles/throughput table, etc. The RISC-V is fragmented, the asm optimized code for one kind of CPU may slower on the other brand.

We may keep GCC/LLVM with auto-vectorized as start point, we may add asm code if document good enough future.

Regards,
Chen

At 2025-07-09 10:02:27, wu.changsheng at sanechips.com.cn wrote:

Hi Chen,

Thank you for your reply. Let me try to address your questions.

1, GCC uses registers (V8, V10, V12, V14) because the m2 in vsetvli a5,zero,e32,m2,ta,ma indicates LMUL=2, which combines Vi and Vi+1 into a register group, doubling the vector width to 2*VLEN. In this case, it actually uses 8 registers: V8–V15.  

2, Our coding is based on the officially ratified and released specification documents, and we have tested and verified it on hardware compliant with the RVA23 profile. Therefore, there will be no code modifications due to instruction changes. For example, the Matrix extension planned for future RISC-V additions will not alter the existing Vector 1.0 extension.

Best Wishes！

Changsheng Wu

E：wu.changsheng at sanechips.com.cn

SANECHIPS TECHNOLOGY CO.,LTD.

Original
From: chen <chenm003 at 163.com>
To: 吴昌盛0318004250;
Cc: x265-devel at videolan.org <x265-devel at videolan.org>;mahesh at multicorewareinc.com <mahesh at multicorewareinc.com>;pavan.tarun at multicorewareinc.com <pavan.tarun at multicorewareinc.com>;沈显来0318003851;袁佳0318004243;
Date: 2025年07月08日 22:35
Subject: Re:Re: [x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Hi Changsheng,

Thank you for providing so much detailed information. I am glad to see that RISC-VV is gradually maturing.

I conducted a simple experiment on GCC, it can automatically generate vectorized code now. However, I found a special instruction vlseg4e32.v in output, and consulted the RISC-VV documentation. I got many doubts.

Such as registers is (V8, V9, V10, V11) in document, or (V8, V10, V12, V14) in GCC.

And, there are still many ASM instruction's details that only exist in fragmented PPT from different user/companies, without a unified official document

These issues may lead to repeated code rework in the future, which makes me still not recommend accepting RISC-VV now.

If possible, please help the RISC-V community to improving the RISC-VV ISA documentation. I am pleased to accept RISC-VV as one of the target platforms in the future.

Regards,
Chen

Code

int test1(int *x, int N)
{
    int sum = 0;
    if (__builtin_expect(N % 4, 0))
    {
        for(int i = 0; i < N; i+=4)
        {
            sum += (x[i+0] + x[i+1] + x[i+2] + x[i+3]);
        }
    }
    else
        __builtin_unreachable();
    return sum;
}

GCC output

test1(int*, int):
        ble     a1,zero,.L4
        vsetvlia5,zero,e32,m2,ta,ma
        addiw   a4,a1,-1
        vmv.v.iv4,0
        srliw   a4,a4,2
        addiw   a4,a4,1
.L3:
        vsetvlia5,a4,e32,m2,tu,ma
        vlseg4e32.v     v8,(a0)
        slli    a3,a5,4
        sub     a4,a4,a5
        add     a0,a0,a3
        vadd.vvv2,v10,v8
        vadd.vvv2,v2,v12
        vadd.vvv2,v2,v14
        vadd.vvv4,v4,v2
        bne     a4,zero,.L3
        vsetvlia5,zero,e32,m2,ta,ma
        vmv.s.xv1,zero
        vredsum.vs      v4,v4,v1
        vmv.x.sa0,v4
        ret
.L4:
        li      a0,0
        ret

RISC-V Document
7.8.2. Vector Strided Segment Loads and Stores

Vector strided segment loads and stores move contiguous segments where each segment is separated by the byte-stride offset given in the rs2 GPR argument.

|
Note
| Negative and zero strides are supported. |
    # Format     vlsseg<nf>e<eew>.v vd, (rs1), rs2, vm          # Strided segment loads     vssseg<nf>e<eew>.v vs3, (rs1), rs2, vm         # Strided segment stores      # Examples     vsetvli a1, t0, e8, ta, ma     vlsseg3e8.v v4, (x5), x6   # Load bytes at addresses x5+i*x6   into v4[i],                               #  and bytes at addresses x5+i*x6+1 into v5[i],                               #  and bytes at addresses x5+i*x6+2 into v6[i].      # Examples     vsetvli a1, t0, e32, ta, ma     vssseg2e32.v v2, (x5), x6   # Store words from v2[i] to address x5+i*x6                                 #   and words from v3[i] to address x5+i*x6+4

At 2025-07-07 15:24:16, wu.changsheng at sanechips.com.cn wrote:

Hi Chen,

Thank you for your previous feedback.

I'd like to supplement some information about RISC-V Vector V1.0 and hope you can reconsider x265 support for the RISC-V architecture.   

1. The RISC-V community considers Vector V1.0 a stable version. The RISC-V Vector V1.0 was officially approved and released in 2021. The server profile RVA23, released in October 2024, also specifies Vector V1.0, and The RISC-V Instruction Set Manual Volume published the same year adopts Vector V1.0 as well. 

2. Many chip manufacturers already support RISC-V Vector Extension V1.0, such as the already released SiFive P670/P470, Andes NX27V, Alibaba C920, and SpaceMIT X100 CPUs. In the next year or two, many more vendors will launch chips supporting Vector V1.0.

3. GCC experimentally introduced RISC-V Vector support in GCC 12 (May 2022) and officially supported RISC-V Vector V1.0 in GCC 14 (May 2024).

4. The Linux kernel merged support for RISC-V Vector V1.0 in June 2023 and released it in the LTS 6.21 version.  

5. Our company has already planned to deploy RISC-V servers in data centers, with x265 video encoding being one of the key business scenarios. We will continue contributing RISC-V architecture patches.

RISC-V has garnered widespread attention and strong investment, leading to rapid development. I believe it will become another mainstream architecture following x86 and Arm. RISC-V is now commercially viable and deserves adoption by the x265 community.

Best Wishes！

Changsheng Wu

M: +86 13776570034

E：wu.changsheng at sanechips.com.cn

SANECHIPS TECHNOLOGY CO.,LTD.

From: chen <chenm003 at 163.com>
To: 吴昌盛0318004250;
Cc: x265-devel at videolan.org <x265-devel at videolan.org>;mahesh at multicorewareinc.com <mahesh at multicorewareinc.com>;pavan.tarun at multicorewareinc.com <pavan.tarun at multicorewareinc.com>;沈显来0318003851;袁佳0318004243;吴昌盛0318004250;
Date: 2025年07月07日 03:28
Subject: Re:[x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Hi Changsheng,

Thank for the patches.

However, I don't think RISC-V Extension-V stable enough nowadays.

v1.0 frozen at September 2021

v1.1 public review at May 2023

no more update until July 2025

And most instructions has not behavior description,

For example, vredsum.vs in the patch

vredsum.vs  vd, vs2, vs1, vm   # vd[0] =  sum( vs1[0] , vs2[*] )

I just guess it is
vd[0] =  vs1[0] + sum(vs2[*])

Another example is vlse8.v,

I may guess it is equal to x86 PSHUFB or ARM VTBL,

Above example I just guess, I can't confirm my concept in past couple years, too many similar problem inside RISC-V Extension-V

So, I suggest do not integrate / implement RISC-V patch, until specification become stable enough.

Rgards,

Chen

2025-07-06 10:08:25，wu.changsheng at sanechips.com.cn 

From 7562e3a834a6a5ea76ab1b97acf915e095646cd5 Mon Sep 17 00:00:00 2001

From: Changsheng Wu <wu.changsheng at sanechips.com.cn>

Date: Sat, 5 Jul 2025 23:09:14 +0800

Subject: [PATCH] RISCV64: add copy_cnt assembly optimization

TestBench test result:

  copy_cnt[4x4] |        1.34x |          123.12   |      165.06

  copy_cnt[8x8] |        2.64x |          214.07   |      564.26

copy_cnt[16x16] |        3.96x |          563.83   |      2232.00

copy_cnt[32x32] |        7.44x |          2144.80  |      15954.42

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20250709/a0832731/attachment-0001.htm>