<html><head></head><body><div dir="auto">Thanks for the suggestion. Just to clarify, the implementation does not choose between 128-bit and 256-bit vector widths at compile time. The code follows a Vector-Length Agnostic (VLA) approach, so the actual vector width is determined by the hardware at runtime via RVV semantics rather than by function pointer selection.<br>The current repository implementation was originally written with a 128-bit assumption, which is why the initial validation was performed on 128-bit hardware to provide a direct comparison. With the VLA design, the same code runs correctly on wider vector hardware (e.g., 256-bit) without requiring separate code paths, and the test results confirm good scalability.<br>Please let me know if I misunderstood your concern — I’m happy to clarify further.</div><br><br><div class="gmail_quote"><div dir="auto">On February 6, 2026 4:37:21 PM GMT+08:00, wu.changsheng@sanechips.com.cn wrote:</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="zcontentRow"><div>It is recommended not to decide between using 128-bit or 256-bit width at compile time. Instead, detect the bit width at runtime and then select accordingly when initializing function pointers.</div><div style="font-size:14px;font-family:微软雅黑,Microsoft YaHei;line-height:1.5"><br></div><div style="font-size:14px;font-family:微软雅黑,Microsoft YaHei;line-height:1.5"><br></div><div style="font-size:14px;font-family:微软雅黑,Microsoft YaHei;line-height:1.5"><br></div><div unonameen="Wu Changsheng0318004250" unonamech="吴昌盛0318004250" style="line-height:1.5" class="zMailSign"><div style="display: none;" unonameen="Wu Changsheng0318004250" unonamech="吴昌盛0318004250" class="zMailSignTitle"><label class="sign_nameUno">吴昌盛0318004250</label><span class="sign_arrow"></span></div><div class="zMailSignContent"><p style="box-sizing: border-box; outline: 0px; white-space: normal; font-family: Arial, Helvetica, "Microsoft Yahei", sans-serif; margin-top: 0px; margin-bottom: 0px; padding: 0px; min-height: 14px; background-color: rgb(255, 255, 255); line-height: normal;">Best Wishes!</p><p style="box-sizing: border-box; outline: 0px; white-space: normal; font-family: Arial, Helvetica, "Microsoft Yahei", sans-serif; margin-top: 0px; margin-bottom: 0px; padding: 0px; min-height: 14px; background-color: rgb(255, 255, 255); line-height: normal;">Changsheng Wu</p><p style="box-sizing: border-box; outline: 0px; white-space: normal; font-family: Arial, Helvetica, "Microsoft Yahei", sans-serif; margin-top: 0px; margin-bottom: 0px; padding: 0px; min-height: 14px; background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box; outline: 0px; font-family: arial, sans-serif, "Myriad Pro"; line-height: normal;">E:wu.changsheng@sanechips.com.cn</span></p><p style="box-sizing: border-box; outline: 0px; white-space: normal; font-family: Arial, Helvetica, "Microsoft Yahei", sans-serif; margin-top: 0px; margin-bottom: 0px; padding: 0px; min-height: 14px; background-color: rgb(255, 255, 255); line-height: normal;">SANECHIPS TECHNOLOGY CO.,LTD.</p><p><br></p></div></div><div style="line-height:1.5" class="zMailFrom"></div><div style="display:block" class="zhistoryRow"><div style="width: 100%; height: 28px; line-height: 28px; background-color: #E0E5E9; color: #1388FF; text-align: center;" class="zhistoryDes">Original</div><div id="zwriteHistoryContainer"><div class="control-group zhistoryPanel"><div style="padding: 8px; background-color: #F5F6F8;" class="zhistoryHeader"><div><strong>From: </strong><span class="zreadUserName">daichengrong <daichengrong@iscas.ac.cn></span>
</div><div><strong>To: </strong><span style="display: inline;" class="zreadUserName">x265-devel@videolan.org <x265-devel@videolan.org>;</span>
</div><div><strong>Date: </strong><span data-zmail-format-date="res:WriteMailResource.sendDateFormat" data-zmail-timezone="8" data-zmail-timezone-code="Asia/Shanghai" data-zmail-timezone-value="2026-02-06 16:15:13">2026年02月06日 16:15</span>
</div><div><strong>Subject: </strong><span class="zreadTitle"><strong>[x265] [PATCH] RISC-V: Add RVV optimized DCT32x32</strong></span>
</div></div><div zmailbusiness="businessExternal"></div><div class="zhistoryContent">This patch adds an RVV-optimized implementation of DCT 32x32 for RISC-V.<br> <br>The current implementation in the repository is written with the assumption of a 128-bit VLEN and does not account for wider vector lengths. Therefore, initial testing was performed on a 128-bit platform, allowing the results to directly reflect the advantages of the optimized code over the existing implementation.<br> <br>**SG2044 (128-bit VLEN):**<br> <br>```<br>dct32x32 | 5.14x | 1800.12 | 9247.73<br>dct32x32 | 9.85x | 935.26 | 9214.26<br>```<br> <br>Building on this, the new implementation adopts a Vector-Length Agnostic (VLA) design. Additional testing on a 256-bit platform demonstrates good scalability and further performance gains.<br> <br>**Banana Pi F3 (256-bit VLEN):**<br> <br>```<br>dct32x32 | 5.59x | 2222.48 | 12420.64<br>dct32x32 | 13.28x | 935.97 | 12431.17<br>```<br> <br>To simplify comparison with the existing implementation, this patch introduces an `RVV_DCT32_OPT` compile-time option. The optimization can be disabled using:<br> <br>```<br>-DRVV_DCT32_OPT=0<br>```<br> <br>allowing straightforward A/B performance testing.<br> <br>Signed-off-by: daichengrong <daichengrong@iscas.ac.cn> <br>---<br> source/CMakeLists.txt | 6 +<br> source/common/CMakeLists.txt | 2 +-<br> source/common/riscv64/asm-primitives.cpp | 3 +<br> source/common/riscv64/dct-32dct.S | 714 +++++++++++++++++++++++<br> source/common/riscv64/fun-decls.h | 1 +<br> 5 files changed, 725 insertions(+), 1 deletion(-)<br> mode change 100755 => 100644 source/CMakeLists.txt<br> create mode 100644 source/common/riscv64/dct-32dct.S<br> <br>diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt<br>old mode 100755<br>new mode 100644<br>index 9f93b6ec2..fd91da702<br>--- a/source/CMakeLists.txt<br>+++ b/source/CMakeLists.txt<br>@@ -512,6 +512,11 @@ int main() {<br> message(STATUS "Found RVV")<br> add_definitions(-DHAVE_RVV=1)<br> <br>+ option(RVV_DCT32_OPT "Enable use of RVV DCT32 OPT" ON)<br>+ if(RVV_DCT32_OPT)<br>+ add_definitions(-DHAVE_RVV_OPT=1)<br>+ endif()<br>+<br> set(RVV_INTRINSIC_TEST [[<br> #include <riscv_vector.h> <br> #include <stdint.h> <br>@@ -947,6 +952,7 @@ if((MSVC_IDE OR XCODE OR GCC) AND ENABLE_ASSEMBLY)<br> enable_language(ASM)<br> foreach(ASM ${RISCV64_ASMS})<br> set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/riscv64/${ASM})<br>+ message(STATUS "add ... ${ASM_SRC}")<br> list(APPEND ASM_SRCS ${ASM_SRC})<br> list(APPEND ASM_OBJS ${ASM}.${SUFFIX})<br> add_custom_command(<br>diff --git a/source/common/CMakeLists.txt b/source/common/CMakeLists.txt<br>index 69125c3cb..4945af009 100644<br>--- a/source/common/CMakeLists.txt<br>+++ b/source/common/CMakeLists.txt<br>@@ -185,7 +185,7 @@ if(ENABLE_ASSEMBLY AND (RISCV64 OR CROSS_COMPILE_RISCV64))<br> source_group(Assembly FILES ${ASM_PRIMITIVES})<br> <br> # Add riscv64 assembly files here.<br>- set(A_SRCS asm.S blockcopy8.S dct.S sad-a.S ssd-a.S pixel-util.S mc-a.S p2s.S sao.S loopfilter.S intrapred.S riscv64_utils.S)<br>+ set(A_SRCS asm.S blockcopy8.S dct.S sad-a.S ssd-a.S pixel-util.S mc-a.S p2s.S sao.S loopfilter.S intrapred.S riscv64_utils.S dct-32dct.S)<br> set(VEC_PRIMITIVES)<br> <br> if(CPU_HAS_RVV)<br>diff --git a/source/common/riscv64/asm-primitives.cpp b/source/common/riscv64/asm-primitives.cpp<br>index ce03288f9..7bd017cf8 100644<br>--- a/source/common/riscv64/asm-primitives.cpp<br>+++ b/source/common/riscv64/asm-primitives.cpp<br>@@ -234,6 +234,9 @@ void setupRVVPrimitives(EncoderPrimitives &p)<br> p.dst4x4 = PFX(dst4_v);<br> <br> ALL_LUMA_TU_S(dct, dct, v);<br>+#if defined(HAVE_RVV_OPT)<br>+ p.cu[BLOCK_32x32].dct = PFX(dct_32_v_opt);<br>+#endif<br> ALL_LUMA_TU_S(idct, idct, v);<br> <br> ALL_LUMA_TU_L(nonPsyRdoQuant, nonPsyRdoQuant, v);<br>diff --git a/source/common/riscv64/dct-32dct.S b/source/common/riscv64/dct-32dct.S<br>new file mode 100644<br>index 000000000..a25521706<br>--- /dev/null<br>+++ b/source/common/riscv64/dct-32dct.S<br>@@ -0,0 +1,714 @@<br>+/*****************************************************************************<br>+ * Copyright (C) 2026 MulticoreWare, Inc<br>+ *<br>+ * Authors: daichengrong <daichengrong@iscas.ac.cn> <br>+ *<br>+ * This program is free software; you can redistribute it and/or modify<br>+ * it under the terms of the GNU General Public License as published by<br>+ * the Free Software Foundation; either version 2 of the License, or<br>+ * (at your option) any later version.<br>+ *<br>+ * This program is distributed in the hope that it will be useful,<br>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of<br>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the<br>+ * GNU General Public License for more details.<br>+ *<br>+ * You should have received a copy of the GNU General Public License<br>+ * along with this program; if not, write to the Free Software<br>+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.<br>+ *<br>+ * This program is also available under a commercial proprietary license.<br>+ * For more information, contact us at license @ x265.com.<br>+ *****************************************************************************/<br>+<br>+#include "asm.S" <br>+<br>+#ifdef __APPLE__<br>+.section __RODATA,__rodata<br>+#else<br>+.section .rodata<br>+#endif<br>+<br>+.align 4<br>+<br>+.set dct32_shift_1, 4 + BIT_DEPTH - 8<br>+.set dct32_shift_2, 11<br>+<br>+.text<br>+<br>+#define DCT32_O_CONSTANT_1_0 90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4<br>+#define DCT32_O_CONSTANT_3_1 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13<br>+#define DCT32_O_CONSTANT_5_2 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22<br>+#define DCT32_O_CONSTANT_7_3 85, 46, -13, -67, -90, -73, -22, 38, 82, 88, 54, -4, -61, -90, -78, -31<br>+#define DCT32_O_CONSTANT_9_4 82, 22, -54, -90, -61, 13, 78, 85, 31, -46, -90, -67, 4, 73, 88, 38<br>+#define DCT32_O_CONSTANT_11_5 78, -4, -82, -73, 13, 85, 67, -22, -88, -61, 31, 90, 54, -38, -90, -46<br>+#define DCT32_O_CONSTANT_13_6 73, -31, -90, -22, 78, 67, -38, -90, -13, 82, 61, -46, -88, -4, 85, 54<br>+#define DCT32_O_CONSTANT_15_7 67, -54, -78, 38, 85, -22, -90, 4, 90, 13, -88, -31, 82, 46, -73, -61<br>+#define DCT32_O_CONSTANT_17_8 61, -73, -46, 82, 31, -88, -13, 90, -4, -90, 22, 85, -38, -78, 54, 67<br>+#define DCT32_O_CONSTANT_19_9 54, -85, -4, 88, -46, -61, 82, 13, -90, 38, 67, -78, -22, 90, -31, -73<br>+#define DCT32_O_CONSTANT_21_10 46, -90, 38, 54, -90, 31, 61, -88, 22, 67, -85, 13, 73, -82, 4, 78<br>+#define DCT32_O_CONSTANT_23_11 38, -88, 73, -4, -67, 90, -46, -31, 85, -78, 13, 61, -90, 54, 22, -82<br>+#define DCT32_O_CONSTANT_25_12 31, -78, 90, -61, 4, 54, -88, 82, -38, -22, 73, -90, 67, -13, -46, 85<br>+#define DCT32_O_CONSTANT_27_13 22, -61, 85, -90, 73, -38, -4, 46, -78, 90, -82, 54, -13, -31, 67, -88<br>+#define DCT32_O_CONSTANT_29_14 13, -38, 61, -78, 88, -90, 85, -73, 54, -31, 4, 22, -46, 67, -82, 90<br>+#define DCT32_O_CONSTANT_31_15 4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90<br>+<br>+<br>+#define DCT32_EO_CONSTANT_2_0 90, 87, 80, 70, 57, 43, 25, 9<br>+#define DCT32_EO_CONSTANT_6_1 87, 57, 9, -43, -80, -90, -70, -25<br>+#define DCT32_EO_CONSTANT_10_2 80, 9, -70, -87, -25, 57, 90, 43<br>+#define DCT32_EO_CONSTANT_14_3 70, -43, -87, 9, 90, 25, -80, -57<br>+<br>+#define DCT32_EO_CONSTANT_18_4 57, -80, -25, 90, -9, -87, 43, 70<br>+#define DCT32_EO_CONSTANT_22_5 43, -90, 57, 25, -87, 70, 9, -80<br>+#define DCT32_EO_CONSTANT_26_6 25, -70, 90, -80, 43, 9, -57, 87<br>+#define DCT32_EO_CONSTANT_30_7 9, -25, 43, -57, 70, -80, 87, -90<br>+<br>+.macro lx rd, addr<br>+#if (__riscv_xlen == 32)<br>+ lw \rd, \addr<br>+#elif (__riscv_xlen == 64)<br>+ ld \rd, \addr<br>+#else<br>+ lq \rd, \addr<br>+#endif<br>+.endm<br>+<br>+.macro sx rd, addr<br>+#if (__riscv_xlen == 32)<br>+ sw \rd, \addr<br>+#elif (__riscv_xlen == 64)<br>+ sd \rd, \addr<br>+#else<br>+ sq \rd, \addr<br>+#endif<br>+.endm<br>+<br>+.macro butterfly e, o, tmp_p, tmp_m<br>+ vadd.vv \tmp_p, \e, \o<br>+ vsub.vv \tmp_m, \e, \o<br>+.endm<br>+<br>+.macro butterfly_widen e, o, tmp_p, tmp_m<br>+ vwadd.vv \tmp_p, \e, \o<br>+ vwsub.vv \tmp_m, \e, \o<br>+.endm<br>+<br>+.macro DCT32_EEO_CAL dst, m1, m2, m3, m4, s1, s2, s3, s4, line, shift<br>+ li a2, \m1<br>+ li a3, \m2<br>+ li a4, \m3<br>+ li a5, \m4<br>+ vmul.vx \dst, \s1, a2<br>+ vmacc.vx \dst, a3, \s2<br>+ vmacc.vx \dst, a4, \s3<br>+ vmacc.vx \dst, a5, \s4<br>+.endm<br>+<br>+.macro DCT32_4_DST_ADD_1_MEMBER first, in, dst_start_index, dst1, dst2, dst3, dst4, t0, t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14,t15<br>+.if \dst_start_index == 0<br>+ li a2, \t0<br>+ li a3, \t1<br>+ li a4, \t2<br>+ li a5, \t3<br>+.elseif \dst_start_index == 4<br>+ li a2, \t4<br>+ li a3, \t5<br>+ li a4, \t6<br>+ li a5, \t7<br>+.elseif \dst_start_index == 8<br>+ li a2, \t8<br>+ li a3, \t9<br>+ li a4, \t10<br>+ li a5, \t11<br>+.else<br>+ li a2, \t12<br>+ li a3, \t13<br>+ li a4, \t14<br>+ li a5, \t15<br>+.endif<br>+<br>+.if \first == 1<br>+ vmul.vx \dst1, \in, a2<br>+ vmul.vx \dst2, \in, a3<br>+ vmul.vx \dst3, \in, a4<br>+ vmul.vx \dst4, \in, a5<br>+.else<br>+ vmacc.vx \dst1, a2, \in<br>+ vmacc.vx \dst2, a3, \in<br>+ vmacc.vx \dst3, a4, \in<br>+ vmacc.vx \dst4, a5, \in<br>+.endif<br>+.endm<br>+<br>+.macro DCT32_STORE_L line, shift, in<br>+ vnclip.wi \in, \in, \shift<br>+ addi t0, a1, 32 * 2 * \line<br>+ vse16.v \in, (t0)<br>+.endm<br>+<br>+.macro tr_32xN_rvv name, shift<br>+function func_tr_32xN_\name\()_rvv<br>+ .option arch, +zba<br>+ // E saved from tmp stack<br>+ mv a7, t5<br>+ // one vector bytes after widen<br>+ slli t2, t4, 2<br>+ // O saved from tmp stack + 16xE<br>+ slli t0, t2, 4<br>+ add a6, t5, t0<br>+<br>+ // load 0-3 28-31<br>+ add t0, a0, 2*0<br>+ vlsseg4e16.v v0,(a0), t3<br>+ add t0, a0, 2*28<br>+ vlsseg4e16.v v4,(t0), t3<br>+<br>+ butterfly_widen v0, v7, v8, v16<br>+ butterfly_widen v1, v6, v10, v18<br>+ butterfly_widen v2, v5, v12, v20<br>+ butterfly_widen v3, v4, v14, v22<br>+<br>+ // load 4-7 24-27<br>+ add t0, a0, 2*4<br>+ vlsseg4e16.v v0,(t0), t3<br>+ add t0, a0, 2*24<br>+ vlsseg4e16.v v4,(t0), t3<br>+<br>+ // save E 0 1 2 3<br>+ vse32.v v8, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v10, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v12, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v14, (a7)<br>+<br>+ // save O 1 2 3 4<br>+ vse32.v v16, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v18, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v20, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v22, (a6)<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ DCT32_4_DST_ADD_1_MEMBER 1, v16, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_1_0<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_3_1<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_5_2<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_7_3<br>+<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ butterfly_widen v0, v7, v8, v16<br>+ butterfly_widen v1, v6, v10, v18<br>+ butterfly_widen v2, v5, v12, v20<br>+ butterfly_widen v3, v4, v14, v22<br>+<br>+ // load 8-11 20-23<br>+ add t0, a0, 2*8<br>+ vlsseg4e16.v v0,(t0), t3<br>+ add t0, a0, 2*20<br>+ vlsseg4e16.v v4,(t0), t3<br>+<br>+ // save E 4 5 6 7<br>+ add a7, a7, t2<br>+ vse32.v v8, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v10, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v12, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v14, (a7)<br>+<br>+ // save O 4 5 6 7<br>+ add a6, a6, t2<br>+ vse32.v v16, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v18, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v20, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v22, (a6)<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_9_4<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_11_5<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_13_6<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_15_7<br>+<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ butterfly_widen v0, v7, v8, v16<br>+ butterfly_widen v1, v6, v10, v18<br>+ butterfly_widen v2, v5, v12, v20<br>+ butterfly_widen v3, v4, v14, v22<br>+<br>+ // load 12-15 16-19<br>+ add t0, a0, 2*12<br>+ vlsseg4e16.v v0,(t0), t3<br>+ add t0, a0, 2*16<br>+ vlsseg4e16.v v4,(t0), t3<br>+<br>+ // save E 8 9 10 11<br>+ add a7, a7, t2<br>+ vse32.v v8, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v10, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v12, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v14, (a7)<br>+<br>+ // save O 8 9 10 11<br>+ add a6, a6, t2<br>+ vse32.v v16, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v18, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v20, (a6)<br>+ add a6, a6, t2<br>+ vse32.v v22, (a6)<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_17_8<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_19_9<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_21_10<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_23_11<br>+<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ butterfly_widen v0, v7, v8, v16<br>+ butterfly_widen v1, v6, v10, v18<br>+ butterfly_widen v2, v5, v12, v20<br>+ butterfly_widen v3, v4, v14, v22<br>+<br>+ // save E 12 13 14 15<br>+ add a7, a7, t2<br>+ vse32.v v8, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v10, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v12, (a7)<br>+ add a7, a7, t2<br>+ vse32.v v14, (a7)<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_25_12<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_27_13<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_29_14<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 0, v24, v26, v28, v30, DCT32_O_CONSTANT_31_15<br>+<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ DCT32_STORE_L 1, \shift, v24<br>+ DCT32_STORE_L 3, \shift, v26<br>+ DCT32_STORE_L 5, \shift, v28<br>+ DCT32_STORE_L 7, \shift, v30<br>+<br>+<br>+ // cal dst 4-15<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ // 12<br>+ DCT32_4_DST_ADD_1_MEMBER 1, v16, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_25_12<br>+ DCT32_4_DST_ADD_1_MEMBER 1, v16, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_25_12<br>+ DCT32_4_DST_ADD_1_MEMBER 1, v16, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_25_12<br>+ // reload O0 to v16<br>+ slli t0, t2, 4<br>+ add a6, t5, t0<br>+ vle32.v v16, (a6)<br>+<br>+ // 13<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_27_13<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_27_13<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_27_13<br>+ // reload O1 to v18<br>+ add a6, a6, t2<br>+ vle32.v v18, (a6)<br>+<br>+ // 14<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_29_14<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_29_14<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_29_14<br>+ // reload O2 to v20<br>+ add a6, a6, t2<br>+ vle32.v v20, (a6)<br>+<br>+ // 15<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_31_15<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_31_15<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_31_15<br>+ // reload O3 to v22<br>+ add a6, a6, t2<br>+ vle32.v v22, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_1_0<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_1_0<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_1_0<br>+ // reload O4 to v16<br>+ add a6, a6, t2<br>+ vle32.v v16, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_3_1<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_3_1<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_3_1<br>+ // reload O5 to v18<br>+ add a6, a6, t2<br>+ vle32.v v18, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_5_2<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_5_2<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_5_2<br>+ // reload O6 to v20<br>+ add a6, a6, t2<br>+ vle32.v v20, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_7_3<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_7_3<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_7_3<br>+ // reload O7 to v22<br>+ add a6, a6, t2<br>+ vle32.v v22, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_9_4<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_9_4<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_9_4<br>+ // reload O8 to v16<br>+ add a6, a6, t2<br>+ vle32.v v16, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_11_5<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_11_5<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_11_5<br>+ // reload O9 to v18<br>+ add a6, a6, t2<br>+ vle32.v v18, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_13_6<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_13_6<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_13_6<br>+ // reload O10 to v20<br>+ add a6, a6, t2<br>+ vle32.v v20, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_15_7<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_15_7<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_15_7<br>+ // reload O11 to v22<br>+ add a6, a6, t2<br>+ vle32.v v22, (a6)<br>+<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_17_8<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_17_8<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_17_8<br>+<br>+ // reload E 0 to v16<br>+ add a7, t5, zero<br>+ vle32.v v16, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_19_9<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_19_9<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_19_9<br>+ // reload E1 to v18<br>+ add a7, a7, t2<br>+ vle32.v v18, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_21_10<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_21_10<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_21_10<br>+ // reload E2 to v20<br>+ add a7, a7, t2<br>+ vle32.v v20, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 4, v0, v2, v4, v6, DCT32_O_CONSTANT_23_11<br>+<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ // write 9 11 13 15<br>+ DCT32_STORE_L 9, \shift, v0<br>+ DCT32_STORE_L 11, \shift, v2<br>+ DCT32_STORE_L 13, \shift, v4<br>+ DCT32_STORE_L 15, \shift, v6<br>+<br>+ // reload E3 to v0<br>+ add a7, a7, t2<br>+ vle32.v v0, (a7)<br>+ // reload E12 to v2<br>+ add a7, a7, t2<br>+ sh3add a7, t2, a7<br>+ vle32.v v2, (a7)<br>+ // reload E13 to v4<br>+ add a7, a7, t2<br>+ vle32.v v4, (a7)<br>+ // reload E14 to v6<br>+ add a7, a7, t2<br>+ vle32.v v6, (a7)<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 8, v8, v10, v12, v14, DCT32_O_CONSTANT_23_11<br>+ // write 17 19 21 23<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ DCT32_STORE_L 17, \shift, v8<br>+ DCT32_STORE_L 19, \shift, v10<br>+ DCT32_STORE_L 21, \shift, v12<br>+ DCT32_STORE_L 23, \shift, v14<br>+<br>+ // reload E15 to v8<br>+ add a7, a7, t2<br>+ vle32.v v8, (a7)<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 12, v24, v26, v28, v30, DCT32_O_CONSTANT_23_11<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ // write 25 27 29 31<br>+ DCT32_STORE_L 25, \shift, v24<br>+ DCT32_STORE_L 27, \shift, v26<br>+ DCT32_STORE_L 29, \shift, v28<br>+ DCT32_STORE_L 31, \shift, v30<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ // cal E 3 12 EE EO 3<br>+ butterfly v0, v2, v10, v0<br>+ // save EE 3<br>+ slli t0, t2, 4<br>+ add a6, t5, t0<br>+ vse32.v v10, (a6)<br>+ // reload E 4<br>+ sh2add a7, t2, t5<br>+ vle32.v v10, (a7)<br>+<br>+ // cal dst 2 4 6 10<br>+ DCT32_4_DST_ADD_1_MEMBER 1, v0, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_14_3<br>+<br>+ // cal E 2 13 EE EO 2<br>+ butterfly v20, v4, v12, v20<br>+ // save EE 2<br>+ add a6, a6, t2<br>+ vse32.v v12, (a6)<br>+ // reload E 5<br>+ add a7, a7, t2<br>+ vle32.v v12, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_10_2<br>+<br>+ // cal E 1 14 EE EO 1<br>+ butterfly v18, v6, v14, v18<br>+ // save EE 1<br>+ add a6, a6, t2<br>+ vse32.v v14, (a6)<br>+ // reload E 6<br>+ add a7, a7, t2<br>+ vle32.v v14, (a7)<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_6_1<br>+<br>+ // cal E 0 15 EE EO 0<br>+ butterfly v16, v8, v22, v16<br>+ // reload EE 0<br>+ add a6, a6, t2<br>+ vse32.v v22, (a6)<br>+ // reload E 7<br>+ add a7, a7, t2<br>+ vle32.v v22, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_2_0<br>+<br>+ // cal dst 18 22 26 30<br>+ DCT32_4_DST_ADD_1_MEMBER 1, v0, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_14_3<br>+ // reload E 8 v0<br>+ add a7, a7, t2<br>+ vle32.v v0, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v20, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_10_2<br>+ // reload E 9 v20<br>+ add a7, a7, t2<br>+ vle32.v v20, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v18, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_6_1<br>+ // reload E 10 v18<br>+ add a7, a7, t2<br>+ vle32.v v18, (a7)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v16, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_2_0<br>+<br>+<br>+ // cal E 7 8 EE EO 7<br>+ butterfly v22, v0, v16, v22<br>+ // reload E 11 v0<br>+ add a7, a7, t2<br>+ vle32.v v0, (a7)<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_30_7<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v22, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_30_7<br>+<br>+ // cal E 6 9 EE EO 6<br>+ butterfly v14, v20, v22, v14<br>+ // reload EE 0 v20<br>+ vle32.v v20, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v14, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_26_6<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v14, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_26_6<br>+<br>+ // cal E 5 10 EE EO 5<br>+ butterfly v12, v18, v14, v12<br>+<br>+ // reload EE 1 v18<br>+ sub a6, a6, t2<br>+ vle32.v v18, (a6)<br>+<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v12, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_22_5<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v12, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_22_5<br>+ // load EE 1 v18<br>+<br>+ // cal E 4 11 EE EO 4<br>+ butterfly v10, v0, v12, v10<br>+ // reload EE 2 v18<br>+ sub a6, a6, t2<br>+ vle32.v v0, (a6)<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v10, 0, v24 v26 v28 v30, DCT32_EO_CONSTANT_18_4<br>+ DCT32_4_DST_ADD_1_MEMBER 0, v10, 4, v2 v4 v6 v8, DCT32_EO_CONSTANT_18_4<br>+ // reload EE 3 v10<br>+ sub a6, a6, t2<br>+ vle32.v v10, (a6)<br>+<br>+ //write dst 2 6 10 14 18 22 26 30<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ DCT32_STORE_L 2, \shift, v24<br>+ DCT32_STORE_L 6, \shift, v26<br>+ DCT32_STORE_L 10, \shift, v28<br>+ DCT32_STORE_L 14, \shift, v30<br>+<br>+ DCT32_STORE_L 18, \shift, v2<br>+ DCT32_STORE_L 22, \shift, v4<br>+ DCT32_STORE_L 26, \shift, v6<br>+ DCT32_STORE_L 30, \shift, v8<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ // EE 0-7 ready in register<br>+<br>+ // EE 3 4 EEE EEO 3<br>+ butterfly v10, v12, v28, v26<br>+ // EE 1 6 EEE EEO 1<br>+ butterfly v18, v22, v24, v22<br>+ // EE 2 5 EEE EEO 2<br>+ butterfly v0, v14, v30, v10<br>+ // EE 0 7 EEE EEO 0<br>+ butterfly v20, v16, v14, v12<br>+<br>+<br>+ // EEO[0-4] v12 v22 v16 v26<br>+ //dst 4 12 20 28<br>+ DCT32_EEO_CAL v4, 89, 75, 50, 18, v12, v22, v10, v26, 4, \shift<br>+ DCT32_EEO_CAL v8, 75, -18, -89, -50, v12, v22, v10, v26, 12, \shift<br>+ DCT32_EEO_CAL v6, 50, -89, 18, 75, v12, v22, v10, v26, 20, \shift<br>+ DCT32_EEO_CAL v16, 18, -50, 75, -89, v12, v22, v10, v26, 28, \shift<br>+<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+<br>+ DCT32_STORE_L 4, \shift, v4<br>+ DCT32_STORE_L 12, \shift, v8<br>+ DCT32_STORE_L 20, \shift, v6<br>+ DCT32_STORE_L 28, \shift, v16<br>+<br>+ vsetvli zero, zero, e32, m2, ta, ma<br>+ # EEEE[0] = EEE[0] + EEE[3];<br>+ # EEEO[0] = EEE[0] - EEE[3];<br>+ butterfly v14, v28, v16, v20<br>+ # EEEE[1] = EEE[1] + EEE[2];<br>+ # EEEO[1] = EEE[1] - EEE[2];<br>+ butterfly v24, v30, v2, v4<br>+<br>+<br>+ # dst[0] = (int16_t)((g_t32[0][0] * EEEE[0] + g_t32[0][1] * EEEE[1] + add) >> shift);<br>+ // 64 64<br>+ li a2, 64<br>+ li a3, 64<br>+ vmul.vx v18, v16, a2<br>+ vmacc.vx v18, a3, v2<br>+ # dst[8 * line] = (int16_t)((g_t32[8][0] * EEEO[0] + g_t32[8][1] * EEEO[1] + add) >> shift);<br>+ // 83 36<br>+ li a2, 83<br>+ li a3, 36<br>+ vmul.vx v6, v20, a2<br>+ vmacc.vx v6, a3, v4<br>+ # dst[16 * line] = (int16_t)((g_t32[16][0] * EEEE[0] + g_t32[16][1] * EEEE[1] + add) >> shift);<br>+ // 64 -64<br>+ li a2, 64<br>+ li a3, -64<br>+ vmul.vx v8, v16, a2<br>+ vmacc.vx v8, a3, v2<br>+ # dst[24 * line] = (int16_t)((g_t32[24][0] * EEEO[0] + g_t32[24][1] * EEEO[1] + add) >> shift);<br>+ // 36 -83<br>+ li a2, 36<br>+ li a3, -83<br>+ vmul.vx v10, v20, a2<br>+ vmacc.vx v10, a3, v4<br>+<br>+ //write dst 0 8 16 24<br>+ vsetvli zero, zero, e16, m1, ta, ma<br>+ DCT32_STORE_L 0, \shift, v18<br>+ DCT32_STORE_L 8, \shift, v6<br>+ DCT32_STORE_L 16, \shift, v8<br>+ DCT32_STORE_L 24, \shift, v10<br>+<br>+ ret<br>+endfunc<br>+.endm<br>+<br>+tr_32xN_rvv firstpass, dct32_shift_1<br>+tr_32xN_rvv secondpass, dct32_shift_2<br>+<br>+.macro DCT_N size<br>+function PFX(dct_\size\()_v_opt)<br>+ .option arch, +zba<br>+<br>+ addi sp, sp, -16<br>+ sx ra, (sp)<br>+<br>+ mv t6, a1<br>+ csrwi vxrm, 0<br>+<br>+ li t1, 32<br>+ vsetvli t4, t1, e16, m1, ta, ma<br>+<br>+ li t0, 4096<br>+ // temp stack address<br>+ sub t5, sp, t0<br>+ li t0, 2048<br>+ sub sp, t5, t0<br>+<br>+ // a0<br>+ mv a1, sp<br>+ slli t3, a2, 1<br>+1:<br>+ jal func_tr_32xN_firstpass_rvv<br>+ mul t0, t4, t3<br>+ add a0, a0, t0<br>+ slli t0, t4, 1<br>+ add a1, a1, t0<br>+ sub t1, t1, t4<br>+ bnez t1, 1b<br>+<br>+ li t1, 32<br>+ mv a0, sp<br>+ mv a1, t6<br>+ li t3, 64<br>+1:<br>+ jal func_tr_32xN_secondpass_rvv<br>+ slli t0, t4, 6<br>+ add a0, a0, t0<br>+ slli t0, t4, 1<br>+ add a1, a1, t0<br>+ sub t1, t1, t4<br>+ bnez t1, 1b<br>+<br>+2:<br>+ li t0, 4096+2048<br>+ add sp, sp, t0<br>+ lx ra, (sp)<br>+ addi sp, sp, 16<br>+<br>+ ret<br>+endfunc<br>+.endm<br>+<br>+DCT_N 32<br>diff --git a/source/common/riscv64/fun-decls.h b/source/common/riscv64/fun-decls.h<br>index ec04d9968..7ffb32e65 100644<br>--- a/source/common/riscv64/fun-decls.h<br>+++ b/source/common/riscv64/fun-decls.h<br>@@ -123,6 +123,7 @@ FUNCDEF_TU_S(void, cpy1Dto2D_shr, v, int16_t* dst, const int16_t* src, intptr_t<br> FUNCDEF_TU_S(void, ssimDist, v, const pixel *fenc, uint32_t fStride, const pixel *recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k);<br> FUNCDEF_TU_S(void, idct, v, const int16_t* src, int16_t* dst, intptr_t dstStride);<br> FUNCDEF_TU_S(void, dct, v, const int16_t* src, int16_t* dst, intptr_t srcStride);<br>+FUNCDEF_TU_S(void, dct, v_opt, const int16_t* src, int16_t* dst, intptr_t srcStride);<br> FUNCDEF_TU_S(void, getResidual, v, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);<br> <br> FUNCDEF_TU_S2(void, intra_pred_planar, rvv, pixel* dst, intptr_t dstride, const pixel* srcPix, int, int);</div></div></div></div></div></blockquote></div></body></html>