<div dir="ltr">This patch has been pushed to the master branch. <br clear="all"><div><div dir="ltr" class="gmail_signature"><div dir="ltr"></div></div></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><b>__________________________</b></div><div><b>Karam Singh</b></div><div><b>Ph.D. IIT Guwahati</b></div><div><font size="1">Senior Software (Video Coding) Engineer  </font></div><div><font size="1">Mobile: +91 8011279030</font></div><div><font size="1">Block 9A, 6th floor, DLF Cyber City</font></div><div><font size="1">Manapakkam, Chennai 600 089</font></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 6, 2024 at 4:16 PM Wei Chen <<a href="mailto:Wei.Chen@arm.com">Wei.Chen@arm.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Starting from b3ce1c32da303c92241e85bd69298ab9903c0126,<br>

We observed that the video streams generated by x265<br>

compiled with CLANG and GCC, using the same parameters<br>

and input video, are inconsistent.<br>

<br>

Encoding logs for CLANG compiled x265:<br>

x265 [info]: frame I:      2, Avg QP:37.79  kb/s: 72220.75<br>

x265 [info]: frame P:     20, Avg QP:37.75  kb/s: 74146.30<br>

x265 [info]: frame B:     76, Avg QP:39.75  kb/s: 67231.66<br>

<br>

Encoding logs for GCC complied x265:<br>

x265 [info]: frame I:      2, Avg QP:37.79  kb/s: 72220.75<br>

x265 [info]: frame P:     20, Avg QP:37.75  kb/s: 74022.17<br>

x265 [info]: frame B:     76, Avg QP:39.75  kb/s: 67220.26<br>

<br>

This because a macro LOAD_DIFF_16x4_sve2 was introduced in<br>

commit b3ce1c32da303c92241e85bd69298ab9903c0126 for AArch64<br>

(This macro has been renamed to LOAD_DIFF_16x4_sve now) and<br>

is used by SATD kernels. The macro LOAD_DIFF_16x4_sve uses<br>

unprotected callee-saved SVE registers z8 to z15.<br>

<br>

Unfortunately, in the caller function MotionEstimate of SATD,<br>

the CLANG generates some code to use s10. Here is a simplified<br>

code snippet:<br>

<br>

        fmov  s10, #0.50000000<br>

        bl subpelCompare -> will invoke SATD<br>

        fadd  s0, s0, s10<br>

<br>

Thus, the content of s10 used in MotionEstimate is overwritten<br>

by the SATD function, which affects the correctness of subsequent<br>

video encoding.<br>

<br>

But we did not observe similar code in GCC to access s8 - s15<br>

in MotionEstimate. At the same time, we can influence CLANG by<br>

modifying -mpu=neoverse-N2, in this case CLANG will not use s10<br>

in MotionEstimate, thereby making the video stream consistent.<br>

However, such compiler dependent behavior is unreliable.<br>

<br>

Therefore, in this patch, we avoid using z8 to z15 and extra stack<br>

operations by reusing z0 to z7 in LOAD_DIFF_16x4_sve. Here are the<br>

test results after applying the patch:<br>

<br>

Encoding logs for CLANG compiled x265:<br>

x265 [info]: frame I:      2, Avg QP:37.79  kb/s: 72220.75<br>

x265 [info]: frame P:     20, Avg QP:37.75  kb/s: 74022.17<br>

x265 [info]: frame B:     76, Avg QP:39.75  kb/s: 67220.26<br>

<br>

Encoding logs for GCC compiled x265:<br>

x265 [info]: frame I:      2, Avg QP:37.79  kb/s: 72220.75<br>

x265 [info]: frame P:     20, Avg QP:37.75  kb/s: 74022.17<br>

x265 [info]: frame B:     76, Avg QP:39.75  kb/s: 67220.26<br>

<br>

Change-Id: Ic0e0c0706b99e53b138cceb758ee6ce148130e4b<br>

---<br>

  source/common/aarch64/pixel-util-sve.S | 34 +++++++++++++-------------<br>

  1 file changed, 17 insertions(+), 17 deletions(-)<br>

<br>

diff --git a/source/common/aarch64/pixel-util-sve.S <br>

b/source/common/aarch64/pixel-util-sve.S<br>

index 3d073d42e..106ba903a 100644<br>

--- a/source/common/aarch64/pixel-util-sve.S<br>

+++ b/source/common/aarch64/pixel-util-sve.S<br>

@@ -190,27 +190,27 @@ endfunc<br>

      ld1b            {z7.h}, p0/z, [x2, x11]<br>

      add             x0, x0, x1<br>

      add             x2, x2, x3<br>

-    ld1b            {z29.h}, p0/z, [x0]<br>

-    ld1b            {z9.h}, p0/z, [x0, x11]<br>

-    ld1b            {z10.h}, p0/z, [x2]<br>

-    ld1b            {z11.h}, p0/z, [x2, x11]<br>

-    add             x0, x0, x1<br>

-    add             x2, x2, x3<br>

-    ld1b            {z12.h}, p0/z, [x0]<br>

-    ld1b            {z13.h}, p0/z, [x0, x11]<br>

-    ld1b            {z14.h}, p0/z, [x2]<br>

-    ld1b            {z15.h}, p0/z, [x2, x11]<br>

-    add             x0, x0, x1<br>

-    add             x2, x2, x3<br>

-<br>

      sub             \v0\().h, z0.h, z2.h<br>

      sub             \v4\().h, z1.h, z3.h<br>

      sub             \v1\().h, z4.h, z6.h<br>

      sub             \v5\().h, z5.h, z7.h<br>

-    sub             \v2\().h, z29.h, z10.h<br>

-    sub             \v6\().h, z9.h, z11.h<br>

-    sub             \v3\().h, z12.h, z14.h<br>

-    sub             \v7\().h, z13.h, z15.h<br>

+<br>

+    ld1b            {z0.h}, p0/z, [x0]<br>

+    ld1b            {z1.h}, p0/z, [x0, x11]<br>

+    ld1b            {z2.h}, p0/z, [x2]<br>

+    ld1b            {z3.h}, p0/z, [x2, x11]<br>

+    add             x0, x0, x1<br>

+    add             x2, x2, x3<br>

+    ld1b            {z4.h}, p0/z, [x0]<br>

+    ld1b            {z5.h}, p0/z, [x0, x11]<br>

+    ld1b            {z6.h}, p0/z, [x2]<br>

+    ld1b            {z7.h}, p0/z, [x2, x11]<br>

+    add             x0, x0, x1<br>

+    add             x2, x2, x3<br>

+    sub             \v2\().h, z0.h, z2.h<br>

+    sub             \v6\().h, z1.h, z3.h<br>

+    sub             \v3\().h, z4.h, z6.h<br>

+    sub             \v7\().h, z5.h, z7.h<br>

  .endm<br>

<br>

  // one vertical hadamard pass and two horizontal<br>

-- <br>

2.34.1_______________________________________________<br>

x265-devel mailing list<br>

<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>

<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>

</blockquote></div>