[x265] [arm64] port ssim_4x4x2_core

Mon Aug 9 17:06:05 UTC 2021

Hi Min Chen,

Thanks for the suggestion.
I tried several patches to avoid using adalp however all my changes produced slower code:
For example,

Before:
           ssim_4x4x2_core  30.69x   13.39           410.85
After:
       ssim_4x4x2_core      27.33x    15.01                    410.15

with the following change:

@@ -1707,31 +1707,37 @@ function x265_ssim_4x4x2_core_neon
     uaddl           v29.8h, v4.8b, v5.8b
     uaddlp          v30.4s, v16.8h
     uaddlp          v31.4s, v24.8h
+    uaddlp          v0.4s, v17.8h
+    uaddlp          v1.4s, v18.8h
+    uaddlp          v4.4s, v19.8h
+    uaddlp          v5.4s, v20.8h

     uaddw           v28.8h, v28.8h, v2.8b
     uaddw           v29.8h, v29.8h, v6.8b
-    uadalp          v30.4s, v17.8h
+    add             v0.4s, v0.4s, v1.4s
+    add             v4.4s, v4.4s, v5.4s
+    uaddlp          v2.4s, v21.8h
+    uaddlp          v6.4s, v22.8h
+    add             v30.4s, v30.4s, v0.4s
+    add             v2.4s, v2.4s, v4.4s
     uadalp          v31.4s, v25.8h

     uaddw           v28.8h, v28.8h, v3.8b
     uaddw           v29.8h, v29.8h, v7.8b
-    uadalp          v30.4s, v18.8h
+    add             v30.4s, v30.4s, v6.4s
+    uaddlp          v3.4s, v23.8h
+    add             v30.4s, v30.4s, v2.4s
     uadalp          v31.4s, v26.8h

     uaddlp          v28.4s, v28.8h
     uaddlp          v29.4s, v29.8h
-    uadalp          v30.4s, v19.8h
+    add             v30.4s, v30.4s, v3.4s
     uadalp          v31.4s, v27.8h

     addp            v28.4s, v28.4s, v28.4s
     addp            v29.4s, v29.4s, v29.4s
-    uadalp          v30.4s, v20.8h
-    addp            v31.4s, v31.4s, v31.4s
-
-    uadalp          v30.4s, v21.8h
-    uadalp          v30.4s, v22.8h
-    uadalp          v30.4s, v23.8h
     addp            v30.4s, v30.4s, v30.4s
+    addp            v31.4s, v31.4s, v31.4s

     st4             {v28.2s, v29.2s, v30.2s, v31.2s}, [x4]
     ret

I think it is fine to use uadalp as the instruction is pipelined.
Neoverse-N1 SWOG says: latency for UADALP is “4(1)”
with Note 2. “Other accumulate pipelines also support late-forwarding of accumulate operands from similar μOPs, allowing a typical sequence of such μOPs to issue one every cycle (accumulate latency shown in parentheses).”
The accumulate in the reduction chain only takes 1 cycle, which is hard to beat with plain “add” operations that take 2 cycles as in the above change.

I was able to slightly improve performance by early starting the longest reduction chain.
See the attached patch.
Before:
           ssim_4x4x2_core  30.69x   13.39           410.85
After:
       ssim_4x4x2_core      31.03x    13.32                    413.45

Sebastian

From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Reply-To: Development for x265 <x265-devel at videolan.org>
Date: Saturday, August 7, 2021 at 2:20 AM
To: Development for x265 <x265-devel at videolan.org>
Subject: RE: [EXTERNAL] [x265] [arm64] port ssim_4x4x2_core


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


Hi,

Code looks good.
The only comment is UADALP is slower, we can adjust order of sum to avoid it.


Regards,
Min Chen



 2021-08-07 02:01:13，"Pop, Sebastian" <spop at amazon.com>
Hi,
the attached patch ports to arm64 the following kernel:

ssim_4x4x2_core  30.69x   13.39           410.85

Ok to commit?

Thanks,
Sebastian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210809/2d5c8e3b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm64-port-ssim_4x4x2_core.patch
Type: application/octet-stream
Size: 3481 bytes
Desc: 0001-arm64-port-ssim_4x4x2_core.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210809/2d5c8e3b/attachment-0001.obj>