[x265] [arm64] port ssim_4x4x2_core
Pop, Sebastian
spop at amazon.com
Mon Aug 9 17:06:05 UTC 2021
Hi Min Chen,
Thanks for the suggestion.
I tried several patches to avoid using adalp however all my changes produced slower code:
For example,
Before:
ssim_4x4x2_core 30.69x 13.39 410.85
After:
ssim_4x4x2_core 27.33x 15.01 410.15
with the following change:
@@ -1707,31 +1707,37 @@ function x265_ssim_4x4x2_core_neon
uaddl v29.8h, v4.8b, v5.8b
uaddlp v30.4s, v16.8h
uaddlp v31.4s, v24.8h
+ uaddlp v0.4s, v17.8h
+ uaddlp v1.4s, v18.8h
+ uaddlp v4.4s, v19.8h
+ uaddlp v5.4s, v20.8h
uaddw v28.8h, v28.8h, v2.8b
uaddw v29.8h, v29.8h, v6.8b
- uadalp v30.4s, v17.8h
+ add v0.4s, v0.4s, v1.4s
+ add v4.4s, v4.4s, v5.4s
+ uaddlp v2.4s, v21.8h
+ uaddlp v6.4s, v22.8h
+ add v30.4s, v30.4s, v0.4s
+ add v2.4s, v2.4s, v4.4s
uadalp v31.4s, v25.8h
uaddw v28.8h, v28.8h, v3.8b
uaddw v29.8h, v29.8h, v7.8b
- uadalp v30.4s, v18.8h
+ add v30.4s, v30.4s, v6.4s
+ uaddlp v3.4s, v23.8h
+ add v30.4s, v30.4s, v2.4s
uadalp v31.4s, v26.8h
uaddlp v28.4s, v28.8h
uaddlp v29.4s, v29.8h
- uadalp v30.4s, v19.8h
+ add v30.4s, v30.4s, v3.4s
uadalp v31.4s, v27.8h
addp v28.4s, v28.4s, v28.4s
addp v29.4s, v29.4s, v29.4s
- uadalp v30.4s, v20.8h
- addp v31.4s, v31.4s, v31.4s
-
- uadalp v30.4s, v21.8h
- uadalp v30.4s, v22.8h
- uadalp v30.4s, v23.8h
addp v30.4s, v30.4s, v30.4s
+ addp v31.4s, v31.4s, v31.4s
st4 {v28.2s, v29.2s, v30.2s, v31.2s}, [x4]
ret
I think it is fine to use uadalp as the instruction is pipelined.
Neoverse-N1 SWOG says: latency for UADALP is “4(1)”
with Note 2. “Other accumulate pipelines also support late-forwarding of accumulate operands from similar μOPs, allowing a typical sequence of such μOPs to issue one every cycle (accumulate latency shown in parentheses).”
The accumulate in the reduction chain only takes 1 cycle, which is hard to beat with plain “add” operations that take 2 cycles as in the above change.
I was able to slightly improve performance by early starting the longest reduction chain.
See the attached patch.
Before:
ssim_4x4x2_core 30.69x 13.39 410.85
After:
ssim_4x4x2_core 31.03x 13.32 413.45
Sebastian
From: x265-devel <x265-devel-bounces at videolan.org> on behalf of chen <chenm003 at 163.com>
Reply-To: Development for x265 <x265-devel at videolan.org>
Date: Saturday, August 7, 2021 at 2:20 AM
To: Development for x265 <x265-devel at videolan.org>
Subject: RE: [EXTERNAL] [x265] [arm64] port ssim_4x4x2_core
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
Hi,
Code looks good.
The only comment is UADALP is slower, we can adjust order of sum to avoid it.
Regards,
Min Chen
2021-08-07 02:01:13,"Pop, Sebastian" <spop at amazon.com>
Hi,
the attached patch ports to arm64 the following kernel:
ssim_4x4x2_core 30.69x 13.39 410.85
Ok to commit?
Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210809/2d5c8e3b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm64-port-ssim_4x4x2_core.patch
Type: application/octet-stream
Size: 3481 bytes
Desc: 0001-arm64-port-ssim_4x4x2_core.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210809/2d5c8e3b/attachment-0001.obj>
More information about the x265-devel
mailing list