<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Aptos;
panose-1:2 11 0 4 2 2 2 2 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:"\@SimSun";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:10.0pt;
font-family:"Aptos",sans-serif;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
font-size:10.0pt;
font-family:"Courier New";}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Consolas",serif;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Aptos",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="#467886" vlink="#96607D" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hi Chen,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks for the feedback.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I had a try using LDR with offsets and unrolling by 2, the performance is almost the same for the 2 approaches<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">(<=0.03x deviation up or down for different block sizes).<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Regards,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Li<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div id="mail-editor-reference-message-container">
<div>
<div>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">x265-devel <x265-devel-bounces@videolan.org> on behalf of chen <chenm003@163.com><br>
<b>Date: </b>Friday, 2025. June 20. at 7:09<br>
<b>To: </b>Development for x265 <x265-devel@videolan.org><br>
<b>Cc: </b>nd <nd@arm.com><br>
<b>Subject: </b>Re: [x265] [PATCH] AArch64: Optimize pixel_avg_pp_4xh<o:p></o:p></span></p>
</div>
<div>
<div id="spnEditorContent">
<p style="margin:0in"><span style="font-size:10.5pt;font-family:"Arial",sans-serif;color:black">The code looks good to me<o:p></o:p></span></p>
<p style="margin:0in"><span style="font-size:10.5pt;font-family:"Arial",sans-serif;color:black">btw: The LDR support Register Indirect Addressing, how about unroll(2) to reduce ADD operators?<o:p></o:p></span></p>
<p style="margin:0in"><span style="font-size:10.5pt;font-family:"Arial",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<pre><span style="color:black"><br>At 2025-06-19 22:58:53, "Li Zhang" <li.zhang2@arm.com> wrote:<o:p></o:p></span></pre>
<pre><span style="color:black">>Use LDR and STR instead of LD1 to lane in the pixel_avg_pp_4xh assembly<o:p></o:p></span></pre>
<pre><span style="color:black">>implementation. The new approach is a wholly destructive operation and<o:p></o:p></span></pre>
<pre><span style="color:black">>removes a false dependency on the existing register contents.<o:p></o:p></span></pre>
<pre><span style="color:black">><o:p> </o:p></span></pre>
<pre><span style="color:black">>The change provides up to 2.5x speed up.<o:p></o:p></span></pre>
<pre><span style="color:black">>---<o:p></o:p></span></pre>
<pre><span style="color:black">> source/common/aarch64/mc-a.S | 9 ++++++---<o:p></o:p></span></pre>
<pre><span style="color:black">> 1 file changed, 6 insertions(+), 3 deletions(-)<o:p></o:p></span></pre>
<pre><span style="color:black">><o:p> </o:p></span></pre>
<pre><span style="color:black">>diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S<o:p></o:p></span></pre>
<pre><span style="color:black">>index 130bf1a4a..ff18713fa 100644<o:p></o:p></span></pre>
<pre><span style="color:black">>--- a/source/common/aarch64/mc-a.S<o:p></o:p></span></pre>
<pre><span style="color:black">>+++ b/source/common/aarch64/mc-a.S<o:p></o:p></span></pre>
<pre><span style="color:black">>@@ -38,10 +38,13 @@<o:p></o:p></span></pre>
<pre><span style="color:black">> .macro pixel_avg_pp_4xN_neon h<o:p></o:p></span></pre>
<pre><span style="color:black">> function PFX(pixel_avg_pp_4x\h\()_neon)<o:p></o:p></span></pre>
<pre><span style="color:black">> .rept \h<o:p></o:p></span></pre>
<pre><span style="color:black">>- ld1 {v0.s}[0], [x2], x3<o:p></o:p></span></pre>
<pre><span style="color:black">>- ld1 {v1.s}[0], [x4], x5<o:p></o:p></span></pre>
<pre><span style="color:black">>+ ldr s0, [x2]<o:p></o:p></span></pre>
<pre><span style="color:black">>+ ldr s1, [x4]<o:p></o:p></span></pre>
<pre><span style="color:black">>+ add x2, x2, x3<o:p></o:p></span></pre>
<pre><span style="color:black">>+ add x4, x4, x5<o:p></o:p></span></pre>
<pre><span style="color:black">> urhadd v2.8b, v0.8b, v1.8b<o:p></o:p></span></pre>
<pre><span style="color:black">>- st1 {v2.s}[0], [x0], x1<o:p></o:p></span></pre>
<pre><span style="color:black">>+ str s2, [x0]<o:p></o:p></span></pre>
<pre><span style="color:black">>+ add x0, x0, x1<o:p></o:p></span></pre>
<pre><span style="color:black">> .endr<o:p></o:p></span></pre>
<pre><span style="color:black">> ret<o:p></o:p></span></pre>
<pre><span style="color:black">> endfunc<o:p></o:p></span></pre>
<pre><span style="color:black">>-- <o:p></o:p></span></pre>
<pre><span style="color:black">>2.39.5 (Apple Git-154)<o:p></o:p></span></pre>
<pre><span style="color:black">><o:p> </o:p></span></pre>
</div>
</div>
</div>
</div>
</div>
</body>
</html>