<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<div>
<div>
<div>
<p style="margin:0in"><span style="font-size:10.5pt;font-family:"Arial",sans-serif">Hi Min Chen,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">thanks for your reviews.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> +.macro SAD_X_END_64 x<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + uaddlp v16.4s, v16.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> The dynamic range is 64*255 = 16320 -> 14-bits, so we are not need extend to 32-bits in here<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + uaddlp v17.4s, v17.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + uaddlp v18.4s, v18.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + uaddlp v20.4s, v20.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + uaddlp v21.4s, v21.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + uaddlp v22.4s, v22.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v16.4s, v16.4s, v20.4s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v17.4s, v17.4s, v21.4s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v18.4s, v18.4s, v22.4s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + trn2 v20.2d, v16.2d, v16.2d<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + trn2 v21.2d, v17.2d, v17.2d<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + trn2 v22.2d, v18.2d, v18.2d<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v16.2s, v16.2s, v20.2s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v17.2s, v17.2s, v21.2s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + add v18.2s, v18.2s, v22.2s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> + uaddlp v16.1d, v16.2s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> ADD+TRN2+ADD generate sum of v16+v20 in V.2s, follow by UADDLP into V.1s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> As we analyze dynamic range in above, we can replace it by<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> ADD v16, v20 ; 15-bits<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> (ignore inst for V17=V17+V21, etc)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> ADD v16, V17 ; 16-bits<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> (ignore other registers)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> ADDLV s0,v16<o:p></o:p></span></p>
<p style="margin:0in"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Following your recommendation I tried the following code to delay widening to<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">the last step with uaddlv. This code does not pass correctness tests.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.macro SAD_X_END_64 x<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v16.8h, v16.8h, v20.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v17.8h, v17.8h, v21.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v18.8h, v18.8h, v22.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> trn2 v20.2d, v16.2d, v16.2d<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> trn2 v21.2d, v17.2d, v17.2d<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> trn2 v22.2d, v18.2d, v18.2d<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v16.4h, v16.4h, v20.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v17.4h, v17.4h, v21.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v18.4h, v18.4h, v22.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uaddlv s16, v16.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uaddlv s17, v17.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uaddlv s18, v18.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> stp s16, s17, [x6], #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.if \x == 3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> str s18, [x6]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.elseif \x == 4<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v19.8h, v19.8h, v23.8h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> trn2 v23.2d, v19.2d, v19.2d<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> add v19.2s, v19.2s, v23.2s<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uaddlv s19, v19.4h<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> stp s18, s19, [x6]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.endif<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> ret<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">.endm<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">As we start executing the above code, the values observed in each lane of v16 to<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">v23 are already 16-bit. For example,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">(gdb) p $v16.h.u<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">$21 = {65024, 65024, 65024, 65024, 65024, 65024, 65024, 65024}<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Each lane of v16 accumulates 4 differences of range 255:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uabal \v1\().8h, v0.8b, v4.8b<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uabal \v1\().8h, v1.8b, v5.8b<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uabal \v1\().8h, v2.8b, v6.8b<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> uabal \v1\().8h, v3.8b, v7.8b<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">and this is in a loop of 64 iterations.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">So the dynamic range for each vector element is 4*64*255 = 65280 -> 16-bits<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">We need to widen arithmetic in the first step as in the original patch,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">and we cannot postpone widening to the last step of the reduction.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> </span><span style="font-size:10.5pt;font-family:"Arial",sans-serif;color:black">I guess STP may store two result in a cycle</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Please see attached the amended patch that uses store pairs.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I have seen a small performance improvement with this change.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Sebastian<o:p></o:p></span></p>
</div>
</div>
</div>
</div>
</body>
</html>