<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 30, 2013 at 11:03 AM, chen <span dir="ltr"><<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-height:1.7;font-size:14px;font-family:arial"><div><div class="h5"><div>+%macro PROCESS_SAD_12x4 0<br>+ movu m1, [r2]<br>
+ movu m2, [r0]<br>+ pand m1, m4<br>+ pand m2, m4<br>+ psadbw m1, m2<br>+ paddd m0, m1<br>+ lea r2, [r2 + r3]<br>+ lea r0, [r0 + r1]<br>+ movu m1, [r2]<br>+ movu m2, [r0]<br>
+ pand m1, m4<br>+ pand m2, m4<br>+ psadbw m1, m2<br>+ paddd m0, m1</div>
</div></div><blockquote style="PADDING-LEFT:1ex;MARGIN:0px 0px 0px 0.8ex;BORDER-LEFT:#ccc 1px solid">
<div dir="ltr"><div><div class="h5">
<div class="gmail_quote">>>+ lea r2, [r2 + r3]<br>>>+ lea r0, [r0 + r1]</div>
<div class="gmail_quote">>>+ movu m1, [r2]<br>>>+ movu m2, [r0]</div>
<div class="gmail_quote"><br></div>
<div class="gmail_quote">
<div class="gmail_quote">we don't need to load address every time when we are adding stride to it. we should try to calculate address first using multiply by 1, 2, 4, or 8 if it not the case then we should load it.</div>
<div class="gmail_quote"> like above four instruction can be replaced with these two only.</div>
<div class="gmail_quote"><br></div>
<div class="gmail_quote">movu m1, [r2 + 2 * r3]</div>movu m2, [r0 + 2 * r1]<br></div>
<div class="gmail_quote"><br>+ pand m1, m4<br>+ pand m2, m4<br>+ psadbw m1, m2<br>+ paddd m0, m1<br>+ lea r2, [r2 + r3]<br>+ lea r0, [r0 + r1]<br>+ movu m1, [r2]<br>+ movu m2, [r0]<br>
+ pand m1, m4<br>+ pand m2, m4<br>+ psadbw m1, m2<br>+ paddd m0, m1<br>+%endmacro<br>+<br> %macro PROCESS_SAD_16x4 0<br> movu m1, [r2]<br> movu m2, [r2 + r3]<br>@@ -1007,6 +1041,29 @@<br>
movd eax, m0<br> RET<br><br>+;-----------------------------------------------------------------------------<br>+; int pixel_sad_12x16(
uint8_t *, intptr_t, uint8_t *, intptr_t )<br>+;-----------------------------------------------------------------------------<br>+cglobal pixel_sad_12x16, 4,4,4<br>+ mova m4, [MSK]<br>+ pxor m0, m0<br>+<br>+ PROCESS_SAD_12x4<br>
+ lea r2, [r2 + r3]<br>+ lea r0, [r0 + r1]<br>+ PROCESS_SAD_12x4<br>+ lea r2, [r2 + r3]<br>+ lea r0, [r0 + r1]<br>+ PROCESS_SAD_12x4<br>+ lea r2, [r2 + r3]<br>
+ lea r0, [r0 + r1]<br>+ PROCESS_SAD_12x4<br>+<br>+ movhlps m1, m0<br>+ paddd m0, m1<br>+ movd eax, m0<br>+ RET<br>+<br> %endmacro<br><b></b></div>
<div class="gmail_quote">overuse of lea instruction please eliminate them, use available registers to save loads operations.</div><br></div></div>Excuse me, I forgot something, for 12xN, use MOVQ+MOVD is better than MOVU+PAND</div>
</blockquote></div></blockquote><div><br></div><div><br></div><div>I've queued all of these changes for the default branch since they are already faster than the intrinsics and this allows us to remove quite a number of them. Further optimizations should be done based on these that are applied. </div>
</div><div><br></div>-- <br>Steve Borho
</div></div>