<div dir="ltr"><div dir="ltr">Hi Chen,<div>Thanks for your suggestions. Your feedback is noted. </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 7, 2019 at 3:41 PM chen <<a href="mailto:chenm003@163.com">chenm003@163.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="line-height:1.7;color:rgb(0,0,0);font-size:14px;font-family:Arial"><div>Just say it works.</div><div><br></div><div><div>First at all,</div><div>The expect algorithm is square of (x >> shift)</div><div>It is 8 bits (I assume we talk with 8bpp, the 16bpp are similar) multiple of 8-bits and result is 16 bits.</div><div>The function works on CU-level, the blockSize is up to 64 only, or call 6-bits.</div><div>So, we can decide the maximum dynamic range is 16+6+6 = 28 bits </div><div><br></div><div>In this way, the output uint64_t is unnecessary on 8bpp mode.</div><div><br></div><div>Moreover, PMOVZXBD+VPMULDQ can be replace by PMOVZXBW+PMADDWD, (please remember that PMADDUBSW just work on one of unsigned input),</div><div>this way may accelerate 3~4 times of processing throughput. </div></div><div><div>I don't why not VPMULLD, it almost double performance</div></div><div><br></div><div>Further, unnecessary VPSRLDQ because we choice VPMULDQ</div><div><br></div><div><div>+ vpmuldq m2, m1, m1</div><div>+ vpsrldq m1, m1, 4</div><div>+ vpmuldq m1, m1, m1</div></div><div><br></div><div><br></div><div>Regards,</div><div>Min</div><div><br></div>At 2019-03-07 17:36:19, "Dinesh Kumar Reddy" <<a href="mailto:dinesh@multicorewareinc.com" target="_blank">dinesh@multicorewareinc.com</a>> wrote:<br> <blockquote id="gmail-m_-6760942120823212322isReplyContent" style="padding-left:1ex;margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>+static void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)</div><div>+{</div><div>+ *z_k = 0;</div><div>+ for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)</div><div>+ {</div><div>+ for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1)</div><div>+ {</div><div>+ uint32_t temp = src[block_yy * blockSize + block_xx] >> shift;</div><div>+ *z_k += temp * temp;</div><div>+ }</div><div>+ }</div><div>+}</div><div>+</div><div>diff -r d12a4caf7963 -r 19f27e0c8a6f source/common/x86/pixel-a.asm</div><div>--- a/source/common/x86/pixel-a.asm<span style="white-space:pre-wrap"> </span>Wed Feb 27 12:35:02 2019 +0530</div><div>+++ b/source/common/x86/pixel-a.asm<span style="white-space:pre-wrap"> </span>Mon Mar 04 15:36:38 2019 +0530</div><div>@@ -388,6 +388,16 @@</div><div> vpaddq m7, m6</div><div> %endmacro</div><div> </div><div>+%macro NORM_FACT_COL 1</div><div>+ vpsrld m1, m0, SSIMRD_SHIFT</div><div>+ vpmuldq m2, m1, m1</div><div>+ vpsrldq m1, m1, 4</div><div>+ vpmuldq m1, m1, m1</div><div>+</div><div>+ vpaddq m1, m2</div><div>+ vpaddq m3, m1</div><div>+%endmacro</div><div>+</div><div> ; FIXME avoid the spilling of regs to hold 3*stride.</div><div> ; for small blocks on x86_32, modify pixel pointer instead.</div><div> </div><div>@@ -16303,3 +16313,266 @@</div><div> movq [r4], xm4</div><div> movq [r6], xm7</div><div> RET</div><div>+</div><div>+</div><div>+;static void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)</div><div>+;{</div><div>+; *z_k = 0;</div><div>+; for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)</div><div>+; {</div><div>+; for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1)</div><div>+; {</div><div>+; uint32_t temp = src[block_yy * blockSize + block_xx] >> shift;</div><div>+; *z_k += temp * temp;</div><div>+; }</div><div>+; }</div><div>+;}</div><div>+;--------------------------------------------------------------------------------------</div><div>+; void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)</div><div>+;--------------------------------------------------------------------------------------</div><div>+INIT_YMM avx2</div><div>+cglobal normFact8, 4, 5, 6</div><div>+ mov r4d, 8</div><div>+ vpxor m3, m3 ;z_k</div><div>+ vpxor m5, m5</div><div>+.row:</div><div>+%if HIGH_BIT_DEPTH</div><div>+ vpmovzxwd m0, [r0] ;src</div><div>+%elif BIT_DEPTH == 8</div><div>+ vpmovzxbd m0, [r0]</div><div>+%else</div><div>+ %error Unsupported BIT_DEPTH!</div><div>+%endif</div><div></div></div></div><br>
</blockquote></div>
</blockquote></div>_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><i><b style="background-color:rgb(255,255,255)"><font color="#000000">Regards,</font></b></i><div><i><b style="background-color:rgb(255,255,255)"><font color="#000000">Akil</font></b></i></div></div></div></div></div></div>