<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div>Just say it works.</div><div><br></div><div><div>First at all,</div><div>The expect algorithm is square of (x >> shift)</div><div>It is 8 bits (I assume we talk with 8bpp, the 16bpp are similar) multiple of 8-bits and result is 16 bits.</div><div>The function works on CU-level, the blockSize is up to 64 only, or call 6-bits.</div><div>So, we can decide the maximum dynamic range is 16+6+6 = 28 bits </div><div><br></div><div>In this way, the output uint64_t is unnecessary on 8bpp mode.</div><div><br></div><div>Moreover, PMOVZXBD+VPMULDQ can be replace by PMOVZXBW+PMADDWD, (please remember that PMADDUBSW just work on one of unsigned input),</div><div>this way may accelerate 3~4 times of processing throughput. </div></div><div><div>I don't why not VPMULLD, it almost double performance</div></div><div><br></div><div>Further, unnecessary VPSRLDQ because we choice VPMULDQ</div><div><br></div><div><div>+ vpmuldq m2, m1, m1</div><div>+ vpsrldq m1, m1, 4</div><div>+ vpmuldq m1, m1, m1</div></div><div><br></div><div><br></div><div>Regards,</div><div>Min</div><div><br></div>At 2019-03-07 17:36:19, "Dinesh Kumar Reddy" <dinesh@multicorewareinc.com> wrote:<br> <blockquote id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>+static void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)</div><div>+{</div><div>+ *z_k = 0;</div><div>+ for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)</div><div>+ {</div><div>+ for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1)</div><div>+ {</div><div>+ uint32_t temp = src[block_yy * blockSize + block_xx] >> shift;</div><div>+ *z_k += temp * temp;</div><div>+ }</div><div>+ }</div><div>+}</div><div>+</div><div>diff -r d12a4caf7963 -r 19f27e0c8a6f source/common/x86/pixel-a.asm</div><div>--- a/source/common/x86/pixel-a.asm<span style="white-space:pre-wrap"> </span>Wed Feb 27 12:35:02 2019 +0530</div><div>+++ b/source/common/x86/pixel-a.asm<span style="white-space:pre-wrap"> </span>Mon Mar 04 15:36:38 2019 +0530</div><div>@@ -388,6 +388,16 @@</div><div> vpaddq m7, m6</div><div> %endmacro</div><div> </div><div>+%macro NORM_FACT_COL 1</div><div>+ vpsrld m1, m0, SSIMRD_SHIFT</div><div>+ vpmuldq m2, m1, m1</div><div>+ vpsrldq m1, m1, 4</div><div>+ vpmuldq m1, m1, m1</div><div>+</div><div>+ vpaddq m1, m2</div><div>+ vpaddq m3, m1</div><div>+%endmacro</div><div>+</div><div> ; FIXME avoid the spilling of regs to hold 3*stride.</div><div> ; for small blocks on x86_32, modify pixel pointer instead.</div><div> </div><div>@@ -16303,3 +16313,266 @@</div><div> movq [r4], xm4</div><div> movq [r6], xm7</div><div> RET</div><div>+</div><div>+</div><div>+;static void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)</div><div>+;{</div><div>+; *z_k = 0;</div><div>+; for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)</div><div>+; {</div><div>+; for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1)</div><div>+; {</div><div>+; uint32_t temp = src[block_yy * blockSize + block_xx] >> shift;</div><div>+; *z_k += temp * temp;</div><div>+; }</div><div>+; }</div><div>+;}</div><div>+;--------------------------------------------------------------------------------------</div><div>+; void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)</div><div>+;--------------------------------------------------------------------------------------</div><div>+INIT_YMM avx2</div><div>+cglobal normFact8, 4, 5, 6</div><div>+ mov r4d, 8</div><div>+ vpxor m3, m3 ;z_k</div><div>+ vpxor m5, m5</div><div>+.row:</div><div>+%if HIGH_BIT_DEPTH</div><div>+ vpmovzxwd m0, [r0] ;src</div><div>+%elif BIT_DEPTH == 8</div><div>+ vpmovzxbd m0, [r0]</div><div>+%else</div><div>+ %error Unsupported BIT_DEPTH!</div><div>+%endif</div><div></div></div></div><br>
</blockquote></div>
</blockquote></div>