<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Oct 4, 2013 at 6:26 AM,  <span dir="ltr"><<a href="mailto:dnyaneshwar@multicorewareinc.com" target="_blank">dnyaneshwar@multicorewareinc.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"># HG changeset patch<br>
# User Dnyaneshwar<br></blockquote><div><br></div><div>First, last name and email address please</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
# Date 1380885916 -19800<br>
#      Fri Oct 04 16:55:16 2013 +0530<br>
# Node ID f4100c037a0d6f64d78a8a313e175f6c8445e30b<br>
# Parent  69943bfd02a2feea711da586eb15c7ac77fa700d<br>
replace block_copy_p_s (short to pixel) vector class function with intrinsic.<br></blockquote><div><br></div><div>include a blank line between summary and further explanation</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Performance measured is same as that of vector function.<br></blockquote><div><br></div><div>this is not a surprise; the vector class code has reasonable performance when the function is small enough to be entirely inlined.  as the function grows, MSVC's inliner starts to disable itself and the vector class perf drops off considerably</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
diff -r 69943bfd02a2 -r f4100c037a0d source/common/vec/blockcopy-sse3.cpp<br>
--- a/source/common/vec/blockcopy-sse3.cpp      Fri Oct 04 16:27:02 2013 +0530<br>
+++ b/source/common/vec/blockcopy-sse3.cpp      Fri Oct 04 16:55:16 2013 +0530<br>
@@ -106,10 +106,14 @@<br>
         {<br>
             for (int x = 0; x < bx; x += 16)<br>
             {<br>
-                Vec8us word0, word1;<br>
-                word0.load_a(src + x);<br>
-                word1.load_a(src + x + 8);<br>
-                compress(word0, word1).store_a(dst + x);<br>
+                __m128i word0 = _mm_load_si128((__m128i const*)(src + x));       // load block of 16 byte from src<br>
+                __m128i word1 = _mm_load_si128((__m128i const*)(src + x + 8));<br>
+<br>
+                __m128i mask = _mm_set1_epi32(0x00FF00FF);                  // mask for low bytes<br>
+                __m128i low_mask = _mm_and_si128(word0, mask);              // bytes of low<br>
+                __m128i high_mask = _mm_and_si128(word1, mask);             // bytes of high<br>
+                __m128i word01 = _mm_packus_epi16(low_mask, high_mask);     // unsigned pack<br>
+                _mm_store_si128((__m128i*)&dst[x], word01);                 // store block into dst<br>
             }<br>
<br>
             src += sstride;<br>
_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Steve Borho
</div></div>