<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>At 2013-10-15 16:35:15,dnyaneshwar@multicorewareinc.com wrote:<BR>># HG changeset patch<BR>># User Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com><BR>># Date 1381826069 -19800<BR>># Tue Oct 15 14:04:29 2013 +0530<BR>># Node ID 3cd533917aa110f7231abf6e0186e99b22dd4dcf<BR>># Parent 1a85d8814346efdb984ea9eae24d1b06b973e9a8<BR>>pixel-sse41.cpp: Modified PROCESS_SSE_SS4x1 macro with faster intrinsics.<BR>><BR>>diff -r 1a85d8814346 -r 3cd533917aa1 source/common/vec/pixel-sse41.cpp<BR>>--- a/source/common/vec/pixel-sse41.cpp Tue Oct 15 12:45:58 2013 +0530<BR>>+++ b/source/common/vec/pixel-sse41.cpp Tue Oct 15 1
4:04:29 2013 +0530<BR>>@@ -5331,10 +5331,8 @@<BR>> #define PROCESS_SSE_SS4x1(BASE)\<BR>> m1 = _mm_loadu_si128((__m128i const*)(fenc + BASE)); \<BR>> n1 = _mm_loadu_si128((__m128i const*)(fref + BASE)); \<BR>>- sign1 = _mm_srai_epi16(m1, 15); \<BR>>- tmp1 = _mm_unpacklo_epi16(m1, sign1); \<BR>>- sign2 = _mm_srai_epi16(n1, 15); \<BR>>- tmp2 = _mm_unpacklo_epi16(n1, sign2); \<BR>>+ tmp1= _mm_cvtepi16_epi32(m1); \<BR>>+ tmp2= _mm_cvtepi16_epi32(n1); \<BR>> diff = _mm_sub_epi32(tmp1, tmp2); \<BR>>
diff = _mm_mullo_epi32(diff, diff); \<BR>> sum = _mm_add_epi32(sum, diff)<BR></DIV>
<DIV>two suggest:</DIV>
<DIV>1. be careful use SSE4 instruction with VS compiler, it have many bugs</DIV>
<DIV>2. are we have full of 16-bits dynamic range? if not, we may use instruction PMADDWD for more performance</DIV>
<DIV> </DIV></div>