[x265] Fwd: [PATCH] replace block_copy_p_p vector class function with intrinsic code

Praveen Tiwari praveen at multicorewareinc.com
Sat Oct 5 08:55:34 CEST 2013


---------- Forwarded message ----------
From: <dnyaneshwar at multicorewareinc.com>
Date: Fri, Oct 4, 2013 at 4:27 PM
Subject: [x265] [PATCH] replace block_copy_p_p vector class function with
intrinsic code
To: x265-devel at videolan.org


         {
             for (int x = 0; x < bx; x += 16)
             {
-                Vec16c word;
-                word.load_a(src + x);
-                word.store_a(dst + x);
+                __m128i word0 = _mm_load_si128((__m128i const*)(src + x));
// load block of 16 byte from src
+                _mm_store_si128((__m128i*)&dst[x], word0); // store block
into dst
             }
Here also, I will suggest to do unroll for multiple of 8. use load function
for 64 bit. Suppose our x come some ting like 24, 25 we can store 16
elements from above loop  but for rest (for 25 it's 25-16 = 9) we have to
copy 9 elements as individuals. If we will add an unroll for 8. we have to
just copy (25 - 16 - 8 = 1) 1 element as individual. please add an unroll
loop for 8 and test it.



             src += sstride;

Regards,
Praveen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20131005/aa9b7eff/attachment.html>


More information about the x265-devel mailing list