[x265] Fwd: [PATCH] replace block_copy_p_p vector class function with intrinsic code
Praveen Tiwari
praveen at multicorewareinc.com
Sat Oct 5 08:55:34 CEST 2013
---------- Forwarded message ----------
From: <dnyaneshwar at multicorewareinc.com>
Date: Fri, Oct 4, 2013 at 4:27 PM
Subject: [x265] [PATCH] replace block_copy_p_p vector class function with
intrinsic code
To: x265-devel at videolan.org
{
for (int x = 0; x < bx; x += 16)
{
- Vec16c word;
- word.load_a(src + x);
- word.store_a(dst + x);
+ __m128i word0 = _mm_load_si128((__m128i const*)(src + x));
// load block of 16 byte from src
+ _mm_store_si128((__m128i*)&dst[x], word0); // store block
into dst
}
Here also, I will suggest to do unroll for multiple of 8. use load function
for 64 bit. Suppose our x come some ting like 24, 25 we can store 16
elements from above loop but for rest (for 25 it's 25-16 = 9) we have to
copy 9 elements as individuals. If we will add an unroll for 8. we have to
just copy (25 - 16 - 8 = 1) 1 element as individual. please add an unroll
loop for 8 and test it.
src += sstride;
Regards,
Praveen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20131005/aa9b7eff/attachment.html>
More information about the x265-devel
mailing list