[x265] Fwd: [PATCH] blockcopy_pp_12x32: SSE2 asm code optimization

Fri Feb 6 17:03:11 CET 2015

At 2015-02-06 17:23:23,"Praveen Tiwari" <praveen at multicorewareinc.com> wrote:

---------- Forwarded message ----------
From: chen<chenm003 at 163.com>
Date: Thu, Feb 5, 2015 at 5:55 PM
Subject: Re: [x265] [PATCH] blockcopy_pp_12x32: SSE2 asm code optimization
To: Development for x265 <x265-devel at videolan.org>

>>this code is right
>>but could you try use general register move (rN, rNd) in x64 mode?

I applied your idea of using general register as buffer in x64 for 4x8 (easy to test with) but surprisingly using SIMD registers is faster. here I have the code and performance numbers:
copy_pp[  4x8]  2.67x    139.98          374.18          [using general register move (rN, rNd)]  
copy_pp[  4x8]  3.34x    109.60          366.35          [SIMD registers as buffer]

codes: [using general register move (rN, rNd)]  
        ;-----------------------------------------------------------------------------
; void blockcopy_pp_4x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)
;-----------------------------------------------------------------------------
INIT_XMM sse2
cglobal blockcopy_pp_4x8, 4, 10, 0

    lea     r4,    [3 * r1]
    lea     r5,    [3 * r3]

    mov     r6d,     [r2]
    mov     r7d,     [r2 + r3]
    mov     r8d,     [r2 + 2 * r3]
    mov     r9d,     [r2 + r5]

    mov     [r0],          r6d
    mov     [r0 + r1],     r7d
    mov     [r0 + 2 * r1], r8d
    mov     [r0 + r4],     r9d
it is slower because you use more register.
use below sequence:

mov r6d,  [r2]
mov [r0], r6d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20150207/670942be/attachment-0001.html>