[x265] [PATCH] add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE
chen
chenm003 at 163.com
Mon Sep 22 22:40:16 CEST 2014
- Previous message: [x265] [PATCH] add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE
- Next message: [x265] [PATCH] add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
At 2014-09-22 21:15:57,sagar at multicorewareinc.com wrote:
># HG changeset patch
># User Sagar Kotecha sagar at multicorewareinc.com>
># Date 1411391728 -19800
># Mon Sep 22 18:45:28 2014 +0530
># Node ID 2fb0a3286265a757c94a36cec0695817116d5260
># Parent fd435504f15e0b13dabba9efe0aa94e7047060b5
>add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE
>
--- a/source/common/x86/blockcopy8.asm Mon Sep 22 13:14:54 2014 +0530
>+++ b/source/common/x86/blockcopy8.asm Mon Sep 22 18:45:28 2014 +0530
>@@ -2904,6 +2904,46 @@
> BLOCKCOPY_SS_W16_H4 16, 12
>
> ;-----------------------------------------------------------------------------
>+; void blockcopy_ss_16x4(int16_t *dest, intptr_t deststride, int16_t *src, intptr_t srcstride)
>+;-----------------------------------------------------------------------------
>+%macro BLOCKCOPY_SS_W16_H4_avx 2
>+INIT_YMM avx
>+cglobal blockcopy_ss_%1x%2, 4, 5, 2
>+ mov r4d, %2/4
>+ add r1, r1
>+ add r3, r3
>+.loop:
>+ movu m0, [r2]
>+ movu m1, [r2 + r3]
>+
>+ movu [r0], m0
>+ movu [r0 + r1], m1
>+
>+ lea r2, [r2 + 2 * r3]
>+ lea r0, [r0 + 2 * r1]
you have more free register, so you may buffer r1*3 and r3*3, to reduce 2 of LEA
>+
>+ movu m0, [r2]
>+ movu m1, [r2 + r3]
>+ movu [r0], m0
>+ movu [r0 + r1], m1
>+
>+ dec r4d
>+ lea r0, [r0 + 2 * r1]
>+ lea r2, [r2 + 2 * r3]
after above optimize, you can replace factor to 4 here
>+ jnz .loop
dec+jnz may reduce 1 uops, it is 'micro fusion' in newer CPU
>+ RET
>+%endmacro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20140923/f512d443/attachment.html>
- Previous message: [x265] [PATCH] add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE
- Next message: [x265] [PATCH] add avx version for chroma_copy_ss 16x4, 16x8, 16x12, 16x16, 16x24, 16x32, 16x64 based on csp, approx 1.5x-2x speedup over SSE
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
More information about the x265-devel
mailing list