[x264-devel] [PATCH 1/1] arm: optimize luma intra deblock neon asm
Janne Grunau
janne-x264 at jannau.net
Tue Sep 1 00:31:40 CEST 2015
On 2015-08-31 00:55:58 +0200, Janne Grunau wrote:
> Hi Martin,
>
> hopefully faster luma intra deblock. Pushes only if_1 and if_2 onto the
> stack and then calculates first the p0'-p2' values and then q0'-q2'
> values. There is some overhead due to the splitted calculations but
> hopefully less then spilling registers onto the stack.
>
> As for the other patches, feel free to squash it.
Benchmark results for cortex-{a7,a15} exynos 5422
cortex-a7: C neon (before) neon (after)
deblock_luma_intra[0]: 6627 3498 3236
deblock_luma_intra[1]: 7314 2460 2163
cortex-a15
deblock_luma_intra[0]: 3300 1213 1220
deblock_luma_intra[1]: 4128 812 793
The cortex-a7 in the exynos seems to be faster than the rpi2. I'd have
hope for an improvement on the a15 too but still worthwhile for ~10%
speedup on the cortex-a7
Janne
More information about the x264-devel
mailing list