[x264-devel] [PATCH 1/1] arm: optimize luma intra deblock neon asm

Janne Grunau janne-x264 at jannau.net
Tue Sep 1 00:31:40 CEST 2015


On 2015-08-31 00:55:58 +0200, Janne Grunau wrote:
> Hi Martin,
> 
> hopefully faster luma intra deblock. Pushes only if_1 and if_2 onto the
> stack and then calculates first the p0'-p2' values and then q0'-q2'
> values. There is some overhead due to the splitted calculations but
> hopefully less then spilling registers onto the stack.
> 
> As for the other patches, feel free to squash it.

Benchmark results for cortex-{a7,a15} exynos 5422

cortex-a7:             C       neon (before)   neon (after)
deblock_luma_intra[0]: 6627    3498            3236
deblock_luma_intra[1]: 7314    2460            2163
cortex-a15
deblock_luma_intra[0]: 3300    1213            1220
deblock_luma_intra[1]: 4128    812             793

The cortex-a7 in the exynos seems to be faster than the rpi2. I'd have 
hope for an improvement on the a15 too but still worthwhile for ~10% 
speedup on the cortex-a7

Janne


More information about the x264-devel mailing list