[x264-devel] [PATCH 1/4] x264_intra_sad_x3_4x4_armv6
Måns Rullgård
mans at mansr.com
Tue Jan 31 01:37:08 CET 2012
George Stephanos <gaf.stephanos at gmail.com> writes:
> +function x264_intra_sad_x3_4x4_armv6
> + push {r4-r6,lr}
> + mov r5, #0
> +
> +.set Y, 0
> +.rept 4
> +.if Y==0
> + ldrb r6, [r1, #Y*FDEC_STRIDE-1]
> + add r3, r6, r6, lsl #8
> +.else
> + ldrb r3, [r1, #Y*FDEC_STRIDE-1]
> + add r6, r3
> + add r3, r3, r3, lsl #8
> +.endif
> + ldr r4, [r0, #Y*FENC_STRIDE]
> + add r3, r3, r3, lsl #16
> + usada8 r5, r3, r4, r5
> +.set Y, Y+1
> +.endr
> + str r5, [r2, #4]
> + mov r5, #0
> +
> + ldr r3, [r1, #-1*FDEC_STRIDE]
> +
> + ldr r4, [r0, #0*FENC_STRIDE]
> + ldr r1, [r0, #1*FENC_STRIDE]
> + usada8 r5, r3, r4, r5
> + ldr r4, [r0, #2*FENC_STRIDE]
> + usada8 r5, r3, r1, r5
> + ldr r1, [r0, #3*FENC_STRIDE]
> + usada8 r5, r3, r4, r5
> + usada8 r5, r3, r1, r5
> +
> + str r5, [r2]
> +
> + mov r5, #0
> + add r6, #4
> + usad8 r1, r3, r5
> +
> + add r1, r6
> + lsr r1, #3
> + add r1, r1, r1, lsl #8
> + ldr r4, [r0, #0*FENC_STRIDE]
> + add r1, r1, r1, lsl #16
> + ldr r3, [r0, #1*FENC_STRIDE]
> + usada8 r5, r1, r4, r5
> + ldr r4, [r0, #2*FENC_STRIDE]
> + usada8 r5, r1, r3, r5
> + ldr r3, [r0, #3*FENC_STRIDE]
> + usada8 r5, r1, r4, r5
> + usada8 r5, r1, r3, r5
> +
> + str r5, [r2, #8]
> + pop {r4-r6,pc}
> +.endfunc
This code has three independent blocks in sequence, each of them with
tight dependency chains internally. Moreover, they all load the same
memory locations. Use a few more registers and interleave the three
blocks instead.
--
Måns Rullgård
mans at mansr.com
More information about the x264-devel
mailing list