Hello!<br><br>Sorry for the extreme delay with the finishing up of this patch. I've now applied some of Mĺns' suggestions (interleaved some instructions, removed unnecessary register pushes, and otherwise attempting to improve performance further).<br>
<br>Best regards,<br>Stefan Groenroos<br><br><div class="gmail_quote">On Tue, Feb 8, 2011 at 3:59 PM, Mĺns Rullgĺrd <span dir="ltr"><<a href="mailto:mans@mansr.com" target="_blank">mans@mansr.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">Stefan Grönroos <<a href="mailto:stefan.gronroos@gmail.com">stefan.gronroos@gmail.com</a>> writes:<br>
<br>
> Hello guys!<br>
><br>
> The attached patch changes the x264_mc_chroma_neon to work with the NV12<br>
> format. With quick tests, the code seems to give about 15% higher fps on a<br>
> Nokia N900 using a Foreman CIF test video on default settings.<br>
><br>
> Still, the instruction ordering should maybe be optimized still. I also have<br>
> a feeling that I thought something could be done better when I wrote it, but<br>
> I don't remember what it was. Maybe someone more into NEON could take a look<br>
> at the code?<br>
><br>
> Best regards,<br>
> Stefan Grönroos<br>
><br>
</div></div>> From a14ef35d96f9eb5e452405f10fcd0d2f8c48cb55 Mon Sep 17 00:00:00 2001<br>
> From: Stefan Groenroos <<a href="mailto:stefan.gronroos@gmail.com">stefan.gronroos@gmail.com</a>><br>
> Date: Sun, 6 Feb 2011 11:56:30 +0200<br>
> Subject: [PATCH] Fix mc_chroma NEON code for NV12.<br>
><br>
> ---<br>
> common/arm/mc-a.S | 355 +++++++++++++++++++++++++++++++---------------------<br>
> common/arm/mc-c.c | 2 +-<br>
> 2 files changed, 213 insertions(+), 144 deletions(-)<br>
><br>
> diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S<br>
> index 62e88cc..f06de85 100644<br>
> --- a/common/arm/mc-a.S<br>
> +++ b/common/arm/mc-a.S<br>
> @@ -806,57 +806,56 @@ copy_w16_aligned_loop:<br>
> .endfunc<br>
><br>
><br>
> -// void x264_mc_chroma_neon( uint8_t *dst, int i_dst_stride,<br>
> -// uint8_t *src, int i_src_stride,<br>
> -// int dx, int dy, int i_width, int i_height );<br>
> +//void mc_chroma( pixel *dstu, pixel *dstv, int i_dst_stride,<br>
> +// pixel *src, int i_src_stride,<br>
> +// int mvx, int mvy,<br>
> +// int i_width, int i_height )<br>
> function x264_mc_chroma_neon<br>
> - push {r4-r6, lr}<br>
> - ldrd r4, [sp, #16]<br>
> - ldr r6, [sp, #24]<br>
> + push {r4-r7, lr}<br>
> + vpush {d8-d15}<br>
<br>
Only push/pop the registers you actually use. Keeping sp 8-byte<br>
aligned before the vpush might also save a cycle. Pushing an extra<br>
core register to maintain alignment costs nothing.<br>
<br>
> + ldrd r4, [sp, #84]<br>
> + ldrd r6, [sp, #92]<br>
><br>
> - asr lr, r5, #3<br>
> - mul lr, r3, lr<br>
> - add r2, r2, r4, asr #3<br>
> - cmp r6, #4<br>
> - add r2, r2, lr<br>
> + asr lr, r6, #3<br>
> + mul lr, r4, lr<br>
> + add r3, r3, r5, asr #2<br>
> + cmp r7, #4<br>
> + add r3, r3, lr<br>
> + bic r3, r3, #0x1<br>
><br>
> - and r4, r4, #7<br>
> and r5, r5, #7<br>
> - pld [r2]<br>
> - pld [r2, r3]<br>
> + and r6, r6, #7<br>
> + pld [r3]<br>
> + pld [r3, r4]<br>
><br>
> bgt mc_chroma_w8<br>
> beq mc_chroma_w4<br>
><br>
> -// calculate cA cB cC cD<br>
> .macro CHROMA_MC_START r0 r1<br>
> - muls lr, r4, r5<br>
> - rsb r6, lr, r5, lsl #3<br>
> - rsb ip, lr, r4, lsl #3<br>
> - sub r4, lr, r4, lsl #3<br>
> - sub r4, r4, r5, lsl #3<br>
> - add r4, r4, #64<br>
> + muls lr, r5, r6<br>
> + rsb r7, lr, r6, lsl #3<br>
> + rsb ip, lr, r5, lsl #3<br>
> + sub r5, lr, r5, lsl #3<br>
> + sub r5, r5, r6, lsl #3<br>
> + add r5, r5, #64<br>
><br>
> beq 2f<br>
> -<br>
> - add r5, r2, r3<br>
> -<br>
> - vdup.8 d0, r4<br>
> - lsl r3, r3, #1<br>
> + vdup.8 d0, r5<br>
> vdup.8 d1, ip<br>
> - vld1.64 {\r0}, [r2], r3<br>
> - vdup.8 d2, r6<br>
> - vld1.64 {\r1}, [r5], r3<br>
> + vld2.8 {\r0}, [r3], r4<br>
> + vdup.8 d2, r7<br>
> + vld2.8 {\r1}, [r3], r4<br>
> vdup.8 d3, lr<br>
<br>
It is sometimes faster to do use VLD1 followed by VUZP. The total<br>
issue cycles are the same, but if interleaved with arithmetic, there<br>
are more opportunities for dual-issue. On Cortex-A5 this split is<br>
almost always faster due to how VLDn for n>1 is implemented there.<br>
<br>
It could also be an idea to defer the deinterleaving until later in<br>
the calculations.<br>
<br>
> - ldr r4, [sp, #28]<br>
> -<br>
> - vext.8 d5, d4, d5, #1<br>
> - vext.8 d7, d6, d7, #1<br>
> + ldr r5, [sp, #100]<br>
> .endm<br>
><br>
> .macro CHROMA_MC width, align<br>
> mc_chroma_w\width:<br>
> - CHROMA_MC_START d4, d6<br>
> + CHROMA_MC_START d4-d5, d8-d9<br>
> + vext.8 d6, d4, d6, #1<br>
> + vext.8 d7, d5, d7, #1<br>
> + vext.8 d10, d8, d10, #1<br>
> + vext.8 d11, d9, d11, #1<br>
> // since the element size varies, there's a different index for the 2nd store<br>
> .if \width == 4<br>
> .set st2, 1<br>
> @@ -864,187 +863,257 @@ mc_chroma_w\width:<br>
> .set st2, 2<br>
> .endif<br>
><br>
> - vtrn.32 d4, d5<br>
> - vtrn.32 d6, d7<br>
> + vtrn.32 d4, d6<br>
> + vtrn.32 d5, d7<br>
> + vtrn.32 d8, d10<br>
> + vtrn.32 d9, d11<br>
><br>
> vtrn.32 d0, d1<br>
> vtrn.32 d2, d3<br>
><br>
> 1: // height loop, interpolate xy<br>
> - pld [r5]<br>
> + pld [r3]<br>
> vmull.u8 q8, d4, d0<br>
> - vmlal.u8 q8, d6, d2<br>
> - vld1.64 {d4}, [r2], r3<br>
> - vext.8 d5, d4, d5, #1<br>
> - vtrn.32 d4, d5<br>
> - vmull.u8 q9, d6, d0<br>
> - vmlal.u8 q9, d4, d2<br>
> - vld1.64 {d6}, [r5], r3<br>
> + vmlal.u8 q8, d8, d2<br>
> + vmull.u8 q9, d5, d0<br>
> + vmlal.u8 q9, d9, d2<br>
> + vld2.8 {d4-d5}, [r3], r4<br>
> + vext.8 d6, d4, d6, #1<br>
> + vext.8 d7, d5, d7, #1<br>
> + vtrn.32 d4, d6<br>
> + vtrn.32 d5, d7<br>
> +<br>
> + vmull.u8 q10, d8, d0<br>
> + vmlal.u8 q10, d4, d2<br>
> + vmull.u8 q11, d9, d0<br>
> + vmlal.u8 q11, d5, d2<br>
> + vld2.8 {d8-d9}, [r3], r4<br>
<br>
Here some interleaving of arithmetic and permutation instructions<br>
might help since these can dual-issue (on A8).<br>
<br>
> vadd.i16 d16, d16, d17<br>
> vadd.i16 d17, d18, d19<br>
> + vadd.i16 d18, d20, d21<br>
> + vadd.i16 d19, d22, d23<br>
> vrshrn.u16 d16, q8, #6<br>
> - subs r4, r4, #2<br>
> - pld [r2]<br>
> - vext.8 d7, d6, d7, #1<br>
> - vtrn.32 d6, d7<br>
> - vst1.\align {d16[0]}, [r0,:\align], r1<br>
> - vst1.\align {d16[st2]}, [r0,:\align], r1<br>
> + vrshrn.u16 d18, q9, #6<br>
> + subs r5, r5, #2<br>
> + pld [r3]<br>
> + vext.8 d10, d8, d10, #1<br>
> + vext.8 d11, d9, d11, #1<br>
> + vtrn.32 d8, d10<br>
> + vtrn.32 d9, d11<br>
<br>
I like to keep the operands vertically aligned.<br>
<br>
> + vst1.\align {d16[0]}, [r0,:\align], r2<br>
> + vst1.\align {d16[st2]}, [r1,:\align], r2<br>
> + vst1.\align {d18[0]}, [r0,:\align], r2<br>
> + vst1.\align {d18[st2]}, [r1,:\align], r2<br>
> bgt 1b<br>
<br>
The same comments apply to the rest of other cases.<br>
<br>
I would suggest trying out a few improvements if time permits,<br>
otherwise go with this as is. It's certainly faster than C...<br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Mĺns Rullgĺrd<br>
<a href="mailto:mans@mansr.com">mans@mansr.com</a><br>
_______________________________________________<br>
x264-devel mailing list<br>
<a href="mailto:x264-devel@videolan.org">x264-devel@videolan.org</a><br>
<a href="http://mailman.videolan.org/listinfo/x264-devel" target="_blank">http://mailman.videolan.org/listinfo/x264-devel</a><br>
</font></span></blockquote></div><br>