Thanks for the comments, Mĺns!<br>Hopefully I'll have some time in the weekend to try out your suggested changes, and try to squeeze some extra performance out of it.<br><br>Stefan<br><br><div class="gmail_quote">On Tue, Feb 8, 2011 at 3:59 PM, Mĺns Rullgĺrd <span dir="ltr"><<a href="mailto:mans@mansr.com">mans@mansr.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div><div></div><div class="h5">Stefan Grönroos <<a href="mailto:stefan.gronroos@gmail.com">stefan.gronroos@gmail.com</a>> writes:<br>
<br>
> Hello guys!<br>
><br>
> The attached patch changes the x264_mc_chroma_neon to work with the NV12<br>
> format. With quick tests, the code seems to give about 15% higher fps on a<br>
> Nokia N900 using a Foreman CIF test video on default settings.<br>
><br>
> Still, the instruction ordering should maybe be optimized still. I also have<br>
> a feeling that I thought something could be done better when I wrote it, but<br>
> I don't remember what it was. Maybe someone more into NEON could take a look<br>
> at the code?<br>
><br>
> Best regards,<br>
> Stefan Grönroos<br>
><br>
</div></div>> From a14ef35d96f9eb5e452405f10fcd0d2f8c48cb55 Mon Sep 17 00:00:00 2001<br>
> From: Stefan Groenroos <<a href="mailto:stefan.gronroos@gmail.com">stefan.gronroos@gmail.com</a>><br>
> Date: Sun, 6 Feb 2011 11:56:30 +0200<br>
> Subject: [PATCH] Fix mc_chroma NEON code for NV12.<br>
><br>
> ---<br>
> common/arm/mc-a.S | 355 +++++++++++++++++++++++++++++++---------------------<br>
> common/arm/mc-c.c | 2 +-<br>
> 2 files changed, 213 insertions(+), 144 deletions(-)<br>
><br>
> diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S<br>
> index 62e88cc..f06de85 100644<br>
> --- a/common/arm/mc-a.S<br>
> +++ b/common/arm/mc-a.S<br>
> @@ -806,57 +806,56 @@ copy_w16_aligned_loop:<br>
> .endfunc<br>
><br>
><br>
> -// void x264_mc_chroma_neon( uint8_t *dst, int i_dst_stride,<br>
> -// uint8_t *src, int i_src_stride,<br>
> -// int dx, int dy, int i_width, int i_height );<br>
> +//void mc_chroma( pixel *dstu, pixel *dstv, int i_dst_stride,<br>
> +// pixel *src, int i_src_stride,<br>
> +// int mvx, int mvy,<br>
> +// int i_width, int i_height )<br>
> function x264_mc_chroma_neon<br>
> - push {r4-r6, lr}<br>
> - ldrd r4, [sp, #16]<br>
> - ldr r6, [sp, #24]<br>
> + push {r4-r7, lr}<br>
> + vpush {d8-d15}<br>
<br>
Only push/pop the registers you actually use. Keeping sp 8-byte<br>
aligned before the vpush might also save a cycle. Pushing an extra<br>
core register to maintain alignment costs nothing.<br>
<br>
> + ldrd r4, [sp, #84]<br>
> + ldrd r6, [sp, #92]<br>
><br>
> - asr lr, r5, #3<br>
> - mul lr, r3, lr<br>
> - add r2, r2, r4, asr #3<br>
> - cmp r6, #4<br>
> - add r2, r2, lr<br>
> + asr lr, r6, #3<br>
> + mul lr, r4, lr<br>
> + add r3, r3, r5, asr #2<br>
> + cmp r7, #4<br>
> + add r3, r3, lr<br>
> + bic r3, r3, #0x1<br>
><br>
> - and r4, r4, #7<br>
> and r5, r5, #7<br>
> - pld [r2]<br>
> - pld [r2, r3]<br>
> + and r6, r6, #7<br>
> + pld [r3]<br>
> + pld [r3, r4]<br>
><br>
> bgt mc_chroma_w8<br>
> beq mc_chroma_w4<br>
><br>
> -// calculate cA cB cC cD<br>
> .macro CHROMA_MC_START r0 r1<br>
> - muls lr, r4, r5<br>
> - rsb r6, lr, r5, lsl #3<br>
> - rsb ip, lr, r4, lsl #3<br>
> - sub r4, lr, r4, lsl #3<br>
> - sub r4, r4, r5, lsl #3<br>
> - add r4, r4, #64<br>
> + muls lr, r5, r6<br>
> + rsb r7, lr, r6, lsl #3<br>
> + rsb ip, lr, r5, lsl #3<br>
> + sub r5, lr, r5, lsl #3<br>
> + sub r5, r5, r6, lsl #3<br>
> + add r5, r5, #64<br>
><br>
> beq 2f<br>
> -<br>
> - add r5, r2, r3<br>
> -<br>
> - vdup.8 d0, r4<br>
> - lsl r3, r3, #1<br>
> + vdup.8 d0, r5<br>
> vdup.8 d1, ip<br>
> - vld1.64 {\r0}, [r2], r3<br>
> - vdup.8 d2, r6<br>
> - vld1.64 {\r1}, [r5], r3<br>
> + vld2.8 {\r0}, [r3], r4<br>
> + vdup.8 d2, r7<br>
> + vld2.8 {\r1}, [r3], r4<br>
> vdup.8 d3, lr<br>
<br>
It is sometimes faster to do use VLD1 followed by VUZP. The total<br>
issue cycles are the same, but if interleaved with arithmetic, there<br>
are more opportunities for dual-issue. On Cortex-A5 this split is<br>
almost always faster due to how VLDn for n>1 is implemented there.<br>
<br>
It could also be an idea to defer the deinterleaving until later in<br>
the calculations.<br>
<br>
> - ldr r4, [sp, #28]<br>
> -<br>
> - vext.8 d5, d4, d5, #1<br>
> - vext.8 d7, d6, d7, #1<br>
> + ldr r5, [sp, #100]<br>
> .endm<br>
><br>
> .macro CHROMA_MC width, align<br>
> mc_chroma_w\width:<br>
> - CHROMA_MC_START d4, d6<br>
> + CHROMA_MC_START d4-d5, d8-d9<br>
> + vext.8 d6, d4, d6, #1<br>
> + vext.8 d7, d5, d7, #1<br>
> + vext.8 d10, d8, d10, #1<br>
> + vext.8 d11, d9, d11, #1<br>
> // since the element size varies, there's a different index for the 2nd store<br>
> .if \width == 4<br>
> .set st2, 1<br>
> @@ -864,187 +863,257 @@ mc_chroma_w\width:<br>
> .set st2, 2<br>
> .endif<br>
><br>
> - vtrn.32 d4, d5<br>
> - vtrn.32 d6, d7<br>
> + vtrn.32 d4, d6<br>
> + vtrn.32 d5, d7<br>
> + vtrn.32 d8, d10<br>
> + vtrn.32 d9, d11<br>
><br>
> vtrn.32 d0, d1<br>
> vtrn.32 d2, d3<br>
><br>
> 1: // height loop, interpolate xy<br>
> - pld [r5]<br>
> + pld [r3]<br>
> vmull.u8 q8, d4, d0<br>
> - vmlal.u8 q8, d6, d2<br>
> - vld1.64 {d4}, [r2], r3<br>
> - vext.8 d5, d4, d5, #1<br>
> - vtrn.32 d4, d5<br>
> - vmull.u8 q9, d6, d0<br>
> - vmlal.u8 q9, d4, d2<br>
> - vld1.64 {d6}, [r5], r3<br>
> + vmlal.u8 q8, d8, d2<br>
> + vmull.u8 q9, d5, d0<br>
> + vmlal.u8 q9, d9, d2<br>
> + vld2.8 {d4-d5}, [r3], r4<br>
> + vext.8 d6, d4, d6, #1<br>
> + vext.8 d7, d5, d7, #1<br>
> + vtrn.32 d4, d6<br>
> + vtrn.32 d5, d7<br>
> +<br>
> + vmull.u8 q10, d8, d0<br>
> + vmlal.u8 q10, d4, d2<br>
> + vmull.u8 q11, d9, d0<br>
> + vmlal.u8 q11, d5, d2<br>
> + vld2.8 {d8-d9}, [r3], r4<br>
<br>
Here some interleaving of arithmetic and permutation instructions<br>
might help since these can dual-issue (on A8).<br>
<br>
> vadd.i16 d16, d16, d17<br>
> vadd.i16 d17, d18, d19<br>
> + vadd.i16 d18, d20, d21<br>
> + vadd.i16 d19, d22, d23<br>
> vrshrn.u16 d16, q8, #6<br>
> - subs r4, r4, #2<br>
> - pld [r2]<br>
> - vext.8 d7, d6, d7, #1<br>
> - vtrn.32 d6, d7<br>
> - vst1.\align {d16[0]}, [r0,:\align], r1<br>
> - vst1.\align {d16[st2]}, [r0,:\align], r1<br>
> + vrshrn.u16 d18, q9, #6<br>
> + subs r5, r5, #2<br>
> + pld [r3]<br>
> + vext.8 d10, d8, d10, #1<br>
> + vext.8 d11, d9, d11, #1<br>
> + vtrn.32 d8, d10<br>
> + vtrn.32 d9, d11<br>
<br>
I like to keep the operands vertically aligned.<br>
<br>
> + vst1.\align {d16[0]}, [r0,:\align], r2<br>
> + vst1.\align {d16[st2]}, [r1,:\align], r2<br>
> + vst1.\align {d18[0]}, [r0,:\align], r2<br>
> + vst1.\align {d18[st2]}, [r1,:\align], r2<br>
> bgt 1b<br>
<br>
The same comments apply to the rest of other cases.<br>
<br>
I would suggest trying out a few improvements if time permits,<br>
otherwise go with this as is. It's certainly faster than C...<br>
<font color="#888888"><br>
--<br>
Mĺns Rullgĺrd<br>
<a href="mailto:mans@mansr.com">mans@mansr.com</a><br>
_______________________________________________<br>
x264-devel mailing list<br>
<a href="mailto:x264-devel@videolan.org">x264-devel@videolan.org</a><br>
<a href="http://mailman.videolan.org/listinfo/x264-devel" target="_blank">http://mailman.videolan.org/listinfo/x264-devel</a><br>
</font></blockquote></div><br>