[x264-devel] commit: Compile fixes for pre-ARMv6T2 and/or PIC (David Conrad )

Mon Sep 7 04:52:03 CEST 2009

David Conrad <lessen42 at gmail.com> writes:

> On Sep 6, 2009, at 2:45 PM, Måns Rullgård wrote:
>
>> git at videolan.org (git version control) writes:
>>
>>> x264 | branch: master | David Conrad <lessen42 at gmail.com> | Wed
>>> Sep  2 16:14:59 2009 -0700|
>>> [e390cbf993d180b1db413746272e232ac3068dad] | committer: Jason
>>> Garrett-Glaser
>>>
>>> Compile fixes for pre-ARMv6T2 and/or PIC
>>>
>>> +.macro movconst rd, val
>>> +#ifdef HAVE_ARMV6T2
>>> +    movw        \rd, #:lower16:\val
>>> +.if \val >> 16
>>> +    movt        \rd, #:upper16:\val
>>> +.endif
>>> +#else
>>> +    ldr         \rd, =\val
>>> +#endif
>>> +.endm
>>> +
>>> @@ -1209,9 +1203,8 @@ function x264_pixel_ssim_end4_neon, export=1
>>>     vshl.s32    q2,  q2,  #6
>>>     vadd.s32    q1,  q8,  q8
>>>
>>> -    mov         r3, #416        // ssim_c1= .01*.01*255*255*64
>>> -    movw        ip, #39355      // ssim_c2= .03*.03*255*255*64*63
>>> - 3<<16
>>> -    movt        ip, #3
>>> +    mov         r3, #416        // ssim_c1 = .01*.01*255*255*64
>>> +    movconst    ip, 235963      // ssim_c2 = .03*.03*255*255*64*63
>>>     vdup.32     q14, r3
>>>     vdup.32     q15, ip
>>>
>>> diff --git a/common/arm/predict-a.S b/common/arm/predict-a.S
>>> index 46e687b..8ff61a2 100644
>>> --- a/common/arm/predict-a.S
>>> +++ b/common/arm/predict-a.S
>>> @@ -102,7 +102,7 @@ function x264_predict_4x4_ddr_armv6, export=1
>>>     add     r4, r4, r3, lsl #8
>>>     add     r5, r5, r4, lsl #8
>>>     add     r6, r6, r5, lsl #8
>>> -    ldr     ip, pb_1
>>> +    ldr     ip, =0x01010101
>>
>> Why not use movconst here?
>
> Oops it should, that was the first thing I did (since I didn't know
> about the syntax earlier); movconst was the last. I'll change it in
> with iPhone support.

In this particular case, a shift/or sequence might be faster on pre-T2
CPUs:

    mov ip, #0x01
    orr ip, ip, #0x0100
    orr ip, ip, ip, lsl #16

For best results, interleave with other instructions to avoid a stall
between the second and third line.  Shifted operands are required one
cycle earlier than non-shifted.

>>> +    # arm-gcc-4.2 produces incorrect output with -ffast-math
>>> +    # and it doesn't save any speed anyway on 4.4, so disable it
>>> +    CFLAGS="-O4 -fno-fast-math $CFLAGS"
>>
>> Details?
>
> The output wasn't bitexact to x86 with both CodeSourcery 2007q3 and
> Apple gcc 4.2 although it was bitexact between those two compilers,
> although it looked fine (no obvious artifacts.) The stats showed a
> higher bitrate (crf) with much more I macroblocks than P/B, so Jason
> suggested it was probably something being messed up in SAD scores but
> left RD fine. I didn't investigate exactly what gcc screwed up, but
> gcc 4.4 and Apple gcc 4.0 both matched x86 output, as did both 4.2
> variants with -fno-fast-math, and I didn't measure a speedup on arm
> with gcc 4.4 with -ffast-math.

OK, sounds like a real bug.

-- 
Måns Rullgård
mans at mansr.com