[x264-devel] [PATCH 1/1] checkasm: fix arm64 register clobber test

Mon Aug 24 20:51:44 CEST 2015

On Mon, 24 Aug 2015, Janne Grunau wrote:

> On 2015-08-24 12:42:47 +0300, Martin Storsjö wrote:
>> On Mon, 17 Aug 2015, Janne Grunau wrote:
>>
>>> Hi Martin,
>>>
>>> the code has a couple of problems, see below in the commit message.
>>> feel free squash this patch. I still need to test on ios though.
>>> IIRC it won't work there since the ABI for stack arguments and variadic
>>> arguments is incompatible. Stack arguments take only there size + padding
>>> for natural alignment but variadic arguments are allocated to slots of
>>> 8 bytes. There are iirc one or two functions which require args on the
>>> stack. If those arguments are smaller than 8 bytes this won't work.
>>> I don't think this is fixable, i.e. arm64 x264_checkasm_call can't work
>>> on ios.
>>>
>>> Janne
>>>
>>> ---8<---
>>> The stackpointer SP must be 16-byte aligned if it's used for meory
>>> access. Solve this by creating a proper frame record.
>>>
>>> Use the frame pointer to access the original arguments on the stack.
>>>
>>> Provide a macro to test neon args.
>>>
>>> Do not write past int ok.
>>> ---
>>> tools/checkasm-aarch64.S | 58 ++++++++++++++++++++++--------------------------
>>> 1 file changed, 26 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/tools/checkasm-aarch64.S b/tools/checkasm-aarch64.S
>>> index 2b86d62..ec36518 100644
>>> --- a/tools/checkasm-aarch64.S
>>> +++ b/tools/checkasm-aarch64.S
>>> @@ -56,11 +56,11 @@ error_message:
>>> // max number of args used by any x264 asm function.
>>> #define MAX_ARGS 15
>>>
>>> -#define ARG_STACK 8*(MAX_ARGS - 6)
>>> -#define PUSHED 8*8 + 8*12
>>> +#define ARG_STACK ((8*(MAX_ARGS - 6) + 15) & ~15)
>>>
>>> function x264_checkasm_call, export=1
>>> -    str         x30, [sp, #-8]!
>>> +    stp         x29, x30, [sp, #-16]!
>>> +    mov         x29, sp
>>>    stp         x19, x20, [sp, #-16]!
>>>    stp         x21, x22, [sp, #-16]!
>>>    stp         x23, x24, [sp, #-16]!
>>> @@ -82,12 +82,13 @@ function x264_checkasm_call, export=1
>>>    ldp         x25, x26, [x9], #16
>>>    ldp         x27, x28, [x9], #16
>>>
>>> -    str         x1,  [sp, #-8]!
>>> +    str         x1,  [sp, #-16]!
>>>
>>>    sub         sp,  sp,  #ARG_STACK
>>> .equ pos, 0
>>> +// first two stacked args are copied to x6, x7
>>> .rept MAX_ARGS-6
>>> -    ldr         x9, [sp, #ARG_STACK + PUSHED + 16 + pos]
>>> +    ldr         x9, [x29, #16 + 16 + pos]
>>>    str         x9, [sp, #pos]
>>> .equ pos, pos + 8
>>> .endr
>>> @@ -99,33 +100,26 @@ function x264_checkasm_call, export=1
>>>    mov         x3,  x5
>>>    mov         x4,  x6
>>>    mov         x5,  x7
>>> -    ldp         x6,  x7,  [sp, #ARG_STACK + PUSHED]
>>> +    ldp         x6,  x7,  [x29, #16]
>>>    blr         x12
>>>    add         sp,  sp,  #ARG_STACK
>>> -    ldr         x2,  [sp], #8
>>> -
>>> -    stp         x0,  x1, [sp, #-16]!
>>> +    ldr         x2,  [sp]
>>> +    stp         x0,  x1, [sp]
>>>    movrel      x9, register_init
>>> -    ldp         d0,  d1,  [x9], #16
>>> -    ldp         d2,  d3,  [x9], #16
>>> -    ldp         d4,  d5,  [x9], #16
>>> -    ldp         d6,  d7,  [x9], #16
>>> -    eor         v0.8b,  v0.8b,  v8.8b
>>> -    eor         v1.8b,  v1.8b,  v9.8b
>>> -    eor         v2.8b,  v2.8b,  v10.8b
>>> -    eor         v3.8b,  v3.8b,  v11.8b
>>> -    eor         v4.8b,  v4.8b,  v12.8b
>>> -    eor         v5.8b,  v5.8b,  v13.8b
>>> -    eor         v6.8b,  v6.8b,  v14.8b
>>> -    eor         v7.8b,  v7.8b,  v15.8b
>>> -    orr         v0.8b,  v0.8b,  v1.8b
>>> -    orr         v0.8b,  v0.8b,  v2.8b
>>> -    orr         v0.8b,  v0.8b,  v3.8b
>>> -    orr         v0.8b,  v0.8b,  v4.8b
>>> -    orr         v0.8b,  v0.8b,  v5.8b
>>> -    orr         v0.8b,  v0.8b,  v6.8b
>>> -    orr         v0.8b,  v0.8b,  v7.8b
>>> -    fmov        x3,  d0
>>> +    movi        v3.8h,  #0
>>> +
>>> +.macro check_reg_neon reg1, reg2
>>> +    ldr         q0,  [x9], #16
>>> +    uzp1        v1.2d,  v\reg1\().2d, v\reg2\().2d
>>> +    eor         v0.16b, v0.16b, v1.16b
>>> +    orr         v3.16b, v3.16b, v0.16b
>>> +.endm
>>> +    check_reg_neon  8,  9
>>> +    check_reg_neon  10, 11
>>> +    check_reg_neon  12, 13
>>> +    check_reg_neon  14, 15
>>> +    xtn         v3.8b,  v3.8h
>>
>> Doesn't this drop half of the bits in the registers, i.e. a bit set
>> in the upper half of the halfwords would be discarded and missed? An
>> uqxtn would probably handle that, right?
>
> half (16-bit) or double (64-bit) words?

halfwords

> in any case the code is correct.

No, I just reproduced a clobbered register that it failed to notice. E.g., 
before the check_reg_neon, do "mov x0, #42; mov v8.b[1], w0", and it will 
fail to notice. For odd indices between 1 and 7, it will fail to notice 
the corrupted register, while even indices between 0 and 6 are caught 
properly (and 8-15 ignored, as they should).

> it takes the the lower 64-bit of the vector registers and writes them 
> into v1. The ABI just mandates that the lower 64-bit of the refisters 
> should be preserved, i.e. the equivalent of the ARM NEON d8-d15.

Yes, it takes the lower 64-bit half of two 128 bit vector registers and 
writes them into v1. But then, at the end, once you have the bitfield in 
v3 showing which bits had errors (where every bit is significant; the 
upper 64 bits of this register comes from the errors in v9, v11 etc), you 
discard every other byte in the register.

// Martin