[x265] [arm64] port sad_x{3,4}

Fri Jul 23 19:54:46 UTC 2021

Hi Min Chen,
thanks for your reviews.

> +.macro SAD_X_END_64 x
> +    uaddlp          v16.4s, v16.8h
> The dynamic range is 64*255 = 16320 -> 14-bits, so we are not need extend to 32-bits in here
>
> +    uaddlp          v17.4s, v17.8h
> +    uaddlp          v18.4s, v18.8h
> +    uaddlp          v20.4s, v20.8h
> +    uaddlp          v21.4s, v21.8h
> +    uaddlp          v22.4s, v22.8h
> +    add             v16.4s, v16.4s, v20.4s
> +    add             v17.4s, v17.4s, v21.4s
> +    add             v18.4s, v18.4s, v22.4s
> +    trn2            v20.2d, v16.2d, v16.2d
> +    trn2            v21.2d, v17.2d, v17.2d
> +    trn2            v22.2d, v18.2d, v18.2d
> +    add             v16.2s, v16.2s, v20.2s
>
> +    add             v17.2s, v17.2s, v21.2s
> +    add             v18.2s, v18.2s, v22.2s
> +    uaddlp          v16.1d, v16.2s
> ADD+TRN2+ADD generate sum of v16+v20 in V.2s, follow by UADDLP into V.1s
>
> As we analyze dynamic range in above, we can replace it by
> ADD v16, v20   ; 15-bits
>         (ignore inst for V17=V17+V21, etc)
> ADD v16, V17  ; 16-bits
>         (ignore other registers)
> ADDLV s0,v16

Following your recommendation I tried the following code to delay widening to
the last step with uaddlv.  This code does not pass correctness tests.

.macro SAD_X_END_64 x
    add             v16.8h, v16.8h, v20.8h
    add             v17.8h, v17.8h, v21.8h
    add             v18.8h, v18.8h, v22.8h
    trn2            v20.2d, v16.2d, v16.2d
    trn2            v21.2d, v17.2d, v17.2d
    trn2            v22.2d, v18.2d, v18.2d
    add             v16.4h, v16.4h, v20.4h
    add             v17.4h, v17.4h, v21.4h
    add             v18.4h, v18.4h, v22.4h
    uaddlv          s16, v16.4h
    uaddlv          s17, v17.4h
    uaddlv          s18, v18.4h
    stp             s16, s17, [x6], #8
.if \x == 3
    str             s18, [x6]
.elseif \x == 4
    add             v19.8h, v19.8h, v23.8h
    trn2            v23.2d, v19.2d, v19.2d
    add             v19.2s, v19.2s, v23.2s
    uaddlv          s19, v19.4h
    stp             s18, s19, [x6]
.endif
    ret
.endm

As we start executing the above code, the values observed in each lane of v16 to
v23 are already 16-bit.  For example,

(gdb) p $v16.h.u
$21 = {65024, 65024, 65024, 65024, 65024, 65024, 65024, 65024}

Each lane of v16 accumulates 4 differences of range 255:
    uabal           \v1\().8h, v0.8b, v4.8b
    uabal           \v1\().8h, v1.8b, v5.8b
    uabal           \v1\().8h, v2.8b, v6.8b
    uabal           \v1\().8h, v3.8b, v7.8b
and this is in a loop of 64 iterations.
So the dynamic range for each vector element is 4*64*255 = 65280 -> 16-bits
We need to widen arithmetic in the first step as in the original patch,
and we cannot postpone widening to the last step of the reduction.

> I guess STP may store two result in a cycle

Please see attached the amended patch that uses store pairs.
I have seen a small performance improvement with this change.

Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210723/ab26a9eb/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm64-port-sad_x-3-4.patch
Type: application/octet-stream
Size: 17856 bytes
Desc: 0001-arm64-port-sad_x-3-4.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210723/ab26a9eb/attachment-0001.obj>