[x265] [arm64] port sad_x{3,4}
chen
chenm003 at 163.com
Sat Jul 24 05:44:15 UTC 2021
Hi,
That's my fault, I lost these part of SAD, so your code is no problem now, thank you.
Regards,
Min Chen
At 2021-07-24 03:54:46, "Pop, Sebastian" <spop at amazon.com> wrote:
Hi Min Chen,
thanks for your reviews.
> +.macro SAD_X_END_64 x
> + uaddlp v16.4s, v16.8h
> The dynamic range is 64*255 = 16320 -> 14-bits, so we are not need extend to 32-bits in here
>
> + uaddlp v17.4s, v17.8h
> + uaddlp v18.4s, v18.8h
> + uaddlp v20.4s, v20.8h
> + uaddlp v21.4s, v21.8h
> + uaddlp v22.4s, v22.8h
> + add v16.4s, v16.4s, v20.4s
> + add v17.4s, v17.4s, v21.4s
> + add v18.4s, v18.4s, v22.4s
> + trn2 v20.2d, v16.2d, v16.2d
> + trn2 v21.2d, v17.2d, v17.2d
> + trn2 v22.2d, v18.2d, v18.2d
> + add v16.2s, v16.2s, v20.2s
>
> + add v17.2s, v17.2s, v21.2s
> + add v18.2s, v18.2s, v22.2s
> + uaddlp v16.1d, v16.2s
> ADD+TRN2+ADD generate sum of v16+v20 in V.2s, follow by UADDLP into V.1s
>
> As we analyze dynamic range in above, we can replace it by
> ADD v16, v20 ; 15-bits
> (ignore inst for V17=V17+V21, etc)
> ADD v16, V17 ; 16-bits
> (ignore other registers)
> ADDLV s0,v16
Following your recommendation I tried the following code to delay widening to
the last step with uaddlv. This code does not pass correctness tests.
.macro SAD_X_END_64 x
add v16.8h, v16.8h, v20.8h
add v17.8h, v17.8h, v21.8h
add v18.8h, v18.8h, v22.8h
trn2 v20.2d, v16.2d, v16.2d
trn2 v21.2d, v17.2d, v17.2d
trn2 v22.2d, v18.2d, v18.2d
add v16.4h, v16.4h, v20.4h
add v17.4h, v17.4h, v21.4h
add v18.4h, v18.4h, v22.4h
uaddlv s16, v16.4h
uaddlv s17, v17.4h
uaddlv s18, v18.4h
stp s16, s17, [x6], #8
.if \x == 3
str s18, [x6]
.elseif \x == 4
add v19.8h, v19.8h, v23.8h
trn2 v23.2d, v19.2d, v19.2d
add v19.2s, v19.2s, v23.2s
uaddlv s19, v19.4h
stp s18, s19, [x6]
.endif
ret
.endm
As we start executing the above code, the values observed in each lane of v16 to
v23 are already 16-bit. For example,
(gdb) p $v16.h.u
$21 = {65024, 65024, 65024, 65024, 65024, 65024, 65024, 65024}
Each lane of v16 accumulates 4 differences of range 255:
uabal \v1\().8h, v0.8b, v4.8b
uabal \v1\().8h, v1.8b, v5.8b
uabal \v1\().8h, v2.8b, v6.8b
uabal \v1\().8h, v3.8b, v7.8b
and this is in a loop of 64 iterations.
So the dynamic range for each vector element is 4*64*255 = 65280 -> 16-bits
We need to widen arithmetic in the first step as in the original patch,
and we cannot postpone widening to the last step of the reduction.
> I guess STP may store two result in a cycle
Please see attached the amended patch that uses store pairs.
I have seen a small performance improvement with this change.
Thanks,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210724/9fa93ce7/attachment.html>
More information about the x265-devel
mailing list