[x265] [arm64] port sad

chen chenm003 at 163.com
Sat Jul 17 08:56:41 UTC 2021


Hi Sebastian,



Thank you for your code.


At first, sorry for delay, I am very busy on my family and my toy hardware codec in last week, I just have a little spare-time during weekend.
The next, I didn't take a look all of functions, but I made some comments on 64x64.


On the function, unroll=8 (4*2) will get good performance on Out-Of-Order (OOO) CPU, but may drain performance due to cache miss and related issues on low-end CPU such as Cortex-A53, Of course, this is not problem on this versiong of patch.


In the 64x64, the sum calculate by below code.
==========

+.macro SAD_END_64

+    add         v16.8h, v16.8h, v17.8h

+    add         v17.8h, v18.8h, v19.8h

+    add         v16.8h, v16.8h, v17.8h

+    uaddlv      s0,  v16.8h

+    fmov        w0,  s0

+    add         v18.8h, v20.8h, v21.8h

+    add         v19.8h, v22.8h, v23.8h

+    add         v17.8h, v18.8h, v19.8h

+    uaddlv      s1,  v17.8h

+    fmov        w1,  s1

+    add         w0, w0, w1

+    ret

+.endm

==========


You use two of UADDLV to avoid overflow, how about sum these partial registers on NEON field to reduce instruction UADDLV?
e.g.
UADDLP v16,v16
UADDLP v17,v17
ADD v16,v17
UADDLV s0,v16


Regards,
Min Chen

2021-07-17 04:44:05,"Pop, Sebastian" <spop at amazon.com> 

Hi,

the attached patch ports to arm64 the following kernels:

 

            sad[  4x4]  10.11x   6.50            65.72

            sad[  8x8]  28.95x   8.50            246.00

            sad[  8x4]  23.03x   5.45            125.43

            sad[  4x8]  12.09x   10.64           128.68

            sad[16x16]  53.37x   19.19           1024.05

            sad[ 16x8]  43.09x   11.62           500.84

            sad[ 8x16]  31.03x   16.87           523.44

            sad[ 16x4]  39.73x   6.27            249.10

            sad[16x12]  50.55x   15.10           763.44

            sad[ 4x16]  14.23x   19.39           275.91

            sad[12x16]  33.68x   22.95           772.81

            sad[32x32]  62.10x   64.84           4026.97

            sad[32x16]  59.82x   33.74           2018.56

            sad[16x32]  57.94x   35.01           2028.17

            sad[ 32x8]  53.98x   18.77           1013.48

            sad[32x24]  61.29x   49.36           3024.90

            sad[ 8x32]  31.84x   32.49           1034.56

            sad[24x32]  53.61x   56.39           3022.97

            sad[64x64]  65.24x   255.86          16692.29

            sad[64x32]  61.77x   131.16          8100.90

            sad[32x64]  62.31x   128.90          8031.79

            sad[64x16]  60.28x   67.35           4060.31

            sad[64x48]  62.53x   193.59          12104.64

            sad[16x64]  61.10x   66.13           4040.26

            sad[48x64]  61.75x   194.68          12022.14

 

Ok to commit?

 

Thanks,

Sebastian

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210717/387426e9/attachment.html>


More information about the x265-devel mailing list