[x265] [arm64] port sad_x{3,4}

chen chenm003 at 163.com
Thu Jul 22 23:43:37 UTC 2021


Hi,




Some comments,




+.macro SAD_X_END_64 x

+    uaddlp          v16.4s, v16.8h
The dynamic range is 64*255 = 16320 -> 14-bits, so we are not need extend to 32-bits in here



+    uaddlp          v17.4s, v17.8h

+    uaddlp          v18.4s, v18.8h

+    uaddlp          v20.4s, v20.8h

+    uaddlp          v21.4s, v21.8h

+    uaddlp          v22.4s, v22.8h

+    add             v16.4s, v16.4s, v20.4s

+    add             v17.4s, v17.4s, v21.4s

+    add             v18.4s, v18.4s, v22.4s

+    trn2            v20.2d, v16.2d, v16.2d

+    trn2            v21.2d, v17.2d, v17.2d

+    trn2            v22.2d, v18.2d, v18.2d

+    add             v16.2s, v16.2s, v20.2s



+    add             v17.2s, v17.2s, v21.2s
+    add             v18.2s, v18.2s, v22.2s
+    uaddlp          v16.1d, v16.2s
ADD+TRN2+ADD generate sum of v16+v20 in V.2s, follow by UADDLP into V.1s


As we analyze dynamic range in above, we can replace it by
ADD v16, v20   ; 15-bits
        (ignore inst for V17=V17+V21, etc)
ADD v16, V17  ; 16-bits
        (ignore other registers)
ADDLV s0,v16




+    uaddlp          v17.1d, v17.2s
+    uaddlp          v18.1d, v18.2s


+    st1             {v16.s}[0], [x6], #4
+    st1             {v17.s}[0], [x6], #4

+    st1             {v18.s}[0], [x6], #4

I guess STP may store two result in a cycle




Regards,
Min Chen




 2021-07-22 14:30:50,"Pop, Sebastian" <spop at amazon.com> 

Hi,

 

the attached patch ports to arm64 the following kernels:

 

         sad_x3[  4x4]  12.23x   13.79           168.68

         sad_x4[  4x4]  14.12x   15.82           223.43

         sad_x3[  8x8]  35.05x   17.45           611.47

         sad_x4[  8x8]  38.48x   21.18           814.95

         sad_x3[  8x4]  27.19x   11.46           311.48

         sad_x4[  8x4]  30.40x   13.60           413.37

         sad_x3[  4x8]  14.16x   22.99           325.37

         sad_x4[  4x8]  15.82x   27.39           433.23

         sad_x3[16x16]  40.94x   57.94           2371.97

         sad_x4[16x16]  43.63x   72.44           3160.44

         sad_x3[ 16x8]  38.84x   30.54           1186.15

         sad_x4[ 16x8]  39.23x   40.16           1575.43

         sad_x3[ 8x16]  38.74x   31.43           1217.71

         sad_x4[ 8x16]  41.48x   39.01           1618.17

         sad_x3[ 16x4]  31.82x   18.88           600.72

         sad_x4[ 16x4]  36.35x   21.87           795.00

         sad_x3[16x12]  40.27x   43.87           1766.74

         sad_x4[16x12]  42.58x   55.94           2381.75

         sad_x3[ 4x16]  15.34x   42.16           646.67

         sad_x4[ 4x16]  17.08x   51.06           872.12

         sad_x3[12x16]  29.45x   61.06           1798.28

         sad_x4[12x16]  30.39x   78.94           2399.17

         sad_x3[32x32]  42.85x   216.39          9272.65

         sad_x4[32x32]  42.53x   294.98          12544.76

         sad_x3[32x16]  42.09x   110.35          4644.86

         sad_x4[32x16]  41.71x   151.05          6301.01

         sad_x3[16x32]  44.19x   106.99          4728.04

         sad_x4[16x32]  44.72x   139.94          6257.96

         sad_x3[ 32x8]  40.10x   58.16           2332.47

         sad_x4[ 32x8]  41.17x   76.65           3155.96

         sad_x3[32x24]  42.69x   162.76          6947.64

        sad_x4[32x24]  42.08x   223.88          9421.46

         sad_x3[ 8x32]  41.86x   57.89           2423.47

         sad_x4[ 8x32]  45.26x   71.56           3239.07

         sad_x3[24x32]  45.10x   155.22          6999.53

         sad_x4[24x32]  45.30x   205.87          9325.60

         sad_x3[64x64]  39.87x   925.36          36892.50

         sad_x4[64x64]  40.80x   1214.79         49557.66

         sad_x3[64x32]  39.40x   468.08          18444.51

         sad_x4[64x32]  40.71x   609.27          24803.74

         sad_x3[32x64]  43.48x   426.05          18522.95

         sad_x4[32x64]  43.31x   577.80          25024.14

         sad_x3[64x16]  38.67x   238.72          9231.84

         sad_x4[64x16]  40.36x   308.10          12435.08

         sad_x3[64x48]  39.70x   695.95          27628.87

         sad_x4[64x48]  40.74x   912.56          37173.46

         sad_x3[16x64]  44.85x   208.19          9337.52

         sad_x4[16x64]  45.46x   274.68          12487.54

         sad_x3[48x64]  42.68x   653.74          27903.74

         sad_x4[48x64]  44.67x   835.79          37336.87

 

Ok to commit?

 

Thanks,

Sebastian

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20210723/538ed50c/attachment-0001.html>


More information about the x265-devel mailing list