[x265] [PATCH] SAO: no need to unroll chroma to avoid increased code size

Wed Feb 17 10:39:41 CET 2016

在 2016-02-17 17:32:23，"Ashok Kumar Mishra" <ashok at multicorewareinc.com> 写道：

On Wed, Feb 17, 2016 at 2:42 PM, chen <chenm003 at 163.com> wrote:

在 2016-02-17 16:40:47，"Ashok Kumar Mishra" <ashok at multicorewareinc.com> 写道：

On Wed, Feb 17, 2016 at 9:28 AM, chen <chenm003 at 163.com> wrote:

At 2016-02-16 23:12:02,ashok at multicorewareinc.com wrote:
># HG changeset patch
># User Ashok Kumar Mishra<ashok at multicorewareinc.com>
># Date 1455633570 -19800
>#      Tue Feb 16 20:09:30 2016 +0530
># Node ID 36751a3dce37e4f506f4bdec12e20ef665b42012
># Parent  33b02e2af5a4b38cd54d3f94f163aae640855dbf
>SAO: no need to unroll chroma to avoid increased code size
>
>diff -r 33b02e2af5a4 -r 36751a3dce37 source/encoder/sao.cpp
>--- a/source/encoder/sao.cpp	Tue Feb 16 20:09:30 2016 +0530
>+++ b/source/encoder/sao.cpp	Tue Feb 16 20:09:30 2016 +0530
>@@ -113,7 +113,6 @@
>         m_clipTable = &(m_clipTableBase[rangeExt]);
> 
>         // Share with fast clip lookup table
>-
>         for (int i = 0; i < rangeExt; i++)
>             m_clipTableBase[i] = 0;
> 
>@@ -638,13 +637,8 @@
> {
>     PicYuv* reconPic = m_frame->m_reconPic;
>     intptr_t stride = reconPic->m_strideC;
>-    int ctuWidth  = g_maxCUSize;
>-    int ctuHeight = g_maxCUSize;
>-
>-    {
>-        ctuWidth  >>= m_hChromaShift;
>-        ctuHeight >>= m_vChromaShift;
>-    }
>+    int ctuWidth  = g_maxCUSize >> m_hChromaShift;
>+    int ctuHeight = g_maxCUSize >> m_vChromaShift;
> 
>     int addr = idxY * m_numCuInWidth + idxX;
>     pixel* recCb = reconPic->getCbAddr(addr);
>@@ -652,88 +646,53 @@
> 
>     if (idxX == 0)
>     {
>-        for (int i = 0; i < ctuHeight + 1; i++)
>+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += stride)

this style is not good to readable

First of all I couldn't understand why it is not good to readable. I believe in two things, the code should be good to readable, i.e., not complex so that in future code maintenance  will be easier
and at the same time the code should be compact. We should look into the code size since its impact is much larger in embedded systems.
Since we are working in x86, we may not be feeling it's impact. Yes I agree that there are certain cases like unrolling a loop to do same operations will give good performance. But here this is not the case.
Here I feel it is completely wrong to write separate code for Cb and Cr. It is increasing the code size, which is avoidable.

Still there are some unnecessary copy operations are there below if(typeIdx >= 0) condition check.
I will clean this and other parts in code in my next patches.

In here, the for() mixed with control code, it affect readable on code.

For embedded system, code size is not a big problem, the performance depends on CPU, e.g. TI DSP jump instruction need 6 cycles, we may generate up to 48 instructions during that period.; the new ARM cortex have 13-stages pipeline with out-of-order execute feature,  unroll loop may hidden another Chroma operators in these cycles.
Of course, I don't care change to loop, the Chroma is not bottleneck.

Yes code size is a problem as per my understanding, since the on-chip program memory size is very limited on TI DSP as well as other processors.

We should not unroll the code and make it reasonably high where it is not necessary at all as it is in this part of code where we are not going
to achieve any performance by simply unrolling the code. Just imagine one scenario if we write separate code for each components  Y, Cb and Cr in each
and every part of encoder, how much the code size it be !!!
unless you may make encoder code & data less than 32KB, otherwise we have to use external memory to store code.

>         { >             m_tmpL1[1][i] = recCb[0]; >             m_tmpL1[2][i] = recCr[0]; >-            recCb += stride; >-            recCr += stride; >         } >     } >  >-    bool mergeLeftFlagCb = (ctuParam[1][addr].mergeMode == SAO_MERGE_LEFT); >-    int typeIdxCb = ctuParam[1][addr].typeIdx;

in here, no multiplication  operator to access typeIdx, see below

>-
>-    bool mergeLeftFlagCr = (ctuParam[2][addr].mergeMode == SAO_MERGE_LEFT);
>-    int typeIdxCr = ctuParam[2][addr].typeIdx;
>-
>     if (idxX != (m_numCuInWidth - 1))
>     {
>         recCb = reconPic->getCbAddr(addr);
>         recCr = reconPic->getCrAddr(addr);
>-        for (int i = 0; i < ctuHeight + 1; i++)
>+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += stride)
>         {
>             m_tmpL2[1][i] = recCb[ctuWidth - 1];
>             m_tmpL2[2][i] = recCr[ctuWidth - 1];
>-            recCb += stride;
>-            recCr += stride;
>         }
>     }
> 
>-    // Process U
>-    if (typeIdxCb >= 0)
>+    for (int plane = 1; plane < 3; plane++)
>     {
>-        if (!mergeLeftFlagCb)
>+        int typeIdx = ctuParam[plane][addr].typeIdx;
here necessary multiplication on array index access

>+        if (typeIdx >= 0)
>         {
>-            if (typeIdxCb == SAO_BO)
>+            if (ctuParam[plane][addr].mergeMode != SAO_MERGE_LEFT)
>             {
>-                memset(m_offsetBo[1], 0, sizeof(m_offsetBo[0]));
>+                if (typeIdx == SAO_BO)
>+                {
>+                    memset(m_offsetBo[plane], 0, sizeof(m_offsetBo[0]));
> 
>-                for (int i = 0; i < SAO_NUM_OFFSET; i++)
>-                    m_offsetBo[1][((ctuParam[1][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[1][addr].offset[i] << SAO_BIT_INC);
>-            }
>-            else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3)
>-            {
>-                int offset[NUM_EDGETYPE];
>-                offset[0] = 0;
>-                for (int i = 0; i < SAO_NUM_OFFSET; i++)
>-                    offset[i + 1] = ctuParam[1][addr].offset[i] << SAO_BIT_INC;
>+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)
>+                        m_offsetBo[plane][((ctuParam[plane][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[plane][addr].offset[i] << SAO_BIT_INC);
>+                }
>+                else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3)
>+                {
>+                    int offset[NUM_EDGETYPE];
>+                    offset[0] = 0;
>+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)
>+                        offset[i + 1] = ctuParam[plane][addr].offset[i] << SAO_BIT_INC;

the loop just 4 times, we may merge with below loop

> 
>-                for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++)
>-                    m_offsetEo[1][edgeType] = (int8_t)offset[s_eoTable[edgeType]];
>+                    for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++)
>+                        m_offsetEo[plane][edgeType] = (int8_t)offset[s_eoTable[edgeType]];
>+                }
>             }
>         }
>-        processSaoCu(addr, typeIdxCb, 1);
>+        processSaoCu(addr, typeIdx, plane);
>+        std::swap(m_tmpL1[plane], m_tmpL2[plane]);
>     }

_______________________________________________
x265-devel mailing list
x265-devel at videolan.org
https://mailman.videolan.org/listinfo/x265-devel

_______________________________________________
x265-devel mailing list
x265-devel at videolan.org
https://mailman.videolan.org/listinfo/x265-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20160217/6d81fa91/attachment-0001.html>