[x265] [PATCH] SAO: no need to unroll chroma to avoid increased code size

Wed Feb 17 10:32:23 CET 2016

On Wed, Feb 17, 2016 at 2:42 PM, chen <chenm003 at 163.com> wrote:

>
>
> 在 2016-02-17 16:40:47，"Ashok Kumar Mishra" <ashok at multicorewareinc.com>
> 写道：
>
> On Wed, Feb 17, 2016 at 9:28 AM, chen <chenm003 at 163.com> wrote:
>
>>
>> At 2016-02-16 23:12:02,ashok at multicorewareinc.com wrote:
>> ># HG changeset patch
>> ># User Ashok Kumar Mishra<ashok at multicorewareinc.com>
>> ># Date 1455633570 -19800
>> >#      Tue Feb 16 20:09:30 2016 +0530
>> ># Node ID 36751a3dce37e4f506f4bdec12e20ef665b42012
>> ># Parent  33b02e2af5a4b38cd54d3f94f163aae640855dbf
>> >SAO: no need to unroll chroma to avoid increased code size
>> >
>> >diff -r 33b02e2af5a4 -r 36751a3dce37 source/encoder/sao.cpp
>> >--- a/source/encoder/sao.cpp	Tue Feb 16 20:09:30 2016 +0530
>> >+++ b/source/encoder/sao.cpp	Tue Feb 16 20:09:30 2016 +0530
>> >@@ -113,7 +113,6 @@
>> >         m_clipTable = &(m_clipTableBase[rangeExt]);
>> >
>> >         // Share with fast clip lookup table
>> >-
>> >         for (int i = 0; i < rangeExt; i++)
>> >             m_clipTableBase[i] = 0;
>> >
>> >@@ -638,13 +637,8 @@
>> > {
>> >     PicYuv* reconPic = m_frame->m_reconPic;
>> >     intptr_t stride = reconPic->m_strideC;
>> >-    int ctuWidth  = g_maxCUSize;
>> >-    int ctuHeight = g_maxCUSize;
>> >-
>> >-    {
>> >-        ctuWidth  >>= m_hChromaShift;
>> >-        ctuHeight >>= m_vChromaShift;
>> >-    }
>> >+    int ctuWidth  = g_maxCUSize >> m_hChromaShift;
>> >+    int ctuHeight = g_maxCUSize >> m_vChromaShift;
>> >
>> >     int addr = idxY * m_numCuInWidth + idxX;
>> >     pixel* recCb = reconPic->getCbAddr(addr);
>> >@@ -652,88 +646,53 @@
>> >
>> >     if (idxX == 0)
>> >     {
>> >-        for (int i = 0; i < ctuHeight + 1; i++)
>> >+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += stride)
>>
>> this style is not good to readable
>>
>> First of all I couldn't understand why it is not good to readable. I
> believe in two things, the code should be good to readable, i.e., not
> complex so that in future code maintenance  will be easier
> and at the same time the code should be compact. We should look into the
> code size since its impact is much larger in embedded systems.
> Since we are working in x86, we may not be feeling it's impact. Yes I
> agree that there are certain cases like unrolling a loop to do same
> operations will give good performance. But here this is not the case.
> Here I feel it is completely wrong to write separate code for Cb and Cr.
> It is increasing the code size, which is avoidable.
>
> Still there are some unnecessary copy operations are there below
> if(typeIdx >= 0) condition check.
> I will clean this and other parts in code in my next patches.
>
>
> In here, the for() mixed with control code, it affect readable on code.
>
> For embedded system, code size is not a big problem, the performance
> depends on CPU, e.g. TI DSP jump instruction need 6 cycles, we may generate
> up to 48 instructions during that period.; the new ARM cortex have
> 13-stages pipeline with out-of-order execute feature,  unroll loop may
> hidden another Chroma operators in these cycles.
> Of course, I don't care change to loop, the Chroma is not bottleneck.
>
>
Yes code size is a problem as per my understanding, since the on-chip
program memory size is very limited on TI DSP as well as other processors.
We should not unroll the code and make it reasonably high where it is not
necessary at all as it is in this part of code where we are not going
to achieve any performance by simply unrolling the code. Just imagine one
scenario if we write separate code for each components  Y, Cb and Cr in each
and every part of encoder, how much the code size it be !!!

>
>> >         {
>> >             m_tmpL1[1][i] = recCb[0];
>> >             m_tmpL1[2][i] = recCr[0];
>> >-            recCb += stride;
>> >-            recCr += stride;
>> >         }
>> >     }
>> >
>> >-    bool mergeLeftFlagCb = (ctuParam[1][addr].mergeMode == SAO_MERGE_LEFT);
>> >-    int typeIdxCb = ctuParam[1][addr].typeIdx;
>> in here, no multiplication  operator to access typeIdx, see below
>>
>> >-
>> >-    bool mergeLeftFlagCr = (ctuParam[2][addr].mergeMode == SAO_MERGE_LEFT);
>> >-    int typeIdxCr = ctuParam[2][addr].typeIdx;
>> >-
>> >     if (idxX != (m_numCuInWidth - 1))
>> >     {
>> >         recCb = reconPic->getCbAddr(addr);
>> >         recCr = reconPic->getCrAddr(addr);
>> >-        for (int i = 0; i < ctuHeight + 1; i++)
>> >+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += stride)
>> >         {
>> >             m_tmpL2[1][i] = recCb[ctuWidth - 1];
>> >             m_tmpL2[2][i] = recCr[ctuWidth - 1];
>> >-            recCb += stride;
>> >-            recCr += stride;
>> >         }
>> >     }
>> >
>> >-    // Process U
>> >-    if (typeIdxCb >= 0)
>> >+    for (int plane = 1; plane < 3; plane++)
>> >     {
>> >-        if (!mergeLeftFlagCb)
>> >+        int typeIdx = ctuParam[plane][addr].typeIdx;
>> here necessary multiplication on array index access
>>
>>
>> >+        if (typeIdx >= 0)
>> >         {
>> >-            if (typeIdxCb == SAO_BO)
>> >+            if (ctuParam[plane][addr].mergeMode != SAO_MERGE_LEFT)
>> >             {
>> >-                memset(m_offsetBo[1], 0, sizeof(m_offsetBo[0]));
>> >+                if (typeIdx == SAO_BO)
>> >+                {
>> >+                    memset(m_offsetBo[plane], 0, sizeof(m_offsetBo[0]));
>> >
>> >-                for (int i = 0; i < SAO_NUM_OFFSET; i++)
>> >-                    m_offsetBo[1][((ctuParam[1][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[1][addr].offset[i] << SAO_BIT_INC);
>> >-            }
>> >-            else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3)
>> >-            {
>> >-                int offset[NUM_EDGETYPE];
>> >-                offset[0] = 0;
>> >-                for (int i = 0; i < SAO_NUM_OFFSET; i++)
>> >-                    offset[i + 1] = ctuParam[1][addr].offset[i] << SAO_BIT_INC;
>> >+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)
>> >+                        m_offsetBo[plane][((ctuParam[plane][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[plane][addr].offset[i] << SAO_BIT_INC);
>> >+                }
>> >+                else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3)
>> >+                {
>> >+                    int offset[NUM_EDGETYPE];
>> >+                    offset[0] = 0;
>> >+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)
>> >+                        offset[i + 1] = ctuParam[plane][addr].offset[i] << SAO_BIT_INC;
>> the loop just 4 times, we may merge with below loop
>>
>> >
>> >-                for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++)
>> >-                    m_offsetEo[1][edgeType] = (int8_t)offset[s_eoTable[edgeType]];
>> >+                    for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++)
>> >+                        m_offsetEo[plane][edgeType] = (int8_t)offset[s_eoTable[edgeType]];
>> >+                }
>> >             }
>> >         }
>> >-        processSaoCu(addr, typeIdxCb, 1);
>> >+        processSaoCu(addr, typeIdx, plane);
>> >+        std::swap(m_tmpL1[plane], m_tmpL2[plane]);
>> >     }
>>
>>
>>
>> _______________________________________________
>> x265-devel mailing list
>> x265-devel at videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>>
>>
>
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20160217/98572428/attachment.html>