<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><br><div></div><div id="divNeteaseMailCard"></div><br>在 2016-02-17 17:32:23，"Ashok Kumar Mishra" <ashok@multicorewareinc.com> 写道：<br> <blockquote id="isReplyContent" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 17, 2016 at 2:42 PM, chen <span dir="ltr"><<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div><div class="h5"><br><div></div><div></div><br>在 2016-02-17 16:40:47，"Ashok Kumar Mishra" <<a href="mailto:ashok@multicorewareinc.com" target="_blank">ashok@multicorewareinc.com</a>> 写道：<br> </div></div><blockquote style="PADDING-LEFT:1ex;MARGIN:0px 0px 0px 0.8ex;BORDER-LEFT:#ccc 1px solid"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5">On Wed, Feb 17, 2016 at 9:28 AM, chen <span dir="ltr"><<a href="mailto:chenm003@163.com" target="_blank">chenm003@163.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><br><pre><div><div>At 2016-02-16 23:12:02,<a href="mailto:ashok@multicorewareinc.com" target="_blank">ashok@multicorewareinc.com</a> wrote:

># HG changeset patch

># User Ashok Kumar Mishra<<a href="mailto:ashok@multicorewareinc.com" target="_blank">ashok@multicorewareinc.com</a>>

># Date 1455633570 -19800

>#      Tue Feb 16 20:09:30 2016 +0530

># Node ID 36751a3dce37e4f506f4bdec12e20ef665b42012

># Parent  33b02e2af5a4b38cd54d3f94f163aae640855dbf

>SAO: no need to unroll chroma to avoid increased code size

>

>diff -r 33b02e2af5a4 -r 36751a3dce37 source/encoder/sao.cpp

>--- a/source/encoder/sao.cpp   Tue Feb 16 20:09:30 2016 +0530

>+++ b/source/encoder/sao.cpp   Tue Feb 16 20:09:30 2016 +0530

>@@ -113,7 +113,6 @@

>         m_clipTable = &(m_clipTableBase[rangeExt]);

> 

>         // Share with fast clip lookup table

>-

>         for (int i = 0; i < rangeExt; i++)

>             m_clipTableBase[i] = 0;

> 

>@@ -638,13 +637,8 @@

> {

>     PicYuv* reconPic = m_frame->m_reconPic;

>     intptr_t stride = reconPic->m_strideC;

>-    int ctuWidth  = g_maxCUSize;

>-    int ctuHeight = g_maxCUSize;

>-

>-    {

>-        ctuWidth  >>= m_hChromaShift;

>-        ctuHeight >>= m_vChromaShift;

>-    }

>+    int ctuWidth  = g_maxCUSize >> m_hChromaShift;

>+    int ctuHeight = g_maxCUSize >> m_vChromaShift;

> 

>     int addr = idxY * m_numCuInWidth + idxX;

>     pixel* recCb = reconPic->getCbAddr(addr);

>@@ -652,88 +646,53 @@

> 

>     if (idxX == 0)

>     {

>-        for (int i = 0; i < ctuHeight + 1; i++)

>+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += stride)<br><br></div></div>this style is not good to readable<span><br></span></pre></div></blockquote><div>First of all I couldn't understand why it is not good to readable. I believe in two things, the code should be good to readable, i.e., not complex so that in future code maintenance  will be easier</div><div>and at the same time the code should be compact. We should look into the code size since its impact is much larger in embedded systems.</div><div>Since we are working in x86, we may not be feeling it's impact. Yes I agree that there are certain cases like unrolling a loop to do same operations will give good performance. But here this is not the case.</div><div>Here I feel it is completely wrong to write separate code for Cb and Cr. It is increasing the code size, which is avoidable.</div><div><br></div><div>Still there are some unnecessary copy operations are there below if(typeIdx >= 0) condition check.</div></div></div><div><div><div class="h5">I will clean this and other parts in code in my next patches. <br><br><br></div></div>In here, the for() mixed with control code, it affect readable on code.<br><br>For embedded system, code size is not a big problem, the performance depends on CPU, e.g. TI DSP jump instruction need 6 cycles, we may generate up to 48 instructions during that period.; the new ARM cortex have 13-stages pipeline with out-of-order execute feature,  unroll loop may hidden another Chroma operators in these cycles. <br>Of course, I don't care change to loop, the Chroma is not bottleneck.<br></div></div></div></div></blockquote></div></blockquote><div><br></div><div>Yes code size is a problem as per my understanding, since the on-chip program memory size is very limited on TI DSP as well as other processors.<br><div>We should not unroll the code and make it reasonably high where it 

is not necessary at all as it is in this part of code where we are not 

going</div><div>to achieve any performance by simply unrolling the code.

 Just imagine one scenario if we write separate code for each components

  Y, Cb and Cr in each</div><div>and every part of encoder, how much the code size it be !!!</div><pre><span>

</span></pre>unless you may make encoder code & data less than 32KB, otherwise we have to use external memory to store code.<br><br></div><br>>         {

>             m_tmpL1[1][i] = recCb[0];

>             m_tmpL1[2][i] = recCr[0];

>-            recCb += stride;

>-            recCr += stride;

>         }

>     }

> 

>-    bool mergeLeftFlagCb = (ctuParam[1][addr].mergeMode == SAO_MERGE_LEFT);

>-    int typeIdxCb = ctuParam[1][addr].typeIdx;<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><blockquote style="PADDING-LEFT:1ex;MARGIN:0px 0px 0px 0.8ex;BORDER-LEFT:#ccc 1px solid"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><pre><span></span>in here, no multiplication  operator to access typeIdx, see below<span><br>

>-

>-    bool mergeLeftFlagCr = (ctuParam[2][addr].mergeMode == SAO_MERGE_LEFT);

>-    int typeIdxCr = ctuParam[2][addr].typeIdx;

>-

>     if (idxX != (m_numCuInWidth - 1))

>     {

>         recCb = reconPic->getCbAddr(addr);

>         recCr = reconPic->getCrAddr(addr);

>-        for (int i = 0; i < ctuHeight + 1; i++)

>+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += stride)

>         {

>             m_tmpL2[1][i] = recCb[ctuWidth - 1];

>             m_tmpL2[2][i] = recCr[ctuWidth - 1];

>-            recCb += stride;

>-            recCr += stride;

>         }

>     }

> 

>-    // Process U

>-    if (typeIdxCb >= 0)

>+    for (int plane = 1; plane < 3; plane++)

>     {

>-        if (!mergeLeftFlagCb)

>+        int typeIdx = ctuParam[plane][addr].typeIdx;<br></span>here necessary multiplication on array index access<div><div><br>

>+        if (typeIdx >= 0)

>         {

>-            if (typeIdxCb == SAO_BO)

>+            if (ctuParam[plane][addr].mergeMode != SAO_MERGE_LEFT)

>             {

>-                memset(m_offsetBo[1], 0, sizeof(m_offsetBo[0]));

>+                if (typeIdx == SAO_BO)

>+                {

>+                    memset(m_offsetBo[plane], 0, sizeof(m_offsetBo[0]));

> 

>-                for (int i = 0; i < SAO_NUM_OFFSET; i++)

>-                    m_offsetBo[1][((ctuParam[1][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[1][addr].offset[i] << SAO_BIT_INC);

>-            }

>-            else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3)

>-            {

>-                int offset[NUM_EDGETYPE];

>-                offset[0] = 0;

>-                for (int i = 0; i < SAO_NUM_OFFSET; i++)

>-                    offset[i + 1] = ctuParam[1][addr].offset[i] << SAO_BIT_INC;

>+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)

>+                        m_offsetBo[plane][((ctuParam[plane][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[plane][addr].offset[i] << SAO_BIT_INC);<br>>+                }

>+                else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3)

>+                {

>+                    int offset[NUM_EDGETYPE];

>+                    offset[0] = 0;

>+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)

>+                        offset[i + 1] = ctuParam[plane][addr].offset[i] << SAO_BIT_INC;<br></div></div>the loop just 4 times, we may merge with below loop<span><br>

> 

>-                for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++)

>-                    m_offsetEo[1][edgeType] = (int8_t)offset[s_eoTable[edgeType]];

>+                    for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++)

>+                        m_offsetEo[plane][edgeType] = (int8_t)offset[s_eoTable[edgeType]];

>+                }

>             }

>         }

>-        processSaoCu(addr, typeIdxCb, 1);

>+        processSaoCu(addr, typeIdx, plane);

>+        std::swap(m_tmpL1[plane], m_tmpL2[plane]);

>     }

</span></pre></div><br>_______________________________________________<br>

x265-devel mailing list<br>

<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>

<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>

<br></blockquote></div></div></div><br></div></div>

</blockquote></div><br>_______________________________________________<br>

x265-devel mailing list<br>

<a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>

<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>

<br></blockquote></div><br></div></div>

</blockquote></div>