<div dir="ltr">Looks like this doesn't apply to the head... grrr...<div>Apologies for the mistake - will resend a fixed patch.</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Pradeep Ramachandran, PhD<div>Solution Architect,</div><div>Multicoreware Inc.</div><div>Ph:   +91 99627 82018</div></div></div></div></div></div></div></div></div></div>

<br><div class="gmail_quote">On Tue, Sep 1, 2015 at 1:51 PM,  <span dir="ltr"><<a href="mailto:pradeep@multicorewareinc.com" target="_blank">pradeep@multicorewareinc.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"># HG changeset patch<br>

# User Pradeep Ramachandran <<a href="mailto:pradeep@multicorewareinc.com">pradeep@multicorewareinc.com</a>><br>

# Date 1441095610 -19800<br>

#      Tue Sep 01 13:50:10 2015 +0530<br>

# Node ID eda792e5d9c746e099879e8e58211f7807b88075<br>

# Parent  9f669a8d959e557c7ea2cc7be59c807dc5b8dee2<br>

Performance: Don't split threads into per-NUMA pools unless specified in cli<br>

<br>

Fixes the threadpools to store a numa node mask instead of a numa node id to<br>

use liNUMA functionality, and changes default thread pool behavior to use one<br>

monolythic pool<br>

<br>

We see gains of 10%+ on Intel Xeon E5-2666v3, E5-2640 v2 machines for 4K videos<br>

encoding at an ABR of 15Mbps in slower and versylow settings. Ultrafast shows a<br>

perf dip of ~8% on the same machines. So when running in ultrafast mode, explicilty<br>

specify # threads with the --pools command (--pools 18,18 on the E5-2666v3 for<br>

example)<br>

<br>

diff -r 9f669a8d959e -r eda792e5d9c7 doc/reST/cli.rst<br>

--- a/doc/reST/cli.rst  Tue Sep 01 11:38:47 2015 +0530<br>

+++ b/doc/reST/cli.rst  Tue Sep 01 13:50:10 2015 +0530<br>

@@ -202,15 +202,29 @@<br>

        "-"       - same as "none"<br>

        "10"      - allocate one pool, using up to 10 cores on node 0<br>

        "-,+"     - allocate one pool, using all cores on node 1<br>

-       "+,-,+"   - allocate two pools, using all cores on nodes 0 and 2<br>

-       "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2<br>

-       "-,*"     - allocate three pools, using all cores on nodes 1, 2 and 3<br>

+       "+,-,+"   - allocate one pool, using only cores on nodes 0 and 2<br>

+       "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2<br>

+       "-,*"     - allocate one pool, using all cores on nodes 1, 2 and 3<br>

        "8,8,8,8" - allocate four pools with up to 8 threads in each pool<br>

+       "8,+,+,+" - allocate two pools, the first with 8 threads on node 0, and the second with all cores on node 1,2,3<br>

<br>

-       The total number of threads will be determined by the number of threads<br>

-       assigned to all nodes. The worker threads will each be given affinity for<br>

-       their node, they will not be allowed to migrate between nodes, but they<br>

-       will be allowed to move between CPU cores within their node.<br>

+       A thread pool dedicated to a given NUMA node is enabled only when the<br>

+       number of threads to be created on that NUMA node is explicitly mentioned<br>

+       in that corresponding position with the --pools option. Else, all threads<br>

+       are spawned from a single pool. The total number of threads will be<br>

+       determined by the number of threads assigned to the enabled NUMA nodes for<br>

+       that pool. The worker threads are be given affinity to all the enabled<br>

+       NUMA nodes for that pool and may migrate between them, unless explicitly<br>

+       specified as described above.<br>

+<br>

+       In the case that any threadpool has more than 64 threads, the threadpool<br>

+       may be broken down into multiple pools of 64 threads each; on 32-bit<br>

+       machines, this number is 32. All pools are given affinity to the NUMA<br>

+       nodes on which the original pool had affinity. For performance reasons,<br>

+       the last thread pool is spawned only if it has more than 32 threads for<br>

+       64-bit machines, or 16 for 32-bit machines. If the total number of threads<br>

+       in the system doesn't obey this constraint, we may spawn fewer threads<br>

+       than cores which has been emperically shown to be better for performance.<br>

<br>

        If the four pool features: :option:`--wpp`, :option:`--pmode`,<br>

        :option:`--pme` and :option:`--lookahead-slices` are all disabled,<br>

@@ -219,10 +233,6 @@<br>

        If "none" is specified, then all four of the thread pool features are<br>

        implicitly disabled.<br>

<br>

-       Multiple thread pools will be allocated for any NUMA node with more than<br>

-       64 logical CPU cores. But any given thread pool will always use at most<br>

-       one NUMA node.<br>

-<br>

        Frame encoders are distributed between the available thread pools,<br>

        and the encoder will never generate more thread pools than<br>

        :option:`--frame-threads`.  The pools are used for WPP and for<br>

@@ -238,8 +248,12 @@<br>

        system, a POSIX build of libx265 without libnuma will be less work<br>

        efficient. See :ref:`thread pools <pools>` for more detail.<br>

<br>

-       Default "", one thread is allocated per detected hardware thread<br>

-       (logical CPU cores) and one thread pool per NUMA node.<br>

+       Default "", one pool is created across all available NUMA nodes, with<br>

+       one thread allocated per detected hardware thread<br>

+       (logical CPU cores). In the case that the total number of threads is more<br>

+       than the maximum size that ATOMIC operations can handle (32 for 32-bit<br>

+       compiles, and 64 for 64-bit compiles), multiple thread pools may be<br>

+       spawned subject to the performance constraint described above.<br>

<br>

        Note that the string value will need to be escaped or quoted to<br>

        protect against shell expansion on many platforms<br>

diff -r 9f669a8d959e -r eda792e5d9c7 source/common/threadpool.cpp<br>

--- a/source/common/threadpool.cpp      Tue Sep 01 11:38:47 2015 +0530<br>

+++ b/source/common/threadpool.cpp      Tue Sep 01 13:50:10 2015 +0530<br>

@@ -226,8 +226,13 @@<br>

 {<br>

     enum { MAX_NODE_NUM = 127 };<br>

     int cpusPerNode[MAX_NODE_NUM + 1];<br>

+    int threadsPerPool[MAX_NODE_NUM + 2];<br>

+    unsigned long nodeMaskPerPool[MAX_NODE_NUM +2];<br>

<br>

     memset(cpusPerNode, 0, sizeof(cpusPerNode));<br>

+    memset(threadsPerPool, 0, sizeof(threadsPerPool));<br>

+    memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));<br>

+<br>

     int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);<br>

     int cpuCount = getCpuCount();<br>

     bool bNumaSupport = false;<br>

@@ -258,7 +263,7 @@<br>

         for (int i = 0; i < numNumaNodes; i++)<br>

             x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d logical cores\n", i, cpusPerNode[i]);<br>

<br>

-    /* limit nodes based on param->numaPools */<br>

+    /* limit threads based on param->numaPools */<br>

     if (p->numaPools && *p->numaPools)<br>

     {<br>

         const char *nodeStr = p->numaPools;<br>

@@ -266,19 +271,30 @@<br>

         {<br>

             if (!*nodeStr)<br>

             {<br>

-                cpusPerNode[i] = 0;<br>

+                threadsPerPool[i] = 0;<br>

                 continue;<br>

             }<br>

             else if (*nodeStr == '-')<br>

-                cpusPerNode[i] = 0;<br>

+                threadsPerPool[i] = 0;<br>

             else if (*nodeStr == '*')<br>

+            {<br>

+                for(int j = i; j < numNumaNodes; j++)<br>

+                {<br>

+                    threadsPerPool[numNumaNodes] += cpusPerNode[j];<br>

+                    nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << j);<br>

+                }<br>

                 break;<br>

+            }<br>

             else if (*nodeStr == '+')<br>

-                ;<br>

+            {<br>

+                threadsPerPool[numNumaNodes] += cpusPerNode[i];<br>

+                nodeMaskPerPool[numNumaNodes] = ((unsigned long)1 << i);<br>

+            }<br>

             else<br>

             {<br>

                 int count = atoi(nodeStr);<br>

-                cpusPerNode[i] = X265_MIN(count, cpusPerNode[i]);<br>

+                threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]);<br>

+                nodeMaskPerPool[i] = ((unsigned long)1 << i);<br>

             }<br>

<br>

             /* consume current node string, comma, and white-space */<br>

@@ -288,24 +304,31 @@<br>

                ++nodeStr;<br>

         }<br>

     }<br>

+    else<br>

+    {<br>

+        for(int i = 0; i < numNumaNodes; i++)<br>

+        {<br>

+            threadsPerPool[numNumaNodes]  += cpusPerNode[i];<br>

+            nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << i);<br>

+        }<br>

+    }<br>

<br>

-    // In the case that numa is disabled and we have more CPUs than 64,<br>

-    // spawn the last pool only if the # threads in that pool is > 1/2 max (heuristic)<br>

-    if ((numNumaNodes == 1) &&<br>

-        (cpusPerNode[0] > MAX_POOL_THREADS) &&<br>

-        (cpusPerNode[0] % MAX_POOL_THREADS < (MAX_POOL_THREADS / 2)))<br>

+    // If the last pool size is > MAX_POOL_THREADS, clip it to spawn thread pools only of size >= 1/2 max (heuristic)<br>

+    if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) &&<br>

+       ((threadsPerPool[numNumaNodes] % MAX_POOL_THREADS) < (MAX_POOL_THREADS / 2)))<br>

     {<br>

-        cpusPerNode[0] -= (cpusPerNode[0] % MAX_POOL_THREADS);<br>

-        x265_log(p, X265_LOG_DEBUG, "Creating only %d worker threads to prevent asymmetry in pools; may not use all HW contexts\n", cpusPerNode[0]);<br>

+        threadsPerPool[numNumaNodes] -= (threadsPerPool[numNumaNodes] % MAX_POOL_THREADS);<br>

+        x265_log(p, X265_LOG_DEBUG,<br>

+                 "Creating only %d worker threads beyond specified numbers with --pools (if specified) to prevent asymmetry in pools; may not use all HW contexts\n", threadsPerPool[numNumaNodes]);<br>

     }<br>

<br>

     numPools = 0;<br>

-    for (int i = 0; i < numNumaNodes; i++)<br>

+    for (int i = 0; i < numNumaNodes + 1; i++)<br>

     {<br>

         if (bNumaSupport)<br>

             x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d logical cores\n", i, cpusPerNode[i]);<br>

-        if (cpusPerNode[i])<br>

-            numPools += (cpusPerNode[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;<br>

+        if (threadsPerPool[i])<br>

+            numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;<br>

     }<br>

<br>

     if (!numPools)<br>

@@ -324,20 +347,20 @@<br>

         int node = 0;<br>

         for (int i = 0; i < numPools; i++)<br>

         {<br>

-            while (!cpusPerNode[node])<br>

+            while (!threadsPerPool[node])<br>

                 node++;<br>

-            int cores = X265_MIN(MAX_POOL_THREADS, cpusPerNode[node]);<br>

-            if (!pools[i].create(cores, maxProviders, node))<br>

+            int numThreads = X265_MIN(MAX_POOL_THREADS, threadsPerPool[node]);<br>

+            if (!pools[i].create(numThreads, maxProviders, nodeMaskPerPool[node]))<br>

             {<br>

                 X265_FREE(pools);<br>

                 numPools = 0;<br>

                 return NULL;<br>

             }<br>

             if (numNumaNodes > 1)<br>

-                x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads on NUMA node %d\n", i, cores, node);<br>

+                x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads with NUMA node mask %lx\n", i, numThreads, nodeMaskPerPool[node]);<br>

             else<br>

-                x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", cores);<br>

-            cpusPerNode[node] -= cores;<br>

+                x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", numThreads);<br>

+            threadsPerPool[node] -= numThreads;<br>

         }<br>

     }<br>

     else<br>

@@ -350,11 +373,27 @@<br>

     memset(this, 0, sizeof(*this));<br>

 }<br>

<br>

-bool ThreadPool::create(int numThreads, int maxProviders, int node)<br>

+bool ThreadPool::create(int numThreads, int maxProviders, unsigned long nodeMask)<br>

 {<br>

     X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool cannot have more than MAX_POOL_THREADS threads\n");<br>

<br>

-    m_numaNode = node;<br>

+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>

+    m_numaNodeMask = new DWORD_PTR;<br>

+    (*m_numaNodeMask) = nodeMask & ~(0x1 << X265_MIN(getNumaNodeCount(), MAX_NODE_NUM));<br>

+#elif HAVE_LIBNUMA<br>

+    if (numa_available() >= 0)<br>

+    {<br>

+        struct bitmask* nodemask = numa_allocate_nodemask();<br>

+        if (nodemask)<br>

+        {<br>

+            *(nodemask->maskp) = nodeMask;<br>

+            m_numaNodeMask = nodemask;<br>

+        }<br>

+        else<br>

+            x265_log(NULL, X265_LOG_ERROR, "unable to get NUMA node mask for %lx\n", nodeMask);<br>

+    }<br>

+#endif<br>

+<br>

     m_numWorkers = numThreads;<br>

<br>

     m_workers = X265_MALLOC(WorkerThread, numThreads);<br>

@@ -408,36 +447,39 @@<br>

<br>

     X265_FREE(m_workers);<br>

     X265_FREE(m_jpTable);<br>

+<br>

+    if(m_numaNodeMask)<br>

+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>

+        free(m_numaNodeMask);<br>

+#elif HAVE_LIBNUMA<br>

+        numa_free_nodemask((struct bitmask*)m_numaNodeMask);<br>

+#endif<br>

 }<br>

<br>

 void ThreadPool::setCurrentThreadAffinity()<br>

 {<br>

-    setThreadNodeAffinity(m_numaNode);<br>

+    setThreadNodeAffinity(m_numaNodeMask);<br>

 }<br>

<br>

 /* static */<br>

-void ThreadPool::setThreadNodeAffinity(int numaNode)<br>

+void ThreadPool::setThreadNodeAffinity(void *numaNodeMask)<br>

 {<br>

 #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>

-    GROUP_AFFINITY groupAffinity;<br>

-    if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))<br>

-    {<br>

-        if (SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)groupAffinity.Mask))<br>

-            return;<br>

-    }<br>

-    x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode);<br>

+    if (SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)numaNodeMask))<br>

+        return;<br>

+    else<br>

+        x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n");<br>

 #elif HAVE_LIBNUMA<br>

     if (numa_available() >= 0)<br>

     {<br>

-        numa_run_on_node(numaNode);<br>

-        numa_set_preferred(numaNode);<br>

+        numa_run_on_node_mask((struct bitmask*)numaNodeMask);<br>

+        numa_set_interleave_mask((struct bitmask*)numaNodeMask);<br>

         numa_set_localalloc();<br>

         return;<br>

     }<br>

-    x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode);<br>

-#else<br>

-    (void)numaNode;<br>

+    x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n");<br>

 #endif<br>

+    return;<br>

 }<br>

<br>

 /* static */<br>

diff -r 9f669a8d959e -r eda792e5d9c7 source/common/threadpool.h<br>

--- a/source/common/threadpool.h        Tue Sep 01 11:38:47 2015 +0530<br>

+++ b/source/common/threadpool.h        Tue Sep 01 13:50:10 2015 +0530<br>

@@ -83,7 +83,7 @@<br>

     sleepbitmap_t m_sleepBitmap;<br>

     int           m_numProviders;<br>

     int           m_numWorkers;<br>

-    int           m_numaNode;<br>

+    void*         m_numaNodeMask;<br>

     bool          m_isActive;<br>

<br>

     JobProvider** m_jpTable;<br>

@@ -92,7 +92,7 @@<br>

     ThreadPool();<br>

     ~ThreadPool();<br>

<br>

-    bool create(int numThreads, int maxProviders, int node);<br>

+    bool create(int numThreads, int maxProviders, unsigned long nodeMask);<br>

     bool start();<br>

     void stopWorkers();<br>

     void setCurrentThreadAffinity();<br>

@@ -103,7 +103,7 @@<br>

<br>

     static int  getCpuCount();<br>

     static int  getNumaNodeCount();<br>

-    static void setThreadNodeAffinity(int node);<br>

+    static void setThreadNodeAffinity(void *numaNodeMask);<br>

 };<br>

<br>

 /* Any worker thread may enlist the help of idle worker threads from the same<br>

</blockquote></div><br></div>