<div dir="ltr">Thanks. Most comments are simple. I have only one response, in <span style="background-color:rgb(255,255,0)">this colour</span> - let me know your thoughts.<div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Pradeep Ramachandran, PhD<div>Solution Architect,</div><div>Multicoreware Inc.</div><div>Ph: +91 99627 82018</div></div></div></div></div></div></div></div></div></div>
<br><div class="gmail_quote">On Wed, Sep 2, 2015 at 11:55 PM, Steve Borho <span dir="ltr"><<a href="mailto:steve@borho.org" target="_blank">steve@borho.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 09/02, <a href="mailto:pradeep@multicorewareinc.com">pradeep@multicorewareinc.com</a> wrote:<br>
> # HG changeset patch<br>
> # User Pradeep Ramachandran <<a href="mailto:pradeep@multicorewareinc.com">pradeep@multicorewareinc.com</a>><br>
> # Date 1441169444 -19800<br>
> # Wed Sep 02 10:20:44 2015 +0530<br>
> # Node ID 5f7e53224962ef0279f39e745aa59eea7cd2141a<br>
> # Parent 86e9bd7dd19278fceef65fc93a06dc8746ec9daf<br>
> Performance: Don't split threads into per-NUMA pools unless specified in cli<br>
><br>
> Fixes the threadpools to store a numa node mask instead of a numa node id to<br>
> use liNUMA functionality, and changes default thread pool behavior to use one<br>
> monolythic pool<br>
<br>
</span>nit: monolithic<br>
<span class=""><br>
> We see gains of 10%+ on Intel Xeon E5-2666v3, E5-2640 v2 machines for 4K videos<br>
> encoding at an ABR of 15Mbps in slower and versylow settings. Ultrafast shows a<br>
> perf dip of ~8% on the same machines. So when running in ultrafast mode, explicilty<br>
> specify # threads with the --pools command (--pools 18,18 on the E5-2666v3 which<br>
> has two sockets of 18 threads each, for example)<br>
<br>
</span>it all looks pretty good, couple nits<br>
<div><div class="h5"><br>
> diff -r 86e9bd7dd192 -r 5f7e53224962 doc/reST/cli.rst<br>
> --- a/doc/reST/cli.rst Tue Sep 01 17:06:05 2015 +0530<br>
> +++ b/doc/reST/cli.rst Wed Sep 02 10:20:44 2015 +0530<br>
> @@ -202,15 +202,29 @@<br>
> "-" - same as "none"<br>
> "10" - allocate one pool, using up to 10 cores on node 0<br>
> "-,+" - allocate one pool, using all cores on node 1<br>
> - "+,-,+" - allocate two pools, using all cores on nodes 0 and 2<br>
> - "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2<br>
> - "-,*" - allocate three pools, using all cores on nodes 1, 2 and 3<br>
> + "+,-,+" - allocate one pool, using only cores on nodes 0 and 2<br>
> + "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2<br>
> + "-,*" - allocate one pool, using all cores on nodes 1, 2 and 3<br>
> "8,8,8,8" - allocate four pools with up to 8 threads in each pool<br>
> + "8,+,+,+" - allocate two pools, the first with 8 threads on node 0, and the second with all cores on node 1,2,3<br>
><br>
> - The total number of threads will be determined by the number of threads<br>
> - assigned to all nodes. The worker threads will each be given affinity for<br>
> - their node, they will not be allowed to migrate between nodes, but they<br>
> - will be allowed to move between CPU cores within their node.<br>
> + A thread pool dedicated to a given NUMA node is enabled only when the<br>
> + number of threads to be created on that NUMA node is explicitly mentioned<br>
> + in that corresponding position with the --pools option. Else, all threads<br>
> + are spawned from a single pool. The total number of threads will be<br>
> + determined by the number of threads assigned to the enabled NUMA nodes for<br>
> + that pool. The worker threads are be given affinity to all the enabled<br>
> + NUMA nodes for that pool and may migrate between them, unless explicitly<br>
> + specified as described above.<br>
> +<br>
> + In the case that any threadpool has more than 64 threads, the threadpool<br>
> + may be broken down into multiple pools of 64 threads each; on 32-bit<br>
> + machines, this number is 32. All pools are given affinity to the NUMA<br>
> + nodes on which the original pool had affinity. For performance reasons,<br>
> + the last thread pool is spawned only if it has more than 32 threads for<br>
> + 64-bit machines, or 16 for 32-bit machines. If the total number of threads<br>
> + in the system doesn't obey this constraint, we may spawn fewer threads<br>
> + than cores which has been emperically shown to be better for performance.<br>
><br>
> If the four pool features: :option:`--wpp`, :option:`--pmode`,<br>
> :option:`--pme` and :option:`--lookahead-slices` are all disabled,<br>
> @@ -219,10 +233,6 @@<br>
> If "none" is specified, then all four of the thread pool features are<br>
> implicitly disabled.<br>
><br>
> - Multiple thread pools will be allocated for any NUMA node with more than<br>
> - 64 logical CPU cores. But any given thread pool will always use at most<br>
> - one NUMA node.<br>
> -<br>
> Frame encoders are distributed between the available thread pools,<br>
> and the encoder will never generate more thread pools than<br>
> :option:`--frame-threads`. The pools are used for WPP and for<br>
> @@ -238,8 +248,12 @@<br>
> system, a POSIX build of libx265 without libnuma will be less work<br>
> efficient. See :ref:`thread pools <pools>` for more detail.<br>
><br>
> - Default "", one thread is allocated per detected hardware thread<br>
> - (logical CPU cores) and one thread pool per NUMA node.<br>
> + Default "", one pool is created across all available NUMA nodes, with<br>
> + one thread allocated per detected hardware thread<br>
> + (logical CPU cores). In the case that the total number of threads is more<br>
> + than the maximum size that ATOMIC operations can handle (32 for 32-bit<br>
> + compiles, and 64 for 64-bit compiles), multiple thread pools may be<br>
> + spawned subject to the performance constraint described above.<br>
><br>
> Note that the string value will need to be escaped or quoted to<br>
> protect against shell expansion on many platforms<br>
> diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.cpp<br>
> --- a/source/common/threadpool.cpp Tue Sep 01 17:06:05 2015 +0530<br>
> +++ b/source/common/threadpool.cpp Wed Sep 02 10:20:44 2015 +0530<br>
> @@ -226,8 +226,13 @@<br>
> {<br>
> enum { MAX_NODE_NUM = 127 };<br>
> int cpusPerNode[MAX_NODE_NUM + 1];<br>
> + int threadsPerPool[MAX_NODE_NUM + 2];<br>
> + unsigned long nodeMaskPerPool[MAX_NODE_NUM +2];<br>
<br>
</div></div>use explicit size types when you care about the size: uint32_t<br>
<div><div class="h5"><br>
> memset(cpusPerNode, 0, sizeof(cpusPerNode));<br>
> + memset(threadsPerPool, 0, sizeof(threadsPerPool));<br>
> + memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));<br>
> +<br>
> int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);<br>
> int cpuCount = getCpuCount();<br>
> bool bNumaSupport = false;<br>
> @@ -258,7 +263,7 @@<br>
> for (int i = 0; i < numNumaNodes; i++)<br>
> x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d logical cores\n", i, cpusPerNode[i]);<br>
><br>
> - /* limit nodes based on param->numaPools */<br>
> + /* limit threads based on param->numaPools */<br>
> if (p->numaPools && *p->numaPools)<br>
> {<br>
> const char *nodeStr = p->numaPools;<br>
> @@ -266,19 +271,30 @@<br>
> {<br>
> if (!*nodeStr)<br>
> {<br>
> - cpusPerNode[i] = 0;<br>
> + threadsPerPool[i] = 0;<br>
> continue;<br>
> }<br>
> else if (*nodeStr == '-')<br>
> - cpusPerNode[i] = 0;<br>
> + threadsPerPool[i] = 0;<br>
> else if (*nodeStr == '*')<br>
> + {<br>
> + for(int j = i; j < numNumaNodes; j++)<br>
<br>
</div></div>w/s<br>
<span class=""><br>
> + {<br>
> + threadsPerPool[numNumaNodes] += cpusPerNode[j];<br>
> + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << j);<br>
<br>
</span>could use 1U<br>
<div><div class="h5"><br>
> + }<br>
> break;<br>
> + }<br>
> else if (*nodeStr == '+')<br>
> - ;<br>
> + {<br>
> + threadsPerPool[numNumaNodes] += cpusPerNode[i];<br>
> + nodeMaskPerPool[numNumaNodes] = ((unsigned long)1 << i);<br>
> + }<br>
> else<br>
> {<br>
> int count = atoi(nodeStr);<br>
> - cpusPerNode[i] = X265_MIN(count, cpusPerNode[i]);<br>
> + threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]);<br>
> + nodeMaskPerPool[i] = ((unsigned long)1 << i);<br>
> }<br>
><br>
> /* consume current node string, comma, and white-space */<br>
> @@ -288,24 +304,31 @@<br>
> ++nodeStr;<br>
> }<br>
> }<br>
> -<br>
> - // In the case that numa is disabled and we have more CPUs than 64,<br>
> - // spawn the last pool only if the # threads in that pool is > 1/2 max (heuristic)<br>
> - if ((numNumaNodes == 1) &&<br>
> - (cpusPerNode[0] > MAX_POOL_THREADS) &&<br>
> - (cpusPerNode[0] % MAX_POOL_THREADS < (MAX_POOL_THREADS / 2)))<br>
> + else<br>
> {<br>
> - cpusPerNode[0] -= (cpusPerNode[0] % MAX_POOL_THREADS);<br>
> - x265_log(p, X265_LOG_DEBUG, "Creating only %d worker threads to prevent asymmetry in pools; may not use all HW contexts\n", cpusPerNode[0]);<br>
> + for(int i = 0; i < numNumaNodes; i++)<br>
> + {<br>
> + threadsPerPool[numNumaNodes] += cpusPerNode[i];<br>
> + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << i);<br>
> + }<br>
> + }<br>
> +<br>
> + // If the last pool size is > MAX_POOL_THREADS, clip it to spawn thread pools only of size >= 1/2 max (heuristic)<br>
> + if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) &&<br>
> + ((threadsPerPool[numNumaNodes] % MAX_POOL_THREADS) < (MAX_POOL_THREADS / 2)))<br>
> + {<br>
> + threadsPerPool[numNumaNodes] -= (threadsPerPool[numNumaNodes] % MAX_POOL_THREADS);<br>
> + x265_log(p, X265_LOG_DEBUG,<br>
> + "Creating only %d worker threads beyond specified numbers with --pools (if specified) to prevent asymmetry in pools; may not use all HW contexts\n", threadsPerPool[numNumaNodes]);<br>
> }<br>
><br>
> numPools = 0;<br>
> - for (int i = 0; i < numNumaNodes; i++)<br>
> + for (int i = 0; i < numNumaNodes + 1; i++)<br>
> {<br>
> if (bNumaSupport)<br>
> x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d logical cores\n", i, cpusPerNode[i]);<br>
> - if (cpusPerNode[i])<br>
> - numPools += (cpusPerNode[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;<br>
> + if (threadsPerPool[i])<br>
> + numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;<br>
> }<br>
><br>
> if (!numPools)<br>
> @@ -324,20 +347,20 @@<br>
> int node = 0;<br>
> for (int i = 0; i < numPools; i++)<br>
> {<br>
> - while (!cpusPerNode[node])<br>
> + while (!threadsPerPool[node])<br>
> node++;<br>
> - int cores = X265_MIN(MAX_POOL_THREADS, cpusPerNode[node]);<br>
> - if (!pools[i].create(cores, maxProviders, node))<br>
> + int numThreads = X265_MIN(MAX_POOL_THREADS, threadsPerPool[node]);<br>
> + if (!pools[i].create(numThreads, maxProviders, nodeMaskPerPool[node]))<br>
> {<br>
> X265_FREE(pools);<br>
> numPools = 0;<br>
> return NULL;<br>
> }<br>
> if (numNumaNodes > 1)<br>
> - x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads on NUMA node %d\n", i, cores, node);<br>
> + x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads with NUMA node mask %lx\n", i, numThreads, nodeMaskPerPool[node]);<br>
> else<br>
> - x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", cores);<br>
> - cpusPerNode[node] -= cores;<br>
> + x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", numThreads);<br>
> + threadsPerPool[node] -= numThreads;<br>
> }<br>
> }<br>
> else<br>
> @@ -350,11 +373,28 @@<br>
> memset(this, 0, sizeof(*this));<br>
> }<br>
><br>
> -bool ThreadPool::create(int numThreads, int maxProviders, int node)<br>
> +bool ThreadPool::create(int numThreads, int maxProviders, unsigned long nodeMask)<br>
> {<br>
> X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool cannot have more than MAX_POOL_THREADS threads\n");<br>
><br>
> - m_numaNode = node;<br>
> +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>
> + DWORD *nodemask = new DWORD;<br>
<br>
</div></div>why malloc?<br></blockquote><div><br></div><div><span style="background-color:rgb(255,255,0)">m_numaNodeMask is a void* pointer as the type differs across windows and non-windows builds. If I declared `DWORD nodemask = ...` and assigned m_numaNodeMask = &nodemask, it won't work as it is a reference to a local variable which will go out of scope soon, right? Hence the malloc.</span></div><div><span style="background-color:rgb(255,255,0)"><br></span></div><div><span style="background-color:rgb(255,255,0)">Is there some better way to do this?</span></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div class="h5"><br>
> + (*nodemask) = nodeMask & ~(0x1 << getNumaNodeCount());<br>
> + m_numaNodeMask = nodemask;<br>
> +#elif HAVE_LIBNUMA<br>
> + if (numa_available() >= 0)<br>
> + {<br>
> + struct bitmask* nodemask = numa_allocate_nodemask();<br>
> + if (nodemask)<br>
> + {<br>
> + *(nodemask->maskp) = nodeMask;<br>
> + m_numaNodeMask = nodemask;<br>
> + }<br>
> + else<br>
> + x265_log(NULL, X265_LOG_ERROR, "unable to get NUMA node mask for %lx\n", nodeMask);<br>
> + }<br>
> +#endif<br>
> +<br>
> m_numWorkers = numThreads;<br>
><br>
> m_workers = X265_MALLOC(WorkerThread, numThreads);<br>
> @@ -408,36 +448,39 @@<br>
><br>
> X265_FREE(m_workers);<br>
> X265_FREE(m_jpTable);<br>
> +<br>
> + if(m_numaNodeMask)<br>
> +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>
> + free(m_numaNodeMask);<br>
<br>
</div></div>new must be coupled with delete, this will crash on some O/S<br>
<div><div class="h5"><br>
> +#elif HAVE_LIBNUMA<br>
> + numa_free_nodemask((struct bitmask*)m_numaNodeMask);<br>
> +#endif<br>
> }<br>
><br>
> void ThreadPool::setCurrentThreadAffinity()<br>
> {<br>
> - setThreadNodeAffinity(m_numaNode);<br>
> + setThreadNodeAffinity(m_numaNodeMask);<br>
> }<br>
><br>
> /* static */<br>
> -void ThreadPool::setThreadNodeAffinity(int numaNode)<br>
> +void ThreadPool::setThreadNodeAffinity(void *numaNodeMask)<br>
> {<br>
> #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>
> - GROUP_AFFINITY groupAffinity;<br>
> - if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))<br>
> - {<br>
> - if (SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)groupAffinity.Mask))<br>
> - return;<br>
> - }<br>
> - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode);<br>
> + if (SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)(*((DWORD*)numaNodeMask))))<br>
> + return;<br>
> + else<br>
> + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n");<br>
> #elif HAVE_LIBNUMA<br>
> if (numa_available() >= 0)<br>
> {<br>
> - numa_run_on_node(numaNode);<br>
> - numa_set_preferred(numaNode);<br>
> + numa_run_on_node_mask((struct bitmask*)numaNodeMask);<br>
> + numa_set_interleave_mask((struct bitmask*)numaNodeMask);<br>
> numa_set_localalloc();<br>
> return;<br>
> }<br>
> - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode);<br>
> -#else<br>
> - (void)numaNode;<br>
> + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n");<br>
> #endif<br>
> + return;<br>
> }<br>
><br>
> /* static */<br>
> diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.h<br>
> --- a/source/common/threadpool.h Tue Sep 01 17:06:05 2015 +0530<br>
> +++ b/source/common/threadpool.h Wed Sep 02 10:20:44 2015 +0530<br>
> @@ -83,7 +83,7 @@<br>
> sleepbitmap_t m_sleepBitmap;<br>
> int m_numProviders;<br>
> int m_numWorkers;<br>
> - int m_numaNode;<br>
> + void* m_numaNodeMask;<br>
> bool m_isActive;<br>
><br>
> JobProvider** m_jpTable;<br>
> @@ -92,7 +92,7 @@<br>
> ThreadPool();<br>
> ~ThreadPool();<br>
><br>
> - bool create(int numThreads, int maxProviders, int node);<br>
> + bool create(int numThreads, int maxProviders, unsigned long nodeMask);<br>
> bool start();<br>
> void stopWorkers();<br>
> void setCurrentThreadAffinity();<br>
> @@ -103,7 +103,7 @@<br>
><br>
> static int getCpuCount();<br>
> static int getNumaNodeCount();<br>
> - static void setThreadNodeAffinity(int node);<br>
> + static void setThreadNodeAffinity(void *numaNodeMask);<br>
> };<br>
><br>
> /* Any worker thread may enlist the help of idle worker threads from the same<br>
<br>
> # HG changeset patch<br>
> # User Pradeep Ramachandran <<a href="mailto:pradeep@multicorewareinc.com">pradeep@multicorewareinc.com</a>><br>
> # Date 1441169444 -19800<br>
> # Wed Sep 02 10:20:44 2015 +0530<br>
> # Node ID 5f7e53224962ef0279f39e745aa59eea7cd2141a<br>
> # Parent 86e9bd7dd19278fceef65fc93a06dc8746ec9daf<br>
> Performance: Don't split threads into per-NUMA pools unless specified in cli<br>
><br>
> Fixes the threadpools to store a numa node mask instead of a numa node id to<br>
> use liNUMA functionality, and changes default thread pool behavior to use one<br>
> monolythic pool<br>
><br>
> We see gains of 10%+ on Intel Xeon E5-2666v3, E5-2640 v2 machines for 4K videos<br>
> encoding at an ABR of 15Mbps in slower and versylow settings. Ultrafast shows a<br>
> perf dip of ~8% on the same machines. So when running in ultrafast mode, explicilty<br>
> specify # threads with the --pools command (--pools 18,18 on the E5-2666v3 which<br>
> has two sockets of 18 threads each, for example)<br>
><br>
> diff -r 86e9bd7dd192 -r 5f7e53224962 doc/reST/cli.rst<br>
> --- a/doc/reST/cli.rst Tue Sep 01 17:06:05 2015 +0530<br>
> +++ b/doc/reST/cli.rst Wed Sep 02 10:20:44 2015 +0530<br>
> @@ -202,15 +202,29 @@<br>
> "-" - same as "none"<br>
> "10" - allocate one pool, using up to 10 cores on node 0<br>
> "-,+" - allocate one pool, using all cores on node 1<br>
> - "+,-,+" - allocate two pools, using all cores on nodes 0 and 2<br>
> - "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2<br>
> - "-,*" - allocate three pools, using all cores on nodes 1, 2 and 3<br>
> + "+,-,+" - allocate one pool, using only cores on nodes 0 and 2<br>
> + "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2<br>
> + "-,*" - allocate one pool, using all cores on nodes 1, 2 and 3<br>
> "8,8,8,8" - allocate four pools with up to 8 threads in each pool<br>
> + "8,+,+,+" - allocate two pools, the first with 8 threads on node 0, and the second with all cores on node 1,2,3<br>
><br>
> - The total number of threads will be determined by the number of threads<br>
> - assigned to all nodes. The worker threads will each be given affinity for<br>
> - their node, they will not be allowed to migrate between nodes, but they<br>
> - will be allowed to move between CPU cores within their node.<br>
> + A thread pool dedicated to a given NUMA node is enabled only when the<br>
> + number of threads to be created on that NUMA node is explicitly mentioned<br>
> + in that corresponding position with the --pools option. Else, all threads<br>
> + are spawned from a single pool. The total number of threads will be<br>
> + determined by the number of threads assigned to the enabled NUMA nodes for<br>
> + that pool. The worker threads are be given affinity to all the enabled<br>
> + NUMA nodes for that pool and may migrate between them, unless explicitly<br>
> + specified as described above.<br>
> +<br>
> + In the case that any threadpool has more than 64 threads, the threadpool<br>
> + may be broken down into multiple pools of 64 threads each; on 32-bit<br>
> + machines, this number is 32. All pools are given affinity to the NUMA<br>
> + nodes on which the original pool had affinity. For performance reasons,<br>
> + the last thread pool is spawned only if it has more than 32 threads for<br>
> + 64-bit machines, or 16 for 32-bit machines. If the total number of threads<br>
> + in the system doesn't obey this constraint, we may spawn fewer threads<br>
> + than cores which has been emperically shown to be better for performance.<br>
><br>
> If the four pool features: :option:`--wpp`, :option:`--pmode`,<br>
> :option:`--pme` and :option:`--lookahead-slices` are all disabled,<br>
> @@ -219,10 +233,6 @@<br>
> If "none" is specified, then all four of the thread pool features are<br>
> implicitly disabled.<br>
><br>
> - Multiple thread pools will be allocated for any NUMA node with more than<br>
> - 64 logical CPU cores. But any given thread pool will always use at most<br>
> - one NUMA node.<br>
> -<br>
> Frame encoders are distributed between the available thread pools,<br>
> and the encoder will never generate more thread pools than<br>
> :option:`--frame-threads`. The pools are used for WPP and for<br>
> @@ -238,8 +248,12 @@<br>
> system, a POSIX build of libx265 without libnuma will be less work<br>
> efficient. See :ref:`thread pools <pools>` for more detail.<br>
><br>
> - Default "", one thread is allocated per detected hardware thread<br>
> - (logical CPU cores) and one thread pool per NUMA node.<br>
> + Default "", one pool is created across all available NUMA nodes, with<br>
> + one thread allocated per detected hardware thread<br>
> + (logical CPU cores). In the case that the total number of threads is more<br>
> + than the maximum size that ATOMIC operations can handle (32 for 32-bit<br>
> + compiles, and 64 for 64-bit compiles), multiple thread pools may be<br>
> + spawned subject to the performance constraint described above.<br>
><br>
> Note that the string value will need to be escaped or quoted to<br>
> protect against shell expansion on many platforms<br>
> diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.cpp<br>
> --- a/source/common/threadpool.cpp Tue Sep 01 17:06:05 2015 +0530<br>
> +++ b/source/common/threadpool.cpp Wed Sep 02 10:20:44 2015 +0530<br>
> @@ -226,8 +226,13 @@<br>
> {<br>
> enum { MAX_NODE_NUM = 127 };<br>
> int cpusPerNode[MAX_NODE_NUM + 1];<br>
> + int threadsPerPool[MAX_NODE_NUM + 2];<br>
> + unsigned long nodeMaskPerPool[MAX_NODE_NUM +2];<br>
><br>
> memset(cpusPerNode, 0, sizeof(cpusPerNode));<br>
> + memset(threadsPerPool, 0, sizeof(threadsPerPool));<br>
> + memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));<br>
> +<br>
> int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);<br>
> int cpuCount = getCpuCount();<br>
> bool bNumaSupport = false;<br>
> @@ -258,7 +263,7 @@<br>
> for (int i = 0; i < numNumaNodes; i++)<br>
> x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d logical cores\n", i, cpusPerNode[i]);<br>
><br>
> - /* limit nodes based on param->numaPools */<br>
> + /* limit threads based on param->numaPools */<br>
> if (p->numaPools && *p->numaPools)<br>
> {<br>
> const char *nodeStr = p->numaPools;<br>
> @@ -266,19 +271,30 @@<br>
> {<br>
> if (!*nodeStr)<br>
> {<br>
> - cpusPerNode[i] = 0;<br>
> + threadsPerPool[i] = 0;<br>
> continue;<br>
> }<br>
> else if (*nodeStr == '-')<br>
> - cpusPerNode[i] = 0;<br>
> + threadsPerPool[i] = 0;<br>
> else if (*nodeStr == '*')<br>
> + {<br>
> + for(int j = i; j < numNumaNodes; j++)<br>
> + {<br>
> + threadsPerPool[numNumaNodes] += cpusPerNode[j];<br>
> + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << j);<br>
> + }<br>
> break;<br>
> + }<br>
> else if (*nodeStr == '+')<br>
> - ;<br>
> + {<br>
> + threadsPerPool[numNumaNodes] += cpusPerNode[i];<br>
> + nodeMaskPerPool[numNumaNodes] = ((unsigned long)1 << i);<br>
> + }<br>
> else<br>
> {<br>
> int count = atoi(nodeStr);<br>
> - cpusPerNode[i] = X265_MIN(count, cpusPerNode[i]);<br>
> + threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]);<br>
> + nodeMaskPerPool[i] = ((unsigned long)1 << i);<br>
> }<br>
><br>
> /* consume current node string, comma, and white-space */<br>
> @@ -288,24 +304,31 @@<br>
> ++nodeStr;<br>
> }<br>
> }<br>
> -<br>
> - // In the case that numa is disabled and we have more CPUs than 64,<br>
> - // spawn the last pool only if the # threads in that pool is > 1/2 max (heuristic)<br>
> - if ((numNumaNodes == 1) &&<br>
> - (cpusPerNode[0] > MAX_POOL_THREADS) &&<br>
> - (cpusPerNode[0] % MAX_POOL_THREADS < (MAX_POOL_THREADS / 2)))<br>
> + else<br>
> {<br>
> - cpusPerNode[0] -= (cpusPerNode[0] % MAX_POOL_THREADS);<br>
> - x265_log(p, X265_LOG_DEBUG, "Creating only %d worker threads to prevent asymmetry in pools; may not use all HW contexts\n", cpusPerNode[0]);<br>
> + for(int i = 0; i < numNumaNodes; i++)<br>
> + {<br>
> + threadsPerPool[numNumaNodes] += cpusPerNode[i];<br>
> + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << i);<br>
> + }<br>
> + }<br>
> +<br>
> + // If the last pool size is > MAX_POOL_THREADS, clip it to spawn thread pools only of size >= 1/2 max (heuristic)<br>
> + if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) &&<br>
> + ((threadsPerPool[numNumaNodes] % MAX_POOL_THREADS) < (MAX_POOL_THREADS / 2)))<br>
> + {<br>
> + threadsPerPool[numNumaNodes] -= (threadsPerPool[numNumaNodes] % MAX_POOL_THREADS);<br>
> + x265_log(p, X265_LOG_DEBUG,<br>
> + "Creating only %d worker threads beyond specified numbers with --pools (if specified) to prevent asymmetry in pools; may not use all HW contexts\n", threadsPerPool[numNumaNodes]);<br>
> }<br>
><br>
> numPools = 0;<br>
> - for (int i = 0; i < numNumaNodes; i++)<br>
> + for (int i = 0; i < numNumaNodes + 1; i++)<br>
> {<br>
> if (bNumaSupport)<br>
> x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d logical cores\n", i, cpusPerNode[i]);<br>
> - if (cpusPerNode[i])<br>
> - numPools += (cpusPerNode[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;<br>
> + if (threadsPerPool[i])<br>
> + numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;<br>
> }<br>
><br>
> if (!numPools)<br>
> @@ -324,20 +347,20 @@<br>
> int node = 0;<br>
> for (int i = 0; i < numPools; i++)<br>
> {<br>
> - while (!cpusPerNode[node])<br>
> + while (!threadsPerPool[node])<br>
> node++;<br>
> - int cores = X265_MIN(MAX_POOL_THREADS, cpusPerNode[node]);<br>
> - if (!pools[i].create(cores, maxProviders, node))<br>
> + int numThreads = X265_MIN(MAX_POOL_THREADS, threadsPerPool[node]);<br>
> + if (!pools[i].create(numThreads, maxProviders, nodeMaskPerPool[node]))<br>
> {<br>
> X265_FREE(pools);<br>
> numPools = 0;<br>
> return NULL;<br>
> }<br>
> if (numNumaNodes > 1)<br>
> - x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads on NUMA node %d\n", i, cores, node);<br>
> + x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads with NUMA node mask %lx\n", i, numThreads, nodeMaskPerPool[node]);<br>
> else<br>
> - x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", cores);<br>
> - cpusPerNode[node] -= cores;<br>
> + x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", numThreads);<br>
> + threadsPerPool[node] -= numThreads;<br>
> }<br>
> }<br>
> else<br>
> @@ -350,11 +373,28 @@<br>
> memset(this, 0, sizeof(*this));<br>
> }<br>
><br>
> -bool ThreadPool::create(int numThreads, int maxProviders, int node)<br>
> +bool ThreadPool::create(int numThreads, int maxProviders, unsigned long nodeMask)<br>
> {<br>
> X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool cannot have more than MAX_POOL_THREADS threads\n");<br>
><br>
> - m_numaNode = node;<br>
> +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>
> + DWORD *nodemask = new DWORD;<br>
> + (*nodemask) = nodeMask & ~(0x1 << getNumaNodeCount());<br>
> + m_numaNodeMask = nodemask;<br>
> +#elif HAVE_LIBNUMA<br>
> + if (numa_available() >= 0)<br>
> + {<br>
> + struct bitmask* nodemask = numa_allocate_nodemask();<br>
> + if (nodemask)<br>
> + {<br>
> + *(nodemask->maskp) = nodeMask;<br>
> + m_numaNodeMask = nodemask;<br>
> + }<br>
> + else<br>
> + x265_log(NULL, X265_LOG_ERROR, "unable to get NUMA node mask for %lx\n", nodeMask);<br>
> + }<br>
> +#endif<br>
> +<br>
> m_numWorkers = numThreads;<br>
><br>
> m_workers = X265_MALLOC(WorkerThread, numThreads);<br>
> @@ -408,36 +448,39 @@<br>
><br>
> X265_FREE(m_workers);<br>
> X265_FREE(m_jpTable);<br>
> +<br>
> + if(m_numaNodeMask)<br>
> +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>
> + free(m_numaNodeMask);<br>
> +#elif HAVE_LIBNUMA<br>
> + numa_free_nodemask((struct bitmask*)m_numaNodeMask);<br>
> +#endif<br>
> }<br>
><br>
> void ThreadPool::setCurrentThreadAffinity()<br>
> {<br>
> - setThreadNodeAffinity(m_numaNode);<br>
> + setThreadNodeAffinity(m_numaNodeMask);<br>
> }<br>
><br>
> /* static */<br>
> -void ThreadPool::setThreadNodeAffinity(int numaNode)<br>
> +void ThreadPool::setThreadNodeAffinity(void *numaNodeMask)<br>
> {<br>
> #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7<br>
> - GROUP_AFFINITY groupAffinity;<br>
> - if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))<br>
> - {<br>
> - if (SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)groupAffinity.Mask))<br>
> - return;<br>
> - }<br>
> - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode);<br>
> + if (SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)(*((DWORD*)numaNodeMask))))<br>
> + return;<br>
> + else<br>
> + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n");<br>
> #elif HAVE_LIBNUMA<br>
> if (numa_available() >= 0)<br>
> {<br>
> - numa_run_on_node(numaNode);<br>
> - numa_set_preferred(numaNode);<br>
> + numa_run_on_node_mask((struct bitmask*)numaNodeMask);<br>
> + numa_set_interleave_mask((struct bitmask*)numaNodeMask);<br>
> numa_set_localalloc();<br>
> return;<br>
> }<br>
> - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode);<br>
> -#else<br>
> - (void)numaNode;<br>
> + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n");<br>
> #endif<br>
> + return;<br>
> }<br>
><br>
> /* static */<br>
> diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.h<br>
> --- a/source/common/threadpool.h Tue Sep 01 17:06:05 2015 +0530<br>
> +++ b/source/common/threadpool.h Wed Sep 02 10:20:44 2015 +0530<br>
> @@ -83,7 +83,7 @@<br>
> sleepbitmap_t m_sleepBitmap;<br>
> int m_numProviders;<br>
> int m_numWorkers;<br>
> - int m_numaNode;<br>
> + void* m_numaNodeMask;<br>
> bool m_isActive;<br>
><br>
> JobProvider** m_jpTable;<br>
> @@ -92,7 +92,7 @@<br>
> ThreadPool();<br>
> ~ThreadPool();<br>
><br>
> - bool create(int numThreads, int maxProviders, int node);<br>
> + bool create(int numThreads, int maxProviders, unsigned long nodeMask);<br>
> bool start();<br>
> void stopWorkers();<br>
> void setCurrentThreadAffinity();<br>
> @@ -103,7 +103,7 @@<br>
><br>
> static int getCpuCount();<br>
> static int getNumaNodeCount();<br>
> - static void setThreadNodeAffinity(int node);<br>
> + static void setThreadNodeAffinity(void *numaNodeMask);<br>
> };<br>
><br>
> /* Any worker thread may enlist the help of idle worker threads from the same<br>
<br>
</div></div>> _______________________________________________<br>
> x265-devel mailing list<br>
> <a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>
> <a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
<span class="HOEnZb"><font color="#888888"><br>
<br>
--<br>
Steve Borho<br>
_______________________________________________<br>
x265-devel mailing list<br>
<a href="mailto:x265-devel@videolan.org">x265-devel@videolan.org</a><br>
<a href="https://mailman.videolan.org/listinfo/x265-devel" rel="noreferrer" target="_blank">https://mailman.videolan.org/listinfo/x265-devel</a><br>
</font></span></blockquote></div><br></div></div></div>