[x265] [PATCH] Performance: Don't split threads into per-NUMA pools unless specified in cli
Pradeep Ramachandran
pradeep at multicorewareinc.com
Thu Sep 3 05:51:14 CEST 2015
Thanks. Most comments are simple. I have only one response, in this colour
- let me know your thoughts.
Pradeep Ramachandran, PhD
Solution Architect,
Multicoreware Inc.
Ph: +91 99627 82018
On Wed, Sep 2, 2015 at 11:55 PM, Steve Borho <steve at borho.org> wrote:
> On 09/02, pradeep at multicorewareinc.com wrote:
> > # HG changeset patch
> > # User Pradeep Ramachandran <pradeep at multicorewareinc.com>
> > # Date 1441169444 -19800
> > # Wed Sep 02 10:20:44 2015 +0530
> > # Node ID 5f7e53224962ef0279f39e745aa59eea7cd2141a
> > # Parent 86e9bd7dd19278fceef65fc93a06dc8746ec9daf
> > Performance: Don't split threads into per-NUMA pools unless specified in
> cli
> >
> > Fixes the threadpools to store a numa node mask instead of a numa node
> id to
> > use liNUMA functionality, and changes default thread pool behavior to
> use one
> > monolythic pool
>
> nit: monolithic
>
> > We see gains of 10%+ on Intel Xeon E5-2666v3, E5-2640 v2 machines for 4K
> videos
> > encoding at an ABR of 15Mbps in slower and versylow settings. Ultrafast
> shows a
> > perf dip of ~8% on the same machines. So when running in ultrafast mode,
> explicilty
> > specify # threads with the --pools command (--pools 18,18 on the
> E5-2666v3 which
> > has two sockets of 18 threads each, for example)
>
> it all looks pretty good, couple nits
>
> > diff -r 86e9bd7dd192 -r 5f7e53224962 doc/reST/cli.rst
> > --- a/doc/reST/cli.rst Tue Sep 01 17:06:05 2015 +0530
> > +++ b/doc/reST/cli.rst Wed Sep 02 10:20:44 2015 +0530
> > @@ -202,15 +202,29 @@
> > "-" - same as "none"
> > "10" - allocate one pool, using up to 10 cores on node 0
> > "-,+" - allocate one pool, using all cores on node 1
> > - "+,-,+" - allocate two pools, using all cores on nodes 0 and 2
> > - "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
> > - "-,*" - allocate three pools, using all cores on nodes 1, 2
> and 3
> > + "+,-,+" - allocate one pool, using only cores on nodes 0 and 2
> > + "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2
> > + "-,*" - allocate one pool, using all cores on nodes 1, 2 and 3
> > "8,8,8,8" - allocate four pools with up to 8 threads in each pool
> > + "8,+,+,+" - allocate two pools, the first with 8 threads on node
> 0, and the second with all cores on node 1,2,3
> >
> > - The total number of threads will be determined by the number of
> threads
> > - assigned to all nodes. The worker threads will each be given
> affinity for
> > - their node, they will not be allowed to migrate between nodes, but
> they
> > - will be allowed to move between CPU cores within their node.
> > + A thread pool dedicated to a given NUMA node is enabled only when
> the
> > + number of threads to be created on that NUMA node is explicitly
> mentioned
> > + in that corresponding position with the --pools option. Else, all
> threads
> > + are spawned from a single pool. The total number of threads will be
> > + determined by the number of threads assigned to the enabled NUMA
> nodes for
> > + that pool. The worker threads are be given affinity to all the
> enabled
> > + NUMA nodes for that pool and may migrate between them, unless
> explicitly
> > + specified as described above.
> > +
> > + In the case that any threadpool has more than 64 threads, the
> threadpool
> > + may be broken down into multiple pools of 64 threads each; on
> 32-bit
> > + machines, this number is 32. All pools are given affinity to the
> NUMA
> > + nodes on which the original pool had affinity. For performance
> reasons,
> > + the last thread pool is spawned only if it has more than 32
> threads for
> > + 64-bit machines, or 16 for 32-bit machines. If the total number of
> threads
> > + in the system doesn't obey this constraint, we may spawn fewer
> threads
> > + than cores which has been emperically shown to be better for
> performance.
> >
> > If the four pool features: :option:`--wpp`, :option:`--pmode`,
> > :option:`--pme` and :option:`--lookahead-slices` are all disabled,
> > @@ -219,10 +233,6 @@
> > If "none" is specified, then all four of the thread pool features
> are
> > implicitly disabled.
> >
> > - Multiple thread pools will be allocated for any NUMA node with
> more than
> > - 64 logical CPU cores. But any given thread pool will always use at
> most
> > - one NUMA node.
> > -
> > Frame encoders are distributed between the available thread pools,
> > and the encoder will never generate more thread pools than
> > :option:`--frame-threads`. The pools are used for WPP and for
> > @@ -238,8 +248,12 @@
> > system, a POSIX build of libx265 without libnuma will be less work
> > efficient. See :ref:`thread pools <pools>` for more detail.
> >
> > - Default "", one thread is allocated per detected hardware thread
> > - (logical CPU cores) and one thread pool per NUMA node.
> > + Default "", one pool is created across all available NUMA nodes,
> with
> > + one thread allocated per detected hardware thread
> > + (logical CPU cores). In the case that the total number of threads
> is more
> > + than the maximum size that ATOMIC operations can handle (32 for
> 32-bit
> > + compiles, and 64 for 64-bit compiles), multiple thread pools may be
> > + spawned subject to the performance constraint described above.
> >
> > Note that the string value will need to be escaped or quoted to
> > protect against shell expansion on many platforms
> > diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.cpp
> > --- a/source/common/threadpool.cpp Tue Sep 01 17:06:05 2015 +0530
> > +++ b/source/common/threadpool.cpp Wed Sep 02 10:20:44 2015 +0530
> > @@ -226,8 +226,13 @@
> > {
> > enum { MAX_NODE_NUM = 127 };
> > int cpusPerNode[MAX_NODE_NUM + 1];
> > + int threadsPerPool[MAX_NODE_NUM + 2];
> > + unsigned long nodeMaskPerPool[MAX_NODE_NUM +2];
>
> use explicit size types when you care about the size: uint32_t
>
> > memset(cpusPerNode, 0, sizeof(cpusPerNode));
> > + memset(threadsPerPool, 0, sizeof(threadsPerPool));
> > + memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));
> > +
> > int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);
> > int cpuCount = getCpuCount();
> > bool bNumaSupport = false;
> > @@ -258,7 +263,7 @@
> > for (int i = 0; i < numNumaNodes; i++)
> > x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d
> logical cores\n", i, cpusPerNode[i]);
> >
> > - /* limit nodes based on param->numaPools */
> > + /* limit threads based on param->numaPools */
> > if (p->numaPools && *p->numaPools)
> > {
> > const char *nodeStr = p->numaPools;
> > @@ -266,19 +271,30 @@
> > {
> > if (!*nodeStr)
> > {
> > - cpusPerNode[i] = 0;
> > + threadsPerPool[i] = 0;
> > continue;
> > }
> > else if (*nodeStr == '-')
> > - cpusPerNode[i] = 0;
> > + threadsPerPool[i] = 0;
> > else if (*nodeStr == '*')
> > + {
> > + for(int j = i; j < numNumaNodes; j++)
>
> w/s
>
> > + {
> > + threadsPerPool[numNumaNodes] += cpusPerNode[j];
> > + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1
> << j);
>
> could use 1U
>
> > + }
> > break;
> > + }
> > else if (*nodeStr == '+')
> > - ;
> > + {
> > + threadsPerPool[numNumaNodes] += cpusPerNode[i];
> > + nodeMaskPerPool[numNumaNodes] = ((unsigned long)1 << i);
> > + }
> > else
> > {
> > int count = atoi(nodeStr);
> > - cpusPerNode[i] = X265_MIN(count, cpusPerNode[i]);
> > + threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]);
> > + nodeMaskPerPool[i] = ((unsigned long)1 << i);
> > }
> >
> > /* consume current node string, comma, and white-space */
> > @@ -288,24 +304,31 @@
> > ++nodeStr;
> > }
> > }
> > -
> > - // In the case that numa is disabled and we have more CPUs than 64,
> > - // spawn the last pool only if the # threads in that pool is > 1/2
> max (heuristic)
> > - if ((numNumaNodes == 1) &&
> > - (cpusPerNode[0] > MAX_POOL_THREADS) &&
> > - (cpusPerNode[0] % MAX_POOL_THREADS < (MAX_POOL_THREADS / 2)))
> > + else
> > {
> > - cpusPerNode[0] -= (cpusPerNode[0] % MAX_POOL_THREADS);
> > - x265_log(p, X265_LOG_DEBUG, "Creating only %d worker threads to
> prevent asymmetry in pools; may not use all HW contexts\n", cpusPerNode[0]);
> > + for(int i = 0; i < numNumaNodes; i++)
> > + {
> > + threadsPerPool[numNumaNodes] += cpusPerNode[i];
> > + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << i);
> > + }
> > + }
> > +
> > + // If the last pool size is > MAX_POOL_THREADS, clip it to spawn
> thread pools only of size >= 1/2 max (heuristic)
> > + if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) &&
> > + ((threadsPerPool[numNumaNodes] % MAX_POOL_THREADS) <
> (MAX_POOL_THREADS / 2)))
> > + {
> > + threadsPerPool[numNumaNodes] -= (threadsPerPool[numNumaNodes] %
> MAX_POOL_THREADS);
> > + x265_log(p, X265_LOG_DEBUG,
> > + "Creating only %d worker threads beyond specified
> numbers with --pools (if specified) to prevent asymmetry in pools; may not
> use all HW contexts\n", threadsPerPool[numNumaNodes]);
> > }
> >
> > numPools = 0;
> > - for (int i = 0; i < numNumaNodes; i++)
> > + for (int i = 0; i < numNumaNodes + 1; i++)
> > {
> > if (bNumaSupport)
> > x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d
> logical cores\n", i, cpusPerNode[i]);
> > - if (cpusPerNode[i])
> > - numPools += (cpusPerNode[i] + MAX_POOL_THREADS - 1) /
> MAX_POOL_THREADS;
> > + if (threadsPerPool[i])
> > + numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) /
> MAX_POOL_THREADS;
> > }
> >
> > if (!numPools)
> > @@ -324,20 +347,20 @@
> > int node = 0;
> > for (int i = 0; i < numPools; i++)
> > {
> > - while (!cpusPerNode[node])
> > + while (!threadsPerPool[node])
> > node++;
> > - int cores = X265_MIN(MAX_POOL_THREADS, cpusPerNode[node]);
> > - if (!pools[i].create(cores, maxProviders, node))
> > + int numThreads = X265_MIN(MAX_POOL_THREADS,
> threadsPerPool[node]);
> > + if (!pools[i].create(numThreads, maxProviders,
> nodeMaskPerPool[node]))
> > {
> > X265_FREE(pools);
> > numPools = 0;
> > return NULL;
> > }
> > if (numNumaNodes > 1)
> > - x265_log(p, X265_LOG_INFO, "Thread pool %d using %d
> threads on NUMA node %d\n", i, cores, node);
> > + x265_log(p, X265_LOG_INFO, "Thread pool %d using %d
> threads with NUMA node mask %lx\n", i, numThreads, nodeMaskPerPool[node]);
> > else
> > - x265_log(p, X265_LOG_INFO, "Thread pool created using
> %d threads\n", cores);
> > - cpusPerNode[node] -= cores;
> > + x265_log(p, X265_LOG_INFO, "Thread pool created using
> %d threads\n", numThreads);
> > + threadsPerPool[node] -= numThreads;
> > }
> > }
> > else
> > @@ -350,11 +373,28 @@
> > memset(this, 0, sizeof(*this));
> > }
> >
> > -bool ThreadPool::create(int numThreads, int maxProviders, int node)
> > +bool ThreadPool::create(int numThreads, int maxProviders, unsigned long
> nodeMask)
> > {
> > X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool
> cannot have more than MAX_POOL_THREADS threads\n");
> >
> > - m_numaNode = node;
> > +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> > + DWORD *nodemask = new DWORD;
>
> why malloc?
>
m_numaNodeMask is a void* pointer as the type differs across windows and
non-windows builds. If I declared `DWORD nodemask = ...` and assigned
m_numaNodeMask = &nodemask, it won't work as it is a reference to a local
variable which will go out of scope soon, right? Hence the malloc.
Is there some better way to do this?
>
> > + (*nodemask) = nodeMask & ~(0x1 << getNumaNodeCount());
> > + m_numaNodeMask = nodemask;
> > +#elif HAVE_LIBNUMA
> > + if (numa_available() >= 0)
> > + {
> > + struct bitmask* nodemask = numa_allocate_nodemask();
> > + if (nodemask)
> > + {
> > + *(nodemask->maskp) = nodeMask;
> > + m_numaNodeMask = nodemask;
> > + }
> > + else
> > + x265_log(NULL, X265_LOG_ERROR, "unable to get NUMA node
> mask for %lx\n", nodeMask);
> > + }
> > +#endif
> > +
> > m_numWorkers = numThreads;
> >
> > m_workers = X265_MALLOC(WorkerThread, numThreads);
> > @@ -408,36 +448,39 @@
> >
> > X265_FREE(m_workers);
> > X265_FREE(m_jpTable);
> > +
> > + if(m_numaNodeMask)
> > +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> > + free(m_numaNodeMask);
>
> new must be coupled with delete, this will crash on some O/S
>
> > +#elif HAVE_LIBNUMA
> > + numa_free_nodemask((struct bitmask*)m_numaNodeMask);
> > +#endif
> > }
> >
> > void ThreadPool::setCurrentThreadAffinity()
> > {
> > - setThreadNodeAffinity(m_numaNode);
> > + setThreadNodeAffinity(m_numaNodeMask);
> > }
> >
> > /* static */
> > -void ThreadPool::setThreadNodeAffinity(int numaNode)
> > +void ThreadPool::setThreadNodeAffinity(void *numaNodeMask)
> > {
> > #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> > - GROUP_AFFINITY groupAffinity;
> > - if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))
> > - {
> > - if (SetThreadAffinityMask(GetCurrentThread(),
> (DWORD_PTR)groupAffinity.Mask))
> > - return;
> > - }
> > - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to
> NUMA node %d\n", numaNode);
> > + if (SetThreadAffinityMask(GetCurrentThread(),
> (DWORD_PTR)(*((DWORD*)numaNodeMask))))
> > + return;
> > + else
> > + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity
> for NUMA node mask\n");
> > #elif HAVE_LIBNUMA
> > if (numa_available() >= 0)
> > {
> > - numa_run_on_node(numaNode);
> > - numa_set_preferred(numaNode);
> > + numa_run_on_node_mask((struct bitmask*)numaNodeMask);
> > + numa_set_interleave_mask((struct bitmask*)numaNodeMask);
> > numa_set_localalloc();
> > return;
> > }
> > - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to
> NUMA node %d\n", numaNode);
> > -#else
> > - (void)numaNode;
> > + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for
> NUMA node mask\n");
> > #endif
> > + return;
> > }
> >
> > /* static */
> > diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.h
> > --- a/source/common/threadpool.h Tue Sep 01 17:06:05 2015 +0530
> > +++ b/source/common/threadpool.h Wed Sep 02 10:20:44 2015 +0530
> > @@ -83,7 +83,7 @@
> > sleepbitmap_t m_sleepBitmap;
> > int m_numProviders;
> > int m_numWorkers;
> > - int m_numaNode;
> > + void* m_numaNodeMask;
> > bool m_isActive;
> >
> > JobProvider** m_jpTable;
> > @@ -92,7 +92,7 @@
> > ThreadPool();
> > ~ThreadPool();
> >
> > - bool create(int numThreads, int maxProviders, int node);
> > + bool create(int numThreads, int maxProviders, unsigned long
> nodeMask);
> > bool start();
> > void stopWorkers();
> > void setCurrentThreadAffinity();
> > @@ -103,7 +103,7 @@
> >
> > static int getCpuCount();
> > static int getNumaNodeCount();
> > - static void setThreadNodeAffinity(int node);
> > + static void setThreadNodeAffinity(void *numaNodeMask);
> > };
> >
> > /* Any worker thread may enlist the help of idle worker threads from
> the same
>
> > # HG changeset patch
> > # User Pradeep Ramachandran <pradeep at multicorewareinc.com>
> > # Date 1441169444 -19800
> > # Wed Sep 02 10:20:44 2015 +0530
> > # Node ID 5f7e53224962ef0279f39e745aa59eea7cd2141a
> > # Parent 86e9bd7dd19278fceef65fc93a06dc8746ec9daf
> > Performance: Don't split threads into per-NUMA pools unless specified in
> cli
> >
> > Fixes the threadpools to store a numa node mask instead of a numa node
> id to
> > use liNUMA functionality, and changes default thread pool behavior to
> use one
> > monolythic pool
> >
> > We see gains of 10%+ on Intel Xeon E5-2666v3, E5-2640 v2 machines for 4K
> videos
> > encoding at an ABR of 15Mbps in slower and versylow settings. Ultrafast
> shows a
> > perf dip of ~8% on the same machines. So when running in ultrafast mode,
> explicilty
> > specify # threads with the --pools command (--pools 18,18 on the
> E5-2666v3 which
> > has two sockets of 18 threads each, for example)
> >
> > diff -r 86e9bd7dd192 -r 5f7e53224962 doc/reST/cli.rst
> > --- a/doc/reST/cli.rst Tue Sep 01 17:06:05 2015 +0530
> > +++ b/doc/reST/cli.rst Wed Sep 02 10:20:44 2015 +0530
> > @@ -202,15 +202,29 @@
> > "-" - same as "none"
> > "10" - allocate one pool, using up to 10 cores on node 0
> > "-,+" - allocate one pool, using all cores on node 1
> > - "+,-,+" - allocate two pools, using all cores on nodes 0 and 2
> > - "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
> > - "-,*" - allocate three pools, using all cores on nodes 1, 2
> and 3
> > + "+,-,+" - allocate one pool, using only cores on nodes 0 and 2
> > + "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2
> > + "-,*" - allocate one pool, using all cores on nodes 1, 2 and 3
> > "8,8,8,8" - allocate four pools with up to 8 threads in each pool
> > + "8,+,+,+" - allocate two pools, the first with 8 threads on node
> 0, and the second with all cores on node 1,2,3
> >
> > - The total number of threads will be determined by the number of
> threads
> > - assigned to all nodes. The worker threads will each be given
> affinity for
> > - their node, they will not be allowed to migrate between nodes, but
> they
> > - will be allowed to move between CPU cores within their node.
> > + A thread pool dedicated to a given NUMA node is enabled only when
> the
> > + number of threads to be created on that NUMA node is explicitly
> mentioned
> > + in that corresponding position with the --pools option. Else, all
> threads
> > + are spawned from a single pool. The total number of threads will be
> > + determined by the number of threads assigned to the enabled NUMA
> nodes for
> > + that pool. The worker threads are be given affinity to all the
> enabled
> > + NUMA nodes for that pool and may migrate between them, unless
> explicitly
> > + specified as described above.
> > +
> > + In the case that any threadpool has more than 64 threads, the
> threadpool
> > + may be broken down into multiple pools of 64 threads each; on
> 32-bit
> > + machines, this number is 32. All pools are given affinity to the
> NUMA
> > + nodes on which the original pool had affinity. For performance
> reasons,
> > + the last thread pool is spawned only if it has more than 32
> threads for
> > + 64-bit machines, or 16 for 32-bit machines. If the total number of
> threads
> > + in the system doesn't obey this constraint, we may spawn fewer
> threads
> > + than cores which has been emperically shown to be better for
> performance.
> >
> > If the four pool features: :option:`--wpp`, :option:`--pmode`,
> > :option:`--pme` and :option:`--lookahead-slices` are all disabled,
> > @@ -219,10 +233,6 @@
> > If "none" is specified, then all four of the thread pool features
> are
> > implicitly disabled.
> >
> > - Multiple thread pools will be allocated for any NUMA node with
> more than
> > - 64 logical CPU cores. But any given thread pool will always use at
> most
> > - one NUMA node.
> > -
> > Frame encoders are distributed between the available thread pools,
> > and the encoder will never generate more thread pools than
> > :option:`--frame-threads`. The pools are used for WPP and for
> > @@ -238,8 +248,12 @@
> > system, a POSIX build of libx265 without libnuma will be less work
> > efficient. See :ref:`thread pools <pools>` for more detail.
> >
> > - Default "", one thread is allocated per detected hardware thread
> > - (logical CPU cores) and one thread pool per NUMA node.
> > + Default "", one pool is created across all available NUMA nodes,
> with
> > + one thread allocated per detected hardware thread
> > + (logical CPU cores). In the case that the total number of threads
> is more
> > + than the maximum size that ATOMIC operations can handle (32 for
> 32-bit
> > + compiles, and 64 for 64-bit compiles), multiple thread pools may be
> > + spawned subject to the performance constraint described above.
> >
> > Note that the string value will need to be escaped or quoted to
> > protect against shell expansion on many platforms
> > diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.cpp
> > --- a/source/common/threadpool.cpp Tue Sep 01 17:06:05 2015 +0530
> > +++ b/source/common/threadpool.cpp Wed Sep 02 10:20:44 2015 +0530
> > @@ -226,8 +226,13 @@
> > {
> > enum { MAX_NODE_NUM = 127 };
> > int cpusPerNode[MAX_NODE_NUM + 1];
> > + int threadsPerPool[MAX_NODE_NUM + 2];
> > + unsigned long nodeMaskPerPool[MAX_NODE_NUM +2];
> >
> > memset(cpusPerNode, 0, sizeof(cpusPerNode));
> > + memset(threadsPerPool, 0, sizeof(threadsPerPool));
> > + memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));
> > +
> > int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);
> > int cpuCount = getCpuCount();
> > bool bNumaSupport = false;
> > @@ -258,7 +263,7 @@
> > for (int i = 0; i < numNumaNodes; i++)
> > x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d
> logical cores\n", i, cpusPerNode[i]);
> >
> > - /* limit nodes based on param->numaPools */
> > + /* limit threads based on param->numaPools */
> > if (p->numaPools && *p->numaPools)
> > {
> > const char *nodeStr = p->numaPools;
> > @@ -266,19 +271,30 @@
> > {
> > if (!*nodeStr)
> > {
> > - cpusPerNode[i] = 0;
> > + threadsPerPool[i] = 0;
> > continue;
> > }
> > else if (*nodeStr == '-')
> > - cpusPerNode[i] = 0;
> > + threadsPerPool[i] = 0;
> > else if (*nodeStr == '*')
> > + {
> > + for(int j = i; j < numNumaNodes; j++)
> > + {
> > + threadsPerPool[numNumaNodes] += cpusPerNode[j];
> > + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1
> << j);
> > + }
> > break;
> > + }
> > else if (*nodeStr == '+')
> > - ;
> > + {
> > + threadsPerPool[numNumaNodes] += cpusPerNode[i];
> > + nodeMaskPerPool[numNumaNodes] = ((unsigned long)1 << i);
> > + }
> > else
> > {
> > int count = atoi(nodeStr);
> > - cpusPerNode[i] = X265_MIN(count, cpusPerNode[i]);
> > + threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]);
> > + nodeMaskPerPool[i] = ((unsigned long)1 << i);
> > }
> >
> > /* consume current node string, comma, and white-space */
> > @@ -288,24 +304,31 @@
> > ++nodeStr;
> > }
> > }
> > -
> > - // In the case that numa is disabled and we have more CPUs than 64,
> > - // spawn the last pool only if the # threads in that pool is > 1/2
> max (heuristic)
> > - if ((numNumaNodes == 1) &&
> > - (cpusPerNode[0] > MAX_POOL_THREADS) &&
> > - (cpusPerNode[0] % MAX_POOL_THREADS < (MAX_POOL_THREADS / 2)))
> > + else
> > {
> > - cpusPerNode[0] -= (cpusPerNode[0] % MAX_POOL_THREADS);
> > - x265_log(p, X265_LOG_DEBUG, "Creating only %d worker threads to
> prevent asymmetry in pools; may not use all HW contexts\n", cpusPerNode[0]);
> > + for(int i = 0; i < numNumaNodes; i++)
> > + {
> > + threadsPerPool[numNumaNodes] += cpusPerNode[i];
> > + nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << i);
> > + }
> > + }
> > +
> > + // If the last pool size is > MAX_POOL_THREADS, clip it to spawn
> thread pools only of size >= 1/2 max (heuristic)
> > + if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) &&
> > + ((threadsPerPool[numNumaNodes] % MAX_POOL_THREADS) <
> (MAX_POOL_THREADS / 2)))
> > + {
> > + threadsPerPool[numNumaNodes] -= (threadsPerPool[numNumaNodes] %
> MAX_POOL_THREADS);
> > + x265_log(p, X265_LOG_DEBUG,
> > + "Creating only %d worker threads beyond specified
> numbers with --pools (if specified) to prevent asymmetry in pools; may not
> use all HW contexts\n", threadsPerPool[numNumaNodes]);
> > }
> >
> > numPools = 0;
> > - for (int i = 0; i < numNumaNodes; i++)
> > + for (int i = 0; i < numNumaNodes + 1; i++)
> > {
> > if (bNumaSupport)
> > x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d
> logical cores\n", i, cpusPerNode[i]);
> > - if (cpusPerNode[i])
> > - numPools += (cpusPerNode[i] + MAX_POOL_THREADS - 1) /
> MAX_POOL_THREADS;
> > + if (threadsPerPool[i])
> > + numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) /
> MAX_POOL_THREADS;
> > }
> >
> > if (!numPools)
> > @@ -324,20 +347,20 @@
> > int node = 0;
> > for (int i = 0; i < numPools; i++)
> > {
> > - while (!cpusPerNode[node])
> > + while (!threadsPerPool[node])
> > node++;
> > - int cores = X265_MIN(MAX_POOL_THREADS, cpusPerNode[node]);
> > - if (!pools[i].create(cores, maxProviders, node))
> > + int numThreads = X265_MIN(MAX_POOL_THREADS,
> threadsPerPool[node]);
> > + if (!pools[i].create(numThreads, maxProviders,
> nodeMaskPerPool[node]))
> > {
> > X265_FREE(pools);
> > numPools = 0;
> > return NULL;
> > }
> > if (numNumaNodes > 1)
> > - x265_log(p, X265_LOG_INFO, "Thread pool %d using %d
> threads on NUMA node %d\n", i, cores, node);
> > + x265_log(p, X265_LOG_INFO, "Thread pool %d using %d
> threads with NUMA node mask %lx\n", i, numThreads, nodeMaskPerPool[node]);
> > else
> > - x265_log(p, X265_LOG_INFO, "Thread pool created using
> %d threads\n", cores);
> > - cpusPerNode[node] -= cores;
> > + x265_log(p, X265_LOG_INFO, "Thread pool created using
> %d threads\n", numThreads);
> > + threadsPerPool[node] -= numThreads;
> > }
> > }
> > else
> > @@ -350,11 +373,28 @@
> > memset(this, 0, sizeof(*this));
> > }
> >
> > -bool ThreadPool::create(int numThreads, int maxProviders, int node)
> > +bool ThreadPool::create(int numThreads, int maxProviders, unsigned long
> nodeMask)
> > {
> > X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool
> cannot have more than MAX_POOL_THREADS threads\n");
> >
> > - m_numaNode = node;
> > +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> > + DWORD *nodemask = new DWORD;
> > + (*nodemask) = nodeMask & ~(0x1 << getNumaNodeCount());
> > + m_numaNodeMask = nodemask;
> > +#elif HAVE_LIBNUMA
> > + if (numa_available() >= 0)
> > + {
> > + struct bitmask* nodemask = numa_allocate_nodemask();
> > + if (nodemask)
> > + {
> > + *(nodemask->maskp) = nodeMask;
> > + m_numaNodeMask = nodemask;
> > + }
> > + else
> > + x265_log(NULL, X265_LOG_ERROR, "unable to get NUMA node
> mask for %lx\n", nodeMask);
> > + }
> > +#endif
> > +
> > m_numWorkers = numThreads;
> >
> > m_workers = X265_MALLOC(WorkerThread, numThreads);
> > @@ -408,36 +448,39 @@
> >
> > X265_FREE(m_workers);
> > X265_FREE(m_jpTable);
> > +
> > + if(m_numaNodeMask)
> > +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> > + free(m_numaNodeMask);
> > +#elif HAVE_LIBNUMA
> > + numa_free_nodemask((struct bitmask*)m_numaNodeMask);
> > +#endif
> > }
> >
> > void ThreadPool::setCurrentThreadAffinity()
> > {
> > - setThreadNodeAffinity(m_numaNode);
> > + setThreadNodeAffinity(m_numaNodeMask);
> > }
> >
> > /* static */
> > -void ThreadPool::setThreadNodeAffinity(int numaNode)
> > +void ThreadPool::setThreadNodeAffinity(void *numaNodeMask)
> > {
> > #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> > - GROUP_AFFINITY groupAffinity;
> > - if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))
> > - {
> > - if (SetThreadAffinityMask(GetCurrentThread(),
> (DWORD_PTR)groupAffinity.Mask))
> > - return;
> > - }
> > - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to
> NUMA node %d\n", numaNode);
> > + if (SetThreadAffinityMask(GetCurrentThread(),
> (DWORD_PTR)(*((DWORD*)numaNodeMask))))
> > + return;
> > + else
> > + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity
> for NUMA node mask\n");
> > #elif HAVE_LIBNUMA
> > if (numa_available() >= 0)
> > {
> > - numa_run_on_node(numaNode);
> > - numa_set_preferred(numaNode);
> > + numa_run_on_node_mask((struct bitmask*)numaNodeMask);
> > + numa_set_interleave_mask((struct bitmask*)numaNodeMask);
> > numa_set_localalloc();
> > return;
> > }
> > - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to
> NUMA node %d\n", numaNode);
> > -#else
> > - (void)numaNode;
> > + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for
> NUMA node mask\n");
> > #endif
> > + return;
> > }
> >
> > /* static */
> > diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.h
> > --- a/source/common/threadpool.h Tue Sep 01 17:06:05 2015 +0530
> > +++ b/source/common/threadpool.h Wed Sep 02 10:20:44 2015 +0530
> > @@ -83,7 +83,7 @@
> > sleepbitmap_t m_sleepBitmap;
> > int m_numProviders;
> > int m_numWorkers;
> > - int m_numaNode;
> > + void* m_numaNodeMask;
> > bool m_isActive;
> >
> > JobProvider** m_jpTable;
> > @@ -92,7 +92,7 @@
> > ThreadPool();
> > ~ThreadPool();
> >
> > - bool create(int numThreads, int maxProviders, int node);
> > + bool create(int numThreads, int maxProviders, unsigned long
> nodeMask);
> > bool start();
> > void stopWorkers();
> > void setCurrentThreadAffinity();
> > @@ -103,7 +103,7 @@
> >
> > static int getCpuCount();
> > static int getNumaNodeCount();
> > - static void setThreadNodeAffinity(int node);
> > + static void setThreadNodeAffinity(void *numaNodeMask);
> > };
> >
> > /* Any worker thread may enlist the help of idle worker threads from
> the same
>
> > _______________________________________________
> > x265-devel mailing list
> > x265-devel at videolan.org
> > https://mailman.videolan.org/listinfo/x265-devel
>
>
> --
> Steve Borho
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20150903/b1d03cd1/attachment-0001.html>
More information about the x265-devel
mailing list