[x265] [PATCH] Performance: Don't split threads into per-NUMA pools unless specified in cli

Thu Sep 3 06:52:38 CEST 2015

On 09/03, Pradeep Ramachandran wrote:
> Thanks. Most comments are simple. I have only one response, in this colour
> - let me know your thoughts.
> 
> Pradeep Ramachandran, PhD
> Solution Architect,
> Multicoreware Inc.
> Ph:   +91 99627 82018
> 
> On Wed, Sep 2, 2015 at 11:55 PM, Steve Borho <steve at borho.org> wrote:
> 
> > On 09/02, pradeep at multicorewareinc.com wrote:
> > > # HG changeset patch
> > > # User Pradeep Ramachandran <pradeep at multicorewareinc.com>
> > > # Date 1441169444 -19800
> > > #      Wed Sep 02 10:20:44 2015 +0530
> > > # Node ID 5f7e53224962ef0279f39e745aa59eea7cd2141a
> > > # Parent  86e9bd7dd19278fceef65fc93a06dc8746ec9daf
> > > Performance: Don't split threads into per-NUMA pools unless specified in
> > cli
> > >
> > > Fixes the threadpools to store a numa node mask instead of a numa node
> > id to
> > > use liNUMA functionality, and changes default thread pool behavior to
> > use one
> > > monolythic pool
> >
> > nit: monolithic
> >
> > > We see gains of 10%+ on Intel Xeon E5-2666v3, E5-2640 v2 machines for 4K
> > videos
> > > encoding at an ABR of 15Mbps in slower and versylow settings. Ultrafast
> > shows a
> > > perf dip of ~8% on the same machines. So when running in ultrafast mode,
> > explicilty
> > > specify # threads with the --pools command (--pools 18,18 on the
> > E5-2666v3 which
> > > has two sockets of 18 threads each, for example)
> >
> > it all looks pretty good, couple nits
> >
> > > diff -r 86e9bd7dd192 -r 5f7e53224962 doc/reST/cli.rst
> > > --- a/doc/reST/cli.rst        Tue Sep 01 17:06:05 2015 +0530
> > > +++ b/doc/reST/cli.rst        Wed Sep 02 10:20:44 2015 +0530
> > > @@ -202,15 +202,29 @@
> > >       "-"       - same as "none"
> > >       "10"      - allocate one pool, using up to 10 cores on node 0
> > >       "-,+"     - allocate one pool, using all cores on node 1
> > > -     "+,-,+"   - allocate two pools, using all cores on nodes 0 and 2
> > > -     "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
> > > -     "-,*"     - allocate three pools, using all cores on nodes 1, 2
> > and 3
> > > +     "+,-,+"   - allocate one pool, using only cores on nodes 0 and 2
> > > +     "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2
> > > +     "-,*"     - allocate one pool, using all cores on nodes 1, 2 and 3
> > >       "8,8,8,8" - allocate four pools with up to 8 threads in each pool
> > > +     "8,+,+,+" - allocate two pools, the first with 8 threads on node
> > 0, and the second with all cores on node 1,2,3
> > >
> > > -     The total number of threads will be determined by the number of
> > threads
> > > -     assigned to all nodes. The worker threads will each be given
> > affinity for
> > > -     their node, they will not be allowed to migrate between nodes, but
> > they
> > > -     will be allowed to move between CPU cores within their node.
> > > +     A thread pool dedicated to a given NUMA node is enabled only when
> > the
> > > +     number of threads to be created on that NUMA node is explicitly
> > mentioned
> > > +     in that corresponding position with the --pools option. Else, all
> > threads
> > > +     are spawned from a single pool. The total number of threads will be
> > > +     determined by the number of threads assigned to the enabled NUMA
> > nodes for
> > > +     that pool. The worker threads are be given affinity to all the
> > enabled
> > > +     NUMA nodes for that pool and may migrate between them, unless
> > explicitly
> > > +     specified as described above.
> > > +
> > > +     In the case that any threadpool has more than 64 threads, the
> > threadpool
> > > +     may be broken down into multiple pools of 64 threads each; on
> > 32-bit
> > > +     machines, this number is 32. All pools are given affinity to the
> > NUMA
> > > +     nodes on which the original pool had affinity. For performance
> > reasons,
> > > +     the last thread pool is spawned only if it has more than 32
> > threads for
> > > +     64-bit machines, or 16 for 32-bit machines. If the total number of
> > threads
> > > +     in the system doesn't obey this constraint, we may spawn fewer
> > threads
> > > +     than cores which has been emperically shown to be better for
> > performance.
> > >
> > >       If the four pool features: :option:`--wpp`, :option:`--pmode`,
> > >       :option:`--pme` and :option:`--lookahead-slices` are all disabled,
> > > @@ -219,10 +233,6 @@
> > >       If "none" is specified, then all four of the thread pool features
> > are
> > >       implicitly disabled.
> > >
> > > -     Multiple thread pools will be allocated for any NUMA node with
> > more than
> > > -     64 logical CPU cores. But any given thread pool will always use at
> > most
> > > -     one NUMA node.
> > > -
> > >       Frame encoders are distributed between the available thread pools,
> > >       and the encoder will never generate more thread pools than
> > >       :option:`--frame-threads`.  The pools are used for WPP and for
> > > @@ -238,8 +248,12 @@
> > >       system, a POSIX build of libx265 without libnuma will be less work
> > >       efficient. See :ref:`thread pools <pools>` for more detail.
> > >
> > > -     Default "", one thread is allocated per detected hardware thread
> > > -     (logical CPU cores) and one thread pool per NUMA node.
> > > +     Default "", one pool is created across all available NUMA nodes,
> > with
> > > +     one thread allocated per detected hardware thread
> > > +     (logical CPU cores). In the case that the total number of threads
> > is more
> > > +     than the maximum size that ATOMIC operations can handle (32 for
> > 32-bit
> > > +     compiles, and 64 for 64-bit compiles), multiple thread pools may be
> > > +     spawned subject to the performance constraint described above.
> > >
> > >       Note that the string value will need to be escaped or quoted to
> > >       protect against shell expansion on many platforms
> > > diff -r 86e9bd7dd192 -r 5f7e53224962 source/common/threadpool.cpp
> > > --- a/source/common/threadpool.cpp    Tue Sep 01 17:06:05 2015 +0530
> > > +++ b/source/common/threadpool.cpp    Wed Sep 02 10:20:44 2015 +0530
> > > @@ -226,8 +226,13 @@
> > >  {
> > >      enum { MAX_NODE_NUM = 127 };
> > >      int cpusPerNode[MAX_NODE_NUM + 1];
> > > +    int threadsPerPool[MAX_NODE_NUM + 2];
> > > +    unsigned long nodeMaskPerPool[MAX_NODE_NUM +2];
> >
> > use explicit size types when you care about the size: uint32_t
> >
> > >      memset(cpusPerNode, 0, sizeof(cpusPerNode));
> > > +    memset(threadsPerPool, 0, sizeof(threadsPerPool));
> > > +    memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));
> > > +
> > >      int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);
> > >      int cpuCount = getCpuCount();
> > >      bool bNumaSupport = false;
> > > @@ -258,7 +263,7 @@
> > >          for (int i = 0; i < numNumaNodes; i++)
> > >              x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d
> > logical cores\n", i, cpusPerNode[i]);
> > >
> > > -    /* limit nodes based on param->numaPools */
> > > +    /* limit threads based on param->numaPools */
> > >      if (p->numaPools && *p->numaPools)
> > >      {
> > >          const char *nodeStr = p->numaPools;
> > > @@ -266,19 +271,30 @@
> > >          {
> > >              if (!*nodeStr)
> > >              {
> > > -                cpusPerNode[i] = 0;
> > > +                threadsPerPool[i] = 0;
> > >                  continue;
> > >              }
> > >              else if (*nodeStr == '-')
> > > -                cpusPerNode[i] = 0;
> > > +                threadsPerPool[i] = 0;
> > >              else if (*nodeStr == '*')
> > > +            {
> > > +                for(int j = i; j < numNumaNodes; j++)
> >
> > w/s
> >
> > > +                {
> > > +                    threadsPerPool[numNumaNodes] += cpusPerNode[j];
> > > +                    nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1
> > << j);
> >
> > could use 1U
> >
> > > +                }
> > >                  break;
> > > +            }
> > >              else if (*nodeStr == '+')
> > > -                ;
> > > +            {
> > > +                threadsPerPool[numNumaNodes] += cpusPerNode[i];
> > > +                nodeMaskPerPool[numNumaNodes] = ((unsigned long)1 << i);
> > > +            }
> > >              else
> > >              {
> > >                  int count = atoi(nodeStr);
> > > -                cpusPerNode[i] = X265_MIN(count, cpusPerNode[i]);
> > > +                threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]);
> > > +                nodeMaskPerPool[i] = ((unsigned long)1 << i);
> > >              }
> > >
> > >              /* consume current node string, comma, and white-space */
> > > @@ -288,24 +304,31 @@
> > >                 ++nodeStr;
> > >          }
> > >      }
> > > -
> > > -    // In the case that numa is disabled and we have more CPUs than 64,
> > > -    // spawn the last pool only if the # threads in that pool is > 1/2
> > max (heuristic)
> > > -    if ((numNumaNodes == 1) &&
> > > -        (cpusPerNode[0] > MAX_POOL_THREADS) &&
> > > -        (cpusPerNode[0] % MAX_POOL_THREADS < (MAX_POOL_THREADS / 2)))
> > > +    else
> > >      {
> > > -        cpusPerNode[0] -= (cpusPerNode[0] % MAX_POOL_THREADS);
> > > -        x265_log(p, X265_LOG_DEBUG, "Creating only %d worker threads to
> > prevent asymmetry in pools; may not use all HW contexts\n", cpusPerNode[0]);
> > > +        for(int i = 0; i < numNumaNodes; i++)
> > > +        {
> > > +            threadsPerPool[numNumaNodes]  += cpusPerNode[i];
> > > +            nodeMaskPerPool[numNumaNodes] |= ((unsigned long)1 << i);
> > > +        }
> > > +    }
> > > +
> > > +    // If the last pool size is > MAX_POOL_THREADS, clip it to spawn
> > thread pools only of size >= 1/2 max (heuristic)
> > > +    if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) &&
> > > +        ((threadsPerPool[numNumaNodes] % MAX_POOL_THREADS) <
> > (MAX_POOL_THREADS / 2)))
> > > +    {
> > > +        threadsPerPool[numNumaNodes] -= (threadsPerPool[numNumaNodes] %
> > MAX_POOL_THREADS);
> > > +        x265_log(p, X265_LOG_DEBUG,
> > > +                 "Creating only %d worker threads beyond specified
> > numbers with --pools (if specified) to prevent asymmetry in pools; may not
> > use all HW contexts\n", threadsPerPool[numNumaNodes]);
> > >      }
> > >
> > >      numPools = 0;
> > > -    for (int i = 0; i < numNumaNodes; i++)
> > > +    for (int i = 0; i < numNumaNodes + 1; i++)
> > >      {
> > >          if (bNumaSupport)
> > >              x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d
> > logical cores\n", i, cpusPerNode[i]);
> > > -        if (cpusPerNode[i])
> > > -            numPools += (cpusPerNode[i] + MAX_POOL_THREADS - 1) /
> > MAX_POOL_THREADS;
> > > +        if (threadsPerPool[i])
> > > +            numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) /
> > MAX_POOL_THREADS;
> > >      }
> > >
> > >      if (!numPools)
> > > @@ -324,20 +347,20 @@
> > >          int node = 0;
> > >          for (int i = 0; i < numPools; i++)
> > >          {
> > > -            while (!cpusPerNode[node])
> > > +            while (!threadsPerPool[node])
> > >                  node++;
> > > -            int cores = X265_MIN(MAX_POOL_THREADS, cpusPerNode[node]);
> > > -            if (!pools[i].create(cores, maxProviders, node))
> > > +            int numThreads = X265_MIN(MAX_POOL_THREADS,
> > threadsPerPool[node]);
> > > +            if (!pools[i].create(numThreads, maxProviders,
> > nodeMaskPerPool[node]))
> > >              {
> > >                  X265_FREE(pools);
> > >                  numPools = 0;
> > >                  return NULL;
> > >              }
> > >              if (numNumaNodes > 1)
> > > -                x265_log(p, X265_LOG_INFO, "Thread pool %d using %d
> > threads on NUMA node %d\n", i, cores, node);
> > > +                x265_log(p, X265_LOG_INFO, "Thread pool %d using %d
> > threads with NUMA node mask %lx\n", i, numThreads, nodeMaskPerPool[node]);
> > >              else
> > > -                x265_log(p, X265_LOG_INFO, "Thread pool created using
> > %d threads\n", cores);
> > > -            cpusPerNode[node] -= cores;
> > > +                x265_log(p, X265_LOG_INFO, "Thread pool created using
> > %d threads\n", numThreads);
> > > +            threadsPerPool[node] -= numThreads;
> > >          }
> > >      }
> > >      else
> > > @@ -350,11 +373,28 @@
> > >      memset(this, 0, sizeof(*this));
> > >  }
> > >
> > > -bool ThreadPool::create(int numThreads, int maxProviders, int node)
> > > +bool ThreadPool::create(int numThreads, int maxProviders, unsigned long
> > nodeMask)
> > >  {
> > >      X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool
> > cannot have more than MAX_POOL_THREADS threads\n");
> > >
> > > -    m_numaNode = node;
> > > +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> > > +    DWORD *nodemask = new DWORD;
> >
> > why malloc?
> >
> 
> m_numaNodeMask is a void* pointer as the type differs across windows and
> non-windows builds. If I declared `DWORD nodemask = ...` and assigned
> m_numaNodeMask = &nodemask, it won't work as it is a reference to a local
> variable which will go out of scope soon, right? Hence the malloc.
> 
> Is there some better way to do this?

if it has to be a pointer, you might as well make the data-storage a
member of ThreadPool, something like this in ThreadPool class
definition:

#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
    DWORD m_winNodeMask
#endif

then:

    m_numaNode = &m_winNodeMask;

-- 
Steve Borho