[x265] Question about NUMA and core/thread use

Pradeep Ramachandran pradeep at multicorewareinc.com
Wed May 10 09:26:46 CEST 2017


On Wed, May 10, 2017 at 12:32 PM, Michael Lackner <
michael.lackner at unileoben.ac.at> wrote:

> On 05/10/2017 08:24 AM, Pradeep Ramachandran wrote:
> > On Wed, May 10, 2017 at 11:12 AM, Michael Lackner <
> > michael.lackner at unileoben.ac.at> wrote:
> >
> >> Thank you very much for your input!
> >>
> >> Since --pmode and --pme seem to break NUMA support, I disabled them. I
> >> simply cannot tell
> >> users that they have to switch off NUMA in their UEFI firmware just for
> >> this one
> >> application. There may be a lot of situations where this is just not
> >> doable.
> >>
> >> If there is a way to make --pmode --pme work together with x265s' NUMA
> >> support, I'd use
> >> it, but I don't know how?
> >
> > Could you please elaborate more here? It seems to work ok for us here.
> I've
> > tried on CentOS and Win Server 2017 dual socket systems and I see all
> > sockets being used.
>
> It's like this: x265 does *say* it's using 32 threads in two NUMA pools.
> That's just how
> it should be. But it behaves very weirdly, almost never loading more than
> two logical
> cores. FPS are extremely low, so it's really slow.
>
> CPU load stays at 190-200%, sometimes briefly dropping to 140-150%, where
> it should be in
> the range of 2800-3200%. As soon as I remove --pmode --pme, the system is
> being loaded
> very well! It almost never drops below the 3000% (30 cores) mark then.
>
> I also works *with* --pmode --pme, but only if NUMA is disabled on the
> firmware level,
> showing only a classic, flat topology to the OS.
>
> That behavior can be seen on CentOS 7.3 Linux, having compiled x265 2.4+2
> with GCC 4.8.5
> and yasm 1.3.0. The machine is a HP ProLiant DL360 Gen9 machine with two
> Intel Xeon
> E5-2620 CPUs.
>
> Removing --pmode --pme was suggested by Mario *LigH* Rohkrämer earlier in
> this thread.
>

This seems something specific with your configuration setup. I just tried
an identical experiment on two systems that I have which are dual-socket
E5-2699 v4s (88 threads spread across two sockets) running CentOS 6.8 and
CentOS 7.2. I compiled x265 with gcc version 4.4 and am able to see
utilization actually pick up closer to 5000% (monitored using htop) when
--pme and --pmode are enabled in the command line; without these options,
the utilization is closer to 3300%.


> Here is my topology when NUMA is enabled (pretty simple):
>
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> node 0 size: 32638 MB
> node 0 free: 266 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> node 1 size: 32768 MB
> node 1 free: 82 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
>
> Thanks!
>

You seem to have very little free memory in each node which might be making
you go to disk and therefore affecting performance. I recommend trying to
free some memory up before running x265 to see if that helps.


> >> Ah yes, I've also found that 8K does indeed help a ton. With 4K and
> >> similar settings, I'm
> >> able to load 16-25 CPUs currently, sometimes briefly 30. With 8K, load
> is
> >> much higher.
> >>
> >> Maybe you can advise how to maximize parallelization / loading as many
> >> CPUs as possible
> >> without breaking NUMA support on both Windows and Linux.
> >>
> >> I'm saying this, because my benchmarking project is targeting multiple
> >> operating systems,
> >> it currently works on:
> >>   * Windows NT 5.2 & 6.0 (wo. NUMA)
> >>   * Windows NT 6.1 - 10.0 (w. NUMA)
> >>   * MacOS X (wo. NUMA)
> >>   * Linux (w. and wo. NUMA)
> >>   * FreeBSD, OpenBSD, NetBSD and DragonFly BSD UNIX (wo. NUMA)
> >>   * Solaris (wo. NUMA)
> >>   * Haiku OS (wo. NUMA)
> >>
> >> Thank you very much!
> >>
> >> Best,
> >> Michael
> >>
> >> On 05/10/2017 07:21 AM, Pradeep Ramachandran wrote:
> >>> Michael,
> >>> Adding --lookahead-threads 2 statically allocated two threads for
> >>> lookahead. Therefore, the worker threads launched to work on WPP will
> >> 32-2
> >>> = 30 in count. We've found some situations in which statically
> allocating
> >>> threads for lookahead was useful and therefore decided to expose it to
> >> the
> >>> user. Please see if this helps your use-case and enable appropriately.
> >>>
> >>> Now as far as scaling up for 8K goes, a single instance of x265 scales
> up
> >>> well to 25-30 threads depending on the preset you're running in. We've
> >>> found pmode and pme help performance considerably on some Broadwell
> >> server
> >>> systems but again, that is also dependent on content. I would encourage
> >> you
> >>> play with those settings and see if they help your use case. Beyond
> these
> >>> thread counts, one instance of x265 may not be beneficial for you.
> >>>
> >>> Pradeep.
> >>>
> >>> On Fri, May 5, 2017 at 3:26 PM, Michael Lackner <
> >>> michael.lackner at unileoben.ac.at> wrote:
> >>>
> >>>> I found the reason for "why did x265 use 30 threads and not 32, when I
> >>>> have 32 CPUs".
> >>>>
> >>>> Actually, it was (once again) my own fault. Thinking I know better
> than
> >>>> x265, I spawned
> >>>> two lookahead threads starting with 32 logical CPUs
> >> ('--lookahead-threads
> >>>> 2').
> >>>>
> >>>> It seems what x265 does is to reserve two dedicated CPUs for this, but
> >>>> then it couldn't
> >>>> permanently saturate them.
> >>>>
> >>>> I still don't know when I should be starting with that stuff for 8K
> >>>> content. 64 CPUs? 256
> >>>> CPUs? Or should I leave everything to x265? My goal was to be able to
> >>>> fully load as many
> >>>> CPUs as possible in the future.
> >>>>
> >>>> In any case, the culprit was myself.
> >>>>
> >>>> On 05/04/2017 11:18 AM, Mario *LigH* Rohkrämer wrote:
> >>>>> Am 04.05.2017, 10:58 Uhr, schrieb Michael Lackner <
> >>>> michael.lackner at unileoben.ac.at>:
> >>>>>
> >>>>>> Still wondering why not 32, but ok.
> >>>>>
> >>>>> x265 will calculate how many threads it will really need to utilize
> the
> >>>> WPP and other
> >>>>> parallelizable steps, in relation to the frame dimensions and the
> >>>> complexity. It may not
> >>>>> *need* more than 30 threads, would not have any task to give to two
> >>>> more. Possibly.
> >>>>> Developers know better...
> >>>>
> >>>> --
> >>>> Michael Lackner
> >>>> Lehrstuhl für Informationstechnologie (CiT)
> >>>> Montanuniversität Leoben
> >>>> Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
> >>>> Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.
> >>>> at/infotech
> >>>> _______________________________________________
> >>>> x265-devel mailing list
> >>>> x265-devel at videolan.org
> >>>> https://mailman.videolan.org/listinfo/x265-devel
> >>
> >> --
> >> Michael Lackner
> >> Lehrstuhl für Informationstechnologie (CiT)
> >> Montanuniversität Leoben
> >> Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
> >> Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.
> >> at/infotech
> >> _______________________________________________
> >> x265-devel mailing list
> >> x265-devel at videolan.org
> >> https://mailman.videolan.org/listinfo/x265-devel
> >>
> >>
> >>
> >> N �n�r����)em�h�yhiם�w^��
>
> --
> Michael Lackner
> Lehrstuhl für Informationstechnologie (CiT)
> Montanuniversität Leoben
> Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
> Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.
> at/infotech
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20170510/6cfdb89f/attachment-0001.html>


More information about the x265-devel mailing list