[x265] Question about NUMA and core/thread use

Michael Lackner michael.lackner at unileoben.ac.at
Thu May 4 09:09:35 CEST 2017


Hello,

I would like to ask a few questions about NUMA support in x265 (and parallelization in
general).

I've observed some really weird behavior with x265 (2.4+2 in my case) on CentOS 7.3 x64
Linux, running on a HP ProLiant DL360 Gen9 machine with two Intel Xeon E5-2620 CPUs.

So thats 2 processors, 16 phyiscal CPUs total and 32 logical CPUs total.

1.)

First, I compiled and linked against libnuma to build myself a NUMA-aware version. Then I
tried to encode some 8K content with wpp, pmode and pme.

According to the info output it recognized the machines' two NUMA nodes just fine, and
spawned enough threads as well, but the encoder was *horribly* slow. Looking at the CPU
utilization it became apparent why: Only two cores were being used! No idea why. For a
very brief period the utilization would skyrocket to ~3000% (30 CPUs), but then drop back
to ~200%, sometimes ~100% for most of the time.

According to another user I know, a similar NUMA-aware x64 build of 2.4+2 for Win7+ showed
weird behavior as well, on some bladecenter (also Hewlett Packard!), running Windows
Server 2012 natively. He also had two sockets and the same number of cores/threads.
According to him, "only half of the cores are being used".

So I went to HPs' UEFI and switched the NUMA topology from "clustered" to "flat", and
activated node interleaving (?), basically disabling NUMA support. Then I re-ran the test.
I also told that other user to try the same.

2.)

Now x265 couldn't see any NUMA nodes anymore, and utilization was constantly high!
However, it spawned only 30 threads instead of 32, as if there was upper bound for threads
per pool. This was the same on both Win Server 2012 as well as CentOS 7.3 Linux.

When running like that, x265 would sometimes briefly use all 32 cores, but mostly just 30.
Still, it was MUCH faster than with the somehow broken NUMA stuff from before.

So:

Question A: What went wrong with NUMA? Would you assume that the issue is a HP firmware
bug, or did I do something wrong? I didn't specify any --pools either, left that for x265
to decide.

Question B: Without NUMA and 32 logical CPUs, x265 said: "Thread pool created using 30
threads". Why not 32? Two logical CPUs stayed mostly unused during the encode.

You may also want to know about other parallelization options I set:

--slices 2 --lookahead-slices 4 --lookahead-threads 2 (not sure if those really are a good
choice)

I really don't wanna tell users to "just switch off NUMA and run this with --pools
<number-of-logical-cpus>"

Thank you very much for any insights you might be able to provide!

Best,
Michael

-- 
Michael Lackner
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.at/infotech


More information about the x265-devel mailing list