[x265] A question about NUMA issues on MS Windows (again..)

Michael Lackner michael.lackner at unileoben.ac.at
Mon Dec 11 10:21:46 CET 2017


Hello,

I would like to ask the developers about a NUMA issue that a user has been encountering
while running a benchmark I built around x265.

His system consists of the following components:

Supermicro H8QGi-F (quad socket G34 for AMD Opteron CPUs)
4 x Opteron 6128 chips with 8 cores each, 32 cores total
Windows Server 2012 (NT 6.2 with NUMA support in x265)
x265 patched to work with 8192x4608¹ for even more parallelism. Settings below².

Now, those CPUs are multi-chip-modules with 2 x 4 cores each. Thus, that quad-socket
system consists of 8 NUMA nodes in total.

However, for some reason, x265 chooses to use only the first three nodes, see the
following screenshot provided by the affected user:

http://www.directupload.net/file/d/4931/7nowm2f5.jpg

Loading 32 cores with this shouldn't be an issue, as has been verified on other 32-thread
machines³ on Linux and Windows, so it should be possible if x265 would just spawn all 8
thread pools required.

I'm wondering if it's because of something stupid I'm doing with the options (like
--slices 2), or if something else could be the reason for only three thread pools being
spawned? With this, x265 is much slower than it should be.

Maybe you can provide me with some input on this?

Thank you very much!


¹: source/input/input.h, line 30:

#define MAX_FRAME_HEIGHT 4608


²: Pass 1 settings for the given case, tuned to make everything fit in 12-16GB of RAM
while producing just the desired amount of runtime and parallelism (unless I'm doing
something very wrong here):

--y4m --frames 800 -D 10 --fps 24 --allow-non-conformance -p veryslow --pmode --pme
--slices 2 --lookahead-slices 4 --rc-lookahead 3 --open-gop --ref 6 --bframes 2
--b-pyramid --bitrate 60000 --rect --amp --aq-mode 2 --no-sao --qcomp 0.75
--no-strong-intra-smoothing --psy-rd 1.6 --psy-rdoq 5.0 --rdoq-level 1 --ssim-rd
--tu-inter-depth 3 --tu-intra-depth 3 --ctu 16 --max-tu-size 16 --qg-size 16 --pass 1
--slow-firstpass --stats ".\var\temporary-output\v.stats" --csv
".\var\temporary-output\pass1-framestats.txt" --csv-log-level 1 --sar 1 --range full -o
".\var\temporary-output\pass1.h265"


³: Other machines tested, both showed pretty much full load on all cores/threads:

1.) 2 x Xeon E5-2620 v4 Broadwells, 16 cores / 32 threads total on CentOS 7.3 Linux
2.) AMD Ryzen Threadripper 1950X, 16 cores / 32 threads total on Windows 10 Pro 1703

-- 
Michael Lackner
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.at/infotech


More information about the x265-devel mailing list