[x265] >64 maximum threads per NUMA node and per system for flat topologies
Michael Lackner
michael.lackner at unileoben.ac.at
Tue Jul 16 08:45:11 UTC 2024
Greetings!
Short version
-------------
I would like to ask whether anyone could help me make x265 spawn more than 64 worker
threads per NUMA node or per system on flat topologies. My target would be 256 per node /
threadpool, as this is the least modern operating systems support per node or system. I'm
asking for help because my C++ skills are... practically non-existant.
I adjusted MAX_POOL_THREADS in source/common/threadpool.h to 256 on 64-bit, but this is
not enough, it's still limited to 64 in practise. I assume threadpool.cpp needs to be
modified as well?
Could anyone here help me with this?
Thank you very much!
Long version with more background info
--------------------------------------
I maintain a cross-platform x265-based benchmark tuned towards many-core systems and
without 32-bit support. It's based on a slightly modified encoder and very large 8K input
and is also parameter-tuned for parallelism. With that it can load machines with well more
than 64 CPUs.
Now we have the issue that x265 appears to scale to over 64 logical CPUs only on systems
with multiple NUMA nodes, like multi-socket servers. If it's a single-node or non-NUMA
machine with more than 64 cores, it loads those 64 fully, while leaving the others idle.
Modern operating system versions (Windows, Linux, *BSD) however are able to address at
least 256 logical CPUs for a flat topology (e.g. large Threadripper CPUs) or per NUMA node
on systems where NUMA nodes exist.
What I would like to do is modify x265 so that it can spawn 256 worker threads per
threadpool / NUMA node and 256 on systems where only one node or no NUMA exists.
For now, I have only modified the calculation of MAX_POOL_THREADS in
source/common/threadpool.h as follows:
enum { MAX_POOL_THREADS = sizeof(sleepbitmap_t) * 32 };
sleepbitmap_t is of type uint64_t, so its size is 8 bytes. That multiplied with 32 should
give 256, the desired value.
That however is clearly not all that's needed, as it still only scales to 64 cores per
pool, so I guess the code in threadpool.cpp needs adaptation as well.
Given how much I suck at C++ I find it hard to find the parts that define this
64-threads-per-node limit. I have a few ideas, but am just too uncertain due to my low skill.
Hence I would like to ask whether any of you could help me make this modification or point
me in the right direction. I do have corresponding virtual machines on a large EPYC host
for testing, so that part I can just do by myself.
Note: The benchmark is version-locked to 2.5+48-bd438ce10843, so it's a pretty old version
of x265. I thought that maybe that's the reason, and current versions can already do
this... but I don't think so, as MAX_POOL_THREADS still appears to have a maximum of 64 in
x265 version 3.6.
Once again: Thank you very much!
Best regards
--
Michael Lackner
System Operator
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 3842 402 1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 3842 402 1502 | Web : http://infotech.unileoben.ac.at
More information about the x265-devel
mailing list