[x265] >64 maximum threads per NUMA node and per system for flat topologies

Michael Lackner michael.lackner at unileoben.ac.at
Tue Jul 16 08:45:11 UTC 2024


Greetings!

Short version
-------------

I would like to ask whether anyone could help me make x265 spawn more than 64 worker 
threads per NUMA node or per system on flat topologies. My target would be 256 per node / 
threadpool, as this is the least modern operating systems support per node or system. I'm 
asking for help because my C++ skills are... practically non-existant.

I adjusted MAX_POOL_THREADS in source/common/threadpool.h to 256 on 64-bit, but this is 
not enough, it's still limited to 64 in practise. I assume threadpool.cpp needs to be 
modified as well?

Could anyone here help me with this?

Thank you very much!


Long version with more background info
--------------------------------------

I maintain a cross-platform x265-based benchmark tuned towards many-core systems and 
without 32-bit support. It's based on a slightly modified encoder and very large 8K input 
and is also parameter-tuned for parallelism. With that it can load machines with well more 
than 64 CPUs.

Now we have the issue that x265 appears to scale to over 64 logical CPUs only on systems 
with multiple NUMA nodes, like multi-socket servers. If it's a single-node or non-NUMA 
machine with more than 64 cores, it loads those 64 fully, while leaving the others idle.

Modern operating system versions (Windows, Linux, *BSD) however are able to address at 
least 256 logical CPUs for a flat topology (e.g. large Threadripper CPUs) or per NUMA node 
on systems where NUMA nodes exist.

What I would like to do is modify x265 so that it can spawn 256 worker threads per 
threadpool / NUMA node and 256 on systems where only one node or no NUMA exists.

For now, I have only modified the calculation of MAX_POOL_THREADS in 
source/common/threadpool.h as follows:

   enum { MAX_POOL_THREADS = sizeof(sleepbitmap_t) * 32 };

sleepbitmap_t is of type uint64_t, so its size is 8 bytes. That multiplied with 32 should 
give 256, the desired value.

That however is clearly not all that's needed, as it still only scales to 64 cores per 
pool, so I guess the code in threadpool.cpp needs adaptation as well.

Given how much I suck at C++ I find it hard to find the parts that define this 
64-threads-per-node limit. I have a few ideas, but am just too uncertain due to my low skill.


Hence I would like to ask whether any of you could help me make this modification or point 
me in the right direction. I do have corresponding virtual machines on a large EPYC host 
for testing, so that part I can just do by myself.

Note: The benchmark is version-locked to 2.5+48-bd438ce10843, so it's a pretty old version 
of x265. I thought that maybe that's the reason, and current versions can already do 
this... but I don't think so, as MAX_POOL_THREADS still appears to have a maximum of 64 in 
x265 version 3.6.

Once again: Thank you very much!

Best regards

-- 
Michael Lackner
System Operator
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 3842 402 1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 3842 402 1502 | Web : http://infotech.unileoben.ac.at


More information about the x265-devel mailing list