[x265] >64 maximum threads per NUMA node and per system for flat topologies

Michael Lackner michael.lackner at unileoben.ac.at
Tue Jul 16 11:36:12 UTC 2024


Hello,

You're right of course, but benchmark results on machines with multiple NUMA nodes suggest 
that x265 can still go faster beyond 64 worker threads. The returns will be diminishing at 
some point of course, but still. The systems on which this became clear were dual AMD EPYC 
9474F, dual EPYC 7502 and dual EPYC 7V12 ones.

On the CPU hardware and operating system side of things I can say that >64 is definitely 
possible and has been for a while, with or without NUMA. That's as long as you're not 
running some very old OS versions like Windows 7, FreeBSD 8 or Linux 2.6.

Modern systems simply have to support this. Consider e.g. the AMD Threadripper 7990WX CPU.

It's a CPU for single-socket machines, so you have one NUMA node (yes, some UEFIs support 
splitting it up into virtual multiple NUMA nodes, but that's not a given). Yet it features 
96 cores and with that 192 logical CPUs when SMT is active. And a single process on a 
modern OS *can* load that thing.

That is why I am certain that this is a limitation of x265 in this case. Other programs 
had similar issues in the past. Adobe PhotoShop, Cinebench R15 and Steinberg Cubase would 
make for some examples. They all updated their hardcoded upper limits in newer versions, 
(Like for Cinebench, version R20 came with support for up to 256 CPUs).

So I'm reasonably confident here.

It's just my confidence in my C++ skills that is practically zero. ;)

Thank you!

Best
Michael

On 16/07/2024 12:35, Mario *LigH* Rohkrämer wrote:
> Disclaimer: I am not a specialist here, just a brief reply...
> 
> Obviously, multithreading requires support from the CPU hardware and from the operating 
> system. As long as they are limited (core mask register width, API parameter width), an 
> application won't be able to break these limits.
> 
> But there is also a limit of efficiency. Running the encoding in parallel on more cores 
> may speed up the calculation but also may reduce the scope of every task; finding 
> redundancies in the material to be used for bitrate reduction may get harder when each 
> thread sees less of the material due to some separation.
> 
> Furthermore, parallelism saturates. The effort of managing parallel threads and the amount 
> of stalling due to dependencies probably makes one encoding task with the maximum number 
> of threads less efficient than running two tasks with each half of that.
> 
> So, whether you can is just half the question... But you are probably most interested in 
> the first of my three remarks for now. That much, I will be curious too.

-- 
Michael Lackner
System Operator
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 3842 402 1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 3842 402 1502 | Web : http://infotech.unileoben.ac.at



More information about the x265-devel mailing list