[x265] >64 maximum threads per NUMA node and per system for flat topologies

Mario *LigH* Rohkrämer contact at ligh.de
Tue Jul 16 11:49:19 UTC 2024


There is indeed a limit based on the OS edition:

https://www.anandtech.com/show/15483/amd-threadripper-3990x-review/3

 > ... Windows 10 Home is limited to 64 cores (threads), whereas 
Pro/Education versions go up to 128, and then Workstation/Enterprise to 256.

More details about limits and possibly passing them:

https://codeinsecurity.wordpress.com/2022/04/07/cpu-socket-and-core-count-limits-in-windows-10-and-how-to-remove-them/


Michael Lackner schrieb am 16.07.2024 um 13:36:
> Hello,
> 
> You're right of course, but benchmark results on machines with multiple 
> NUMA nodes suggest that x265 can still go faster beyond 64 worker 
> threads. The returns will be diminishing at some point of course, but 
> still. The systems on which this became clear were dual AMD EPYC 9474F, 
> dual EPYC 7502 and dual EPYC 7V12 ones.
> 
> On the CPU hardware and operating system side of things I can say that 
>  >64 is definitely possible and has been for a while, with or without 
> NUMA. That's as long as you're not running some very old OS versions 
> like Windows 7, FreeBSD 8 or Linux 2.6.
> 
> Modern systems simply have to support this. Consider e.g. the AMD 
> Threadripper 7990WX CPU.
> 
> It's a CPU for single-socket machines, so you have one NUMA node (yes, 
> some UEFIs support splitting it up into virtual multiple NUMA nodes, but 
> that's not a given). Yet it features 96 cores and with that 192 logical 
> CPUs when SMT is active. And a single process on a modern OS *can* load 
> that thing.
> 
> That is why I am certain that this is a limitation of x265 in this case. 
> Other programs had similar issues in the past. Adobe PhotoShop, 
> Cinebench R15 and Steinberg Cubase would make for some examples. They 
> all updated their hardcoded upper limits in newer versions, (Like for 
> Cinebench, version R20 came with support for up to 256 CPUs).
> 
> So I'm reasonably confident here.
> 
> It's just my confidence in my C++ skills that is practically zero. ;)
> 
> Thank you!
> 
> Best
> Michael
> 
> On 16/07/2024 12:35, Mario *LigH* Rohkrämer wrote:
>> Disclaimer: I am not a specialist here, just a brief reply...
>>
>> Obviously, multithreading requires support from the CPU hardware and 
>> from the operating system. As long as they are limited (core mask 
>> register width, API parameter width), an application won't be able to 
>> break these limits.
>>
>> But there is also a limit of efficiency. Running the encoding in 
>> parallel on more cores may speed up the calculation but also may 
>> reduce the scope of every task; finding redundancies in the material 
>> to be used for bitrate reduction may get harder when each thread sees 
>> less of the material due to some separation.
>>
>> Furthermore, parallelism saturates. The effort of managing parallel 
>> threads and the amount of stalling due to dependencies probably makes 
>> one encoding task with the maximum number of threads less efficient 
>> than running two tasks with each half of that.
>>
>> So, whether you can is just half the question... But you are probably 
>> most interested in the first of my three remarks for now. That much, I 
>> will be curious too.
> 


-- 

Fun and success!

Mario *LigH* Rohkrämer
maito:contact at ligh.de


More information about the x265-devel mailing list