[x265] >64 maximum threads per NUMA node and per system for flat topologies
Mario *LigH* Rohkrämer
contact at ligh.de
Tue Jul 16 11:49:19 UTC 2024
There is indeed a limit based on the OS edition:
https://www.anandtech.com/show/15483/amd-threadripper-3990x-review/3
> ... Windows 10 Home is limited to 64 cores (threads), whereas
Pro/Education versions go up to 128, and then Workstation/Enterprise to 256.
More details about limits and possibly passing them:
https://codeinsecurity.wordpress.com/2022/04/07/cpu-socket-and-core-count-limits-in-windows-10-and-how-to-remove-them/
Michael Lackner schrieb am 16.07.2024 um 13:36:
> Hello,
>
> You're right of course, but benchmark results on machines with multiple
> NUMA nodes suggest that x265 can still go faster beyond 64 worker
> threads. The returns will be diminishing at some point of course, but
> still. The systems on which this became clear were dual AMD EPYC 9474F,
> dual EPYC 7502 and dual EPYC 7V12 ones.
>
> On the CPU hardware and operating system side of things I can say that
> >64 is definitely possible and has been for a while, with or without
> NUMA. That's as long as you're not running some very old OS versions
> like Windows 7, FreeBSD 8 or Linux 2.6.
>
> Modern systems simply have to support this. Consider e.g. the AMD
> Threadripper 7990WX CPU.
>
> It's a CPU for single-socket machines, so you have one NUMA node (yes,
> some UEFIs support splitting it up into virtual multiple NUMA nodes, but
> that's not a given). Yet it features 96 cores and with that 192 logical
> CPUs when SMT is active. And a single process on a modern OS *can* load
> that thing.
>
> That is why I am certain that this is a limitation of x265 in this case.
> Other programs had similar issues in the past. Adobe PhotoShop,
> Cinebench R15 and Steinberg Cubase would make for some examples. They
> all updated their hardcoded upper limits in newer versions, (Like for
> Cinebench, version R20 came with support for up to 256 CPUs).
>
> So I'm reasonably confident here.
>
> It's just my confidence in my C++ skills that is practically zero. ;)
>
> Thank you!
>
> Best
> Michael
>
> On 16/07/2024 12:35, Mario *LigH* Rohkrämer wrote:
>> Disclaimer: I am not a specialist here, just a brief reply...
>>
>> Obviously, multithreading requires support from the CPU hardware and
>> from the operating system. As long as they are limited (core mask
>> register width, API parameter width), an application won't be able to
>> break these limits.
>>
>> But there is also a limit of efficiency. Running the encoding in
>> parallel on more cores may speed up the calculation but also may
>> reduce the scope of every task; finding redundancies in the material
>> to be used for bitrate reduction may get harder when each thread sees
>> less of the material due to some separation.
>>
>> Furthermore, parallelism saturates. The effort of managing parallel
>> threads and the amount of stalling due to dependencies probably makes
>> one encoding task with the maximum number of threads less efficient
>> than running two tasks with each half of that.
>>
>> So, whether you can is just half the question... But you are probably
>> most interested in the first of my three remarks for now. That much, I
>> will be curious too.
>
--
Fun and success!
Mario *LigH* Rohkrämer
maito:contact at ligh.de
More information about the x265-devel
mailing list