[x265] >64 maximum threads per NUMA node and per system for flat topologies
Michael Lackner
michael.lackner at unileoben.ac.at
Tue Jul 16 12:35:10 UTC 2024
Yes, but Windows 10 Home is not the issue here. :)
If somebody runs Windows 10 Home on their Threadripper 7995WX machine, then... well... I'd
just blame the user for picking the wrong operating system for their hardware.
But let's say we'd have Windows 11 Professional running on such a 7995WX machine. Then
x265 will still only use 64 of its 192 logical CPUs! Same on Linux and BSD.
So we're not running into operating system limits. There appear to be some hard-coded
limits in x265 itself.
I just found some old documentation of mine where I noted down that raising
MAX_POOL_THREADS did work on a Threadripper 3990X with 128 logical CPUs on a single NUMA
node when specifying 128 threads per pool with the parameter '--pools 128'.
When doing that with MAX_POOL_THREADS at 256, x265 posts this info line:
"x265 [info]: Thread pool created using 128 threads"
But it seems I didn't test that thoroughly, looking only at that info line thinking "cool,
it works now".
But it doesn't.
I just re-tested this, and the result is clearly a failure, please look at the attached
htop screenshot (some of the minimally loaded cores likely only show ffmpeg input decoder
load, not x265 load).
It still shows "x265 [info]: Thread pool created using 128 threads", but that's bogus.
On a 2-node, 2-socket NUMA machine with 64 logical CPUs per node and 128 in total, x265
spawns two threadpools instead of one to match the different topology, and in this case
all 128 CPUs get loaded (no longer 100% on all of them, but still).
And that's the limit, threads per pool! And I'm 99% certain it's just hidden somewhere in
x265's code... unless I'm just overlooking something obvious here.
Best Regards,
Michael
On 16/07/2024 13:49, Mario *LigH* Rohkrämer wrote:
> There is indeed a limit based on the OS edition:
>
> https://www.anandtech.com/show/15483/amd-threadripper-3990x-review/3
>
> > ... Windows 10 Home is limited to 64 cores (threads), whereas Pro/Education versions go
> up to 128, and then Workstation/Enterprise to 256.
>
> More details about limits and possibly passing them:
>
> https://codeinsecurity.wordpress.com/2022/04/07/cpu-socket-and-core-count-limits-in-windows-10-and-how-to-remove-them/
>
>
> Michael Lackner schrieb am 16.07.2024 um 13:36:
>> Hello,
>>
>> You're right of course, but benchmark results on machines with multiple NUMA nodes
>> suggest that x265 can still go faster beyond 64 worker threads. The returns will be
>> diminishing at some point of course, but still. The systems on which this became clear
>> were dual AMD EPYC 9474F, dual EPYC 7502 and dual EPYC 7V12 ones.
>>
>> On the CPU hardware and operating system side of things I can say that >64 is
>> definitely possible and has been for a while, with or without NUMA. That's as long as
>> you're not running some very old OS versions like Windows 7, FreeBSD 8 or Linux 2.6.
>>
>> Modern systems simply have to support this. Consider e.g. the AMD Threadripper 7990WX CPU.
>>
>> It's a CPU for single-socket machines, so you have one NUMA node (yes, some UEFIs
>> support splitting it up into virtual multiple NUMA nodes, but that's not a given). Yet
>> it features 96 cores and with that 192 logical CPUs when SMT is active. And a single
>> process on a modern OS *can* load that thing.
>>
>> That is why I am certain that this is a limitation of x265 in this case. Other programs
>> had similar issues in the past. Adobe PhotoShop, Cinebench R15 and Steinberg Cubase
>> would make for some examples. They all updated their hardcoded upper limits in newer
>> versions, (Like for Cinebench, version R20 came with support for up to 256 CPUs).
>>
>> So I'm reasonably confident here.
>>
>> It's just my confidence in my C++ skills that is practically zero. ;)
>>
>> Thank you!
>>
>> Best
>> Michael
>>
>> On 16/07/2024 12:35, Mario *LigH* Rohkrämer wrote:
>>> Disclaimer: I am not a specialist here, just a brief reply...
>>>
>>> Obviously, multithreading requires support from the CPU hardware and from the operating
>>> system. As long as they are limited (core mask register width, API parameter width), an
>>> application won't be able to break these limits.
>>>
>>> But there is also a limit of efficiency. Running the encoding in parallel on more cores
>>> may speed up the calculation but also may reduce the scope of every task; finding
>>> redundancies in the material to be used for bitrate reduction may get harder when each
>>> thread sees less of the material due to some separation.
>>>
>>> Furthermore, parallelism saturates. The effort of managing parallel threads and the
>>> amount of stalling due to dependencies probably makes one encoding task with the
>>> maximum number of threads less efficient than running two tasks with each half of that.
>>>
>>> So, whether you can is just half the question... But you are probably most interested
>>> in the first of my three remarks for now. That much, I will be curious too.
>>
>
>
--
Michael Lackner
System Operator
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 3842 402 1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 3842 402 1502 | Web : http://infotech.unileoben.ac.at
-------------- next part --------------
A non-text attachment was scrubbed...
Name: htop.png
Type: image/png
Size: 23798 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20240716/c16c0307/attachment-0001.png>
More information about the x265-devel
mailing list