[x265] Parallelization on "manycore" systems

Michael Lackner michael.lackner at unileoben.ac.at
Thu Feb 2 09:01:20 CET 2017


Good morning,

Actually, the original x264 benchmark was 2-Pass, so this one is maybe also going to be. I
just gave you the first of two commands, the second being pass #2 with similar options.

It's not "just" a benchmark, but also a stress test. If you're interested, this is the
current/old project from 2010-2017: http://www.xin.at/x264/index-en.php

As you can see further down, a lot of users have had runtimes of days, in rarer cases
weeks or even months (!) running it. So it's ok if it's slow. Actually, we want it that way!

So there is no "theroretical" knowledge as to how it would scale using slices and
lookahead slices/threads? Because there is no way I can get my hands on anything with more
than 64 logical CPUs in the near future.

I tried to overcommit CPUs using virtual machine hypervisors (more vCPUs than real ones),
but it makes the guest OSes completely unresponsive.

So I simply can't test this I fear... :( I'd love to test and report back, but at the
moment I just don't have the hardware.

As for content, I'm currently playing around with the Blender movie "Tears of Steel"
(https://mango.blender.org/), which is a mix of live action and lots of CGI in 4096x1714
res. I'm using the Y4M version from here: http://media.xiph.org/tearsofsteel/. But this is
just preliminary, I can still switch content, it just has to stand under a license that
makes it (or rather parts of it) freely redistributable.

I'm in no rush either, so if --slices is very new and still subject to change, I can just
wait a few more months as well. 20-25 CPUs wouldn't be useful in our case.. I could always
just spawn multiple processes in parallel, but that's something I really don't want to do.

Thanks for your reply!

Best,
Michael

On 02/02/2017 05:39 AM, Pradeep Ramachandran wrote:
> Michael,
> There have been a few other efforts to create a benchmarking tool around
> x265  as it stresses the CPU pretty heavily; some of these tools are
> available for free download as well.
> 
> As far as scaling goes, at this point, a single instance of x265 scales
> well to around 20-25 CPU threads, but the serial nature of the decisions
> that we make limits our parallelism to these levels. We have recently
> implemented the slices feature (the above number is without slices) which
> could enable us to scale more, and this is something that we're actively
> looking at. More lookahead-slices and lookahead-threads should help, but
> the specifics depend on the content that you are encoding; so I'd encourage
> you to play with them and share your results so that the community can also
> comment.
> 
> Also, from your command-line below, I would remove the -pass 1
> --slow-firstpass options as I think those aren't relevant for you; they are
> used to generate results from a quick first pass that can be refined
> further in subsequent passes.
> 
> Pradeep.
> 
> On Wed, Feb 1, 2017 at 4:55 PM, Michael Lackner <
> michael.lackner at unileoben.ac.at> wrote:
> 
>> Greetings,
>>
>> I have a question about parallelization in x265. I'm currently preparing a
>> benchmarking
>> project based on x265 (a successor of a similar project using x264).
>>
>> The x264 one created in 2010 was locked on a specific version/options and
>> is now running
>> out of steam because it fails to fully utilize todays' larger processors
>> (16 and more
>> logical CPUs).
>>
>> I'm currently basing this new thing on 4K input content (either UHD or
>> full 4096x2160,
>> unsure), and I'd like it to scale up to around 1000-2000 logical CPUs or
>> more if possible
>> (fully loading them). This would also make it possible to load entire
>> shared memory
>> clusters today.
>>
>> I don't care about effective output quality that much, so parallelization
>> is paramount.
>>
>> I've seen that x265 has a few knobs you can turn manually to better
>> utilize many cores,
>> but for my content I'm not sure when I should set which option to what
>> value?! I don't
>> have test systems for this yet of course...
>>
>> I've begun to write a script to determine logical CPU counts on Windows,
>> Linux and
>> FreeBSD, I just need to know what to do with the following:
>>
>> --slices <integer>
>> --lookahead-slices <0..16>
>> --lookahead-threads <integer>
>>
>> I'm already using:
>>
>> --ctu 16
>> --wpp
>> --pmode
>> --pme
>>
>> In total, my current options are like this (I also want to be hard on the
>> CPU per clock to
>> make the benchmark run long enough even with a small enough input file,
>> but only where it
>> doesn't hurt parallelization):
>>
>> -D 10 --fps 24000/1001 -p veryslow --pmode --pme --wpp --open-gop --ref 6
>> --bframes 16
>> --b-pyramid --weightb --max-merge 5 --b-intra --bitrate 10000 --rect --amp
>> --aq-mode 2
>> --no-sao --qcomp 0.75 --no-strong-intra-smoothing --psy-rd 1.6 --psy-rdoq
>> 5.0 --rdoq-level
>> 1 --tu-inter-depth 4 --tu-intra-depth 4 --ctu 16 --max-tu-size 32 --pass 1
>> --slow-firstpass --stats v.stats --sar 1 --range full
>>
>> These might not be good settings for my purpose, and some are redundant
>> given the profile
>> I guess, which is why I'd like to ask here. I'm just unsure when I should
>> start using more
>> lookahead slices. And then? Should I switch from lookahead slices to
>> lookahead threads at
>> some point, or can both be used together!?
>>
>> When should I start slicing up input frames, and what do I need to to
>> consider given
>> proper values for --slices <integer> etc.
>>
>> Is it even possible to scale to THAT many cores with 4K/UHD content?!
>>
>> I wanna make this a bit more future-proof this time around...
>>
>> Thanks a lot for your input!
>>
>> Best,
>> Michael

-- 
Michael Lackner
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.at/infotech


More information about the x265-devel mailing list