[x265] NUMA vs. no NUMA, a quick test (Was: Re: Question about NUMA and core/thread use)

Thu May 4 15:45:54 CEST 2017

And here we go, in case somebody is interested, a NUMA/no NUMA comparison.

This is a HP ProLiant DL360 G9 machine with two Xeon E5-2620 v4 8-Core processors with HT
enabled, so two NUMA nodes with 32GiB of Reg. ECC DDR4/2133 and 16 threads each. 64GiB RAM
and 32 threads total. QPI link speed was 8.0GT/s, which should translate into a
theoretical peak of 64GiB/s, if I'm not severly mistaken. The memory is configured in a
quad channel setup per node, which should translate into a theoretical peak of roughly
66GiB/s.

I ran x265 through 800 frames at a resolution of 8192×3428, once with NUMA fully enabled
and once with NUMA switched off, 2-pass with a target bitrate of 10000kbit/s (Settings below).

OS was CentOS 7.3 Linux, x86_64, Compiler was gcc/g++ 4.8.5, Assembler was yasm 1.3.0.

As Mario Rohkrämer suggested, both runs were done with '--no-pmode --no-pme', which made
parallelization work properly when having NUMA activated.

And this is the result (HH:MM:SS.mmm, where HH=hours, MM=minutes, SS=seconds and
mmm=milliseconds):

With NUMA: 01:41:59.584
Without NUMA: 01:42:29.381

So... Seems completely negligible, at least on a "small" 2-socket machine. Maybe it would
start to matter with more sockets and more cores/threads per NUMA node?! Or on a shared
memory cluster with much higher latencies between bricks (if those are supported by
libnuma that is...).

Settings:

Pass 1:
--y4m -D 10 --fps 24 --allow-non-conformance -p veryslow --no-pmode --no-pme --slices 2
--lookahead-slices 4 --lookahead-threads 2 --open-gop --ref 6 --bframes 8 --b-pyramid
--bitrate 10000 --rect --amp --aq-mode 2 --no-sao --qcomp 0.75 --no-strong-intra-smoothing
--psy-rd 1.6 --psy-rdoq 5.0 --rdoq-level 1 --ssim-rd --tu-inter-depth 3 --tu-intra-depth 3
--ctu 16 --max-tu-size 16 --qg-size 16 --pass 1 --slow-firstpass --stats "v.stats" --csv
"p1.csv" --csv-log-level 1 --sar 1 --range full -o "pass1.h265"

Pass 2:
--y4m -D 10 --fps 24 --allow-non-conformance -p veryslow --no-pmode --no-pme --slices 2
--lookahead-slices 4 --lookahead-threads 2 --open-gop --ref 6 --bframes 8 --b-pyramid
--bitrate 10000 --rect --amp --aq-mode 2 --no-sao --qcomp 0.75 --no-strong-intra-smoothing
--psy-rd 1.6 --psy-rdoq 5.0 --rdoq-level 1 --ssim-rd --tu-inter-depth 3 --tu-intra-depth 3
--ctu 16 --max-tu-size 16 --qg-size 16 --pass 2 --stats "v.stats" --csv "p2.csv"
--csv-log-level 1 --sar 1 --range full -o "pass2.h265"

On 05/04/2017 11:18 AM, Mario *LigH* Rohkrämer wrote:
> Am 04.05.2017, 10:58 Uhr, schrieb Michael Lackner <michael.lackner at unileoben.ac.at>:
> 
>> Still wondering why not 32, but ok.
> 
> x265 will calculate how many threads it will really need to utilize the WPP and other
> parallelizable steps, in relation to the frame dimensions and the complexity. It may not
> *need* more than 30 threads, would not have any task to give to two more. Possibly.
> Developers know better...

-- 
Michael Lackner
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.at/infotech