[x265] Regarding new thread pooling: Low CPU utilization on Dual Xeon

Steve Borho steve at borho.org
Mon Mar 2 17:49:53 CET 2015


On Sun, Mar 1, 2015 at 4:00 AM, Mario Rohkrämer <contact at ligh.de> wrote:
> There have been several reports already that the current x265 binaries are
> not very efficient since a new thread pooling was introduced.
>
> One of the members of the German speaking doom9/Gleitz video board -
> 'Massaguana' - reported a very low utilization, ~25-30% overall, on a dual
> socket board:
>
> 2x Intel Xeon CPU X5675 @ 3.07GHz
> logical processor count: 24 (2*6 physical cores *2 with HT)
>
> He is running encoding tests on MacOS X 10.10.2 via Hybrid (by Selur) and
> Handbrake (here he is also discussing crashes with certain parameters in
> their forum).
>
> Attached is a screenshot of a per-core CPU utilization bar indicator. As you
> can see, only few cores are used at all, and they even quite little.
> Apparently these threads are waiting a lot for each other? More details -
> some logs and screenshots - in German in this thread:
>
> http://forum.gleitz.info/showthread.php?46557&p=449737#post449737 (ff.)
>
> It would certainly help to evaluate the thread pooling better if there were
> some sample values available, at least for a typical single-socket
> quad-core, possibly one with HyperThreading, and dual-socket systems.
> Preferably in some of the famous forums (doom9, VideoHelp) and the online
> documentation.
>
> In return, Massaguana will be happy to offer tests on his machine. In the
> meantime, he will discuss with Selur how to add pme and pmode to the
> parameters in Hybrid...

Utilization numbers themselves are not very illuminating, I'm more
curious if the performance went up or down after the change.

x265 (any HEVC encoder, really) struggles with data dependencies. They
are everywhere and they are tight, while the large block sizes reduce
the effectiveness of frame parallelism relative to AVC. This is
particularly true for resolutions 720p and below.

The thread pool changes did not really affect this much. The effect
that you are mostly seeing is the removal of wave-front processing in
lookahead frame cost estimates. The previous code was using more
cores, but it was using them inefficiently with much overhead.  The
new lookahead is batching cost estimates together so each worker does
an entire cost estimate as one job. This lowered utilization but
improved performance (the CTUs per work second was much higher and fps
improved).

This batching only helps --b-adapt 2, so if you are using --b-adapt 1
or 0 there is much less work to be done in lookahead and so it is
mostly serialized. There is a slice-threading option for lookahead
cost estimates (similar to x264 lookahead threading), but it is
currently only enabled if you have more than 12 logical cores per
socket. I could be convinced to lower this threshold but only if I see
evidence that slicetypeDecide() is the bottleneck on that machine.

The other major change in the new thread pool design is that x265 now
generates one pool per socket and distributes frames being encoded
round-robin between those sockets, such that the threads working on a
particular frame are always on the same socket.  This again can lower
overall utilization (a small amount) but only because each thread is
more compute efficient.

If you would like to see these effects, enable DETAILED_CU_STATS and
run encodes with builds before and after the change.

My focus right now is on work efficiency, making the best use of the
CPU cycles we spend. Avoiding redundant work, avoiding overheads, and
using SIMD in more places.  If the utilization drops as we make the
encoder faster, this is a double-win for server throughput.

-- 
Steve Borho


More information about the x265-devel mailing list