[x265] Questions about parallelism

Deepthi Nandakumar deepthi at multicorewareinc.com
Mon Feb 2 11:43:06 CET 2015


Hello,

First, glad to know we've got company analyzing performance issues on x265!
Hopefully, this will lead to lots of good ideas.

On Mon, Feb 2, 2015 at 3:36 PM, Nicolas Morey-Chaisemartin <nmorey at kalray.eu
> wrote:

> Hi,
>
> I've been elbow-deep in the overall parallelism of x265 these last few
> days to debug a performance problem we have and ended up with some
> questions.
> A little bit of context first:
>  - Our goal is to offload frame encoding to an external accelerator.
>  - In the current implementation we have minor changes to the overall code
> and override "FrameEncoder::compressCTURows" which pushes the frame to
> the accelerator and mostly wait until the frame is compressed.
>  - Our HW encoder can handle multiple FrameEncoder through pipelining
>  - We are working on x265 1.4
>
> This usually works great and we have very good performance figures for HD
> video. However we recently switched to 4K and ran into a performance issue.
> Few measures told us that Decoding + pre-analysis could handle UHD at 30fps
> without any worries.
> We also know that our accelerator can handle over UHD at 30 either. However
> when plugging both together, the performance fell short (somewhere around
> 20 fps).
>
> After a lot of tracing, we found out that part of it is due to the way the
> LookAhead works.
> Here is the scenario (Lookahead=8, frame-threads=4)
>
> * Main thread pushed 7 frames to lookahead, filling it for the first time
> and returns
> * Main thread pushes an eighth, run a synchronous slice decision, and then
> pushes the frame to a frame encoder
> * Main thread pushes a new frame, start async slice decision, checks for
> encoded frame, wait for the slice decision and pushes the frame to a frame
> encoder
> However the async decision here pops 4 frames from the lookahead (current
> + 3 bframes)
> * Twice  in a row, main thread pushed a new frame to the lookahead (no
> decision trigerred as it's not full), and pushes a new one to frame encoders
> * Main thread pushed a frame to lookahead (no decision triggered), *wait
> for a frame encoder to finish* and pushes a new frame to a frame encoder
>
> And the issue is here, we are at a point where no work is done on the x86
> side, because the main thread is stuck waiting for an encoded frame to come
> back before doing anything else.
> So on the next frame, there is a latency introduced before pushing a new
> frame to a frame encoder because we have to run a slice decision
> beforehand. And this is where the performance is lost for us.
>

We ran into this exact issue doing performance analyses on 40-core+
machines, where we found slicetypeDecide being a bottleneck. Please
checkout this commit.

Changeset:

9033 (d36211d0190f) slicetype: allow queue to fill past full to prevent
bottlenecks …

User:

Steve Borho <steve at borho.org>

Date:

2015-01-06 15:38:58 +0530 (3 weeks)

This patch essentially lets lookahead run slightly ahead of the Frame
Encoders, so the output queue always has decided frames available for
encoding.

We're very close to tagging 1.5 - which will contain this and other
improvements.


> I have written a quick hack for that which gave us a good perf improvement
> (~ 30% using our accelerator)
>
> My question would be:
> Is there a reason to have a single thread to the "update lookahead/run
> decide/pop encoded/push to encode" action synchronously?
> This creates dependencies in the encoding pipeline which are difficult to
> control (and understand).
> Would not a simple pipeline work as well?
>
> The idea would be something like this:
> Input -> Input Thread <-> LookAhead queue <-> LookAhead Thread <->
> input_queue <-> Main thread <-> FrameEncoders <-> Main Thread -> Output
> stream
>
> We could try this out, but on generic x86 multicore, it may not benefit
much. It would be good to see if the current tip fixes the performance
issues on your accelerator pipeline as well.


> The retrieval of encoded frames could be done by another thread but it
> just add complexity and I'm not sure it's worth it.
>
> I know this means some important code changes and probably some trickery
> to handle zero-latency encoding but we will need to do some of these
> changes for our encoder so I'd love to have some feedback from you and
> maybe even agree to have such changes upstream !
>
> Regards
>
> Nicolas
>
>
> --
> Nicolas Morey Chaisemartin
> Phone : +33 6 42 46 68 87
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20150202/ae6ea46f/attachment-0001.html>


More information about the x265-devel mailing list