[x265] Questions about parallelism

Nicolas Morey-Chaisemartin nmorey at kalray.eu
Mon Feb 2 11:06:15 CET 2015


Hi,

I've been elbow-deep in the overall parallelism of x265 these last few days to debug a performance problem we have and ended up with some questions.
A little bit of context first:
  - Our goal is to offload frame encoding to an external accelerator.
  - In the current implementation we have minor changes to the overall code and override "FrameEncoder::compressCTURows" which pushes the frame to the accelerator and mostly wait until the frame is compressed.
  - Our HW encoder can handle multiple FrameEncoder through pipelining
  - We are working on x265 1.4

This usually works great and we have very good performance figures for HD video. However we recently switched to 4K and ran into a performance issue.
Few measures told us that Decoding + pre-analysis could handle UHD at 30fps without any worries.
We also know that our accelerator can handle over UHD at 30 either. However when plugging both together, the performance fell short (somewhere around 20 fps).

After a lot of tracing, we found out that part of it is due to the way the LookAhead works.
Here is the scenario (Lookahead=8, frame-threads=4)

* Main thread pushed 7 frames to lookahead, filling it for the first time and returns
* Main thread pushes an eighth, run a synchronous slice decision, and then pushes the frame to a frame encoder
* Main thread pushes a new frame, start async slice decision, checks for encoded frame, wait for the slice decision and pushes the frame to a frame encoder
However the async decision here pops 4 frames from the lookahead (current + 3 bframes)
* Twice  in a row, main thread pushed a new frame to the lookahead (no decision trigerred as it's not full), and pushes a new one to frame encoders
* Main thread pushed a frame to lookahead (no decision triggered), *wait for a frame encoder to finish* and pushes a new frame to a frame encoder

And the issue is here, we are at a point where no work is done on the x86 side, because the main thread is stuck waiting for an encoded frame to come back before doing anything else.
So on the next frame, there is a latency introduced before pushing a new frame to a frame encoder because we have to run a slice decision beforehand. And this is where the performance is lost for us.
I have written a quick hack for that which gave us a good perf improvement (~ 30% using our accelerator)

My question would be:
Is there a reason to have a single thread to the "update lookahead/run decide/pop encoded/push to encode" action synchronously?
This creates dependencies in the encoding pipeline which are difficult to control (and understand).
Would not a simple pipeline work as well?

The idea would be something like this:
Input -> Input Thread <-> LookAhead queue <-> LookAhead Thread <-> input_queue <-> Main thread <-> FrameEncoders <-> Main Thread -> Output stream

The retrieval of encoded frames could be done by another thread but it just add complexity and I'm not sure it's worth it.

I know this means some important code changes and probably some trickery to handle zero-latency encoding but we will need to do some of these changes for our encoder so I'd love to have some feedback from you and maybe even agree to have such changes upstream !

Regards

Nicolas


-- 
Nicolas Morey Chaisemartin
Phone : +33 6 42 46 68 87


More information about the x265-devel mailing list