<div dir="ltr">Hi Steve,<br><div class="gmail_extra"><br></div><div class="gmail_extra">Does x265 already queue tasks for execution in the processing pool with CU granularity or with row of CUs granularity?</div><div class="gmail_extra">

<br></div><div class="gmail_extra">RAUL<br><br><div class="gmail_quote">On Wed, Apr 9, 2014 at 3:21 AM, Nicolas Morey-Chaisemartin <span dir="ltr"><<a href="mailto:nmorey@kalray.eu" target="_blank">nmorey@kalray.eu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Steve,<div class="">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Our plan is to write a custom encoder core optimized for our platform and<br>

use it as an accelerator for x265 running on a x86 processor.<br>

This should look something like that<br>

<br>

/-------------------------\                 /---------------------\<br>

|  x86                    |                 | 1 or more MPPA-256  |<br>

|                         ||                     |<br>

|  x265 preAnalysis +     |<= PCI Link => | Kalray Encoder Core |<br>

|  Rate Control           ||                     |<br>

|                         ||                     |<br>

\-------------------------/                \---------------------/<br>

<br>

The idea for the encoder core is to implement a CTU encoder.<br>

This leaves us some flexibilty on how we want to dispatch the CTU accross<br>

the cores (Tiles, frame parallelism, etc.)<br>

</blockquote>

Yes, at first guess is you would want to move TEncCu::processCU() to<br>

your remote cores, and perhaps distribute that work to as many cores<br>

as possible, and allow the main CPU to manage the wave-front and frame<br>

threading data dependencies and the higher level tasks (slice<br>

decisions and rate control)<br>

</blockquote></div>

The issue with this approach is that the feedback loop between our accelerator and the x86 would be too tight to get maximum performances.<br>

Streaming all the required data back and forth, and synchronizing will become a bottleneck very fast.<br>

That's why our first approach is centered on tiles as we can push more work at once.<br>

However the Encoder core running on our platform is about the same function as processCU so we can move in one direction or another later.<br>

<br><div class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br></blockquote></div><span class="HOEnZb"><font color="#888888">

-- <br>

Nicolas Morey Chaisemartin<br>

Phone : <a href="tel:%2B33%206%2042%2046%2068%2087" value="+33642466887" target="_blank">+33 6 42 46 68 87</a></font></span><div class="HOEnZb"><div class="h5"><br>

<br>

______________________________<u></u>_________________<br>

x265-devel mailing list<br>

<a href="mailto:x265-devel@videolan.org" target="_blank">x265-devel@videolan.org</a><br>

<a href="https://mailman.videolan.org/listinfo/x265-devel" target="_blank">https://mailman.videolan.org/<u></u>listinfo/x265-devel</a><br>

</div></div></blockquote></div><br></div></div>