[x265] Custom LowRes scale

Mon Jul 21 19:11:06 CEST 2014

On 07/21, Nicolas Morey-Chaisemartin wrote:
> Hi,
> 
> We recently profiled x265 pre-analysis to estimate what performance we
> could reach using our accelerator and I was quite disappointed by the
> performance.  When running on a Core-i7 with AVX at roughly 2.7GHz, we
> barely reached the 30fps mark using ultrafast preset on a 4K video.
> 
> After a little bit of browsing I realized that work in LosRew is
> always done at 1/4th of the final resolution which seems fair but
> requires a huge amount of work for 4K.  It seemed straight forward
> enough to change the divider at LowRes initialization but it seems
> there are a lot of hard coded values that depend both on the LowRes
> divider and the LowRes CU Size.
> 
> Here's a patch (definitly not applicable like this but just to give an
> idea of where I'm going) that seems to fix most of the hard-coded
> value.  It still works with a X265_LOWRES_SCALE of 4 and the perf is
> definilty improving (29fps => 40fps on a 2048x1024 medium preset on a
> E5504).
> 
> Would you be interested in a clean version of this? At least the
> hard-coded CU_SIZE part?  IMHO it would be better to have "dynamic"
> value for LowRes depending on preset (or equivalent) and the input
> resolution...  1/4th is fast enough in HD not to be an issue but for
> RT stream in 4K or more, 1/16 will be compulsory.

Interesting. I imagine much 4k content would work decently well even
with further downscaling of the lookahead pictures.

The lowres motion vectors are used in weight analysis as well, so that
file would need to be updated.

Another thing that makes our lookahead slower than x264's is our intra
analysis.  We're measuring DC and planar and then running our `all-angs'
function which generates all 33 angular predictions at once and then
measuring them all with satd. I've been wanting to turn that into a scan
that measures 6 angular modes evenly spaced by 5.  You then pick the
best angular option and search +2 and -2, then pick the best again and
search +1 and -1 (a gradient descent). At the end you measure DC and
planar and pick the best cost mode.  This results in 12 predictions and
satd calls instead of 35 (closer to x264's 9 lowres predictions).

Our `all-angs' function is pretty good, it takes approx 10x the time of
one angular prediction function to generate all 33. So it's not obvious
whether it would be better for this scan approach to call the ten
singular angular functions or the 'all-angs` function once. About half
of the angular predictions must be transposed - and the all-angs
function ignores this for better performance and thus we transpose the
original pixels instead when measuring satd cost of those modes (but we
only have to do that transpose once). The individual angular functions
do this transpose for you internally. if the mode requires it, resulting
in possibly more transposes. This further complicates guessing which
approach would be faster.

Once this approach was working we could use it in the main encode
functions as a --fast-intra option.

Lastly, we need a signed CLA from you before we can accept code
contributions: https://bitbucket.org/multicoreware/x265/wiki/Contribute

-- 
Steve Borho