[x264-devel] Re: Scalability
Christian Bienia
cbienia at CS.Princeton.EDU
Thu Mar 1 19:52:10 CET 2007
On Thu, 2007-03-01 at 11:44, Alex Izvorski wrote:
> On Wed, 2007-02-28 at 15:49 -0700, Loren Merritt wrote:
> > On Wed, 28 Feb 2007, Alex Izvorski wrote:
> > > Thank you for some very interesting results. Could you do a run with:
> > >
> > > x264 --bitrate=8000 --ref=1 --keyint=30 --scenecut=-1 --bframes=1
> > > --no-b-adapt --threads ${NTHREADS} -o output.264 input_1920x1080_512.y4m
> > >
> > > and with threads up to 256 or 512? I suspect that will scale much
> > > better. Thanks ;)
> >
> > OK, I remembered a limiting factor: The motion estimation prepass
> > (for b-adapt, scenecut, and ratecontrol) is singlethreaded. It runs
> > concurrently with all the other threads, but only 1 thread is allocated to
> > the prepass.
> > If you disable those features (--qp 20 --no-b-adapt --scenecut -1) or if
> > you run a 2nd pass (which doesn't do the prepass again), it should scale
> > better.
>
> Yes, that is what I was trying to do, I just didn't realize the
> ratecontrol needs a prepass also ;)
>
> > I would guess that Alex's command would scale worse than the original,
> > since Alex left in --bitrate (so a prepass is still needed), but increased
> > the speed of all the other options (so the prepass takes a larger
> > fraction of the total cpu-time).
>
> Quite right.
>
> Christian, could you change the options I sent to:
>
> x264 --qp=20 --ref=1 --keyint=30 --scenecut=-1 --bframes=1
> --no-b-adapt --threads ${NTHREADS} -o output.264 input_1920x1080_512.y4m
I have the new performance results for those arguments. Because of the
limitation Limin pointed out yesterday, x264 couldn't run with more than
27 threads.
The numbers are again for #CPUs = #Threads. To get a better estimate how
efficiently x264 can exploit parallelism, I've also computed the
parallel efficiency (Speedup/CPUs):
#Threads 1 2 4 8 16 27
time/[s] 839.93 471.99 240.77 124.02 72.2 52.09
Speedup 1 1.78 3.49 6.77 11.63 16.12
P. Eff. 100% 89% 87% 85% 73% 60%
With up to 8 threads x264 does an overall good job utilizing additional
computational resources. There's a 10% drop as soon as x264 runs
multi-threaded, but the efficiency (almost) remains the same after that.
The decline starts with the step from 8 to 16 threads: From there on,
the efficiency drops by approx. 13% per 8 threads. If this trend
continues past 27 threads, x264 will run with approx. 45% efficiency
with 32 threads. More threads will be more or less useless.
The current parallelization model x264 uses seems to be usable with up
to 8 CPUs, if work stealing is delegated to the OS scheduler by running
with more threads than CPUs. 16 CPUs are also more or less usable for
x264, but that really pushes it. All that, of course, assumes that the
input stream is amenable to the current work extraction approach.
Overall, the current parallelization of x264 still seems a little
limited. Loren mentioned that fine-granular parallelization is possible,
but more difficult. But a combined approach seems to be promising: A
central work queue would contain all work units which await processing.
A work unit can be a chunk of a frame (no more than 2-4 chunks per
frame). As soon as a thread has finished working on a work unit, it
enqueues any work units of subsequent frames which are now possible to
encode (and dequeues its next work unit). This approach would also
eliminate the need to have more threads than CPUs to get a high
utilization. :-)
- Chris
--
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html
More information about the x264-devel
mailing list