[x264-devel] implementing Cluster farming?

Wed Mar 18 09:59:45 CET 2009

Hello,

I've been using a homegrown, Makefile-based (!) implementation of x264  
cluster-farming for about
a year now across ~70 GHz of Core2 CPUs, with linear scaling (which,  
in my case, means 2-3x realtime
  720p with high-end settings). While the details of this  
implementation (which, at this point, only runs
on OS X) are not really relevant, I think I've encountered a few  
issues that apply to the broader case
of any x264 cluster-based implementation (which, ideally, should be  
MPI-based for portability reasons).

1) When doing a CRF run, you have to identify "good" scenecut points  
(that would be IDRs in
an equivalent serial x264 run), ideally before you start doing any  
real encoding. Personally, I've found
that having to discard the beginning of a slow-running encoding job  
(because it was started at an arbitrary
boundary instead of a scene cut) rather costly, so I've opted for a  
very fast "0-pass", where I start an instance
of "x264 -m 0 --me dia -r 1" every 5000 frames and look for scene cuts  
in order to know where to start the
real encoding jobs. This ends up being faster, even though you have to  
decode the video twice.

2) For a second-pass run (using one of the above CRF runs as first  
pass), the split-at-scenecuts problem
is already taken care of, but you have to maintain the global rate  
control state across multiple jobs
(x264farm does this by doing its own 2-pass RC ; in an hypothetical  
x264-mpi, copying some
x264_ratecontrol_t 's should be enough).

3) In any case, I've found it useful to have a frame-accurate decoder  
compiled within x264 (as opposed to,
say, a pipe from avs2yuv), especially when >100 MB/s of YV12 data has  
to be passed around. As an
alternative to reading everything from raw YUV, I hacked together a  
decoder in muxers.c based on
FFMPEGSource (a library developed by Myrsloik which gives libavcodec  
frame-accurate seeking on
most sources). This tends to be instrumental in avoiding bottlenecks  
(from disk reads or on the network).

Personally, I've found found the performance/reliability (of my  
cobbled-together implementation) good enough
that I'm considering buying multiple half-1U Q8200s (at 400€ a pop)  
in a rack instead of a Nehalem machine...

Best regards,

Antoine Gerschenfeld