VOUT4 results with directx
gbazin at netcourrier.com
Mon Aug 20 08:07:06 CEST 2001
I did a few experiments on the new video output module for vlc (VOUT4). And I
will detail my results in this mail.
I wanted to try 4 different methods of frame buffer allocation:
- method 1: buffers allocated in system memory and transfered to video memory
with a memcpy() when we actually do the display. (this is the method that vlc
- method 2: buffers allocated in system memory and transfered to video memory
via DMA transfer when we actually do the display.
- method 3: buffers allocated directly in video memory.
- method 4: buffers allocated in system memory for I and P pictures and in
video memory for B pictures.
This experiments were done with the directx plugin for vlc, because directx
can give you directx access to the graphics card memory (with Xvideo you
don't have direct access to video memory, buffers can only be created in
The difference between system and video memory is really important. Every
frame buffer has to end-up in video memory to be displayed so method 3 should
overall save some memory transfers. But (there's always a but) video memory
is usually quite slow compared to system memory, it is then a bad idea to
store a picture buffer there if you need to read from it again. In MPEG world
for example, P and B pictures are reconstructed _from_ I and P pictures, thus
these picture types need to be stored in fast memory (this is why I
introduced method 4).
As expected, method 4 is really overkill. VLC ends-up being a few time slower
I was really hopeful about method 2. DirectX can take advantage of DMA
transfers (when supported) when Blitting (copying) picture buffers. As DMA
transfers are asynchronous, that would mean we would be able to work with
fast buffers (in system memory) and let the DMA controller do the transfer to
video memory whitout any CPU overhead.
Unfortunately I found out that although it is usually possible with directx
to create picture buffers in system memory, it isn't possible to do this with
YV12 picture buffers.
As I couldn't implement method 4 with directx, method 3 should have been the
best performer. But ( you see, there's always a but ;) to my disappointement
I found that method 1 is actually the best.
(the difference is more than noticeable, method 1 is almost twice as fast a
I can't really explain why, maybe it's because of cache issues (when doing a
memcpy() we are transferring whole cache lines ??), or maybe vlc's mpeg
decoder is not optimized enough (too many memory access)??
In short, I'm really disappointed. I was hoping to see a big performance
boost from vout4 but found strictly none !!
( I guess I still have a few more things to try, like instead of using YUV
overlay I could use on the fly YUV -> RGB hardware conversion, but I'm not
sure it is widely available )
If anybody has other suggestions I will be glad to try them, I can also
provide my code if someone wants to do more extensive testing.
(PS: these results only apply to the directx plugin. Xvideo WILL take
advantage of vout4 because vlc was doing one too many memcpy() of the frame
buffer from system mem to system mem)
More information about the vlc-devel