VOUT4 results with directx

Gildas Bazin gbazin at netcourrier.com
Mon Aug 20 08:07:06 CEST 2001


I did a few experiments on the new video output module for vlc (VOUT4). And I 
will detail my results in this mail.

I wanted to try 4 different methods of frame buffer allocation:

- method 1: buffers allocated in system memory and transfered to video memory 
with a memcpy() when we actually do the display. (this is the method that vlc 
currently implements)

- method 2: buffers allocated in system memory and transfered to video memory 
via DMA transfer when we actually do the display.

- method 3: buffers allocated directly in video memory.

- method 4: buffers allocated in system memory for I and P pictures and in 
video memory for B pictures.

This experiments were done with the directx plugin for vlc, because directx 
can give you directx access to the graphics card memory (with Xvideo you 
don't have direct access to video memory, buffers can only be created in 
system memory).

The difference between system and video memory is really important. Every 
frame buffer has to end-up in video memory to be displayed so method 3 should 
overall save some memory transfers. But (there's always a but) video memory 
is usually quite slow compared to system memory, it is then a bad idea to 
store a picture buffer there if you need to read from it again. In MPEG world 
for example, P and B pictures are reconstructed _from_ I and P pictures, thus 
these picture types need to be stored in fast memory (this is why I 
introduced method 4).

As expected, method 4 is really overkill. VLC ends-up being a few time slower 
than before.

I was really hopeful about method 2. DirectX can take advantage of DMA 
transfers (when supported) when Blitting (copying) picture buffers. As DMA 
transfers are asynchronous, that would mean we would be able to work with 
fast buffers (in system memory) and let the DMA controller do the transfer to 
video memory whitout any CPU overhead.
Unfortunately I found out that although it is usually possible with directx 
to create picture buffers in system memory, it isn't possible to do this with 
YV12 picture buffers.

As I couldn't implement method 4 with directx, method 3 should have been the 
best performer. But ( you see, there's always a but ;) to my disappointement 
I found that method 1 is actually the best.
(the difference is more than noticeable, method 1 is almost twice as fast a 
method 3)
I can't really explain why, maybe it's because of cache issues (when doing a 
memcpy() we are transferring whole cache lines ??), or maybe vlc's mpeg 
decoder is not optimized enough (too many memory access)??

In short, I'm really disappointed. I was hoping to see a big performance 
boost from vout4 but found strictly none !!
( I guess I still have a few more things to try, like instead of using YUV 
overlay I could use on the fly YUV -> RGB hardware conversion, but I'm not 
sure it is widely available )

If anybody has other suggestions I will be glad to try them, I can also 
provide my code if someone wants to do more extensive testing.

(PS: these results only apply to the directx plugin. Xvideo WILL take 
advantage of vout4 because vlc was doing one too many memcpy() of the frame 
buffer from system mem to system mem)



More information about the vlc-devel mailing list