[vlc-devel] Software decoding in Hardware buffers

Fri Aug 9 07:50:43 CEST 2019

On 2019-08-08 18:27, Rémi Denis-Courmont wrote:
> Le torstaina 8. elokuuta 2019, 15.29.30 EEST Steve Lhomme a écrit :
>> Any opinion ?
> 
> I don't see why we should mess the architecture for a hardware-specific
> implementation-specific unmaintained module.

It's not unmaintained, I was planning to revive it to make sure that the 
default player on Raspberry Pi remains VLC when we release 4.0. It seems 
there's a different implementation so I'll adapt that one.

One reason for that is to make sure our new push architecture is sound 
and can adapt to many use cases. Supporting SoC architectures should 
still be possible with the new architecture. Allocating all buffers once 
in the display was making this easy and efficient (in terms of copy, not 
memory usage). We should aim for the same level of efficiency.

Also let me remind you the VLC motto: "VLC plays everything and runs 
everywhere".

> Even when the GPU uses the same RAM as the CPU, it typically uses different
> pixel format, tile format and/or memory coherency protocol, or it might simply
> not have a suitable IOMMU. As such, VLC cannot render directly in it.
> 
> And if it could, then by definition, it implies that the decoder and filters can
> allocate and *reference* picture buffers as they see fit, regardless of the
> hardware. Which means the software on CPU side is doing the allocation. If so,
> then there are no good technical reasons why push cannot work - misdesigning
> the display plugin is not a good reason.

I haven't proposed any design change to the display plugin, other than 
already discussed. What I proposed is a way to allocate CPU pictures 
from the GPU. My current solution involves creating a video context 
optionally when the decoder doesn't provide one.

It could even be used on desktop. For example on Intel platform it's 
possible to do it without much performance penalty. I used to do it in 
D3D11 until I realized it sucked for separate GPU memory. But I had no 
way to know exactly the impact of the switch because the code was quite 
different. Now it might be possible to tell. I have a feeling on Intel 
it may actually be better to decode in "GPU" buffers directly. The 
driver can take shortcuts that we can't. It may do the copy more 
efficiently if it needs one (or maybe it doesn't need one). It can do 
the copy asynchronously (as every command sent to a ID3D11DeviceContext) 
as long as it's ready when it needs to be displayed.